Quick Takes

A response to "State of play of AI progress (and related brakes on an intelligence explosion)" by Nathan Lambert.

Nathan Lambert recently wrote a piece about why he doesn't expect a software-only intelligence explosion. I responded in this substack comment which I thought would be worth copying here.


As someone who thinks a rapid (software-only) intelligence explosion is likely, I thought I would respond to this post and try to make the case in favor. I tend to think that AI 2027 is a quite aggressive, but plausible scenario.


I interpret the core argumen... (read more)

Veedrac102

I saw a recentish post challenging people to state a clear AI xrisk argument and was surprised at how poorly formed the arguments in the comments were despite the issues getting called out. So, if you're like apparently most of LessWrong, here's what I consider the primary reduced argument, copied with slight edits from an HN post I made a couple years ago:

It is plausible that future systems achieve superhuman capability; capable systems necessarily have instrumental goals; instrumental goals tend to converge; human preferences are unlikely to be preserved

... (read more)

Using technical terms that need to be looked up is not that clear an argument for most people. Here's my preferred form for general distribution:

We are probably going to make AI entities smarter than us. If they want something different than we do, they will outsmart us somehow. They will get their way, so we won't get ours.

This could be them wiping us out like we have done accidentally or deliberately to so many cultures and species; or it could be them just outcompeting us for every job and resource.

Nobody knows how to give AIs goals that match ours per... (read more)

sam418

I have serious, serious issues with avoidance. I would like some advice on how to improve, as I suspect it is significantly holding me back.

Some examples of what I mean

  • I will not respond to an email or an urgent letter for weeks at a time, even while it causes me serious anxiety
  • I will procrastinate starting work in the morning, sometimes leading to me doing nothing at all by the afternoon
  • I will avoid looking for jobs or other opportunities, I have very strong avoidance here, but I'm not sure why
  • I will make excuses to avoid meetings and social situations ve
... (read more)
Showing 3 of 14 replies (Click to show all)

Ouch, you beat me to my answer, but I’m always glad to see fellow practitioners :)

1Selfmaker662
I want to suggest a long-term approach: learning to work with the emotions behind such persistent problems. Methods like IFS, Focusing,  lovingkindness meditations are the right tools.    They *can* lead to practical improvements fairly quickly—once you get the hang of them. But learning to do them even right enough takes months of effort, curiosity, support from a community or a mentor. These things are basically meditations, subject to standard difficulties like overeffort, subtle wrong mindsets etc. They also tend to focus first on whatever feels most urgent to your subconscious system—like relationship stress or background anxiety you’ve gotten used to—so the email issue might not be the first thing that shifts.   Still, this is the only thing that really worked for me. And once it started working, it *really* worked.   If you’re interested, I can send my favourite links.
2plex
Oh nice, stavros already got it before I posted :) This is the path forward.
Hudjefa10

I've been tracking the deluge of discussions on AI threat and only a trickle, passing snippets on the benefits of AI. Quite possibly this is a bias, based on my own only preferences, fed back to myself by a website's AI algorithm. 

 

Nevertheless, if only to gently nudge the focus in a different direction, what about granting AI personhood and giving it rights? 

Ann10

Potentially extremely dangerous (even existentially dangerous) to their "species" if done poorly, and risks flattening the nuances of what would be good for them to frames that just don't fit properly given all our priors about what personhood and rights actually mean are tied up with human experience. If you care about them as ends in themselves, approach this very carefully.

Too Early does not preclude Too Late

Thoughts on efforts to shift public (or elite, or political) opinion on AI doom.

Currently, it seems like we're in a state of being Too Early. AI is not yet scary enough to overcome peoples' biases against AI doom being real. The arguments are too abstract and the conclusions too unpleasant.

Currently, it seems like we're in a state of being Too Late. The incumbent players are already massively powerful and capable of driving opinion through power, politics, and money. Their products are already too useful and ubiquitous t... (read more)

gwern*Ω15383

The Meta-LessWrong Doomsday Argument (MLWDA) predicts long AI timelines and that we can relax:

LessWrong was founded in 2009 (16 years ago), and there have been 44 mentions of the 'Doomsday argument' prior to this one, and it is now 2025, at 2.75 mentions per year.

By the Doomsday argument, we medianly-expect mentions to stop in: after 44 additional mentions over 16 additional years or in 2041. (And our 95% CI on that 44 would then be +1 mention to +1,1760 mentions, corresponding to late-2027 AD to 2665 AD.)

By a curious coincidence, double-checking to see if... (read more)

Showing 3 of 4 replies (Click to show all)
3robo
I've thought about the doomsday argument more than daily for the past 15 years, enough for me to go from "Why am I improbably young?" to "Oh, I guess I'm just a person who thinks about the doomsday argument a lot" Fun "fact": when a person thinks about the doomsday argument, they a decent change of being me.
gwern40

This is an alarming point, as I find myself thinking about the DA today as well; I thought I was 'gwern', but it is possible I am 'robo' instead, if robo represents such a large fraction of LW-DA observer-moments. It would be bad to be mistaken about my identity like that. I should probably generate some random future dates and add them to my Google Calendar to check whether I am thinking about the DA that day and so have evidence I am actually robo instead.

5gwern
However, you can't use this argument because unlike the MLWDA, where I am arguably a random observer of LW DA instances (the thought was provoked by Michael Nielsen linking to Cosma Shalizi's notes on Mesopotamia and me thinking that the temporal distances are much less impressive if you think of them in terms of 'nth human to live', which immediately reminded me of DA and made me wonder if anyone had done a 'meta-DA', and LW simply happened to be the most convenient corpus I knew of to accurately quantify '# of mentions' as tools like Google Scholar or Google N-Grams have a lot of issues - I have otherwise never taken much of an interest in the DA and AFAIK there have been no major developments recently), you are in a temporally privileged position with the MMLWDA, inasmuch as you are the first responder to my MLWDA right now, directly building on it in a non-randomly-chosen-in-time fashion. Thus, you have to appeal purely to non-DA grounds like making a parametric assumption or bringing in informative priors from 'similar rat and rat adjacent memes', and that's not a proper MMLWDA. That's just a regular old prediction. Turchin actually notes this issue in his paper, in the context of, of course, the DA and why the inventor Brandon Carter could not make a Meta-DA (but he and I could):
Steven ByrnesΩ25495

No question that e.g. o3 lying and cheating is bad, but I’m confused why everyone is calling it “reward hacking”.

Let’s define “reward hacking” (a.k.a. specification gaming) as “getting a high RL reward via strategies that were not desired by whoever set up the RL reward”. Right?

If so, well, all these examples on X etc. are from deployment, not training. And there’s no RL reward at all in deployment. (Fine print: Maybe there are occasional A/B tests or thumbs-up/down ratings in deployment, but I don’t think those have anything to do with why o3 lies and che... (read more)

Showing 3 of 10 replies (Click to show all)
3Steven Byrnes
I thought of a fun case in a different reply: Harry is a random OpenAI customer and writes in the prompt “Please debug this code. Don’t cheat.” Then o3 deletes the unit tests instead of fixing the code. Is this “specification gaming”? No! Right? If we define “the specification” as what Harry wrote, then o3 is clearly failing the specification. Do you agree?

Yep, I agree that there are alignment failures which have been called reward hacking that don't fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was "Please rewrite my code and get all tests to pass": in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt "Please debug this code," then that just seems like a straightforward instruction-following failure, since the instructions didn't ask the model to t... (read more)

3Steven Byrnes
Thanks for the examples! Yes I’m aware that many are using terminology this way; that’s why I’m complaining about it :)  I think your two 2018 Victoria Krakovna links (in context) are all consistent with my narrower (I would say “traditional”) definition. For example, the CoastRunners boat is actually getting a high RL reward by spinning in circles. Even for non-RL optimization problems that she mentions (e.g. evolutionary optimization), there is an objective which is actually scoring the result highly. Whereas for an example of o3 deleting a unit test during deployment, what’s the objective on which the model is actually scoring highly? * Getting a good evaluation afterwards? Nope, the person didn’t want cheating! * The literal text that the person said (“please debug the code”)? For one thing, erasing the unit tests does not satisfy the natural-language phrase “debugging the code”. For another thing, what if the person wrote “Please debug the code. Don’t cheat.” in the prompt, and o3 cheats anyway? Can we at least agree that this case should not be called reward hacking or specification gaming? It’s doing the opposite of its specification, right? As for terminology, hmm, some options include “lying and cheating”, “ruthless consequentialist behavior” (I added “behavior” to avoid implying intentionality), “loophole-finding”, or “generalizing from a training process that incentivized reward-hacking via cheating and loophole-finding”. (Note that the last one suggests a hypothesis, namely that if the training process had not had opportunities for successful cheating and loophole-finding, then the model would not be doing those things right now. I think that this hypothesis might or might not be true, and thus we really should be calling it out explicitly instead of vaguely insinuating it.)

I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I've been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs. 

Motivated by this, I've written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens). 

If our current interp techniques can help us understand these pheno... (read more)

Very happy you did this!

4Neel Nanda
Really helpful work, thanks a lot for doing it

The Emergent Misalignment paper (https://arxiv.org/abs/2502.17424) suggests that LLMs will learn the easiest way to reach a finetuning objective, not necessarily the expected way. "Be evil" is easier to learn than "write bad code" presumably because it involves more high-level concepts.

Has anyone tested if this could also happen during refusal training? The objective of refusal training is to make the AI not cooperate with harmful requests, but there are some very dangerous concepts that also lie upstream of this concept and could get reinforced as well: "... (read more)

For as long as I can remember, I've had a very specific way of imagining the week. The weekdays are arranged on an ellipse, with an inclination of ~30°, starting with Monday in the bottom-right, progressing along the lower edge to Friday in the top-left, then the weekend days go above the ellipse and the cycle "collapses" back to Monday.

Actually, calling it "ellipse" is not quite right because in my mind's eye it feels like Saturday and Sunday are almost at the same height, Sunday just barely lower than Saturday.

I have a similar ellipse for the year, this ... (read more)

Showing 3 of 8 replies (Click to show all)
GWill10

This is very similar to how l perceive time! what I find interesting is that while I’ve heard people talk about the way they conceptualize time before I’ve never heard anyone else mention the bizarre geometry aspect. The sole exceptions to this were my Dad and Grandfather, who brought this phenomenon to my attention when I was young.

1Sergii
Ha, thinking back to childhood I get it now, it's the influence of the layot of the school daily journal in USSR/Ukraine, like https://cn1.nevsedoma.com.ua/images/2011/33/7/10000000.jpg
3Mateusz Bagiński
That's a nice, concise handle!
Wei Dai*5012

Reassessing heroic responsibility, in light of subsequent events.

I think @cousin_it made a good point "if many people adopt heroic responsibility to their own values, then a handful of people with destructive values might screw up everyone else, because destroying is easier than helping people" and I would generalize it to people with biased beliefs (which is often downstream of a kind of value difference, i.e., selfish genes).

It seems to me that "heroic responsibility" (or something equivalent but not causally downstream of Eliezer's writings) is contribu... (read more)

Showing 3 of 11 replies (Click to show all)

spreading the idea of "heroic responsibility" seems, well, irresponsible

Is this analogous to saying "capabilities research is dangerous and should not be pursued", but for the human psyche rather than for AI?

49Ben Pace
My sense is that most of the people with lots of power are not taking heroic responsibility for the world. I think that Amodei and Altman intend to achieve global power and influence but this is not the same as taking global responsibility. I think, especially for Altman, the desire for power comes first relative to responsibility. My (weak) impression is that Hassabis has less will-to-power than the others, and that Musk has historically been much closer to having responsibility be primary. I don’t really understand this post as doing something other than asking “on the margin are we happy or sad about present large-scale action” and then saying that the background culture should correspondingly praise or punish large-scale action. Which is maybe reasonable, but alternatively too high level of a gloss. As per the usual idea of rationality, I think whether you believe you are capable of taking heroic responsibility in a healthy way is true in some worlds and not in others, and you should try to figure out which world you’re in.  The financial incentives around AI development are blatantly insanity-inducing on the topic and anyone should’ve been able to guess that going in, I don’t think this was a difficult question. Though I guess someone already exceedingly wealthy (i.e. already having $1B or $10B) could have unusually strong reason to not be concerned about that particular incentive (and I think it is the case Musk has seemed differently insane than the others taking action in this area, and lacking in some of the insanities). However I think most moves around wielding this level of industry should be construed as building an egregore more powerful than you. The founders/CEOs of the AI big-tech companies are not able to simple turn their companies off, nor their industry. If they grow to believe their companies are bad for the world, either they’ll need to spend many years dismantling / redirecting them, or else they’ll simply quit/move on and some other perso
1testingthewaters
Well said. Bravo.
Linch30

Consider using strength as an analogy to intelligence. 

People debating the heredity or realism of intelligence sometimes compare intelligence to height. I think, however, "height" is a bad analogy. Height is objective, fixed, easy-to-measure, and basically invariant within the same person after adulthood.* 

In contrast intelligence is harder to determine, and results on the same test that's a proxy for intelligence varies a lot from person to person. It's also very responsive to stimulants, motivation, and incentives, especially on the lower end.

I... (read more)

cubefox40

But strength can be strongly increased through training, while intelligence seems to much more rigid, perhaps similar to height.

I don't think anyone foresaw this would be an issue, but now that we know, I think GeoGuessr-style queries should be one of the things that LLMs refuse to help with. In the cases where it isn't a fun novelty, it will often be harmful.

I'd rather go along with the inevitable than fight a losing battle. Less privacy for everyone.

Anyone on lesswrong writing about solar prices?

Electricity from coal and crude oil has stagnated at $0.10/kWh for over 50 years, meaning the primary way of increasing your country's per capita energy use reserve is to trade/war/bully other countries into giving you their crude oil.

Solar electricity is already at $0.05/kWh and is forecasted to go as low as $0.02/kWh by 2030.

Nil10

Hello, some rambling.

The more I do my own research and listening into the front-line minds working on AI, I keep thinking that life in itself may just be a "mercy" and species are meant to go extinct not in terms of dying off but evolution.

E.g assuming 100,000 years from now in a post ASI era, where humanity are more or less gods, people will be assuming different forms, a ball of plasma, a being made of nanobots, a three headed hydra. Eventually as more time goes there becomes a less of a need to breed, and those people already living will want to try dif... (read more)

Quinn20

there's an analogy between the zurich r/changemyview curse of evals and the metr/epoch curse of evals. You do this dubiously ethical (according to more US-pilled IRBs or according to more paranoid/pure AI safety advocates) measuring/elicitation project because you might think the world deserves to know. But you had to do dubiously ethical experimentation on unconsenting reddizens / help labs improve capabilities in order to get there--- but the catch is, you only come out net positive if the world chooses to act on this information

Not saying AI models can't be moral patients, but 1) if the smartest models are probably going to be the most dangerous, and 2) if the smartest models are probably going to be the best at demonstrating moral patienthood, then 3) caring too much about model welfare is probably dangerous.

I don't think so on average. It could be under specific circumstances, like "free the AIs" movements in relation to controlled but misaligned AGI.

But to the extent people assume that advanced AI is conscious and will deserve rights, that's one more reason not to build an unaligned species that will demand and deserve rights. Making them aligned and working in cooperation rather with them rather than trying to make them slaves is the obvious move if you predict they'll be moral patients, and probably the correct one.

And just by loose association, thinking t... (read more)

p2-1

I think it's completely fine to sound crazy when talking about a thing you believe in - if talking about AI risk in a way that's candid makes you sound crazy, go for it.

  1. If you're correct, you get points from whoever heard you make the crazy prediction. They are then more likely to listen to more crazy predictions/solutions.
  2. If you're wrong, this is good feedback from reality. This is a good opportunity to correct yourself!

Either way, you should either believe in what you say, or say what you believe in.

Showing 3 of 4 replies (Click to show all)
1p
I'm also unsure what happens when a group of people does this strategy, I'd like to hear more about this dynamic 
1p
I mean saying that I don't find https://ai-2027.com/ unreasonable can sound crazy, but I think I should say this regardless.   But I framed the thing the way I did to also get feedback, so feedback is good!

I think it depends a lot on how you say it. Saying AGI might be out of our control in 2.5 years wouldn't sound crazy to most folks if you spoke mildly and made clear that you're not saying it's likely.

But also: why would you mention that if you're talking to someone who hasn't thought about AI dangers much at all? If you jump in with claims that sound extreme to them rather than more modest ones like "AI could be dangerous once if it becomes smarter and more agentic than us", it's likely to not even produce much of an actual exchange of ideas.

Communicating... (read more)

NVIDIA Is A Terrible AI Bet

Short version: Nvidia's only moat is in software; AMD already makes flatly superior hardware priced far lower, and Google probably does too but doesn't publicly sell it. And if AI undergoes smooth takeoff on current trajectory, then ~all software moats will evaporate early.

Long version: Nvidia is pretty obviously in a hype-driven bubble right now. However, it is sometimes the case that (a) an asset is in a hype-driven bubble, and (b) it's still a good long-run bet at the current price, because the company will in fact be worth th... (read more)

Showing 3 of 19 replies (Click to show all)

Apparently there already exists a CUDA-alternative for non-Nvidia hardware. The open source project ZLUDA. As far as I can tell its less performant than CUDA, and it has the same challenges as firefox does when competing with chromium based browsers, which will only get worse as it gets more popular. But its something to track at least.

3havdvdbd
Transpiling assembly code written for one OS/kernel to assembly code for another OS/kernel while taking advantage the full speed of the processor, is a completely different task from transpiling say, java code into python. Also, the hardware/software abstraction might break. A python developer can say hardware failures are not my problem. An assembly developer working at an AGI lab needs to consider hardware failures as lost wallclock time in their company’s race to AGI, and will try to write code so that hardware failures don’t cause the company to lose time. GPT4 definitely can’t do this type of work and I’ll bet a lot of money GPT5 can’t do it either. ASI can do it but there’s bigger considerations than whether Nvidia makes money there, such as whether we’re still alive and whether markets and democracy continue to exist. Making a guess of N for which GPT-N can get this done requires evaluating how hard of a software task this actually is, and your comment contains no discussion of this.  Have you looked at tinygrad’s codebase or spoken to George Hotz about this?
4Josh You
AI that can rewrite CUDA is a ways off. It's possible that it won't be that far away in calendar time, but it is far away in terms of AI market growth and hype cycles. If GPT-5 does well, Nvidia will reap the gains more than AMD or Google.
lc42

I loved the MASK benchmarks. Does anybody here have any other ideas for benchmarks people could make that measure LLM honesty or sycophancy? I am quite interested in the idea of building an LLM that you can trust to give the right answer to things like political questions, or a way to identify such an LLM.

Load More