Nathan Lambert recently wrote a piece about why he doesn't expect a software-only intelligence explosion. I responded in this substack comment which I thought would be worth copying here.
As someone who thinks a rapid (software-only) intelligence explosion is likely, I thought I would respond to this post and try to make the case in favor. I tend to think that AI 2027 is a quite aggressive, but plausible scenario.
I interpret the core argumen...
I saw a recentish post challenging people to state a clear AI xrisk argument and was surprised at how poorly formed the arguments in the comments were despite the issues getting called out. So, if you're like apparently most of LessWrong, here's what I consider the primary reduced argument, copied with slight edits from an HN post I made a couple years ago:
...It is plausible that future systems achieve superhuman capability; capable systems necessarily have instrumental goals; instrumental goals tend to converge; human preferences are unlikely to be preserved
Using technical terms that need to be looked up is not that clear an argument for most people. Here's my preferred form for general distribution:
We are probably going to make AI entities smarter than us. If they want something different than we do, they will outsmart us somehow. They will get their way, so we won't get ours.
This could be them wiping us out like we have done accidentally or deliberately to so many cultures and species; or it could be them just outcompeting us for every job and resource.
Nobody knows how to give AIs goals that match ours per...
I have serious, serious issues with avoidance. I would like some advice on how to improve, as I suspect it is significantly holding me back.
Some examples of what I mean
Ouch, you beat me to my answer, but I’m always glad to see fellow practitioners :)
I've been tracking the deluge of discussions on AI threat and only a trickle, passing snippets on the benefits of AI. Quite possibly this is a bias, based on my own only preferences, fed back to myself by a website's AI algorithm.
Nevertheless, if only to gently nudge the focus in a different direction, what about granting AI personhood and giving it rights?
Potentially extremely dangerous (even existentially dangerous) to their "species" if done poorly, and risks flattening the nuances of what would be good for them to frames that just don't fit properly given all our priors about what personhood and rights actually mean are tied up with human experience. If you care about them as ends in themselves, approach this very carefully.
Too Early does not preclude Too Late
Thoughts on efforts to shift public (or elite, or political) opinion on AI doom.
Currently, it seems like we're in a state of being Too Early. AI is not yet scary enough to overcome peoples' biases against AI doom being real. The arguments are too abstract and the conclusions too unpleasant.
Currently, it seems like we're in a state of being Too Late. The incumbent players are already massively powerful and capable of driving opinion through power, politics, and money. Their products are already too useful and ubiquitous t...
The Meta-LessWrong Doomsday Argument (MLWDA) predicts long AI timelines and that we can relax:
LessWrong was founded in 2009 (16 years ago), and there have been 44 mentions of the 'Doomsday argument' prior to this one, and it is now 2025, at 2.75 mentions per year.
By the Doomsday argument, we medianly-expect mentions to stop in: after 44 additional mentions over 16 additional years or in 2041. (And our 95% CI on that 44 would then be +1 mention to +1,1760 mentions, corresponding to late-2027 AD to 2665 AD.)
By a curious coincidence, double-checking to see if...
This is an alarming point, as I find myself thinking about the DA today as well; I thought I was 'gwern', but it is possible I am 'robo' instead, if robo represents such a large fraction of LW-DA observer-moments. It would be bad to be mistaken about my identity like that. I should probably generate some random future dates and add them to my Google Calendar to check whether I am thinking about the DA that day and so have evidence I am actually robo instead.
No question that e.g. o3 lying and cheating is bad, but I’m confused why everyone is calling it “reward hacking”.
Let’s define “reward hacking” (a.k.a. specification gaming) as “getting a high RL reward via strategies that were not desired by whoever set up the RL reward”. Right?
If so, well, all these examples on X etc. are from deployment, not training. And there’s no RL reward at all in deployment. (Fine print: Maybe there are occasional A/B tests or thumbs-up/down ratings in deployment, but I don’t think those have anything to do with why o3 lies and che...
Yep, I agree that there are alignment failures which have been called reward hacking that don't fall under my definition of specification gaming, including your example here. I would call your example specification gaming if the prompt was "Please rewrite my code and get all tests to pass": in that case, the solution satisfies the prompt in an unintended way. If the model starts deleting tests with the prompt "Please debug this code," then that just seems like a straightforward instruction-following failure, since the instructions didn't ask the model to t...
I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I've been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs.
Motivated by this, I've written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens).
If our current interp techniques can help us understand these pheno...
Very happy you did this!
The Emergent Misalignment paper (https://arxiv.org/abs/2502.17424) suggests that LLMs will learn the easiest way to reach a finetuning objective, not necessarily the expected way. "Be evil" is easier to learn than "write bad code" presumably because it involves more high-level concepts.
Has anyone tested if this could also happen during refusal training? The objective of refusal training is to make the AI not cooperate with harmful requests, but there are some very dangerous concepts that also lie upstream of this concept and could get reinforced as well: "...
For as long as I can remember, I've had a very specific way of imagining the week. The weekdays are arranged on an ellipse, with an inclination of ~30°, starting with Monday in the bottom-right, progressing along the lower edge to Friday in the top-left, then the weekend days go above the ellipse and the cycle "collapses" back to Monday.
Actually, calling it "ellipse" is not quite right because in my mind's eye it feels like Saturday and Sunday are almost at the same height, Sunday just barely lower than Saturday.
I have a similar ellipse for the year, this ...
This is very similar to how l perceive time! what I find interesting is that while I’ve heard people talk about the way they conceptualize time before I’ve never heard anyone else mention the bizarre geometry aspect. The sole exceptions to this were my Dad and Grandfather, who brought this phenomenon to my attention when I was young.
Reassessing heroic responsibility, in light of subsequent events.
I think @cousin_it made a good point "if many people adopt heroic responsibility to their own values, then a handful of people with destructive values might screw up everyone else, because destroying is easier than helping people" and I would generalize it to people with biased beliefs (which is often downstream of a kind of value difference, i.e., selfish genes).
It seems to me that "heroic responsibility" (or something equivalent but not causally downstream of Eliezer's writings) is contribu...
spreading the idea of "heroic responsibility" seems, well, irresponsible
Is this analogous to saying "capabilities research is dangerous and should not be pursued", but for the human psyche rather than for AI?
Consider using strength as an analogy to intelligence.
People debating the heredity or realism of intelligence sometimes compare intelligence to height. I think, however, "height" is a bad analogy. Height is objective, fixed, easy-to-measure, and basically invariant within the same person after adulthood.*
In contrast intelligence is harder to determine, and results on the same test that's a proxy for intelligence varies a lot from person to person. It's also very responsive to stimulants, motivation, and incentives, especially on the lower end.
I...
But strength can be strongly increased through training, while intelligence seems to much more rigid, perhaps similar to height.
I'd rather go along with the inevitable than fight a losing battle. Less privacy for everyone.
Anyone on lesswrong writing about solar prices?
Electricity from coal and crude oil has stagnated at $0.10/kWh for over 50 years, meaning the primary way of increasing your country's per capita energy use reserve is to trade/war/bully other countries into giving you their crude oil.
Solar electricity is already at $0.05/kWh and is forecasted to go as low as $0.02/kWh by 2030.
Hello, some rambling.
The more I do my own research and listening into the front-line minds working on AI, I keep thinking that life in itself may just be a "mercy" and species are meant to go extinct not in terms of dying off but evolution.
E.g assuming 100,000 years from now in a post ASI era, where humanity are more or less gods, people will be assuming different forms, a ball of plasma, a being made of nanobots, a three headed hydra. Eventually as more time goes there becomes a less of a need to breed, and those people already living will want to try dif...
there's an analogy between the zurich r/changemyview curse of evals and the metr/epoch curse of evals. You do this dubiously ethical (according to more US-pilled IRBs or according to more paranoid/pure AI safety advocates) measuring/elicitation project because you might think the world deserves to know. But you had to do dubiously ethical experimentation on unconsenting reddizens / help labs improve capabilities in order to get there--- but the catch is, you only come out net positive if the world chooses to act on this information
I don't think so on average. It could be under specific circumstances, like "free the AIs" movements in relation to controlled but misaligned AGI.
But to the extent people assume that advanced AI is conscious and will deserve rights, that's one more reason not to build an unaligned species that will demand and deserve rights. Making them aligned and working in cooperation rather with them rather than trying to make them slaves is the obvious move if you predict they'll be moral patients, and probably the correct one.
And just by loose association, thinking t...
I think it's completely fine to sound crazy when talking about a thing you believe in - if talking about AI risk in a way that's candid makes you sound crazy, go for it.
Either way, you should either believe in what you say, or say what you believe in.
I think it depends a lot on how you say it. Saying AGI might be out of our control in 2.5 years wouldn't sound crazy to most folks if you spoke mildly and made clear that you're not saying it's likely.
But also: why would you mention that if you're talking to someone who hasn't thought about AI dangers much at all? If you jump in with claims that sound extreme to them rather than more modest ones like "AI could be dangerous once if it becomes smarter and more agentic than us", it's likely to not even produce much of an actual exchange of ideas.
Communicating...
Short version: Nvidia's only moat is in software; AMD already makes flatly superior hardware priced far lower, and Google probably does too but doesn't publicly sell it. And if AI undergoes smooth takeoff on current trajectory, then ~all software moats will evaporate early.
Long version: Nvidia is pretty obviously in a hype-driven bubble right now. However, it is sometimes the case that (a) an asset is in a hype-driven bubble, and (b) it's still a good long-run bet at the current price, because the company will in fact be worth th...
Apparently there already exists a CUDA-alternative for non-Nvidia hardware. The open source project ZLUDA. As far as I can tell its less performant than CUDA, and it has the same challenges as firefox does when competing with chromium based browsers, which will only get worse as it gets more popular. But its something to track at least.
I loved the MASK benchmarks. Does anybody here have any other ideas for benchmarks people could make that measure LLM honesty or sycophancy? I am quite interested in the idea of building an LLM that you can trust to give the right answer to things like political questions, or a way to identify such an LLM.