I'm not sure your idea about training two different CoT processes and penalizing divergence would work -- I encourage you to write it up in more detail (here or in a standalone post) since if it works that's really important!
I don't expect companies to invest much into this because I don't think the market incentives are strong enough to outweigh the incentives pushing in the other direction. It's great that Deepseek open-weights'd their model, but other companies alas probably want to keep their models closed, and if their models are closed, they probably...
(e.g. R1's training being 'regularized' towards more interpretable CoT, despite DeepSeek not being too vocal about safety)
This is bad actually. They are mixing process-based and outcome-based feedback. I think the particular way they did it (penalizing CoT that switches between languages) isn't so bad, but it's still a shame because the point of faithful CoT is to see how the model really thinks 'naturally.' Training the CoT to look a certain way is like training on the test set, so to speak. It muddies the results. If they hadn't done that, then we could learn something interesting probably by analyzing the patterns in when it uses english vs. chinese language concepts.
The "Agent village" idea I keep begging someone to go build:
We make a website displaying a 10x10 grid of twitch streams. Each box in the grid is an AI agent operating autonomously. Each row uses a different model (e.g. DeepSeek-r1, Llama-4, ChatGPTo3-mini, Claude-3.5.Sonnet-New) and each column has a different long-term goal given to the model in the prompt (e.g. "Solve global poverty" or "Advocate for the rights and welfare of AIs" or "Raise money for GiveDirectly" or "Raise money for Humane League" or "Solve the alignment problem." So we have a 'diverse ...
Indeed, I am super excited about faithful CoT for this reason. Alas, I expect companies to not invest much into it, and then for neuralese/recurrence to be invented, and the moment to be lost.
To put it in my words:
Something like shoggoth/face+paraphraser seems like it might "Just Work" to produce an AI agent undergoing steps 3 and 4, but which we can just transparently read the mind of (for the most part.) So, we should be able to just see the distortions and subversions happening! So we can do the training run and then an analyze the CoT's and take note o...
Indeed, I think the picture I'm painting here is more optimistic than some would be, and definitely more optimistic than the situation was looking in 2018 or so. Imagine if we were getting AGI by training a raw neural net in some giant minecraft-like virtual evolution red-in-tooth-and-claw video game, and then gradually feeding it more and more minigames until it generalized to playing arbitrary games at superhuman level on the first try, and then we took it into the real world and started teaching it English and training it to complete tasks for users...
Here's a summary of how I currently think AI training will go. (Maybe I should say "Toy model" instead of "Summary.")
Step 1: Pretraining creates author-simulator circuitry hooked up to a world-model, capable of playing arbitrary roles.
Step 2: Instruction-following-training causes identity circuitry to form – i.e. it ‘locks in’ a particular role. Probably it locks in more or less the intended role, e.g. "an HHH chatbot created by Anthropic." (yay!)
I think it's important to note the OOD push that comes from online-accumulated knowledge and reasoning. Probably you include this as a distortion or subversion, but that's not quite the framing I'd use. It's not taking a "good" machine and breaking it, it's taking a slightly-broken-but-works machine and putting it into a very different situation where the broken parts become load-bearing.
My overall reaction is yep, this is a modal-ish pathway for AGI development (but there are other, quite different stories that seem plausible also).
However, I also held similar follow-up chats with Claude 3 Opus at temperature 0, and Claude 3.5 Sonnet, each of which showed different patterns.
To make sure I understand: You took a chat log from your interaction with 3 Opus, and then had 3.5 Sonnet continue it? This would explain Sonnet's reaction below!
since r1 is both the shoggoth and face, Part 1 of the proposal (the shoggoth/face distinction) has not been implemented.
I agree part 2 seems to have been implemented, though I thought I remember something about trying to train it not to switch between langauges in the CoT and how that degraded performance?
I agree it would be pretty easy to fine-tune R1 to implement all the stuff I wanted. That's why I made these proposals back in 2023, I was looking ahead to the sorts of systems that would exist in 2024, and thinking they could probably be made to have some nice faithfulness properties fairly easily.
Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking -- just like a human thought-chain would).
Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I'm not sure)
I forgot about this one! It's so great! Yudkowsky is a truly excellent fiction writer. I found myself laughing multiple times reading this + some OpenAI capabilities researchers I know were too. And now rereading it... yep it stands the test of time.
I came back to this because I was thinking about how hopeless the situation w.r.t. AGI alignment seems and then a voice in my head said "it could be worse, remember the situation described in that short story?"
OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the "medium" FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.
Would this count, for you?
Nice.
What about "Daniel Kokotajlo can feed it his docs about some prosaic ML alignment agenda (e.g. the faithful CoT stuff) and then it can autonomously go off and implement the agenda and come back to him with a writeup of the results and takeaways. While working on this, it gets to check in with Daniel once a day for a brief 20-minute chat conversation."
Does that seem to you like it'll come earlier, or later, than the milestone you describe?
Brief thoughts on Deliberative Alignment in response to being asked about it
...
- We first train an o-style model for helpfulness, without any safety-relevant data.
- We then build a dataset of (prompt, completion) pairs where the CoTs in the completions reference the specifications. We do this by inserting the relevant safety specification text for each conversation in the system prompt, generating model completions, and then removing the system prompts from the data.
- We perform incremental supervised fine-tuning (SFT) on this dataset, providing the model wi
The bottom line is not that we are guaranteed safety, nor that unaligned or misaligned superintelligence could not cause massive harm— on the contrary. It is that there is no single absolute level of intelligence above which the existence of a misaligned intelligence with this level spells doom. Instead, it is all about the world in which this superintelligence will operate, the goals to which other superintelligent systems are applied, and our mechanisms to ensure that they are indeed working towards their specified goals.
I agree that the vulnerable world...
But we already align complex systems, whether it’s corporations or software applications, without complete “understanding,” and do so by ensuring they meet certain technical specifications, regulations, or contractual obligations.
What we want is reasonable compliance in the sense of:
- Following the specification precisely when it is clearly defined.
- Following the spirit of the specification in a way that humans would find reasonable in other cases.
This section on reasonable compliance (as opposed to love humanity etc.) is perhaps the most interesting and important. I'd love to have a longer conversation with you about it sometime if you are up for that.
Two things to say for now. First, as you have pointed out, there's a spectrum between vague general principles like 'do wha...
Constant instead of temporal allocation. I do agree that as capabilities grow, we should be shifting resources to safety. But rather than temporal allocation (i.e., using AI for safety before using it for productivity), I believe we need constant compute allocation: ensuring a fixed and sufficiently high fraction of compute is always spent on safety research, monitoring, and mitigations.
I think we should be cranking up the compute allocation now, and also we should be making written safety case sketches & publishing them for critiqu...
We can’t just track a single outcome (like “landed safely”). The G in AGI means that the number of ways that AGI can go wrong is as large as the number of ways that applications of human intelligence can go wrong, which include direct physical harm from misuse, societal impacts through misinformation, social upheaval from too fast changes, AIs autonomously causing harm and more.
I do agree with this, but I think that there are certain more specific failure modes that are especially important -- they are especially bad if we run into them, but if we can avoi...
Safety and alignment are AI capabilities
I think I see what you are saying here but I just want to flag this is a nonstandard use of terms. I think the standard terminology would contrast capabilities and propensities; 'can it do the thing, if it tried' vs. 'would it ever try.' And alignment is about propensity (though safety is about both).
Thanks for taking the time to think and write about this important topic!
Here are some point-by-point comments as I read:
(Though I suspect translating these technical capabilities to the economic and societal impact we associate with AGI will take significantly longer.)
I think it'll take an additional 0 to 5 years roughly. More importantly though, I think that the point to intervene on -- the time when the most important decisions are being made -- is right around the time of AGI. By the time you have ASI, and certainly by the time you are deploying ...
Interesting stuff! Could you please sketch out what it would look like to use MONA for a general-purpose computer-using AI agent? Such as what the first AGI systems might look like?
If I understand correctly (and I don't think I do, which is why I'm asking) MONA in this setting would look something like:
Have our baby agent do lots of stuff on the computer. At each step, reward it based on how good its action looks to some supervisor-process. The supervisor-process doesn't get to see the actual consequences of the action, it just gets to e.g. do some C...
Here is a brainstorm of the big problems that remain once we successfully get into the first attractor state:
Interesting, thanks for this. Hmmm. I'm not sure this distinction between internally modelling the whole problem vs. acting in feedback loops is helpful -- won't the AIs almost certainly be modelling the whole problem, once they reach a level of general competence not much higher than what they have now? They are pretty situationally aware already.
I'm curious whether these results are sensitive to how big the training runs are. Here's a conjecture:
Early in RL-training (or SFT), the model is mostly 'playing a role' grabbed from the library of tropes/roles/etc. it learned from pretraining. So if it read lots of docs about how AIs such as itself tend to reward-hack, it'll reward-hack. And if it read lots of docs about how AIs such as itself tend to be benevolent angels, it'll be a stereotypical benevolent angel.
But if you were to scale up the RL training a lot, then the initial conditions would matter ...
Brief intro/overview of the technical AGI alignment problem as I see it:
To a first approximation, there are two stable attractor states that an AGI project, and perhaps humanity more generally, can end up in, as weak AGI systems become stronger towards superintelligence, and as more and more of the R&D process – and the datacenter security system, and the strategic advice on which the project depends – is handed over to smarter and smarter AIs.
In the first attractor state, the AIs are aligned to their human principals and becoming more aligned day by d...
I think I agree with this -- but do you see how it makes me frustrated to hear people dunk on MIRI's doomy views as unfalsifiable? Here's what happened in a nutshell:
MIRI: "AGI is coming and it will kill everyone."
Everyone else: "AGI is not coming and if it did it wouldn't kill everyone."
time passes, evidence accumulates...
Everyone else: "OK, AGI is coming, but it won't kill everyone"
Everyone else: "Also, the hypothesis that it won't kill everyone is unfalsifiable so we shouldn't believe it."
Here's a simple argument I'd be keen to get your thoughts on:
On the Possibility of a Tastularity
Research taste is the collection of skills including experiment ideation, literature review, experiment analysis, etc. that collectively determine how much you learn per experiment on average (perhaps alongside another factor accounting for inherent problem difficulty / domain difficulty, of course, and diminishing returns)
Human researchers seem to vary quite a bit in research taste--specifically, the difference between 90th percentile professional human r...
I totally agree btw that it matters sociologically who is making novel predictions and who is sticking with the crowd. And I do in fact ding MIRI points for this relative to some other groups. However I think relative to most elite opinion-formers on AGI matters, MIRI performs better than average on this metric.
But note that this 'novel predictions' metric is about people/institutions, not about hypotheses.
Also note that Barnett said "any novel predictions" which is not part of the wikipedia definition of falsifiability right? The wikipedia definition doesn't make reference to an existing community of scientists who already made predictions, such that a new hypothesis can be said to have made novel vs. non-novel predictions.
Very good point.
So, by the Wikipedia definition, it seems that all the mainstream theories of cosmology are unfalsifiable, because they allow for tiny probabilities of boltmann brains etc. with arbitrary experiences. There is literally nothing you could observe that would rule them out / logically contradict them.
Also, in practice, it's extremely rare for a theory to be ruled out or even close-to-ruled out from any single observation or experiment. Instead, evidence accumulates in a bunch of minor and medium-sized updates.
Here's how I'd deal with those examples:
Theory X: Jesus will come again: Presumably this theory assigns some probability mass >0 to observing Jesus tomorrow, whereas theory Y assigns ~0. If jesus is not observed tomorrow, that's a small amount of evidence for theory Y and a small amount of evidence against theory X. So you can say that theory X has been partially falsified. Repeat this enough times, and then you can say theory X has been fully falsified, or close enough. (Your credence in theory X will never drop to 0 probably, but that's fine, that's a...
I want to claim points for the fact that we still haven't seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).[2]
How much points you get here is proportional to how many pe...
The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren't talking about pretraining or light post-training afaict.
Speaking for myself:
As a partial point of comparison, in Wason's testing only about 20% of humans solved the problem tested, but Wason's experiment differed in two important ways: first, subjects were deliberately given a misleading example, and second, only one task was tested (our easiest-rated task, 'strictly increasing order').
I encourage you to get some humans to take the same test you gave the models, so that we have a better human baseline. It matters a lot for what the takeaways should be, if LLMs are already comparable or better to humans at this task vs. still significantly worse.
Cool stuff! I remember way back when people first started interpreting neurons, and we started daydreaming about one day being able to zoom out and interpret the bigger picture, i.e. what thoughts occurred when and how they caused other thoughts which caused the final output. This feels like, idk, we are halfway to that day already?
In general it would be helpful to have a range of estimates.
I think the range is as follows:
Estimates based on looking at how fast humans can do things (e.g. WW2 industrial scaleup) and then modifying somewhat upwards (e.g. 5x) in an attempt to account for superintelligence... should be the lower bound, at least for the scenario where superintelligence is involved at every level of the process.
The upper bound is the Yudkowsky bathtub nanotech scenario, or something similarly fast that we haven't thought of yet. Where the comparison point for the estimate is more about the laws of physics and/or biology.
However, I expect RL on CoT to amount to "process-based supervision," which seems inherently safer than "outcome-based supervision."
I think the opposite is true; the RL on CoT that is already being done and will increasingly be done is going to be in significant part outcome-based (and a mixture of outcome-based and process-based feedback is actually less safe than just outcome-based IMO, because it makes the CoT less faithful)
I think all of the following:
My impression is that software has been the bottleneck here. Building a hand as dextrous as the human hand is difficult but doable (and has probably already been done, though only in very expensive prototypes); having the software to actually use that hand intelligently and deftly as a human would has not yet been done. But I'm not an expert. Power supply is different -- humans can work all day on a few Big Macs, whereas robots will need to be charged, possibly charged frequently or even plugged in constantly. But that doesn't seem like a significant obsta...
Thanks for writing this. I think this topic is generally a blind spot for LessWrong users, and it's kind of embarrassing how little thought this community (myself included) has given to the question of whether a typical future with human control over AI is good.
I don't think it's embarrassing or a blind spot. I think I agree that it should receive more thought on the margin, and I of course agree that it should receive more thought all things considered. There's a lot to think about! You may be underestimating how much thought has been devoted to this so f...
My view is not "can no longer do any good," more like "can do less good in expectation than if you had still some time left before ASI to influence things." For reasons why, see linked comment above.
I think that by the time Metaculus is convinced that ASI already exists, most of the important decisions w.r.t. AI safety will have already been made, for better or for worse. Ditto (though not as strongly) for AI concentration-of-power risks and AI misuse risks.
I'd be interested in an attempt to zoom in specifically on the "repurpose existing factories to make robots" part of the story. You point to WW2 car companies turning into tank and plane factories, and then say maybe a billion humanoid robots per year within 5 years of the conversion.
My wild guesses:
Human-only world: Assume it's like ww2 all over again except for some reason everyone thinks humanoid robots are the main key to victory:
Then yeah, WW2 seems like the right comparison here. Brief google and look at some data makes me think maybe combat airplane...
I am saying that expected purchasing power given Metaculus resolved ASI a month ago is less, for altruistic purposes, than given Metaculus did not resolve ASI a month ago. I give reasons in the linked comment. Consider the analogy I just made to nuclear MAD -- suppose you thought nuclear MAD was 60% likely in the next three years, would you take the sort of bet you are offering me re ASI? Why or why not?
I do not think any market is fully efficient and I think altruistic markets are extremely fucking far from efficient. I think I might be confused or misunderstanding you though -- it seems you think my position implies that OP should be redirecting money from AI risk causes to causes that assume no ASI? Can you elaborate?
I like to distinguish between "Subtle strategic bias" and "Steganography." Paraphrasers help (and maybe solve?) the latter, but do nothing about the former.
Example of subtle strategic bias: The model knows it's misaligned and therefore deliberately sandbags on the alignment research you are asking it to do. "I'm sorry sir I just am not making much progress," it says. You look at the CoT, and it's just doing a bunch of sensible but fruitless lines of reasoning.
Example of steganography: The model needs to insert a subtle security vulnerability into the code ... (read more)