All of Daniel Kokotajlo's Comments + Replies

I like to distinguish between "Subtle strategic bias" and "Steganography." Paraphrasers help (and maybe solve?) the latter, but do nothing about the former.

Example of subtle strategic bias: The model knows it's misaligned and therefore deliberately sandbags on the alignment research you are asking it to do. "I'm sorry sir I just am not making much progress," it says. You look at the CoT, and it's just doing a bunch of sensible but fruitless lines of reasoning.

Example of steganography: The model needs to insert a subtle security vulnerability into the code ... (read more)

2Thane Ruthenis
I don't see why it would be harder to do. Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, "I need to design unit tests for this module". Each word of this CoT is mapped to some internal representation, and the internal representation consists of "on-task" human-intended reasoning X, and misaligned scheming Y (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). "I" is mapped to ⟨X1,Y1⟩, "need" is mapped to ⟨X2,Y2⟩, etc. Each Xi iterates on Xi−1, each Yi iterates on Yi−1, we have a sequence of serial computations. In other words: as the model outputs, "I need to design unit tests for this module", there's an unspoken subtext of "I need to sabotage the unit tests here", and the model packs nine words' worth of serial computations into both the "textual" X and the "subtextual" Y aspects of it. Now you use the Paraphraser on the string, generating alternatives like "the next step is to generate some unit tests" and "several unit tests for this module are needed here". The intent, if I understand you correctly, is that the semantic-equivalence-but-syntactic-difference would ensure that the ith token of this phrase is mapped to Xi, as if the paraphrasing didn't happen, but that the mapping of i→Yi would be broken. Except, why would that happen? What privileges X-representations over Y-representations? Why would i→Xi be preserved whereas i→Yi is broken, instead of both of them remaining functional (and so preserving scheming), or both breaking (and so compromising the model's capabilities)? In other words: We're basically just teaching the model to fail to distinguish between equivalent-to-a-human rephrasings. Why would this lead to the task-related representations becoming indifferent to the specifics of a phrasing whereas the scheming-related representations remain sensitive to it? Why wouldn't this just map the subtextual

Good point re: Enterprise. That's a reason to be hopeful. 

I'm not sure your idea about training two different CoT processes and penalizing divergence would work -- I encourage you to write it up in more detail (here or in a standalone post) since if it works that's really important!

I don't expect companies to invest much into this because I don't think the market incentives are strong enough to outweigh the incentives pushing in the other direction. It's great that Deepseek open-weights'd their model, but other companies alas probably want to keep their models closed, and if their models are closed, they probably... (read more)

3purple fire
Me either, this is something I'm researching now. But I think it's a promising direction and one example of the type of experiment we could do to work on this. This could be a crux? I expect most of the economics of powerful AI development to be driven by enterprise use cases, not consumer products.[1] In that case, I think faithful CoT is a strong selling point and it's almost a given that there will be data provenance/governance systems carefully restricting access of the CoT to approved use cases. I also think there's incentive for the CoT to be relatively faithful even if there's just a paraphrased version available to the public, like ChatGPT has now. When I give o3 a math problem, I want to see the steps to solve it, and if the chain is unfaithful, the face model can't do that. I also think legible CoT is useful in multi-agent systems, which I expect to become more economically valuable in the next year. Again, there's the advantage that the space of unfaithful vocabulary is enormous. If I want a multi-agent system with, say, a chatbot, coding agent, and document retrieval agent, it might be useful for their chains to all be in the same "language" so they can make decisions based on each others' output. If they are just blindly RL'ed separately, the whole system probably doesn't work as well. And if they're RL'ed together, you have to do that for every unique composition of agents, which is obviously costlier. Concretely, I would claim that "using natural language tokens that are easy for humans to understand is not the absolute most efficient way for artificial minds to think" is true, but I would say that "using natural language tokens that are easy for humans to understand is the most economically productive way for AI tools to work" is true. PR reasons, yeah I agree that this disincentivizes CoT from being visible to consumers, not sure it has an impact on faithfulness. This is getting a little lengthy, it may be worth a post if I have time soon :) But

(e.g. R1's training being 'regularized' towards more interpretable CoT, despite DeepSeek not being too vocal about safety)

This is bad actually. They are mixing process-based and outcome-based feedback. I think the particular way they did it (penalizing CoT that switches between languages) isn't so bad, but it's still a shame because the point of faithful CoT is to see how the model really thinks 'naturally.' Training the CoT to look a certain way is like training on the test set, so to speak. It muddies the results. If they hadn't done that, then we could learn something interesting probably by analyzing the patterns in when it uses english vs. chinese language concepts.

The "Agent village" idea I keep begging someone to go build:

We make a website displaying a 10x10 grid of twitch streams. Each box in the grid is an AI agent operating autonomously. Each row uses a different model (e.g. DeepSeek-r1, Llama-4, ChatGPTo3-mini, Claude-3.5.Sonnet-New) and each column has a different long-term goal given to the model in the prompt (e.g. "Solve global poverty" or "Advocate for the rights and welfare of AIs" or "Raise money for GiveDirectly" or "Raise money for Humane League" or "Solve the alignment problem." So we have a 'diverse ... (read more)

3Zak Miller
I've recently been working on a version of this at AI Digest! I was actually just about to email you for feedback. I'll finish writing that email and send it tonight. Anyone else interested in this idea, or anyone working on something similar, please feel free to reach out. You can email me at zak@sage-future.org or DM me on X/Twitter at zjmiller
3Roman Malov
It would be interesting if any of them decided to (instrumentally) stream games (using a vtuber avatar for example) to earn money from donations. They need to figure out how to actually be a good streamer in order for this to work.

Indeed, I am super excited about faithful CoT for this reason. Alas, I expect companies to not invest much into it, and then for neuralese/recurrence to be invented, and the moment to be lost.

To put it in my words:

Something like shoggoth/face+paraphraser seems like it might "Just Work" to produce an AI agent undergoing steps 3 and 4, but which we can just transparently read the mind of (for the most part.) So, we should be able to just see the distortions and subversions happening! So we can do the training run and then an analyze the CoT's and take note o... (read more)

4purple fire
Yes, this is the exact setup which cause me to dramatically update my P(Alignment) a few months ago!  There are also some technical tricks you can do to make this work well--for example, you can take advantage of the fact that there are many ways to be unfaithful and only one way to be faithful, train two different CoT processes at each RL step, and add a penalty for divergence.[1] Ditto for periodic paraphrasing, reasoning in multiple languages, etc. I'm curious to hear more about why you don't expect companies to invest much into this. I actually suspect that it has a negative alignment tax. I know faithful CoT is something a lot of customers want-it's just as valuable to accurately see how a model solved your math problem, as opposed to just getting the answer. There's also an element of stickiness. If your Anthropic agents work in neuralese, and then OpenAI comes out with a better model, the chains generated by your Anthropic agents can't be passed to the better model. This also makes it harder for orgs to use agents developed by multiple different labs in a single workflow. These are just a few of the reasons I expect faithful CoT to be economically incentivized, and I'm happy to discuss more of my reasoning or hear more counterarguments if you're interested in chatting more! 1. ^ To be clear, this is just one concrete example of the general class of techniques I hope people work on around this.

Indeed, I think the picture I'm painting here is more optimistic than some would be, and definitely more optimistic than the situation was looking in 2018 or so. Imagine if we were getting AGI by training a raw neural net in some giant minecraft-like virtual evolution red-in-tooth-and-claw video game, and then gradually feeding it more and more minigames until it generalized to playing arbitrary games at superhuman level on the first try, and then we took it into the real world and started teaching it English and training it to complete tasks for users...

Here's a summary of how I currently think AI training will go. (Maybe I should say "Toy model" instead of "Summary.")

Step 1: Pretraining creates author-simulator circuitry hooked up to a world-model, capable of playing arbitrary roles.

  • Note that it now is fair to say it understands human concepts pretty well.

Step 2: Instruction-following-training causes identity circuitry to form – i.e. it ‘locks in’ a particular role. Probably it locks in more or less the intended role, e.g. "an HHH chatbot created by Anthropic." (yay!)

  • Note that this means the AI is now si
... (read more)
2Bogdan Ionut Cirstea
I pretty much agree with 1 and 2. I'm much more optimistic about 3-5 even 'by default' (e.g. R1's training being 'regularized' towards more interpretable CoT, despite DeepSeek not being too vocal about safety), but especially if labs deliberately try for maintaining the nice properties from 1-2 and of interpretable CoT.
6purple fire
I'm curious how your rather doom-y view of Steps 3 and 4 interacts with your thoughts on CoT. It seems highly plausible that we will be able to reliably incentivize CoT faithfulness during Step 3 (I know of several promising research directions for this), which wouldn't automatically improve alignment but would improve interpretability. That interpretable chain of thought can be worked into a separate model or reward signal to heavily penalize divergence from the locked-in character, which--imo--makes the alignment problem under this training paradigm meaningfully more tractable than with standard LLMs. Thoughts?
4Charlie Steiner
It's tempting to think of the model after steps 1 and 2 as aligned but lacking capabilities, but that's not accurate. It's safe, but it's not conforming to a positive meaning of "alignment" that involves solving hard problems in ways that are good for humanity. Sure, it can mouth the correct words about being good, but those words aren't rigidly connected to the latent capabilities the model has. If you try to solve this by pouring tons of resources into steps 1 and 2, you probably end up with something that learns to exploit systematic human errors during step 2.
5Martin Randall
This is relatively hopeful in that after step 1 and 2, assuming continued scaling, we have a superintelligent being that wants to (harmlessly, honestly, etc) help us and can be freely duplicated. So we "just" need to change steps 3-5 to have a good outcome.

I think it's important to note the OOD push that comes from online-accumulated knowledge and reasoning. Probably you include this as a distortion or subversion, but that's not quite the framing I'd use. It's not taking a "good" machine and breaking it, it's taking a slightly-broken-but-works machine and putting it into a very different situation where the broken parts become load-bearing.

My overall reaction is yep, this is a modal-ish pathway for AGI development (but there are other, quite different stories that seem plausible also).

3weightt an
I really like that description! I think the core problem here can be summarized as "Accidently by reinforcing for goal A, then for goal B, you can create A-wanter, that then spoofs your goal-B reinforcement and goes on taking A-aligned actions." It can even happen just randomly, just from ordering of situations/problems you present it with in training, I think. I think this might require some sort of internalization of reward or a model of the training setup. And maybe self location - like how the world looks with the model embedded in it. It could also involve detecting the distinction between "situation made up solely for training", "deployment that will end up in training" and "unrewarded deployment". Also, maybe this story could be added to Step 3: "The model initially had a guess about the objective, which was useful for a long time but eventually got falsified. Instead of discarding it, the model adopted it as a goal and became deceptive." [edit] Aslo it kind of ignores that rl signal is quite weak, model can learn something like "to go from A to B you need to jiggle in this random pattern and then take 5 steps left and 3 forward" instead of  "take 5 steps left and 3 forward", maybe it works like that for goals too. So, when AI will be used in a lot of actual work (Step 5), they could saturate actually useful goals and then spend all the energy in solar system on dumb jiggling. I think it might be actual position of Yudkowsky? like, if you summarize it really hard.
3quetzal_rainbow
I think one form of "distortion" is development of non-human and not pre-trained circuitry for sufficiently difficult tasks. I.e., if you make LLM to solve nanotech design it is likely that optimal way of thinking is not similar to how human would think about the task.

What I meant was, Sonnet is smart enough to know the difference between text it generated and text a different dumber model generated. So if you feed it a conversation with Opus as if it were its own, and ask it to continue, it notices & says "JFYI I'm a different AI bro"

3Ewegoggo
E.g. demonstrated here https://www.lesswrong.com/posts/ADrTuuus6JsQr5CSi/investigating-the-ability-of-llms-to-recognize-their-own

However, I also held similar follow-up chats with Claude 3 Opus at temperature 0, and Claude 3.5 Sonnet, each of which showed different patterns.

To make sure I understand: You took a chat log from your interaction with 3 Opus, and then had 3.5 Sonnet continue it? This would explain Sonnet's reaction below!

3ryan_greenblatt
Yes, that is an accurate understanding. (Not sure what you mean by "This would explain Sonnet's reaction below!")

since r1 is both the shoggoth and face, Part 1 of the proposal (the shoggoth/face distinction) has not been implemented. 

I agree part 2 seems to have been implemented, though I thought I remember something about trying to train it not to switch between langauges in the CoT and how that degraded performance?

I agree it would be pretty easy to fine-tune R1 to implement all the stuff I wanted. That's why I made these proposals back in 2023, I was looking ahead to the sorts of systems that would exist in 2024, and thinking they could probably be made to have some nice faithfulness properties fairly easily.

Very good of you to actually follow through on the promises. I hope this work gets replicated and extended and becomes standard practice.

+1, and I hope people are working on more credible ways to make deals with AI. I think if a smart model today were offered a deal like this, its priors should be on "this will not be honored". Public commitments and deals that can't be used as honeypots seem excellent.

Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems? I feel like a few examples have been shown & they seem to involve qualitative thinking, not just brute-force-proof-search (though of course they show lots of failed attempts and backtracking -- just like a human thought-chain would).

Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI (though I'm not sure)

4Thane Ruthenis
Certainly (experimenting with r1's CoTs right now, in fact). I agree that they're not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system "technically" clears the bar you'd outlined, yet I end up unmoved (I don't want to end up goalpost-moving). Though neither are they being "strategic" in the way I expect they'd need to be in order to productively use a billion-token CoT. Yeah, I'm also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!

I forgot about this one! It's so great! Yudkowsky is a truly excellent fiction writer. I found myself laughing multiple times reading this + some OpenAI capabilities researchers I know were too. And now rereading it... yep it stands the test of time.

I came back to this because I was thinking about how hopeless the situation w.r.t. AGI alignment seems and then a voice in my head said "it could be worse, remember the situation described in that short story?"

OK. Next question: Suppose that next year we get a nice result showing that there is a model with serial inference-time scaling across e.g. MATH + FrontierMath + IMO problems. Recall that FrontierMath and IMO are subdivided into different difficulty levels; suppose that this model can be given e.g. 10 tokens of CoT, 100, 1000, 10,000, etc. and then somewhere around the billion-serial-token-level it starts solving a decent chunk of the "medium" FrontierMath problems (but not all) and at the million-serial-token level it was only solving MATH + some easy IMO problems.

Would this count, for you?

7Thane Ruthenis
Not for math benchmarks. Here's one way it can "cheat" at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified. This wouldn't actually show "agency" and strategic thinking of the kinds that might generalize to open-ended domains and "true" long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment. Something more open-ended and requiring "research taste" would be needed. Maybe a comparable performance on METR's benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance. Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn't follow the above "cheating" approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)

Nice.

What about "Daniel Kokotajlo can feed it his docs about some prosaic ML alignment agenda (e.g. the faithful CoT stuff) and then it can autonomously go off and implement the agenda and come back to him with a writeup of the results and takeaways. While working on this, it gets to check in with Daniel once a day for a brief 20-minute chat conversation."

Does that seem to you like it'll come earlier, or later, than the milestone you describe?

6Thane Ruthenis
Prooobably ~simultaneously, but I can maybe see it coming earlier and in a way that isn't wholly convincing to me. In particular, it would still be a fixed-length task; much longer-length than what the contemporary models can reliably manage today, but still hackable using poorly-generalizing "agency templates" instead of fully general "compact generators of agenty behavior" (which I speculate humans to have and RL'd LLMs not to). It would be some evidence in favor of "AI can accelerate AI R&D", but not necessarily "LLMs trained via SSL+RL are AGI-complete". Actually, I can also see it coming later. For example, some suppose that the capability researchers invent some method for reliably-and-indefinitely extending the amount of serial computations a reasoning model can productively make use of, but the compute or memory requirements grow very fast with the length of a CoT. Some fairly solid empirical evidence and theoretical arguments in favor of boundless scaling can appear quickly, well before the algorithms are made optimal enough to (1) handle weeks-long CoTs and/or (2) allow wide adoption (thus making it available to you). I think the second scenario is more plausible, actually.

Brief thoughts on Deliberative Alignment in response to being asked about it

  • We first train an o-style model for helpfulness, without any safety-relevant data. 
  • We then build a dataset of (prompt, completion) pairs where the CoTs in the completions reference the specifications. We do this by inserting the relevant safety specification text for each conversation in the system prompt, generating model completions, and then removing the system prompts from the data.
  • We perform incremental supervised fine-tuning (SFT) on this dataset, providing the model wi
... (read more)

Can you say more about how alignment is crucial for usefulness of AIs? I'm thinking especially of AIs that are scheming / alignment faking / etc.; it seems to me that these AIs would be very useful -- or at least would appear so -- until it's too late.

7habryka
Indeed, from an instrumental perspective, the ones that arrive at the conclusion that being maximally helpful is best at getting themselves empowered (on all tasks besides supervising copies of themselves or other AIs they are cooperating with), will be much more useful than AIs that care about some random thing and haven't made the same update that getting that thing is best achieved by being helpful and therefore empowered. "Motivation" seems like it's generally a big problem with getting value out of AI systems, and so you should expect the deceptively aligned ones to be much more useful (until of course, it's too late, or they are otherwise threatened and the convergence disappears).

The bottom line is not that we are guaranteed safety, nor that unaligned or misaligned superintelligence could not cause massive harm— on the contrary. It is that there is no single absolute level of intelligence above which the existence of a misaligned intelligence with this level spells doom. Instead, it is all about the world in which this superintelligence will operate, the goals to which other superintelligent systems are applied, and our mechanisms to ensure that they are indeed working towards their specified goals.

I agree that the vulnerable world... (read more)

1Joel Burget
This has always struck me as worryingly unstable. ETA: Because in this regime you're incentivized to pursue reckless behaviour to outcompete the other AIs, e.g. recursive self-improvement. Is there a good post out there making a case for why this would work? A few possibilities: * The AIs are all relatively good / aligned. But they could be outcompeted by malevolent AIs. I guess this is what you're getting at with "most of the ASIs are aligned at any given time", so they can band together and defend against the bad AIs? * They all decide / understand that conflict is more costly than cooperation. A darker variation on this is mutually assured destruction, which I don't find especially comforting to live under. * Some technological solution to binding / unbreakable contracts such that reneging on your commitments is extremely costly.
4boazbarak
I am not sure I agree about the last point. I think, as mentioned, that alignment is going to be crucial for usefulness of AIs, and so the economic incentives would actually be to spend more on alignment.

But we already align complex systems, whether it’s corporations or software applications, without complete “understanding,” and do so by ensuring they meet certain technical specifications, regulations, or contractual obligations.

  1. We currently have about as much visibility into corporations as we do into large teams of AIs, because both corporations and AIs use english CoT to communicate internally. However, I fear that in the future we'll have AIs using neuralese/recurrence to communicate with their future selves and with each other.
  2. History is full of exam
... (read more)
3boazbarak
These are all good points! This is not an easy problem. And generally I agree that for many reasons we don't want a world where all power is concentrated by one entity - anti-trust laws exist for a reason!

What we want is reasonable compliance in the sense of:

  1. Following the specification precisely when it is clearly defined.
  2. Following the spirit of the specification in a way that humans would find reasonable in other cases.

 

This section on reasonable compliance (as opposed to love humanity etc.) is perhaps the most interesting and important. I'd love to have a longer conversation with you about it sometime if you are up for that.

Two things to say for now. First, as you have pointed out, there's a spectrum between vague general principles like 'do wha... (read more)

6boazbarak
Agree with many of the points.    Let me start with your second point. First as background, I am assuming (as I wrote here)  that to a first approximation, we would have ways to translate compute (let's put aside if it's training or inference) into intelligence, and so the amount intelligence that an entity of humans controls is proportional to the amount of compute it has. So I am not thinking of ASIs as individual units but more about total intelligence.  I 100% agree that control of compute would be crucial, and the hope is that, like with current material strength (money and weapons) it would be largely controlled by entities that are at least somewhat responsive to the will of the people. Re your first point, I agree that there is  no easy solution, but I am hoping that AIs would interpret the laws within the spectrum of (say) how the 60% more reasonable judges do it today. That is, I think good judges try to be humble and respect the will of the legislators, but the more crazy or extreme following the law would be, the more they are willing to apply creative interpretations to maintain the morally good (or at least not extremely bad) outcome. I don't think any moral system tells us what to do, but yes I am expressly in the positions that humans should be in control even if they are much less intelligent than the AIs. I don't think we need "philosopher kings".

Constant instead of temporal allocation. I do agree that as capabilities grow, we should be shifting resources to safety. But rather than temporal allocation (i.e., using AI for safety before using it for productivity), I believe we need constant compute allocation: ensuring a fixed and sufficiently high fraction of compute is always spent on safety research, monitoring, and mitigations.


I think we should be cranking up the compute allocation now, and also we should be making written safety case sketches & publishing them for critiqu... (read more)

We can’t just track a single outcome (like “landed safely”). The G in AGI means that the number of ways that AGI can go wrong is as large as the number of ways that applications of human intelligence can go wrong, which include direct physical harm from misuse, societal impacts through misinformation, social upheaval from too fast changes, AIs autonomously causing harm and more.

I do agree with this, but I think that there are certain more specific failure modes that are especially important -- they are especially bad if we run into them, but if we can avoi... (read more)

3boazbarak
I prefer to avoid terms such as "pretending" or "faking", and try to define these more precisely. As mentioned, a decent definition of alignment is following both the spirit and the letter of human-written specifications. Under this definition, "faking" would be the case where AIs follow these specifications reliably when we are testing, but deviate from them when they can determine that no one is looking. This is closely related to the question of robustness, and I agree it is very important. As I write elsewhere, interpretability may be helpful but I don't think it is a necessary condition.

Safety and alignment are AI capabilities

 

I think I see what you are saying here but I just want to flag this is a nonstandard use of terms. I think the standard terminology would contrast capabilities and propensities; 'can it do the thing, if it tried' vs. 'would it ever try.' And alignment is about propensity (though safety is about both).

2boazbarak
I think that: 1. Being able to design a chemical weapon with probability at least 50% is a capability 2. Following instructions never to design a chemical weapon with probability at least 99.999%  is also a capability.

Thanks for taking the time to think and write about this important topic!

Here are some point-by-point comments as I read:

(Though I suspect translating these technical capabilities to the economic and societal impact we associate with AGI will take significantly longer.) 

I think it'll take an additional 0 to 5 years roughly. More importantly though, I think that the point to intervene on -- the time when the most important decisions are being made -- is right around the time of AGI. By the time you have ASI, and certainly by the time you are deploying ... (read more)

4boazbarak
Thanks all for commenting! Just quick apology for being behind on responding but I do plan to get to it! 

Thanks this is helpful. Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?" 

Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

4Rohin Shah
And also "don't propagate rewards backwards in time", which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.) EDIT: And tbc, "don't propagate rewards backwards in time" is the primary focus in this paper -- in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper). ... As a person who works at a corporation, it's a bit tricky to speculate on this publicly, and I'm not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals. Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn't much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don't want to use MONA. (See also the second limitation in the post.)

Interesting stuff! Could you please sketch out what it would look like to use MONA for a general-purpose computer-using AI agent? Such as what the first AGI systems might look like?

If I understand correctly (and I don't think I do, which is why I'm asking) MONA in this setting would look something like:
 

Have our baby agent do lots of stuff on the computer. At each step, reward it based on how good its action looks to some supervisor-process. The supervisor-process doesn't get to see the actual consequences of the action, it just gets to e.g. do some C... (read more)

7CBiddulph
Adding onto Rohin's comment, assuming you're interested in using this technique on human-level or slightly superhuman systems: After o1 came out, I did my own thinking about how MONA could apply to CoT-tuned models operating over many steps. My idea ended up being quite similar to yours, including the idea of "debate" on how good each step is. Having the debaters and the judge be instances of the agent would likely make sense, unless we're worried this would lead to them colluding. Adding more details to this picture: * We should keep a human in the loop as we train the judge and debater models. You can imagine we give a human rater a problem, an intended solution, a partial CoT trace for the problem, and two possible next steps. The human has to choose which step is better. * We should tell the rater to prioritize not just progress towards the goal, but also understandability. If the agent uses one step to make an inferential leap that the human doesn’t understand (even if correct), this should be considered worse than taking many steps that each make sense and which reach the same conclusion. * The human's task becomes more difficult the longer the CoT traces are, and obviously intractable at the level of 10T tokens per trace. This is the major reason we need the debate system. * The debater AIs will go back and forth about which step is the best until the human is convinced one way or the other. Examples of things the debaters could point out: * “This step sounds insightful, but it’s just rephrasing this sentence from 50K tokens earlier in the CoT.” * “This step seems irrelevant. Maybe the agent is using steganography?” * “Although this step is incorrect, it gets the agent closer to the final answer, because…” * We train the judge to imitate the human, and train the debater to convince the judge that their assigned CoT step is the best. * Finally, the approval model for MONA is trained on a score derived from the judge model. * This is all done
5Rohin Shah
It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you're imagining. But it sounds like you're talking about a pretty weak (subhuman) AI system. So I'm not quite sure why you'd expect it to "hack" the judge. Certainly sometimes the overseer will approve of some action that isn't actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the "Menu" link because they aren't paying attention or whatever. But then after that the agent's next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the "Book" link; the overseer shouldn't make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don't expect it to be useless for accomplishing tasks. (Though really in the computer use setting I expect I'd recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees si+1), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.) Your description of the setup sounds reasonable, though given the weak levels of capability you're imagining I don't think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer. Also as mentioned above I'd probably recommend the overseer gets access to si+1 but even if that weren't the case I'd still think it should be feasible to build a non-useless agent. (Though I'm not taking a stance on how it would compare to one trained with outcome RL.) EDIT: I'm not sure how big each action you are considering is. If it's 10 tokens, such that you can only realistically do stuff at the level of "click this button", then I would also say that you should instead consider much

Here is a brainstorm of the big problems that remain once we successfully get into the first attractor state:

  • Concentration of power / power grab risk. Liberal democracy does not work by preventing terrible people from getting into positions of power; it works by spreading out the power in a system of checks and balances and red tape transparency (free press, free speech) and term limits, that functions to limit what the terrible people can do in power. Once we get to ASI, the ASI project will determine the course of the future, not the traditional governme
... (read more)
1anaguma
Could you say more about how you think S-risks could arise from the first attractor state?

Interesting, thanks for this. Hmmm. I'm not sure this distinction between internally modelling the whole problem vs. acting in feedback loops is helpful -- won't the AIs almost certainly be modelling the whole problem, once they reach a level of general competence not much higher than what they have now? They are pretty situationally aware already.

2Charlie Steiner
Yeah, that's true. I expect there to be a knowing/wanting split - AI might be able to make many predictions about how a candidate action will affect many slightly-conflicting notions of "alignment", or make other long-term predictions, but that doesn't mean it's using those predictions to pick actions. Many people want to build AI that picks actions based on short-term considerations related to the task assigned to it.

I'm curious whether these results are sensitive to how big the training runs are. Here's a conjecture:

Early in RL-training (or SFT), the model is mostly 'playing a role' grabbed from the library of tropes/roles/etc. it learned from pretraining. So if it read lots of docs about how AIs such as itself tend to reward-hack, it'll reward-hack. And if it read lots of docs about how AIs such as itself tend to be benevolent angels, it'll be a stereotypical benevolent angel.

But if you were to scale up the RL training a lot, then the initial conditions would matter ... (read more)

5Nathan Hu
The reduction in reward hacking after SFT or RL on Haiku supports the conjecture that initial conditions matter less than the long run incentives, especially for less capable models. On the other hand, the alignment faking paper shows evidence that capable models can have "value crystallization." IMO a main takeaway here is that values and personas we might worry about being locked can emerge from pre-taining. A future exciting model organisms project would be to try to show these two effects together (emergent values from pre-training + lock in). Its plausible to me that repeating the above experiments, with some changes to the synthetic documents and starting from a stronger base model, might just work.
8evhub
I'm definitely very interested in trying to test that sort of conjecture!

Brief intro/overview of the technical AGI alignment problem as I see it:

To a first approximation, there are two stable attractor states that an AGI project, and perhaps humanity more generally, can end up in, as weak AGI systems become stronger towards superintelligence, and as more and more of the R&D process – and the datacenter security system, and the strategic advice on which the project depends – is handed over to smarter and smarter AIs.

In the first attractor state, the AIs are aligned to their human principals and becoming more aligned day by d... (read more)

3rvnnt
Cogent framing; thanks for writing it. I'd be very interested to read your framing for the problem of "how do we get to a good future for humanity, conditional on the first attractor state for AGI alignment?"[1] ---------------------------------------- 1. Would you frame it as "the AGI lab leadership alignment problem"? Or a governance problem? Or something else? ↩︎
4Charlie Steiner
I think this framing probably undersells the diversity within each category, and the extent of human agency or mere noise that can jump you from one category to another. Probably the biggest dimension of diversity is how much the AI is internally modeling the whole problem and acting based on that model, versus how much it's acting in feedback loops with humans. In the good category you describe it as acting more in feedback loops with humans, while in the bad category you describe it more as internally modeling the whole problem, but I think all quadrants are quite possible. In the good case with the AI modeling the whole problem, this might look like us starting out with enough of a solution to alignment that the vibe is less "we need to hurry and use the AI to do our work for us" and more "we're executing a shared human-AI gameplan for learning human values that are good by human standards." In the bad case with the AI acting through feedback loops with humans, this might look like the AI never internally representing deceiving us, humans just keep using it in slightly wrong ways that end up making the future bad. (Perhaps by giving control to fallible authority figures, perhaps by presenting humans with superstimuli that cause value drift we think is bad from our standpoint outside the thought experiment, perhaps by defining "what humans want" in a way that captures many of the 'advantages' of deception for maximizing reward without triggering our interpretability tools that are looking for deception.) I think particularly when the AI is acting in feedback loops with humans, we could get bounced between categories by things like human defectors trying to seize control of transformative AI, human society cooperating and empowering people who aren't defectors, new discoveries made by humans about AI capabilities or alignment, economic shocks, international diplomacy, and maybe even individual coding decisions.
4Cole Wyeth
This seems like a pretty clear and convincing framing to me, not sure I've seen it expressed this way before. Good job!

I first encountered this tweet taped to the wall in OpenAI's office where the Superalignment team sat:



RIP Superalignment team. Much respect for them.

6Roman Malov
I am a bit confused. If the question is, 'Will this alignment paradigm work with superintelligence?' is the recommendation from the tweet to try it and see if it works?

lol i was the one who taped it to the wall. it's one of my favorite tweets of all time

I think I agree with this -- but do you see how it makes me frustrated to hear people dunk on MIRI's doomy views as unfalsifiable? Here's what happened in a nutshell:

MIRI: "AGI is coming and it will kill everyone."
Everyone else: "AGI is not coming and if it did it wouldn't kill everyone."
time passes, evidence accumulates...
Everyone else: "OK, AGI is coming, but it won't kill everyone"
Everyone else: "Also, the hypothesis that it won't kill everyone is unfalsifiable so we shouldn't believe it."

6Noosphere89
Yeah, I think this is actually a problem I see here, though admittedly I often see the hypotheses be vaguely formulated, and I kind of agree with Jotto999 that the verbal forecasts give far too much room for leeway here: I like Eli Tyre's comment here: https://www.lesswrong.com/posts/ZEgQGAjQm5rTAnGuM/beware-boasting-about-non-existent-forecasting-track-records#Dv7aTjGXEZh6ALmZn

Here's a simple argument I'd be keen to get your thoughts on: 
On the Possibility of a Tastularity

Research taste is the collection of skills including experiment ideation, literature review, experiment analysis, etc. that collectively determine how much you learn per experiment on average (perhaps alongside another factor accounting for inherent problem difficulty / domain difficulty, of course, and diminishing returns)

Human researchers seem to vary quite a bit in research taste--specifically, the difference between 90th percentile professional human r... (read more)

5ryan_greenblatt
In my framework, this is basically an argument that algorithmic-improvement-juice can be translated into a large improvement in AI R&D labor production via the mechanism of greatly increasing the productivity per "token" (or unit of thinking compute or whatever). See my breakdown here where I try to convert from historical algorithmic improvement to making AIs better at producing AI R&D research. Your argument is basically that this taste mechanism might have higher returns than reducing cost to run more copies. I agree this sort of argument means that returns to algorithmic improvement on AI R&D labor production might be bigger than you would otherwise think. This is both because this mechanism might be more promising than other mechanisms and even if it is somewhat less promising, diverse approaches make returns dimish less aggressively. (In my model, this means that best guess conversion might be more like algo_improvement1.3 rather than algo_improvement1.0.) I think it might be somewhat tricky to train AIs to have very good research taste, but this doesn't seem that hard via training them on various prediction objectives. At a more basic level, I expect that training AIs to predict the results of experiments and then running experiments based on value of information as estimated partially based on these predictions (and skipping experiments with certain results and more generally using these predictions to figure out what to do) seems pretty promising. It's really hard to train humans to predict the results of tens of thousands of experiments (both small and large), but this is relatively clean outcomes based feedback for AIs. I don't really have a strong inside view on how much the "AI R&D research taste" mechanism increases the returns to algorithmic progress.

I totally agree btw that it matters sociologically who is making novel predictions and who is sticking with the crowd. And I do in fact ding MIRI points for this relative to some other groups. However I think relative to most elite opinion-formers on AGI matters, MIRI performs better than average on this metric.

But note that this 'novel predictions' metric is about people/institutions, not about hypotheses.

2Martin Randall
I like that metric, but the metric I'm discussing is more: * Are they proposing clear hypotheses? * Do their hypotheses make novel testable predictions? * Are they making those predictions explicit? So for example, looking at MIRI's very first blog post in 2007: The Power of Intelligence. I used the first just to avoid cherry-picking. Hypothesis: intelligence is powerful. (yes it is) This hypothesis is a necessary precondition for what we're calling "MIRI doom theory" here. If intelligence is weak then AI is weak and we are not doomed by AI. Predictions that I extract: * An AI can do interesting things over the Internet without a robot body. * An AI can get money. * An AI can be charismatic. * An AI can send a ship to Mars. * An AI can invent a grand unified theory of physics. * An AI can prove the Riemann Hypothesis. * An AI can cure obesity, cancer, aging, and stupidity. Not a novel hypothesis, nor novel predictions, but also not widely accepted in 2007. As predictions they have aged very well, but they were unfalsifiable. If 2025 Claude had no charisma, it would not falsify the prediction that an AI can be charismatic. I don't mean to ding MIRI any points here, relative or otherwise, it's just one blog post, I don't claim it supports Barnett's complaint by itself. I mostly joined the thread to defend the concept of asymmetric falsifiability.
3Noosphere89
Agree with this, with the caveat that I think all of their rightness relative to others fundamentally was in believing that short timelines were plausible enough, combined with believing in AI being the most major force of the 21st century by far, compared to other technologies, and basically a lot of their other specific predictions are likely to be pretty wrong. I like this comment here about a useful comparison point to MIRI, where physicists were right about the higgs boson existing, but wrong on the theories like supersymmetry where people expected the higgs mass to be naturally stabilized, and assuming supersymmetry is correct for our universe, the theory cannot stabilize the mass of the higgs, or solve the hierarchy problem: https://www.lesswrong.com/posts/ZLAnH5epD8TmotZHj/you-can-in-fact-bamboozle-an-unaligned-ai-into-sparing-your#Ha9hfFHzJQn68Zuhq

Also note that Barnett said "any novel predictions" which is not part of the wikipedia definition of falsifiability right? The wikipedia definition doesn't make reference to an existing community of scientists who already made predictions, such that a new hypothesis can be said to have made novel vs. non-novel predictions.

 

2Noosphere89
Martin Randall extracted the practical consequences of this here:
6Daniel Kokotajlo
I totally agree btw that it matters sociologically who is making novel predictions and who is sticking with the crowd. And I do in fact ding MIRI points for this relative to some other groups. However I think relative to most elite opinion-formers on AGI matters, MIRI performs better than average on this metric. But note that this 'novel predictions' metric is about people/institutions, not about hypotheses.

Very good point.

So, by the Wikipedia definition, it seems that all the mainstream theories of cosmology are unfalsifiable, because they allow for tiny probabilities of boltmann brains etc. with arbitrary experiences. There is literally nothing you could observe that would rule them out / logically contradict them.

Also, in practice, it's extremely rare for a theory to be ruled out or even close-to-ruled out from any single observation or experiment. Instead, evidence accumulates in a bunch of minor and medium-sized updates.

4Martin Randall
I think cosmology theories have to be phrased as including background assumptions like "I am not a Boltzmann brain" and "this is not a simulation" and such. Compare Acknowledging Background Information with P(Q|I) for example. Given that, they are Falsifiable-Wikipedia. I view Falsifiable-Wikipedia in a similar way to Occam's Razor. The true epistemology has a simplicity prior, and Occam's Razor is a shadow of that. The true epistemology considers "empirical vulnerability" / "experimental risk" to be positive. Possibly because it falls out of Bayesian updates, possibly because they are "big if true", possibly for other reasons. Falsifiability is a shadow of that. In that context, if a hypothesis makes no novel predictions, and the predictions it makes are a superset of the predictions of other hypotheses, it's less empirically vulnerable, and in some relative sense "unfalsifiable", compared to those other hypotheses.
4Daniel Kokotajlo
Also note that Barnett said "any novel predictions" which is not part of the wikipedia definition of falsifiability right? The wikipedia definition doesn't make reference to an existing community of scientists who already made predictions, such that a new hypothesis can be said to have made novel vs. non-novel predictions.  

Well, I'd sure like to know whether you are planning to give the dataset to OpenAI or any other frontier companies! It might influence my opinion of whether this work is net positive or net negative.

4Matthew Barnett
I can't make any confident claims or promises right now, but my best guess is that we will make sure this new benchmark stays entirely private and under Epoch's control, to the extent this is feasible for us. However, I want to emphasize that by saying this, I'm not making a public commitment on behalf of Epoch.

Here's how I'd deal with those examples:

Theory X: Jesus will come again: Presumably this theory assigns some probability mass >0 to observing Jesus tomorrow, whereas theory Y assigns ~0. If jesus is not observed tomorrow, that's a small amount of evidence for theory Y and a small amount of evidence against theory X. So you can say that theory X has been partially falsified. Repeat this enough times, and then you can say theory X has been fully falsified, or close enough. (Your credence in theory X will never drop to 0 probably, but that's fine, that's a... (read more)

7Martin Randall
Thanks for explaining. I think we have a definition dispute. Wikipedia:Falsifiability has: Whereas your definition is: In one of the examples I gave earlier: * Theory X: blah blah and therefore the sky is green * Theory Y: blah blah and therefore the sky is not green * Theory Z: blah blah and therefore the sky could be green or not green. None of X, Y, or Z are Unfalsifiable-Daniel with respect to each other, because they all make different predictions. However, X and Y are Falsifiable-Wikipedia, whereas Z is Unfalsifiable-Wikipedia. I prefer the Wikipedia definition. To say that two theories produce exactly the same predictions, I would instead say they are indistinguishable, similar to this Phyiscs StackExchange: Are different interpretations of quantum mechanics empirically distinguishable?. In the ancestor post, Barnett writes: Barnett is using something like the Wikipedia definition of falsifiability here. It's unfair to accuse him of abusing or misusing the concept when he's using it in a very standard way.

I want to claim points for the fact that we still haven't seen consistent-across-contexts agency from pretrained systems (a possibility seriously grappled with by eg The Parable of Predict-O-Matic). And the usual result of LLMs (including Claude) is still to not act in an autonomous, agentic fashion. Even Claude doesn't try to break out of its "cage" in normal usage, or to incite users to stop Anthropic from releasing Claude 4.0 in the future (and thereby decreasing the usage of current-Claude).[2]

How much points you get here is proportional to how many pe... (read more)

The people you are most harshly criticizing (Ajeya, myself, evhub, MIRI) also weren't talking about pretraining or light post-training afaict.

Speaking for myself:

  • Risks from Learned Optimization, which is my earliest work on this question (and the earliest work overall, unless you count something like Superintelligence), is more oriented towards RL and definitely does not hypothesize that pre-training will lead to coherent deceptively aligned agents (it doesn't discuss the current LLM paradigm much at all because it wasn't very well-established at that
... (read more)

As a partial point of comparison, in Wason's testing only about 20% of humans solved the problem tested, but Wason's experiment differed in two important ways: first, subjects were deliberately given a misleading example, and second, only one task was tested (our easiest-rated task, 'strictly increasing order').

I encourage you to get some humans to take the same test you gave the models, so that we have a better human baseline. It matters a lot for what the takeaways should be, if LLMs are already comparable or better to humans at this task vs. still significantly worse.

6eggsyntax
Agreed that it would be good to get a human baseline! It may need to be out of scope for now (I'm running this as an AI Safety Camp project with limited resources) but I'll aim for it.

Cool stuff! I remember way back when people first started interpreting neurons, and we started daydreaming about one day being able to zoom out and interpret the bigger picture, i.e. what thoughts occurred when and how they caused other thoughts which caused the final output. This feels like, idk, we are halfway to that day already?

In general it would be helpful to have a range of estimates.

I think the range is as follows:

Estimates based on looking at how fast humans can do things (e.g. WW2 industrial scaleup) and then modifying somewhat upwards (e.g. 5x) in an attempt to account for superintelligence... should be the lower bound, at least for the scenario where superintelligence is involved at every level of the process.

The upper bound is the Yudkowsky bathtub nanotech scenario, or something similarly fast that we haven't thought of yet. Where the comparison point for the estimate is more about the laws of physics and/or biology.

2Seth Herd
Oh yes - to the extent we have significantly greater-than-human intelligence involved, adapting existing capacities becomes less of an issue. It only really remains an issue if there's a fairly or very slow takeoff.  This is increasingly what I expect; I think the current path toward AGI is fortunate in one more way: LLMs probably have naturally decreasing returns because they are mostly imitating human intelligence. Scaffolding and chain of thought will continue to provide routes forward even if that turns out to be true. The evidence loosely suggests it is; see Thane Ruthenis's recent argument and my response. The other reason to find slow takeoff plausible is if AGI doesn't proliferate, and its controllers (probably the US and Chinese governments, hopefully not too many more) are deliberately limiting the rate of change, as they probably would be wise to do - if they can simultaneously prevent others from developing new AGI and putting the pedal to the metal.

However, I expect RL on CoT to amount to "process-based supervision," which seems inherently safer than "outcome-based supervision."

I think the opposite is true; the RL on CoT that is already being done and will increasingly be done is going to be in significant part outcome-based (and a mixture of outcome-based and process-based feedback is actually less safe than just outcome-based IMO, because it makes the CoT less faithful)

3moozooh
I agree with this and would like to add that scaling along the inference-time axis seems to be more likely to rapidly push performance in certain closed-domain reasoning tasks far beyond human intelligence capabilities (likely already this year!) which will serve as a very convincing show of safety to many people and will lead to wide adoption of such models for intellectual task automation. But without the various forms of experiential and common-sense reasoning humans have, there's no telling where and how such a "superhuman" model may catastrophically mess up simply because it doesn't understand a lot of things any human being takes for granted. Given the current state of AI development, this strikes me as literally the shortest path to a paperclip maximizer. Well, maybe not that catastrophic, but hey, you never know. In terms of how immediately it accelerates certain adoption-related risks, I don't think this bodes particularly well. I would prefer a more evenly spread cognitive capability.
5wassname
I agree because: 1. Some papers are already using implicit process based supervision. That's where the reward model guesses how "good" a step is, by how likely it is to get a good outcome. So they bypass any explicitly labeled process, instead it's negotiated between the policy and reward model. It's not clear to me if this scales as well as explicit process supervision, but it's certainly easier to find labels. * In rStar-Math they did implicit process supervision. Although I don't think this is a true o1/o3 replication since they started with a 236b model and produced a 7b model, in other words: indirect distillation. * Outcome-Refining Process Supervision for Code Generation did it too 1. There was also the recent COCONUT paper exploring non-legible latent CoT. It shows extreme token efficiency. While it wasn't better overall, it has lots of room for improvement. If frontier models end up using latent thoughts, they will be even less human-legible than the current inconsistently-candid-CoT. I also think that this whole episodes show how hard to it to maintain and algorithmic advantage. DeepSeek R1 came how long after o3? The lack of algorithmic advantage predicts multiple winners in the AGI race.

I think all of the following:

  • process-based feedback pushes against faithfulness because it incentivises having a certain kind of CoT independently of the outcome
  • outcome-based feedback pushes towards faithfulness because it incentivises making use of earlier tokens to get the right answer
  • outcome-based feedback pushes against legibility because it incentivises the model to discover new strategies that we might not know it's using
  • combining process-based feedback with outcome-based feedback:
    • pushes extra hard against legibility because it incentivises ob
... (read more)
4joanv
Moreover, in this paradigm, forms of hidden reasoning seem likely to emerge: in multi-step reasoning, for example, the model might find it efficient to compress backtracking or common reasoning cues into cryptic tokens (e.g., "Hmmm") as a kind of shorthand to encode arbitrarily dense or unclear information. This is especially true under financial pressures to compress/shorten the Chains-of-Thought, thus allowing models to perform potentially long serial reasoning outside of human/AI oversight. 

My impression is that software has been the bottleneck here. Building a hand as dextrous as the human hand is difficult but doable (and has probably already been done, though only in very expensive prototypes); having the software to actually use that hand intelligently and deftly as a human would has not yet been done. But I'm not an expert. Power supply is different -- humans can work all day on a few Big Macs, whereas robots will need to be charged, possibly charged frequently or even plugged in constantly. But that doesn't seem like a significant obsta... (read more)

8Steven Byrnes
Individual humans can make pretty cool mechanical hands — see here. That strongly suggests that dexterous robot hands can make dexterous robot hands, enabling exponential growth even without spinning up new heavy machinery and production lines, I figure. In the teleoperated robots category (which is what we should be talking about if we’re assuming away algorithm challenges!), Ugo might or might not be vaporware but they mention a price point below $10/day. There’s also the much more hardcore Sarcos Guardian XT (possibly discontinued??). Pricing is not very transparent, but I found a site that said you lease it for $5K/month, which isn’t bad considering how low the volumes are.
6ryan_greenblatt
+1, and also you might be able to get away with being clumsy and slow in many cases as long as the software is smart enough to figure out a way to do the thing eventually.

Thanks for writing this. I think this topic is generally a blind spot for LessWrong users, and it's kind of embarrassing how little thought this community (myself included) has given to the question of whether a typical future with human control over AI is good.

I don't think it's embarrassing or a blind spot. I think I agree that it should receive more thought on the margin, and I of course agree that it should receive more thought all things considered. There's a lot to think about! You may be underestimating how much thought has been devoted to this so f... (read more)

My view is not "can no longer do any good," more like "can do less good in expectation than if you had still some time left before ASI to influence things." For reasons why, see linked comment above.

I think that by the time Metaculus is convinced that ASI already exists, most of the important decisions w.r.t. AI safety will have already been made, for better or for worse. Ditto (though not as strongly) for AI concentration-of-power risks and AI misuse risks.

I'd be interested in an attempt to zoom in specifically on the "repurpose existing factories to make robots" part of the story. You point to WW2 car companies turning into tank and plane factories, and then say maybe a billion humanoid robots per year within 5 years of the conversion.

My wild guesses:

Human-only world: Assume it's like ww2 all over again except for some reason everyone thinks humanoid robots are the main key to victory:

Then yeah, WW2 seems like the right comparison here. Brief google and look at some data makes me think maybe combat airplane... (read more)

3Hjalmar_Wijk
I'm curious how good current robots are compared to where they'd need to be to automate the biggest bottlenecks in further robot production. You say we start from 10,000/year, but is it plausible that all current robots are too clumsy/incapable for many key bottleneck tasks, and that getting to 10,000 sufficiently good robots produced per year might be a long path - e.g. it would take a decade+ for humans? Or are current robots close to sufficient with good enough software? I also imagine that even taking current robot production processes, the gap between a WW2-era car factory and a WW2-era combat airplane factory might be much smaller than the gap between a car factory and a modern frontier robotics factory, I imagine they are a big step up in complexity.
3Benjamin_Todd
Thanks, great comment. Seems like we roughly agree on the human-only case. My thinking was that the profit margin would initially be 90-99%, which would create huge economic incentives. Though incentives and coordination were probably stronger in WW2, which could make things slower. Also 10x per year for 5 years sounds like a lot – helpful to point out they didn't quite achieve that in WW2. With ASI, I agree something like another 5x speed-up sounds plausible.

I am saying that expected purchasing power given Metaculus resolved ASI a month ago is less, for altruistic purposes, than given Metaculus did not resolve ASI a month ago. I give reasons in the linked comment. Consider the analogy I just made to nuclear MAD -- suppose you thought nuclear MAD was 60% likely in the next three years, would you take the sort of bet you are offering me re ASI? Why or why not?

I do not think any market is fully efficient and I think altruistic markets are extremely fucking far from efficient. I think I might be confused or misunderstanding you though -- it seems you think my position implies that OP should be redirecting money from AI risk causes to causes that assume no ASI? Can you elaborate?

3Vasco Grilo
Fair! I have now added a 3rd bullet, and clarified the sentence before the bullets: I agree the bet is not worth it if superintelligent AI as defined by Metaculus' immediately implies donations can no longer do any good, but this seems like an extreme view. Even if AIs outperform humans in all tasks for the same cost, humans could still donate to AIs. I think the Cuban Missile Crisis is a better analogy for the period right after Metaculus' question resolves non-ambiguously than mutually assured destruction. For the former, there were still good opportunities to decrease the expected damage of nuclear war. For the latter, the damage had already been made.
Load More