All of Tom Davidson's Comments + Replies

I meant at any point, but was imagining the period around full automation yeah. Why do you ask?

I'll post about my views on different numbers of OOMs soon

Sorry, for my comments on this post I've been referring to "software only singularity?" only as "will the parameter r >1 when we f first fully automate AI RnD", not as a threshold for some number of OOMs. That's what Ryan's analysis seemed to be referring to. 

 

I separately think that even if initially r>1 the software explosion might not go on for that long

1Tom Davidson
I'll post about my views on different numbers of OOMs soon

Obviously the numbers in the LLM case are much less certain given that I'm guessing based on qualitative improvement and looking at some open source models,

Sorry,I don't follow why they're less certain? 

 

based on some first principles reasoning and my understanding of how returns diminished in the semi-conductor case

I'd be interested to hear more about this. The semi conductor case is hard as we don't know how far we are from limits, but if we use Landauer's limit then I'd guess you're right. There's also uncertainty about how much alg progress we will and have met

2ryan_greenblatt
I'm just eyeballing the rate of algorithmic progress while in the computer vision case, we can at least look at benchmarks and know the cost of training compute for various models. My sense is that you have generalization issues in the compute vision case while in the frontier LLM case you have issues with knowing the actual numbers (in terms of number of employees and cost of training runs). I'm also just not carefully doing the accounting. I don't have much to say here sadly, but I do think investigating this could be useful.

Why are they more recoverable? Seems like a human who seized power would seek asi advice on how to cement their power

4AnthonyC
I'm sure they would. And some of those ways ASI can help would include becoming immortal, replacing relationships with other humans things like that. But compared to an ASI, it is easier for a human to die, to have their mind changed by outside influences, and to place intrinsic value on the whole range of things humans care about including other people.
Tom Davidson*Ω7145

Thanks for this!

Compared to you, I more expect evidence of scheming if it exists. 

You argue weak schemers might just play nice. But if so, we can use them to do loads of intellectual labour to make fancy behavioral red teaming and interp to catch out the next gen of AI. 

More generally, the plan of bootstrapping to increasingly complex behavioral tests and control schemes seems likely to work. It seems like if one model has spent a lot of thinking time designing a scheme then another model would have to be much smarter to zero shot cause a catastrophe without the scheme detecting it. Eg. analogies with humans suggest this.

I agree that if an AI is incapable of competently scheming (i.e., alignment faking and sabotaging safety work without being caught), but is capable of massively accelerating safety work, then doing huge amounts of safety work with this AI is very promising.

(I put this aside in this post as I was trying to have a more narrow focus on how we'll update about scheming independent of how easily scheming will be handled and without talking about methods that don't currently exist.)

(The specific directions you mentioned of "fancy behavioral red teaming and interp... (read more)

I agree the easy vs hard worlds influence the chance of AI taking over. 

But are you also claiming it influences the badness of takeover conditional on it happening? (That's the subject of my post)

5evhub
I think it affects both, since alignment difficulty determines both the probability that the AI will have values that cause it to take over, as well as the expected badness of those values conditional on it taking over.

So you predict that if Claude was in a situation where it knew that it had complete power over you and could make you say that you liked it then it would stop being nice? I think would continue to be nice in any situation of that rough kind which suggests it's actually nice not just narcissistically pretending

Yes, I think it's quite possible that Claude might stop being nice at some point, or maybe somehow hack its reward signal. Another possibility is that something like the "Waluigi Effect" happens at some point, like with Bing/Sydney.

But I think it is even more likely that a superintelligent Claude would interpret "being nice" in a different way than you or me. It could, for example, come to the conclusion that life is suffering and we all would be better off if we didn't exist at all. Or we should be locked in a secure place and drugged so we experience ete... (read more)

But a human could instruct an aligned ASI to help it take over and do a lot of damage

That structural difference you point to seems massive. The reputational downsides of bad behavior will be multiplied 100-fold+ for AI as it reflects on millions of instances and the company's reputation. 

 

And it will be much easier to record and monitor ai thinking and actions to catch bad behaviour. 


Why unlikely we can detect selfishness? Why can't we bootstrap from human-level? 

0[anonymous]
human behavior reflects on the core structure individual humans are variations on, too

One dynamic initially preventing stasis in influence post-AGI is that different ppl have different discount rates, so those with lower discounts will slowly gain influence over time

Yep I'm saying you're wrong about this. If money compounds but you don't have utility=log($) then you shouldn't Kelly bet

Your formula is only valid if utility = log($).

With that assumption the equation compares your utility with and without insurance. Simple!

If you had some other utility function, like utility = $, then you should make insurance decisions differently.

I think the Kelly betting stuff is a big distraction, and that ppl with utility=$ shouldn't bet like that. I think the result that Kelly betting maximizes long term $ bakes in assumptions about utility functions and is easily misunderstood - someone with utility=$ probably goes bankrupt but might become insanely... (read more)

1notfnofn
From the original post: Click the link for a more in-depth explanation
2Matt Goldenberg
Is this true? I'm still a bit confused about this point of the Kelly criterion. I thought that actually this is the way to maximize expected returns if you value money linearly, and the log term comes from compounding gains. That the log utility assumption is actually a separate justification for the Kelly criterion that doesn't take into account expected compounding returns
1kqr
This is a synonym for "if money compounds and you want more of it at lower risk". So in a sense, yes, but it seems confusing to phrase it in terms of utility as if the choice was arbitrary and not determined by other constraints.

I enjoyed reading this, thanks.

I think your definition of solving alignment here might be too broad?

If we have superintelligent agentic AI that tries to help its user but we end up missing out of the benefits of AI bc of catastrophic coordination failures, or bc of misuse, then I think you're saying we didn't solve alignment bc we didn't elicit the benefits?

You discuss this, but I prefer to separate out control and alignment. Where I wouldn't count us as having solved alignment if we only elicit behavior via intense/exploitative control schemes. So I'd adj... (read more)

2Joe Carlsmith
In my definition, you don't have to actually elicit the benefits. You just need to have gained "access" to the benefits. And I meant this specifically cover cases like misuse. Quoting from the OP:  Re: separating out control and alignment, I agree that there's something intuitive and important about differentiating between control and alignment, where I'd roughly think of control as "you're ensuring good outcomes via influencing the options available to the AI," and alignment as "you're ensuring good outcomes by influencing which options the AI is motivated to pursue." The issue is that in the real world, we almost always get good outcomes via a mix of these -- see, e.g. humans. And as I discuss in the post, I think it's one of the deficiencies of the traditional alignment discourse that it assumes that limiting options is hopeless, and that we need AIs that are motivated to choose desirable options even in arbtrary circumstances and given arbitrary amounts of power over their environment. I've been trying, in this framework, to specifically avoid that implication.  That said, I also acknowledge that there's some intuitive difference between cases in which you've basically got AIs in the position of slaves/prisoners who would kill you as soon as they had any decently-likely-to-succeed chance to do so, and cases in which AIs are substantially intrinsically motivated in desirable ways, but would still kill/disempower you in distant cases with difficult trade-offs (in the same sense that many human personal assistants might kill/disempower their employers in various distant cases). And I agree that it seems a bit weird to talk about having "solved the alignment problem" in the former sort of case. This makes me wonder whether what I should really be talking about is something like "solving the X-risk-from-power-seeking-AI problem," which is the thing I really care about.  Another option would be to include some additional, more moral-patienthood attuned constraint i

I enjoyed it, and think that ideas are important, but found it hard to follow at points

Some suggestions:

  • explain more why self criticism allows one part to assert control
  • give more examples throughout, especially the second half. I think some paragraphs don't have examples and are harder to understand
  • flesh out examples to make them longer and more detailed

I think your model will underestimate the benefits of ramping up spending quickly today. 

You model the size of the $ overhang as constant. But in fact it's doubling every couple of years as global spending on producing on AI chips grows. (The overhang relates to the fraction of chips used in the largest training run, not the fraction of GWP spent on the largest training run.) That means that ramping up spending quickly (on training runs or software or hardware research) gives that $ overhang less time to grow

1Maxime Riché
Interesting! I will see if I can correct that easily.

Why are you at 50% ai kills >99% ppl given the points you make in the other direction?

1ryan_greenblatt
My probabilities are very rough, but I'm feeling more like 1/3 ish today after thinking about it a bit more. Shrug. As far as reasons for it being this high: * Conflict seems plausible to get to this level of lethality (see edit, I think I was a bit unclear or incorrect) * AIs might not care about acausal trade considerations before too late (seems unclear) * Future humans/AIs/aliens might decide it isn't morally important to particularly privilege currently alive humans Generally, I'm happy to argue for 'we should be pretty confused and there are a decent number of good reasons why AIs might keep humans alive'. I'm not confident in survival overall though...

So far causally upstream of the human evaluator's opinion? Eg an AI counselor optimizing for getting to know you

I think the "soup of heuristics" stories (where the AI is optimizing something far causally upstream of reward instead of something that is downstream or close enough to be robustly correlated) don't lead to takeover in the same way

Why does it not lead to takeover in the same way?

2paulfchristiano
Because it's easy to detect and correct (except that correcting it might push you into one of the other regimes).

AI understands that the game ends after 1908 and modifies accordingly.

Does it? In the game you link it seems like the bot doesn't act accordingly in the last move phase. Turkey misses a chance to grab Rumania, Germany misses a chance to grab London, and I think France misses something as well.

Glad you added these empirical research directions! If I were you I'd prioritize these over the theoretical framework.

1DragonGod
Theory is needed to interpret the results of experiment. Only with a solid theoretical framework can useful empirical research be done.
So either one must claim that AI-related unawareness is of a very different type or scale from ordinary human cases in our world today, or one must implicitly claim that unawareness modeling would in fact be a contribution to the agency literature.

I agree that the Bostrom/Yudkowsky scenario implies AI-related unawareness is of a very different scale from ordinary human cases. From an outside view perspective, this is a strike against the scenario. However, this deviation from past trends does follow fairly naturally (though not necessarily) from the hypothesis of a sudden and massive intelligence gap

Re the difference between Monopoly rents and agency rents: monopoly rents would be eliminated by competition between firms whereas agency rents would be eliminated by competition between workers. So they're different in that sense.