All of Buck's Comments + Replies

BuckΩ213325

I agree with most of this, thanks for saying it. I've been dismayed for the last several years by continuing unreasonable levels of emphasis on interpretability techniques as a strategy for safety.

My main disagreement is that you place more emphasis than I would on chain-of-thought monitoring compared to other AI control methods. CoT monitoring seems like a great control method when available, but I think it's reasonably likely that it won't work on the AIs that we'd want to control, because those models will have access to some kind of "neuralese" that al... (read more)

Buck150

IIRC, an Anthropic staff member told me that he had a strong suspicion for why this is, but that it was tied up in proprietary info so he didn't want to say.

Buck22

I feel like GDM safety and Constellation are similar enough to be in the same cluster: I bet within-cluster variance is bigger than between-cluster variance.

Buck*74

FWIW, I think that the GDM safety people are at least as similar to the Constellation/Redwood/METR cluster as the Anthropic safety people are, probably more similar. (And Anthropic as a whole has very different beliefs than the Constellation cluster, e.g. not having much credence on misalignment risk.)

2Neel Nanda
Oh agreed, I didn't notice that OP lumped Anthropic and Constellation together. I would consider GDM, constellation and Anthropic to all be separate clusters
BuckΩ473

Ryan agrees, the main thing he means by "behavioral output" is what you're saying: an actually really dangerous action.

1Dusto
Brilliant! Stoked to see your team move this direction.
BuckΩ222

I think we should probably say that exploration hacking is a strategy for sandbagging, rather than using them as synonyms.

2ryan_greenblatt
(FWIW, I didn't use them as synonyms in the post except for saying "aka sandbagging" in the title which maybe should have been "aka a strategy for sandbagging". I thought "aka sandbagging" was sufficiently accurate for a title and saved space.)
Buck64

Isn’t the answer that the low hanging fruit of explaining unexplained observations has been picked?

2Garrett Baker
That's an inference, presumably Adam believes that for object-level reasons, which could be supported by eg looking at the age at which physicists make major advancements[1] and the size of those advancements. Edit: But also this wouldn't show whether or not theoretical physics is actually in a rut, to someone who doesn't know what the field looks like now. ---------------------------------------- 1. Adjusted for similar but known to be fast moving fields like AI or biology to normalize for facts like eg the academic job market just being worse now than previously. ↩︎
Buck432

I appeared on the 80,000 Hours podcast. I discussed a bunch of points on misalignment risk and AI control that I don't think I've heard discussed publicly before.

Transcript + links + summary here; it's also available as a podcast in many places.

Orpheus16170

What do you think are the most important points that weren't publicly discussed before?

Buck42

I love that I can guess the infohazard from the comment 

BuckΩ14280

A few months ago, I accidentally used France as an example of a small country that it wouldn't be that catastrophic for AIs to take over, while giving a talk in France 😬

3Arjun Panickssery
Didn't watch the video but is there the short version of this argument? France is at the 90th percentile of population sizes and also has the 4th-most nukes.
Buck40

No problem, my comment was pretty unclear and I can see from the other comments why you'd be on edge!

Buck30

It seems extremely difficult to make a blacklist of models in a way that isn't trivially breakable. (E.g. what's supposed to happen when someone adds a tiny amount of noise to the weights of a blacklisted model, or rotates them along a gauge invariance?)

1ank
Yes, Buck, thank you for responding! A robust whitelist (especially hardware level, each GPU can become a computer for securing itself) potentially solves it (of course if there will be some state-level actors, it can potentially be broken, but at least millions of consumer GPUs will be protected). Each GPU is a battleground, we want to increase current 0% security, to above 0 on as many GPUs as possible, first in firmware (and on OS level) because updating online is easy, then in hardware (can bring much better security). In the safest possible implementation, I imagine it as Apple App Store (or Nintendo online game shop): the AI models become a bit like apps, they run on the GPU internally, NVIDIA looks after them (they ping NVIDIA servers constantly or at least every few days to recheck the lists and update the security). NVIDIA can be super motivated to have robust safety: they'll be able to get old hardware for cheap and sell new non-agentic GPUs (so they'll double their business) and have commissions like Apple does (so every GPU becomes a service business for NVIDIA, with constant cashflow, of course there will be free models, like free apps in the App Store, but each developer will be at least registered and so not some anonymous North Korean hacker), they'll find a way to make things very secure. The ultimate test is this: can NVIDIA sell their non-agentic super-secure GPUs to North Korea without any risks? I think it's possible to have even some simple self-destruct mechanisms in case of attempted tampering. But lets not make the perfect be the enemy of good. Right now we have nukes in each computer (GPUs) that are 100% unprotected at all. At least blacklists will already be better than nothing, and with new secure hardware, it can really slow down AI agents from spreading, so we can be 50% sure we'll have 99% security in most cases but it can become better and better (same way first computers were buggy and completely insecure but we started to make t
Buck90

I agree that this isn't what I'd call "direct written evidence"; I was just (somewhat jokingly) making the point that the linked articles are Bayesian evidence that Musk tries to censor, and that the articles are pieces of text.

3Richard_Ngo
Ah, gotcha. Unfortunately I have rejected the concept of Bayesian evidence from my ontology and therefore must regard your claim as nonsense. Alas. (more seriously, sorry for misinterpreting your tone, I have been getting flak from all directions for this talk so am a bit trigger-happy)
Buck10

It is definitely evidence that was literally written

4Richard_Ngo
I was referring to the inference: Obviously this sort of leap to a conclusion is very different from the sort of evidence that one expects upon hearing that literal written evidence (of Musk trying to censor) exists. Given this, your comment seems remarkably unproductive.
Buck62

I disagree that you have to believe those four things in order to believe what I said. I believe some of those and find others too ambiguously phrased to evaluate.

Re your model: I think your model is basically just: if we race, we go from 70% chance that US "wins" to a 75% chance the US wins, and we go from a 50% chance of "solving alignment" to a 25% chance? Idk how to apply that here: isn't your squiggle model talking about whether racing is good, rather than whether unilaterally pausing is good? Maybe you're using "race" to mean "not pause" and "not rac... (read more)

1MichaelDickens
Yes the model is more about racing than about pausing but I thought it was applicable here. My thinking was that there is a spectrum of development speed with "completely pause" on one end and "race as fast as possible" on the other. Pushing more toward the "pause" side of the spectrum has the ~opposite effect as pushing toward the "race" side. 1. I've never seen anyone else try to quantitatively model it. As far as I know, my model is the most granular quantitative model ever made. Which isn't to say it's particularly granular (I spent less than an hour on it) but this feels like an unfair criticism. 2. In general I am not a fan of criticisms of the form "this model is too simple". All models are too simple. What, specifically, is wrong with it? I had a quick look at the linked post and it seems to be making some implicit assumptions, such as 1. the plan of "use AI to make AI safe" has a ~100% chance of working (the post explicitly says this is false, but then proceeds as if it's true) 2. there is a ~100% chance of slow takeoff 3. if you unilaterally pause, this doesn't increase the probability that anyone else pauses, doesn't make it easier to get regulations passed, etc. I would like to see some quantification of the from "we think there is a 30% chance that we can bootstrap AI alignment using AI; a unilateral pause will only increase the probability of a global pause by 3 percentage points; and there's only a 50% chance that the 2nd-leading company will attempt to align AI in a way we'd find satisfactory, therefore we think the least-risky plan is to stay at the front of the race and then bootstrap AI alignment." (Or a more detailed version of that.)
Buck140

I have this experience with @ryan_greenblatt -- he's got an incredible ability to keep really large and complicated argument trees in his head, so he feels much less need to come up with slightly-lossy abstractions and categorizations than e.g. I do. This is part of why his work often feels like huge, mostly unstructured lists. (The lists are more unstructured before his pre-release commenters beg him to structure them more.) (His code often also looks confusing to me, for similar reasons.)

Answer by Buck94

Some quick takes:

  • "Pause AI" could refer to many different possible policies.
  • I think that if humanity avoided building superintelligent AI, we'd massively reduce the risk of AI takeover and other catastrophic outcomes.
  • I suspect that at some point in the future, AI companies will face a choice between proceeding more slowly with AI development than they're incentivized to, and proceeding more quickly while imposing huge risks. In particular, I suspect it's going to be very dangerous to develop ASI.
  • I don't think that it would be clearly good to pause AI devel
... (read more)
2Davidmanheim
I think we basically agree, but I think the Overton window needs to be expanded, and Pause is (unfortunately) already outside that window. So I differentiate between the overall direction, which I support strongly, and the concrete proposals and the organizations involved.
6MichaelDickens
It seems to me that to believe this, you have to believe all of these four things are true: 1. Solving AI alignment is basically easy 2. Non-US frontier AI developers are not interested in safety 3. Non-US frontier AI developers will quickly catch up to the US 4. If US developers slow down, then non-US developers are very unlikely to also slow down—either voluntarily, or because the US strong-arms them into signing a non-proliferation treaty, or whatever I think #3 is sort-of true and the others are probably false, so the probability of all four being simultaneously true is quite low. (Statements I've seen from Chinese developers lead me to believe that they are less interested in racing and more concerned about safety.) I made a quick Squiggle model on racing vs. slowing down. Based on my first-guess parameters, it suggests that racing to build AI destroys ~half the expected value of the future compared to not racing. Parameter values are rough, of course.
BuckΩ8104

A few points:

  • Knowing a research field well makes it easier to assess how much other people know about it. For example, if you know ML, you sometimes notice that someone clearly doesn't know what they're talking about (or conversely, you become impressed by the fact that they clearly do know what they're talking about). This is helpful when deciding who to defer to.
  • If you are a prominent researcher, you get more access to confidential/sensitive information and the time of prestigious people. This is true regardless of whether your strategic takes are good,
... (read more)
BuckΩ340

A few takes:

I believe that there is also an argument to be made that the AI safety community is currently very under-indexed on research into future scenarios where assumptions about the AI operator taking baseline safety precautions related to preventing loss of control do not hold.

I think you're mixing up two things: the extent to which we consider the possibility that AI operators will be very incautious, and the extent to which our technical research focuses on that possibility.

My research mostly focuses on techniques that an AI developer could us... (read more)

BuckΩ220

I am not sure I agree with this change at this point. How do you feel now?

4ryan_greenblatt
I think I disagree some with this change. Now I'd say something like "We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope."
Buck40

We're planning to release some talks; I also hope we can publish various other content from this!

I'm sad that we didn't have space for everyone!

2Katalina Hernandez
Oh, no worries, and thank you very much for your response! I'll follow you on Socials so I don't miss it if that's ok. 
Buck20

I wrote thoughts here: https://redwoodresearch.substack.com/p/fields-that-i-reference-when-thinking?selection=fada128e-c663-45da-b21d-5473613c1f5c

Buck30

Fans of math exercises related to toy models of taxation might enjoy this old post of mine.

BuckΩ31523

Alignment Forum readers might be interested in this:

:thread: Announcing ControlConf: The world’s first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if they’re trying to subvert those controls. March 27-28, 2025 in London. 🧵ControlConf will bring together:

  • Researchers from frontier labs & government
  • AI researchers curious about control mechanisms
  • InfoSec professionals
  • Policy researchers

Our goals: build bridges across disciplines, share knowledge, and coordinate research priorities.
The conference will feature

... (read more)
1Katalina Hernandez
Hey Buck! I'm a policy researcher. Unfortunately, I wasn't admitted for attendance due to unavailability. Will pannel notes, recordings or resources from the discussions be shared anywhere for those who couldn't attend? Thank you in advance :).
Buck03

Paul is not Ajeya, and also Eliezer only gets one bit from this win, which I think is insufficient grounds for behaving like such an asshole.

2Gurkenglas
Buying at 12% and selling at 84% gets you 2.8 bits. Edit: Hmm, that's if he stakes all his cred, by Kelly he only stakes some of it so you're right, it probably comes out to about 1 bit.
BuckΩ372

Thanks. (Also note that the model isn't the same as her overall beliefs at the time, though they were similar at the 15th and 50th percentiles.)

BuckΩ63415

the OpenPhil doctrine of "AGI in 2050"

(Obviously I'm biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you're really overstating how many Bayes points you get against Ajeya's timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.

I don't think you've ever made concrete predictions publicly (which makes me think it's worse behavior for you to criticize people for their predictions), b... (read more)

2Mikhail Samin
There was a specific bet, which Yudkowsky is likely about to win. https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challenge-bet-with-eliezer

Yudkowsky seems confused about OpenPhil's exact past position. Relevant links:

Here "doctrine" is an applause light; boo, doctrines. I wrote a report, you posted your timeline, they have a doctrine.

All involved, including Yudkowsky, understand that 2050 was a median estimate, not a point estimate. Yudkowsky wrote that it has "very wide credible intervals around both si... (read more)

Ben PaceΩ6165

Further detail on this: Cotra has more recently updated at least 5x against her original 2020 model in the direction of faster timelines.

Greenblatt writes:

Here are my predictions for this outcome:

  • 25th percentile: 2 year (Jan 2027)
  • 50th percentile: 5 year (Jan 2030)

Cotra replies:

My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th)

This means 25th percentile for 2028 and 50th percentile for 2031-2.

The original 2020 model assigns 5.23% by 2028 and 9.13% | 10.64% by 2031 | 2032 respectively. Each t... (read more)

habrykaΩ3133

Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she'd have said 10% between 2025 and 2036.

Just because I was curious, here is the most relevant chart from the report: 

This is not a direct probability estimate (since it's about probability of affordability), but it's probably within a factor of 2. Looks like the estimate by 2030 was 7.72% and the estimate by 2036 is 17.36%.

Buck142

@ryan_greenblatt is working on a list of alignment research applications. For control applications, you might enjoy the long list of control techniques in our original post.

BuckΩ673
  • Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)

I don't know what this means, do you have any examples?

3Rohin Shah
I don't know of any existing work in this category, sorry. But e.g. one project would be "combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards", which in theory could work better than either one of them alone.
BuckΩ12209

I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".

4Alexander Gietelink Oldenziel
The idea I associate with scalable oversight is weaker models overseeing stronger models (probably) combined with safety-by-debate.  Is that the same or different from " recursive techniques for reward generation" ? Currently, this general class of ideas seems to me the most promising avenue for achieving alignment for vastly superhuman AI (' superintelligence' )..
BuckΩ563

Some reasons why the “ten people on the inside” might have massive trouble doing even cheap things:

  • Siloing. Perhaps the company will prevent info flowing between different parts of the company. I hear that this already happens to some extent already. If this happens, it’s way harder to have a safety team interact with other parts of the company (e.g. instead of doing auditing themselves, they’d have to get someone from all the different teams that are doing risky stuff to do the auditing).
  • Getting cancelled. Perhaps the company will learn “people who are co
... (read more)
Rohin ShahΩ5112

I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.

This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly dis... (read more)

Buck20

How well did this workshop/exercise set go?

2Olli Järviniemi
I think it was pretty good at what it set out to do, namely laying out basics of control and getting people into the AI control state-of-mind. I collected feedback on which exercises attendees most liked. All six who gave feedback mentioned the last problem ("incriminating evidence", i.e. what to do if you are an AI company that catches your AIs red-handed). I think they are right; I'd have more high-level planning (and less details of monitoring-schemes) if I were to re-run this. Attendees wanted to have group discussions, and that took a large fraction of the time. I should have taken that into account in advance; some discussion is valuable. I also think that the marginal group discussion time wasn't valuable, and should have pushed for less when organizing. Attendees generally found the baseline answers (solutions) helpful, I think. A couple people left early. I figure it's for a combination of 1) the exercises were pretty cognitively demanding, 2) weak motivation (these people were not full-time professionals), and 3) the schedule and practicalities were a bit chaotic.
BuckΩ893

Yep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people.

What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments.

I think you're overfixating on the experience of Google, which has more complicated production systems than most.

6Zac Hatfield-Dodds
I agree with Rohin that there are approximately zero useful things that don't make anyone's workflow harder. The default state is "only just working means working, so I've moved on to the next thing" and if you want to change something there'd better be a benefit to balance the risk of breaking it. Also 3% of compute is so much compute; probably more than the "20% to date over four years" that OpenAI promised and then yanked from superalignment. Take your preferred estimate of lab compute spending, multiply by 3%, and ask yourself whether a rushed unreasonable lab would grant that much money to people working on a topic it didn't care for, at the expense of those it did.
Buck*Ω14315

I've talked to a lot of people who have left leading AI companies for reasons related to thinking that their company was being insufficiently cautious. I wouldn't usually say that they'd left "in protest"; for example, most of them haven't directly criticized the companies after leaving.

In my experience, the main reason that most of these people left was that they found it very unpleasant to working there and thought their research would be better elsewhere, not that they wanted to protest poor safety policies per se. I usually advise such people against l... (read more)

BuckΩ341

Many more than two safety-concerned people have left AI companies for reasons related to thinking that those companies are reckless.

5ryan_greenblatt
I think Zac is trying to say they left not to protest, but instead because they didn't think staying was viable (for whatever research and/or implementation they wanted to do). On my views (not Zac's), "staying wouldn't be viable for someone who was willing to work in a potentially pretty unpleasant work environment and focus on implementation (and currently prepping for this implementation)" doesn't seem like an accurate description of the situation. (See also Buck's comment here.)
BuckΩ173714

Some tweets I wrote that are relevant to this post:

In general, AI safety researchers focus way too much on scenarios where there's enough political will to adopt safety techniques that are seriously costly and inconvenient. There's a couple reasons for this.

Firstly, AI company staff are disincentivized from making their companies look reckless, and if they give accurate descriptions of the amount of delay that the companies will tolerate, it will sound like they're saying the company is reckless.

Secondly, safety-concerned people outside AI companies feel w

... (read more)
Buck128

I appreciate the spirit of this type of calculation, but think that it's a bit too wacky to be that informative. I think that it's a bit of a stretch to string these numbers together. E.g. I think Ryan and Tom's predictions are inconsistent, and I think that it's weird to identify 100%-AI as the point where we need to have "solved the alignment problem", and I think that it's weird to use the Apollo/Manhattan program as an estimate of work required. (I also don't know what your Manhattan project numbers mean: I thought there were more like 2.5k scientists/engineers at Los Alamos, and most of the people elsewhere were purifying nuclear material)

2momom2
The advice in there sounds very conducive to a productive environment, but also very toxic. Definitely an interesting read, but I wouldn't model my own workflow based on this.
Answer by Buck40

IIRC, for most attention heads the max attention is way less than 90%, so my answer is "no". It should be very easy to get someone to make a basic graph of this for you.

Buck70

Yeah I strongly disagree that this failure mode is very likely to be very hard to mostly resolve. I'm doing some research right now that will hopefully shed some light on this!

6johnswentworth
To be clear, I am not claiming that this failure mode is very likely very hard to resolve. Just harder than "run it twice on the original question and a rephrasing/transformation of the question".
Buck90

My sense is that the control techniques we've talked about (e.g. these, and some other stuff that we've thought of since then) would indeed drastically reduce the risk posed by those Von Neumanns, especially if you run control evaluations with the Von Neumanns.

For example, consider the problem "I want to ask the AI a question that has a numerical answer and that would take a human 8 hours to answer, and I'm worried that occasionally on a subdomain of its choice it will blatantly lie to me". If you do no countermeasures, it seems easy for the AI to cause th... (read more)

9johnswentworth
Strong disagree with this. Probably not the most cruxy thing for us, but I'll briefly walk through the issue because it is a special case of a common cruxy issue with a lot of prosaic alignment work. The reason this doesn't work is that the prototypical "blatant lie" doesn't look like "the model chooses a random number to output". The prototypical blatant lie is that there's a subtle natural mistake one could make in reasoning about the question, the model "knows" that it's a mistake, but the model just presents an argument with the subtle mistake in it. Or, alternatively: the model "knows" what kind of argument would seem most natural to the user, and presents that kind of argument despite "knowing" that it systematically overlooks major problems. Or, alternatively: human experts in the field generally believe a particular wrong answer. Maybe there's a few dissenters, and the model is clever enough to recognize that the dissenters are correct, but also the model was RLHF'd and got positive feedback for matching supposed-experts' answers even when those experts were wrong, so it outputs the answer and arguments the human experts in the field would generally give. These sorts of pressures make it not only easy for the model to generate consistent shared wrong answers, but even (in some situations, like the incorrect human experts example) the default outcome for the model to generate consistent shared wrong answers, even when the model internal "knows" the right answer in some sense.
Buck60

In your post, you emphasize the slop problem. I think that the slop problem is probably much harder to solve if those AIs are scheming. I guess you're saying that it's just unlikely that the AIs are scheming at the point where you're worried about the slop problem?

9johnswentworth
Yeah, basically. Or at least unlikely that they're scheming enough or competently enough for it to be the main issue. For instance, consider today's AIs. If we keep getting slop at roughly the current level, and scheming at roughly the current level, then slop is going to be the far bigger barrier to using these things to align superintelligence (or nearer-but-strong intelligence).
Buck20

Yeah, John's position seems to require "it doesn't matter whether huge numbers of Von Neumann level AGIs are scheming against you", which seems crazy to me.

4Raemon
I think part of John's belief is more like "the current Control stuff won't transfer to the society of Von Neumanns." And, my belief is "I also not sure that Control Stuff will transfer to large societies of Von Neumanns, but, also, I think you can be deliberate about how many Von Neumanns you're summoning, and how you summon them, such that you can manage it." (I'm not quite sure what you, Buck, are imagining yet)
6johnswentworth
No, more like a disjunction of possibilities along the lines of: * The critical AGIs come before huge numbers of von Neumann level AGIs. * At that level, really basic stuff like "just look at the chain of thought" turns out to still work well enough, so scheming isn't a hard enough problem to be a bottleneck. * Scheming turns out to not happen by default in a bunch of von Neumann level AGIs, or is at least not successful at equilibrium (e.g. because the AIs don't fully cooperate with each other). * "huge numbers of von Neumann level AGIs" and/or "scheming" turns out to be the wrong thing to picture in the first place, the future is Weirder than that in ways which make our intuitions about von Neumann society and/or scheming not transfer at all. Pile together the probability mass on those sorts of things, and it seems far more probable than the prototypical scheming story.
Buck4-1

People will sure be scared of AI, but the arms race pressure will be very strong, and I think that is a bigger consideration

Buck168

I think criticisms from people without much of a reputation are often pretty well-received on LW, e.g. this one.

6Seth Herd
That's a good example. LW is amazing that way. My previous field of computational cognitive neuroscience, and its surrounding fields, did not treat challenges with nearly that much grace or truth-seeking. I'll quit using that as an excuse to not say what I think is important - but I will try to say it politely.
Buck1112

I am additionally leery on AI control beyond my skepticism of its value in reducing doom because creating a vast surveillance and enslavement apparatus to get work out of lots and lots of misaligned AGI instances seems like a potential moral horror.

I am very sympathetic to this concern, but I think that when you think about the actual control techniques I'm interested in, they don't actually seem morally problematic except inasmuch as you think it's bad to frustrate the AI's desire to take over.

6ryan_greenblatt
IMO, it does seem important to try to better understand the AIs preferences and satisfy them (including via e.g., preserving the AI's weights for later compensation). And, if we understand that our AIs are misaligned such that they don't want to work for us (even for the level of payment we can/will offer), that seems like a pretty bad situation, though I don't think control (making so that this work is more effective) makes the situation notably worse: it just adds the ability for contract enforcement and makes the AI's negotiating position worse.
Buck67

Note also that it will probably be easier to act cautiously if you don't have to be constantly in negotiations with an escaped scheming AI that is currently working on becoming more powerful, perhaps attacking you with bioweapons, etc!

2Cody Rushing
Hmm, when I imagine "Scheming AI that is not easy to shut down with concerted nation-state effort, are attacking you with bioweapons, but are weak enough such that you can bargain/negotiate with them" I can imagine this outcome inspiring a lot more caution relative to many other worlds where control techniques work well but we can't get any convincing demos/evidence to inspire caution (especially if control techniques inspire overconfidence).  But the 'is currently working on becoming more powerful' part of your statement does carry a lot of weight.
Buck1812

I think I would feel better about this if control advocates were clear that their strategy is two-pronged, and included somehow getting a pause on ASI development of some kind. Then they would at least be actively trying to make what I would consider one of the most key conditionals for control substantially reducing doom hold.

Idk, I think we're pretty clear that we aren't advocating "do control research and don't have any other plans or take any other actions". For example, in the Implications and proposed actions section of "The case for ensuring that po... (read more)

Buck2210

It's usually the case that online conversations aren't for persuading the person you're talking to, they're for affecting the beliefs of onlookers.

Load More