All of Rohin Shah's Comments + Replies

Rohin ShahΩ220

I agree that fine-grained configuration mechanisms probably increase security, and this might net increase security, as I say in the post.

You mention permission systems, which is certainly a big deal, but I didn't see anything about broader configuration mechanisms, much of which can be motivated solely by efficiency and incidentally helps with security. (I was disputing your efficiency -> less security claim; permissions mechanisms aren't a valid counterargument since they aren't motivated by efficiency.)

One reason I'm skeptical is that you can get alm

... (read more)
Rohin ShahΩ442

Technologies that allow workers to be more isolated from each other gain you both convenience (because your coworkers no longer accidentally mess up what you’re doing) and also security (because you can remove your coworker’s permission to affect the code you’re running), but generally reduce efficiency. When we try to buy efficiency at the cost of convenience, we might lose security too.

Hmm, this feels less likely to me. Isolation can often be an efficiency benefit because one employee's mistake doesn't propagate to screwing up the work of everyone else, ... (read more)

4Buck
I agree that fine-grained configuration mechanisms probably increase security, and this might net increase security, as I say in the post. I agree that the increasing value of compute might increase the isolation you use for the reason you said. One reason I'm skeptical is that you can get almost all that value by having AIs voluntarily adopt mechanisms that generally isolate their jobs from the other jobs that are running (e.g. having naming conventions about who is allowed to read or write what) that get you the reliability benefits without getting any security.

I think the evidence is roughly at "this should be a weakly held prior easily overturned by personal experience": https://www.lesswrong.com/posts/c8EeJtqnsKyXdLtc5/how-long-can-people-usefully-work

That said, I do think there's enough evidence that I would bet (not at extreme odds) that it is bad for productivity to have organizational cultures that emphasize working very long hours (say > 60 hours / week), unless you are putting in special care to hire people compatible with that culture. Partly this is because I expect organizations to often be unable to overcome weak priors even when faced with blatant evidence.

Andrew Gelman: "Bring on the Stupid: When does it make sense to judge a person, a group, or an organization by its worst?" (Not quite as clearcut, since it doesn't name the person in the title, but still)

(If this also doesn't count as "intellectual writing circles", consider renaming your category, since I clearly do not understand what you mean, except inasmuch as it is "rationalist or rationalist-adjacent circles".) 

2Ben Pace
I certainly consider Gelman a valid example of the category :)
2habryka
The Gelman post in question is importantly not about arguing for the linked post being bad/stupid, it was taking it fully as a given. I actually think that's an importantly different dynamic because if you are in a context where you can actually presume with your audience that something is bad, then writing it in a title isn't actually influencing the status landscape very much (though it's tricky).  Similarly, I think on LessWrong writing a title which presumes the falsity of the existence of a christian god would in other contexts I think be a pretty bad thing to do, but on LessWrong be totally fine, for similar reasons.

Hmm, interesting. I was surprised by the claim so I did look back through ACX and posts from the LW review, and it does seem to back up your claim (the closest I saw was "Sorry, I Still Think MR Is Wrong About USAID", note I didn't look very hard). EDIT: Actually I agree with sunwillrise that "Moldbug sold out" meets the bar (and in general my felt sense is that ACX does do this).

I'd dispute the characterization of this norm as operating "within intellectual online writing circles". I think it's a rationalist norm if anything. For example I went to Slow Bo... (read more)

Ben Pace1210

I am both surprised and glad my comment led to an update :)

FWIW I never expect the political blogs to be playing by the good rules of the rest of the intellectual writing circles, I view them more as soldiers. Not central examples of soldiers, but enough so that I'd repeatedly be disappointed by them if I expected them to hold themselves to the same standards.

(As an example, in my mind I confidently-but-vaguely recall some Matt Yglesias tweets where he endorsed dishonesty for his side of the political on some meta-level, in order to win political conflicts; interested if anyone else recalls this / has a link.)

Rohin Shah11953

While I disagree with Nate on a wide variety of topics (including implicit claims in this post), I do want to explicitly highlight strong agreement with this:

I have a whole spiel about how your conversation-partner will react very differently if you share your concerns while feeling ashamed about them versus if you share your concerns as if they’re obvious and sensible, because humans are very good at picking up on your social cues. If you act as if it’s shameful to believe AI will kill us all, people are more prone to treat you that way. If you act a

... (read more)
[anonymous]2514

It could instead be "there is a real threat model for existential risk, and it is important that society does more to address it than it is currently doing". If you're going to share concerns at all, figure out the position you do have courage in, and then discuss that as if it is obvious and sensible, not as if you are ashamed of it.

The potential failure mode I see with this is that, if you're not paying sufficient attention to your rhetoric, you run the risk of activating people's anti-Pascal's mugging instincts.[1] As Jeremy Gillen said:

Everyone is

... (read more)

I don’t see any inconsistency in being unhappy with what titotal is doing and happy about what AI 2027 is doing.

I agree with this. I was responding pretty specifically to Zvi's critique in particular, which is focusing on things like the use of the word "bad" and the notion that there could be a goal to lower the status and prestige of AI 2027. If instead the critique was about e.g. norms of intellectual discourse I'd be on board.

That said I don't feel like your defense feels all that strong to me? I'm happy to take your word for it that there was lots of ... (read more)

But it isn't trend extrapolation?

If the current doubling time is T, and each subsequent doubling takes 10% less time, then you have infinite doublings (i.e. singularity) by time 10T. So with T = 4.5 months you get singularity by 45 months. This is completely insensitive to the initial conditions or to the trend in changes-in-doubling-time (unless the number "10%" was chosen based on trend extrapolation, but that doesn't seem to be the case).

(In practice the superexponential model predicts singularity even sooner than 45 months, because of the additional effect from automated AI R&D.)

I don't see how this is responding to anything I've said? What in my comment are you disagreeing with or adding color to?

Again, my position is not "AI 2027 did something bad". My position is "stop critiquing people for having goals around status and prestige rather than epistemics, or at least do so consistently".

(Incidentally, I suspect bio anchors did better on the axis of getting good reviews / feedback, but that isn't particularly central to anything I'm claiming.)

4habryka
I was responding to this part:  And I was saying that this is also true for the early drafts of AI 2027. Only after a long discussion of the technical disagreements did it go on to a huge amplification thing. This seems directly relevant to that section.  I am responding to the part about consistent standards. I don’t really understand what you believe here, clearly you care a lot about people not using lots of rhetorical tricks and adversarial persuasion tactics all the time, and we’ve talked about that in the past, so I am just straightforwardly arguing that on those dimensions titotal’s post was much worse compared to AI 2027.  We don’t need to come to agreement on this part, it does seem kind of hard to evaluate. But in as much as your top level comment is arguing some kind of asymmetric standard is being applied, that just seems super wrong to me. I don’t know where I would put the line of encourage/discourage, but I don’t see any inconsistency in being unhappy with what titotal is doing and happy about what AI 2027 is doing.
Rohin Shah1818

Things I agree with:

  • AI 2027 was less crude in its use of symmetric weapons (which can often itself be a good symmetric weapon when the goal is to influence elites)
  • AI 2027 made lots of asymmetric choices (but so did titotal)
  • AI 2027 is doing better than "piece[s] of media" (but that bar is so incredibly low)

I disagree that titotal's critique is far away from AI 2027 on the relevant spectrum. For example, titotal's critique was posted on the EA Forum / LessWrong, and focused on technical disagreements, rather than going through a huge amplification / social m... (read more)

4habryka
Before AI 2027 was posted with a big amplification / media push, it underwent as far as I can tell the single most intense set of review and feedback requests of any big writing project I've seen so far. I don't know whether it was literally posted on LessWrong, but I've seen comments from many many dozens if not hundreds of people over the many dozens of revisions that the scenario underwent.  Like, I am quite into public discourse being better than private Google Doc systems, but AI 2027 was so widely circulated pre-publication in Google Doc format, with lots of focus on technical disagreements, that this seems easily much superior to what is going on with this post.
Rohin Shah2510

I mean, yes, if the goal of the post was to lower the status and prestige of AI 2027 and to do so through people reading the title and updating in that way, rather than to offer a helpful critique, then it is true that the title was the best local way to achieve that objective, epistemic commons be damned. I would hope for a different goal?

Come on, this is such an isolated demand for rigor. AI 2027 clearly had the goal of raising the status and prestige of belief in AI risk and short timelines. They employed tons of symmetric weapons in the pursuit of this... (read more)

5Ben Pace
(FWIW in this comment I am largely just repeating things already said in the longer thread... I wrote this mostly to clarify my own thinking.) I think the conflict here is that, within intellectual online writing circles, attempting to use the title of a post to directly attempt to set a bottom line in the status of something is defecting on a norm, but this is not so in the 'internet of beefs' rest of the world, where titles are readily used as cudgels in status fights. Within the intellectual online writing circles, this is not a good goal for a title, and it's not something that AI 2027 did (or, like, something that ~any ACX post or ~any LW curated post does)[1]. This is not the same as "not putting your bottom line in the title", it's "don't attempt to directly write the bottom line about the status of something in your title". I agree you're narrowly correct that it's acceptable to have goals for changing the status of various things, and it's good to push back on implying that that isn't allowed by any method. But I think Zvi did make the point that the title itself of the critique post attempted to do it using the title and that's not something AI 2027 did and is IMO defecting on a worthy truce in the intellectual online circles. 1. ^ To the best of my recollection. Can anyone think of counterexamples?
habryka1814

I don't think AI 2027 did anything even close to as crude as calling the thing you are arguing against just "bad" in your title. 

Indeed, I think overall AI 2027 is really doing remarkably well at being asymmetric in really a huge number of its choices (I am of course biased as having been involved in many of those choices, but I currently would say that AI 2027 is close to the very top at the intersection of "accessible" and "trying to make itself only succeed and compelling if indeed its claims are true" as I think any piece of media out there).

(I do... (read more)

Rohin ShahΩ193422

Great work!

We've also been seeing a similar vulnerability in our debate work, where when we use a weaker LLM as the judge, it is very hard to get the judge to take seriously the possibility that the debaters might be lying to it (using only prompting). As a result it often just believes the dishonest debater despite clear evidence to the contrary and gets the judgment wrong.

This problem mostly goes away if we switch to a more capable LLM as the judge model. My sense is that there's some baseline level of capability that is required to be a good monitor (si... (read more)

I think you are being led astray by having a one-dimensional notion of intelligence.

What I meant was "the AI doesn't already know how to evaluate what's best according to its own values better than the operators".

Well yes, that is the idea, there is information asymmetry between the AI and humans. Note that this can still apply even when the AI is much smarter than the humans.

CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain. [...]

It accepts corrections as long as it has less c

... (read more)
1Towards_Keeperhood
Thanks. (I do agree that we can get narrowly superhuman CIRL-like AI which we can then still shut down because it trusts humans more about general strategic considerations. But I think if your plan is to let the AI solve alignment or coordinate the world to slow down AI progress, this won't help you much for the parts of the problem we are most bottlenecked on.) I agree that the AI may not be able to precisely predict what exact tradeoffs each operator might be willing to make, e.g. between required time and safety of a project, but I think it would be able to predict it well enough that the differences in what strategy it uses wouldn't be large. Or do you imagine strategically keeping some information from the AI? Either way, the AI is only updating on information, not changing its (terminal) goals. (Though the instrumental subgoals can in principle change.) Even if the alignment works out perfectly, when the AI is smarter and the humans are like "actually we want to shut you down", the AI does update that the humans are probably worried about something, but if the AI is smart enough and sees how the humans were worried about something that isn't actually going to happen, it can just be like "sorry, that's not actually in your extrapolated interests, you will perhaps understand later when you're smarter", and then tries to fulfill human values. But if we're confident alignment to humans will work out we don't need corrigibility. Corrigibility is rather intended so we might be able to recover if something goes wrong. If the values of the AI drift a bit, then the AI will likely notice this before the humans and take measures that the humans don't find out or won't (be able to) change its values back, because that's the strategy that's best according to the AI's new values. Likewise just updating on new information, not changing terminal goals. Also note that parents often think (sometimes correctly) that they better know what is in the child's extrapolated i

Not a full response, but some notes:

  • I agree Eliezer likely wouldn't want "corrigibility" to refer to the thing I'm imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.
  • I disagree that in early-CIRL "the AI doesn't already know its own values and how to accomplish them better than the operators". It knows that its goal is to optimize the human's utility function, and it can be better than the human at eliciting that utility function. It just doesn't have perfect information about what the human's utility function is.
  • I care quite a
... (read more)
1Towards_Keeperhood
Yeah thanks for distinguishing. It's not at all obvious to me that Paul would call CIRL "corrigible" - I'd guess not, but idk. My model of what Paul thinks about corrigibility matches my model of corrigibility much much closer than CIRL. It's possible that the EY-Paul disagreement mostly comes down to consequentialism. CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain. Sorry that was very poorly phrased by me. What I meant was "the AI doesn't already know how to evaluate what's best according to its own values better than the operators". So yes I agree. I still find it confusing though why people started calling that corrigibility. In your previous comment you wrote: I don't understand why you think this. It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn't. (Of course it doesn't matter if all goes well because the CIRL AI would go on an become an aligned superintelligence, but it's not correctable, and I don't see why you think it's evidence.) I (and I think also Eliezer) agree with that. But CIRL::correctability already breaks down at high human level, so I don't know what you mean here. Also, in my view corrigibility isn't just about what happens if the alignment works out totally fine, but still maintain correctability if it doesn't: If something goes wrong with CIRL so its goal isn't pointed to the human utility function anymore, it would not want operators to correct it.  The One central hope behind corrigibility was that if something went wrong that changed the optimization target, the AI would still let operators correct it as long as the simple corrigibility part kept working. (Where the hope was that there would be a quite simple and robust such corrigibility part, but we haven't found it yet.) E.g. if you look at the corrigibility paper, you could imagine that if they actual

I definitely was not thinking about the quoted definition of corrigibility, which I agree is not capturing what at least Eliezer, Nate and Paul are saying about corrigibility (unless there is more to it than the quoted paragraph). I continue to think that Paul and Eliezer have pretty different things in mind when they talk about corrigibility, and this comment seems like some vindication of my view.

I do wish I hadn't used the phrases "object-level" and "meta-level" and just spent 4 paragraphs unpacking what I meant by that because in hindsight that was con... (read more)

1Towards_Keeperhood
Yeah fair point. I don't really know what Paul means with corrigibility. (One hypothesis: Paul doesn't think in terms of consequentialist cognition but in terms of learned behaviors that generalize, and maybe the question "but does it behave that way because it wants the operator's values to be fulfilled or because it just wants to serve?" seems meaningless from Paul's perspective. But idk.) I'm pretty sure Eliezer would not want the term "corrigibility" to be used for the kind of correctability you get in the early stages of CIRL when the AI doesn't already know its own values and how to accomplish them better than the operators. (Eliezer actually talked a bunch about this CIRL-like correctability in his 2001 report "Creating Friendly AI". (Probably not worth your time to read, though given the context that it was 2001, there seemed to me to be some good original thinking going on there which I didn't see often. Also you can see Eliezer being optimistic about alignment.)) And I don't see it as evidence that Eliezer!corrigibility isn't anti-natural. (In the following I use "corrigibility" in the Eliezer-sense. I'm pretty confident that all of the following matches Eliezer's model, but not completely sure.) The motivation behind corrigibility was that aligning superintelligence seemed to hard, so we want to aim an AI to do a pivotal task that gets humanity on a course to likely properly aligning superintelligence later. The corrigible AI would be just pointed to accomplish this task, and not to human values at all. It should be this bounded thing that only cares about this bounded task and afterwards shuts itself down. It shouldn't do the task because it wants to accomplish human values and the task seems like a good way to accomplish it. Human values are unbounded, and it might be less likely shut itself down afterwards. Corrigibility has nothing to do with human values. Roughly speaking, we can perhaps disentangle 3 corrigibility approaches: 1. Train for co
Rohin ShahΩ330

I feel like the natural idea here is that interp generates understanding and then you use the understanding to generate behavioral evidence. Idk if this is what Dario has in mind but it at least seems plausible.

Rohin ShahΩ220

Hmm, maybe we do disagree. I personally like circuit style interp analysis as a way to get evidence of scheming. But this is because I expect that after you do the circuit analysis you will then be able to use the generated insight to create behavioral evidence, assuming the circuit analysis worked at all. (Similarly to e.g. the whale + baseball = shark adversarial example.)

Maybe this doesn't come up as much in your conversation with people, but I've seen internals based testing methods which don't clearly ground out in behavioral evidence discussed often.

... (read more)
Rohin ShahΩ360

You might expect the labor force of NormalCorp to be roughly in equilibrium where they gain equally from spending more on compute as they gain from spending on salaries (to get more/better employees).

[...]

However, I'm quite skeptical of this type of consideration making a big difference because the ML industry has already varied the compute input massively, with over 7 OOMs of compute difference between research now (in 2025) vs at the time of AlexNet 12 years ago, (invalidating the view that there is some relatively narrow range of inputs in which neither

... (read more)
Rohin ShahΩ440

In some sense I agree with this post, but I'm not sure who the intended audience is, or what changes anyone should make. What existing work seems like it will generate "evidence which is just from fancy internals-based methods (and can't be supported by human inspection of AI behavior)", and that is the primary story for why it is impactful? I don't think this is true of probing, SAEs, circuit analysis, debate, ...

6ryan_greenblatt
This is often important in my thinking: when thinking about various internals based methods that could test for scheming (but won't produce direct behavioral evidence), this comes up. I wrote this doc after noticing that I wanted to reference this somewhere. Also, I often hear people discuss getting non-behavioral evidence for scheming using internals/interp. (As an example, probes for detecting deceptive cognition and then seeing if this fire more than expected on honeypots.) And, understanding this isn't going to result in legible evidence is important for understanding the theory of change for this work: it's important that you can iterate usefully against the method. I think people sometimes explicitly model iterating against these testing methods, but sometimes they don't. Maybe this doesn't come up as much in your conversation with people, but I've seen internals based testing methods which don't clearly ground out in behavioral evidence discussed often. (E.g., it's the application that the Anthropic interp team has most discussed, it's the most obvious application of probing for internal deceptive reasoning other than resampling against the probes.) Notably Dario seemingly thinks that circuit style interp analysis (which IMO would be unlikely to yield behavioral evidence on it's own) is the main way we might get definitive (aka legible) evidence of scheming. So, I think Dario's essay on interp is an example of someone disagreeing with this post! Dario's essay on interp came out after this post was published, otherwise I might have referenced it. I wasn't trying to trigger any research particular reprioritization with this post, but I historically found that people hadn't really thought through this (relatively obvious once noted) consideration and I think people are sometimes interested in thinking through specific theories of impact for their work.
Rohin ShahΩ340

(Meta: Going off of past experience I don't really expect to make much progress with more comments, so there's a decent chance I will bow out after this comment.)

I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less)

Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its... (read more)

3Michael Thiessen
This is definitely the crux so probably really the only point worth debating. RLHF is just papering over problems. Sure, the model is slightly more difficult to jailbreak but it's still pretty easy to jailbreak. Sure, the agent is less likely through RLHF to output text you don't like, but I think agents will reliably overcome that obstacle as useful agents won't just be outputting the most probable continuation, they'll be searching through decision space and finding those unlikely continuations that will score well on its task. I don't think RLHF does anything remotely analogous to making it care about whether it's following your intent, or following a constitution, etc. You're definitely aware of the misalignment that still exists with our current RLHF'd models and have read the recent papers on alignment faking etc. so I probably can't make an argument you haven't already heard. Maybe the true crux is you believe this is a situation where we can muddle through by making harmful continuations ever more unlikely? I don't. 
Rohin ShahΩ350

like being able to give the judge or debate partner the goal of actually trying to get to the truth

The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say "honesty is always a winning move" rather than "honesty is the only winning move"). These certainly depend on modeling assumptions but the assumptions are more like "assume the models are sufficiently capable" not "assume we can give them a goal". When applying this in practice there is also ... (read more)

1Michael Thiessen
(Sorry, hard to differentiate quotes from you vs quotes from the paper in this format) If the judge is human, sure. If the judge is another AI, it seems like a wild assumption to me. The section on judge safety in your paper does a good job of listing many of the problems. On thing I want to call out as something I more strongly disagree with is: I don't think we have any fully aligned models, and won't any time soon. I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less), and goodhart's law definitely applies here.   Maybe this is just using the same terms for two different systems, but the paper also talks about using judging for monitoring deployed systems: Is this intended only as a auditing mechanism, not a prevention mechanism (eg. we noticed in the wild the AI is failing debates, time to roll back the release)? Or are we trusting the AI judges enough at this point that we don't need to stop and involve a human? I also worry the "cheap system with high recall but low precision" will be too easy to fool for the system to be functional past a certain capability level.   From a 20000ft perspective, this all just looks like RLHF with extra steps that make it harder to reason about whether it'll work or not. If we had evidence that RLHF worked great, but the only flaw with it was the AI was starting to get too complex for humans to give feedback, then I would be more interested in all of the fiddly details of amplified oversight and working through them. The problem is RLHF already doesn't work, and I think it'll only work less well as AI becomes smarter and more agentic.
Rohin ShahΩ482

Rather, I think that most of the value lies in something more like "enabling oversight of cognition, despite not having data that isolates that cognition."

Is this a problem you expect to arise in practice? I don't really expect it to arise, if you're allowing for a significant amount of effort in creating that data (since I assume you'd also be putting a significant amount of effort into interpretability).

We've got a lot of interest, so it's taking some time to go through applications. If you haven't heard back by the end of March, please ping me; hopefully it will be sooner than that.

The answer to that question will determine which team will do the first review of your application. (We get enough applications that the first review costs quite a bit of time, so we don't want both teams to review all applications separately.)

You can still express interest in both teams (e.g. in the "Any other info" question), and the reviewer will take that into account and consider whether to move your application to the other team, but Gemini Safety reviewers aren't going to be as good at evaluating ASAT candidates, and vice versa, so you should choose the team that you think is a better fit for you.

There are different interview processes. ASAT is more research-driven while Gemini Safety is more focused on execution and implementation. If you really don't know which of the two teams would be a better fit, you can submit a separate application for each.

Our hiring this round is a small fraction of our overall team size, so this is really just correcting a minor imbalance, and shouldn't be taken as reflective of some big strategy. I'm guessing we'll go back to hiring a mix of the two around mid-2025.

You can check out my career FAQ, as well as various other resources linked from there.

Still pretty optimistic by the standards of the AGI safety field, somewhat shorter timelines than I reported in that post.

Neither of these really affect the work we do very much. I suppose if I were extremely pessimistic I would be doing something else, but even at a p(doom) of 50% I'd do basically the same things I'm doing now.

(And similarly individual team members have a wide variety of beliefs on both optimism and timelines. I actually don't know their beliefs on those topics very well because these beliefs are usually not that action-relevant for us.)

More capability research than AGI safety research but idk what the ratio is and it's not something I can easily find out

Since we have multiple roles, the interview process varies across candidates, but usually it would have around 3 stages that in total correspond to 4-8 hours of interviews.

We'll leave it up until the later of those two (and probably somewhat beyond that, but that isn't guaranteed). I've edited the post.

Rohin ShahΩ220

Is that right?

Yes, that's broadly accurate, though one clarification:

This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly

That's a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.

I think this will

... (read more)
Rohin ShahΩ220

If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point

That doesn't seem right. It can simultaneously be the case that you can't tell that there are problems stemming from long-term optimization problems when you don't use MONA, and also if you actually use MONA, then it will measurably improve quality.

For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which ... (read more)

3Fabien Roger
To rephrase what I think you are saying are situations where work on MONA is very helpful: * By default people get bitten by long-term optimization. They notice issues in prod because it's hard to catch everything. They patch individual failures when they come up, but don't notice that if they did more work on MONA, they would stop the underlying driver of many issues (including future issues that could result in catastrophes). They don't try MONA-like techniques because it's not very salient when you are trying to fix individual failures and does not pass cost-benefit to fix individual failures. * If you do work on MONA in realistic-ish settings, you may be able to demonstrate that you can avoid many failures observed in prod without ad-hoc patches and that the alignment tax is not too large. This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly and because people don't by default think of the individual failures you've seen in prod as symptoms of their long-term optimization, but your empirical work pushes them over the line and they end up trying to adopt MONA to avoid future failures in prod (and maybe reduce catastrophic risk - though given competitive pressures, that might not be the main factor driving decisions and so you don't have to make an ironclad case for MONA reducing catastrophic risk). Is that right? I think this is at least plausible. I think this will become much more likely once we actually start observing long-term optimization failures in prod. Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
Rohin ShahΩ220

I meant "it's obvious you should use MONA if you are seeing problems with long-term optimization", which I believe is Fabien's position (otherwise it would be "hard to find").

Your reaction seems more like "it's obvious MONA would prevent multi-step reward hacks"; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).

Rohin ShahΩ220

I have some credence in all three of those bullet points.

For MONA it's a relatively even mixture of the first and second points.

(You are possibly the first person I know of who reacted to MONA with "that's obvious" instead of "that obviously won't perform well, why would anyone ever do it". Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)

For debate it's mostly the first point, and to some extent the third point.

2Fabien Roger
That's right. If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point (especially if the alignment tax is non-negligible), and I am not sure how you get it. In particular, for the median AIs that labs use to 20x AI safety research, my guess is that you won't have invisible long-term reward hacking problems, and so I would advise labs to spend the alignment tax on other interventions (like using weaker models when possible, or doing control), not on using process-based rewards. I would give different advice if * the alignment tax of MONA were tiny * there were decent evidence for invisible long-term reward hacking problems with catastrophic consequences solved by MONA I think this is not super plausible to happen, but I am sympathetic to research towards these two goals. So maybe we don't disagree that much (except maybe on the plausibility of invisible long-term reward hacking problems for the AIs that matter the most).
2ryan_greenblatt
I also have the "that's obvious reaction", but possibly I'm missing somne details. I also think it won't perform well enough in practice to pencil given other better places to allocate safety budget (if it does trade off which is unclear).
Rohin Shah*Ω220

Got it, that makes more sense. (When you said "methods work on toy domains" I interpreted "work" as a verb rather than a noun.)

But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings

I think by far the biggest open question is "how do you provide the nonmyopic approval so that the model actually performs well". I don't think anyone has even attempted to tackle this so it's hard to tell what you could learn about it, but I'd be surprised if there weren't generaliza... (read more)

2Fabien Roger
My bad, I was a bit sloppy here. The debate-for-control stuff is in the RFP but not the debate vs subtle reward hacks that don't show up in feedback quality evals. I think we agree that there are some flavors of debate work that are exciting and not present in the RFP.
Rohin ShahΩ541

I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?

I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn't have to grapple as hard with the challenge of defining the approval feedback as I'd expect in a realistic deployment. But it does impose an alignment tax, so there's no point in using MONA currently, when good enough alignm... (read more)

5Fabien Roger
Which of the bullet points in my original message do you think is wrong? Do you think MONA and debate papers are: * on the path to techniques that measurably improve feedback quality on real domains with potentially a low alignment tax, and that are hard enough to find that labs won't use them by default? * on the path to providing enough evidence of their good properties that even if they did not measurably help with feedback quality in real domains (and slightly increased cost), labs could be convinced to use them because they are expected to improve non-measurable feedback quality? * on the path to speeding up safety-critical domains? * (valuable for some other reason?)
2Fabien Roger
That is not my claim. By "I would have guessed that methods work on toy domains and math will be useless" I meant "I would have guessed that if a lab decided to do process-based feedback, it will be better off not doing a detailed literature review of methods in MONA and followups on toy domains, and just do the process-based supervision that makes sense in the real domain they now look at. The only part of the method section of MONA papers that matters might be "we did process-based supervision"." I did not say "methods that work on toy domains will be useless" (my sentence was easy to misread). I almost have the opposite stance, I am closer to "it's so obvious that process-based feedback helps that if capabilities people ever had issues stemming from long-term optimization, they would obviously use more myopic objectives. So process-based feedback so obviously prevents problems from non-myopia in real life that the experiments in the MONA paper don't increase the probability that people working on capabilities implement myopic objectives."  But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings (above the "capability researcher does what looks easiest and sensible to them" baseline)? Not a crux. My guess is that if debate did "work" to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality. So my low confidence guess is that it's not high priority for people working on x-risk safety to speed up that work. But I am excited about debate work that is not just about improving feedback quality. For example I am interested in debate vs schemers or debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn't show up in "feedback quality
Rohin ShahΩ330

I don't know of any existing work in this category, sorry. But e.g. one project would be "combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards", which in theory could work better than either one of them alone.

Rohin Shah*Ω529622

I'm excited to see this RFP out! Many of the topics in here seem like great targets for safety work.

I'm sad that there's so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is "alternatives to adversarial training". (There's also "new moonshots for aligning superintelligence" but I don't expect much to come out of that, and "white-box estimation of rare misbehavior" could help if you are willing to put optimizat... (read more)

3Oliver Daniels
I have a similar confusion (see my comment here) but seems like at least Ryan wants control evaluations to cover this case? (perhaps on the assumption that if your "control measures" are successful, they should be able to elicit aligned behavior from scheming models and this behavior can be reinforced?)
8Fabien Roger
I think additional non-moonshot work in these domains will have a very hard time helping. [low confidence] My high level concern is that non-moonshot work in these clusters may be the sort of things labs will use anyway (with or without safety push) if this helped with capabilities because the techniques are easy to find, and won't use if it didn't help with capabilities because the case for risk reduction is weak. This concern is mostly informed by my (relatively shallow) read of recent work in these clusters. [edit: I was at least somewhat wrong, see comment threads below] Here are things that would change my mind: * If I thought people were making progress towards techniques with nicer safety properties and no alignment tax that seem hard enough to make workable in practice that capabilities researchers won't bother using by default, but would bother using if there was existing work on how to make them work. * (For the different question of preventing AIs from using "harmful knowledge", I think work on robust unlearning and gradient routing may have this property - the current SoTA is far enough from a solution that I expect labs to not bother doing anything good enough here, but I think there is a path to legible success, and conditional on success I expect labs to pick it up because it would be obviously better, more robust, and plausibly cheaper than refusal training + monitoring. And I think robust unlearning and gradient routing have better safety properties than refusal training + monitoring.) * If I thought people were making progress towards understanding when not using process-based supervision and debate is risky. This looks like demos and model organisms aimed at measuring when, in real life, not using these simple precautions would result in very bad outcomes while using the simple precautions would help. * (In the different domain of  CoT-faithfulness I think there is a lot of value in demonstrating the risk of opaque CoT well-enough that
7Buck
I don't know what this means, do you have any examples?
BuckΩ12209

I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".

Rohin ShahΩ5112

I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.

This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly dis... (read more)

Rohin Shah*Ω9127

In broad strokes I agree with Zac. And tbc I'm generally a fan of formal verification and have done part of a PhD in program synthesis.

So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology

This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to "don't build such AIs", in which case I would appreciate that being stated more directly, or if it reduces to "limit ... (read more)

2Martin Randall
Based on my understanding of the article: 1. The sound over-approximation of human psychology is that humans are psychologically safe from information attacks of less than N bits. "Talk Control" is real, "Charm Person" is not. 2. Under "Steganography, and other funny business" there is a sketched safety specification that each use of the AI will communicate at most one bit of information. 3. Not stated explicitly: humans will be restricted to using the AI system no more than N times. Comments and concerns: 1. Human psychology is also impacted by the physical environment, eg drugs, diseases, being paperclipped. The safety of the physical environment must be covered by separate verifications. 2. There could be a side-channel for information if an AI answers some questions faster than others, uses more energy for some questions than others, etc. 3. Machine interpretability techniques must be deployed in a side-channel resistant way. We can't have the AI thinking about pegasi and unicorns in a morse code pattern and an intern reads it and ten years later everyone is a pony. 4. There probably need to be multiple values of N for different time-frames. 1,000 adversarial bits in a minute is more psychologically dangerous than the same number of bits over a year. 5. Today, we don't know good values for N, but we can spend the first few bits getting higher safe values of N. We can also use the Yudkowskian technique of using volunteers that are killed or put into cryonic storage after being exposed to the bits. 6. If we could prove that AIs cannot acausally coordinate we could increase the bound to N bits per AI, or AI instance. Again, a good use for initial bits. 7. None of this stops us going extinct.
Rohin Shah*Ω132212

I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.

By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to i... (read more)

9Buck
Yep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people. What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments. I think you're overfixating on the experience of Google, which has more complicated production systems than most.
Rohin ShahΩ780

Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.

We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post ... (read more)

Indeed this seems related! We'd hope that there isn't too much conflict between the different "generations", since the intent is for the overseer to enable long-term plans, but it seems like something that could happen if the overseer is bad enough.

For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?

If it were as simple as "just ask an LLM to choose actions" someone would have deployed this product a while ago.

But in any case I agree this isn't the most interesting case for MONA, I talked about it because that's what Daniel asked about.

In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI develo... (read more)

Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you're using a prediction market it's no longer accurate to say that individual humans understand the strategy.

(We've seen this comment and are looking into options)

Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it's more doable now.

For a variety of reasons the core team behind this paper has moved on to other things, so we won't get to it in the near future, but it would be great to see others working on this!

Thanks, and interesting generalization!

My thoughts depend on whether you train the weaker model.

  1. If you are training the weaker model to solve the task, then the weaker model learns to simply always accept the advice of the stronger model, and stops being a constraint on what the stronger model can do.
  2. If you aren't training the weaker model and are just training the stronger model based on whether it convinces the weak model, then you are probably not getting the benefits of RL (specifically: getting the model to generate actions that are evaluated to be go
... (read more)
4Knight Lee
Wait, can we try to imagine how point 1 happens? If the weaker model always follows the advice of the stronger model, the RL reward on the stronger model will become useless since every output will get maximum reward. Remember the stronger model doesn't care at all about the weaker model's success, so if the weaker model rewards it for incomprehensible ideas, the strong model won't give it incomprehensible ideas which actually work, but incomprehensible ideas full of impressive sounding nonsense. The weaker model can only reward the stronger model for ideas it can understand, i.e. predict will get an reward. The crux is that the final reward signal only leak into the stronger model through the weaker model's prediction of that reward signal, mixed with the weaker model's "values." This means that many safety properties of the weaker model can carry over into the stronger model (to some extent). It should have the same alignment tax as MONA, since both my version and MONA has some kind of weaker model set the stronger model's reward signal. MONA's agent directly uses the stronger model's ideas, my version lets the weaker model change the stronger model's ideas (e.g. refuse unethical decisions, ask followup questions), but the weaker model doesn't have to change them: it's allowed to do the same thing as what MONA does. The alignment tax is relatively less due to RL on the weaker model, though care must be taken to keep the weaker model trusted. Do you agree?
Load More