All of Rohin Shah's Comments + Replies

We've got a lot of interest, so it's taking some time to go through applications. If you haven't heard back by the end of March, please ping me; hopefully it will be sooner than that.

The answer to that question will determine which team will do the first review of your application. (We get enough applications that the first review costs quite a bit of time, so we don't want both teams to review all applications separately.)

You can still express interest in both teams (e.g. in the "Any other info" question), and the reviewer will take that into account and consider whether to move your application to the other team, but Gemini Safety reviewers aren't going to be as good at evaluating ASAT candidates, and vice versa, so you should choose the team that you think is a better fit for you.

There are different interview processes. ASAT is more research-driven while Gemini Safety is more focused on execution and implementation. If you really don't know which of the two teams would be a better fit, you can submit a separate application for each.

Our hiring this round is a small fraction of our overall team size, so this is really just correcting a minor imbalance, and shouldn't be taken as reflective of some big strategy. I'm guessing we'll go back to hiring a mix of the two around mid-2025.

You can check out my career FAQ, as well as various other resources linked from there.

Still pretty optimistic by the standards of the AGI safety field, somewhat shorter timelines than I reported in that post.

Neither of these really affect the work we do very much. I suppose if I were extremely pessimistic I would be doing something else, but even at a p(doom) of 50% I'd do basically the same things I'm doing now.

(And similarly individual team members have a wide variety of beliefs on both optimism and timelines. I actually don't know their beliefs on those topics very well because these beliefs are usually not that action-relevant for us.)

More capability research than AGI safety research but idk what the ratio is and it's not something I can easily find out

Since we have multiple roles, the interview process varies across candidates, but usually it would have around 3 stages that in total correspond to 4-8 hours of interviews.

We'll leave it up until the later of those two (and probably somewhat beyond that, but that isn't guaranteed). I've edited the post.

Rohin ShahΩ220

Is that right?

Yes, that's broadly accurate, though one clarification:

This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly

That's a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.

I think this will

... (read more)
Rohin ShahΩ220

If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point

That doesn't seem right. It can simultaneously be the case that you can't tell that there are problems stemming from long-term optimization problems when you don't use MONA, and also if you actually use MONA, then it will measurably improve quality.

For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which ... (read more)

3Fabien Roger
To rephrase what I think you are saying are situations where work on MONA is very helpful: * By default people get bitten by long-term optimization. They notice issues in prod because it's hard to catch everything. They patch individual failures when they come up, but don't notice that if they did more work on MONA, they would stop the underlying driver of many issues (including future issues that could result in catastrophes). They don't try MONA-like techniques because it's not very salient when you are trying to fix individual failures and does not pass cost-benefit to fix individual failures. * If you do work on MONA in realistic-ish settings, you may be able to demonstrate that you can avoid many failures observed in prod without ad-hoc patches and that the alignment tax is not too large. This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly and because people don't by default think of the individual failures you've seen in prod as symptoms of their long-term optimization, but your empirical work pushes them over the line and they end up trying to adopt MONA to avoid future failures in prod (and maybe reduce catastrophic risk - though given competitive pressures, that might not be the main factor driving decisions and so you don't have to make an ironclad case for MONA reducing catastrophic risk). Is that right? I think this is at least plausible. I think this will become much more likely once we actually start observing long-term optimization failures in prod. Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
Rohin ShahΩ220

I meant "it's obvious you should use MONA if you are seeing problems with long-term optimization", which I believe is Fabien's position (otherwise it would be "hard to find").

Your reaction seems more like "it's obvious MONA would prevent multi-step reward hacks"; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).

Rohin ShahΩ220

I have some credence in all three of those bullet points.

For MONA it's a relatively even mixture of the first and second points.

(You are possibly the first person I know of who reacted to MONA with "that's obvious" instead of "that obviously won't perform well, why would anyone ever do it". Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)

For debate it's mostly the first point, and to some extent the third point.

2Fabien Roger
That's right. If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point (especially if the alignment tax is non-negligible), and I am not sure how you get it. In particular, for the median AIs that labs use to 20x AI safety research, my guess is that you won't have invisible long-term reward hacking problems, and so I would advise labs to spend the alignment tax on other interventions (like using weaker models when possible, or doing control), not on using process-based rewards. I would give different advice if * the alignment tax of MONA were tiny * there were decent evidence for invisible long-term reward hacking problems with catastrophic consequences solved by MONA I think this is not super plausible to happen, but I am sympathetic to research towards these two goals. So maybe we don't disagree that much (except maybe on the plausibility of invisible long-term reward hacking problems for the AIs that matter the most).
2ryan_greenblatt
I also have the "that's obvious reaction", but possibly I'm missing somne details. I also think it won't perform well enough in practice to pencil given other better places to allocate safety budget (if it does trade off which is unclear).
Rohin Shah*Ω220

Got it, that makes more sense. (When you said "methods work on toy domains" I interpreted "work" as a verb rather than a noun.)

But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings

I think by far the biggest open question is "how do you provide the nonmyopic approval so that the model actually performs well". I don't think anyone has even attempted to tackle this so it's hard to tell what you could learn about it, but I'd be surprised if there weren't generaliza... (read more)

2Fabien Roger
My bad, I was a bit sloppy here. The debate-for-control stuff is in the RFP but not the debate vs subtle reward hacks that don't show up in feedback quality evals. I think we agree that there are some flavors of debate work that are exciting and not present in the RFP.
Rohin ShahΩ541

I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?

I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn't have to grapple as hard with the challenge of defining the approval feedback as I'd expect in a realistic deployment. But it does impose an alignment tax, so there's no point in using MONA currently, when good enough alignm... (read more)

5Fabien Roger
Which of the bullet points in my original message do you think is wrong? Do you think MONA and debate papers are: * on the path to techniques that measurably improve feedback quality on real domains with potentially a low alignment tax, and that are hard enough to find that labs won't use them by default? * on the path to providing enough evidence of their good properties that even if they did not measurably help with feedback quality in real domains (and slightly increased cost), labs could be convinced to use them because they are expected to improve non-measurable feedback quality? * on the path to speeding up safety-critical domains? * (valuable for some other reason?)
2Fabien Roger
That is not my claim. By "I would have guessed that methods work on toy domains and math will be useless" I meant "I would have guessed that if a lab decided to do process-based feedback, it will be better off not doing a detailed literature review of methods in MONA and followups on toy domains, and just do the process-based supervision that makes sense in the real domain they now look at. The only part of the method section of MONA papers that matters might be "we did process-based supervision"." I did not say "methods that work on toy domains will be useless" (my sentence was easy to misread). I almost have the opposite stance, I am closer to "it's so obvious that process-based feedback helps that if capabilities people ever had issues stemming from long-term optimization, they would obviously use more myopic objectives. So process-based feedback so obviously prevents problems from non-myopia in real life that the experiments in the MONA paper don't increase the probability that people working on capabilities implement myopic objectives."  But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings (above the "capability researcher does what looks easiest and sensible to them" baseline)? Not a crux. My guess is that if debate did "work" to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality. So my low confidence guess is that it's not high priority for people working on x-risk safety to speed up that work. But I am excited about debate work that is not just about improving feedback quality. For example I am interested in debate vs schemers or debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn't show up in "feedback quality
Rohin ShahΩ330

I don't know of any existing work in this category, sorry. But e.g. one project would be "combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards", which in theory could work better than either one of them alone.

Rohin Shah*Ω529419

I'm excited to see this RFP out! Many of the topics in here seem like great targets for safety work.

I'm sad that there's so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is "alternatives to adversarial training". (There's also "new moonshots for aligning superintelligence" but I don't expect much to come out of that, and "white-box estimation of rare misbehavior" could help if you are willing to put optimizat... (read more)

3Oliver Daniels
I have a similar confusion (see my comment here) but seems like at least Ryan wants control evaluations to cover this case? (perhaps on the assumption that if your "control measures" are successful, they should be able to elicit aligned behavior from scheming models and this behavior can be reinforced?)
8Fabien Roger
I think additional non-moonshot work in these domains will have a very hard time helping. [low confidence] My high level concern is that non-moonshot work in these clusters may be the sort of things labs will use anyway (with or without safety push) if this helped with capabilities because the techniques are easy to find, and won't use if it didn't help with capabilities because the case for risk reduction is weak. This concern is mostly informed by my (relatively shallow) read of recent work in these clusters. [edit: I was at least somewhat wrong, see comment threads below] Here are things that would change my mind: * If I thought people were making progress towards techniques with nicer safety properties and no alignment tax that seem hard enough to make workable in practice that capabilities researchers won't bother using by default, but would bother using if there was existing work on how to make them work. * (For the different question of preventing AIs from using "harmful knowledge", I think work on robust unlearning and gradient routing may have this property - the current SoTA is far enough from a solution that I expect labs to not bother doing anything good enough here, but I think there is a path to legible success, and conditional on success I expect labs to pick it up because it would be obviously better, more robust, and plausibly cheaper than refusal training + monitoring. And I think robust unlearning and gradient routing have better safety properties than refusal training + monitoring.) * If I thought people were making progress towards understanding when not using process-based supervision and debate is risky. This looks like demos and model organisms aimed at measuring when, in real life, not using these simple precautions would result in very bad outcomes while using the simple precautions would help. * (In the different domain of  CoT-faithfulness I think there is a lot of value in demonstrating the risk of opaque CoT well-enough that
7Buck
I don't know what this means, do you have any examples?
BuckΩ12198

I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".

Rohin ShahΩ5110

I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.

This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly dis... (read more)

Rohin Shah*Ω9127

In broad strokes I agree with Zac. And tbc I'm generally a fan of formal verification and have done part of a PhD in program synthesis.

So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology

This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to "don't build such AIs", in which case I would appreciate that being stated more directly, or if it reduces to "limit ... (read more)

2Martin Randall
Based on my understanding of the article: 1. The sound over-approximation of human psychology is that humans are psychologically safe from information attacks of less than N bits. "Talk Control" is real, "Charm Person" is not. 2. Under "Steganography, and other funny business" there is a sketched safety specification that each use of the AI will communicate at most one bit of information. 3. Not stated explicitly: humans will be restricted to using the AI system no more than N times. Comments and concerns: 1. Human psychology is also impacted by the physical environment, eg drugs, diseases, being paperclipped. The safety of the physical environment must be covered by separate verifications. 2. There could be a side-channel for information if an AI answers some questions faster than others, uses more energy for some questions than others, etc. 3. Machine interpretability techniques must be deployed in a side-channel resistant way. We can't have the AI thinking about pegasi and unicorns in a morse code pattern and an intern reads it and ten years later everyone is a pony. 4. There probably need to be multiple values of N for different time-frames. 1,000 adversarial bits in a minute is more psychologically dangerous than the same number of bits over a year. 5. Today, we don't know good values for N, but we can spend the first few bits getting higher safe values of N. We can also use the Yudkowskian technique of using volunteers that are killed or put into cryonic storage after being exposed to the bits. 6. If we could prove that AIs cannot acausally coordinate we could increase the bound to N bits per AI, or AI instance. Again, a good use for initial bits. 7. None of this stops us going extinct.
Rohin Shah*Ω132212

I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.

By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to i... (read more)

9Buck
Yep, I think that at least some of the 10 would have to have some serious hustle and political savvy that is atypical (but not totally absent) among AI safety people. What laws are you imagine making it harder to deploy stuff? Notably I'm imagining these people mostly doing stuff with internal deployments. I think you're overfixating on the experience of Google, which has more complicated production systems than most.
Rohin ShahΩ780

Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.

We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post ... (read more)

Indeed this seems related! We'd hope that there isn't too much conflict between the different "generations", since the intent is for the overseer to enable long-term plans, but it seems like something that could happen if the overseer is bad enough.

For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?

If it were as simple as "just ask an LLM to choose actions" someone would have deployed this product a while ago.

But in any case I agree this isn't the most interesting case for MONA, I talked about it because that's what Daniel asked about.

In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI develo... (read more)

Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you're using a prediction market it's no longer accurate to say that individual humans understand the strategy.

(We've seen this comment and are looking into options)

Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it's more doable now.

For a variety of reasons the core team behind this paper has moved on to other things, so we won't get to it in the near future, but it would be great to see others working on this!

Thanks, and interesting generalization!

My thoughts depend on whether you train the weaker model.

  1. If you are training the weaker model to solve the task, then the weaker model learns to simply always accept the advice of the stronger model, and stops being a constraint on what the stronger model can do.
  2. If you aren't training the weaker model and are just training the stronger model based on whether it convinces the weak model, then you are probably not getting the benefits of RL (specifically: getting the model to generate actions that are evaluated to be go
... (read more)
4Knight Lee
Wait, can we try to imagine how point 1 happens? If the weaker model always follows the advice of the stronger model, the RL reward on the stronger model will become useless since every output will get maximum reward. Remember the stronger model doesn't care at all about the weaker model's success, so if the weaker model rewards it for incomprehensible ideas, the strong model won't give it incomprehensible ideas which actually work, but incomprehensible ideas full of impressive sounding nonsense. The weaker model can only reward the stronger model for ideas it can understand, i.e. predict will get an reward. The crux is that the final reward signal only leak into the stronger model through the weaker model's prediction of that reward signal, mixed with the weaker model's "values." This means that many safety properties of the weaker model can carry over into the stronger model (to some extent). It should have the same alignment tax as MONA, since both my version and MONA has some kind of weaker model set the stronger model's reward signal. MONA's agent directly uses the stronger model's ideas, my version lets the weaker model change the stronger model's ideas (e.g. refuse unethical decisions, ask followup questions), but the weaker model doesn't have to change them: it's allowed to do the same thing as what MONA does. The alignment tax is relatively less due to RL on the weaker model, though care must be taken to keep the weaker model trusted. Do you agree?
Rohin Shah*Ω440

Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?" 

And also "don't propagate rewards backwards in time", which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)

EDIT: And tbc, "don't propagate rewards backwards in time" is the primary focus in this paper -- in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section ... (read more)

Rohin Shah*Ω450

Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world?

It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you're imagining.

But it sounds like you're talking about a pretty weak (subhuman) AI system. So I'm not quite sure why you'd expect it to "hack" the judge.

Certainly sometimes the overseer will approve of some action that isn't actually the right one, e.g. when book... (read more)

1Yonatan Cale
For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no? The interesting application MONA seems to be when the myopic RL agent is able to produce better suggestions than the overseer   Edit: I elaborated 1. ^ Plus maybe let the overseer observe the result and say "oops" and roll back that action, if we can implement a rollback in this context
3Daniel Kokotajlo
Thanks this is helpful. Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?"  Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.
Rohin ShahΩ112111

I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:

What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]

The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]

Layer 1

... (read more)
Marius HobbhahnΩ122512

That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe. 

OpenAI have already spent on the order of a million dollars just to score well on some benchmarks

Note this is many different inference runs each of which was thousands of dollars. I agree that people will spend billions of dollars on inference in total (which isn't specific to the o-series of models). My incredulity was at the idea of spending billions of dollars on a single episode, which is what I thought you were talking about given that you were talking about capability gains from scaling up inference-time compute.

Re: (1), if you look through the thread for the comment of mine that was linked above, I respond to top-down heuristical-constraint-based search as well. I agree the response is different and not just "computational inefficiency".

Re: (2), I agree that near-future systems will be easily retargetable by just changing the prompt or the evaluator function (this isn't new to the o-series, you can also "retarget" any LLM chatbot by giving it a different prompt). If this continues to superintelligence, I would summarize it as "it turns out alignment wasn't a prob... (read more)

4Thane Ruthenis
1. Yup, I read through it after writing the previous response and now see that you don't need to be convinced of that point. Sorry about dragging you into this. 2. I could nitpick the details here, but I think the discussion has kind of wandered away from any pivotal points of disagreement, plus John didn't want object-level arguments under this post. So I petition to leave it at that. There's a log-scaling curve, OpenAI have already spent on the order of a million dollars just to score well on some benchmarks, and people are talking about "how much would you be willing to pay for the proof of the Riemann Hypothesis?". It seems like a straightforward conclusion that if o-series/inference-time scaling works as well as ML researchers seem to hope, there'd be billion-dollar inference runs funded by some major institutions.

I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work

I stand by my statement and don't think anything about the o-series model invalidates it.

And to be clear, I've expected for many years that early powerful AIs will be expensive to run, and have critiqued people for analyses that implicitly assumed or implied that the first powerful AIs will be cheap, prior to the o-series being released. (Though unfortunately for the two posts I'm thinking of, I made the critiques privately.)

There's a world of difference between ... (read more)

4Thane Ruthenis
To lay out my arguments properly: 1. "Search is ruinously computationally inefficient" does not work as a counter-argument against the retargetability of search, because the inefficiency argument applies to babble-and-prune search, not to the top-down heuristical-constraint-based search that was/is being discussed. There are valid arguments against easily-retargetable heuristics-based search as well (I do expect many learned ML algorithms to be much messier than that). But this isn't one of them. 2. ML researchers are currently incredibly excited about the inference-time scaling laws, talking about inference runs costing millions/billions of dollars, and how much capability will be unlocked this way. The o-series paradigm would use this compute to, essentially, perform babble-and-prune search. The pruning would seem to be done by some easily-swappable evaluator (either the system's own judgement based on the target specified in a prompt, or an external theorem-prover, etc.). If things will indeed go this way, then it would seem that a massive amount of capabilities will be based on highly inefficient babble-and-prune search, and that this search would be easily retargetable by intervening on one compact element of the system (the prompt, or the evaluator function).
Rohin Shah*Ω882

Thus, you might’ve had a story like: “sure, AI systems might well end up with non-myopic motivations that create some incentive towards scheming. However, we’re also training them to behave according to various anti-scheming values – e.g., values like honesty, behaving-as-intended, etc. And these values will suffice to block schemer-like behavior overall.” Thus, on this story, anti-scheming values might function in a manner similar to anti-stealing values in a human employee considering stealing from her employer (and in a position to do so). It’s not

... (read more)
Rohin ShahΩ440

I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:

  1. You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
  2. You do train on synthetically generated dangerous actio
... (read more)
Rohin ShahΩ220

You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.

3Buck
I don't think this is right. To train an untrusted monitor, you need to synthetically generate dangerous actions (because by assumption these actions are rare). So during training, you'll always know the correct label for an action before you show it to the model. You might want to use something debate-like in the synthetic input generation process, but that's structurally different.

Yeah my bad, that's incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.

(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)

Rohin ShahΩ340

I agree that this distinction is important -- I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).

I'm not in full agreement on your comments on the theories of change:

  1. I'm pretty uncertain about the effects of bad reward signals on propensity for scheming / non-myopic reward hacking, and in particular I think the effects could be large.
  2. I'm less worried about purely optimizing against a flawed reward signal though not unworried. I agree it doesn't strike us by surprise, but I also don't expect scheming to strike us by surp
... (read more)
2Buck
I don’t understand your last sentence, can you rephrase?

So the argument here is either that China is more responsive to "social proof" of the importance of AI (rather than observations of AI capabilities), or that China wants to compete with USG for competition's sake (e.g. showing they are as good as or better than USG)? I agree this is plausible.

It's a bit weird to me to call this an "incentive", since both of these arguments don't seem to be making any sort of appeal to rational self-interest on China's part. Maybe change it to "motivation"? I think that would have been clearer to me.

(Btw, you seem to be ass... (read more)

1rosehadshar
Changed to motivation, thanks for the suggestion. I agree that centralising to make AI safe would make a difference. It seems a lot less likely to me than centralising to beat China (there's already loads of beat China rhetoric, and it doesn't seem very likely to go away).

Tbc, I don't want to strongly claim that centralization implies shorter timelines. Besides the point you raise there's also things like bureaucracy and diseconomies of scale. I'm just trying to figure out what the authors of the post were saying.

That said, if I had to guess, I'd guess that centralization speeds up timelines.

Your infosecurity argument seems to involve fixing a point in time, and comparing a (more capable) centralized AI project against multiple (less capable) decentralized AI projects. However, almost all of the risks you're considering depend much more on the capability of the AI project rather than the point in time at which they occur. So I think best practice here would be to fix a rough capability profile, and compare a (shorter timelines) centralized AI project against multiple (longer timelines) decentralized AI projects.

In more detail:

It’s not clear wh

... (read more)
3Tom Davidson
Fwiw, my own position is that for both infosec and racing it's the brute fact that USG see fits to centralise all resources and develop AGI asap that would cause China to 1) try much harder to steal the weights than when private companies had developed the same capabilities themselves, 2) try much harder to race to AGI themselves. 
3rosehadshar
Thanks, I expect you're right that there's some confusion in my thinking here. Haven't got to the bottom of it yet, but on more incentive to steal the weights: - partly I'm reasoning in the way that you guess, more resources -> more capabilities -> more incentives - I'm also thinking "stronger signal that the US is all in and thinks this is really important -> raises p(China should also be all in) from a Chinese perspective -> more likely China invests hard in stealing the weights" - these aren't independent lines of reasoning, as the stronger signal is sent by spending more resources - but I tentatively think that it's not the case that at a fixed capability level the incentives to steal the weights are the same. I think they'd be higher with a centralised project, as conditional on a centralised project there's more reason for China to believe a) AGI is the one thing that matters, b) the US is out to dominate
2[anonymous]
While a centralized project would get more resources, it also has more ability to pause//investigate things. So EG if the centralized project researchers see something concerning (perhaps early warning signs of scheming), it seems more able to do a bunch of research on that Concerning Thing before advancing. I think makes the effect of centralization on timelines unclear, at least if one expects these kinds of “warning signs” on the pathway to very advanced capabilities. (It’s plausible that you could get something like this from a decentralized model with sufficient oversight, but this seems much harder, especially if we expect most of the technical talent to stay with the decentralized projects as opposed to joining the oversight body.)

Regarding the rest of the article - it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it's easy). 

Huh. If you think of that as capabilities I don't know what would count as alignment. What's an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?

E.g. it seems like you think RLHF counts as an alignment technique -- this seems like a central approach that you might use in BASALT.

If you

... (read more)
1Yonatan Cale
TL;DR: point 3 is my main one.   1) [I'm not sure why you're asking, maybe I'm missing something, but I'll answer] For example, checking if human values are a "natural abstraction", or trying to express human values in a machine readable format, or getting an AI to only think in human concepts, or getting an AI that is trained on a limited subset of things-that-imply-human-preferences to generalize well out of that distribution.  I can make up more if that helps? anyway my point was just to say explicitly what parts I'm commenting on and why (in case I missed something)   2) It's a candidate alignment technique. RLHF is sometimes presented (by others) as an alignment technique that should give us hope about AIs simply understanding human values and applying them in out of distribution situations (such as with an ASI). I'm not optimistic about that myself, but rather than arguing against it, I suggest we could empirically check if RLHF generalizes to an out-of-distribution situation, such as minecraft maybe. I think observing the outcome here would effect my opinion (maybe it just would work?), and a main question of mine was whether it would effect other people's opinions too (whether they do or don't believe that RLHF is a good alignment technique).   3) I would finetune the AI on objective outcomes like "fill this chest with gold" or "kill that creature [the dragon]" or "get 100 villagers in this area". I'd pick these goals as ones that require the AI to be a capable minecraft player (filling a chest with gold is really hard) but don't require the AI to understand human values or ideally anything about humans at all. So I'd avoid finetuning it on things like "are other players having fun" or "build a house that would be functional for a typical person" or "is this waterfall pretty [subjectively, to a human]". Does this distinction seem clear? useful? This would let us test how some specific alignment technique (such as "RLHF that doesn't contain mi
1Yonatan Cale
Thanks! In the part you quoted - my main question would be "do you plan on giving the agent examples of good/bad norm following" (such as RLHFing it). If so - I think it would miss the point, because following those norms would become in-distribution, and so we wouldn't learn if our alignment generalizes out of distribution without something-like-RLHF for that distribution. That's the main thing I think worth testing here. (do you agree? I can elaborate on why I think so) If you hope to check if the agent will be aligned[1] with no minecraft-specific alignment training, then sounds like we're on the same page!   Regarding the rest of the article - it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it's easy).  My only comment there is that I'd try to not give the agent feedback about human values (like "is the waterfall pretty") but only about clearly defined objectives (like "did it kill the dragon"), in order to not accidentally make human values in minecraft be in-distribution for this agent. wdyt?   (I hope I didn't misunderstand something important in the article, feel free to correct me of course)   1. ^ Whatever "aligned" means. "other players have fun on this minecraft server" is one example.

You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.

I thought the whole point of this update was to specify when you start your comprehensive evals, rather than when you complete your comprehensive evals. The old RSP implied that evals must complete at most 3 months after the last evals were completed, which is awkward if you don't know how long comprehensive evals will take, and is presumably what led to the 3 day violation in ... (read more)

5Zach Stein-Perlman
(I agree that the intention is surely no more than 6 months; I'm mostly annoyed for legibility—things like this make it harder for me to say "Anthropic has clearly committed to X" for lab-comparison purposes—and LeCun-test reasons)
Rohin ShahΩ696

Once the next Anthropic, GDM, or OpenAI paper on SAEs comes out, I will evaluate my predictions in the same way as before.

Uhh... if we (GDM mech interp team) saw good results on any one of the eight things on your list, we'd probably write a paper just about that thing, rather than waiting to get even more results. And of course we might write an SAE paper that isn't about downstream uses (e.g. I'm also keen on general scientific validation of SAEs), or a paper reporting negative results, or a paper demonstrating downstream use that isn't one of your eight... (read more)

1Knight Lee
It seems like the post is implicitly referring to the next big paper on SAEs from one of these labs, similar in newsworthiness as the last Anthropic paper. A big paper won't be a negative result or a much smaller downstream application, and a big paper would compare its method against baselines if possible, making 165% still within the ballpark. I still agree with your comment, especially the recommendation for a time-based prediction (I explained in my other comment here). Thank you for your alignment work :)
5scasper
Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn't seem too crazy. I think that this DOES match my expectations.  But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much. 
Rohin ShahΩ692

This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me. E.g. my categorization would be something like:

  • Clearly alignment: debate theory, certain flavors of process supervision
  • Clearly control: removing affordances (e.g. "don't connect the model to the Internet")
  • Could be either one: interpretability, critique models (in control this is called "untrusted monitoring"), most conceptions of ELK, generating inputs on which models behave badly, anomaly detection, capability evaluations, faithful chain
... (read more)
6ryan_greenblatt
Agreed. (And I made a comment to this effect on Mark's draft of this short form.) I don't think it is that productive to classify research directions, the central thing we're trying to point at is a classification for reasons why you might think that you're fine where one reason is "the AI couldn't do bad stuff without getting caught" and another reason is "the AI wouldn't do bad stuff" (obviously simplifying somewhat). I also think that white-box control should typically be considered pretty separately from black-box control and the story for it is often pretty different. (Though when thinking about white-box control you should often adopt the central control perspective, e.g., you should incorporate white-box runtime detection techniques (like probes) into an overall control evaluation via exogenously trying to determine the probe reliability and other important aspects of the probe. For instance, the probe can take advantage of black box strategies and how safe we are depends on how many actions AIs need to cause huge problems.) The claims I'd like to make is: * There should be more research specifically targeting a black-box control story and which is operating from a black-box control perspective. (I'd guess like 20% of the overall AI safety portfolio should be this and maybe 30% of safety work at labs. This is not including work like "get AIs to be more useful at alignment research and get better at DC eval methodology" which should also be subsidized from a control perspective but seem non-central.) * When people do research which could be pretty useful from a black-box control perspective (e.g. debate or other scalable oversight style research), they should consider sometimes explicitly think about the control perspective and consider adjusting their research accordingly (to the extent they can understand this perspective well enough). To the extent that people have an alignment style story they think about while doing their research, they should be rou
5ryan_greenblatt
You're potentially thinking about footnote 4 in our post on control: This footnote is a bit confusingly worded, but I think the situation we were trying to say is "Suppose you expect dangerously powerful AI in the next year, and the current year is prior to 2032. We're guessing you should put the better team on control." This is different than research in the run up. ---------------------------------------- I also think that to the extent people are trying to do backchained research focused on specific applications, it makes sense to put the better team on control over reducing the chance that scheming arises. (But these aren't the only classes of interventions and some interventions don't nicely fit into these buckets, e.g., you can do work on differentially making AIs more useful for alignment work which isn't well classified as either and you can work on high level interpretability which aims to roughly understand how AIs make decisions in some cases (this high-level interp doesn't clearly help with reducing the chance that scheming arises very directly, but could help with a bunch of stuff).)

I think this is referring to ∇θL(xtrain)=0, which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that's where the miscommunication is stemming from

Ah, yup, that's the issue, and I agree you're correct that  is the relevant thing here. I'll edit the post to say I'm no longer sure about the claim. (I don't have the time to understand how this lines up with the actual paper -- I remember it being kind of sparse and not trivial to follow -- perhaps you could look into it and leave a comment here.)

0[comment deleted]
3Knight Lee
I think this little mistake doesn't affect the gist of your summary post, I wouldn't worry about it too much. The mistake The mistaken argument was an attempt to explain why α≥2/d. Believe it or not, the paper never actually argued α≥2/d, it argued α≥1/d. That's because the paper used the L2 loss, and α≥2/d  (L2)⟺α≥1/d  (L1). I feel that was the main misunderstanding. Figure 2b in the paper shows α=4/d  (L2) for most models but α=2/d  (L2) for some models. The paper never mentions L2 loss, just that the loss function is "analytic in f−F and minimized at f=F." Such a loss function converges to L2 when the loss is small enough. This important sentence is hard to read because it's cut in half by a bunch of graphs, and looks like another unimportant mathematical assumption. Some people like L2 loss or a loss function that converges to L2 when the loss is small, because most loss functions (even L1) behave like L2 anyway, once you subtract a big source of loss. E.g. variance-limited scaling has α=1 in both L1 and L2 because you subtract L(∞,P) or L(D,∞) from L(D,P). Even resolution-limited scaling may require subtracting loss due to noise L(∞,∞). L2 is nicer because if L(xtrain) is zero, ∇xL2(xtrain)=0 but ∇xL1(xtrain) is undefined since the absolute value is undifferentiable at 0. If you read closely, α≥4/d  (L2) only refers to piecewise linear approximations: The second paper says the same: Nearest neighbors regression/classification has α=1/d  (L1). Skip this section if you already agree: * If you use the nearest neighbors regression to fit the function f(x)=sin(x/π), it creates a piecewise constant function that looks like a staircase trying to approximate a curve. The L1 loss is proportional to D^(-1) because adding data makes the staircase smaller. A better attempt to connect the dots of the training data (e.g. a piecewise linear function) would have L1 loss proportional to D^(-2) or better because adding data makes the "staircase" both smaller and smo

Mathematically, the Taylor expansion is:

And then we have  and also . (This does assume a "sufficiently nice" loss function, that is satisfied by most loss functions used in practice.)

I agree  is not zero. I also agree if you take some point in between  and  it can have non-zero loss, e.g.  need not be zero. I'm not sure if either of these are what you're t... (read more)

4Osher Lerner
Yes, that's precisely what I'm claiming! Sorry if that wasn't clear.  As for how to establish that, I proposed an intuitive justification: And an outline for a proof: Let's walk through an example 1. Consider trying to fit a simple function, f(x)=0 . Let's collect a training dataset Dtrain=[(x=0,y=0),(x=1,y=0),(x=2,y=0),…] * You optimize a perfect model on Dtrain (e.g. a neural net with mapping NN(x)=0) * Now let's study the scaling of error as you move away from training points. In the example, we achieved ∇xL(xtrain)=0, since coincidentally ∀xf(x)=NN(x) 2. Consider a second example. Let's fit f(x)=sin(x/π). Again, we collect training data Dtrain=[(x=0,y=0),(x=1,y=0),(x=2,y=0),…] * You optimize a perfect model on Dtrain (using the same optimization procedure, we get a neural net with mapping NN(x)=0) * Now, we see ∇xL(xtrain)≠0 (we predict a flat line at y=0, and L measures error from a sinusoid). You can notice this visually or analytically: derivation: ∇xL(xtrain)=∇x∥f(x)−NN(x)∥x=xtrain=∇x∥f(x)−0∥x=xtrain=∇x∥sin(x/π)∥x∈Z≠0 The model is trained on xtrain,f(xtrain), and is independent of ∇xf(xtrain).  That means even if by happy accident our optimization procedure achieves ∇xL(xtrain)=0, we can prove that it is not generally true by considering an identical training dataset with a different underlying function (and knowing our optimization must result in the same model) On rereading your original argument: I think this is referring to ∇θL(xtrain)=0, which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that's where the miscommunication is stemming from, since "gradient of loss" is being overloaded from discussion of optimization ∇θL, and discussion of Taylor-expanding L around xtrain (which uses ∇xL)
Load More