The answer to that question will determine which team will do the first review of your application. (We get enough applications that the first review costs quite a bit of time, so we don't want both teams to review all applications separately.)
You can still express interest in both teams (e.g. in the "Any other info" question), and the reviewer will take that into account and consider whether to move your application to the other team, but Gemini Safety reviewers aren't going to be as good at evaluating ASAT candidates, and vice versa, so you should choose the team that you think is a better fit for you.
There are different interview processes. ASAT is more research-driven while Gemini Safety is more focused on execution and implementation. If you really don't know which of the two teams would be a better fit, you can submit a separate application for each.
Our hiring this round is a small fraction of our overall team size, so this is really just correcting a minor imbalance, and shouldn't be taken as reflective of some big strategy. I'm guessing we'll go back to hiring a mix of the two around mid-2025.
Still pretty optimistic by the standards of the AGI safety field, somewhat shorter timelines than I reported in that post.
Neither of these really affect the work we do very much. I suppose if I were extremely pessimistic I would be doing something else, but even at a p(doom) of 50% I'd do basically the same things I'm doing now.
(And similarly individual team members have a wide variety of beliefs on both optimism and timelines. I actually don't know their beliefs on those topics very well because these beliefs are usually not that action-relevant for us.)
More capability research than AGI safety research but idk what the ratio is and it's not something I can easily find out
Since we have multiple roles, the interview process varies across candidates, but usually it would have around 3 stages that in total correspond to 4-8 hours of interviews.
We'll leave it up until the later of those two (and probably somewhat beyond that, but that isn't guaranteed). I've edited the post.
Is that right?
Yes, that's broadly accurate, though one clarification:
This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly
That's a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.
...I think this will
If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point
That doesn't seem right. It can simultaneously be the case that you can't tell that there are problems stemming from long-term optimization problems when you don't use MONA, and also if you actually use MONA, then it will measurably improve quality.
For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which ...
I meant "it's obvious you should use MONA if you are seeing problems with long-term optimization", which I believe is Fabien's position (otherwise it would be "hard to find").
Your reaction seems more like "it's obvious MONA would prevent multi-step reward hacks"; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).
I have some credence in all three of those bullet points.
For MONA it's a relatively even mixture of the first and second points.
(You are possibly the first person I know of who reacted to MONA with "that's obvious" instead of "that obviously won't perform well, why would anyone ever do it". Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)
For debate it's mostly the first point, and to some extent the third point.
Got it, that makes more sense. (When you said "methods work on toy domains" I interpreted "work" as a verb rather than a noun.)
But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings
I think by far the biggest open question is "how do you provide the nonmyopic approval so that the model actually performs well". I don't think anyone has even attempted to tackle this so it's hard to tell what you could learn about it, but I'd be surprised if there weren't generaliza...
I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?
I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn't have to grapple as hard with the challenge of defining the approval feedback as I'd expect in a realistic deployment. But it does impose an alignment tax, so there's no point in using MONA currently, when good enough alignm...
I don't know of any existing work in this category, sorry. But e.g. one project would be "combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards", which in theory could work better than either one of them alone.
I'm excited to see this RFP out! Many of the topics in here seem like great targets for safety work.
I'm sad that there's so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is "alternatives to adversarial training". (There's also "new moonshots for aligning superintelligence" but I don't expect much to come out of that, and "white-box estimation of rare misbehavior" could help if you are willing to put optimizat...
I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".
I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.
This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly dis...
In broad strokes I agree with Zac. And tbc I'm generally a fan of formal verification and have done part of a PhD in program synthesis.
So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology
This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to "don't build such AIs", in which case I would appreciate that being stated more directly, or if it reduces to "limit ...
I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.
By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to i...
Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.
We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post ...
Indeed this seems related! We'd hope that there isn't too much conflict between the different "generations", since the intent is for the overseer to enable long-term plans, but it seems like something that could happen if the overseer is bad enough.
For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?
If it were as simple as "just ask an LLM to choose actions" someone would have deployed this product a while ago.
But in any case I agree this isn't the most interesting case for MONA, I talked about it because that's what Daniel asked about.
In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI develo...
Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you're using a prediction market it's no longer accurate to say that individual humans understand the strategy.
(We've seen this comment and are looking into options)
Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it's more doable now.
For a variety of reasons the core team behind this paper has moved on to other things, so we won't get to it in the near future, but it would be great to see others working on this!
Thanks, and interesting generalization!
My thoughts depend on whether you train the weaker model.
Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?"
And also "don't propagate rewards backwards in time", which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)
EDIT: And tbc, "don't propagate rewards backwards in time" is the primary focus in this paper -- in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section ...
Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world?
It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you're imagining.
But it sounds like you're talking about a pretty weak (subhuman) AI system. So I'm not quite sure why you'd expect it to "hack" the judge.
Certainly sometimes the overseer will approve of some action that isn't actually the right one, e.g. when book...
I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:
...What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]
Layer 1
That's fair. I think the more accurate way of phrasing this is not "we will get catastrophe" and more "it clearly exceeds the risk threshold I'm willing to take / I think humanity should clearly not take" which is significantly lower than 100% of catastrophe.
OpenAI have already spent on the order of a million dollars just to score well on some benchmarks
Note this is many different inference runs each of which was thousands of dollars. I agree that people will spend billions of dollars on inference in total (which isn't specific to the o-series of models). My incredulity was at the idea of spending billions of dollars on a single episode, which is what I thought you were talking about given that you were talking about capability gains from scaling up inference-time compute.
Re: (1), if you look through the thread for the comment of mine that was linked above, I respond to top-down heuristical-constraint-based search as well. I agree the response is different and not just "computational inefficiency".
Re: (2), I agree that near-future systems will be easily retargetable by just changing the prompt or the evaluator function (this isn't new to the o-series, you can also "retarget" any LLM chatbot by giving it a different prompt). If this continues to superintelligence, I would summarize it as "it turns out alignment wasn't a prob...
I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work
I stand by my statement and don't think anything about the o-series model invalidates it.
And to be clear, I've expected for many years that early powerful AIs will be expensive to run, and have critiqued people for analyses that implicitly assumed or implied that the first powerful AIs will be cheap, prior to the o-series being released. (Though unfortunately for the two posts I'm thinking of, I made the critiques privately.)
There's a world of difference between ...
...Thus, you might’ve had a story like: “sure, AI systems might well end up with non-myopic motivations that create some incentive towards scheming. However, we’re also training them to behave according to various anti-scheming values – e.g., values like honesty, behaving-as-intended, etc. And these values will suffice to block schemer-like behavior overall.” Thus, on this story, anti-scheming values might function in a manner similar to anti-stealing values in a human employee considering stealing from her employer (and in a position to do so). It’s not
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
Yeah my bad, that's incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.
(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)
I agree that this distinction is important -- I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).
I'm not in full agreement on your comments on the theories of change:
(Replied to Tom above)
So the argument here is either that China is more responsive to "social proof" of the importance of AI (rather than observations of AI capabilities), or that China wants to compete with USG for competition's sake (e.g. showing they are as good as or better than USG)? I agree this is plausible.
It's a bit weird to me to call this an "incentive", since both of these arguments don't seem to be making any sort of appeal to rational self-interest on China's part. Maybe change it to "motivation"? I think that would have been clearer to me.
(Btw, you seem to be ass...
Tbc, I don't want to strongly claim that centralization implies shorter timelines. Besides the point you raise there's also things like bureaucracy and diseconomies of scale. I'm just trying to figure out what the authors of the post were saying.
That said, if I had to guess, I'd guess that centralization speeds up timelines.
Your infosecurity argument seems to involve fixing a point in time, and comparing a (more capable) centralized AI project against multiple (less capable) decentralized AI projects. However, almost all of the risks you're considering depend much more on the capability of the AI project rather than the point in time at which they occur. So I think best practice here would be to fix a rough capability profile, and compare a (shorter timelines) centralized AI project against multiple (longer timelines) decentralized AI projects.
In more detail:
...It’s not clear wh
Regarding the rest of the article - it seems to be mainly about making an agent that is capable at minecraft, which seems like a required first step that I ignored meanwhile (not because it's easy).
Huh. If you think of that as capabilities I don't know what would count as alignment. What's an example of alignment work that aims to build an aligned system (as opposed to e.g. checking whether a system is aligned)?
E.g. it seems like you think RLHF counts as an alignment technique -- this seems like a central approach that you might use in BASALT.
...If you
You note that the RSP says we will do a comprehensive assessment at least every 6 months—and then you say it would be better to do a comprehensive assessment at least every 6 months.
I thought the whole point of this update was to specify when you start your comprehensive evals, rather than when you complete your comprehensive evals. The old RSP implied that evals must complete at most 3 months after the last evals were completed, which is awkward if you don't know how long comprehensive evals will take, and is presumably what led to the 3 day violation in ...
Once the next Anthropic, GDM, or OpenAI paper on SAEs comes out, I will evaluate my predictions in the same way as before.
Uhh... if we (GDM mech interp team) saw good results on any one of the eight things on your list, we'd probably write a paper just about that thing, rather than waiting to get even more results. And of course we might write an SAE paper that isn't about downstream uses (e.g. I'm also keen on general scientific validation of SAEs), or a paper reporting negative results, or a paper demonstrating downstream use that isn't one of your eight...
This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me. E.g. my categorization would be something like:
I think this is referring to ∇θL(xtrain)=0, which is certainly true for a perfectly optimized model (or even just settled gradient descent). Maybe that's where the miscommunication is stemming from
Ah, yup, that's the issue, and I agree you're correct that is the relevant thing here. I'll edit the post to say I'm no longer sure about the claim. (I don't have the time to understand how this lines up with the actual paper -- I remember it being kind of sparse and not trivial to follow -- perhaps you could look into it and leave a comment here.)
Mathematically, the Taylor expansion is:
And then we have and also . (This does assume a "sufficiently nice" loss function, that is satisfied by most loss functions used in practice.)
I agree is not zero. I also agree if you take some point in between and it can have non-zero loss, e.g. need not be zero. I'm not sure if either of these are what you're t...
We've got a lot of interest, so it's taking some time to go through applications. If you haven't heard back by the end of March, please ping me; hopefully it will be sooner than that.