Technologies that allow workers to be more isolated from each other gain you both convenience (because your coworkers no longer accidentally mess up what you’re doing) and also security (because you can remove your coworker’s permission to affect the code you’re running), but generally reduce efficiency. When we try to buy efficiency at the cost of convenience, we might lose security too.
Hmm, this feels less likely to me. Isolation can often be an efficiency benefit because one employee's mistake doesn't propagate to screwing up the work of everyone else, ...
I think the evidence is roughly at "this should be a weakly held prior easily overturned by personal experience": https://www.lesswrong.com/posts/c8EeJtqnsKyXdLtc5/how-long-can-people-usefully-work
That said, I do think there's enough evidence that I would bet (not at extreme odds) that it is bad for productivity to have organizational cultures that emphasize working very long hours (say > 60 hours / week), unless you are putting in special care to hire people compatible with that culture. Partly this is because I expect organizations to often be unable to overcome weak priors even when faced with blatant evidence.
Andrew Gelman: "Bring on the Stupid: When does it make sense to judge a person, a group, or an organization by its worst?" (Not quite as clearcut, since it doesn't name the person in the title, but still)
(If this also doesn't count as "intellectual writing circles", consider renaming your category, since I clearly do not understand what you mean, except inasmuch as it is "rationalist or rationalist-adjacent circles".)
Hmm, interesting. I was surprised by the claim so I did look back through ACX and posts from the LW review, and it does seem to back up your claim (the closest I saw was "Sorry, I Still Think MR Is Wrong About USAID", note I didn't look very hard). EDIT: Actually I agree with sunwillrise that "Moldbug sold out" meets the bar (and in general my felt sense is that ACX does do this).
I'd dispute the characterization of this norm as operating "within intellectual online writing circles". I think it's a rationalist norm if anything. For example I went to Slow Bo...
I am both surprised and glad my comment led to an update :)
FWIW I never expect the political blogs to be playing by the good rules of the rest of the intellectual writing circles, I view them more as soldiers. Not central examples of soldiers, but enough so that I'd repeatedly be disappointed by them if I expected them to hold themselves to the same standards.
(As an example, in my mind I confidently-but-vaguely recall some Matt Yglesias tweets where he endorsed dishonesty for his side of the political on some meta-level, in order to win political conflicts; interested if anyone else recalls this / has a link.)
While I disagree with Nate on a wide variety of topics (including implicit claims in this post), I do want to explicitly highlight strong agreement with this:
...I have a whole spiel about how your conversation-partner will react very differently if you share your concerns while feeling ashamed about them versus if you share your concerns as if they’re obvious and sensible, because humans are very good at picking up on your social cues. If you act as if it’s shameful to believe AI will kill us all, people are more prone to treat you that way. If you act a
It could instead be "there is a real threat model for existential risk, and it is important that society does more to address it than it is currently doing". If you're going to share concerns at all, figure out the position you do have courage in, and then discuss that as if it is obvious and sensible, not as if you are ashamed of it.
The potential failure mode I see with this is that, if you're not paying sufficient attention to your rhetoric, you run the risk of activating people's anti-Pascal's mugging instincts.[1] As Jeremy Gillen said:
...Everyone is
I don’t see any inconsistency in being unhappy with what titotal is doing and happy about what AI 2027 is doing.
I agree with this. I was responding pretty specifically to Zvi's critique in particular, which is focusing on things like the use of the word "bad" and the notion that there could be a goal to lower the status and prestige of AI 2027. If instead the critique was about e.g. norms of intellectual discourse I'd be on board.
That said I don't feel like your defense feels all that strong to me? I'm happy to take your word for it that there was lots of ...
But it isn't trend extrapolation?
If the current doubling time is T, and each subsequent doubling takes 10% less time, then you have infinite doublings (i.e. singularity) by time 10T. So with T = 4.5 months you get singularity by 45 months. This is completely insensitive to the initial conditions or to the trend in changes-in-doubling-time (unless the number "10%" was chosen based on trend extrapolation, but that doesn't seem to be the case).
(In practice the superexponential model predicts singularity even sooner than 45 months, because of the additional effect from automated AI R&D.)
I don't see how this is responding to anything I've said? What in my comment are you disagreeing with or adding color to?
Again, my position is not "AI 2027 did something bad". My position is "stop critiquing people for having goals around status and prestige rather than epistemics, or at least do so consistently".
(Incidentally, I suspect bio anchors did better on the axis of getting good reviews / feedback, but that isn't particularly central to anything I'm claiming.)
Things I agree with:
I disagree that titotal's critique is far away from AI 2027 on the relevant spectrum. For example, titotal's critique was posted on the EA Forum / LessWrong, and focused on technical disagreements, rather than going through a huge amplification / social m...
I mean, yes, if the goal of the post was to lower the status and prestige of AI 2027 and to do so through people reading the title and updating in that way, rather than to offer a helpful critique, then it is true that the title was the best local way to achieve that objective, epistemic commons be damned. I would hope for a different goal?
Come on, this is such an isolated demand for rigor. AI 2027 clearly had the goal of raising the status and prestige of belief in AI risk and short timelines. They employed tons of symmetric weapons in the pursuit of this...
I don't think AI 2027 did anything even close to as crude as calling the thing you are arguing against just "bad" in your title.
Indeed, I think overall AI 2027 is really doing remarkably well at being asymmetric in really a huge number of its choices (I am of course biased as having been involved in many of those choices, but I currently would say that AI 2027 is close to the very top at the intersection of "accessible" and "trying to make itself only succeed and compelling if indeed its claims are true" as I think any piece of media out there).
(I do...
Great work!
We've also been seeing a similar vulnerability in our debate work, where when we use a weaker LLM as the judge, it is very hard to get the judge to take seriously the possibility that the debaters might be lying to it (using only prompting). As a result it often just believes the dishonest debater despite clear evidence to the contrary and gets the judgment wrong.
This problem mostly goes away if we switch to a more capable LLM as the judge model. My sense is that there's some baseline level of capability that is required to be a good monitor (si...
I think you are being led astray by having a one-dimensional notion of intelligence.
What I meant was "the AI doesn't already know how to evaluate what's best according to its own values better than the operators".
Well yes, that is the idea, there is information asymmetry between the AI and humans. Note that this can still apply even when the AI is much smarter than the humans.
...CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain. [...]
It accepts corrections as long as it has less c
Not a full response, but some notes:
I definitely was not thinking about the quoted definition of corrigibility, which I agree is not capturing what at least Eliezer, Nate and Paul are saying about corrigibility (unless there is more to it than the quoted paragraph). I continue to think that Paul and Eliezer have pretty different things in mind when they talk about corrigibility, and this comment seems like some vindication of my view.
I do wish I hadn't used the phrases "object-level" and "meta-level" and just spent 4 paragraphs unpacking what I meant by that because in hindsight that was con...
I feel like the natural idea here is that interp generates understanding and then you use the understanding to generate behavioral evidence. Idk if this is what Dario has in mind but it at least seems plausible.
Hmm, maybe we do disagree. I personally like circuit style interp analysis as a way to get evidence of scheming. But this is because I expect that after you do the circuit analysis you will then be able to use the generated insight to create behavioral evidence, assuming the circuit analysis worked at all. (Similarly to e.g. the whale + baseball = shark adversarial example.)
...Maybe this doesn't come up as much in your conversation with people, but I've seen internals based testing methods which don't clearly ground out in behavioral evidence discussed often.
...You might expect the labor force of NormalCorp to be roughly in equilibrium where they gain equally from spending more on compute as they gain from spending on salaries (to get more/better employees).
[...]
However, I'm quite skeptical of this type of consideration making a big difference because the ML industry has already varied the compute input massively, with over 7 OOMs of compute difference between research now (in 2025) vs at the time of AlexNet 12 years ago, (invalidating the view that there is some relatively narrow range of inputs in which neither
In some sense I agree with this post, but I'm not sure who the intended audience is, or what changes anyone should make. What existing work seems like it will generate "evidence which is just from fancy internals-based methods (and can't be supported by human inspection of AI behavior)", and that is the primary story for why it is impactful? I don't think this is true of probing, SAEs, circuit analysis, debate, ...
(Meta: Going off of past experience I don't really expect to make much progress with more comments, so there's a decent chance I will bow out after this comment.)
I would expect bootstrapping will at most align a model as thoroughly as its predecessor was aligned (but probably less)
Why? Seems like it could go either way to me. To name one consideration in the opposite direction (without claiming this is the only consideration), the more powerful model can do a better job at finding the inputs on which the model would be misaligned, enabling you to train its...
like being able to give the judge or debate partner the goal of actually trying to get to the truth
The idea is to set up a game in which the winning move is to be honest. There are theorems about the games that say something pretty close to this (though often they say "honesty is always a winning move" rather than "honesty is the only winning move"). These certainly depend on modeling assumptions but the assumptions are more like "assume the models are sufficiently capable" not "assume we can give them a goal". When applying this in practice there is also ...
Rather, I think that most of the value lies in something more like "enabling oversight of cognition, despite not having data that isolates that cognition."
Is this a problem you expect to arise in practice? I don't really expect it to arise, if you're allowing for a significant amount of effort in creating that data (since I assume you'd also be putting a significant amount of effort into interpretability).
We've got a lot of interest, so it's taking some time to go through applications. If you haven't heard back by the end of March, please ping me; hopefully it will be sooner than that.
The answer to that question will determine which team will do the first review of your application. (We get enough applications that the first review costs quite a bit of time, so we don't want both teams to review all applications separately.)
You can still express interest in both teams (e.g. in the "Any other info" question), and the reviewer will take that into account and consider whether to move your application to the other team, but Gemini Safety reviewers aren't going to be as good at evaluating ASAT candidates, and vice versa, so you should choose the team that you think is a better fit for you.
There are different interview processes. ASAT is more research-driven while Gemini Safety is more focused on execution and implementation. If you really don't know which of the two teams would be a better fit, you can submit a separate application for each.
Our hiring this round is a small fraction of our overall team size, so this is really just correcting a minor imbalance, and shouldn't be taken as reflective of some big strategy. I'm guessing we'll go back to hiring a mix of the two around mid-2025.
Still pretty optimistic by the standards of the AGI safety field, somewhat shorter timelines than I reported in that post.
Neither of these really affect the work we do very much. I suppose if I were extremely pessimistic I would be doing something else, but even at a p(doom) of 50% I'd do basically the same things I'm doing now.
(And similarly individual team members have a wide variety of beliefs on both optimism and timelines. I actually don't know their beliefs on those topics very well because these beliefs are usually not that action-relevant for us.)
More capability research than AGI safety research but idk what the ratio is and it's not something I can easily find out
Since we have multiple roles, the interview process varies across candidates, but usually it would have around 3 stages that in total correspond to 4-8 hours of interviews.
We'll leave it up until the later of those two (and probably somewhat beyond that, but that isn't guaranteed). I've edited the post.
Is that right?
Yes, that's broadly accurate, though one clarification:
This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly
That's a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.
...I think this will
If the situations where you imagine MONA helping are situations where you can't see the long-term optimization problems, I think you need a relatively strong second bullet point
That doesn't seem right. It can simultaneously be the case that you can't tell that there are problems stemming from long-term optimization problems when you don't use MONA, and also if you actually use MONA, then it will measurably improve quality.
For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which ...
I meant "it's obvious you should use MONA if you are seeing problems with long-term optimization", which I believe is Fabien's position (otherwise it would be "hard to find").
Your reaction seems more like "it's obvious MONA would prevent multi-step reward hacks"; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).
I have some credence in all three of those bullet points.
For MONA it's a relatively even mixture of the first and second points.
(You are possibly the first person I know of who reacted to MONA with "that's obvious" instead of "that obviously won't perform well, why would anyone ever do it". Admittedly you are imagining a future hypothetical where it's obvious to everyone that long-term optimization is causing problems, but I don't think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)
For debate it's mostly the first point, and to some extent the third point.
Got it, that makes more sense. (When you said "methods work on toy domains" I interpreted "work" as a verb rather than a noun.)
But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings
I think by far the biggest open question is "how do you provide the nonmyopic approval so that the model actually performs well". I don't think anyone has even attempted to tackle this so it's hard to tell what you could learn about it, but I'd be surprised if there weren't generaliza...
I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?
I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn't have to grapple as hard with the challenge of defining the approval feedback as I'd expect in a realistic deployment. But it does impose an alignment tax, so there's no point in using MONA currently, when good enough alignm...
I don't know of any existing work in this category, sorry. But e.g. one project would be "combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards", which in theory could work better than either one of them alone.
I'm excited to see this RFP out! Many of the topics in here seem like great targets for safety work.
I'm sad that there's so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is "alternatives to adversarial training". (There's also "new moonshots for aligning superintelligence" but I don't expect much to come out of that, and "white-box estimation of rare misbehavior" could help if you are willing to put optimizat...
I think we should just all give up on the word "scalable oversight"; it is used in many conflicting ways, sadly. I mostly talk about "recursive techniques for reward generation".
I don't think you should think of "poor info flows" as something that a company actively does, but rather as the default state of affairs for any fast-moving organization with 1000+ people. Such companies normally need to actively fight against poor info flows, resulting in not-maximally-terrible-but-still-bad info flows.
This is a case where I might be over indexing from experience at Google, but I'd currently bet that if you surveyed a representative set of Anthropic and OpenAI employees, more of them would mostly agree with that statement than mostly dis...
In broad strokes I agree with Zac. And tbc I'm generally a fan of formal verification and have done part of a PhD in program synthesis.
So, GSAI addresses the band of capabilities where AI systems become potentially no longer safe to interact with directly due to their potential ability to model and exploit human psychology
This seems like a great example of something that I strongly expect GSAI will not handle (unless the proposal reduces to "don't build such AIs", in which case I would appreciate that being stated more directly, or if it reduces to "limit ...
I also agree with Zac, maybe if you had a really well-selected group of 10 people you could do something, but 10 randomly selected AGI safety researchers probably don't accomplish much.
By far my biggest objection is that there are approximately zero useful things that "[don't] make anyone's workflow harder". I expect you're vastly underestimating the complexity of production systems and companies that build them, and the number of constraints they are under. (You are assuming a do-ocracy though, depending on how much of a do-ocracy it is (e.g. willing to i...
Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.
We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post ...
Indeed this seems related! We'd hope that there isn't too much conflict between the different "generations", since the intent is for the overseer to enable long-term plans, but it seems like something that could happen if the overseer is bad enough.
For a simple task like booking a restaurant, we could just ask the (frozen) overseer-AI to pick[1] actions, no?
If it were as simple as "just ask an LLM to choose actions" someone would have deployed this product a while ago.
But in any case I agree this isn't the most interesting case for MONA, I talked about it because that's what Daniel asked about.
In our experiments on both Test-driven development and Loan applications you can see that the ground truth reward goes up with MONA. The ground truth reward at step 0 represents the reward the agent would obtain if it were frozen. So this looks like your option (3), assuming that the overseer and the agent were identical. (Partly this is because we are also mixing in non-AI sources of feedback, like whether the code runs and passes the tests, and whether the AI made the correct decision on the loan, but I think this is a realistic model of future AI develo...
Discussed in the paper in Section 6.3, bullet point 3. Agreed that if you're using a prediction market it's no longer accurate to say that individual humans understand the strategy.
(We've seen this comment and are looking into options)
Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it's more doable now.
For a variety of reasons the core team behind this paper has moved on to other things, so we won't get to it in the near future, but it would be great to see others working on this!
Thanks, and interesting generalization!
My thoughts depend on whether you train the weaker model.
You mention permission systems, which is certainly a big deal, but I didn't see anything about broader configuration mechanisms, much of which can be motivated solely by efficiency and incidentally helps with security. (I was disputing your efficiency -> less security claim; permissions mechanisms aren't a valid counterargument since they aren't motivated by efficiency.)
... (read more)