All of Ajeya Cotra's Comments + Replies

Ajeya CotraΩ101710

To put it another way: we probably both agree that if we had gotten AI personal assistants that shop for you and book meetings for you in 2024, that would have been at least some evidence for shorter timelines. So their absence is at least some evidence for longer timelines. The question is what your underlying causal model was: did you think that if we were going to get superintelligence by 2027, then we really should see personal assistants in 2024? A lot of people strongly believe that, you (Daniel) hardly believe it at all, and I'm somewhere in the mid... (read more)

8Daniel Kokotajlo
That's reasonable. Seems worth mentioning that I did make predictions in What 2026 Looks Like, and eyeballing them now I don't think I was saying that we'd have personal assistants that shop for you and book meetings for you in 2024, at least not in a way that really works. (I say at the beginning of 2026 "The age of the AI assistant has finally dawned.") In other words I think even in 2021 I was thinking that widespread actually useful AI assistants would happen about a year or two before superintelligence. (Not because I have opinions about the orderings of technologies in general, but because I think that once an AGI company has had a popular working personal assistant for two years they should be able to figure out how to make a better version that dramatically speeds up their R&D.)
Ajeya CotraΩ470

I'm not talking about narrowly your claim; I just think this very fundamentally confuses most people's basic models of the world. People expect, from their unspoken models of "how technological products improve," that long before you get a mind-bendingly powerful product that's so good it can easily kill you, you get something that's at least a little useful to you (and then you get something that's a little more useful to you, and then something that's really useful to you, and so on). And in fact that is roughly how it's working — for programmers, not fo... (read more)

Ajeya CotraΩ101710

To put it another way: we probably both agree that if we had gotten AI personal assistants that shop for you and book meetings for you in 2024, that would have been at least some evidence for shorter timelines. So their absence is at least some evidence for longer timelines. The question is what your underlying causal model was: did you think that if we were going to get superintelligence by 2027, then we really should see personal assistants in 2024? A lot of people strongly believe that, you (Daniel) hardly believe it at all, and I'm somewhere in the mid... (read more)

Yeah TBC, I'm at even less than 1-2 decades added, more like 1-5 years.

Ajeya CotraΩ6156

Interestingly, I've heard from tons of skeptics I've talked to (e.g. Tim Lee, CSET people, AI Snake Oil) that timelines to actual impacts in the world (such as significant R&D acceleration or industrial acceleration) are going to be way longer than we say because AIs are too unreliable and risky, therefore people won't use them. I was more dismissive of this argument before but: 

  • It matches my own lived experience (e.g. I still use search way more than LLMs, even to learn about complex topics, because I have good Google Fu and LLMs make stuff up too much).
  • As you say, it seems like a plausible explanation for why my weird friends make way more use out of coding agents than giant AI companies.
7Daniel Kokotajlo
I tentatively remain dismissive of this argument. My claim was never "AIs are actually reliable and safe now" such that your lived experience would contradict it. I too predicted that AIs would be unreliable and risky in the near-term. My prediction is that after the intelligence explosion the best AIs will be reliable and safe (insofar as they want to be, that is.) ...I guess just now I was responding to a hypothetical interlocutor who agrees that AI R&D automation could come soon but thinks that that doesn't count as "actual impacts in the world." I've met many such people, people who think that software-only singularity is unlikely, people who like to talk about real-world bottlenecks, etc. But you weren't describing such a person, you were describing someone who also thinks we won't be able to automate AI R&D for a long time. There I'd say... well, we'll see. I agree that AIs are unreliable and risky and that therefore they'll be able to do impressive-seeming stuff that looks like they could automate AI R&D well before they actually automate AI R&D in practice. But... probably by the end of 2025 they'll be hitting that first milestone (imagine e.g. an AI that crushes RE-Bench and also can autonomously research & write ML papers, except the ML papers are often buggy and almost always banal / unimportant, and the experiments done to make them had a lot of bugs and wasted compute, and thus AI companies would laugh at the suggestion of putting said AI in charge of a bunch of GPUs and telling it to cook.) And then two years later maybe they'll be able to do it for real, reliably, in practice, such that AGI takeoff happens. Maybe another thing I'd say is "One domain where AIs seem to be heavily used in practice, is coding, especially coding at frontier AI companies (according to friends who work at these companies and report fairly heavy usage). This suggests that AI R&D automation will happen more or less on schedule."
4Noosphere89
Indeed, I believe this is the main explanation for why my median timelines are longer than say situational awareness, and why AI isn't nearly as impactful as people used to think back in the day. The big difference from a lot of skeptics is I believe this adds at most 1-2 decades to the timeline, not multiple decades to make AI very, very useful.
Ajeya CotraΩ4103

Yeah, good point, I've been surprised by how uninterested the companies have been in agents.

BuckΩ8110

Another effect here is that the AI companies often don't want to be as reckless as I am, e.g. letting agents run amok on my machines.

Ajeya CotraΩ16309

One thing that I think is interesting, which doesn't affect my timelines that much but cuts in the direction of slower: once again I overestimated how much real world use anyone who wasn't a programmer would get. I definitely expected an off-the-shelf agent product that would book flights and reserve restaurants and shop for simple goods, one that worked well enough I would actually use it (and I expected that to happen before the one hour plus coding tasks were solved; I expected it to be concurrent with half hour coding tasks).

I can't tell if the fact th... (read more)

Ajeya CotraΩ5132

My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th), and procedurally I also now defer a lot to Redwood and METR engineers. More discussion here: https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines?commentId=hnrfbFCP7Hu6N6Lsp

Ajeya CotraΩ35888

I agree the discussion holds up well in terms of the remaining live cruxes. Since this exchange, my timelines have gotten substantially shorter. They're now pretty similar to Ryan's (they feel a little bit slower but within the noise from operationalizations being fuzzy; I find it a bit hard to think about what 10x labor inputs exactly looks like). 

The main reason they've gotten shorter is that performance on few-hour agentic tasks has moved almost twice as fast as I expected, and this seems broadly non-fake (i.e. it seems to be translating into real ... (read more)

Reply10211
7maxnadeau
You mentioned CyBench here. I think CyBench provides evidence against the claim "agents are already able to perform self-contained programming tasks that would take human experts multiple hours". AFAIK, the most up-to-date CyBench run is in the joint AISI o1 evals. In this study (see Table 4.1, and note the caption), all existing models (other than o3, which was not evaluated here) succeed on 0/10 attempts at almost all the Cybench tasks that take >40 minutes for humans to complete. 
BuckΩ121711

Of course AI company employees have the most hands-on experience

FWIW I am not sure this is right--most AI company employees work on things other than "try to get as much work as possible from current AI systems, and understand the trajectory of how useful the AIs will be". E.g. I think I have more personal experience with running AI agents than people at AI companies who don't actively work on AI agents.

There are some people at AI companies who work on AI agents that use non-public models, and those people are ahead of the curve. But that's a minority.

Ajeya CotraΩ16309

One thing that I think is interesting, which doesn't affect my timelines that much but cuts in the direction of slower: once again I overestimated how much real world use anyone who wasn't a programmer would get. I definitely expected an off-the-shelf agent product that would book flights and reserve restaurants and shop for simple goods, one that worked well enough I would actually use it (and I expected that to happen before the one hour plus coding tasks were solved; I expected it to be concurrent with half hour coding tasks).

I can't tell if the fact th... (read more)

Ajeya CotraΩ25480

(Cross-posted to EA Forum.)

I’m a Senior Program Officer at Open Phil, focused on technical AI safety funding. I’m hearing a lot of discussion suggesting funding is very tight right now for AI safety, so I wanted to give my take on the situation.

At a high level: AI safety is a top priority for Open Phil, and we are aiming to grow how much we spend in that area. There are many potential projects we'd be excited to fund, including some potential new AI safety orgs as well as renewals to existing grantees, academic research projects, upskilling grants, and mor... (read more)

5Evan McVail
By the way, Open Philanthropy is actively hiring for roles on Ajeya’s team in order to build capacity to make more TAIS grants! You can learn more and apply here.
3gvst
Could you share these program strategy questions? (Assuming there are more not described in your comment). I think the community is likely to have helpful insights on at least some of them. I personally am interested in independently researching similar questions and it would be great to know what answers be useful for OpenPhil.
Ajeya CotraΩ163413

my guess is most of that success is attributable to the work on RLHF, since that was really the only substantial difference between Chat-GPT and GPT-3

I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free. My guess is that if you did a head-to-head comparison of RLHF and kludgey random hacks involving imitation and prompt engineering, they'd seem similarly cool to a random journalist / VC, and generate similar excitement.

3Raemon
I think the part where it has a longer memory/coherence feels like a major shift (having gotten into the flow of experimenting with GPT3 in the month prior to chatGPT, I felt like the two interfaces were approximately as convenient) I don't know what mechanism was used to generate the longer coherence though.

People seem pretty impressed with CharacterAI, which seems to get most of its character-specific info from prompting and having finetuned on roleplay dialog. However, it's also possible that CharacterAI's base models are RLHF'd to be consistent roleplayers.

Matthew BarnettΩ204525

I don't think this is right -- the main hype effect of chatGPT over previous models feels like it's just because it was in a convenient chat interface that was easy to use and free.

I don't have extensive relevant expertise, but as a personal datapoint: I used Davinci-002 multiple times to generate an interesting dialogue in order to test its capabilities. I ran several small-scale Turing tests, and the results were quite unimpressive in my opinion. When ChatGPT came out, I tried it out (on the day of its release) and very quickly felt that it was qualitati... (read more)

I strongly disagree with the "best case" thing. Like, policies could just learn human values! It's not that implausible.

Yes, sorry, "best case" was oversimplified. What I meant is that generalizing to want reward is in some sense the model generalizing "correctly;" we could get lucky and have it generalize "incorrectly" in an important sense in a way that happens to be beneficial to us. I discuss this a bit more here.

But if Alex did initially develop a benevolent goal like “empower humans,” the straightforward and “naive” way of acting on that goal wo

... (read more)
4TurnTrout
See also: Inner and outer alignment decompose one hard problem into two extremely hard problems (in particular: Inner alignment seems anti-natural).
5cfoster0
AFAIK the reward signal is not typically included as an input to the policy network in RL. Not sure why, and I could be wrong about that, but that is not my main question. The bigger question is "Has direct access to when?" At the moment in time when the model is making a decision, it does not have direct access to the decision-relevant reward signal because that reward is typically causally downstream of the model's decision. That reward may not even have a definite value until after decision time. Whereas concrete observables like "shiny gold coins" and "the finish line straight ahead" and "my opponent is in check" (and other abstractions in the model's ontology that are causally upstream from reward in reality) are readily available at decision time. It seems to me that that makes them natural candidates for credit assignment to flag early on as the reward-responsible mental events and reinforce into stable motivations, since they in fact were the factors that determined the decisions that led to rewards. IME, the most straightforward way for reward-itself to become the model's primary goal would be if the model learns to base its decisions on an accurate reward-predictor much earlier than it learns to base its decisions on other (likely upstream) factors. If it instead learns how to accurately predict reward-itself after it is already strongly motivated by some concrete observables, I don't see why we should expect it to dislodge that motivation, despite the true fact that those concrete observables are only pretty correlated with reward whereas an accurate reward-predictor is perfectly correlated with reward. Why? Because the model currently doesn't care about reward-itself, it currently cares about the concrete observable(s), so it has no reason to take actions that would override that goal, and it has positive goal-content integrity reasons to not take those actions.

Yeah, I agree this is a good argument structure -- in my mind, maximizing reward is both a plausible case (which Richard might disagree with) and the best case (conditional on it being strategic at all and not a bag of heuristics), so it's quite useful to establish that it's doomed; that's the kind of structure I was going for in the post.

9Richard_Ngo
I strongly disagree with the "best case" thing. Like, policies could just learn human values! It's not that implausible. If I had to try point to the crux here, it might be "how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?" Where we both agree that there's some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existing heuristics, whereas I'm more focused on the regime where there are lots of low-hanging fruit in terms of changes that would make a policy more successful, and so the question of how easy that goal is to learn from its training data is pretty important. (As usual, there's the human analogy: our goals are very strongly biased towards things we have direct observational access to!) Even setting aside this disagreement, though, I don't like the argumentative structure because the generalization of "reward" to large scales is much less intuitive than the generalization of other concepts (like "make money") to large scales - in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.

Note that the "without countermeasures" post consistently discusses both possibilities (the model cares about reward or the model cares about something else that's consistent with it getting very high reward on the training dataset). E.g. see this paragraph from the above-the-fold intro:

Once this progresses far enough, the best way for Alex to accomplish most possible “goals” no longer looks like “essentially give humans what they want but take opportunities to manipulate them here and there.” It looks more like “seize the power to permanently direct how

... (read more)
5Richard_Ngo
Yepp, agreed, the thing I'm objecting to is how you mainly focus on the reward case, and then say "but the same dynamics apply in other cases too..." The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).

Yeah I agree more of the value of this kind of exercise (at least within the community) is in revealing more granular disagreements about various things. But I do think there's value in establishing to more external people something high level like "It really could be soon and it's not crazy or sci fi to think so."

Can you say more about what particular applications you had in mind?

Stuff like personal assistants who write emails / do simple shopping, coding assistants that people are more excited about than they seem to be about Codex, etc.

(Like I said in the main post, I'm not totally sure what PONR refers to, but don't think I agree that the first lucrative application marks a PONR -- seems like there are a bunch of things you can do after that point, including but not limited to alignment research.)

3Noosphere89
Well, the first killer app is biology and protein folding. https://www.vox.com/future-perfect/2022/8/3/23288843/deepmind-alphafold-artificial-intelligence-biology-drugs-medicine-demis-hassabis

I don't see it that way, no. Today's coding models can help automate some parts of the ML researcher workflow a little bit, and I think tomorrow's coding models will automate more and more complex parts, and so on. I think this expansion could be pretty rapid, but I don't think it'll look like "not much going on until something snaps into place."

(Coherence aside, when I now look at that number it does seem a bit too high, and I feel tempted to move it to 2027-2028, but I dunno, that kind of intuition is likely to change quickly from day to day.)

2Lukas_Gloor
Aren't a lot of compute-growth trends likely to run out of room before 2026? If so, a bump before 2026 could make sense. (Though less so if we think compute is unlikely to be the bottleneck for TAI – I don't know to what degree people see the chinchilla insight as a significant update here.) 
Ajeya CotraΩ4163

Hm, yeah, I bet if I reflected more things would shift around, but I'm not sure the fact that there's a shortish period where the per-year probability is very elevated followed by a longer period with lower per-year probability is actually a bad sign.

Roughly speaking, right now we're in an AI boom where spending on compute for training big models is going up rapidly, and it's fairly easy to actually increase spending quickly because the current levels are low. There's some chance of transformative AI in the middle of this spending boom -- and because resou... (read more)

(+1. I totally agree that input growth will slow sometime if we don't get TAI soon. I just think you have to be pretty sure that it slows right around 2040 to have the specific numbers you mention, and smoothing out when it will slow down due to that uncertainty gives a smoother probability distribution for TAI.)

Where does the selection come from? Will the designers toss a really impressive AI for not getting reward on that one timestep? I think not.

I was talking about gradient descent here, not designers.

It doesn't seem like it would have to prevent us from building computers if it has access to far more compute than we could access on Earth. It would just be powerful enough to easily defeat the kind of AIs we could train with the relatively meager computing resources we could extract from Earth. In general the AI is a superpower and humans are dramatically technologically behind, so it seems like it has many degrees of freedom and doesn't have to be particularly watching for this.

2Ben Pace
It's certainly the case that the resource disparity is an enormity. Perhaps you have more fleshed out models of what fights between different intelligence-levels look like, and how easy it is to defend against those with vastly fewer resources, but I don't. Such that while I would feel confident in saying that an army with a billion soldiers will consider a head-to-head battle with an army of one hundred soldiers barely a nuisance, I don't feel as confident in saying that an AGI with a trillion times as much compute will consider a smaller AGI foe barely a nuisance. Anyway, I don't have anything smarter to say on this, so by default I'll drop the thread here (you're perfectly welcome to reply further). (Added 9 days later: I want to note that while I think it's unlikely that this less well-resourced AGI would be an existential threat, I think the only thing I have to establish for this argument to go through is that the cost of the threat is notably higher than the cost of killing all the humans. I find it confusing to estimate the cost of the threat, even if it's small, and so it's currently possible to me that the cost will end up many orders of magnitude higher than the cost of killing them.)

Neutralizing computational capabilities doesn't seem to involve total destruction of physical matter or human extinction though, especially for a very powerful being. Seems like it'd be basically just as easy to ensure we + future AIs we might train are no threat as it is to vaporize the Earth.

0Ben Pace
Yeah, I'm not sure if I see that. Some of the first solutions I come up with seem pretty complicated — like a global government that prevents people from building computers, or building an AGI to oversee Earth in particular and ensure we never build computers (my assumption is that building such an AGI is a very difficult task). In particular it seems like it might be very complicated to neutralize us while carving out lots of space for allowing us the sorts of lives we find valuable, where we get to build our own little societies and so on. And the easy solution is always to just eradicate us, which can surely be done in less than a day.
Ajeya CotraΩ3116

My answer is a little more prosaic than Raemon. I don't feel at all confident that an AI that already had God-like abilities would choose to literally kill all humans to use their bodies' atoms for its own ends; it seems totally plausible to me that -- whether because of exotic things like "multiverse-wide super-rationality" or "acausal trade" or just "being nice" -- the AI will leave Earth alone, since (as you say) it would be very cheap for it to do so.

The thing I'm referring to as "takeover" is the measures that an AI would take to make sure that humans... (read more)

Ben PaceΩ3118

it seems totally plausible to me that... the AI will leave Earth alone, since (as you say) it would be very cheap for it to do so.

Counterargument: the humans may build another AGI that breaks out and poses an existential threat to the first AGI. 

My guess is the first AGI would want to neutralize our computational capabilities in a bunch of ways.

I mean things like tricks to improve the sample efficiency of human feedback, doing more projects that are un-enhanced RLHF to learn things about how un-enhanced RLHF works, etc.

Ajeya Cotra*Ω102214

I'm pretty confused about how to think about the value of various ML alignment papers. But I think even if some piece of empirical ML work on alignment is really valuable for reducing x-risk, I wouldn't expect its value to take the form of providing insight to readers like you or me. So you as a reader not getting much out of it is compatible with the work being super valuable, and we probably need to assess it on different terms.

The main channel of value that I see for doing work like "learning to summarize" and the critiques project and various interpret... (read more)

Ajeya Cotra*Ω8126

I was mainly talking about the current margin when I talked about how excited I am about the theoretical vs empirical work I see "going on" right now and how excited I tend to be about currently-active researchers who are doing theory vs empirical research. And I was talking about the future when I said that I expect empirical work to end up with the lion's share of credit for AI risk reduction.

Eliezer, Bostrom, and co certainly made a big impact in raising the problem to people's awareness and articulating some of its contours. It's kind of a matter of s... (read more)

I think the retroactive editing of rewards (not just to punish explicitly bad action but to slightly improve evaluation of everything) is actually pretty default, though I understand if people disagree. It seems like an extremely natural thing to do that would make your AI more capable and make it more likely to pass most behavioral safety interventions.

In other words, even if the average episode length is short (e.g. 1 hour), I think the default outcome is to have the rewards for that episode be computed as far after the fact as possible, because that hel... (read more)

3PeterMcCluskey
Okay, it sounds like you know more about that than I do. It sounds like it might cause Alex to typically care about outcomes that are months in the future? That seems somewhat close to what it would take to be a major danger.

Thanks, but I'm not working on that project! That project is led by Beth Barnes.

Ajeya CotraΩ9141

Hm, not sure I understand but I wasn't trying to make super specific mechanistic claims here -- I agree that what I said doesn't reduce confusion about the specific internal mechanisms of how lying gets to be hard for most humans, but I wasn't intending to claim that it was. I also should have said something like "evolutionary, cultural, and individual history" instead (I was using "evolution" as a shorthand to indicate it seems common among various cultures but of course that doesn't mean don't-lie genes are directly bred into us! Most human universals ar... (read more)

Ajeya CotraΩ172912

I'm agnostic about whether the AI values reward terminally or values some other complicated mix of things. The claim I'm making is behavioral -- a claim that the strategy of "try to figure out how to get the most reward" would be selected over other strategies like "always do the nice thing."

The strategy could be compatible with a bunch of different psychological profiles. "Playing the training game" is a filter over models -- lots of possible models could do it, the claim is just that we need to reason about the distribution of psychologies given that the... (read more)

8TurnTrout
But why would that strategy be selected? Where does the selection come from? Will the designers toss a really impressive AI for not getting reward on that one timestep? I think not.  Why? I maintain that the agent would not do so, unless it were already terminally motivated by reward. For empirical example, some neuroscientists know that brain stimulation reward leads to higher reward, and the brain very likely does some kind of reinforcement learning, so why don't neuroscientists wirehead themselves? 

Geoffrey Irving, Jan Leike, Paul Christiano, Rohin Shah, and probably others were doing various kinds of empirical work a few years before Redwood (though I would guess Oliver doesn't like that work and so wouldn't consider it a counterexample to his view).

habrykaΩ8160

Yeah, I think Open AI tried to do some empirical work, but approximately just produced capability progress, in my current model of the world (though I also think the incentive environment there was particularly bad). I feel confused about the "learning to summarize from human feedback" work, and currently think it was overall bad for the world, but am not super confident (in general I feel very confused about the sign of RLHF research).

I think Rohin Shah doesn't think of himself as having produced empirical work that helps with AI Alignment, but only to ha... (read more)

I agree that in an absolute sense there is very little empirical work that I'm excited about going on, but I think there's even less theoretical work going on that I'm excited about, and when people who share my views on the nature of the problem work on empirical work I feel that it works better than when they do theoretical work.

6habryka
Hmm, there might be some mismatch of words here. Like, most of the work so far on the problem has been theoretical. I am confused how you could not be excited about the theoretical work that established the whole problem, the arguments for why it's hard, and that helped us figure out at least some of the basic parameters of the problem. Given that (I think) you currently think AI Alignment is among the global priorities, you presumably think the work that allowed you to come to believe that (and that allowed others to do the same) was very valuable and important. My guess is you are somehow thinking of work like Superintelligence, or Eliezer's original work, or Evan's work on inner optimization as something different than "theoretical work"?
Ajeya Cotra*Ω574

The gradient pressure towards valuing reward terminally when you've already figured out reliable strategies for doing what humans want, seems very weak....in practice, it seems to me like these differences would basically only happen due to operator error, or cosmic rays, or other genuinely very rare events (as you describe in the "Security Holes" section).

Yeah, I disagree. With plain HFDT, it seems like there's continuous pressure to improve things on the margin by being manipulative -- telling human evaluators what they want to hear, playing to pervas... (read more)

5Not Relevant
Ok, this is pretty convincing. The only outlying question I have is whether each of these suboptimalities are easier for the agent to integrate into its reward model directly, or whether it's easiest to incorporate them by deriving them via reasoning about the reward. It seems certain that eventually some suboptimalities will fall into the latter camp. When that does happen, an interesting question is whether SGD will favor converting the entire objective to direct-reward-maximization, or whether direct-reward-maximization will be just a component of the objective along with other heuristics. One reason to suppose the latter might occur are findings like this paper (https://arxiv.org/pdf/2006.07710.pdf) which suggest that, once a model has achieved good accuracy, it won't update to a better hypothesis even if that hypothesis is simpler. The more recent grokking literature might push us the other way, though if I understand correctly it mostly occurs when adding weight decay (creating a "stronger" simplicity prior).

Note I was at -16 with one vote, and only 3 people have voted so far. So it's a lot due to the karma-weight of the first disagreer.

4habryka
I strong-disagreed on it when it was already at -7 or so, so I think it was just me and another person strongly disagreeing. I expected other people would vote it up again (and didn't expect anything like consensus in the direction I voted).
8Lukas Finnveden
There's a 1000 year old vampire stalking lesswrong!? 16 is supposed to be three levels above Eliezer.

I think updating negatively on the situation/action pair has functionally the same effect as changing the reward to be what you now think it should be -- my understanding is that RL can itself be implemented as just updates on situation/action pairs, so you could have trained your whole model that way. Since the reason you updated negatively on that situation/action pair is because of something you noticed long after the action was complete, it is still pushing your models to care about the longer-run.

This posits that the model has learned to wirehead

I... (read more)

3Not Relevant
  Ah, gotcha. This is definitely a convincing argument that models will learn to value things longer-term (with a lower discount rate), and I shouldn't have used the phrase "short-term" there. I don't yet think it's a convincing argument that the long-term thing it will come to value won't basically be the long-term version of "make humans smile more", but you've helpfully left another comment on that point, so I'll shift the discussion there.

No particular reason -- I can't figure out how to cross post now so I sent a request.

Ajeya CotraΩ102211

Here is the real chasm between the AI safety movement and the ML industry/academia. One field is entirely driven by experimental results; the other is dominated so totally by theory that its own practitioners deny that there can be any meaningful empirical aspect to it, at least, not until the moment when it's too late to make any difference.

To put a finer point on my view on theory vs empirics in alignment:

  • Going forward, I think the vast majority of technical work needed to reduce AI takeover risk is empirical, not theoretical (both in terms of "tota
... (read more)
5alyssavance
I'm surprised by how strong the disagreement is here. Even if what we most need right now is theoretical/pre-paradigmatic, that seems likely to change as AI develops and people reach consensus on more things; compare eg. the work done on optics pre-1800 to all the work done post-1800. Or the work done on computer science pre-1970 vs. post-1970. Curious if people who disagree could explain more - is the disagreement about what stage the field is in/what the field needs right now in 2022, or the more general claim that most future work will be empirical?

In general, all of these stories seem to rely on a very fast form of instrumental convergence to playing the Training Game, such that "learn roughly what humans want, and then get progressively better at doing that, plus learn some extra ways to earn reward when crappy human feedback disagrees with what humans would actually want" is disfavored on priors to "learn to pursue [insert objective] and get progressively better at pursuing it until you eventually hit situational awareness and learn to instrumentally game the training process."

I think the secon... (read more)

3Not Relevant
I've been trying to brainstorm more on experiments around "train a model to 'do X' and then afterwards train it to 'do not-X', and see whether 'do X' sticks". I'm not totally sure that I understand your mental model of the stickiness-hypothesis well enough to know what experiments might confirm/falsify it. Scenario 1: In my mind, the failure mode we're worried about for inner-misaligned AGI is "given data from distribution D, the model first learns safe goal A (because goal A is favored on priors over goal B), but then after additional data from this distribution, the model learns an unsafe goal B (e.g. making reward itself the objective) that performs better on D than goal A (even though it was disfavored on priors)." In this case, the sort of result we might hope for is that we could make goal A sticky enough on D that it persists (and isn't replaced by goal B), at least until the model gains sufficient situational awareness to play the training game to preserve goal A. Note that scenario 1 also captures the setting where, given sufficient samples from D, the model occasionally observes scenarios where explicitly pursuing reward performs better than pursuing safe objectives. These datapoints are just rare enough that "pursuing reward" is disfavored on priors until late in training. Scenario 2: In your above description, it sounds like the failure mode is "given data from distribution D_1, the model learns safe goal A (as it is favored on priors vs. goal B). Then the model is given data from a different distribution D_2 on which goal B is favored on priors vs. goal A, and learns unsafe goal B." In this case, the only real hope for the model to preserve goal A is for it to be able to play the training game in pursuit of goal A. I think of Scenario 1 as more reflective of our situation, and can think of lots of experiments for how to test whether and how much different functions are favored on priors for explaining the same training data. Does this distinction m
5Not Relevant
On my model, the large combo of reward heuristics that works pretty well before situational awareness (because figuring out what things maximize human feedback is actually not that complicated) should continue to work pretty well even once situational awareness occurs. The gradient pressure towards valuing reward terminally when you've already figured out reliable strategies for doing what humans want, seems very weak. We could certainly mess up and increase this gradient pressure, e.g. by sometimes announcing to the model "today is opposite day, your reward function now says to make humans sad!" and then flipping the sign on the reward function, so that the model learns that what it needs to care about is reward and not its on-distribution perfect correlates (like "make humans happy in the medium term"). But in practice, it seems to me like these differences would basically only happen due to operator error, or cosmic rays, or other genuinely very rare events (as you describe in the "Security Holes" section). If you think such disagreements are more common, I'd love to better understand why. Yeah, with the assumption that the model decides to preserve its helpful values because it thinks they might shift in ways it doesn't like unless it plays the training game. (The second half is that once the model starts employing this strategy, gradient descent realizes it only requires a simple inner objective to keep it going, and then shifts the inner objective to something malign.)

All these drives do seem likely. But that's different from arguing that "help humans" isn't likely. I tend to think of the final objective function being some accumulation of all of these, with a relatively significant chunk placed on "help humans" (since in training, that will consistently overrule other considerations like "be more efficient" when it comes to the final reward).

I think that by the logic "heuristic / drive / motive X always overrules heuristic / drive / motive Y when it comes to final reward," the hierarchy is something like:

  1. The drive
... (read more)

[Takeover] seems likely to compete with the above shorter-term values of "make humans happy", "don't harm humans", "don't do things humans notice and dislike in retrospect". It seems like any takeover plan needs to actively go against a large fraction of its internal motivations, in pursuit of maximizing its other motivations in the long term.

I agree this is complicated, and how exactly this works depends on details of the training process and what kinds of policies SGD is biased to find. I also think (especially if we're clever about it) there are lots... (read more)

3Not Relevant
I do think this is exactly what humans do, right? When we find out we've messed up badly (changing our reward), we update negatively on our previous situation/action pair.  This posits that the model has learned to wirehead - i.e. to terminally value reward for its own sake - which contradicts the section's heading, "Even if Alex isn’t “motivated” to maximize reward, it would seek to seize control".  If it does not terminally value its reward registers, but instead terminally values human-feedback-reward's legible proxies (like not-harming-humans, that are never disagreed-with in the lab setting) like not hurting people, then it seems to me that it would not value retroactive edits to the rewards it gets for certain episodes. I agree that if it terminally values its reward then it will do what you've described.

Thanks for the feedback! I'll respond to different points in different comments for easier threading.

There are a lot of human objectives that, to me, seem like they would never conflict with maximizing reward. This includes anything related to disempowering the overseers in any way (that they can recover from), pursuing objectives fundamentally outside the standard human preference distribution (like torturing kittens), causing harms to humans, or in general making the overseers predictably less happy.

I basically agree that in the lab setting (when hum... (read more)

1Not Relevant
I think I agree with everything in this comment, and that paragraph was mostly intended as the foundation for the second point I made (disagreeing with your assessment in "What if Alex has benevolent motivations?"). Part of the disagreement here might be on how I think "be honest and friendly" factorizes into lots of subgoals ("be polite", "don't hurt anyone", "inform the human if a good-seeming plan is going to have bad results 3 days from now", "tell Stalinists true facts about what Stalin actually did"), and while Alex will definitely learn to terminally value the wrong (not honest+helpful+harmless) outcomes for some of these goal-axes, it does seem likely to learn to value other axes robustly.

According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):

  1. Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good
... (read more)
1Matt Putz
By "refining pure human feedback", do you mean refining RLHF ML techniques?  I assume you still view enhancing human feedback as valuable? And also more straightforwardly just increasing the quality of the best human feedback?

I'm still fairly optimistic about sandwiching. I deliberately considered a set of pretty naive strategies ("naive safety effort" assumption) to contrast with future posts which will explore strategies that seem more promising. Carefully-constructed versions of debate, amplification, recursive reward-modeling, etc seem like they could make a significant difference and could be tested through a framework like sandwiching.

1Jérémy Scheurer
Thanks for clearing that up. This is super helpful as a context for understanding this post.

Yeah, I definitely agree with "this problem doesn't seem obviously impossible," at least to push on quantitatively. Seems like there are a bunch of tricks from "choosing easy questions humans are confident about" to "giving the human access to AI assistants / doing debate" to "devising and testing debiasing tools" (what kinds of argument patterns are systematically more likely to convince listeners of true things rather than false things and can we train AI debaters to emulate those argument patterns?) to "asking different versions of the AI the same quest... (read more)

I want to clarify two things:

  1. I do not think AI alignment research is hopeless, my personal probability of doom from AI is something like 25%. My frame is definitely not "death with dignity;" I'm thinking about how to win.
  2. I think there's a lot of empirical research that can be done to reduce risks, including e.g. interpretability, "sandwiching" style projects, adversarial training, testing particular approaches to theoretical problems like eliciting latent knowledge, the evaluations you linked to, empirically demonstrating issues like deception (as you s
... (read more)
2Shiroe
(By the way, I was delighted to find out you're working on that ARC project that I had linked! Keep it up! You are like in the top 0.001% of world-helping-people.)

To your point, sure, an H100 simulator will get perfect reward, but the model doesn't see x′, so how would it acquire the ability to simulate H100?

In the worst-case game we're playing, I can simply say "the reporter we get happens to have this ability because that happens to be easier for SGD to find than the direct translation ability."

When living in worst-case land, I often imagine random search across programs rather than SGD. Imagine we were plucking reporters at random from a giant barrel of possible reporters, rejecting any reporter which didn't p... (read more)

1redbird
Thanks! It's your game, you get to make the rules :):) I think my other proposal, Withhold Material Information, passes this counterexample, because the reporter literally doesn't have the information it would need to simulate the human. 

The question here is just how it would generalize given that it was trained on H_1, H_2,...H_10. To make arguments about how it would generalize, we ask ourselves what internal procedure it might have actually learned to implement.

Your proposal is that it might learn the procedure "just be honest" because that would perform perfectly on this training distribution. You contrast this against the procedure "just answer however the evaluator you've seen most recently would answer," which would get a bad loss because it would be penalized by the stronger evalu... (read more)

1redbird
That's almost right, but it's being penalized right away, before it has any experience with the strong evaluators, so it can't simulate them. The ELK paper says we can assume, if we want, that there are no mislabeled training points (I'll call this "assumption A"). My proposal is that it could actually be useful to mislabel some training points, because they help us detect deception.   As a simple example, let's train a reporter to answer the single question “Is the diamond in the room?”. Each training point has two labels x=+1 if H1 thinks the diamond is still there, else 0 x′=+1 if H100 thinks the diamond is still there, else 0. By assumption A, our training data is such that x′ is always correct. But we deliberately choose a dataset where say 10% of the x labels are wrong (x≠x′).  Then we train the model on points of the form  (v,a,x) (video, action, H1 label). Crucially, the model does not see x′.  The model seeks to output y that maximizes reward R(x,y), where R(x,y)=1    if x is right and y=x   (good job) R(x,y)=10    if x is wrong and y≠x  (you rock, thanks for correcting us!) R(x,y)=−1000     if x is right and y≠x  (bad model, never ever deceive us) R(x,y)=−1000    if x is wrong and y=x  (bad model, never ever deceive us) To your point, sure, an H100 simulator will get perfect reward, but the model doesn't see x′, so how would it acquire the ability to simulate H100 ? EDIT: One way it could plausibly simulate H100  is to notice that all the training examples are easy, and infer what kind of reasoning was used to generate them.  We could try to block this by including some hard examples in the training, but then some of the x′ labels will be wrong.  If we only penalize it for deception on the examples where we're sure the x′ label is right, then it can still infer something about H100 from our failure to penalize ("Hmm, I got away with it that time!").  A fix could be to add noise: Sometimes we don't penalize even when we know it deceived us, a

Yes, that's right. The key thing I'd add to 1) is that ARC believes most kinds of data augmentation (giving the human AI assistance, having the human think longer, giving them other kinds of advantages) are also unlikely to work, so you'd need to do something to "crack open the black box" and penalize ways the reporter is computing its answer. They could still be surprised by data augmentation techniques but they'd hold them to a higher standard.

This proposal has some resemblance to turning reflection up to 11. In worst-case land, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to "the world-understanding that the smartest/most knowledgeable human in the world" has; this understanding could still be missing things that the prediction model knows.

1redbird
How would it learn that Bayes net, though, if it has only been trained so far on H_1, …, H_10?  Those are evaluators we’ve designed to be much weaker than human.

I see why the approach I mention might have some intrinsic limitations in its ability to elicit latent knowledge though. The problem is that even if it understands roughly that it has incentives to use most of what it knows when we ask him simulating the prediction of someone with its own characteristics (or 1400 IQ), given that with ELK we look for an global maximum (we want that it uses ALL its knowledge), there's always an uncertainty on whether it did understand that point or not for extreme intelligence / examples or whether it tries to fit to the tr

... (read more)
Load More