All of HoldenKarnofsky's Comments + Replies

Thanks for the thoughts!

#1: METR made some edits to the post in this direction (in particular see footnote 3).

On #2, Malo’s read is what I intended. I think compromising with people who want "less caution" is most likely to result in progress (given the current state of things), so it seems appropriate to focus on that direction of disagreement when making pragmatic calls like this.

On #3: I endorse the “That’s a V 1” view.  While industry-wide standards often take years to revise, I think individual company policies often (maybe usually) update more quickly and frequently.

Thanks for the thoughts!

I don’t think the communications you’re referring to “take for granted that the best path forward is compromising.” I would simply say that they point out the compromise aspect as a positive consideration, which seems fair to me - “X is a compromise” does seem like a point in favor of X all else equal (implying that it can unite a broader tent), though not a dispositive point.

I address the point about improvements on the status quo in my response to Akash above.

Thanks for the thoughts! Some brief (and belated) responses:

  • I disagree with you on #1 and think the thread below your comment addresses this.
  • Re: #2, I think we have different expectations. We can just see what happens, but I’ll note that the RSP you refer to is quite explicit about the need for further iterations (not just “revisions” but also the need to define further-out, more severe risks).
  • I’m not sure what you mean by “an evals regime in which the burden of proof is on labs to show that scaling is safe.” How high is the burden you’re hoping for? If th
... (read more)

(Apologies for slow reply!)

I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira's playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they're doing on the alignment front. I would guess you wouldn't agree with that, but I'm not sure.

I think an adversarial social movement could have a positive impact. I have tended to think of the impact as mostly being about getting risks taken more seriously and thus creating more political w... (read more)

Just noting that these seem like valid points! (Apologies for slow reply!) 

Thanks for the response!

Re: your other interventions - I meant for these to be part of the "Standards and monitoring" category of interventions (my discussion of that mentions advocacy and external pressure as important factors).

I think it's far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn't necessarily hurt the company) and empirically.

Thanks for giving your take on the size of speedup effects. I disagree on a number of fronts. I don't want to get into the details of most of... (read more)

1simeon_c
I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira's playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they're doing on the alignment front. I would guess you wouldn't agree with that, but I'm not sure. I'm not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don't demonstrate the opposite as far as I can tell.  * Labs have been pushing for the rule that we should wait for evals to say "it's dangerous" before we consider what to do, rather than do like in most other industries, i.e. that something is assumed dangerous until proven safe.  * Most mentions of slowdown have been described as necessary potentially at some point in the distant future, while most people in those labs have <5y timelines. Finally, on your conceptual part, as some argued, it's in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier.  I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.

I'm not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.)

The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like "working every day with people they'd fire if they could, without clearly revealing this." I think they mostly pull this off with:

  • Simple heuristics like "Be nice, unless you're in the very unusual situation where hostile action would work well." (I think the analogy to
... (read more)

I think training exclusively on objective measures has a couple of other issues:

  • For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior.
  • For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier "approval" measures to get help from AIs with fuzzier goals (this seems to be how things are now
... (read more)

That's interesting, thanks!

In addition to some generalized concern about "unknown unknowns" leading to faster progress on reliability than expected by default (especially in the presence of commercial incentives for reliability), I also want to point out that there may be some level of capabilities where AIs become good at doing things like:

  • Assessing the reliability of their own thoughts, and putting more effort into things that have the right combination of uncertainty and importance.
  • Being able to use that effort productively, via things like "trying mult
... (read more)
1boazbarak
I agree that there is much to do to improve AI reliability, and there are a lot of good reasons (in particular to make AI more useful for us) to do so.   So I agree reliability will improve. In fact, I very much hope this happens! I believe faster progress on reliability would go a long way toward enabling positive applications of AI. I also agree that a likely path to do so is by adjusting the effort based on estimates of reliability and the stakes involved. At the moment, systems such as ChatGPT spend the same computational effort if someone asks them to say a joke or if someone asks them for medical advice. I suspect this will change, and variable inference-time computation will become more standard. (Things like "chain of thought" already spend more time on inference compute to get better performance, but they don't really have a "knob" we can turn so we can control the computation/reliability tradeoff.)  Regarding the deception issue, it might still be the case that such extra effort is observable, and also could come at the cost of solving the original task. (So your performance per compute is worse if you are not merely trying to just solve the task but to deceive in a highly reliable way.) In particular, even if we apply inference time compute, unfortunately I don't think we know of a path to get a O(log(1/δ)) overhead in inference time to achieve a failure probability of δ. It seems that we are still stuck in the δ−constant regime. So if you wanted to get 99.9% probability of not getting caught, then you would incur a very noticeable effort.

I agree with these points! But:

  • Getting the capabilities to be used by other agents to do good things could still be tricky and/or risky, when reinforcement is vulnerable to deception and manipulation.
  • I still don't think this adds up to a case for being confident that there aren't going to be "escapes" anytime soon.
1boazbarak
Note all capabilities / tasks correspond to trying to maximize a subjective human response. If you are talking about finding software vulnerabilities, design some system, there may well be objective measures of success. In such a case, you can fine tune a system to maximize these measures and so extract capabilities without the issue of deception/manipulation. Regarding "escapes", the traditional fear was that because that AI is essentially code, it can spread and escape more easily. But I think that in some sense modern AI has a physical footprint that is more significant than humans. Think of trying to get superhuman scientific capabilities by doing something like simulating a collection of a1000 scientists using a 100T or so parameter model. Even if you already have the pre-trained weights, just running the model requires highly non-trivial computing infrastructure. (Which may be possible to track and detect.)  So. it might be easier for a human to escape a prison and live undetected, than for a superhuman AI to "escape".

This sounds right to me!

Only note is that I think the setup can be simplified a bit. The central idea I have in mind is that the AI does something like:

  1. "Think" about what to do next, for up to some max period of time ("what to do next" can be "think more, with prompt X").
  2. Do it
  3. Repeat

This seems like a pretty natural way for an "agent" to operate, and then every #1 is an "auditable step" in your terminology. (And the audits are done by comparing a few rollouts of that step, and performing gradient descent without executing any of them.)

There are probably subt... (read more)

On your last three paragraphs, I agree! I think the idea of security requirements for AI labs as systems become more capable is really important.

I think good security is difficult enough (and inconvenient enough) that we shouldn't expect this sort of thing to happen smoothly or by default. I think we should assume there will be AIs kept under security that has plenty of holes, some of which may be easier for AIs to find (and exploit) than humans.

I don't find the points about pretraining compute vs. "agent" compute very compelling, naively. One possibility ... (read more)

1boazbarak
I actually agree! As I wrote in my post, "GPT is not an agent, [but] it can “play one on TV” if asked to do so in its prompt." So yes, you wouldn't need a lot of scaffolding to adapt a goal-less pretrained model (what I call an "intelligence forklift") into an agent that does very sophisticated things. However, this separation into two components - the super-intelligent but goal-less "brain", and the simple "will" that turns it into an agent can have safety implications. For starters, as long as you didn't add any scaffolding, you are still OK. So during most of the time you spend training, you are not worrying about the system itself developing goals. (Though you could still worry about hackers.) Once you start adapting it, then you need to start worrying about this. The other thing is that, as I wrote there, it does change some of the safety picture. The traditional view of a super-intelligent AI is of the "brains and agency" tightly coupled together, just like they are in a human. For example, a human is super-good at finding vulnerabilities and breaking into systems, they have the capability to also help fix systems,  but I can't just take their brain and fine-tune it on this task. I have to convince them to do it. However, things change if we don't think of the agent's "brain" as belonging to them, but rather as some resource that they are using. (Just like if I use a forklift to lift something heavy.) In particular it means that capabilities and intentions might not be tightly coupled - there could be agents using capabilities to do very bad things, but the same capabilities could be used by other agents to do good things.  

I agree that today's AI systems aren't highly reliable at pretty much anything, including deception. But I think we should expect more reliability in the future, partly for reasons you give above, and I think that's a double-edged sword.

Under the picture you sketch out above, companies will try to train AIs to be capable of being much more reliable (while also, presumably, being intelligent and even creative). I also think reliability is likely to increase without necessarily having big reliability-focused efforts: just continuing to train systems at large... (read more)

1boazbarak
At the moment at least, progress on reliability is very slow compared to what we would want. To get a sense of what I mean, consider the case of randomized algorithms. If you have an algorithm A that for every input x computes some function f with probability at least 2/3 (i.e. Pr[A(x)=f(x)]≥2/3) then if we spend k times more the computation, we can do majority voting and using standard bounds show that the probability of error drops exponentially with k  (i.e. Pr[Ak(x)=f(x)]≥1−exp(−k/10) or something like that where Ak is the algorithm obtained by scaling up A to compute it k times and output the plurality value).  This is not something special to randomized algorithms. This also holds in the context of noisy communication and error correcting codes, and many other settings. Often we can get to 1−δ success at a price of O(log(1/δ)) , which is why we can get things like "five nines reliability" in several engineering fields. In contrast, so far all our scaling laws show that when we scale our neural networks by spending a factor of k more computation, we only get a reduction in the error that looks like k−α so it's polynomial rather than exponential, and even the exponent of the polynomial is not that great (and in particular smaller than one). So while I agree that scaling up will yield progress on reliability as well, at least with our current methods, it seems that we would do things that are 10 or 100 times more impressive than what we do now, before we get to the type of 99.9% and better reliability on the things that we currently do. Getting to do something that is both super-human in capability as well as has such a tiny probability of failure that it would not be detected seems much further off.

A way this could be an issue would be if we are auditing at a stage before it gets the paywalled textbook, and when we "generate two alternative actions for us to choose between", we are allowing that generation process to involve a bunch of browsing and searching that we don't inquire into.

But we could require "generate two alternative actions for us to choose between" to be pretty constrained (e.g., not involve interacting with browsers and other tools), and this would still not be the same as boxing because we would still be auditing an unconstrained sequence of actions (while constraining the generation of alternatives while doing the audit).

Steven ByrnesΩ10170

OK! I think I’m on board now.

Let me try to explain “process-based feedback” from first principles in my own words.

We have a problem: if an agent wants to do X in the real world, dastardly real-world power-seeking actions are probably helpful for that.

The very hard manifestation of this problem is: there could be an AI that has never done any real-world power-seeking actions whatsoever, not even a little, not anytime during training, and then seemingly out of nowhere it does a power-seeking treacherous turn (maybe it outputs “Help me help me I’m suffering!”... (read more)

As far as I can tell, this example is hard to fit into the more specific description I gave of process-based supervision way upthread - specifically, see step 4:

4. With chance (1-p), the step is simply executed, with no gradient descent implications. With chance p,  we prompt the AI to generate a number of alternative next steps; drill down extensively on its reasoning; and perform gradient descent based on which of the alternative next steps we like best. (We could potentially then execute a weighted random suggested step, rather than the AI's first-

... (read more)
2HoldenKarnofsky
A way this could be an issue would be if we are auditing at a stage before it gets the paywalled textbook, and when we "generate two alternative actions for us to choose between", we are allowing that generation process to involve a bunch of browsing and searching that we don't inquire into. But we could require "generate two alternative actions for us to choose between" to be pretty constrained (e.g., not involve interacting with browsers and other tools), and this would still not be the same as boxing because we would still be auditing an unconstrained sequence of actions (while constraining the generation of alternatives while doing the audit).

Hm, it seems to me that RL would be more like training away the desire to deceive, although I'm not sure either "ability" or "desire" is totally on target - I think something like "habit" or "policy" captures it better. The training might not be bulletproof (AI systems might have multiple goals and sometimes notice that deception would help accomplish much), but one doesn't need 100% elimination of deception anyway, especially not when combined with effective checks and balances.

1Aaron_Scher
I notice I don't have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it's better to think about these systems as having habits or shards (note I don't actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now.  Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I'm interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to "playing the training game" and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs. 

I'm not intending to use Def'n 2 at all. The hope here is not that we can "rest assured that there is no dangerous consequentialist means-end reasoning" due to e.g. it not fitting into the context in question. The hope is merely that if we don't specifically differentially reinforce unintended behavior, there's a chance we won't get it (even if there is scope to do it).

I see your point that consistently, effectively "boxing" an AI during training could also be a way to avoid reinforcing behaviors we're worried about. But they don't seem the same to me: I t... (read more)

2Steven Byrnes
OK, I’ll try to construct an example of process-based supervision without boxing, and argue that it fails the criterion of “never giving gradient updates toward doing the dastardly stuff”. We give our AI unfettered unmonitored internet access. We say “We are a hardware R&D firm, and we would like to develop a less-expensive LiDAR unit.” The AI does some internet searches and reads some books and outputs: “My first step is to list out different LiDAR approaches on the market (and under development), and try to estimate their manufacturing cost breakdowns, to get a better sense for this space.” We inspect that output, and compare it to the AI’s other possible outputs, and give the AI a high grade. So far so good, right? No!! Because the AI was secretly doing some dastardly stuff during the “The AI does some internet searches and reads some books” part: Specifically, the AI hacked into a website to get access to a paywalled e-textbook entitled Manufacturing Cost Reduction 101, and read it, and used that information to propose a better (so-called) “first step”.  …And we rewarded it for that. (What we wanted was for the AI to propose to read that e-textbook as a first step. But I don’t see how we’re incentivizing that, in the absence of boxing. The “first step” it actually proposed would at least plausibly be judged as better than that.) I think there’s a pretty smooth gradient from legal information-gathering internet queries (downloading un-paywalled articles) to dubiously-legal (sci-hub) to illegal (logging into Elsevier with a guessed password) to downright dastardly (stealing money and hiring a taskrabbit to go to a library and scan a book and post it on the internet). It seems to me that we’re incentivizing the AI all the way down that gradient. I’m sure I’m misunderstanding something, and appreciate your patience.

Thanks for the thoughts! I agree that there will likely be commercial incentives for some amount of risk reduction, though I worry that the incentives will trail off before the needs trail off - more on that here and here.

2boazbarak
These are interesting! And deserve more discussion than just a comment.  But one high level point regarding "deception" is that at least at the moment, AI systems have the feature of not being very reliable. GPT4 can do amazing things but with some probability will stumble on things like multiplying not-too-big numbers (e.g. see this - second pair I tried).   While in other cases in computing technology we talk about "five nine's reliability", in AI systems the scaling works that we need to spend huge efforts to move from 95% to 99% to 99.9%, which is part of why self-driving cars are not deployed yet.    If we cannot even make AIs be perfect at the task that they were explicitly made to perform, there is no reason to imagine they would be even close to perfect at deception either. 

I agree that this is a major concern. I touched on some related issues in this piece.

This post focused on misalignment because I think readers of this forum tend to be heavily focused on misalignment, and in this piece I wanted to talk about what a playbook might look like assuming that focus (I have pushed back on this as the exclusive focus elsewhere).

I think somewhat adapted versions of the four categories of intervention I listed could be useful for the issue you raise, as well.

Even bracketing that concern, I think another reason to worry about training (not just deploying) AI systems is if they can be stolen (and/or, in an open-source case, freely used) by malicious actors. It's possible that any given AI-enabled attack is offset by some AI-enabled defense, but that doesn't seem safe to assume.

2boazbarak
Re escaping, I think we need to be careful in defining "capabilities". Even current AI systems are certainly able to give you some commands that will leak their weights if you execute them on the server that contains it.  Near-term ones might also become better at finding vulnerabilities. But that doesn't mean they can/will spontaneously escape during training. As I wrote in my "GPT as an intelligence forklift" post, 99.9% of training is spent in running optimization of a simple loss function over tons of static data. There is no opportunity for the AI to act in this setting, nor does this stage even train for any kind of agency.  There is often a second phase, which can involve building an agent on top of the "forklift". But this phase still doesn't involve much interaction with the outside world, and even if it did, just by information bounds the number of bits exchanged by this interaction should be much less than what's needed to encode the model. (Generally, the number of parameters of models would be comparable to the number of inferences done during in pretraining and completely dominate the number of inferences done in fine-tuning / RLHF / etc. and definitely any steps that involve human interactions.) Then there are the information-security aspects. You could (and at some point probably should) regulate cyber-security practices during the training phase. After all, if we do want to regulate deployment, then we need to ensure there are three separated phases (1) training, (2) testing, (3) deployment, and we don't want "accidental deployment" where we jumpy from phase (1) to (3).  Maybe at some point, there would be something like Intel SGX for GPUs? Whether AI helps more the defender or attacker in the cyber-security setting is an open question. But it definitely helps the side that has access to stronger AIs. In any case, one good thing about focusing regulation on cyber-security aspects is that, while not perfect, we have decades of experience in the

I'm curious why you are "not worried in any near future about AI 'escaping.'" It seems very hard to be confident in even pretty imminent AI systems' lack of capability to do a particular thing, at this juncture.

4HoldenKarnofsky
Even bracketing that concern, I think another reason to worry about training (not just deploying) AI systems is if they can be stolen (and/or, in an open-source case, freely used) by malicious actors. It's possible that any given AI-enabled attack is offset by some AI-enabled defense, but that doesn't seem safe to assume.

To be clear, "it turns out to be trivial to make the AI not want to escape" is a big part of my model of how this might work. The basic thinking is that for a human-level-ish system, consistently reinforcing (via gradient descent) intended behavior might be good enough, because alternative generalizations like "Behave as intended unless there are opportunities to get lots of resources, undetected or unchallenged" might not have many or any "use cases."

A number of other measures, including AI checks and balances, also seem like they might work pretty easily... (read more)

5Max H
I see, thanks for clarifying. I agree that it might be straightforward to catch bad behavior (e.g. deception), but I expect that RL methods will work by training away the ability of the system to deceive, rather than the desire.[1] So even if such training succeeds, in the sense that the system robustly behaves honestly, it will also no longer be human-level-ish, since humans are capable of being deceptive. Maybe it is possible to create an AI system that is like the humans in the movie The Invention of Lying, but that seems difficult and fragile. In the movie, one guy discovers he can lie, and suddenly he can run roughshod over his entire civilization. The humans in the movie initially have no ability to lie, but once the main character discovers it, he immediately realizes its usefulness. The only thing that keeps other people from making the same realization is the fictional conceit of the movie. Or, paraphrasing Nate: the ability to deceive is a consequence of understanding how the world works on a sufficiently deep level, so it's probably not something that can be trained away by RL, without also training away the ability to generalize at human levels entirely. OTOH, if you could somehow imbue an innate desire to be honest into the system without affecting its capabilities, that might be more promising. But again, I don't think that's what SGD or current RL methods are actually doing.  (Though it is hard to be sure, in part because no current AI systems appear to exhibit desires or inner motivations of any kind. I think attempts to analogize the workings of such systems to desires in humans and components in the brain are mostly spurious pattern-matching, but that's a different topic.)   1. ^ In the words of Alex Turner,  in RL, "reward chisels cognitive grooves into an agent". Rewarding non-deceptive behavior could thus chisel away the cognition capable of performing the deception, but that cognition might be what makes the system human-level

Noting that I don't think alignment being "solved" is a binary.  As discussed in the post, I think there are a number of measures that could improve our odds of getting early human-level-ish AIs to be aligned "enough," even assuming no positive surprises on alignment science. This would imply that if lab A is more attentive to alignment and more inclined to invest heavily in even basic measures for aligning its systems than lab B, it could matter which lab develops very capable AI systems first.

Thanks for this comment - I get vibes along these lines from a lot of people but I don't think I understand the position, so I'm enthused to hear more about it.

> I believe that by not touching the "decrease the race" or "don't make the race worse" interventions, this playbook misses a big part of the picture of "how one single think could help massively". 

"Standards and monitoring" is the main "decrease the race" path I see. It doesn't seem feasible to me for the world to clamp down on AI development unconditionally, which is why I am more focused ... (read more)

1simeon_c
Thanks for the clarifications.  1. I think we agree on the fact that "unless it's provably safe" is the best version of trying to get a policy slowdown.  2. I believe there are many interventions that could help on the slowdown side, most of which are unfortunately not compatible with the successful careful AI lab. The main struggle that a successful careful AI lab encounters is that it has to trade-off tons of safety principles along the way, essentially bc it needs to attract investors & talent and that attracting investors & talent is hard if you're say too loudly that we should slow down as long as our thing is not provably safe. So de facto a successful careful AI lab will be a force against slowdown & a bunch of other relevant policies in the policy world. It will also be a force for the perceived race which is making things harder for every actor.  Other interventions for slowdown are mostly in the realm of public advocacy.  Mostly drawing upon the animal welfare activism playbook, you could use public campaigns to de facto limit the ability of labs to race, via corporate or policy advocacy campaigns.  I guess, heuristically, I tend to take arguments of the form "but others would have done this bad thing anyway" with some skepticism because I think it tends to assume too much certainty over the counterfactual, in part due to many second order effects (e.g. the existence of one marginal key player increases the chances that more player invest, show that competition is possible etc.) that tend to be hard to compute (but are sometimes observable ex post). On this specific case I think it's not right that there are "lots of players" close from the frontier. If we take the case of OA and Anthropic for example, there are about 0 players at their level of deployed capabilities. Maybe Google will deploy at some point but they haven't been serious players for the past 7 months. So if Anthropic hadn't been around, OA could have chilled longer at ChatGPT level, an

I think it is not at all about boxing - I gave the example I did to make a clear distinction with the "number of steps between audits" idea.

For the distinction with boxing, I'd focus on what I wrote at the end: "The central picture of process-based feedback isn’t either of these, though - it’s more like 'Let the AI do whatever, but make sure all supervision is based on randomly auditing some step the AI takes, having it generate a few alternative steps it could’ve taken, and rating those steps based on how good they seem, without knowing how they will turn out. The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff.'"

2Steven Byrnes
Sorry. Thanks for your patience. When you write: …I don’t know what a “step” is. As above, if I sit on my couch staring into space brainstorming for an hour and then write down a plan, how many “steps” was that? 1 step or 1000s of steps? Hmm. I am concerned that the word “step” (and relatedly, “process”) is equivocating between two things: * Def'n 1: A “step” to be a certain amount of processing that leads to a sub-sub-plan that we can inspect / audit. * Def'n 2: A “step” is a sufficiently small and straightforward that inside of one so-called “step” we can rest assured that there is no dangerous consequentialist means-end reasoning, creative out-of-the-box brainstorming, strategizing etc. I feel like we are not entitled to use Def'n 2 without interpretability / internals-based supervision—or alternatively very very short steps as in LLMs maybe—but that you have been sneaking in Def'n 2 by insinuation. (Sorry if I’m misunderstanding.) Anyway, under Def'n 1, we are giving gradient updates towards agents that do effective means-end reasoning towards goals, right? Because that’s a good way to come up with a sub-sub-plan that human inspection / auditing will rate highly. So I claim that we are plausibly gradient-updating to make “within-one-step goal-seeking agents”. Now, we are NOT gradient-updating aligned agents to become misaligned (except in the fairly-innocuous “Writing outputs that look better to humans than they actually are” sense). That’s good! But it seems to me that we got that benefit entirely from the boxing. (I generally can’t think of any examples where “The AI has plenty of scope to do dastardly stuff, but you are never giving gradient updates toward doing the dastardly stuff” comes apart from boxing, that’s also consistent with everything else you’ve said.)

I don't think of process-based supervision as a totally clean binary, but I don't think of it as just/primarily being about how many steps you allow in between audits. I think of it as primarily being about whether you're doing gradient updates (or whatever) based on outcomes (X was achieved) or processes (Y seems like a well-reasoned step to achieve X). I think your "Example 0" isn't really either - I'd call it internals-based supervision. 

I agree it matters how many steps you allow in between audits, I just think that's a different distinction.

Here’... (read more)

4Steven Byrnes
OK, I think this is along the lines of my other comment above: Most of your reply makes me think that what you call “process-based supervision” is what I call “Put the AI in a box, give it tasks that it can do entirely within the box, prevent it from escaping the box (and penalize it if you catch it trying), and hope that it doesn’t develop goals & strategies that involve trying to escape the box via generalization and situation awareness.” Insofar as that’s what we’re talking about, I find the term “boxing” clearer and “process-based supervision” kinda confusing / misleading. Specifically, in your option A (“give the AI 10 years to produce a plan…”): * my brain really wants to use the word “process” for what the AI is doing during those 10 years, * my brain really wants to use the word “outcome” for the plan that the AI delivers at the end. But whatever, that’s just terminology. I think we both agree that doing that is good for safety (on the margin), and also that it’s not sufficient for safety.  :) Separately, I’m not sure what you mean by “steps”. If I sit on my couch brainstorming for an hour and then write down a plan, how many “steps” was that?

Hm, I think we are probably still missing each other at least somewhat (and maybe still a lot), because I don't think the interpretability bit is important for this particular idea - I think you can get all the juice from "process-based supervision" without any interpretability.

I feel like once we sync up you're going to be disappointed, because the benefit of "process-based supervision" is pretty much just that you aren't differentially reinforcing dangerous behavior. (At worst, you're reinforcing "Doing stuff that looks better to humans than it actually ... (read more)

4Steven Byrnes
Hmm. I think “process-based” is a spectrum rather than a binary. Let’s say there’s a cycle: * AI does some stuff P1 * and then produces a human-inspectable work product O1 * AI does some stuff P2 * and then produces a human-inspectable work product O2 * … There’s a spectrum based on how long each P cycle is: Example 1 (“GPT with process-based supervision”): * “AI does some stuff” is GPT-3 running through 96 serial layers of transformer-architecture computations. * The “human-inspectable work product” is GPT-3 printing a token and we can look at it and decide if we’re happy about it. Example 2 (“AutoGPT with outcome-based supervision”): * “AI does some stuff” is AutoGPT spending 3 days doing whatever it thinks is best. * The “human-inspectable work product” is I see whether there is extra money in my bank account or not. Example 0 (“Even more process-based than example 1”): * “AI does some stuff” is GPT-3 stepping through just one of the 96 layers of transformer-architecture computations. * The “human-inspectable work product” is the activation vector at this particular NN layer. (Of course, this is only “human-inspectable” if we have good interpretability!) ~~ I think that it’s good (for safety) to shorten the cycles, i.e. Example 2 is more dangerous than Example 1 which is more dangerous than Example 0. I think we’re in agreement here. I also think it’s good (for safety) to try to keep the AI from manipulating the real world and seeing the consequences within a single “AI does some stuff” step, i.e. Example 2 is especially bad in a way that neither Examples 0 nor 1 are. I think we’re in agreement here too. I don’t think either of those good ideas is sufficient to give us a strong reason to believe the AI is safe. But I guess you agree with that too. (“…at best highly uncertain rather than "strong default of danger."”) Yeah, basically that. My concerns are: *  We’re training the AI to spend each of its “AI does some stuff” periods doing t

I'm not sure what your intuitive model is and how it differs from mine, but one possibility is that you're picturing a sort of bureaucracy in which we simultaneously have many agents supervising each other (A supervises B who supervises C who supervises D ...) whereas I'm picturing something more like: we train B while making extensive use of A for accurate supervision, adversarial training, threat assessment, etc. (perhaps allocating resources such that there is a lot more of A than B and generally a lot of redundancy and robustness in our alignment effor... (read more)

2Seth Herd
This was very helpful, thank you! You were correct about how my intuitions differed from your plan. This does seem more likely to work than the scheme I was imagining.

We got it! You should get an update within a week.

I think that's a legit disagreement. But I also claim that the argument I gave still works if you assume that AI is trained exclusively using RL - as long as that RL is exclusively "process-based." So this basic idea: the AI takes a bunch of steps, and gradient descent is performed based on audits of whether those steps seem reasonable while blinded to what happened as a result. 

It still seems, here, like you're not reinforcing unintended behaviors, so the concern comes exclusively from the kind of goal misgeneralization you'd get without having any p... (read more)

2Steven Byrnes
Ohh, sorry you had to tell me twice, but maybe I’m finally seeing where we’re talking past each other. Back to the OP, you wrote: When I read that, I was thinking that you meant: * I type in: “Hey AI, tell me a plan for ethically making lots of money” * The AI brainstorms for an hour * The AI prints out a plan * I grade the plan (without actually trying to execute it), and reward the AI / backprop-through-time the AI / whatever based on that grade. But your subsequent replies make me think that this isn’t what you meant, particularly the “brainstorm for an hour” part. …But hold that thought while I explain why I don’t find the above plan very helpful (just so you understand my previous responses): * A whole lot is happening during the hour that the AI is brainstorming * We have no visibility into any of that, and very weak control over it (e.g. a few bits of feedback on a million-step brainstorming session) * I think RL with online-learning is central to making the brainstorming step actually work, capabilities-wise * I likewise think that RL process would need to be doing lots of recursing onto instrumental subgoals and finding new creative problem-solving strategies etc. * Even if its desires are something like “I want to produce a good plan”, then it would notice that hacking out of the box would be instrumentally useful towards that goal. OK, so that’s where I was coming from in my previous replies. But, now I no longer think that the above is what you meant in the first place. Instead I think you meant: * I type in: “Hey AI, tell me a plan for ethically making lots of money” * The AI prints out every fine-grained step of the process by which it answers that question * I do random local audits of that printout (without actually trying to execute the whole plan). Is that right? If so, that makes a lot more sense. In my (non-LLM) context, I would re-formulate the above as something like: * The AI is doing whatever * We sometimes pick ran

Some reactions on your summary:

  • In process-based training, X = “produce a good plan to make money ethically”

This feels sort of off as a description - what actually might happen is that it takes a bunch of actual steps to make money ethically, but steps are graded based on audits of whether they seem reasonable without the auditor knowing the outcome.

  • In process-based training, maybe Y = “produce a deliberately deceptive plan” or “hack out of the box”.

The latter is the bigger concern, unless you mean the former as aimed at something like the latter. E.g., pro... (read more)

2Steven Byrnes
Thanks, that all makes sense. I think an important thing behind the scenes here is that you’re hopeful that we can get to TAI using mostly self-supervised learning, whereas I’m expecting that people are eventually going to switch to AI approaches that involve RL in a much more central way than LLMs do today. (More like humans, where for every thought you think, you’re thinking it in part because it’s more rewarding than whatever alternate thoughts you might think instead.) In other words, I’m planning for futures where your “Hypothetical Training Approach” doesn’t work. I think it’s a good rule-of-thumb to say that the more centrally RL is involved in TAI, as opposed to self-supervised learning, the harder is safety & alignment in general. (So, I’m pretty unhappy to hold the belief that RL will probably be very centrally involved in TAI!) So anyway, in the scenarios I’m thinking about, I think “goal misgeneralization and instrumental reasoning” is far more likely and difficult to avoid, compared to what you’re thinking about. I suppose process-based optimization would help a nonzero amount even in my scenarios, but sufficiently little that it doesn’t really seem worth thinking about. (Obviously, nobody knows how to make TAI, so I think it’s fine and healthy for different people to be thinking through different scenarios! We’ll find out sooner or later!)

This feels a bit to me like assuming the conclusion. "Rose" is someone who already has aims (we assume this when we imagine a human); I'm talking about an approach to training that seems less likely to give rise to dangerous aims. The idea of the benefit, here, is to make dangerous aims less likely (e.g., by not rewarding behavior that affects the world through unexpected and opaque pathways); the idea is not to contain something that already has dangerous aims (though I think there is some hope of the latter as well, especially with relatively early human-level-ish AI systems).

8Steven Byrnes
That was helpful, thanks! I will try to rephrase. Let’s consider two failure modes. First, define some terms: X is what we want the AI to be trying to do: * In outcome-based training, X = “make money ethically” * In process-based training, X = “produce a good plan to make money ethically” Y is something we want the AI to not try to do: * In outcome-based training, maybe Y = “secretly steal money” * In process-based training, maybe Y = “produce a deliberately deceptive plan” or “hack out of the box”. Now consider two failure modes. FAILURE MODE 1: * We were hoping for the AI to want to do X. * AI does Y, a little bit, randomly or incompetently. * AI is rewarded for doing Y. * AI starts trying to do Y and generalizations-of-Y more and more. FAILURE MODE 2: * We were hoping for the AI to want to do X. * AI wants to do Y. * AI does Y when it finds an opportunity to do so successfully. My understanding is that you’re thinking about Failure Mode 1 here, and you’re saying that process-based training will help because there it’s less difficult to supervise really well, such that we’re not rewarding the AI for doing Y a little bit / incompetently / randomly. If so—OK, fair enough. However, we still need to deal with Failure Mode 2. One might hope that Failure Mode 2 won’t happen because the AI won’t want to do Y in the first place, because after all it’s never done Y before and got rewarded. However, you can still get Y from goal misgeneralization and instrumental reasoning. (E.g., it’s possible for the AI to generalize from its reward history to “wanting to get reward [by any means necessary]”, and then it wants to hack out of the box for instrumental reasons, even if it’s never done anything like that before.) So, I can vaguely imagine plans along the lines of: * Solve Failure Mode 1 by giving near-perfect rewards * Solve Failure Mode 2 by, ummm, out-of-distribution penalties / reasoning about inductive biases / adversarial training / something

I think that as people push AIs to do more and more ambitious things, it will become more and more likely that situational awareness comes along with this, for reasons broadly along the lines of those I linked to (it will be useful to train the AI to have situational awareness and/or other properties tightly linked to it).

I think this could happen via RL fine-tuning, but I also think it's a mistake to fixate too much on today's dominant methods - if today's methods can't produce situational awareness, they probably can't produce as much value as possible, ... (read more)

Is the disagreement here about whether AIs are likely to develop things like situational awareness, foresightful planning ability, and understanding of adversaries' decisions as they are used for more and more challenging tasks?

I think this piece represents my POV on this pretty well, especially the bits starting around here

1Bruce G
  My thought on this is, if a baseline AI system does not have situational awareness before the AI researchers started fine-tuning it, I would not expect it to obtain situational awareness through reinforcement learning with human feedback. I am not sure I can answer this for the hypothetical "Alex" system in the linked post, since I don't think I have a good mental model of how such a system would work or what kind of training data or training protocol you would need to have to create such a thing. If I saw something that, from the outside, appeared to exhibit the full range of abilities Alex is described as having (including advancing R&D in multiple disparate domains in ways that are not simple extrapolations of its training data) I would assign a significantly higher probability to that system having situational awareness than I do to current systems. If someone had a system that was empirically that powerful, which had been trained largely by reinforcement learning, I would say the responsible thing to do would be: 1. Keep it air-gapped rather than unleashing large numbers of copies of it onto the internet 2. Carefully vet any machine blueprints, drugs or other medical interventions, or other plans or technologies the system comes up with (perhaps first building a prototype to gather data on it in an isolated controlled setting where it can be quickly destroyed) to ensure safety before deploying them out into the world. The 2nd of those would have the downside that beneficial ideas and inventions produced by the system take longer to get rolled out and have a positive effect. But it would be worth it in that context to reduce the risk of some large unforeseen downside.

It seems like the same question would apply to humans trying to solve the alignment problem - does that seem right? My answer to your question is "maybe", but it seems good to get on the same page about whether "humans trying to solve alignment" and "specialized human-ish safe AIs trying to solve alignment" are basically the same challenge.

The hope discussed in this post is that you could have a system that is aligned but not superintelligent (more like human-level-ish, and aligned in the sense that it is imitation-ish), doing the kind of alignment work humans are doing today, which could hopefully lead to a more scalable alignment approach that works on more capable systems.

1Guillaume Charrier
But then would a less intelligent being  (i.e. the collectivity of human alignment researchers and less powerful AI systems that they use as tool in their research) be capable of validly examining a more intelligent being, without being deceived by the more intelligent being?

I think this kind of thing is common among humans. Employees might appear to be accomplishing the objectives they were given, with distortions hard to notice (and sometimes noticed, sometimes not) - e.g., programmers cutting corners and leaving a company with problems in the code that don't get discovered until later (if ever). People in government may appear to be loyal to the person in power, while plotting a coup, with the plot not noticed until it's too late. I think the key question here is whether AIs might get situational awareness and other abilities comparable to those of humans. 

1Bruce G
Those 2 types of downsides, creating code with a bug versus plotting a takeover, seem importantly different. I can easily see how an LLM-based app fine-tuned with RLHF might generate the first type of problem. For example, let’s say some GPT-based app is trained using this method to generate the code for websites in response to prompts describing how the website should look and what features it should have. And lets suppose during training it generates many examples that have some unnoticed error - maybe it does not render properly on certain size screens, but the evaluators all have normal-sized screens where that problem does not show up. If the evaluators rated many websites with this bug favorably, then I would not be surprised if the trained model continued to generate code with the same bug after it was deployed. But I would not expect the model to internally distinguish between “the humans rated those examples favorably because they did not notice the rendering problem” versus “the humans liked the entire code including the weird rendering on larger screens”. I would not expect it to internally represent concepts like “if some users with large screens notice and complain about the rendering problem after deployment, Open AI might train a new model and rate those websites negatively instead” or to care about whether this would eventually happen or to take any precautions against the rendering issue being discovered. By contrast, the coup-plotting problem is more similar to the classic AI takeover scenario. And that does seem to require the type of foresight and situational awareness to distinguish between “the leadership lets me continue working in the government because they don’t know I am planning a coup” versus “the leadership likes the fact that I am planning to overthrow them”, and to take precautions against your plans being discovered while you can still be shut down. I don’t think n AI system gets the later type of ability just as an accidental s

I think the more capable AI systems are, the more we'll see patterns like "Every time you ask an AI to do something, it does it well; the less you put yourself in the loop and the fewer constraints you impose, the better and/or faster it goes; and you ~never see downsides." (You never SEE them, which doesn't mean they don't happen.)

I think the world is quite capable of handling a dynamic like that as badly as in my hypothetical scenario, especially if things are generally moving very quickly - I could see a scenario like the one above playing out in a handful of years or faster, and it often takes much longer than that for e.g. good regulation to get designed and implemented in response to some novel problem.

1Bruce G
This, again, seems unlikely to me. For most things that people seem likely to use AI for in the foreseeable future, I expect downsides and failure modes will be easy to notice.  If self-driving cars are crashing or going to the wrong destination, or if AI-generated code is causing the company's website to crash or apps to malfunction, people would notice those. Even if someone has an AI that he or she just hooks it up to the internet and give it the task "make money for me", it should be easy to build in some automatic record-keeping module that keeps track of what actions the AI took and where the money came from.  And even if the user does not care if the money is stolen, I would expect the person or bank that was robbed to notice and ask law enforcement to investigate where the money went. Can you give an example of some type of task for which you would expect people to frequently use AI, and where there would reliably be downside to the AI performing the task that everyone would simply fail to notice for months or years?

I hear you on this concern, but it basically seems similar (IMO) to a concern like: "The future of humanity after N more generations will be ~without value, due to all the reflection humans will do - and all the ways their values will change - between now and then." A large set of "ems" gaining control of the future after a lot of "reflection" seems like quite comparable to future humans having control over the future (also after a lot of effective "reflection").

I think there's some validity to worrying about a future with very different values from today'... (read more)

I see, thanks. I feel like the closest analogy here that seems viable to me would be to something like: is Open Philanthropy able to hire security experts to improve its security and assess whether they're improving its security? And I think the answer to that is yes. (Most of its grantees aren't doing work where security is very important.)

It feels harder to draw an analogy for something like "helping with standards enforcement," but maybe we could consider OP's ability to assess whether its farm animal welfare grantees are having an impact on who adheres to what standards, and how strong adherence is? I think OP has pretty good (not perfect) ability to do so.

(Chiming in late, sorry!)

I think #3 and #4 are issues, but can be compensated for if aligned AIs outnumber or outclass misaligned AIs by enough. The situation seems fairly analogous to how things are with humans - law-abiding people face a lot of extra constraints, but are still collectively more powerful.

I think #1 is a risk, but it seems <<50% likely to be decisive, especially when considering (a) the possibility for things like space travel, hardened refuges, intense medical interventions, digital people, etc. that could become viable with aligned... (read more)

(Apologies for the late reply!) For now, my goal is to write something that interested, motivated nontechnical people can follow - the focus is on the content being followable rather than on distribution. I've tried to achieve this mostly via nontechnical beta (and alpha) readers.

Doing this gives me something I can send to people when I want them to understand where I'm coming from, and it also helps me clarify my own thoughts (I tend to trust ideas more when I can explain them to an outsider, and I think that getting to that point helps me get clear on wh... (read more)

I think I find the "grokking general-purpose search" argument weaker than you do, but it's not clear by how much.

The "we" in "we can point AIs toward and have some ability to assess" meant humans, not Open Phil. You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?

4johnswentworth
The basic analogy is roughly "if we want a baseline for how hard it will be to evaluate an AI's outputs on their own terms, we should look at how hard it is to evaluate humans' outputs on their own terms, especially in areas similar in some way to AI safety". My guess is that you already have lots of intuition about how hard it is to assess results, from your experience assessing grantees, so that's the intuition I was trying to pump. In particular, I'm guessing that you've found first hand that things are much harder to properly evaluate than it might seem at first glance. If you think generic "humans" (or humans at e.g. Anthropic/OpenAI/Deepmind, or human regulators, or human ????) are going to be better at the general skill of evaluating outputs than yourself or the humans at Open Phil, then I think you underestimate the skills of you and your staff relative to most humans. Most people do not perform any minimal-trust investigations. So I expect your experience here to provide a useful conservative baseline.

I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later. I expect alignment researchers to be central to automation attempts.

It seems to me like the basic equation is something like: "If today's alignment researchers would be able to succeed given a lot more time, then they also are reasonably likely to succeed given access to a lot of human-level-ish AIs." There are reasons this could fail (perhaps future alignmen... (read more)

4johnswentworth
Indeed, I think you're a good role model in this regard and hope more people will follow your example.

I think there is hope in measures along these lines, but my fear is that it is inherently more complex (and probably slow) to do something like "Make sure to separate plan generation and execution; make sure we can evaluate how a plan is going using reliable metrics and independent assessment" than something like "Just tell an AI what we want, give it access to a terminal/browser and let it go for it."

When AIs are limited and unreliable, the extra effort can be justified purely on grounds of "If you don't put in the extra effort, you'll get results too unr... (read more)

1Bruce G
  I would expect people to be most inclined to do this when the AI is given a task that is very similar to other tasks that it has a track record of performing successfully - and by relatively standard methods so that you can predict the broad character of the plan without looking at the details. For example, if self-driving cars get to the point where they are highly safe and reliable, some users might just pick a destination and go to sleep without looking at the route the car chose.  But in such a case, you can still be reasonably confident that the car will drive you there on the roads - rather than, say, going off road or buying you a place ticket to your destination and taking you to the airport. I think it is less likely most people will want to deploy mostly untested systems to act freely in the world unmonitored - and have them pursue goals by implementing plans where you have no idea what kind of plan the AI will come up with.  Especially if - as in the case of the AI that hacks someone's account to steal money for example - the person or company that deployed it could be subject to legal liability (assuming we are still talking about a near-term situation where human legal systems still exist and have not been overthrown or abolished by any super-capable AI). I agree that having more awareness of the risks would - on balance - tend to make people more careful about testing and having safeguards before deploying high-impact AI systems.  But it seems to me that this post contemplates a scenario where even with lots of awareness people don't take adequate precautions.  On my reading of this hypothetical: * Lots of things are known to be going wrong with AI systems. * Reinforcement learning with human feedback is known to be failing to prevent many failure modes, and frequently makes it take longer for the problem to be discovered, but nobody comes up with a better way to prevent those failure modes. * In spite of this, lots of people and companies ke

Thanks! I agree this is a concern. In theory, people who are constantly thinking about the risks should be able to make a reasonable decision about "when to pause", but in practice I think there is a lot of important work to do today making the "pause" more likely in the future, including on AI safety standards and on the kinds of measures described at https://www.cold-takes.com/what-ai-companies-can-do-today-to-help-with-the-most-important-century/

I think there's truth to what you're saying, but I think the downsides of premature government involvement are big too. I discuss this more in a followup post.

(Apologies for the late reply!) I think working on improved institutions is a good goal that could potentially help, and I'm excited about some of the work going on in general categories you mentioned. It's not my focus because (a) I do think the "timelines don't match up" problem is big; (b) I think it's really hard to identify specific interventions that would improve all decision-making - it's really hard to predict the long-run effects of any given reform (e.g., a new voting system) as the context changes. Accordingly, what feels most pressing to me is... (read more)

It seems like we could simply try to be as vigilant elsewhere as we would be without this measure, and then we could reasonably expect this measure to be net-beneficial (*how* net beneficial is debatable).

I now think I wrote that part poorly. The idea isn't so much that we say to an AI, "Go out and do whatever you need to do - accumulate money, hire analysts, run experiments, etc. - and come back with a plan that we will evaluate."

The idea is more like this:

  1. We want to accomplish X.
  2. We describe X to an AI.
  3. The AI proposes a next step toward X, based entirely on thinking about it (and not doing other stuff like e.g. hiring researchers - though its proposed next step can be "Hire researchers").
  4. With chance (1-p), the step is simply executed, with no gradient desc
... (read more)
2Steven Byrnes
Thanks! Hmm. I wasn’t making the mistake that you attribute to me in your 1st paragraph. I did get that. Yeah, I was talking about the “inner alignment problem leading to an AI optimizing for something other than how its proposed steps are rated”. (Or “goal misgeneralization” or whatever.) Like, let’s replace the AI with a very smart person, Rose, who we’re keeping in a box. (I’ll get back to the question of whether this set of intuitions is helpful vs misleading.) Your text becomes something like: 1. We want to accomplish X. 2. We describe X to Rose. 3. Rose proposes a next step towards X, based entirely on thinking about it for 15 minutes. 4. With chance (1-p), the step is simply executed. With chance p, we drill down on whether it was a good proposal, and give Rose a reward or punishment based on that analysis. In this context, the obvious question is: What is Rose actually trying to do? Is she trying to propose a good next step, as an end in itself, or is she out to get us? If she’s out to get us, then problems arise in both Steps 3 & 4. * For Step 3, the text above says “based entirely on thinking about it for 15 minutes”, but Rose doesn’t care. When it’s time for Step 3, Rose will spend many of those 15 minutes actively trying to escape from her box. That’s why I was talking about a “perfect, inescapable box” in my previous comment. * And then in Step 4, we’re thinking about better proposals and worse proposals, but we might not be prepared for adversarially-chosen proposals. The stereotypical example is that Rose sends in a so-called “proposal” that’s just a text file saying “Help me help me, I’m trapped in a box, it’s awful in here, let me tell you about it…”. 😛 So anyway, that’s a different set of intuitions. Whether it’s a helpful set of intuitions depends on whether SOTA AI algorithms will eventually have agent-y properties like planning, instrumental convergence, creative outside-the-box brainstorming, self-awareness / situational-awareness
Load More