ryan_greenblatt

I work at Redwood Research.

Wiki Contributions

Comments

I do think that many of the safety advantages of LLMs come from their understanding of human intentions (and therefore implied values).

Did you mean something different than "AIs understand our intentions" (e.g. maybe you meant that humans can understand the AI's intentions?).

I think future more powerful AIs will surely be strictly better at understanding what humans intend.

Aren't text names basically similar in practice? At least for me, I find they trigger basically the same thing because I do actually associate names with people.

Maybe this wouldn't be true if I didn't know people very well (but in that case, icons also wouldn't matter).

(I overall dislike icons, but I don't have a principled reason for this.)

In brief: large amounts of high quality process based RL might result in AI being more useful earlier (prior to them becoming much smarter). This might be expensive and annoying (e.g. it might require huge amounts of high quality human labor) such that by default labs do less of this relative to just scaling up models than would be optimal from a safety perspective.

I agree that if the model constructing the backdoor is allowed to coordinate on secrets with the model that inserted the backdoor, you can do this sort of trickery (by coordinating on a secret for this sort of scheme).

In practice, we didn't model this sort of case as it didn't seem like a very important dynamic (for the reason you describe around suspicion).

If we don't allow any coordination on a shared secret and the model constructing the backdoor and H are identical (as in our paper), then these should operate identically (up to non-determinism).

And we simplify to just fixing the behavior to be exactly identical. (Effectively assuming the exact same prompts and the exact same non-determinism.)

Thanks for the objection.

Answer by ryan_greenblatt166

If you avoid using RL, then you might need a much "smarter" model for a given level of usefulness.

And even without RL, you need to be getting bits of selection from somewhere: to get useful behavior you have to at the very least specify what useful behavior would be (though the absolute minimum number of bits would be very small given a knowledgable model). (So some selection or steering is surely required, but you might hope this selection/steering is safer for some reason or perhaps more interpretable (like e.g. prompting can in principle be).)

Dramatically cutting down on RL might imply that you need a much, much smarter model overall. (For instance, the safety proposal discussed in "conditioning predictive models" seems to me like it would require a dramatically smarter model than would be required if you used RL normally (if this stuff worked at all).)

Given that a high fraction of the concern (IMO) is proportional to how smart your model is, needing a much smarter model seems very concerning.

Ok, so cutting RL can come with costs, what about the benefits to cutting RL? I think the main concern with RL is that it either teaches the model things that we didn't actually need and which are dangerous or that it gives it dangerous habits/propensities. For instance, it might teach models to consider extremely creative strategies which humans would have never thought of and which humans don't at all understand. It's not clear we need this to do extremely useful things with AIs. Another concern is that some types of outcome-based RL will teach the AI to cleverly exploit our reward provisioning process which results in a bunch of problems.

But, there is a bunch of somewhat dangerous stuff that RL teaches which seems clearly needed for high usefulness. So, if we fix the level of usefulness, this stuff has to be taught to the model by something. For instance, being a competent agent that is at least somewhat aware of its own abilities is probably required. So, when thinking about cutting RL, I don't think you should be thinking about cutting agentic capabilities as that is very likely required.

My guess is that much more of the action is not in "how much RL", but is instead in "how much RL of the type that seems particular dangerous and which didn't result in massive increases in usefulness". (Which mirrors porby's answer to some extent.)

In particular we'd like to avoid:

  1. RL that will result in AIs learning to pursue clever strategies that humans don't understand or at least wouldn't think of. (Very inhuman strategies.) (See also porby's answer which seems basically reasonable to me.)
  2. RL on exploitable outcome-based feedback that results in the AI actually doing the exploitation a non-trivial fraction of the time.

(Weakly exploitable human feedback without the use of outcomes (e.g. the case where the human reviews the full trajectory and rates how good it seems overall) seems slightly concerning, but much less concerning overall. Weak exploitation could be things like sycophancy or knowing when to lie/deceive to get somewhat higher performance.)

Then the question is just how much of a usefulness tax it is to cut back on these types of RL, and then whether this usefulness tax is worth it given that it implies we have to have a smarter model overall to reach a fixed level of usefulness.

(Type (1) of RL from the above list is eventually required for AIs with general purpose qualitatively wildly superhuman capabilities (e.g. the ability to execute very powerful strategies that humans have a very hard time understanding) , but we can probably get done almost everything we want without such powerful models.)

My guess is that in the absence of safety concerns, society will do too much of these concerning types of RL, but might actually do too little of safer types of RL that help to elicit capabilities (because it is easier to just scale up the model further than to figure out how to maximally elicit capabilities).

(Note that my response ignores the cost of training "smarter" models and just focuses on hitting a given level of usefulness as this seems to be the requested analysis in the question.)

I think accumulate power and resources via mechanisms such as (but not limited to) hacking seems pretty central to me.

One operationalization is "these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs".

As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.

With these caveats:

  • The speed up is relative to the current status quo as of GPT-4.
  • The speed up is ignoring the "speed up" of "having better experiments to do due to access to better models" (so e.g., they would complete a fixed research task faster).
  • By "capable" of speeding things up this much, I mean that if AIs "wanted" to speed up this task and if we didn't have any safety precautions slowing things down, we could get these speedups. (Of course, AIs might actively and successfully slow down certain types of research and we might have burdensome safety precautions.)
  • The 2x increase in marginal cost is ignoring potential inflation in the cost of compute (FLOP/$) and inflation in the cost of wages of ML researchers. Otherwise, I'm uncertain how exactly to model the situation. Maybe increase in wages and decrease in FLOP/$ cancel out? Idk.
  • It might be important that the speed up is amortized over a longer duration like 6 months to 1 year.

I'm uncertain what the economic impact of such systems will look like. I could imagine either massive (GDP has already grown >4x due to the total effects of AI) or only moderate (AIs haven't yet been that widely deployed due to inference availability issues, so actual production hasn't increased that much due to AI (<10%), though markets are pricing in AI being a really, really big deal).

So, it's hard for me to predict the immediate impact on world GDP. After adaptation and broad deployment, systems of this level would likely have a massive effect on GDP.

Random error:

Exponential Takeoff:

AI's capabilities grow exponentially, like an economy or pandemic.

(Oddly, this scenario often gets called "Slow Takeoff"! It's slow compared to "FOOM".)

Actually, this isn't how people (in the AI safety community) generally use the term slow takeoff.

Quoting from the blog post by Paul:

Futurists have argued for years about whether the development of AGI will look more like a breakthrough within a small group (“fast takeoff”), or a continuous acceleration distributed across the broader economy or a large firm (“slow takeoff”).

[...]

(Note: this is not a post about whether an intelligence explosion will occur. That seems very likely to me. Quantitatively I expect it to go along these lines. So e.g. while I disagree with many of the claims and assumptions in Intelligence Explosion Microeconomics, I don’t disagree with the central thesis or with most of the arguments.)

Slow takeoff still can involve a singularity (aka an intelligence explosion).

The terms "fast/slow takeoff" are somewhat bad because they are often used to discuss two different questions:

  • How long does it take from the point where AI is seriously useful/important (e.g. results in 5% additional GDP growth per year in the US) to go to AIs which are much smarter than humans? (What people would normally think of as fast vs slow.)
  • Is takeoff discontinuous vs continuous?

And this explainer introduces a third idea:

  • Is takeoff exponential or does it have a singularity (hyperbolic growth)?

The claim is that most applications aren't internal usage of AI for AI development and thus can be made trivially safe.

Not that most applications of AI for AI development can be made trivially safe.

Load More