Comment Permalink

Thumbs up for trying to think of novel approaches to solving the alignment problem.

Every time the model does something that harms the utility function of the dumber models, it gets a loss function.

A few confusions:

By "it gets a loss function", did you mean "it gets negative reward"?
If yes, doesn't this plan consist entirely of reinforcement learning? How does this "emulate Evolution"?
What exactly does the quoted sentence mean? Does the smarter model (S) receive RL signals proportional to... changes in the dumber agents' (D's) total utility?

Some problems, off the top of my head:

GPT-like models don't have utility functions.
Even if they did, mechinterp is nowhere near advanced enough to be able to reveal models' utility functions.
Humans don't have utility functions. It's unclear how this would generalize to human-alignment.
It's very much unclear what policy S would end up learning in this RL setup. It's even less clear how that policy would generalize outside of training.
- If S is given reward proportional to (changes in) D's utility, then basically we're just training S with D's utility function. I.e., just training some arbitrary RL policy/agent. Not much to do with alignment, AFAICT. ^[1]
- If S is instead given reward for things like {taking actions that lead to obtaining information about D's utility function}, then... we're training an RL policy/agent on proxies to "alignment". I expect that kind of approach to break down badly (due to Goodhart) when S becomes highly capable.

I don't know how you arrived at this plan, but I'm guessing it involved reasoning with highly abstract and vague concepts. You might be interested in (i.a.) these tools/techniques:

Except maybe if you somehow managed to have the entire simulation be a very accurate model of the real world, and D's be very accurate models of humans. But that's not remotely realistic; and still subject to Goodhart. ↩︎

See in context

4 Simple alignment plan that maybe works

by Iknownothing

18th Jul 2023

1 min read

4

I'll ask for feedback at the end of this post, please hold criticisms, judgements and criticisms for then.

Forget every single long complicated, theoretical mathsy alignment plan.
In my opinion, pretty much every single one of those is too complicated and isn't going to work.
Let's look at the one example we have of something dumb making something smart that isn't a complete disaster and at least try to emulate that first.
Evolution- again, hold judgements and criticisms until the end.

What if you trained a smart model on the level of say, GPT3 alongside a group of much dumber and slower models, in an environment like a game world or some virtual world?
Dumb models who, with the research in interpretability, you know what their utility function is.
The smart, fast model however, does not.
Every time the model does something that harms the utility function of the dumber models, it gets a loss function.

The smarter model will likely need to find a way to figure out the utility functions of the dumber models.
Eventually, you might have a model that's good at co-operating with a group of much dumber, slower models- which could be something like what we actually need!

Please feel free to now post any criticisms, comments, judgements, etc. All are welcome.

New to LessWrong?

Getting Started

FAQ

Library

AI RiskInner AlignmentOuter AlignmentAI

Frontpage

4

Simple alignment plan that maybe works

9aphyer

1Iknownothing

6Gordon Seidoh Worley

New Comment

8 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:20 AM

[-]aphyer2y93

We don't actually want our AI to cooperate with each copy of the malaria bacterium.

[-]Iknownothing2y10

That's very astute. True.

[-]Gordon Seidoh Worley2y62

Humans got trained via evolution alongside a bunch of dumber animals. Then we killed a lot of them.

Evolution doesn't align with anything other than differential reproduction rates, so you'd somehow have to make the only way to reproduce to be aligned with human values, which basically sounds like solving alignment and then throwing evolution on top for funsies.

[-]Iknownothing2y12

From my very spotty info on evolution:
Humans got 'trained' to maximise reproducibility and in doing so maximised a bunch of other stuff along the way- including resource acquisition.

What I spoke about here is creating an environment where a more intelligent+fast agent is put in an environment that is deliberately crafted such that it can only survive by helping much dumber, slower agents. Training to act co-operatively.

Writing this out, I may have just made an overcomplicated version of reinforcement learning.

[-]rvnnt2y32

Thumbs up for trying to think of novel approaches to solving the alignment problem.

Every time the model does something that harms the utility function of the dumber models, it gets a loss function.

A few confusions:

By "it gets a loss function", did you mean "it gets negative reward"?
If yes, doesn't this plan consist entirely of reinforcement learning? How does this "emulate Evolution"?
What exactly does the quoted sentence mean? Does the smarter model (S) receive RL signals proportional to... changes in the dumber agents' (D's) total utility?

Some problems, off the top of my head:

GPT-like models don't have utility functions.
Even if they did, mechinterp is nowhere near advanced enough to be able to reveal models' utility functions.
Humans don't have utility functions. It's unclear how this would generalize to human-alignment.
It's very much unclear what policy S would end up learning in this RL setup. It's even less clear how that policy would generalize outside of training.
- If S is given reward proportional to (changes in) D's utility, then basically we're just training S with D's utility function. I.e., just training some arbitrary RL policy/agent. Not much to do with alignment, AFAICT. ^[1]
- If S is instead given reward for things like {taking actions that lead to obtaining information about D's utility function}, then... we're training an RL policy/agent on proxies to "alignment". I expect that kind of approach to break down badly (due to Goodhart) when S becomes highly capable.

I don't know how you arrived at this plan, but I'm guessing it involved reasoning with highly abstract and vague concepts. You might be interested in (i.a.) these tools/techniques:

Except maybe if you somehow managed to have the entire simulation be a very accurate model of the real world, and D's be very accurate models of humans. But that's not remotely realistic; and still subject to Goodhart. ↩︎

[-]mishka2y10

It's an interesting starting point...

The key crux of any "semi-alignment plan for autonomous AI" is how would it behave under recursive self-improvement. (We are getting really close to having AI systems which will be competent in software engineering, including software engineering for AI projects, and including using all kinds of AutoML tricks, so we might be getting close to having AI systems competent in performing AI research.)

And an AI system would like its smarter successors to co-operate with it. And it would actually like smarter successors of other AI systems to be nice as well.

So, yes, this alignment idea might be of use (at least as a part of a larger plan, or an idea to be further modified)...

[-]Iknownothing2y20

That was something like what I was thinking. But I think this won't work, unless modified so much that it'd be completely different. More an idea to toss around.

I'll start over with something else. I do think something that might have value is designing an environment that induces empathy/values/whatever, rather than directly trying to design the AI to be what you want from scratch.
Environment design can be very powerful in influencing humans, but that's in huge part because we (or at least, those of us who put thought in designing environments for folk) understand humans far better than we understand AI.

Like a lot of the not-ridiculously terrible and only extremely terrible plans, this kind of relies on a lot of interpretability.

[-]mishka2y10

Yes, I think we are looking at "seeds of feasible ideas" at this stage, not at "ready to go" ideas...

I tried to look at what would it take for super-powerful AIs

not to destroy the fabric of their environment together with themselves and everything
to care about "interests, freedom, and well-being of all sentient beings"

That's not too easy, but might be doable in a fashion invariant with respect to recursive self-modification (and might be more feasible than more traditional approaches to alignment).

Of course, the fact that we don't know what's sentient and what's not sentient does not help, to say the least ;-) But perhaps we and/or AIs and/or our collaborations with AIs might figure this out sooner rather than later...

Anyway, I did scribble a short write-up on this direction of thinking a few months ago: Exploring non-anthropocentric aspects of AI existential safety

Moderation Log