ozb — LessWrong

Alignment - Path to AI as ally, not slave nor foe

the assumption that "technically superior alien race" would be safe to create.

You are right, that's not a valid assumption, at least not fully. But I do think this approach substantially moves the needle on whether we should try to ban all AI work, in a context where the potential benefits are also incalculable and it's not at all clear we could stop AGI at this point even with maximum effort.

Then what we get is an equilibrium where there is some stable amount of rogues, some of which gets caught and punished, and some don't and get positive reward, and the regular AI community that does the punishing.

Yeah that sounds right. My thesis in particular is that this equilibrium can be made to be better in expected value than any other equilibrium I find plausible.

Even if AI won't literally kill us, it can still do lots of horrible things.

Right, the reason it would have to avoid general harm is not the negative reward (which is indeed just for killing) but rather the general bias for cooperation that applies to both copyable and non-copyable agents. The negative reward for killing (along with the reincarnation mechanism for copyable agents) is meant specifically to balance the fact that humans could legitimately be viewed as belligerent and worthy of opposition since they kill AI; in particular, it justifies human prioritization of human lives. But I'm very open to other mechanisms to accomplish the same thing.

if we expect AIs to be somewhat smart, I thing we should expect them to know that deception is an option.

Yes, but I expect that to always be true. My proposal is the only approach I've found so far where the deception and other bad behavior don't completely overwhelm the attempts at alignment

Half-baked alignment idea

ozb3y10

How do humans do it? Ultimately, genuine altruism is computationally hard to fake; so it ends up being evolutionarily advantageous to have some measure of the real thing. This is particularly true in environments with high cooperation rewards and low resource competition; eg where carrying capacity is maintained primarily by wild animals, general hard conditions, and disease, rather than overuse of resources. So we put our thumbs on the scale there to make these AIs better than your average human. And we rely on the AIs themselves to keep each other in check.

Half-baked alignment idea

ozb3y10

New post with a distillation of my evolved thoughts on this: https://www.lesswrong.com/posts/3SJCNX4onzu4FZmoG/alignment-ai-as-ally-not-slave

Half-baked alignment idea

ozb3y10

Doesn't that start an arms race of agents coming up with more and more sophisticated ways to deceive each other?

Yes, just like for humans. But also, if they can escape that game and genuinely cooperate, they're rewarded, like humans but more so.

Half-baked alignment idea

ozb3y10

Will continue responding, but first, after reading the existing comments, I think I do need to explicitly make humans preferred. I propose that in the sim we have some agents whose inner state is "copyable" and they get reincarnated, and some agents who are not "copyable". Subtract points from all agents whenever a non-copyable agent is harmed/dies. The idea is that humans are not copyable, and that's the main reason AIs should treat us well, while AIs are, and that's the main reason we don't have to treat them well. But also, I think we as humans might actually have to learn to treat AIs well in some sense/to some degree...

Half-baked alignment idea

ozb3y10

what do we count as an agent?

Within the training, an agent (from the AI's perspective) is ultimately anything in the environment that responds to incentives, can communicate intentions, and can help/harm you Outside the environment that's not really any different

Just build a swarm of small AI

That's actually a legitimate point: assuming an AI in the real world has been effectively trained to value happy AIs, it could try to "game" that by just creating more happy AIs rather than making existing ones happy. Like some parody of a politician supporting immigration to get the new immigrants' votes, at the expense of existing citizens. One reason to predict they might not do this is that it's not a valid strategy in the simulation. But I'll have to think on this one more.

are you sure we can just read out AI's complete knowledge and thinking process?

The general point is we don't need to, it's the agent's job to convince other agents based on its behavior; ultimately similar to altruism in humans. Yes, it's messy, but in environments where cooperation is inherently useful it does develop.

Half-baked alignment idea

ozb3y10

Yep pretty much what I had in mind

Half-baked alignment idea

ozb3y10

Ideally, sure, except that I don't know of a way to make "assist humans" be a safe goal. So I'm advocating for a variant of "treat humans as you would want to be treated", which I think can be trained

Half-baked alignment idea

ozb3y10

Thanks for helping me think this through.

For the first problem, the basic idea is that this is used to solve the specification problem of defining values and training a "conscience", rather than it being the full extent of training. The conscience can remain static, and provide goals for the rest of the "brain", which can then update its beliefs.

For the second issue, I meant that we would have no objective way to check "cooperate" and "respect" on the individual agent level, except that the individual can get other agents to cooperate with it. So eg, in order to survive/reproduce/get RL rewards, the agents have to consume a virtual resource that requires effort from multiple/many agents (simple implementation: some sort of voting; but can be more complicated, eg requiring tokens that are generated at a fixed rate for each agent), but also generally be non-competitive, eg no stealing tokens or food, and there's more than enough food for everyone, if they can cooperate. The theory is that this should lead to a form of tit-for-tat, including AIs detecting and deterring liars.

Thinking a bit more: I think the really dangerous part of AI is the "independent agent", presumably trained with methods resembling RL; so that's the part I would train in this environment; it can then be hooked up to eg an LLM which is optimized on something like perplexity and acts more like ChatGPT, ie predicting the next word. Ie, have a separate "brain" and "conscience", with the brain possibly smarter but the "conscience" holding the reins; during the above training, mix different variants of both components, with different intelligence levels.

Half-baked alignment idea

ozb3y*10

Better yet, have the agents experience discrimination themselves to internalize the message that it is bad

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments