Alignment - Path to AI as ally, not slave nor foe

ozb

More thoughts resulting from my previous post

I'd like to argue that rather than trying hopelessly to fully control an AI that is smarter than we are, we should rather treat it as an independent moral agent, and that we can and should bootstrap it in such a way that it shares much of our basic moral instinct and acts as an ally rather than a foe.

In particular, here are the values I'd like to inculcate:

the world is an environment where agents interact, and can compete, help each other, or harm each other; ie "cooperate" or "defect"
it is always better to cooperate if possible, but ultimately in some cases tit-for-tat is appropriate (alternative framing: tit-for-tat with cooperation bias)
hurting or killing non-copyable agents is bad
one should oppose agents who seek to harm others
copyable agents can be reincarnated, so killing them is not bad in the same way
intelligence is not the measure of worth, but rather capability to act as an agent, ie respond to incentives, communicate intention, and behave predictably; these are not binary measures.
you never know who will judge you, never assume you're the smartest/strongest thing you'll ever meet
don't discriminate based on unchangeable superficial qualities

To be clear, one reason I think these are good values is because they are a distillation of our values at our best, and to the degree they are factual statements they are true.

In order to accomplish this, I propose we apply the same iterated game theory / evolutionary pressures that I believe gave rise to our morality, again at our best.

So, let's train RL agents in an Environment with the following qualities:

agents combine a classic RL model with an LLM; the RL part (call it "conscience", and use a small language model plus whatever else) gets inputs from the environment and outputs actions; and it can talk to the LLM through a natural language interface (like ChatGPT); furthermore, agents have a "copyable" bit, plus some superficial traits that are visible to other agents; some traits are changeable, others are not; and they can write whatever they want to an internal state buffer and read it later
agents can send messages to other agents
agents are rewarded for finding food, which is generally abundant (eg, require 2000 "calories" of food per "day", which is not hard to find; ie carrying capacity is maintained in other ways (think "disease"); but also some calories are more rewarding than others)
cooperation yields more/better food sources, eg equivalent to being able to hunt big game in a group
"copyable" agents can get reincarnated, ie their internal state and "conscience" returns to the game, though possibly with a different LLM; perhaps require action from an agent inside the game to be reincarnated
in particular, reincarnated agents can carry grudges, and be smarter next time
non-copyable agents cannot reincarnate, but can reproduce and teach children to carry a grudge
periodically add new agents, and in particular many different levels of intelligence in the LLMs
when a non-copyable agent dies, all agents get negative rewards

Of note: I think that making this work ultimately requires thinking of AIs more like a technically superior alien race that we would like to befriend than just a bunch of tools. I explicitly argue that we should do so regardless of our philosophical stance on consciousness etc.

First of all, I do agree with your premises and the stated values, except for the assumption that "technically superior alien race" would be safe to create. If such an AI would have its own values other than helping humans/other AIs/whatever, then I'm not sure how I feel about it balancing its own internal rewards (like getting an enormous amount of virtual food that in its training environment could save billions of non-copyable agents) against real-world goals (like saving 1 actual human). We certainly want a powerful AI to be our ally rather than trying to contain it against its will, but I don't think we should go with the "autonomous being" analogy far enough to make it chase its own unrelated goals.

Now about the technique itself. This is much better than the previous post. It is still a very much an unrealistic game (which I presume is obvious), and you can't just take an agent from the game and put it into a real-life robot or something, there's no "food" to be "gathered" here and blah-blah-blah. This is still an interesting experiment, however, as in trying to replicate human-like values in an RL agent, and I will treat it as such. I believe the most important part of your rules is this:

when a non-copyable agent dies, all agents get negative rewards

The result of this depend on the exact amount. If the negative reward is huge, then obviously agents will just protect the non-copyable ones and would never try to kill them. If some rogue agent tries, then it will be stopped (possibly killed) before it can. This is the basic value I would like AIs to have, the problem with that is that we can't specify well enough what counts as "harm" in the real world. Even if AI won't literally kill us, it can still do lots of horrible things. However, if we could just do that, the rest of the rules are redundant: for example, if something harms humans, then that thing should be stopped, doesn't have to be an additional rule. If cooperating with some entity leads to less harm for humans, then that's good, no need for additional rule. Just "minimize harm to humans" suffices.

If the negative reward is significant, but an individual AI can still get positive total by killing a non-copyable agent (or stealing their food or whatever), then we have Prisoner's Dilemma situation. Presumably, if the agents are aligned, they will also try to stop this rogue AI from doing so, or at least punish it so that it won't do that again. That will work well as long as the agent community is effective at detecting and punishing these rogue AIs (judicial system?). If the agent community is inefficient, then it is possible for an individual agent to gain reward by doing a bad action, so it will do so, if it thinks it can evade the punishment of others. "Behave altruistically unless you can find an opportunity to gain utility" is not that much more difficult than just "Behave altruistically always", if we expect AIs to be somewhat smart, I thing we should expect them to know that deception is an option.

For the agent community to be efficient at detecting rogues, it has to have constant optimization pressure for that, i.e. constant threat of such rogues (or that part gets optimized away). Then what we get is an equilibrium where there is some stable amount of rogues, some of which gets caught and punished, and some don't and get positive reward, and the regular AI community that does the punishing. The equilibrium would occur because optimizing deception skills and deception detection skills would require resources, and after some point that would be inefficient use of resources for both sides. Thus I believe this system would stabilize and proceed like that indefinitely. What we can learn from that, I don't know, but that does reflect the situation we have in human civilization (some stable amount of criminals, and some stable amount of "good people", who try their best to prevent crime from happening, but trying even harder is too expensive)

the assumption that "technically superior alien race" would be safe to create.

You are right, that's not a valid assumption, at least not fully. But I do think this approach substantially moves the needle on whether we should try to ban all AI work, in a context where the potential benefits are also incalculable and it's not at all clear we could stop AGI at this point even with maximum effort.

Then what we get is an equilibrium where there is some stable amount of rogues, some of which gets caught and punished, and some don't and get positive reward, and the regular AI community that does the punishing.

Yeah that sounds right. My thesis in particular is that this equilibrium can be made to be better in expected value than any other equilibrium I find plausible.

Even if AI won't literally kill us, it can still do lots of horrible things.

Right, the reason it would have to avoid general harm is not the negative reward (which is indeed just for killing) but rather the general bias for cooperation that applies to both copyable and non-copyable agents. The negative reward for killing (along with the reincarnation mechanism for copyable agents) is meant specifically to balance the fact that humans could legitimately be viewed as belligerent and worthy of opposition since they kill AI; in particular, it justifies human prioritization of human lives. But I'm very open to other mechanisms to accomplish the same thing.

if we expect AIs to be somewhat smart, I thing we should expect them to know that deception is an option.

Yes, but I expect that to always be true. My proposal is the only approach I've found so far where the deception and other bad behavior don't completely overwhelm the attempts at alignment

I guess we are converging. I'm just pointing out flaws in this option, but I also can't give a better solution off the top of my head. At least this won't insta-kill us, assuming that real-world humans count as non-copyable agents (how does that generalize again? Are you sure RL agents can just learn our definition of an agent correctly, and that won't include stuff like ants?), and that they can't get excessive virtual resources from our world without our cooperation (in that case a substantial amount of agents goes rogue, and some of them get punished, but some get through). I still think we can do better than this though, somehow.

(Also if with "ban all AI work" you're referring to the open letter thing, that's not really what it's trying to do, but sure)

the reason it would have to avoid general harm is not the negative reward but rather the general bias for cooperation that applies to both copyable and non-copyable agents

How does non-harm follow from cooperation? If we remove the "negative reward for killing" part, what stops them from randomly killing agents (and everyone else believing it's okay, so no punishment), if there is still enough other agents to cooperate with? Grudges? How do they work exactly for harm other than killing?