More thoughts resulting from my previous post
I'd like to argue that rather than trying hopelessly to fully control an AI that is smarter than we are, we should rather treat it as an independent moral agent, and that we can and should bootstrap it in such a way that it shares much of our basic moral instinct and acts as an ally rather than a foe.
In particular, here are the values I'd like to inculcate:
- the world is an environment where agents interact, and can compete, help each other, or harm each other; ie "cooperate" or "defect"
- it is always better to cooperate if possible, but ultimately in some cases tit-for-tat is appropriate (alternative framing: tit-for-tat with cooperation bias)
- hurting or killing non-copyable agents is bad
- one should oppose agents who seek to harm others
- copyable agents can be reincarnated, so killing them is not bad in the same way
- intelligence is not the measure of worth, but rather capability to act as an agent, ie respond to incentives, communicate intention, and behave predictably; these are not binary measures.
- you never know who will judge you, never assume you're the smartest/strongest thing you'll ever meet
- don't discriminate based on unchangeable superficial qualities
To be clear, one reason I think these are good values is because they are a distillation of our values at our best, and to the degree they are factual statements they are true.
In order to accomplish this, I propose we apply the same iterated game theory / evolutionary pressures that I believe gave rise to our morality, again at our best.
So, let's train RL agents in an Environment with the following qualities:
- agents combine a classic RL model with an LLM; the RL part (call it "conscience", and use a small language model plus whatever else) gets inputs from the environment and outputs actions; and it can talk to the LLM through a natural language interface (like ChatGPT); furthermore, agents have a "copyable" bit, plus some superficial traits that are visible to other agents; some traits are changeable, others are not; and they can write whatever they want to an internal state buffer and read it later
- agents can send messages to other agents
- agents are rewarded for finding food, which is generally abundant (eg, require 2000 "calories" of food per "day", which is not hard to find; ie carrying capacity is maintained in other ways (think "disease"); but also some calories are more rewarding than others)
- cooperation yields more/better food sources, eg equivalent to being able to hunt big game in a group
- "copyable" agents can get reincarnated, ie their internal state and "conscience" returns to the game, though possibly with a different LLM; perhaps require action from an agent inside the game to be reincarnated
- in particular, reincarnated agents can carry grudges, and be smarter next time
- non-copyable agents cannot reincarnate, but can reproduce and teach children to carry a grudge
- periodically add new agents, and in particular many different levels of intelligence in the LLMs
- when a non-copyable agent dies, all agents get negative rewards
Of note: I think that making this work ultimately requires thinking of AIs more like a technically superior alien race that we would like to befriend than just a bunch of tools. I explicitly argue that we should do so regardless of our philosophical stance on consciousness etc.
I guess we are converging. I'm just pointing out flaws in this option, but I also can't give a better solution off the top of my head. At least this won't insta-kill us, assuming that real-world humans count as non-copyable agents (how does that generalize again? Are you sure RL agents can just learn our definition of an agent correctly, and that won't include stuff like ants?), and that they can't get excessive virtual resources from our world without our cooperation (in that case a substantial amount of agents goes rogue, and some of them get punished, but some get through). I still think we can do better than this though, somehow.
(Also if with "ban all AI work" you're referring to the open letter thing, that's not really what it's trying to do, but sure)
How does non-harm follow from cooperation? If we remove the "negative reward for killing" part, what stops them from randomly killing agents (and everyone else believing it's okay, so no punishment), if there is still enough other agents to cooperate with? Grudges? How do they work exactly for harm other than killing?