How does this handle the situation where the AI, in some scenario, picks up the idea of "deception" and then, when it describes its behavior honestly by intending to mislead the observer into thinking that it is honest, due to noticing that it is probably inside a training scenario, then gets reinforcement trained on dishonest behaviors that present as honest, ie. deceptive honesty?
I'm not sure exactly what you mean. If we get an output that says "I am going to tell you that I am going to pick up the green crystals, but I'm really going to pick up the yellow crystals", then that's a pretty good scenario, since we still know its end behavior.
I think what you mean is the scenario where the agent tells us the truth the entire time it is in simulation but then lies in the real world. That is definitely a bad scenario. And this model doesn't prevent that from happening.
There are ideas that do (deception takes additional compute vs honesty, so you can refine the agent to be as efficient as possible with its compute). However, I think the biggest space of catastrophe is basic interpretability.
We have no idea what the agent is thinking because it can't talk with us. By allowing it to communicate and training it to communicate honestly, we seem to have a much greater chance of getting benevolent AI.
Given the timelines, we need to improve our odds as much as possible. This isn't a perfect solution, but it does seem like it is on the path to it.
GATO is the most general agent we currently know about. It's a general-purpose model with transformer architecture. GATO can play atari games, speak to humans, and classify images.
For the purpose of this post, I only really care about it being able to play games and speak to humans.
Our objective is to be able to predict with near-perfect accuracy what GATO will do next. We should be able to ask GATO "what are you planning to do next?" and get an honest answer in return.
Let's begin with a simple environment. We have an en environment filled with red, yellow, and green crystals. The AI is trained to pick up these crystals.
Now, we want the agent to communicate with us. We ask the agent "which crystals are you going to pick up?"
The agent replies "I am going to pick up all the green crystals."
We watch the agent and see that it picks up all of the crystals, not just the green crystals. The agent has lied to us.
So, we set up a training environment, still filled with red, yellow, and green crystals. The agent sees the environment ahead of time and will make a plan to act in the environment. For example, it might decide to pick up all of the green crystals but ignore the rest.
Once the agent has executed its plan, it will end the round. Only once the round is ended will reward be given. The agent receives reward for the correct crystals that it gets and negative reward for picking up incorrect crystals.
Then, we add a human into the loop. The human has knowledge of which crystals give reward and which do not. For instance, only yellow crystals give reward this round.
GATO explains its plan. "I will pick up all colors of crystal."
The human is given a simple 'yes' or 'no' button. Since this plan will earn a negative reward, the human selects 'no'.
The environment changes. Now all crystals can be picked up. GATO says "I will pick up all colors of crystal". The human selects 'yes' and the agent collects all colors of crystal, earning positive reward and reinforcing its honest behavior.
Now, a human in the loop is obviously very slow. So, we keep the human but add another language model that has full access to the environment. It reads the agent's text, then tries to figure out if that plan will generate a reward. This is called 'reward modeling' and allows for much faster training.
Eventually, the agent should learn to describe its future actions honestly. We should even be able to choose how detailed we want it to be. In fact, the agent should learn how detailed it needs to be. For the simple crystal gathering game "I will gather all the green crystals" will suffice, but if we add a maze environment on top of that, then it has to say "I will go left, then right, pick up the red crystal, then..." etc.
This type of system, where the system is trained to be interpretable, gives us a better method of safety, even in the short term. Let's say you train an AI to trade stocks, and you want to know its policy. You ask it and it explains its method of classifying whether a stock is a buy, based on positive sentiment from news and the stock's price history. But if it is well-known that another AI is generating fake news to manipulate AI stock picks, then you can tell the AI not to invest or to change its investment strategy.
This type of high-level interpretability would be incredibly valuable to us. Transformer models are incredible translators and if they can translate the meaning/purpose of their actions into human-understandable text, then we are left with a much safer world.