What Success Might Look Like
[Crossposted from my substack Working Through AI.] Alice is the CEO of a superintelligence lab. Her company maintains an artificial superintelligence called SuperMind. When Alice wakes up in the morning, she’s greeted by her assistant-version of SuperMind, called Bob. Bob is a copy of the core AI, one that has been tasked with looking after Alice and implementing her plans. After ordering some breakfast (shortly to appear in her automated kitchen), she asks him how research is going at the lab. Alice cannot understand the details of what her company is doing. SuperMind is working at a level beyond her ability to comprehend. It operates in a fantastically complex economy full of other superintelligences, all going about their business creating value for the humans they share the planet with. This doesn’t mean that Alice is either powerless or clueless, though. On the contrary, the fundamental condition of success is the opposite: Alice is meaningfully in control of her company and its AI. And by extension, the human society she belongs to is in control of its destiny. How might this work? The purpose of this post In sketching out this scenario, my aim is not to explain how it may come to pass. I am not attempting a technical solution to the alignment problem, nor am I trying to predict the future. Rather, my goal is to illustrate what, if anyone indeed builds superintelligent AI, a realistic world to aim for might look like. In the rest of the post, I am going to describe a societal ecosystem, full of AIs trained to follow instructions and seek feedback. It will be governed by a human-led target-setting process that defines the rules AIs should follow and the values they should pursue. Compliance will be trained into them from the ground up and embedded into the structure of the world, ensuring that safety is maintained during deployment. Collectively, the ecosystem will function to guarantee human values and agency over the long term. Towards the end, I will
So the problem is, if you present a model with a bunch of wildly OOD examples in order to test generalisation, that these look fake to it, so it realises it is being tested? This implies alignment evals will increasingly look like psychology experiments, trying to coach/trick models into revealing their more fundamental preferences.
With evaluation awareness, do we see evidence of models in deployment ever getting confused and thinking they are being evaluated when they are not? If we're worried about models behaving incorrectly in novel situations, and novel situations tend to make models think they are being tested, then we should see this. Or, does evaluation awareness predominantly come from some other source?