Curiosity as a Solution to AGI Alignment

Harsha G.

AGI (Artificial General Intelligence) seems to be just around the corner. Or maybe not. Either way it might be humanity’s last ever invention— the greatest of all time or the ultimate doom machine. This is a “thinking-out-loud” piece about how we can avoid the doom machine scenario of AGI.

Firstly, we need an objective function for the AI to align with. I think curiosity can help.

Curiosity as a solution to the AGI Alignment Problem, by Midjourney

Why Curiosity? (And why won’t it be enough?)

Children are curious for their own good. Mostly their curiosity helps them explore their environment and understand how to survive. It also helps their bigger versions (adults) teach them “values” and other means by which children don’t just survive as an individual but survive with the group, in a symbiotic relationship, which leads to better survival of the entire species. Collectivism has always been more important than individualism until maybe the last few centuries.

II.

Children are also curious at the expense of their own survival. They might burn themselves and die. Nature made it easy to kill yourself if you’re curious. Evolution got around this by building loops of positive and negative reinforcement (that we call pain and pleasure). Even if you consider consciousness to be “illusory”, these sensations are “real enough” for the child to not touch the fire again.

This tendency to be curious along with a conscious ability to plan and think long-term and have empathy towards objects and others— define our ability to cheat the game of natural selection and bend the rules to our will. Curiosity in this “post-natural-selection” kind of world has lead to knowledge creation and that to me is the most human-pursuit imaginable, leading to possibly infinite progress.

III.

Children also however have a tendency to be rather “evil”. It takes them more than a decade to align their values to ours and then too, not all of them are able to do it well. For these others, we have defined negative reinforcement loops (punishments) at a societal level as either social isolation (prisons) or error correction (like therapy), either of which still might not help with the value alignment. Either of which probably won’t work for AGIs.

IV.

Overall, curiosity has been instrumental in the evolution of humans, providing advantages in terms of adaptation, social and economic success, and cognitive development. For true alignment, children (or AGI) need to ask a lot of questions and be presented with convincing arguments on why some core values are good to believe, but as a fail-safe, we need a platform-level negative reinforcement loop to punish any outliers or reward any good participants.

Curiosity is not only a good trait for individual development but also a key driver of progress and innovation for society as a whole.

Curiosity-driven AI systems also have the potential to discover universal truths and ethical principles that are important for aligning AI with human values. For example, a curiosity-driven AI system might discover the importance of empathy and cooperation through its interactions with humans, leading to a more harmonious relationship between humans and machines.

Moreover, curiosity-driven AI systems are more likely to be transparent and explainable, which is crucial for building trust and accountability. If an AI system is curious about its environment and constantly learning, it can provide explanations for its decisions and actions, making it easier for humans to understand and evaluate its behavior.

Overall, by creating AI systems that are naturally curious and motivated to explore and learn, we can ensure that they remain safe and beneficial for society, while also advancing the field of AI research and development.

Objective Function for the AGI: “Be thoughtfully curious and help drive progress for society as a whole, inclusively, without harm.”

What about other objective functions?

AI systems designed with specific objectives and goals, may not be able to anticipate all possible scenarios and outcomes, likely leading to unintended consequences. However, if AI systems are designed to be naturally curious and motivated to learn about their environment, they can adapt and respond to new situations and challenges, and discover new ways to achieve their goals. Here are a few other objective functions that made sense to me, which can be mixed with the curiosity function:

A strange loop objective function of finding a symbiotic ecosystem for species to survive while increasing knowledge- Natural selection optimises for high fitness phenotypes on a fitness landscape. The goal can be argued for a species to survive and overall this is achieved with a game design mechanics of symbiosis or ecosystem of species. Humanity needs to build a platform like natural selection on which AGI (or AGIs) can live in symbiosis with humanity, this in itself can also be the objective function of the AGI.

Open Questions to Discuss

What might be wrong with the define objective function? Think at a systems level on how this could lead to doom scenarios. What am I missing?
What are some negative and positive reinforcement loops that can act as a safety net in case of the failure of value-alignment?

While I don't think AI should have objective functions at all (see Eric Drexler's proposals here), I do like one thing about this one.

Curiosity is implementable. Is a tractable, numerical value.

Specifically, the machine can use it's available sensors to get a compressed map of the current frame it's on, M. It can make a future frame prediction, P., conditional on the machine's actions, An.

Curiosity is satisfied with P has low confidence - it is not confidence on what the results of some actions will be.

So you could add to the machine's reward counter positive reward for taking actions that gain information, if the gain is expected to be less than the cost.

With that said, it's still a bad idea of general AIs free to roam in 'our world' to have this function.

I was actually just going to post something about curiosity myself! But my reason for how it works is more concrete, and imo less handwavey (sorry): an AI which wishes to maximize its rate of learning and which doesn't simply fall into wireheading (let's assume away that part for the moment) will tend to seek to create, and perhaps avoid destroying, complex systems that it has trouble predicting, which evolve according to emergent principles - and life fits that definition better than unlife does, and intelligence fits it even better.

I think curiosity in the sense of seeking a high rate of learning automatically leads to interest in other living and intelligent organisms and the desire for them to exist or continue existing. However, that doesn't solve alignment, because it could be curious about what happens when it pokes us - or it may not realize we're interesting until after it's already destroyed us.

Interesting idea, but it seems risky.
Would life be the only, or for that matter, even the primary, complex system that such an AI would avoid interfering with?

Further, it seems likely that a curiosity-based AI might intentionally create or seek out complexity, which could be risky.
Think of how kids love to say "I want to go to the moon!" "I want to go to every country in the world!". I mean, I do too and I'm an adult. Surely a curiosity-based AI would attempt to go to fairly extreme limits for the sake of satiating its own curiosity, at the expense of other values.

Maybe such an AGI could have like... an allowance? "Never spend more than 1% of your resources on a single project" or something? But I have absolutely no idea how you could define a consistent idea of a "single project".

Note, to be entirely clear, I'm not saying that this is anywhere near sufficient to align an AGI completely. Mostly it's just a mechanism of decreasing the chance of totally catastrophic misalignment, and encouraging it to be just really really destructive instead. I don't think curiosity alone is enough to prevent wreaking havoc, but I think it would lead to fitting the technical definition of alignment, which is that at least one human remains alive.

I want to provide feedback, but can't see the actual definition of the objective function in either of the cases. Can you write down a sketch of how this would be implemented using existing primitives (SI, ML) so I can argue against what you're really intending?

Some preliminary thoughts:

Curiosity (obtaining information) is an instrumental goal, so I'm not sure if making it more important will produce more or less aligned systems. How will you trade off curiosity and satisfaction of human values?
It's difficult to specify correctly - depending on what you mean by curiosity, the AI can start showing itself random unpredictable data (the Noisy TV problem for RL curiosity) or manipulating humans to give it self-affirming instructions (if the goal is to minimize prediction error). So far I haven't seen a solution that isn't a hack that won't generalize to superintelligent agents.

Good point. Will have to think on this more. What do you mean by (SI, ML)? Can you link me to articles around this so I can define this better?

Solomonoff Induction and Machine Learning. How would you formulate this in terms of a machine that can only predict future observations?