Yep. A reasonable number of people at OpenAI are working on aligning (or, if you're pessimistic about it, "aligning") modern systems, e.g. Aligning Language Models. And for things that look more like your experiment idea, there's some robustness work (e.g.) that looks at internal perturbations to models.
But no, I don't agree that we just need robustness work or fine-tuning to follow humans instructions.
Basic Problem
If I read about current AI Alignment research ideas, I see a lot of discussion about high capability scenarios, in which AI would be too smart to align simply. It would deceive operators, hack through communication channels and goodheart it's way over to infinity.
However if I block the part of my mind that likes to think about cool superintelligence overlords, and try to understand why high capability should introduce an actual complexity shift... it doesn't seem that intuitive.
At basic level - any time an AI does something we don't want, it's a plain and simple failure of matching one reward function to another. In high capability AI (HAI) we can only imagine what reward functions would be, and how they would be represented. Right now, our simple low capability models (LAI) have reward functions in the shape of big mess of parameters and connections that randomly walked into the first region in which loss is small enough. Are we sure that they are aligned ? I think it's pretty probable that if we take a general consequentialist and bind him to this reward function (whatever it represents) the result would be quite catastrophic. Of course this example is a little contrived, but I think image of separate consequentialist vs. reward function is helpful here.
It seems to me a real alignment failure structure is shared between HAI and LAI. The first just has more wiggle room, so potential misalignments propagate to the level when we can easily notice them.
Let's assume this is somewhat true, then what?
Next steps
Then we should do more low capability research:
There is some current research directions which are working at LAI level. I think a lot of Interpretability work are in this category.
Also isn't the whole AI research industry working on this problem? AI that works ok and can be commercially deployed is more aligned than one which fails to achieve anything.
However it seems to me that our current approach for AI research has wrong focus to be applicable in alignment. Because capabilities are low, we can achieve decent performance by hacks and tricks
, like changing number of layers :).I think that to get useful insights from LAI we need to drag misalignment into plain view. This requires dark side approach. Instead of trying to make AI to do what you want, we can try to intentionally misalign it. I think after we see a lot of horribly misaligned but still real AIs we would gain insight into structure of misalignment.
So here is my half-baked experiment setup proposal:
Experiment proposal
GAN setup.
Cons:
1. Seems to be pretty dangerous to run this setup on HAI
2. High compute requirements. Would require a lot of full training cycles for H
Pros:
1. This experiment tries to capture more fully AI alignment problem.
2. It can fuzzy test AI alignment strategies.
3. We can compare different alignment metrics.
4. Evil AI (E) can find some alignment failure which we haven't considered.
This post was inspired by On how various plans miss the hard bits of the alignment challenge by Nate Soares