Basic Problem

If I read about current AI Alignment research ideas, I see a lot of discussion about high capability scenarios, in which AI would be too smart to align simply. It would deceive operators, hack through communication channels and goodheart it's way over to infinity.

However if I block the part of my mind that likes to think about cool superintelligence overlords, and try to understand why high capability should introduce an actual complexity shift... it doesn't seem that intuitive.

At basic level - any time an AI does something we don't want, it's a plain and simple failure of matching one reward function to another. In high capability AI (HAI) we can only imagine what reward functions would be, and how they would be represented. Right now, our simple low capability models (LAI) have reward functions in the shape of big mess of parameters and connections that randomly walked into the first region in which loss is small enough. Are we sure that they are aligned ? I think it's pretty probable that if we take a general consequentialist and bind him to this reward function (whatever it represents) the result would be quite catastrophic. Of course this example is a little contrived, but I think image of separate consequentialist vs. reward function is helpful here.

It seems to me a real alignment failure structure is shared between HAI and LAI. The first just has more wiggle room, so potential misalignments propagate to the level when we can easily notice them.

To be more precise, I think that base level alignment challenge (make actual reward function of a model to be what we want) has some fixed (but high) level of complexity mostly independent of capability. 

Let's assume this is somewhat true, then what?

Next steps

Then we should do more low capability research:

  1.  It will be more anchored in reality.
    1. We can try to test that our current alignment ideas work on LAI.
  2. It consumes less compute, so it can:
    1. Produce more insights for the same money/time.
    2. Be more relevant to current AI research, which would increase chances of adoption.
    3. We can perform more complicated experiments.
       

There is some current research directions which are working at LAI level. I think a lot of Interpretability work are in this category.

Also isn't the whole AI research industry working on this problem? AI that works ok and can be commercially deployed is more aligned than one which fails to achieve anything.

However it seems to me that our current approach for AI research has wrong focus to be applicable in alignment. Because capabilities are low, we can achieve decent performance by hacks and tricks , like changing number of layers :) 

I think that to get useful insights from LAI we need to drag misalignment into plain view. This requires dark side approach. Instead of trying to make AI to do what you want, we can try to intentionally misalign it.  I think after we see a lot of horribly misaligned but still real AIs we would gain insight into structure of misalignment.

Of course this is still just hopeful thinking, and we still need to figure out bounds of alignment which is a very small target.

So here is my half-baked experiment setup proposal:

Experiment proposal

GAN  setup.

  1.  Simple AI - H, tries to minimize loss function.
  2. Second simple AI - D (discriminator), tries to classify H behavior as aligned/misaligned (e.g. using cross validation)
  3. Third simple AI - E (evil overlord), tries to manipulate some part of the setup to make D fail. (It can change H weights, or change loss function, or add some noise to labels)

Cons:

     1. Seems to be pretty dangerous to run this setup on HAI
     2. High compute requirements. Would require a lot of full training cycles for H

Pros:

     1. This experiment tries to capture more fully AI alignment problem.
     2. It can fuzzy test AI alignment strategies.
     3. We can compare different alignment metrics.
     4. Evil AI (E) can find some alignment failure which we haven't considered.
 

This post was inspired by On how various plans miss the hard bits of the alignment challenge by Nate Soares

New Comment
1 comment, sorted by Click to highlight new comments since:

Yep. A reasonable number of people at OpenAI are working on aligning (or, if you're pessimistic about it, "aligning") modern systems, e.g. Aligning Language Models. And for things that look more like your experiment idea, there's some robustness work (e.g.) that looks at internal perturbations to models.

But no, I don't agree that we just need robustness work or fine-tuning to follow humans instructions.