It seems to me that no matter how many problems from different research agendas we solve, there is always the possibility that some 'unknown unknown' misalignment scenario can occur. I can imagine an approach of building model-agnostic, environment-agnostic minimal assumption alignment guarantees (which seems to be super hard), but I feel like things can go wrong in myriad other ways, even then.

Has there been any discussion about how we might go about this?

New Answer
New Comment

1 Answers sorted by

Gordon Seidoh Worley

40

Donald Hobson gives a comment below explaining some reasoning around dealing with unknown unknowns, but it's not a direct answer to the question, so I'll offer that.

The short answer is "yes".

The longer answer is that this is one of the fundamental considerations in approaching AI alignment and is why some organizations, like MIRI, have taken an approach that doesn't drive straight at the object-level problem and instead tackles issues likely to be foundational to any approach to alignment that could work. In fact you might say the big schism between MIRI and, say, OpenAI, is that MIRI places greater emphasis on addressing the unknown whereas OpenAI expects alignment to look more like an engineering problem with relatively small and not especially dangerous unknown unknowns.

(note: I am not affiliated with either organization so this is an informed opinion on their general approaches, and also note that neither organization is monolithic and individual researches vary greatly in their assessment of these risks.)

My own efforts addressing AI alignment are largely about addressing these sorts of questions, because I think we still poorly understand what alignment even really means. In this sense I know that there is a lot we don't know, but I don't know all of what we don't know that we'll need to (so known unknown unknowns).

2 comments, sorted by Click to highlight new comments since:

If you add adhoc patches until you can't imagine any way for it to go wrong, you get a system that is too complex to imagine. This is the "I can't figure out how this fails" scenario. It is going to fail for reasons that you didn't imagine.

If you understand why it can't fail, for deep fundamental reasons, then its likely to work.

This is the difference between the security mindset and ordinary paranoia. The difference between adding complications until you can't figure out how to break the code, and proving that breaking the code is impossible (assuming the adversary can't get your one time pad, its only used once, your randomness is really random, your adversary doesn't have anthropic superpowers ect).

I would think that the chance of serious failure in the first scenario was >99%, and in the second, (assuming your doing it well and the assumptions you rely on are things you have good reason to believe) <1%


It looks like the MIRI team has its hands full with known unknowns. The main one (of those released publicly) seems to be what they describe as “embedded agency” or “Agent Foundations” work:

our goal in working on embedded agency is to build up relevant insights and back-ground understanding, until we collect enough that the alignment problem is more manageable