When do we know that a model is safe? I want to get a better grip on the basics of inner alignment. And by "basics" I mean the most fundamental basics, the most obvious conditions of safety.

For example: how do we know that Peano arithmetic is safe?

Context

When we talk about unsafe models, it might be useful to make the following distinctions:

  • Models which take bad actions directly (e.g. convert the universe into paperclips). / Models which cause bad outcomes indirectly. E.g. a question-answering model which influences the world through its messages.
  • Models which need to be physically implemented in the world to achieve their goals (e.g. a paperclip-maximizer). / Models which don't even need to be fully physically implemented. Question-answering models are an example again. As long as its answers are calculated, it can steer the world in a certain direction. Even if nobody runs the full model at any point in time.
  • Models which need to model X directly to affect X. / Models which can model Y, which is kinda similar to X, to affect X.[1]
  • Models which steer the world and understand what they're doing. Have explicit goals and search. / Models which steer the world, but don't understand what they're doing. No explicit goals and search. (See Mesa-Search vs Mesa-Control.)

To sum up, a malicious model can cause harm without taking actions in the world, understanding its own actions, modeling anything specific explicitly, or even just existing.

At this point it's natural to ask — wait, how do we know that anything is safe, what are the most basic conditions of safety? How do we know that a "rock" (e.g. Peano arithmetic) is safe? Yes, PA is highly interpretable, but saying "a model is safe if we can interpret it and see that it's a safe" just begs the question.

Can Peano arithmetic harm us?

How could Peano arithmetic possibly harm us?

There could be a certain theorem T. Trying to prove this theorem affects human society in a bad way. Because smart people waste time solving it or because it interacts with human cognition in a bad way. Collatz conjecture could be an example ;)

Yes, "is Peano arithmetic dangerous? how do we know? when do we care?" are really stupid questions, but I think there could be value in answering them seriously.

So, what are the most basic conditions of safety?

That's my core question. But I'll bring up some possible answers myself.

Trust in inductive biases.

We can trust that certain biases produce "true" and "canonical" information (information which can't optimize for anything weird). E.g. the simplicity bias.

Absence of prior unsafe optimization.

Peano arithmetic wasn't a result of any potentially unsafe optimization (that we know of). So it's unlikely to be optimized to cause harm.

  1. ^

     See Eliezer's quote about a hypothetical AI which models programmer's psychology by modeling some properties of an object:

    Maybe the AI reasons about certain very complicated properties of the material object on the pedestal… in fact, these properties are so complicated that they turn out to contain implicit models of User2′s psychology

New Answer
New Comment

1 Answers sorted by

Seth Herd

50

I worry very little about AI harming us by accident. I think that's a much lower class of unsafe than an intelligent being adopting the goal of harming you. And I think we're are busily working toward building AGI that might do just that. So I don't worry about other classes of unsafe things much at all.

Tons of things like cars and the Facebook engagement algorithm are unsafe by changing human behavior in ways that directly and indirectly cause physical and psychological harm.

Optimized to cause harm is another level of unsafe. An engineered or natural virus are optimized to infect you, but the engineered one is optimized for harm and so probably much more dangerous.

The other crucial aspect is goal-directedness. Something that itself wants to harm you is much more dangerous. It will maneuver around unexpected obstacles to harm you, as best it can according to its intelligence or competence in that area.

That's why the most unsafe class is highly intelligent things that want to harm you because they have goals that conflict with yours. If instrumental convergence has made someone.or something adopt an instrumental goals of getting you out of its way, that's far more dangerous than something that was merely optimized by an outside force to harm you (unless that outside force has the intelligence to outthink you I suppose).

I don't worry about accidental or emergent dangers because we're building far more dangerous things on purpose, that may harm us on purpose.

1 comment, sorted by Click to highlight new comments since:

Peano arithmetic is a way of formulating all possible computations, so among the capable things in there, certainly some and possibly most won't cause good outcomes if given influence in the physical world (this depends on how observing human data tends to affect the simpler learners, whether there is some sort of alignment by default). Certainly Peano arithmetic doesn't have any biases specifically helpful for alignment, and it's plausible that there is no alignment by default in any sense, so it's necessary to have such a bias to get a good outcome.

But also enumerating things from Peano arithmetic until something capable is encountered likely takes too much compute to be a practical concern. And anything that does find something capable won't be meaningfully descibed as enumerating things from Peano arithmetic, there will be too much structure in the way such search/learning is performed that's not about Peano arithmetic.

More from Q Home
Curated and popular this week