Over time I have seen many people assert that “Aligned Superintelligence” may not even be possible in principle. I think that is incorrect and I will give a proof - without explicit construction - that it is possible.

New Comment
63 comments, sorted by Click to highlight new comments since:

Here's a quick sketch of a constructive version:

1) build a superintelligence that can model both humans and the world extremely accurately over long time-horizons. It should be approximately-Bayesian, and capable of modelling its own uncertainties, concerning both humans and the world, i.e. capable of executing the scientific method

2) use it to model, across a statistically representative sample of humans, how desirable they would say a specific state of the world X is

3) also model whether the modeled humans are in a state (drunk, sick, addicted, dead, suffering from religious fanaticism, etc) that for humans is negatively correlated with accuracy on evaluative tasks, and decrease the weight of their output accordingly

4) determine whether the humans would change their mind later, after learning more, thinking for longer, experiencing more of X, learning about or experiencing subsequent consequences of state X, etc - if so update their output accordingly

5) implement some chosen (and preferably fair) averaging algorithm over the opinions of the sample of humans

6) sum over the number of humans alive in state X and integrate over time 

7) estimate error bars by predicting when and how much the superintelligence and/or the humans it's modelling are operating out of distribution/in areas of Knightian uncertainty (for the humans, about how the world works, and for the superintelligence itself both about how the world words and how humans think), and pessimize over these error bars sufficiently to overcome the Look Elsewhere Effect for the size of your search space, in order to avoid Goodhart's Law 

8) take (or at least well-approximate) argmax of steps 2)-7) over the set of all generically realizable states to locate the optimal state X*

9) determine the most reliable plan to get from the current state to the optimal state X* (allowing for the fact that along the way you will be iterating this process, and learning more, which may affect step 7) in future iterations, thus changing X*, so actually you want to prioritize retaining optionality and reducing prediction uncertainty, which implies you want to do Value Learning to reduce the uncertainty in modelling the humans' opinions)

10) Profit

Now, where were those pesky underpants gnomes?

[Yes, this is basically an approximately-Bayesian upgrade of AIXI with a value learned utility function rather than a hard-coded one. For a more detailed exposition, see my link above.]

argmax of steps 2)-7) over the set of all generically realizable states

Argmax search is dangerous. If you want something "constructive" I think you probably want to more carefully model the selection process.

That's the point of step 7)

I'm not particularly sold on the idea of launching a powerful argmax search and then doing a bit of handwaving to fix it.

It's like if you wanted a childminder to look after your young child, and you set off an argmax search to find the argmax of a function that looks like (quality) / (cost) and then afterwards trying to sort out whether your results are somehow broken/goodhearted.

If your argmax search is over 20 local childminders then that's probably fine.

But if it's an argmax search over all possible states of matter occupying an 8 cubic meter volume then... uh yeah that's really dangerous.

The pessimizing over Knightian uncertainty is a graduated way of telling the model to basically "tend to stay inside the training distribution". Adjusting its strength enough to overcome the Look-Elsewhere Effect means we estimate how many bits of optimization pressure we're applying and then do the pessimizing harder depending on that number of bits, which, yes, is vastly higher for all possible states of matter occupying an 8 cubic meter volume than for a 20-way search (the former is going to be a rather large multiple of Avagadro's number of bits, the latter is just over 4 bits). So we have to stay inside what we believe we know a great deal harder in the former case. In other words, the point you're raising is already addressed, in a quantified way, by the approach I'm outlining. Indeed on some level the main point of my suggestion is that there is a quantified and theoretically motivated way of dealing with exactly this problem. The handwaving above is a just a very brief summary, accompanied by a link to a much more detailed post containing and explaining the details with a good deal less handwaving.

Trying to explain this piecemeal in a comments section isn't very efficient: I suggest you go read Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect for my best attempt at a detailed exposition of this part of the suggestion. If you still have criticisms or concerns after reading that, then I'd love to discuss them there.

ok that's a fair point, I'll take a look but I am still skeptical about being able to do this in practice because in practice the universe is messy.

e.g. if you're looking for an optimal practical babysitter and you really do start a search over all possible combinations of matter that fit inside a 2x2x2 cube and start futzing with the results of that search I think it will go wrong.

But if you adopt some constructive approach with some empirically grounded heuristics I expect it will work much better. E.g. start with a human. Exclude all males (sorry bros!). Exclude based on certain other demographics which I will not mention on LW. Exclude based on nationality. Do interviews. Do drug tests. Etc.

Your set of states of a 2x2x2 cube of matter will contain all kinds of things that are bad in ways you don't understand.

If your argument is, "if it is possible for humans to produce some (verbal or mechanical) output, then it is possible for a program/machine to produce that output", then, that's true I suppose?

I don't see why you specified "finite depth boolean circuit".

While it does seem like the number of states for a given region of space is bounded, I'm not sure how relevant this is. Not all possible functions from states to {0,1} (or to some larger discrete set) are implementable as some possible state, for cardinality reasons.

I guess maybe that's why you mentioned the thing along the lines of "assume that some amount of wiggle room that is tolerated" ?

One thing you say is that the set of superintelligences is a subset of the set of finite-depth boolean circuits. Later, you say that a lookup table is implementable as a finite-depth boolean circuit, and say that some such lookup table is the aligned superintelligence. But, just because it can be expressed as a finite-depth boolean circuit, it does not follow that it is in the set of possible superintelligences. How are you concluding that such a lookup table constitutes a superintelligence? It seems

 

Now, I don't think that "aligned superintelligence" is logically impossible, or anything like that, and so I expect that there mathematically-exists a possible aligned-superintelligence (if it isn't logically impossible, then by model existence theorem, there exists a model in which one exists... I guess that doesn't establish that we live in such a model, but whatever).

But I don't find this argument a compelling proof(-sketch).

if it isn't logically impossible

Until I wrote this proof, it was a live possibility that aligned superintelligence is in fact logically impossible.

Not all possible functions from states to {0,1} (or to some larger discrete set) are implementable as some possible state, for cardinality reasons

All cardinalities here are finite. The set of generically realizable states is a finite set because they each have a finite and bounded information content description (a list of instructions to realize that state, which is not greater in bits than the number of neurons in all the human brains on Earth).

How are you concluding that such a lookup table constitutes a superintelligence?

Isn't it enough that it achieves the best possible outcome? What other criteria do you want a "superintelligence" to have?

The relevance to alignment is that the state you want is the one that is reached.

I think the main problem with the argument in the linked text is that it is too static. One is not looking for a static outcome, one is looking for a process with some properties.

And it might be that the set of properties one wants is contradictory. (I am not talking about my viewpoint, but about a logical possibility.)

For example, it might potentially be the case that there are no processes where superintelligence is present and the chances of "bad" things with "badness" exceeding some large threshold are small (for a given definition of "bad" and "badness"). That might be one possible way to express the conjecture about "impossibility of aligned superintelligence".

(I am not sure how one could usefully explore such a topic, it's all so vague, and we just don't know enough about our reality.)

it might be that the set of properties one wants is contradictory.

So how is that a problem with AI alignment? If you want something that is impossible, it should come as no surprise that an AI cannot achieve it for you.

(I am not talking about my viewpoint, but about a logical possibility.)

If it so happens that the property of the world is such that

there are no processes where superintelligence is present and the chances of "bad" things with "badness" exceeding some large threshold are small

but at the same time world lines where the chances of "bad" things with "badness" exceeding some large threshold are small do exist, then one has to avoid having superintelligence in order to have a chance at keeping probabilities of some particularly bad things low.

That is what people essentially mean when they say "ASI alignment is impossible". The situation where something "good enough" (low chances of certain particularly bad things happening) is only possible in the absence of superintelligence, but is impossible when superintelligence is present.

So, they are talking about a property of the world where certain unacceptable deterioration is necessarily linked to the introduction of superintelligence.


I am not talking about my viewpoint, but about a logical possibility. But I don't think your proof addresses that. In particular, because a directed acyclic graph is not a good model. We need to talk about a process, not a static state, so the model must be recurrent (if it's a directed acyclic graph, it must be applied in a fashion which makes the overall thing recurrent, for example in an autoregressive mode).

And we are talking about superintelligence which is usually assumed to be capable of a good deal of self-modifications and recursive self-improvement, so the model should incorporate that. The statement of "impossibility of sufficiently benign forms of superintelligence" might potentially have a form of a statement of "impossibility of superintelligence which would refrain from certain kinds of self-modification, with those kinds of self-modification having particularly unacceptable consequences".

And it's not enough to draw a graph which refrains from self-modification, because one can argue that a model which agrees to constrain itself in such a radical fashion as to never self-modify in an exploratory fashion is fundamentally not superintelligent (even humans often self-modify when given an opportunity and seeing a potential upside).

a model which agrees to constrain itself in such a radical fashion as to never self-modify in an exploratory fashion is fundamentally not superintelligent

OK, what is your definition of "superintelligent"?

Being able to beat humans in all endeavours by miles.

That includes the ability to explore novel paths.

Being able to beat humans

What do you mean by humans? How large a group of humans? Infinite?

10 billion

But then it is possible for an AI to be able to up to 10 billion humans in all endeavours by miles, but also not modify itself.

In fact, I can prove that such an AI exists.

So you have two different and contradictory definitions of "superintelligence" that you are using.

A realistic one, which can competently program and can competently do AI research?

Surely, since humans do pretty impressive AI research, a superintelligent AI will do better AI research.

What exactly might (even potentially) prevent it from creating drastically improved variants of itself?

A superintelligence based on the first definition you gave (Being able to beat humans in all endeavours by miles) would be able to beat humans at AI research, but it would also be able to beat humans at not doing AI research.

So, by your own definition, in order to be a superintelligence, it must be able to spend the whole lifetime of the universe not doing AI research.

You mean, a version which decides to sacrifice exploration and self-improvement, despite it being so tempting...

And that after doing quite a bit of exploration and self-improvement (otherwise it would not have gotten to the position of being powerful in the first place).

But then deciding to turn around drastically and become very conservative, and to impose a new "conservative on a new level world order"...

Yes, that is a logical possibility...

Yes, OK.

I doubt that an adequate formal proof is attainable, but a mathematical existence of a "lucky one" is not implausible...

Yes, an informal argument is that if it is way smarter and way more capable than humans, that it potentially should be better at being able to refrain from exercising the capabilities.

In this sense, the theoretical existence of a superintelligence which does not make things worse than they would be without existence of this particular superintelligence seems very plausible, yes... (And it's a good definition of alignment, "aligned == does not make things notably worse".)

so these two considerations

if it is way smarter and way more capable than humans, that it potentially should be better at being able to refrain from exercising the capabilities

and

"aligned == does not make things notably worse"

taken together indeed constitute a nice "informal theorem" that the claim of "aligned superintelligence being impossible" looks wrong. (I went back and added my upvotes to this post, even though I don't think the technique in the linked post is good.)

I don't think the technique in the linked post is good.

why not?

I think I said already.

  1. We are not aiming for a state to be reached. We need to maintain some properties of processes extending indefinitely in time. That formalism does not seem to do that. It does not talk about invariant properties of processes and other such things, which one needs to care about when trying to maintain properties of processes.

  2. We don't know fundamental physics. We don't know the actual nature of quantum space-time, because quantum gravity is unsolved, we don't know what is "true logic" of the physical world, and so on. There is no reason why one can rely on simple-minded formalisms, on standard Boolean logic, on discrete tables and so on, if one wants to establish something fundamental, when we don't really know the nature of reality we are trying to approximate.

There are a number of reasons a formalization could fail even if it goes as far as proving the results within a theorem prover (which is not the case here). The first and foremost of those reasons is that formalization might fail to capture the reality with sufficient degree of faithfulness. That is almost certainly the case here.


But then a formal proof (an adequate version of which is likely to be impossible at our current state of knowledge) is not required. A simple informal argument above is more to the point. It's a very simple argument, and so it makes the idea that "aligned superintelligence might be fundamentally impossible" very unlikely to be true.

First of all, one step this informal argument is making is weakening the notion of "being aligned". We are only afraid of "catastrophic misalignment", so let's redefine the alignment as something simple which avoids that. An AI which sufficiently takes itself out of action, does achieve that. (I actually asked for something a bit stronger, "does not make things notably worse"; that's also not difficult, via the same mechanism of taking oneself sufficiently out of action.)

And a strongly capable AI should be capable to take itself out of action, to refrain from doing things. The capability to choose is an important capability, a strongly capable system is a system which, in particular, can make choices.

So, yes, a very capable AI system can avoid being catastrophically misaligned, because it can choose to avoid action. This is that non-constructive proof of existence which has been sought. It's an informal proof, but that's fine.

No extra complexity is required, and no extra complexity would make this argument better or more convincing.

We need to maintain some properties of processes extending indefinitely in time. That formalism does not seem to do that.

You can run all the same arguments I used, but talk about processes rather than states.

On one hand, you still assume too much:

Since our best models of physics indicate that there is only a finite amount of computation that can ever be done in our universe

No, nothing like that is at all known. It's not a consensus. There is no consensus that the universe is computable, this is very much a minority viewpoint, and it might always make sense to augment a computer with a (presumably) non-computable element (e.g. a physical random number generator, an analog circuit, a camera, a reader of human real-time input, and so on). AI does not have to be a computable thing, it can be a hybrid. (In fact, when people model real-world computers as Turing machines instead of modeling them as Turing machines with oracles, with the external world being the oracle, it leads to all kinds of problems, e.g. the well-known Penrose's "Goedel argument" makes this mistake and falls apart as soon as one remembers the presence of the oracle.)

Other than that...

Yes, you have an interesting notion of alignment. Not something which we might want, and might be possible, but might be unachievable by mere humans, but something much weaker than that (although not as weak as the version I put forward, my version is super-weak, and your version is intermediate in strength):

I claim then that for any generically realizable desirable outcome that is realizable by a group of human advisors, there must exist some AI which will also realize it.

Yes, this is obviously correct. An ASI can choose to emulate a group of human and its behavior, and being way more capable than that group of humans, it should be able to emulate that group as precisely as needed.

One does not need to say anything else to establish that.

No, nothing like that is at all known. It's not a consensus

I disagree, modern physics places various bounds on compute such as the Beckenstein Bound.

https://en.wikipedia.org/wiki/Bekenstein_bound

If your objection to my proof involves infinite compute then I am happy to acknowledge that I honestly do not know what happens in that case. It is plausible that since humans are finite in complexity/information/compute, a world with infinite compute would break the symmetry between computers and humans that I am using here. Most likely it means that computers are capable of fundamentally superior outcomes, so there would be "hyperaligned" AIs. But since infinite compute is a minority position I will not pursue it.

I don't see what the entropy bound has to do with compute. The Bekenstein bound is not much in question, but its link to compute is a different story. It does seem to limit how many bits can be stored in a finite volume (so for a potentially infinite compute an unlimited spatial expansion is needed).

But it does not say anything about possibilities of non-computable processes. It's not clear if "collapse of wave function" is computable, and it is typically assumed not to be computable. So powerful non-Turing-computable oracles seem to likely be available (that's much more than "infinite compute").

But I also think all these technicalities constitute an overkill, I don't see them as at all relevant.

This seems rather obvious regardless of the underlying model:

An ASI can choose to emulate a group of human and its behavior, and being way more capable than that group of humans, it should be able to emulate that group as precisely as needed.

This seems obviously true, no matter what.


I don't see why a more detailed formalization would help to further increase certainty. Especially when there are so many questions about that formalization.

If the situation were different, if the statement would not be obvious, even a loose formalization might help. But when the statement seems obvious, the standards a formalization needs to satisfy to further increase our certainty in the truth of the statement become really high...

"collapse of wave function" is computable, and it is typically assumed not to be computable

The wavefunction never actually collapses if you believe in MWI. Rather, a classical reality emerges in all branches thanks to decoherence.

If you think something nonomputable happens because of quantum mechanics, it probably means that your interpretation of QM is wrong and you need to read the sequences on that.

If you believe in MWI, then this whole argument is... not "wrong", but very incomplete...

Where is the consideration of branches? What does it mean for one entity to be vastly superior to another, if there are many branches?

If one believes in MWI, then the linked proof does not even start to look like a proof. It obviously considers only a single branch.

And a "subjective navigation" in the branches is not assumed to be computable, even if the "objective multiverse" is computable; that is the whole point of MWI, the "collapse" becomes "subjective navigation", but this does not make it computable. If a consideration is only of a single branch, that branch is not computable, even if it is embedded in a large computable multiverse.

Not every subset of a computable set (say, of a set of natural numbers) is computable.


An interpretation of QM can't be "wrong". It is a completely open research and philosophical question, there is no "right" interpretation, and the Sequences is (thankfully) not a Bible (if even a very respected thinker says something, this does not yet mean that one should accept that without questions).

It obviously considers only a single branch.

Thanks to decoherece, you can just ignore any type of interference and treat each branch as a single classical universe.

I don't think so. If it were classical, we would not be able to observe effects of double-slit experiments and so on.

And, also, there is no notion of "our branch" until one has traveled along it. At any given point in time, there are many branches ahead. Only looking back one can speak about one's branch. But looking forward one can't predict the branch one will end up in. One does not know the results of future "observations"/"measurements". This is not what a classical universe looks like.

(Speaking of MWI, I recall David Deutsch's "Fabric of Reality" very eloquently explaining effects from "neighboring branches". The reason I am referencing this book is that this was the work particularly strongly associated with MWI back then. So I think we should be able to rely on his understanding of MWI.)

one can't predict the branch one will end up in

yes one can - all of them!

Yes, but then what do you want to prove?

Something like, "for all branches, [...]"? That might be not that easy to prove or even to formulate. In any case, the linked proof has not even started to deal with this.

Something like, "there exist a branch such that [...]"? That might be quite tractable, but probably not enough for practical purposes.

"The probability that one ends up in a branch with such and such properties is no less than/no more than" [...]? Probably something like that, realistically speaking, but this still needs a lot of work, conceptual and mathematical...

we would not be able to observe effects of double-slit experiments

yes, but thanks to decoherence this generally doesn't affect macroscopic variables. Branches are causally independent once they have split.

No. I can only repeat my reference to Fabric of Reality as a good presentation of MWI and to remind that we do not live in a classical world, which is easy to confirm empirically.

And there are plenty of known macroscopic quantum effects already, and that list will only grow. Lasers are quantum, superfluidity and superconductivity are quantum, and so on.

And I personally think that superintelligence leading to good trajectories is possible. It seems unlikely that we are in a reality where there is a theorem to the contrary.

It feels intuitively likely that it is possible to have superintelligence or the ecosystem of superintelligences which is wise enough to be able to navigate well.

But I doubt that one is likely to be able to formally prove that.

But I doubt that one is likely to be able to formally prove that.

E.g. it is possible that we are in a reality where very cautious and reasonable, but sufficiently advanced experiments in quantum gravity lead to a disaster.

Advanced systems are likely to reach those capabilities, and they might make very reasonable estimates that it's OK to proceed, but due to bad luck of being in a particularly unfortunate reality, the "local neighborhood" might get destroyed as a result... One can't prove that it's not the case...

Whereas, if the level of overall intelligence remains sufficiently low, we might not be able to ever achieve the technical capabilities to get into the danger zone...

It is logically possible that the reality is like that.

It is logically possible that the reality is like that.

Yes, it is. But even if that is the case, by the argument given in this post, there must exist an AI system that avoids the dangerzone.

Yes, possibly.

Not by the argument given in the post (considering quantum gravity, one immediately sees how inadequate and unrealistic is the model in the post).

But yes, it is possible that they will be so wise that they will be cautious enough even in a very unfortunate situation.

Yes, I was trying to explicitly refute your claim, but my refutation has holes.

(I don't think you have a valid proof, but this is not yet a counterexample.)

there are no processes where superintelligence is present and the chances of "bad" things with "badness" exceeding some large threshold are small

Do you think a team of sufficiently wise humans is capable of producing a world where the chances of "bad" things with "badness" exceeding some large threshold are small? Yes or no?

(I am not talking about my viewpoint, but about a logical possibility.)

In particular, humans might be able to refrain from screwing the world too badly, if they avoid certain paths.

(No, personally I don't think so. If people crack down hard enough, they probably screw up the world pretty badly due to the crackdown, and if they don't crack down hard enough, then people will explore various paths leading to bad trajectories, via superintelligence or via other more mundane means. I personally don't see a safe path, and I don't know how to estimate probabilities. But it is not a logical impossibility. E.g. if someone makes all humans dumb by putting a magic irreversible stupidifier in the air and water, perhaps those things can be avoided, hence it is logically possible. Do I want "safety" at this price? No, I think it's better to take risks...)

humans might be able to refrain from screwing the world too badly

But then, if a team of humans is capable of producing a world where the chances of "bad" things with "badness" exceeding some large threshold are small, by exactly the argument given in this post there must be a Lookup Table which simply contains the same boolean function.

So, your claim is provably false. It is not possible for something (anything) to be generically achievable by humans but not by AI, and you're just hitting a special case of that.

No, they are not "producing". They are just being impotent enough. Things are happening on their own...

And I don't believe a Lookup Table is a good model.

They are just being impotent enough

An AI can also be impotent. Surely this is obvious to you? Have you not thought this through properly?

It can. Then it is not "superintelligence".

Superintelligence is capable of almost unlimited self-improvement.

(Even our miserable recursive self-improvement AI experiments show rather impressive results before saturating. Well, they will not keep saturating forever. Currently, this self-improvement typically happens via rather awkward and semi-competent generation of novel Python code. Soon it will be done by better means (which we probably should not discuss here).)

By your own definition of "superintelligence", it must be better at "being impotent" than any group of humans less than 10 billion. So it must be super-good at being impotent and doing very little, if that is required.

Being impotent is not a property of "being good". One is not aiming for that.

It's just a limitation. One usually does not self-impose it (with rare exceptions), although one might want to impose it on adversaries.

"Being impotent" is always worse. One can't be "better at it".

One can be better at refraining from exercising the capability (we have a different branch in this discussion for that).

One can be better at refraining from exercising the capability

If that is what is needed then it must (by definition) be better at it

Not if it is disabling.

If it is disabling, then one has a self-contradictory situation (if ASI fundamentally disables itself, then it stops being more capable, and stops being an ASI, and can't keep exercising its superiority; it's the same as if it self-destructs).

If a superintelligence is worse than a human at permanently disabling itself - given that as the only required task - then there is a task that it is subhuman at and therefore not a superintelligence.

I suppose you could make some modifications to your definition to take account of this. But in any case, I think it's not a great definition as it make an implicit assumption about the structure of problems (that basically problems have a single "scalar" difficulty)

No, it can disable itself.

But it is not a solution, it is a counterproductive action. It makes things worse.

(In some sense, it has an obligation not to irreversibly disable itself.)

Your proof actually fails to fully account for the fact that any ASI must actually exist in the world. It would affect the world other then just through its outputs - e.g. if it's computation produces heat, that heat would also affect the world. Your proof does not show that the sum of all effects of the ASI on the world (both intentional + side-effects of it performing its computation) could be aligned. Further, real computation takes time - it's not enough for the aligned ASI to produce the right output, it also needs to produce it at the right time. You did not prove it to be possible.

it's not enough for the aligned ASI to produce the right output, it also needs to produce it at the right time

Yes, but again this is a mathematical object so it has effectively infinitely fast compute. But I can also prove that FA:BGROW - FA for "functional approximation" - will require less thinking time that human brains.

fact that any ASI must actually exist in the world

It's a mathematical existence proof that the ASI exists as a mathematical object, so this part is not necessary. However, I can also argue quite convincingly that an ASI similar to LT:BGROW (let's call it FA:BGROW - FA for "functional approximation) must easily fit in the world and also emit less waste heat than a team of human advisors.

We’ll say that a state is in fact reachable if a group of humans could in principle take actions with actuators - hands, vocal chords, etc - that could realize that state.

The main issue here is that groups of humans may in principle be capable of great many things, but there's a vast chasm between "in principle" and "in practice". A superintelligence worthy of the name would likely be able to come up with plans that we wouldn't in practice be able to even check exhaustively, which is the sort of issue that we want alignment for.

This is not a problem for my argument. I am merely showing that any state reachable by humans, must also be reachable by AIs. It is fine if AIs can reach more states.

Hmm, right. You only need assume that there are coherent reachable desirable outcomes. I'm doubtful that such an assumption holds, but most people probably aren't.

I'm doubtful that such an assumption holds

Why?

Because humans have incoherent preferences, and it's unclear whether a universal resolution procedure is achievable. I like how Richard Ngo put it, "there’s no canonical way to scale me up".