Over time I have seen many people assert that “Aligned Superintelligence” may not even be possible in principle. I think that is incorrect and I will give a proof - without explicit construction - that it is possible.

New Comment
78 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Here's a quick sketch of a constructive version:

1) build a superintelligence that can model both humans and the world extremely accurately over long time-horizons. It should be approximately-Bayesian, and capable of modelling its own uncertainties, concerning both humans and the world, i.e. capable of executing the scientific method

2) use it to model, across a statistically representative sample of humans, how desirable they would say a specific state of the world X is

3) also model whether the modeled humans are in a state (drunk, sick, addicted, dead, suffering from religious fanaticism, etc) that for humans is negatively correlated with accuracy on evaluative tasks, and decrease the weight of their output accordingly

4) determine whether the humans would change their mind later, after learning more, thinking for longer, experiencing more of X, learning about or experiencing subsequent consequences of state X, etc - if so update their output accordingly

5) implement some chosen (and preferably fair) averaging algorithm over the opinions of the sample of humans

6) sum over the number of humans alive in state X and integrate over time 

7) estimate error bars by predicting when and ... (read more)

4Roko
Argmax search is dangerous. If you want something "constructive" I think you probably want to more carefully model the selection process.
2RogerDearnaley
That's the point of step 7)
2Roko
I'm not particularly sold on the idea of launching a powerful argmax search and then doing a bit of handwaving to fix it. It's like if you wanted a childminder to look after your young child, and you set off an argmax search to find the argmax of a function that looks like (quality) / (cost) and then afterwards trying to sort out whether your results are somehow broken/goodhearted. If your argmax search is over 20 local childminders then that's probably fine. But if it's an argmax search over all possible states of matter occupying an 8 cubic meter volume then... uh yeah that's really dangerous.

The pessimizing over Knightian uncertainty is a graduated way of telling the model to basically "tend to stay inside the training distribution". Adjusting its strength enough to overcome the Look-Elsewhere Effect means we estimate how many bits of optimization pressure we're applying and then do the pessimizing harder depending on that number of bits, which, yes, is vastly higher for all possible states of matter occupying an 8 cubic meter volume than for a 20-way search (the former is going to be a rather large multiple of Avagadro's number of bits, the latter is just over 4 bits). So we have to stay inside what we believe we know a great deal harder in the former case. In other words, the point you're raising is already addressed, in a quantified way, by the approach I'm outlining. Indeed on some level the main point of my suggestion is that there is a quantified and theoretically motivated way of dealing with exactly this problem. The handwaving above is a just a very brief summary, accompanied by a link to a much more detailed post containing and explaining the details with a good deal less handwaving.

Trying to explain this piecemeal in a comments section isn't very efficient: ... (read more)

2Roko
ok that's a fair point, I'll take a look but I am still skeptical about being able to do this in practice because in practice the universe is messy. e.g. if you're looking for an optimal practical babysitter and you really do start a search over all possible combinations of matter that fit inside a 2x2x2 cube and start futzing with the results of that search I think it will go wrong. But if you adopt some constructive approach with some empirically grounded heuristics I expect it will work much better. E.g. start with a human. Exclude all males (sorry bros!). Exclude based on certain other demographics which I will not mention on LW. Exclude based on nationality. Do interviews. Do drug tests. Etc. Your set of states of a 2x2x2 cube of matter will contain all kinds of things that are bad in ways you don't understand.

If your argument is, "if it is possible for humans to produce some (verbal or mechanical) output, then it is possible for a program/machine to produce that output", then, that's true I suppose?

I don't see why you specified "finite depth boolean circuit".

While it does seem like the number of states for a given region of space is bounded, I'm not sure how relevant this is. Not all possible functions from states to {0,1} (or to some larger discrete set) are implementable as some possible state, for cardinality reasons.

I guess maybe that's why you mentioned th... (read more)

2Roko
Until I wrote this proof, it was a live possibility that aligned superintelligence is in fact logically impossible.
2Roko
All cardinalities here are finite. The set of generically realizable states is a finite set because they each have a finite and bounded information content description (a list of instructions to realize that state, which is not greater in bits than the number of neurons in all the human brains on Earth).
1drocta
Yes, I knew the cardinalities in question were finite. The point applies regardless though. For any set X, there is no injection from 2^X to X. In the finite case, this is 2^n > n for all natural numbers n. If there are N possible states, then the number of functions from possible states to {0,1} is 2^N , which is more than N, so there is some function from the set of possible states to {0,1} which is not implemented by any state.
2Roko
I never said it had to be implemented by a state. That is not the claim: the claim is merely that such a function exists.
2Roko
Isn't it enough that it achieves the best possible outcome? What other criteria do you want a "superintelligence" to have?
1drocta
Not if the point of the argument is to establish that a superintelligence is compatible with achieving the best possible outcome. Here is a parody of the issue, which is somewhat unfair and leaves out almost all of your argument, but which I hope makes clear the issue I have in mind: "Proof that a superintelligence can lead to the best possible outcome: Suppose by some method we achieved the best possible outcome. Then, there's no properties we would want a superintelligence to have beyond that, so let's call however we achieved the best possible outcome, 'a superintelligence'. Then, it is possible to have a superintelligence produce the best possible outcome, QED." In order for an argument to be compelling for the conclusion "It is possible for a superintelligence to lead to good outcomes." you need to use a meaning of "a superintelligence" in the argument, such that the statement "It is possible for a superintelligence to lead to good outcomes", when interpreted with that meaning of "a superintelligence", produces the meaning you want that sentence to have? If I argue "it is possible for a superintelligence, by which I mean computer with a clock speed faster than N, to lead to good outcomes", then, even if I convincingly argue that a computer with a clock speed faster than N can lead to good outcomes, that shouldn't convince people that it is possible for a superintelligence, in the sense that they have in mind (presumably not defined as "a computer with a clock speed faster than N"), is compatible with good outcomes. Now, in your argument you say that a superintelligence would presumably be some computational process. True enough! If you then showed that some predicate is true of every computational process, you would then be justified in concluding that that predicate is (presumably) true of every possible superintelligence. But instead, you seem to have argued that a predicate is true of some computational process, and then concluded that it is therefore tr
2Roko
The problem with this is that people use the word "superintelligence" without a precise definition. Clearly they mean some computational process. But nobody who uses the term colloquially defines it. So, I will make the assertion that if a computational process achieves the best possible outcome for you, it is a superintelligence. I don't think anyone would disagree with that. If you do, please state what other properties you think a "superintelligence" must have other than being a computational process achieves the best possible outcome.

The relevance to alignment is that the state you want is the one that is reached.

I think the main problem with the argument in the linked text is that it is too static. One is not looking for a static outcome, one is looking for a process with some properties.

And it might be that the set of properties one wants is contradictory. (I am not talking about my viewpoint, but about a logical possibility.)

For example, it might potentially be the case that there are no processes where superintelligence is present and the chances of "bad" things with "badness" e... (read more)

2Roko
So how is that a problem with AI alignment? If you want something that is impossible, it should come as no surprise that an AI cannot achieve it for you.
5mishka
(I am not talking about my viewpoint, but about a logical possibility.) If it so happens that the property of the world is such that but at the same time world lines where the chances of "bad" things with "badness" exceeding some large threshold are small do exist, then one has to avoid having superintelligence in order to have a chance at keeping probabilities of some particularly bad things low. That is what people essentially mean when they say "ASI alignment is impossible". The situation where something "good enough" (low chances of certain particularly bad things happening) is only possible in the absence of superintelligence, but is impossible when superintelligence is present. So, they are talking about a property of the world where certain unacceptable deterioration is necessarily linked to the introduction of superintelligence. ---------------------------------------- I am not talking about my viewpoint, but about a logical possibility. But I don't think your proof addresses that. In particular, because a directed acyclic graph is not a good model. We need to talk about a process, not a static state, so the model must be recurrent (if it's a directed acyclic graph, it must be applied in a fashion which makes the overall thing recurrent, for example in an autoregressive mode). And we are talking about superintelligence which is usually assumed to be capable of a good deal of self-modifications and recursive self-improvement, so the model should incorporate that. The statement of "impossibility of sufficiently benign forms of superintelligence" might potentially have a form of a statement of "impossibility of superintelligence which would refrain from certain kinds of self-modification, with those kinds of self-modification having particularly unacceptable consequences". And it's not enough to draw a graph which refrains from self-modification, because one can argue that a model which agrees to constrain itself in such a radical fashion as to never se
2Roko
OK, what is your definition of "superintelligent"?
3mishka
Being able to beat humans in all endeavours by miles. That includes the ability to explore novel paths.
3Roko
What do you mean by humans? How large a group of humans? Infinite?
3mishka
10 billion
2Roko
But then it is possible for an AI to be able to up to 10 billion humans in all endeavours by miles, but also not modify itself. In fact, I can prove that such an AI exists. So you have two different and contradictory definitions of "superintelligence" that you are using.
3mishka
A realistic one, which can competently program and can competently do AI research? Surely, since humans do pretty impressive AI research, a superintelligent AI will do better AI research. What exactly might (even potentially) prevent it from creating drastically improved variants of itself?
4Roko
A superintelligence based on the first definition you gave (Being able to beat humans in all endeavours by miles) would be able to beat humans at AI research, but it would also be able to beat humans at not doing AI research. So, by your own definition, in order to be a superintelligence, it must be able to spend the whole lifetime of the universe not doing AI research.
2mishka
You mean, a version which decides to sacrifice exploration and self-improvement, despite it being so tempting... And that after doing quite a bit of exploration and self-improvement (otherwise it would not have gotten to the position of being powerful in the first place). But then deciding to turn around drastically and become very conservative, and to impose a new "conservative on a new level world order"... Yes, that is a logical possibility...
2mishka
Yes, OK. I doubt that an adequate formal proof is attainable, but a mathematical existence of a "lucky one" is not implausible...
2mishka
Yes, an informal argument is that if it is way smarter and way more capable than humans, that it potentially should be better at being able to refrain from exercising the capabilities. In this sense, the theoretical existence of a superintelligence which does not make things worse than they would be without existence of this particular superintelligence seems very plausible, yes... (And it's a good definition of alignment, "aligned == does not make things notably worse".)
4mishka
so these two considerations and taken together indeed constitute a nice "informal theorem" that the claim of "aligned superintelligence being impossible" looks wrong. (I went back and added my upvotes to this post, even though I don't think the technique in the linked post is good.)
2Roko
why not?
3mishka
I think I said already. 1. We are not aiming for a state to be reached. We need to maintain some properties of processes extending indefinitely in time. That formalism does not seem to do that. It does not talk about invariant properties of processes and other such things, which one needs to care about when trying to maintain properties of processes. 2. We don't know fundamental physics. We don't know the actual nature of quantum space-time, because quantum gravity is unsolved, we don't know what is "true logic" of the physical world, and so on. There is no reason why one can rely on simple-minded formalisms, on standard Boolean logic, on discrete tables and so on, if one wants to establish something fundamental, when we don't really know the nature of reality we are trying to approximate. There are a number of reasons a formalization could fail even if it goes as far as proving the results within a theorem prover (which is not the case here). The first and foremost of those reasons is that formalization might fail to capture the reality with sufficient degree of faithfulness. That is almost certainly the case here. ---------------------------------------- But then a formal proof (an adequate version of which is likely to be impossible at our current state of knowledge) is not required. A simple informal argument above is more to the point. It's a very simple argument, and so it makes the idea that "aligned superintelligence might be fundamentally impossible" very unlikely to be true. First of all, one step this informal argument is making is weakening the notion of "being aligned". We are only afraid of "catastrophic misalignment", so let's redefine the alignment as something simple which avoids that. An AI which sufficiently takes itself out of action, does achieve that. (I actually asked for something a bit stronger, "does not make things notably worse"; that's also not difficult, via the same mechanism of taking oneself sufficiently out of action.) And
3Roko
You can run all the same arguments I used, but talk about processes rather than states.
2mishka
On one hand, you still assume too much: No, nothing like that is at all known. It's not a consensus. There is no consensus that the universe is computable, this is very much a minority viewpoint, and it might always make sense to augment a computer with a (presumably) non-computable element (e.g. a physical random number generator, an analog circuit, a camera, a reader of human real-time input, and so on). AI does not have to be a computable thing, it can be a hybrid. (In fact, when people model real-world computers as Turing machines instead of modeling them as Turing machines with oracles, with the external world being the oracle, it leads to all kinds of problems, e.g. the well-known Penrose's "Goedel argument" makes this mistake and falls apart as soon as one remembers the presence of the oracle.) Other than that... Yes, you have an interesting notion of alignment. Not something which we might want, and might be possible, but might be unachievable by mere humans, but something much weaker than that (although not as weak as the version I put forward, my version is super-weak, and your version is intermediate in strength): Yes, this is obviously correct. An ASI can choose to emulate a group of human and its behavior, and being way more capable than that group of humans, it should be able to emulate that group as precisely as needed. One does not need to say anything else to establish that.
3Roko
I disagree, modern physics places various bounds on compute such as the Beckenstein Bound. https://en.wikipedia.org/wiki/Bekenstein_bound If your objection to my proof involves infinite compute then I am happy to acknowledge that I honestly do not know what happens in that case. It is plausible that since humans are finite in complexity/information/compute, a world with infinite compute would break the symmetry between computers and humans that I am using here. Most likely it means that computers are capable of fundamentally superior outcomes, so there would be "hyperaligned" AIs. But since infinite compute is a minority position I will not pursue it.
3mishka
I don't see what the entropy bound has to do with compute. The Bekenstein bound is not much in question, but its link to compute is a different story. It does seem to limit how many bits can be stored in a finite volume (so for a potentially infinite compute an unlimited spatial expansion is needed). But it does not say anything about possibilities of non-computable processes. It's not clear if "collapse of wave function" is computable, and it is typically assumed not to be computable. So powerful non-Turing-computable oracles seem to likely be available (that's much more than "infinite compute"). But I also think all these technicalities constitute an overkill, I don't see them as at all relevant. This seems rather obvious regardless of the underlying model: This seems obviously true, no matter what. ---------------------------------------- I don't see why a more detailed formalization would help to further increase certainty. Especially when there are so many questions about that formalization. If the situation were different, if the statement would not be obvious, even a loose formalization might help. But when the statement seems obvious, the standards a formalization needs to satisfy to further increase our certainty in the truth of the statement become really high...
5Roko
The wavefunction never actually collapses if you believe in MWI. Rather, a classical reality emerges in all branches thanks to decoherence. If you think something nonomputable happens because of quantum mechanics, it probably means that your interpretation of QM is wrong and you need to read the sequences on that.
2mishka
If you believe in MWI, then this whole argument is... not "wrong", but very incomplete... Where is the consideration of branches? What does it mean for one entity to be vastly superior to another, if there are many branches? If one believes in MWI, then the linked proof does not even start to look like a proof. It obviously considers only a single branch. And a "subjective navigation" in the branches is not assumed to be computable, even if the "objective multiverse" is computable; that is the whole point of MWI, the "collapse" becomes "subjective navigation", but this does not make it computable. If a consideration is only of a single branch, that branch is not computable, even if it is embedded in a large computable multiverse. Not every subset of a computable set (say, of a set of natural numbers) is computable. ---------------------------------------- An interpretation of QM can't be "wrong". It is a completely open research and philosophical question, there is no "right" interpretation, and the Sequences is (thankfully) not a Bible (if even a very respected thinker says something, this does not yet mean that one should accept that without questions).
3Roko
Thanks to decoherece, you can just ignore any type of interference and treat each branch as a single classical universe.
2mishka
I don't think so. If it were classical, we would not be able to observe effects of double-slit experiments and so on. And, also, there is no notion of "our branch" until one has traveled along it. At any given point in time, there are many branches ahead. Only looking back one can speak about one's branch. But looking forward one can't predict the branch one will end up in. One does not know the results of future "observations"/"measurements". This is not what a classical universe looks like. (Speaking of MWI, I recall David Deutsch's "Fabric of Reality" very eloquently explaining effects from "neighboring branches". The reason I am referencing this book is that this was the work particularly strongly associated with MWI back then. So I think we should be able to rely on his understanding of MWI.)
5Roko
yes one can - all of them!
2mishka
Yes, but then what do you want to prove? Something like, "for all branches, [...]"? That might be not that easy to prove or even to formulate. In any case, the linked proof has not even started to deal with this. Something like, "there exist a branch such that [...]"? That might be quite tractable, but probably not enough for practical purposes. "The probability that one ends up in a branch with such and such properties is no less than/no more than" [...]? Probably something like that, realistically speaking, but this still needs a lot of work, conceptual and mathematical...
2Roko
bringing QM into this is not helping. All these types of questions are completely generic QM questions and ultimately they come down to measure ||Psi>|²
2mishka
It's just... having a proof is supposed to boost our confidence that the conclusion is correct... if the proof relies on assumptions which are already quite far from the majority opinion about our actual reality (and are probably going to deviate further, as AIs will be better physicists and engineers than us and will leverage the strangeness of our physics much further than we do), then what's the point of that "proof"? how does having this kind of "proof" increase our confidence in what seems informally correct for a single branch reality (and rather uncertain in a presumed multiverse, but we don't even know if we are in a multiverse, so bringing a multiverse in might, indeed, be one of the possible objections to the statement, but I don't know if one wants to pursue this line of discourse, because it is much more complicated than what we are doing here so far)? (as an intellectual exercise, a proof like that is still of interest, even under the unrealistic assumption that we live in a computable reality, I would not argue with that; it's still interesting)
3Roko
yes, but thanks to decoherence this generally doesn't affect macroscopic variables. Branches are causally independent once they have split.
3mishka
No. I can only repeat my reference to Fabric of Reality as a good presentation of MWI and to remind that we do not live in a classical world, which is easy to confirm empirically. And there are plenty of known macroscopic quantum effects already, and that list will only grow. Lasers are quantum, superfluidity and superconductivity are quantum, and so on.
2Roko
Decoherence means that different branches don't interfere with each other on macroscopic scales. That's just the way it works. Superfluids/superconductors/lasers are still microscopic effects that only matter at the scale of atoms or at ultra-low temperature or both.
3mishka
No, not microscopic. Coherent light produced by lasers is not microscopic, we see its traces in the air. And we see the consequences (old fashioned holography and the ability to cut things with focused light, even at large distances). Room temperature is fine for that. Superconductors used in the industry are not microscopic (and the temperatures are high enough to enable industrial use of them in rather common devices such as MRI scanners).
2mishka
And I personally think that superintelligence leading to good trajectories is possible. It seems unlikely that we are in a reality where there is a theorem to the contrary. It feels intuitively likely that it is possible to have superintelligence or the ecosystem of superintelligences which is wise enough to be able to navigate well. But I doubt that one is likely to be able to formally prove that.
2mishka
E.g. it is possible that we are in a reality where very cautious and reasonable, but sufficiently advanced experiments in quantum gravity lead to a disaster. Advanced systems are likely to reach those capabilities, and they might make very reasonable estimates that it's OK to proceed, but due to bad luck of being in a particularly unfortunate reality, the "local neighborhood" might get destroyed as a result... One can't prove that it's not the case... Whereas, if the level of overall intelligence remains sufficiently low, we might not be able to ever achieve the technical capabilities to get into the danger zone... It is logically possible that the reality is like that.
4Roko
Yes, it is. But even if that is the case, by the argument given in this post, there must exist an AI system that avoids the dangerzone.
2mishka
Yes, possibly. Not by the argument given in the post (considering quantum gravity, one immediately sees how inadequate and unrealistic is the model in the post). But yes, it is possible that they will be so wise that they will be cautious enough even in a very unfortunate situation. Yes, I was trying to explicitly refute your claim, but my refutation has holes. (I don't think you have a valid proof, but this is not yet a counterexample.)
1Roko
Do you think a team of sufficiently wise humans is capable of producing a world where the chances of "bad" things with "badness" exceeding some large threshold are small? Yes or no?
4mishka
In particular, humans might be able to refrain from screwing the world too badly, if they avoid certain paths. (No, personally I don't think so. If people crack down hard enough, they probably screw up the world pretty badly due to the crackdown, and if they don't crack down hard enough, then people will explore various paths leading to bad trajectories, via superintelligence or via other more mundane means. I personally don't see a safe path, and I don't know how to estimate probabilities. But it is not a logical impossibility. E.g. if someone makes all humans dumb by putting a magic irreversible stupidifier in the air and water, perhaps those things can be avoided, hence it is logically possible. Do I want "safety" at this price? No, I think it's better to take risks...)
2Roko
But then, if a team of humans is capable of producing a world where the chances of "bad" things with "badness" exceeding some large threshold are small, by exactly the argument given in this post there must be a Lookup Table which simply contains the same boolean function. So, your claim is provably false. It is not possible for something (anything) to be generically achievable by humans but not by AI, and you're just hitting a special case of that.
2mishka
No, they are not "producing". They are just being impotent enough. Things are happening on their own... And I don't believe a Lookup Table is a good model.
2Roko
An AI can also be impotent. Surely this is obvious to you? Have you not thought this through properly?
4mishka
It can. Then it is not "superintelligence". Superintelligence is capable of almost unlimited self-improvement. (Even our miserable recursive self-improvement AI experiments show rather impressive results before saturating. Well, they will not keep saturating forever. Currently, this self-improvement typically happens via rather awkward and semi-competent generation of novel Python code. Soon it will be done by better means (which we probably should not discuss here).)
2Roko
By your own definition of "superintelligence", it must be better at "being impotent" than any group of humans less than 10 billion. So it must be super-good at being impotent and doing very little, if that is required.
2mishka
Being impotent is not a property of "being good". One is not aiming for that. It's just a limitation. One usually does not self-impose it (with rare exceptions), although one might want to impose it on adversaries. "Being impotent" is always worse. One can't be "better at it". One can be better at refraining from exercising the capability (we have a different branch in this discussion for that).
2Roko
If that is what is needed then it must (by definition) be better at it
2mishka
Not if it is disabling. If it is disabling, then one has a self-contradictory situation (if ASI fundamentally disables itself, then it stops being more capable, and stops being an ASI, and can't keep exercising its superiority; it's the same as if it self-destructs).
2Roko
If a superintelligence is worse than a human at permanently disabling itself - given that as the only required task - then there is a task that it is subhuman at and therefore not a superintelligence.
2Roko
I suppose you could make some modifications to your definition to take account of this. But in any case, I think it's not a great definition as it make an implicit assumption about the structure of problems (that basically problems have a single "scalar" difficulty)
2mishka
No, it can disable itself. But it is not a solution, it is a counterproductive action. It makes things worse. (In some sense, it has an obligation not to irreversibly disable itself.)

Your proof actually fails to fully account for the fact that any ASI must actually exist in the world. It would affect the world other then just through its outputs - e.g. if it's computation produces heat, that heat would also affect the world. Your proof does not show that the sum of all effects of the ASI on the world (both intentional + side-effects of it performing its computation) could be aligned. Further, real computation takes time - it's not enough for the aligned ASI to produce the right output, it also needs to produce it at the right time. You did not prove it to be possible.

2Roko
Yes, but again this is a mathematical object so it has effectively infinitely fast compute. But I can also prove that FA:BGROW - FA for "functional approximation" - will require less thinking time that human brains.
2Roko
It's a mathematical existence proof that the ASI exists as a mathematical object, so this part is not necessary. However, I can also argue quite convincingly that an ASI similar to LT:BGROW (let's call it FA:BGROW - FA for "functional approximation) must easily fit in the world and also emit less waste heat than a team of human advisors.
1Anon User
Perhaps you are missing the point of what I am saying here somewhat? The issue is is not the scale of the side-effect of a computation, it's the fact that the side-effect exists, so any accurate mathematical abstraction of an actual real-world ASI must be prepared to deal with solving a self-referential equation.
2Roko
But it's not that: it's a mathematical abstraction of a disembodied ASI that lacks any physical footprint.

Over time I have seen many people assert that “Aligned Superintelligence” may not even be possible in principle. I think that is incorrect and I will give a proof - without explicit construction - that it is possible.

The meta problem here is that you gave a "proof" (in quotes because I haven't verified it myself as correct) using your own definitions of "aligned" and "superintelligence", but if people asserting that it's not possible in principle have different definitions in mind, then you haven't actually shown them to be incorrect.

2Roko
I don't see how anyone could possibly argue with my definitions.
[-]xpym30

We’ll say that a state is in fact reachable if a group of humans could in principle take actions with actuators - hands, vocal chords, etc - that could realize that state.

The main issue here is that groups of humans may in principle be capable of great many things, but there's a vast chasm between "in principle" and "in practice". A superintelligence worthy of the name would likely be able to come up with plans that we wouldn't in practice be able to even check exhaustively, which is the sort of issue that we want alignment for.

2Roko
This is not a problem for my argument. I am merely showing that any state reachable by humans, must also be reachable by AIs. It is fine if AIs can reach more states.
1xpym
Hmm, right. You only need assume that there are coherent reachable desirable outcomes. I'm doubtful that such an assumption holds, but most people probably aren't.
2Roko
Why?
1xpym
Because humans have incoherent preferences, and it's unclear whether a universal resolution procedure is achievable. I like how Richard Ngo put it, "there’s no canonical way to scale me up".
2Roko
This isn't really a problem with alignment so there's no need to address it here. Alignment means the transmission of a preference ordering to an action sequence. Lacking a coherent preference ordering for states of the universe (or histories, for that matter) is not an alignment problem.
1xpym
I'd rather put it that resolving that problem is a prerequisite for the notion of "alignment problem" to be meaningful in the first place. It's not technically a contradiction to have an "aligned" superintelligence that does nothing, but clearly nobody would in practice be satisfied with that.
2Roko
you can have an alignment problem without humans. E.g. two strawberries problem.