Orthogonality Thesis

Discuss the wikitag on this page. Here is the place to ask questions and propose changes.
New Comment
21 comments, sorted by

To get the “orthogonality” part, I think the definition of the thesis also needs to include that increasing the intelligence of the agents does not cause interpretation of (some) goals to converge. 

In particular, the dismissal of the concern that policy must include an absolutely perfect specification of the perfect goal does not deny that an agent could have a goal to maximize paperclip production, but rather asserts that the paperclip maximization goal may seed ASI adequately because a perfect intelligence pursuing it would behave the same as a perfect intelligence pursuing the perfect goal (although we imperfect intelligences do not realize this because we do not appreciate all the overlapping instrumental goals that both entail—for example, truly intelligent paperclip maximization may start with generating a maximally intelligent planner, and that may take so long that no actual paperclip get made).

I noticed the tag posts imported from Arbital that haven't been edited on LW yet can't be found when searching those tags from the "Add Tags" button above posts. Adding ineffective edits like spaces at the end of a paragraph seems to fix that problem.

A link appears to have broken, does anyone know what “null” was supposed to link to in “policy  null ” (note the extra spaces around “null”

Good catch, looks like that's from this revision, which looks like it was copied over from Arbital - some LaTeX didn't make it through.  I'll see if it's trivial to fix.

Should be fixed now.

From the old discussion page:

Talk:Orthogonality thesis

Quality Concerns

  • There is no reference for Stuart Armstrong's requirements. I guess they come from here.
  • I know roughly what the orthogonality thesis is. But if I'd only read the wiki page, it wouldn't make sense to me. – »Well, we don't share the opinion of some people that the goals of increasingly intelligent agents converge. So we put up a thesis which claims that intelligence and goals vary freely. Suppose there's one goal system that fulfils Armstrong's requirements. This would refute the orthogonality thesis, even if most intelligences converged on one or two other goal systems.« I don't mean to say that the orthogonality thesis itself doesn't make sense. I mean that the wiki page doesn't provide enough information to enable people to understand that it makes sense.

Instead of repairing the above shortcomings, I propose referring people to the corresponding Arbital page. --Rmoehn (talk) 14:47, 31 May 2016 (AEST)

The corresponding arbital page is now (apparently) dead.

The page isn't dead, Arbital pages just don't load sometimes (or take 15+ seconds).

The section on Moral Internalism is slightly inaccurate, or at least misleading. Internalism is the metaethical view that an agent can not judge something to be right and yet still not be the least bit motivated to perform it. As such, it is really a semantic claim about the meaning of moral vocabulary: whether or not it is part of the meaning of "that is right" or "that is wrong" that the speaker approves or disapproves respectively of an action. Internalism, then, (as intended by analytic philosophers,) is totally compatible with the Orthogonality Thesis. (Internalism + Orthogonality = noncognitivism or relativism or nihilism.) IIRC, Hume himself was an Internalist!

Sources: http://plato.stanford.edu/entries/moral-motivation/

I suggest changing the section to either "Realist Moral Internalism" or a more comprehensive examination of the options available to the AI-grade philosopher when it comes to moral motivation.

I'm skeptical of Orthogonality. My basic concern is that it can be interpreted as true-but-useless for purposes of defending it, and useful-but-implausible when trying to get it to do some work for you, and that the user of the idea may not notice the switch-a-roo. Consider the following statements: there are arbitrarily powerful cognitive agents

  1. which have circular preferences,
  2. with the goal of paperclip maximization,
  3. with the goal of phlogiston maximization,
  4. which are not relfective,
  5. with values aligned with humanity.

Rehearsing the arguments for Orthogonality and then evaluating these questions, I find my mind gets very slippery.

Orthongonality proponents I've spoken to say 1 is false, because "goal space" excludes circular preferences. But there are very likely other restrictions on goal space imposed once an agent groks things like symmetry. If "goal space" means whatever goals are not excluded by our current understanding of intelligence, I think Orthogonality is unlikely (and poorly formulated). If it means "whatever goals powerful cognitive agents can have", Orthogonality is tautological and distracts us from pursuing the interesting question of what that space of goals actually is. Let's narrow down goal space.

If 2 and 3 get different answers, why? Might a paperclip maximizer take liberties with what is considered a paperclip once it learns that papers can be electrostatically attracted?

If 4 is easily true, I wonder if we're defining "mind space" too broadly to be useful. I'd really like humanity to focus on the sector of mind space that we should focus on in order to get a good outcome. The forms of Orthogonality which are clearly (to me) true distract from the interesting question of what that sector actually is. Let's narrow down mind space.

For 5, I don't find Orthogonality to be a convincing argument. A more convincing argument is to shoot for "humanity can grow up to have arbitrarily high cognitive power" instead.

Thanks for the reply. I agree that strong Inevitability is unreasonable, and I understand the function of #1 and #2 in disrupting a prior frame of mind which assumes strong Inevitability, but that's not the only alternative to Orthogonality. I'm surprised that the arguments are considered successively stronger arguments in favor of Orthogonality, since #6 basically says "under reasonable hypotheses, Orthogonality may well be false." (I admit that's a skewed reading, but I don't know what the referenced ongoing work looks like, so I'm skipping that bit for now. [Edit: is this "tiling agents"? I'm not familiar with that work, but I can go learn about it.])

The other arguments are interesting commentary, but don't argue that Orthogonality is true for agents we ought to care about.

  • Gandhian stability argues that self-modifying agents will try to preserve their preference systems, but not that they can become arbitrarily powerful while doing so. As it happens, circular preference systems illustrate how Gandhian stability could limit how powerful a cognitive agent can become.
  • The unbounded agents argument says Orthogonality is true when "mind space" is broader than what we care about.
  • The search tractability argument looks like a statement about the relative difficulty of accomplishing different goals, not the relative difficulties of holding those goals. I don't mean to dismiss the argument, but I don't understand it. I'm not even clear on exactly what the argument is saying about the tractability of searching for strategies for different goals. That it's the same for all possible goals?

There's 6 successively stronger arguments listed under "Arguments" in the current version of the page. Mind design space largeness and Humean freedom of preference are #1 and #2. By the time we get to the Gandhi stability argument #3, and the higher tiers of argument above (especially including the tiling agents that seem to directly show stability of arbitrary goals), we're outside the domain of arguments that could specialize equally well to supporting circular preferences. The reason for listing #1 and #2 as arguments anyway is not that they finish the argument, but that (a) before the later tiers of argument were developed #1 and #2 were strong intuition-pumps in the correct direction and (b) even if they might arguably prove too much if applied sloppily, they counteract other sloppy intuitions along the lines of "What does this strange new species 'AI' want?" or "But won't it be persuaded by..." Like, it's important to understand that even if it doesn't finish the argument, it is indeed the case that "All AIs have property P" has a lot of chances to be wrong and "At least one AI has property P" has a lot of chances to be right. It doesn't finish the story - if we took it as finishing the story, we'd be proving much too much, like circular preferences - but it pushes the story in a long way in a particular direction compared to coming in with a prior frame of mind about "What will AIs want? Hm, paperclips doesn't sound right, I bet they want mostly to be left alone."

1 seems a bit odd. You could argue that the Argument from Mind Design Space Width supports it, but this just demonstrates that this initial argument may be too crude to do more than act as an intuition pump. By the time we're talking about the Argument from Reflective Stability, I don't think that argument supports "you can have circular preferences" any more.

That's exactly the point (except I'm not sure what you mean by "the Argument from Reflective Stability"; the capital letters suggest you're talking about something very specific). The arguments in favor of Orthogonality just seem like crude intuition pumps. The purpose of 1 was not to actually talk about circular preferences, but to pick an example of something supported by largeness of mind design space, but which we expect to break for some other reason. Orthogonality feels like claiming the existence of an integer with two distinct prime factorizations because "there are so many integers". Like the integers, mind design space is vast, but not arbitrary. It seems unlikely to me that there cannot be theorems showing that sufficiently high cognitive power implies some restriction on goals.

As regards 4, I'd say that while there may theoretically be arbitrarily powerful agents in math-space that are non-reflective, it's not clear that this is a pragmatic truth about most of the AIs that would exist in the long run - although we might be able to get very powerful non-reflective genies. So we're interested in some short-run solutions that involve nonreflectivity, but not long-run solutions.

I don't think 2 and 3 do have different answers. See the argument about what happens if you use an AI that only considers classical atomic hypotheses, in https://arbital.com/p/5c?lens=4657963068455733951

1 seems a bit odd. You could argue that the Argument from Mind Design Space Width supports it, but this just demonstrates that this initial argument may be too crude to do more than act as an intuition pump. By the time we're talking about the Argument from Reflective Stability, I don't think that argument supports "you can have circular preferences" any more. It's also not clear to me why 1 matters - all the arguments I know about, that depend on Orthogonality, still go through if we restrict ourselves to only agents with noncircular preferences. A friendly one should still exist, a paperclip maximizer should still exist.

I am pretty surprised by how confident the voters are!

Is "arbitrarily powerful" intended to include e.g. an arbitrarily dumb search given arbitrarily large amounts of computing power? Or is it intended to require arbitrarily high efficiency as well? The latter interpretation seems to make more sense (and is relevant for forecasting). Also, it's the only option if we read "can exist" as referring to physical possibility, given that there are probably limits on the resources available to any physical system. But on that reading, 99% seems clearly crazy.

It also seems weird to give arguments in favor without offering any plausible way in which the claim could be false, or offering any arguments against. The only alternative mentioned is inevitability, which is maybe taken seriously in philosophy but doesn't really seem plausible.

I guess the norm is that I can add counterarguments and alternatives to the article itself if I object? Somehow the current experience is not set up in a way that would make that feel natural.

Note that most plausible failures of orthogonality are bad news, perhaps very bad news.

(This is hard without threaded conversations. Responding to the "agree/disagree" from Eliezer)

The failure scenario that Paul visualizes for Orthogonality is something along the lines of, 'You can't have superintelligences that optimize any external factor, only things analogous to internal reinforcement.'

The failure scenario that Paul visualizes for Orthogonality is something along the lines of, 'The problem of reflective stability is unsolvable in the limit and no efficient optimizer with a unitary goal can be computationally large or self-improving.'

I think there are a lot of plausible failure modes. The two failures you outline don't seem meaningfully distinct given our current understanding, and seem to roughly describe what I'm imagining. Possible examples:

  • Systems that simply want to reproduce and expand their own influence are at a fundamental advantage. To make this more concrete, imagine powerful agents that have lots of varied internal processes, and that constant effort is needed to prevent the proliferation of internal processes that are optimized for their own proliferation rather than pursuit of some overarching goal. Maybe this kind of effort is needed to obtain competent high-level behavior at all, but maybe if you have some simple values you can spend less effort and let your own internal character shift freely according to competitive pressures.
  • What we were calling "sensory optimization" may be a core feature of some useful algorithms, and it may require a constant fraction of one's resources to repurpose that sensory optimization towards non-sensory ends. This might just be a different way of articulating the last bullet point. I think we could talk about the same thing in many different ways, and at this point we only have a vague understanding of what those scenarios actually look like concretely.
  • It turns out that at some fixed level of organization, the behavior of a system needs to reflect something about the goals of that system---there is no way to focus "generic" medium-level behavior towards an arbitrary goal that isn't already baked into that behavior. (The alternative, which seems almost necessary for the literal form of orthogonality, is that you can have arbitrarily large internal computations that are mostly independent of the agent's goals.) This implies that systems with more complex goals need to do at least slightly more work to pursue those goals. For example, if the system only devotes 0.0000001% of its storage space/internal communication bandwidth to goal content, then that puts a clear lower bound on the scale at which the goals can inform behavior. Of course arbitrarily complex goals could probably be specified indirectly (e.g. I want whatever is written in the envelope over there), but if simple indirect representations are themselves larger than the representation of the simplest goals, this could still represent a real efficiency loss.

Paul is worried about something else / Eliezer has completely missed Paul's point.

I do think the more general point, of "we really don't know what's going on here," is probably more important than the particular possible counterexamples. Even if I had no plausible counterexamples in mind, I just wouldn't especially confident.

I think the only robust argument in favor is that unbounded agents are probably orthogonal. But (1) that doesn't speak to efficiency, and (2) even that is a bit dicey, so I wouldn't go for 99% even on the weaker form of orthogonality that neglects efficiency.

If you can get to 95% cognitive efficiency and 100% technological efficiency, then a human value optimizer ought to not be at an intergalactic-colonization disadvantage or a take-over-the-world-in-an-intelligence-explosion disadvantage and not even very much of a slow-takeoff disadvantage.

It sounds regrettable but certainly not catastrophic. Here is how I would think about this kind of thing (it's not something I've thought about quantitatively much, it doesn't seem particularly action-relevant).

We might think that the speed of development or productivity of projects varies a lot randomly. So in the "race to take over the world" model (which I think is the best case for an inefficient project maximizing its share of the future), we'd want to think about what kind of probabilistic disadvantage a small productivity gap introduces.

As a simple toy model, you can imagine two projects; the one that does better will take over the world.

If you thought that productivity was log normal with a standard deviation of */ 2, then a 5% productivity disadvantage corresponds to maybe a 48% chance of being more productive. Over the course of more time the disadvantage becomes more pronounced if randomness averages out. If productivity variation is larger or smaller then it decreases or increases the impact of an efficiency loss. If there are more participants, then the impact of a productivity hit becomes significantly large. If the good guys only have a small probability of losing, then the cost is proportionally lower. And so on.

Combining with my other views, maybe one is looking at a cost of tenths of a percent. You would presumably hope to avoid this by having the world coordinate even a tiny bit (I thought about this a bit here). Overall I'll stick with regrettable but far from catastrophic.

(My bigger issue in practice with efficiency losses is similar to your view that people ought to have really high confidence. I think it is easy to make sloppy arguments that one approach to AI is 10% as effective as another, when in fact it is 0.0001% as effective, and that holding yourself to asymptotic equivalence is a more productive standard unless it turns out to be unrealizable.)

Paul, I didn't say "99%" lightly, obviously. And that makes me worried that we're not talking about the same thing. Which of the following statements sound agreeable or disagreeable?

"If you can get to 95% cognitive efficiency and 100% technological efficiency, then a human value optimizer ought to not be at an intergalactic-colonization disadvantage or a take-over-the-world-in-an-intelligence-explosion disadvantage and not even very much of a slow-takeoff disadvantage."

"The failure scenario that Paul visualizes for Orthogonality is something along the lines of, 'You can't have superintelligences that optimize any external factor, only things analogous to internal reinforcement.'"

"The failure scenario that Paul visualizes for Orthogonality is something along the lines of, 'The problem of reflective stability is unsolvable in the limit and no efficient optimizer with a unitary goal can be computationally large or self-improving.'"

"Paul is worried about something else / Eliezer has completely missed Paul's point."

(Understandable to focus on explanation for now. Threaded replies to replies would also be great eventually.)

Eliezer: I assumed 95% efficiency was not sufficient; I was thinking about asymptotic equivalence, i.e. efficiency approaching 1 as the sophistication of the system increases. Asymptotic equivalence of technological capability seems less interesting than of cognitive capability, though they are equivalent if either we construe technology broadly to include cognitive tasks or if we measure technological capability in a way with lots of headroom.

(Nick says "more or less any level of intelligence," which I guess could be taken to exclude the very highest levels of intelligence, but based on his other writing I think he intended merely to exclude low levels. The language in this post seems to explicitly cover arbitrarily high efficiency.)

I still think that 99% confidence is way too high even if you allow 50% efficiency, though at that point I would at least go for "very likely."

Also of course you need to be able to replace "paperclip maximizer" with anything. When I imagine orthogonality failing, "human values" seem like a much more likely failure case than "paperclips."

I don't think that this disagreement about orthogonality is especially important, I mostly found the 99%'s amusing and wanted to give you a hard time about it. It does suggest that in some sense I might be more pessimistic about the AI control problem itself than you are, with my optimism driven by faith in humanity / the AI community.

Paul, you can start by writing an objection as a comment, if it's a few paragraphs long. You can write a new comment for each new objection. If you want to make it detailed / add a vote, then creating a new page makes sense.

I agree that the website currently doesn't provide intuitive support for arguments; this will come in the near future. For this year we focused on explanation / presentation.

To make sure we're on the same page, Orthogonality is true if it's possible for a paperclip maximizer to exist and be, say, 95% as cognitively efficient and ~100% as technologically sophisticated as any other agent (with equivalent resources). Check?