XiXiDu comments on The genie knows, but doesn't care - Less Wrong

54 Post author: RobbBB 06 September 2013 06:42AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (515)

You are viewing a single comment's thread. Show more comments above.

Comment author: RobbBB 05 September 2013 06:14:32PM *  12 points [-]

Your summaries of my views here are correct, given that we're talking about a superintelligence.

My question, do you believe there to be a conceptual difference between encoding capabilities, what an AI can do, and goals, what an AI will do? As far as I understand, capabilities and goals are both encodings of how humans want an AI to behave.

Well, there's obviously a difference; 'what an AI can do' and 'what an AI will do' mean two different things. I agree with you that this difference isn't a particularly profound one, and the argument shouldn't rest on it.

What the argument rests on is, I believe, that it's easier to put a system into a positive feedback loop that helps it better model its environment and/or itself, than it is to put a system into a positive feedback loop that helps it better pursue a specific set of highly complex goals we have in mind (but don't know how to fully formalize).

If the AI incorrectly models some feature of itself or its environment, reality will bite back. But if it doesn't value our well-being, how do we make reality bite back and change the AI's course? How do we give our morality teeth?

Whatever goals it initially tries to pursue, it will fail in those goals more often the less accurate its models are of its circumstances; so if we have successfully programmed it to do increasingly well at any difficult goal at all (even if it's not the goal we intended it to be good at), then it doesn't take a large leap of the imagination to see how it could receive feedback from its environment about how well it's doing at modeling states of affairs. 'Modeling states of affairs well' is not a highly specific goal, it's instrumental to nearly all goals, and it's easy to measure how well you're doing at it if you're entangled with anything about your environment at all, e.g., your proximity to a reward button.

(And when a system gets very good at modeling itself, its environment, and the interactions between the two, such that it can predict what changes its behaviors are likely to effect and choose its behaviors accordingly, then we call its behavior 'intelligent'.)

This stands in stark contrast to the difficulty of setting up a positive feedback loop that will allow an AGI to approximate our True Values with increasing fidelity. We understand how accurately modeling something works; we understand the basic principles of intelligence. We don't understand the basic principles of moral value, and we don't even have a firm grasp about how to go about finding out the answer to moral questions. Presumably our values are encoded in some way in our brains, such that there is some possible feedback loop we could use to guide an AGI gradually toward Friendliness. But how do we figure out in advance what that feedback loop needs to look like, without asking the superintelligence? (We can't ask the superintelligence what algorithm to use to make it start becoming Friendly, because to the extent it isn't already Friendly it isn't a trustworthy source of information. This is in addition to the seed/intelligence distinction I noted above.)

If we slightly screw up the AGI's utility function, it will still need to to succeed at modeling things accurately in order to do anything complicated at all. But it will not need to succeed at optimally caring about what humans care about in order to do anything complicated at all.

Comment author: XiXiDu 05 September 2013 07:31:30PM *  0 points [-]

...put a system into a positive feedback loop that helps it better model its environment and/or itself...

This can be understood as both a capability and as a goal. What humans mean an AI to do is to undergo recursive self-improvement. What humans mean an AI to be capable of is to undergo recursive self-improvement.

I am only trying to clarify the situation here. Please correct me if you think that above is wrong.

If the AI incorrectly models some feature of itself or its environment, reality will bite back. But if it doesn't value our well-being, how do we make reality bite back and change the AI's course?

I do not disagree with the orthogonality thesis insofar as an AI can have goals that interfere with human values in a catastrophic way, possibly leading to human extinction.

...if we have successfully programmed it to do increasingly well at any difficult goal at all (even if it's not the goal we intended it to be good at), then it doesn't take a large leap of the imagination to see how it could receive feedback from its environment about how well it's doing at modeling states of affairs.

I believe here is where we start to disagree. I do not understand how the "improvement" part of recursive self-improvement can be independent of properties such as the coherence and specificity of the goal the AI is supposed to achieve.

Either you have a perfectly specified goal, such as "maximizing paperclips", where it is clear what "maximization" means, and what the properties of "paperclips" are, or there is some amount of uncertainty about what it means to achieve the goal of "maximizing paperclips".

Consider the programmers forgot to encode what shape the paperclips are supposed to have. How do you suppose would that influence the behavior of the AI. Would it just choose some shape at random, or would it conclude that shape is not part of its goal? If the former, where would the decision to randomly choose a shape come from? If the latter, what would it mean to maximize shapeless objects?

I am just trying to understand what kind of AI you have in mind.

'Modeling states of affairs well' is not a highly specific goal, it's instrumental to nearly all goals,...

This is a clearer point of disagreement.

An AI needs to be able to draw clear lines where exploration ends and exploitation starts. For example, an AI that thinks about every decision for a year would never get anything done.

An AI also needs to discount low probability possibilities, as to not be vulnerable to internal or external Pascal's mugging scenarios.

These are problems that humans need to solve and encode in order for an AI to be a danger.

But these problems are in essence confinements, or bounds on how an AI is going to behave.

How likely is an AI then going to take over the world, or look for dangerous aliens, in order to make sure that neither aliens nor humans obstruct it from achieving its goal?

Similarly, how likely is such an AI to convert all resources into computronium in order to be better able to model states of affairs well?

This stands in stark contrast to the difficulty of setting up a positive feedback loop that will allow an AGI to approximate our True Values with increasing fidelity.

I understand this. And given your assumptions about how an AI will affect the whole world in a powerful way, it makes sense to make sure that it does so in a way that preserves human values.

I have previously compared this to uncontrollable self-replicating nanobots. Given that you cannot confine the speed or scope of their self-replication, only the nature of the transformation that they cause, you will have to make sure that they transform the world into a paradise rather than grey goo.

Comment author: RobbBB 05 September 2013 08:03:18PM *  2 points [-]

This can be understood as both a capability and as a goal.

Yes. To divide it more finely, it could be a terminal goal, or an instrumental goal; it could be a goal of the AI, or a goal of the human; it could be a goal the human would reflectively endorse, or a goal the human would reflectively reject but is inadvertently promoting anyway.

I believe here is where we start to disagree. I do not understand how the "improvement" part of recursive self-improvement can be independent of properties such as the coherence and specificity of the goal the AI is supposed to achieve.

I agree that, at a given time, the AI must have a determinate goal. (Though the encoding of that goal may be extremely complicated and unintentional. And it may need to be time-indexed.) I'm not dogmatically set on the idea that a self-improving AGI is easy to program; at this point it wouldn't shock me if it took over 100 years to finish making the thing. What you're alluding to are the variety of ways we could fail to construct a self-improving AGI at all. Obviously there are plenty of ways to fail to make an AGI that can improve its own ability to track things about its environment in a domain-general way, without bursting into flames at any point. If there weren't plenty of ways to fail, we'd have already succeeded.

Our main difference in focus is that I'm worried about what happens if we do succeed in building a self-improving AGI that doesn't randomly melt down. Conditioned on our succeeding in the next few centuries in making a machine that actually optimizes for anything at all, and that optimizes for its own ability to generally represent its environment in a way that helps it in whatever else it's optimizing for, we should currently expect humans to go extinct as a result. Even if the odds of our succeeding in the next few centuries were small, it would be worth thinking about how to make that extinction event less likely. (Though they aren't small.)

I gather that you think that making an artificial process behave in any particular way at all (i.e., optimizing for something), while recursively doing surgery on its own source code in the radical way MIRI is interested in, is very tough. My concern is that, no matter how true that is, it doesn't entail that if we succeed at that tough task, we'll therefore have made much progress on other important tough tasks, like Friendliness. It does give us more time to work on Friendliness, but if we convince ourselves that intelligence explosion is a completely pie-in-the-sky possibility, then we won't use that time effectively.

I also gather that you have a hard time imagining our screwing up on a goal architecture without simply breaking the AGI. Perhaps by 'screwing up' you're imagining failing to close a set of parentheses. But I think you should be at least as worried about philosophical, as opposed to technical, errors. A huge worry isn't just that we'll fail to make the AI we intended; it's that our intentions while we're coding the thing will fail to align with the long-term interests of ourselves, much less of the human race.

But these problems are in essence confinements, or bounds on how an AI is going to behave.

How likely is an AI then going to take over the world, or look for dangerous aliens, in order to make sure that neither aliens nor humans obstruct it from achieving its goal?

We agree that it's possible to 'bind' a superintelligence. (By this you don't mean boxing it; you just mean programming it to behave in some ways as opposed to others.) But if the bindings fall short of Friendliness, while enabling superintelligence to arise at all, then a serious risk remains. Is your thought that Friendliness is probably an easier 'binding' to figure out how to code than are, say, resisting Pascal's mugging, or having consistent arithmetical reasoning?

Comment author: XiXiDu 06 September 2013 12:21:29PM *  0 points [-]

Is your thought that Friendliness is probably an easier 'binding' to figure out how to code than are, say, resisting Pascal's mugging, or having consistent arithmetical reasoning?

To explain what I have in mind, consider Ben Goertzel's example of how to test for general intelligence:

...when a robot can enrol in a human university and take classes in the same way as humans, and get its degree, then I’ll [say] we’ve created [an]… artificial general intelligence.

I do not disagree that such a robot, when walking towards the classroom, if it is being obstructed by a fellow human student, could attempt to kill this human, in order to get to the classroom.

Killing a fellow human, from the perspective of the human creators of the robot, is clearly a mistake. From a human perspective, it means that the robot failed.

You believe that the robot was just following its programming/construction. Indeed, the robot is its programming. I agree with this. I agree that the human creators were mistaken about what dynamic state sequence the robot will exhibit by computing the code.

What the "dumb superintelligence" argument tries to highlight is that if humans are incapable of predicting such behavior, then they will also be mistaken about predicting behavior that is harmful to the robots power. For example, while fighting with the human in order to kill it, for a split-second it mistakes its own arm with that of the human and breaks it.

You might now argue that such a robot isn't much of a risk. It is pretty stupid to mistake its own arm with that of the enemy it tries to kill. True. But the point is that there is no relevant difference between failing to predict behavior that will harm the robot itself, and behavior that will harm a human. Except that you might believe the former is much easier than the latter. I dispute this.

For the robot to master a complex environment, like a university full of humans, without harming itself, or decreasing the chance of achieving its goals, is already very difficult. Not stabbing or strangling other human students is not more difficult than not jumping from the 4th floor, instead of taking the stairs. This is the "dumb superintelligence" argument.

Comment author: RobbBB 06 September 2013 06:51:43PM *  5 points [-]

What the "dumb superintelligence" argument tries to highlight is that if humans are incapable of predicting such behavior, then they will also be mistaken about predicting behavior that is harmful to the robots power.

To some extent. Perhaps it would be helpful to distinguish four different kinds of defeater:

  1. early intelligence defeater: We try to build a seed AI, but our self-rewriting AI quickly hits a wall or explodes. This is most likely if we start with a subhuman intelligence and have serious resource constraints (so we can't, e.g., just run an evolutionary algorithm over millions of copies of the AGI until we happen upon a variant that works).

  2. late intelligence defeater: The seed AI works just fine, but at some late stage, when it's already at or near superintelligence, it suddenly explodes. Apparently it went down a blind alley at some point early on that led it to plateau or self-destruct later on, and neither it nor humanity is smart enough yet to figure out where exactly the problem arose. So the FOOM fizzles.

  3. early Friendliness defeater: From the outset, the seed AI's behavior already significantly diverges from Friendliness.

  4. late Friendliness defeater: The seed AI starts off as a reasonable approximation of Friendliness, but as it approaches superintelligence its values diverge from anything we'd consider Friendly, either because it wasn't previously smart enough to figure out how to self-modify while keeping its values stable, or because it was never perfectly Friendly and the new circumstances its power puts it in now make the imperfections much more glaring.

In general, late defeaters are much harder for humans to understand than early defeaters, because an AI undergoing FOOM is too fast and complex to be readily understood. Your three main arguments, if I'm understanding them, have been:

  • (a) Early intelligence defeaters are so numerous that there's no point thinking much about other kinds of defeaters yet.
  • (b) Friendliness defeaters imply a level of incompetence on the programmers' part that strongly suggest intelligence defeaters will arise in the same situation.
  • (c) If an initially somewhat-smart AI is smart enough to foresee and avoid late intelligence defeaters, then an initially somewhat-nice AI should be smart enough to foresee and avoid late Friendliness defeaters.

I reject (a), because I haven't seen any specific reason a self-improving AGI will be particularly difficult to make FOOM -- 'it would require lots of complicated things to happen' is very nearly a fully general argument against any novel technology, so I can't get very far on that point alone. I accept (b), at least for a lot of early defeaters. But my concern is that while non-Friendliness predicts non-intelligence (and non-intelligence predicts non-Friendliness), intelligence also predicts non-Friendliness.

But our interesting disagreement seems to be over (c). Interesting because it illuminates general differences between the basic idea of a domain-general optimization process (intelligence) and the (not-so-)basic idea of Everything Humans Like. One important difference is that if an AGI optimizes for anything, it will have strong reason to steer clear of possible late intelligence defeaters. Late Friendliness defeaters, on the other hand, won't scare optimization-process-optimizers in general.

It's easy to see in advance that most beings that lack obvious early Friendliness defeaters will nonetheless have late Friendliness defeaters. In contrast, it's much less clear that most beings lacking early intelligence defeaters will have late intelligence defeaters. That's extremely speculative at this point -- we simply don't know what sorts of intelligence-destroying attractors might exist out there, or what sorts of paradoxes and complications are difficult v. trivial to overcome.

there is no relevant difference between failing to predict behavior that will harm the robot itself, and behavior that will harm a human. Except that you might believe the former is much easier than the latter. I dispute this.

But, once again, it doesn't take any stupidity on the AI's part to disvalue physically injuring a human, even if it does take stupidity to not understand that one is physically injuring a human. It only takes a different value system. Valuing one's own survival is not orthogonal to valuing becoming more intelligent; but valuing human survival is orthogonal to valuing becoming more intelligent. (Indeed, to the extent they aren't orthogonal it's because valuing becoming more intelligent tends to imply disvaluing human survival, because humans are hard to control and made of atoms that can be used for other ends, including increased computing power.) This is the whole point of the article we're commenting on.

Comment author: XiXiDu 07 September 2013 09:46:42AM *  3 points [-]

Your three main arguments, if I'm understanding them, have been:

Here is part of my stance towards AI risks:

1. I assign a negligible probability to the possibility of a sudden transition from well-behaved narrow AIs to general AIs (see below).

2. An AI will not be pulled at random from mind design space. An AI will be the result of a research and development process. A new generation of AIs will need to be better than other products at “Understand What Humans Mean” and “Do What Humans Mean”, in order to survive the research phase and subsequent market pressure.

3. Commercial, research or military products are created with efficiency in mind. An AI that was prone to take unbounded actions given any terminal goal would either be fixed or abandoned during the early stages of research. If early stages showed that inputs such as the natural language query <What would you do if I asked you to minimize human suffering?> would yield results such as <I will kill all humans.> then the AI would never reach a stage in which it was sufficiently clever and trained to understand what results would satisfy its creators in order to deceive them.

4. I assign a negligible probability to the possibility of a consequentialist AI / expected utility maximizer / approximation to AIXI.

Given that the kind of AIs from point 4 are possible:

5. Omohundro's AI drives are what make the kind of AIs mentioned in point 1 dangerous. Making an AI that does not exhibit these drives in an unbounded manner is probably a prerequisite to get an AI to work at all (there are not enough resources to think about being obstructed by simulator gods etc.), or should otherwise be easy compared to the general difficulties involved in making an AI work using limited resources.

6. An AI from point 4 will only ever do what it has been explicitly programmed to do. Such an AI is not going to protect its utility-function, acquire resources or preemptively eliminate obstacles in an unbounded fashion. Because it is not intrinsically rational to do so. What specifically constitutes rational, economic behavior is inseparable with an agent’s terminal goal. That any terminal goal can be realized in an infinite number of ways implies an infinite number of instrumental goals to choose from.

7. Unintended consequences are by definition not intended. They are not intelligently designed but detrimental side effects, failures. Whereas intended consequences, such as acting intelligently, are intelligently designed. If software was not constantly improved to be better at doing what humans intend it to do we would never be able to reach a level of sophistication where a software could work well enough to outsmart us. To do so it would have to work as intended along a huge number of dimensions. For an AI to constitute a risk as a result of unintended consequences those unintended consequences would have to have no, or little, negative influence on the huge number of intended consequences that are necessary for it to be able to overpower humanity.

I haven't seen any specific reason a self-improving AGI will be particularly difficult to make FOOM...

I am not yet at a point of my education where I can say with confidence that this is the wrong way to think, but I do believe it is.

If someone walked up to you and told you about a risk only he can solve, and that you should therefore give this person money, would you give him money because you do not see any specific reason for why he could be wrong? Personally I would perceive the burden of proof to be on him to show me that the risk is real.

Despite this, I have specific reasons to personally believe that the kind of AI you have in mind is impossible. I have thought about such concepts as consequentialism / expected utility maximization, and do not see that they could be made to work, other than under very limited circumstances. And I also asked other people, outside of LessWrong, who are more educated and smarter than me, and they also told me that these kind of AIs are not feasible, they are uncomputable.

But our interesting disagreement seems to be over (c).

I am not sure I understand what you mean by c. I don't think I agree with it.

One important difference is that if an AGI optimizes for anything,

I don't know what this means.

Valuing one's own survival is not orthogonal to valuing becoming more intelligent; but valuing human survival is orthogonal to valuing becoming more intelligent.

That this black box you call "intelligence" might be useful to achieve a lot of goals is not an argument in support of humans wanting to and succeeding at the implementation of "value to maximize intelligence" in conjunction with "by all means".

Most definitions of intelligence that I am aware of are in terms of the ability to achieve goals. Saying that a system values to become more intelligent then just means that a system values to increase its ability to achieve its goals. In this context, what you suggest is that humans will want to, and will succeed to, implement an AI that in order to beat humans at Tic-tac-toe is first going to take over the universe and make itself capable of building such things as Dyson spheres.

What I am saying is that it is much easier to create a Tic-tac-toe playing AI, or an AI that can earn a university degree, than the former in conjunction with being able to take over the universe and build Dyson spheres.

The argument that valuing not to kill humans is orthogonal to taking over the universe and building Dyson spheres is completely irrelevant.

Comment author: RobbBB 09 September 2013 09:15:55PM *  3 points [-]

An AI will not be pulled at random from mind design space.

I don't think anyone's ever disputed this. (However, that's not very useful if the deterministic process resulting in the SI is too complex for humans to distinguish it in advance from the outcome of a random walk.)

An AI will be the result of a research and development process. A new generation of AIs will need to be better than other products at “Understand What Humans Mean” and “Do What Humans Mean”, in order to survive the research phase and subsequent market pressure.

Agreed. But by default, a machine that is better than other rival machines at satisfying our short-term desires will not satisfy our long-term desires. The concern isn't that we'll suddenly start building AIs with the express purpose of hitting humans in the face with mallets. The concern is that we'll code for short-term rather than long-term goals, due to a mixture of disinterest in Friendliness and incompetence at Friendliness. But if intelligence explosion occurs, 'the long run' will arrive very suddenly, and very soon. So we need to adjust our research priorities to more seriously assess and modulate the long-term consequences of our technology.

An AI that was prone to take unbounded actions given any terminal goal would either be fixed or abandoned during the early stages of research.

That may be a reason to think that recursively self-improving AGI won't occur. But it's not a reason to expect such AGI, if it occurs, to be Friendly.

If early stages showed that inputs such as the natural language query <What would you do if I asked you to minimize human suffering?> would yield results such as <I will kill all humans.>

The seed is not the superintelligence. We shouldn't expect the seed to automatically know whether the superintelligence will be Friendly, any more than we should expect humans to automatically know whether the superintelligence will be Friendly.

Making an AI that does not exhibit these drives in an unbounded manner is probably a prerequisite to get an AI to work at all (there are not enough resources to think about being obstructed by simulator gods etc.)

I'm not following. Why does an AGI have to have a halting condition (specifically, one that actually occurs at some point) in order to be able to productively rewrite its own source code?

  1. An AI from point 4 will only ever do what it has been explicitly programmed to do.

You don't seem to be internalizing my arguments. This is just the restatement of a claim I pointed out was not just wrong but dishonestly stated here.

That any terminal goal can be realized in an infinite number of ways implies an infinite number of instrumental goals to choose from.

Sure, but the list of instrumental goals overlap more than the list of terminal goals, because energy from one project can be converted to energy for a different project. This is an empirical discovery about our world; we could have found ourselves in the sort of universe where instrumental goals don't converge that much, e.g., because once energy's been locked down into organisms or computer chips you just Can't convert it into useful work for anything else. In a world where we couldn't interfere with the AI's alien goals, nor could our component parts and resources be harvested to build very different structures, nor could we be modified to work for the AI, the UFAI would just ignore us and zip off into space to try and find more useful objects. We don't live in that world because complicated things can be broken down into simpler things at a net gain in our world, and humans value a specific set of complicated things.

'These two sets are both infinite' does not imply 'we can't reason about these two things' relative size, or how often the same elements recur in their elements'.

I am not yet at a point of my education where I can say with confidence that this is the wrong way to think, but I do believe it is.

If someone walked up to you and told you about a risk only he can solve, and that you should therefore give this person money, would you give him money because you do not see any specific reason for why he could be wrong? Personally I would perceive the burden of proof to be on him to show me that the risk is real.

You've spent an awful lot of time writing about the varied ways in which you've not yet been convinced by claims you haven't put much time into actively investigating. Maybe some of that time could be better spent researching these topics you keep writing about? I'm not saying to stop talking about this, but there's plenty of material on a lot of these issues to be found. Have you read Intelligence Explosion Microeconomics?

if an AGI optimizes for anything,

I don't know what this means.

http://wiki.lesswrong.com/wiki/Optimization_process

succeeding at the implementation of "value to maximize intelligence" in conjunction with "by all means".

As a rule, adding halting conditions adds complexity to an algorithm, rather than removing complexity.

Saying that a system values to become more intelligent then just means that a system values to increase its ability to achieve its goals.

No, this is a serious misunderstanding. Yudkowsky's definition of 'intelligence' is about the ability to achieve goals in general, not about the ability to achieve the system's goals. That's why you can't increase a system's intelligence by lowering its standards, i.e., making its preferences easier to satisfy.

what you suggest is that humans will want to, and will succeed to, implement an AI that in order to beat humans at Tic-tac-toe is first going to take over the universe and make itself capable of building such things as Dyson spheres.

Straw-man; no one has claimed that humans are likely to want to create an UFAI. What we've suggested is that humans are likely to want to create an algorithm, X, that will turn out to be a UFAI. (In other words, the fallacy you're committing is confusing intension with extension.)

That aside: Are you saying Dyson spheres wouldn't be useful for beating more humans at more tic-tac-toe games? Seems like a pretty good way to win at tic-tac-toe to me.

Comment author: Eliezer_Yudkowsky 10 September 2013 02:11:11AM 2 points [-]

Yudkowsky's definition of 'intelligence' is about the ability to achieve goals in general, not about the ability to achieve the system's goals. That's why you can't increase a system's intelligence by lowering its standards, i.e., making its preferences easier to satisfy.

Actually I do define intelligence as ability to hit a narrow outcome target relative to your own goals, but if your goals are very relaxed then the volume of outcome space with equal or greater utility will be very large. However one would expect that many of the processes involved in hitting a narrow target in outcome space (such that few other outcomes are rated equal or greater in the agent's preference ordering), such as building a good epistemic model or running on a fast computer, would generalize across many utility functions; this is why we can speak of properties apt to intelligence apart from particular utility functions.

Comment author: RobbBB 10 September 2013 02:36:14AM *  0 points [-]

Actually I do define intelligence as ability to hit a narrow outcome target relative to your own goals

Hmm. But this just sounds like optimization power to me. You've defined intelligence in the past as "efficient cross-domain optimization". The "cross-domain" part I've taken to mean that you're able to hit narrow targets in general, not just ones you happen to like. So you can become more intelligent by being better at hitting targets you hate, or by being better at hitting targets you like.

The former are harder to test, but something you'd hate doing now could become instrumentally useful to know how to do later. And your intelligence level doesn't change when the circumstance shifts which part of your skillset is instrumentally useful. For that matter, I'm missing why it's useful to think that your intelligence level could drastically shift if your abilities remained constant but your terminal values were shifted. (E.g., if you became pickier.)

Comment author: Eliezer_Yudkowsky 10 September 2013 03:14:50AM 2 points [-]

No, "cross-domain" means that I can optimize across instrumental domains. Like, I can figure out how to go through water, air, or space if that's the fastest way to my destination, I am not limited to land like a ground sloth.

Measured intelligence shouldn't shift if you become pickier - if you could previously hit a point such that only 1/1000th of the space was more preferred than it, we'd still expect you to hit around that narrow a volume of the space given your intelligence even if you claimed afterward that a point like that only corresponded to 0.25 utility on your 0-1 scale instead of 0.75 utility due to being pickier ([expected] utilities sloping more sharply downward with increasing distance from the optimum).

Comment author: XiXiDu 10 September 2013 10:34:11AM 0 points [-]

But by default, a machine that is better than other rival machines at satisfying our short-term desires will not satisfy our long-term desires.

You might be not aware of this but I wrote a sequence of short blog posts where I tried to think of concrete scenarios that could lead to human extinction. Each of which raised many questions.

The introductory post is 'AI vs. humanity and the lack of concrete scenarios'.

1. Questions regarding the nanotechnology-AI-risk conjunction

2. AI risk scenario: Deceptive long-term replacement of the human workforce

3. AI risk scenario: Social engineering

4. AI risk scenario: Elite Cabal

5. AI risk scenario: Insect-sized drones

6. AI risks scenario: Biological warfare

What might seem to appear completely obvious to you for reasons that I do not understand, e.g. that an AI can take over the world, appears to me largely like magic (I am not trying to be rude, by magic I only mean that I don't understand the details). At the very least there are a lot of open questions. Even given that for the sake of the above posts I accepted that the AI is superhuman and can do such things as deceive humans by its superior knowledge of human psychology. Which seems to be non-trivial assumption, to say the least.

That may be a reason to think that recursively self-improving AGI won't occur. But it's not a reason to expect such AGI, if it occurs, to be Friendly.

Over and over I told you that given all your assumptions, I agree that AGI is an existential risk.

The seed is not the superintelligence. We shouldn't expect the seed to automatically know whether the superintelligence will be Friendly, any more than we should expect humans to automatically know whether the superintelligence will be Friendly.

You did not reply to my argument. My argument was that if the seed is unfriendly then it will not be smart enough to hide its unfriendliness. My argument did not pertain the possibility of a friendly seed turning unfriendly.

Why does an AGI have to have a halting condition (specifically, one that actually occurs at some point) in order to be able to productively rewrite its own source code?

What I have been arguing is that an AI should not be expected, by default, to want to eliminate all possible obstructions. There are many graduations here. That, by some economic or otherwise theoretic argument, it might be instrumentally rational for some ideal AI to take over the world, does not mean that humans would create such an AI, or that an AI could not be limited to care about fires in its server farm rather than that Russia might nuke the U.S. and thereby destroy its servers.

You don't seem to be internalizing my arguments.

Did you mean to reply to another point? I don't see how the reply you linked to is relevant to what I wrote.

Sure, but the list of instrumental goals overlap more than the list of terminal goals, because energy from one project can be converted to energy for a different project.

My argument is that an AI does not need to consider all possible threats and care to acquire all possible resources. Based on its design it could just want to optimize using its initial resources while only considering mundane threats. I just don't see real-world AIs to conclude that they need to take over the world. I don't think an AI is likely going to be designed that way. I also don't think such an AI could work, because such inferences would require enormous amounts of resources.

You've spent an awful lot of time writing about the varied ways in which you've not yet been convinced by claims you haven't put much time into actively investigating. Maybe some of that time could be better spent researching these topics you keep writing about?

I have done what is possible given my current level of education and what I perceive to be useful. I have e.g. asked experts about their opinion.

A few general remarks about the kind of papers such as the one that you linked to.

How much should I update towards MIRI's position if I (1) understood the arguments in the paper (2) found the arguments convincing?

My answer is the following. If the paper was about the abc conjecture, the P versus NP problem, climate change, or even such mundane topics as psychology, I would either not be able to understand the paper, would be unable to verify the claims, or would have very little confidence in my judgement.

So what about 'Intelligence Explosion Microeconomics'? That I can read most of it is only due to the fact that it is very informally written. The topic itself is more difficult and complex than all of the above mentioned problems together. Yet the arguments in support of it, to exaggerate a little bit, contain less rigor than the abstract of one of Shinichi Mochizuki's papers on the abc conjecture.

Which means that my answer is that I should update very little towards MIRI's position and that any confidence I gain about MIRI's position is probably highly unreliable.

http://wiki.lesswrong.com/wiki/Optimization_process

Thanks. My feeling is that to gain any confidence into what all this technically means, and to answer all the questions this raises, I'd probably need about 20 years of study.

No, this is a serious misunderstanding. Yudkowsky's definition of 'intelligence' is

Here is part of a post exemplifying how I understand the relation between goals and intelligence:

If a goal has very few constraints then the set that satisfies all constraints is very large. A vague and ambiguous goal allows for too much freedom in the sense that a wide range of world states would have the same expected value and therefore imply a very large solution space, since a wide range of AI’s will be able to achieve those world states and thereby satisfy the condition of being improved versions of their predecessor.

This means that in order to get an AI to become superhuman at all, and very quickly in particular, you will need to encode a very specific goal against which mistakes, optimization power and achievement can be judged.


It is really hard to communicate how I perceive this and other discussions about MIRI's position without offending people, or killing the discussion.

I am saying this in full honesty. The position you appear to support seems so utterly "complex" (far-fetched) that the current arguments are unconvincing.

Here is my perception of the scenario that you try to sell me (exaggerated to make a point). I have a million questions about it that I can't answer and which your answers either sidestep or explain away by using "magic".

At this point I probably made 90% of the people reading this comment incredible angry. My perception is that you cannot communicate this perception on LessWrong without getting into serious trouble. That's also what I meant when I told you that I cannot be completely honest if you want to discuss this on LessWrong.

I can also assure you that many people who are much smarter and higher status than me think so as well. Many people communicated the absurdity of all this to me but told me that they would not repeat this in public.

Comment author: lavalamp 10 September 2013 11:00:15PM *  1 point [-]

My argument was that if the seed is unfriendly then it will not be smart enough to hide its unfriendliness.

Pretending to be friendly when you're actually not is something that doesn't even require human level intelligence. You could even do it accidentally.

In general, the appearance of Friendliness at low levels of ability to influence the world doesn't guarantee actual Friendliness at high levels of ability to influence the world. (If it did, elected politicians would be much higher quality.)

Comment author: XiXiDu 06 September 2013 10:03:15AM *  0 points [-]

Our main difference in focus is that I'm worried about what happens if we do succeed in building a self-improving AGI that doesn't randomly melt down.

I am trying to understand if the kind of AI, that is underlying the scenario that you have in mind, is a possible and likely outcome of human AI research.

As far as I am aware, as a layman, goals and capabilities are intrinsically tied together. How could a chess computer be capable of winning against humans at chess without the terminal goal of achieving a checkmate?

Coherent and specific goals are necessary to (1) decide which actions are instrumental useful (2) judge the success of self-improvement. If the given goal is logically incoherent, or too vague for the AI to be able to tell apart success from failure, would it work at all?

If I understand your position correctly, you would expect a chess playing general AI, one that does not know about checkmate, instead of "winning at chess", to improve against such goals as "modeling states of affairs well" or "make sure nothing intervenes chess playing". You believe that these goals do not have to be programmed by humans, because they are emergent goals, an instrumental consequence of being general intelligent.

These universal instrumental goals, these "AI drives", seem to be a major reason for why you believe it to be important to make the AI care about behaving correctly. You believe that these AI drives are a given, and the only way to prevent an AI from being an existential risk is to channel these drives, is to focus this power on protecting and amplifying human values.

My perception is that these drives that you imagine are not special and will be as difficult to get "right" than any other goal. I think that the idea that humans not only want to make an AI exhibit such drives, but also succeed at making such drives emerge, is a very unlikely outcome.

As far as I am aware, here is what you believe an AI to want:

  • It will want to self-improve
  • It will want to be rational
  • It will try to preserve their utility functions
  • It will try to prevent counterfeit utility
  • It will be self-protective It will want to acquire resources and use them efficiently

What AIs that humans would ever want to create would require all of these drives, and how easy will it be for humans to make an AI exhibit these drives compared to making an AI that can do what humans want without these drives?

Take mathematics. What are the difficulties associated with making an AI better than humans at mathematics, and will an AI need these drives in order to do so?

Humans did not evolve to play chess or do mathematics. Yet it is considerably more difficult to design a chess AI than an AI that is capable of discovering interesting and useful mathematics.

I believe that the difficulty is due to the fact that it is much easier to formalize what it means to play chess than doing mathematics. The difference between chess and mathematics is that chess has a specific terminal goal in the form of a clear definition of what constitutes winning. Although mathematics has unambiguous rules, there is no specific terminal goal and no clear definition of what constitutes winning.

The progress of the capability of artificial intelligence is not only related to whether humans have evolved for a certain skill or to how much computational resources it requires but also to how difficult it is to formalize the skill, its rules and what it means to succeed.

In the light of this, how difficult would it be to program the drives that you imagine, versus just making an AI win against humans at a given activity without exhibiting these drives?

All these drives are very vague ideas, not like "winning at chess", but more like "being better at mathematics than Terence Tao".

The point I am trying to make is that these drives constitute additional complexity, rather than being simple ideas that you can just assume, and from which you can reason about the behavior of an AI.

It is this context that the "dumb superintelligence" argument tries to highlight. It is likely incredibly hard to make these drives emerge in a seed AI. They implicitly presuppose that humans succeed at encoding intricate ideas about what "winning" means in all those cases required to overpower humans, but not in the case of e.g. winning at chess or doing mathematics. I like to analogize such a scenario to the creation of a generally intelligent autonomous car that works perfectly well at not destroying itself in a crash but which somehow manages to maximize the number of people to run over.

I agree that if you believe that it is much easier to create a seed AI to exhibit the drives that you imagine, than it is to make a seed AI use its initial resources to figure out how to solve a specific problem, then we agree about AI risks.

Comment author: Moss_Piglet 06 September 2013 02:52:39PM *  2 points [-]

(Note: I'm also a layman, so my non-expert opinions necessarily come with a large salt side-dish)

My guess here is that most of the "AI Drives" to self-improve, be rational, retaining it's goal structure, etc. are considered necessary for a functional learning/self-improving algorithm. If the program cannot recognize and make rules for new patterns observed in data, make sound inferences based on known information or keep after it's objective it will not be much of an AGI at all; it will not even be able to function as well as a modern targeted advertising program.

The rest, such as self-preservation, are justified as being logical requirements of the task. Rather than having self-preservation as a terminal value, the paperclip maximizer will value it's own existence as an optimal means of proliferating paperclips. It makes intuitive sense that those sorts of 'drives' would emerge from most-any goal, but then again my intuition is not necessarily very useful for these sorts of questions.

This point might also be a source of confusion;

The progress of the capability of artificial intelligence is not only related to whether humans have evolved for a certain skill or to how much computational resources it requires but also to how difficult it is to formalize the skill, its rules and what it means to succeed. In the light of this, how difficult would it be to program the drives that you imagine, versus just making an AI win against humans at a given activity without exhibiting these drives?

As Dr Valiant (great name or the greatest name?) classifies things in Probably Approximately Correct, Winning Chess would be a 'theoryful' task while Discovering (Interesting) Mathematical Proofs would be a 'theoryless' one. In essence, the theoryful has simple and well established rules for the process which could be programmed optimally in advance with little-to-no modification needed afterwards while the theoryless is complex and messy enough that an imperfect (Probably Approximately Correct) learning process would have to be employed to suss out all the rules.

Now obviously the program will benefit from labeling in it's training data for what is and is not an "interesting" mathematical proof, otherwise it can just screw around with computationally-cheap arithmetic proofs (1 + 1 = 2, 1.1 + 1 = 2.1, 1.2 + 1 = 2.2, etc.) until the heat death of the universe. Less obviously, as the hidden tank example shows, insufficient labeling or bad labels will lead to other unintended results.

So applying that back to Friendliness; despite attempts to construct a Fun Theory, human value is currently (and may well forever remain) theoryless. A learning process whose goal is to maximize human value is going to have to be both well constructed and have very good labels initially to not be Unfriendly. Of course, it could very well correct itself later on, that is in fact at the core of a PAC algorithm, but then we get into questions of FOOM-ing and labels of human value in the environment which I am not equipped to deal with.

Comment author: RobbBB 06 September 2013 05:03:11PM *  3 points [-]

How could a chess computer be capable of winning against humans at chess without the terminal goal of achieving a checkmate?

Humans are capable of winning at chess without the terminal goal of doing so. Nor were humans designed by evolution specifically for chess. Why should we expect a general superintelligence to have intelligence that generalizes less easily than a human's does?

If the given goal is logically incoherent, or too vague for the AI to be able to tell apart success from failure, would it work at all?

You keep coming back to this 'logically incoherent goals' and 'vague goals' idea. Honestly, I don't have the slightest idea what you mean by those things. A goal that can't motivate one to do anything ain't a goal; it's decor, it's noise. 'Goals' are just the outcomes systems tend to produce, especially systems too complex to be easily modeled as, say, physical or chemical processes. Certainly it's possible for goals to be incredibly complicated, or to vary over time. But there's no such thing as a 'logically incoherent outcome'. So what's relevant to our purposes is whether failing to make a powerful optimization process human-friendly will also consistently stop the process from optimizing for anything whatsoever.

I think that the idea that humans not only want to make an AI exhibit such drives, but also succeed at making such drives emerge, is a very unlikely outcome.

Conditioned on a self-modifying AGI (say, an AGI that can quine its source code, edit it, then run the edited program and repeat the process) achieving domain-general situation-manipulating abilities (i.e., intelligence), analogous to humans' but to a far greater degree, which of the AI drives do you think are likely to be present, and which absent? 'It wants to self-improve' is taken as a given, because that's the hypothetical we're trying to assess. Now, should we expect such a machine to be indifferent to its own survival and to the use of environmental resources?

The point I am trying to make is that these drives constitute additional complexity, rather than being simple ideas that you can just assume

Sometimes a more complex phenomenon is the implication of a simpler hypothesis. A much narrower set of goals will have intelligence-but-not-resource-acquisition as instrumental than will have both as instrumental, because it's unlikely to hit upon a goal that requires large reasoning abilities but does not call for many material resources.

It is likely incredibly hard to make these drives emerge in a seed AI.

You haven't given arguments suggesting that here. At most, you've given arguments against expecting a seed AI to be easy to invent. Be careful to note, to yourself and others, when you switch between the claims 'a superintelligence is too hard to make' and 'if we made a superintelligence it would probably be safe'.

Comment author: TheOtherDave 06 September 2013 06:46:39PM 0 points [-]

You keep coming back to this 'logically incoherent goals' and 'vague goals' idea. Honestly, I don't have the slightest idea what you mean by those things.

Well, I'm not sure what XXD means by them, but...

G1 ("Everything is painted red") seems like a perfectly coherent goal. A system optimizing G1 paints things red, hires people to paint things red, makes money to hire people to paint things red, invents superior paint-distribution technologies to deposit a layer of red paint over things, etc.

G2 ("Everything is painted blue") similarly seems like a coherent goal.

G3 (G1 AND G2) seems like an incoherent goal. A system with that goal... well, I'm not really sure what it does.

Comment author: RobbBB 06 September 2013 07:35:27PM *  1 point [-]

A system's goals have to be some event that can be brought about. In our world, '2+2=4' and '2+2=5' are not goals; 'everything is painted red and not-red' may not be a goal for similar reasons. When we're talking about an artificial intelligence's preferences, we're talking about the things it tends to optimize for, not the things it 'has in mind' or the things it believes are its preferences.

This is part of what makes the terminology misleading, and is also why we don't ask 'can a superintelligence be irrational?'. Irrationality is dissonance between my experienced-'goals' (and/or, perhaps, reflective-second-order-'goals') and my what-events-I-produce-'goals'; but we don't care about the superintelligence's phenomenology. We only care about what events it tends to produce.

Tabooing 'goal' and just talking about the events a process-that-models-its-environment-and-directs-the-future tends to produce would, I think, undermine a lot of XiXiDu's intuitions about goals being complex explicit objects you have to painstakingly code in. The only thing that makes it more useful to model a superintelligence as having 'goals' than modeling a blue-minimizing robot as having 'goals' is that the superintelligence responds to environmental variation in a vastly more complicated way. (Because, in order to be even a mediocre programmer, its model-of-the-world-that-determines-action has to be more complicated than a simple camcorder feed.)

Comment author: TheOtherDave 07 September 2013 04:35:51PM 1 point [-]

we're talking about the things it tends to optimize for, not the things it 'has in mind'

Oh.
Well, in that case, all right. If there exists some X a system S is in fact optimizing for, and what we mean by "S's goals" is X, regardless of what target S "has in mind", then sure, I agree that systems never have vague or logically incoherent goals.

just talking about the events a process-that-models-its-environment-and-directs-the-future tends to produce

Well, wait. Where did "models its environment" come from?
If we're talking about the things S optimizes its environment for, not the things S "has in mind", then it would seem that whether S models its environment or not is entirely irrelevant to the conversation.

In fact, given how you've defined "goal" here, I'm not sure why we're talking about intelligence at all. If that is what we mean by "goal" then intelligence has nothing to do with goals, or optimizing for goals. Volcanoes have goals, in that sense. Protons have goals.

I suspect I'm still misunderstanding you.

Comment author: RobbBB 07 September 2013 06:18:57PM *  1 point [-]

From Eliezer's Belief in Intelligence:

"Since I am so uncertain of Kasparov's moves, what is the empirical content of my belief that 'Kasparov is a highly intelligent chess player'? What real-world experience does my belief tell me to anticipate? [...]

"The empirical content of my belief is the testable, falsifiable prediction that the final chess position will occupy the class of chess positions that are wins for Kasparov, rather than drawn games or wins for Mr. G. [...] The degree to which I think Kasparov is a 'better player' is reflected in the amount of probability mass I concentrate into the 'Kasparov wins' class of outcomes, versus the 'drawn game' and 'Mr. G wins' class of outcomes."

From Measuring Optimization Power:

"When I think you're a powerful intelligence, and I think I know something about your preferences, then I'll predict that you'll steer reality into regions that are higher in your preference ordering. [...]

"Ah, but how do you know a mind's preference ordering? Suppose you flip a coin 30 times and it comes up with some random-looking string - how do you know this wasn't because a mind wanted it to produce that string?

"This, in turn, is reminiscent of the Minimum Message Length formulation of Occam's Razor: if you send me a message telling me what a mind wants and how powerful it is, then this should enable you to compress your description of future events and observations, so that the total message is shorter. Otherwise there is no predictive benefit to viewing a system as an optimization process. This criterion tells us when to take the intentional stance.

"(3) Actually, you need to fit another criterion to take the intentional stance - there can't be a better description that averts the need to talk about optimization. This is an epistemic criterion more than a physical one - a sufficiently powerful mind might have no need to take the intentional stance toward a human, because it could just model the regularity of our brains like moving parts in a machine.

"(4) If you have a coin that always comes up heads, there's no need to say "The coin always wants to come up heads" because you can just say "the coin always comes up heads". Optimization will beat alternative mechanical explanations when our ability to perturb a system defeats our ability to predict its interim steps in detail, but not our ability to predict a narrow final outcome. (Again, note that this is an epistemic criterion.)

"(5) Suppose you believe a mind exists, but you don't know its preferences? Then you use some of your evidence to infer the mind's preference ordering, and then use the inferred preferences to infer the mind's power, then use those two beliefs to testably predict future outcomes. The total gain in predictive accuracy should exceed the complexity-cost of supposing that 'there's a mind of unknown preferences around', the initial hypothesis."

Notice that throughout this discussion, what matters is the mind's effect on its environment, not any internal experience of the mind. Unconscious preferences are just as relevant to this method as are conscious preferences, and both are examples of the intentional stance. Note also that you can't really measure the rationality of a system you're modeling in this way; any evidence you raise for 'irrationality' could just as easily be used as evidence that the system has more complicated preferences than you initially thought, or that they're encoded in a more distributed way than you had previously hypothesized.

My take-away from this is that there are two ways we generally think about minds on LessWrong: Rational Choice Theory, on which all minds are equally rational and strange or irregular behaviors are seen as evidence of strange preferences; and what we might call the Ideal Self Theory, on which minds' revealed preferences can differ from their 'true self' preferences, resulting in irrationality. One way of unpacking my idealized values is that they're the rational-choice-theory preferences I would exhibit if my conscious desires exhibited perfect control over my consciously controllable behavior, and those desires were the desires my ideal self would reflectively prefer, where my ideal self is the best trade-off between preserving my current psychology and enhancing that psychology's understanding of itself and its environment.

We care about ideal selves when we think about humans, because we value our conscious, 'felt' desires (especially when they are stable under reflection) more than our unconscious dispositions. So we want to bring our actual behavior (and thus our rational-choice-theory preferences, the 'preferences' we talk about when we speak of an AI) more in line with our phenomenological longings and their idealized enhancements. But since we don't care about making non-person AIs more self-actualized, but just care about how they tend to guide their environment, we generally just assume that they're rational. Thus if an AI behaves in a crazy way (e.g., alternating between destroying and creating paperclips depending on what day of the week it is), it's not because it's a sane rational ghost trapped by crazy constraints. It's because the AI has crazy core preferences.

Where did "models its environment" come from?

If we're talking about the things S optimizes its environment for, not the things S "has in mind", then it would seem that whether S models its environment or not is entirely irrelevant to the conversation.

Yes, in principle. But in practice, a system that doesn't have internal states that track the world around it in a reliable and useable way won't be able to optimize very well for anything particularly unlikely across a diverse set of environments. In other words, it won't be very intelligent. To clarify, this is an empirical claim I'm making about what it takes to be particularly intelligent in our universe; it's not part of the definition for 'intelligent'.

Comment author: TheOtherDave 08 September 2013 06:00:25AM *  1 point [-]

a system that doesn't have internal states that track the world around it in a reliable and useable way won't be able to optimize very well for anything particularly unlikely across a diverse set of environments

Yes, that seems plausible.

I would say rather that modeling one's environment is an effective tool for consistently optimizing for some specific unlikely thing X across a range of environments, so optimizers that do so will be more successful at optimizing for X, all else being equal, but it more or less amounts to the same thing.

But... so what?

I mean, it also seems plausible that optimizers that explicitly represent X as a goal will be more successful at consistently optimizing for X, all else being equal... but that doesn't stop you from asserting that explicit representation of X is irrelevant to whether a system has X as its goal.

So why isn't modeling the environment equally irrelevant? Both features, on your account, are optional enhancements an optimizer might or might not display.

It keeps seeming like all the stuff you quote and say before your last two paragraphs ought to provide an answer that question, but after reading it several times I can't see what answer it might be providing. Perhaps your argument is just going over my head, in which case I apologize for wasting your time by getting into a conversation I'm not equipped for..

Comment author: Vladimir_Nesov 07 September 2013 12:56:02AM *  0 points [-]

A system's goals have to be some event that can be brought about.

This sounds like a potentially confusing level of simplification; a goal should be regarded as at least a way of comparing possible events.

When we're talking about an artificial intelligence's preferences, we're talking about the things it tends to optimize for, not the things it 'has in mind' or the things it believes are its preferences.

Its behavior is what makes its goal important. But in a system designed to follow an explicitly specified goal, it does make sense to talk of its goal apart from its behavior. Even though its behavior will reflect its goal, the goal itself will reflect itself better.

If the goal is implemented as a part of the system, other parts of the system can store some information about the goal, certain summaries or inferences based on it. This information can be thought of as beliefs about the goal. And if the goal is not "logically transparent", that is its specification is such that making concrete conclusions about what it states in particular cases is computationally expensive, then the system never knows what its goal says explicitly, it only ever has beliefs about particular aspects of the goal.

Comment author: RobbBB 07 September 2013 06:51:03PM *  0 points [-]

But in a system designed to follow an explicitly specified goal, it does make sense to talk of its goal apart from its behavior. Even though its behavior will reflect its goal, the goal itself will reflect itself better.

Perhaps, but I suspect that for most possible AIs there won't always be a fact of the matter about where its preference is encoded. The blue-minimizing robot is a good example. If we treat it as a perfectly rational agent, then we might say that it has temporally stable preferences that are very complicated and conditional; or we might say that its preferences change at various times, and are partly encoded, for instance, in the properties of the color-inverting lens on its camera. An AGI's response to environmental fluctuation will probably be vastly more complicated than a blue-minimizer's, but the same sorts of problems arise in modeling it.

I think it's more useful to think of rational-choice-theory-style preferences as useful theoretical constructs -- like a system's center of gravity, or its coherently extrapolated volition -- than as real objects in the machine's hardware or software. This sidesteps the problem of haggling over which exact preferences a system has, how those preferences are distributed over the environment, how to decide between causally redundant encodings which is 'really' the preference encoding, etc. See my response to Dave.

Comment author: Vladimir_Nesov 07 September 2013 08:17:40PM 2 points [-]

"Goal" is a natural idea for describing AIs with limited resources: these AIs won't be able to make optimal decisions, and their decisions can't be easily summarized in terms of some goal, but unlike the blue-minimizing robot they have a fixed preference ordering that doesn't gradually drift away from what it was originally, and eventually they tend to get better at following it.

For example, if a goal is encrypted, and it takes a huge amount of computation to decrypt it, system's behavior prior to that point won't depend on the goal, but it's going to work on decrypting it and eventually will follow it. This encrypted goal is probably more predictive of long-term consequences than anything else in the details of the original design, but it also doesn't predict its behavior during the first stage (and if there is only a small probability that all resources in the universe will allow decrypting the goal, it's probable that system's behavior will never depend on the goal). Similarly, even if there is no explicit goal, as in the case of humans, it might be possible to work with an idealized goal that, like the encrypted goal, can't be easily evaluated, and so won't influence behavior for a long time.

My point is that there are natural examples where goals and the character of behavior don't resemble each other, so that each can't be easily inferred from the other, while both can be observed as aspects of the system. It's useful to distinguish these ideas.

Comment author: XiXiDu 06 September 2013 06:53:40PM 0 points [-]

Here is what I mean:

Evolution was able to come up with cats. Cats are immensely complex objects. Evolution did not intend to create cats. Now consider you wanted to create an expected utility maximizer to accomplish something similar, except that it would be goal-directed, think ahead, and jump fitness gaps. Further suppose that you wanted your AI to create qucks, instead of cats. How would it do this?

Given that your AI is not supposed to search design space at random, but rather look for something particular, you would have to define what exactly qucks are. The problem is that defining what a quck is, is the hardest part. And since nobody has any idea what a quck is, nobody can design a quck creator.

The point is that thinking about the optimization of optimization is misleading, as most of the difficulty is with defining what to optimize, rather than figuring out how to optimize it. In other words, the efficiency of e.g. the scientific method depends critically on being able to formulate a specific hypothesis.

Trying to create an optimization optimizer would be akin to creating an autonomous car to find the shortest route between Gotham City and Atlantis. The problem is not how to get your AI to calculate a route, or optimize how to calculate such a route, but rather that the problem is not well-defined. You have no idea what it means to travel between two fictional cities. Which in turn means that you have no idea what optimization even means in this context, let alone meta-level optimization.

Comment author: linkhyrule5 06 September 2013 07:16:30PM 0 points [-]

The problem is, you don't have to program the bit that says "now make yourself more intelligent." You only have to program the bit that says "here's how to make a new copy of yourself, and here's how to prove it shares your goals without running out of math."

And the bit that says "Try things until something works, then figure out why it worked." AKA modeling.

The AI isn't actually an intelligence optimizer. But it notes that when it takes certain actions, it is better able to model the world, which in turn allows it to make more paperclips (or whatever). So it'll take those actions more often.

Comment author: ArisKatsaris 05 September 2013 07:41:41PM 2 points [-]

or there is some amount of uncertainty about what it means to achieve the goal of "maximizing paperclips

"uncertainty" is in your human understanding of the program, not in the actual program. A program doesn't go "I don't know what I'm supposed to do next", it follows instructions step-by-step.

If the latter, what would it mean to maximize shapeless objects?

It would mean exactly what it's programmed to mean, without any uncertainty in it at all.