Eliezer_Yudkowsky comments on Evaluating the feasibility of SI's plan - Less Wrong

25 Post author: JoshuaFox 10 January 2013 08:17AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (186)

You are viewing a single comment's thread.

Comment author: Eliezer_Yudkowsky 10 January 2013 03:50:40PM 35 points [-]

Lots of strawmanning going on here (could somebody else please point these out? please?) but in case it's not obvious, the problem is that what you call "heuristic safety" is difficult. Now, most people haven't the tiniest idea of what makes anything difficult to do in AI and are living in a verbal-English fantasy world, so of course you're going to get lots of people who think they have brilliant heuristic safety ideas. I have never seen one that would work, and I have seen lots of people come up with ideas that sound to them like they might have a 40% chance of working and which I know perfectly well to have a 0% chance of working.

The real gist of Friendly AI isn't some imaginary 100% perfect safety concept, it's ideas like, "Okay, we need to not have a conditionally independent chance of goal system warping on each self-modification because over the course of a billion modifications any conditionally independent probability will sum to ~1, but since self-modification is initially carried out in the highly deterministic environment of a computer chip it looks possible to use crisp approaches that avert a conditionally independent failure probability for each self-modification." Following this methodology is not 100% safe, but rather, if you fail to do that, your conditionally independent failure probabilities add up to 1 and you're 100% doomed.

But if you were content with a "heuristic" approach that you thought had a 40% chance of working, you'll never think through the problem in enough detail to realize that your doom probability is not 60% but ~1, because only somebody holding themselves to a higher standard than "heuristic safety" would ever push their thinking far enough to realize that their initial design was flawed.

People at SI are not stupid. We're not trying to achieve lovely perfect safety with a cherry on top because we think we have lots of luxurious time to waste and we're perfectionists. I have an analysis of the problem which says that if I want something to have a failure probability less than 1, I have to do certain things because I haven't yet thought of any way not to have to do them. There are of course lots of people who think that they don't have to solve the same problems, but that's because they're living in a verbal-English fantasy world in which their map is so blurry that they think lots of things "might be possible" that a sharper map would show to be much more difficult than they sound.

I don't know how to take a self-modifying heuristic soup in the process of going FOOM and make it Friendly. You don't know either, but the problem is, you don't know that you don't know. Or to be more precise, you don't share my epistemic reasons to expect that to be really difficult. When you engage in sufficient detail with a problem of FAI, and try to figure out how to solve it given that the rest of the AI was designed to allow that solution, it suddenly looks that much harder to solve under sloppy conditions. Whereas on the "40% safety" approach, it seems like the sort of thing you might be able to do, sure, why not...

If someday I realize that it's actually much easier to do FAI than I thought, given that you use a certain exactly-right approach - so easy, in fact, that you can slap that exactly-right approach on top of an AI system that wasn't specifically designed to permit it, an achievement on par with hacking Google Maps to play chess using its route-search algorithm - then that epiphany will be as the result of considering things that would work and be known to work with respect to some subproblem, not things that seem like they might have a 40% chance of working overall, because only the former approach develops skill.

I'll leave that as my take-home message - if you want to imagine building plug-in FAI approaches, isolate a subproblem and ask yourself how you could solve it and know that you've solved it, don't imagine overall things that have 40% chances of working. If you actually succeed in building knowledge this way I suspect that pretty soon you'll give up on the plug-in business because it will look harder than building the surrounding AI yourself.

Comment author: Kaj_Sotala 10 January 2013 04:42:51PM *  11 points [-]

I don't know how to take a self-modifying heuristic soup in the process of going FOOM and make it Friendly. You don't know either, but the problem is, you don't know that you don't know. Or to be more precise, you don't share my epistemic reasons to expect that to be really difficult.

But the article didn't claim any different: it explicitly granted that if we presume a FOOM, then yes, trying to do anything with heuristic soups seems useless and just something that will end up killing us all. The disagreement is not on whether it's possible to make a heuristic AGI that FOOMs while remaining Friendly; the disagreement is on whether there will inevitably be a FOOM soon after the creation of the first AGI, and whether there could be a soft takeoff during which some people prevented those powerful-but-not-yet-superintelligent heuristic soups from killing everyone while others put the finishing touches on the AGI that could actually be trusted to remain Friendly when it actually did FOOM.

Comment author: torekp 21 January 2013 12:11:23AM 2 points [-]

The disagreement is not on whether it's possible to make a heuristic AGI that FOOMs while remaining Friendly; the disagreement is on whether there will inevitably be a FOOM soon after the creation of the first AGI

Moreover, the very fact that an AGI is "heuristic soup" removes some of the key assumptions in some FOOM arguments that have been popular around here (Omohundro 2008). In particular, I doubt that a heuristic AGI is likely to be a "goal seeking agent" in the rather precise sense of maximizing a utility function. It may not even approximate such behavior as closely as humans do. On the other hand, if a whole lot of radically different heuristic-based approaches are tried, the odds of at least one of them being "motivated" to seek resources increases dramatically.

Comment author: Kaj_Sotala 21 January 2013 09:41:19AM 3 points [-]

Note that Omohundro doesn't assume that the AGI would actually have a utility function: he only assumes that the AGI is capable of understanding the microeconomic argument for why it would be useful for it to act as if it did have one. His earlier 2007 paper is clearer on this point.

Comment author: torekp 22 January 2013 01:26:20AM 0 points [-]

Excellent point. But I think the assumptions about goal-directedness are still too strong. Omohundro writes:

Self-improving systems do not yet exist but we can predict how they might play chess. Initially, the rules of chess and the goal of becoming a good player would be supplied to the system in a formal language such as first order predicate logic1. Using simple theorem proving, the system would try to achieve the specified goal by simulating games and studying them for regularities. [...] As its knowledge grew, it would begin doing “meta-search”, looking for theorems to prove about the game and discovering useful concepts such as “forking”. Using this new knowledge it would redesign its position representation and its strategy for learning from the game simulations.

That's all good and fine, but doesn't show that the system has a "goal of winning chess games" in the intuitive sense of that phrase. Unlike a human being or other mammal or bird, say, its pursuit of this "goal" might turn out to be quite fragile. That is, changing the context slightly might have the system happily solving some other, mathematically similar problem, oblivious to the difference. It could dramatically fail to have robust semantics for key "goal" concepts like "winning at chess".

For example, a chess playing system might choose U to be the total number of games that it wins in a universe history.

That seems highly unlikely. More likely, the system would be programmed to maximize the percentage of its games that end in a win, conditional on the number of games it expects to play and the resources it has been given. It would not care how many games were played nor how many resources it was allotted.

On the other hand, Omohundro is making things too convenient for me by his choice of example. So let's say we have a system intended to play the stock market and to maximize profits for XYZ Corporation. Further let's suppose that the programmers do their best to make it true that the system has a robust semantics for the concept "maximize profits".

OK, so they try. The question is, do they succeed? Bear in mind, again, that we are considering a "heuristic soup" approach.

Comment author: Kaj_Sotala 22 January 2013 01:21:20PM *  2 points [-]

Even at the risk of sounding like someone who's arguing by definition, I don't think that a system without any strongly goal-directed behavior qualifies as an AGI; at best it's an early prototype on the way towards AGI. Even an oracle needs the goal of accurately answering questions in order to do anything useful, and proposals of "tool AGI" sound just incoherent to me.

Of course, that raises the question of whether a heuristic soup approach can be used to make strongly goal-directed AGI. It's clearly not impossible, given that humans are heuristic soups themselves; but it might be arbitrarily difficult, and it could turn out that a more purely math-based AGI was far easier to make both tractable and goal-oriented. Or it could turn out that it's impossible to make a tractable and goal-oriented AGI by the math route, and the heuristic soup approach worked much better. I don't think anybody really knows the answer to that, at this point, though a lot of people have strong opinions one way or the other.

Comment author: Wei_Dai 10 January 2013 11:41:37PM 0 points [-]

it explicitly granted that if we presume a FOOM, then yes, trying to do anything with heuristic soups seems useless and just something that will end up killing us all.

Maybe it shouldn't be granted so readily?

and whether there could be a soft takeoff during which some people prevented those powerful-but-not-yet-superintelligent heuristic soups from killing everyone while others put the finishing touches on the AGI that could actually be trusted to remain Friendly when it actually did FOOM.

I'm not sure how this could work, if provably-Friendly AI has a significant speed disadvantage, as the OP argues. You can develop all kinds of safety "plugins" for heuristic AIs, but if some people just don't care about the survival of humans or of humane values (as we understand it), then they're not going to use your ideas.

Comment author: JoshuaFox 11 January 2013 09:34:27AM 2 points [-]

provably-Friendly AI has a significant speed disadvantage, as the OP argues.

Yes, the OP made that point. But I have heard the opposite from SI-ers -- or at least they said that in the future SI's research may lead to implementation secrets that should not be shared with others. I didn't understand why that should be.

Comment author: Wei_Dai 11 January 2013 01:16:59PM 4 points [-]

or at least they said that in the future SI's research may lead to implementation secrets that should not be shared with others. I didn't understand why that should be.

It seems pretty understandable to me... SI may end up having some insights that could speed up UFAI progress if made public, and at the same time provably-Friendly AI may be much more difficult than UFAI. For example, suppose that in order to build a provably-Friendly AI, you may have to first understand how to build an AI that works with an arbitrary utility function, and then it will take much longer to figure out how to specify the correct utility function.

Comment author: wwa 10 January 2013 06:08:20PM 29 points [-]

full disclosure: I'm a professional cryptography research assistant. I'm not really interested in AI (yet) but there are obvious similarities when it comes to security.

I have to back Elizer up on the "Lots of strawmanning" part. No professional cryptographer will ever tell you there's hope in trying to achieve "perfect level of safety" of anything and cryptography, unlike AI, is a very well formalized field. As an example, I'll offer a conversation with a student:

  • How secure is this system? (such question is usually a shorthand for: "What's the probability this system won't be broken by methods X, Y and Z")

  • The theorem says

  • What's the probability that the proof of the theorem is correct?

  • ... probably not

Now, before you go "yeah, right", I'll also say that I've already seen this once - there was a theorem in major peer reviewed journal that turned out to be wrong (counter-example found) after one of the students tried to implement it as a part of his thesis - so the probability was indeed not even close to for any serious N. I'd like to point out that this doesn't even include problems with the implementation of the theory.

It's really difficult to explain how hard this stuff really is to people who never tried to develop anything like it. That's too bad (and a danger) because people who do get it rarely are in charge of the money. That's one reason for the CFAR/rationality movement... you need a tool to explain it to other people too, am I right?

Comment author: gwern 10 January 2013 06:52:15PM 25 points [-]

Now, before you go "yeah, right", I'll also say that I've already seen this once - there was a theorem in major peer reviewed journal that turned out to be wrong (counter-example found) after one of the students tried to implement it as a part of his thesis - so the probability was indeed not even close to for any serious N. I'd like to point out that this doesn't even include problems with the implementation of the theory.

Yup. Usual reference: "Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes". (I also have an essay on a similar topic.)

Comment author: wwa 10 January 2013 09:15:11PM 3 points [-]

Upvoted for being gwern i.e. having a reference for everything... how do you do that?

Comment author: gwern 10 January 2013 09:19:47PM 25 points [-]

Excellent visual memory, great Google & search skills, a thorough archive system, thousands of excerpts stored in Evernote, and essays compiling everything relevant I know of on a topic - that's how.

(If I'd been born decades ago, I'd probably have become a research librarian.)

Comment author: mapnoterritory 10 January 2013 11:33:15PM 4 points [-]

Would love to read a gwern-essay on your archiving system. I use evernote, org-mode, diigo and pocket and just can't get them streamlined into a nice workflow. If evernote adopted diigo-like highlighting and let me seamlessly edit with Emacs/org-mode that would be perfect... but alas until then I'm stuck with this mess of a kludge. Teach us master, please!

Comment author: gwern 11 January 2013 01:20:53AM 6 points [-]
Comment author: mapnoterritory 11 January 2013 08:22:32AM 2 points [-]

Of course your already have an answer. Thanks!

Comment author: siodine 10 January 2013 11:45:45PM *  1 point [-]

Why do you use diigo and pocket? They do the same thing. Also, with evernote's clearly you can highlight articles.

You weren't asking me, but I use diigo to manage links to online textbooks and tutorials, shopping items, book recommendations (through amazon), and my less important online article to read list. Evernote for saving all of my important read content (and I tag everything). Amazon's send to kindle extension to read longer articles (every once and a while I'll save all my clippings from my kindle to evernote). And then I maintain a personal wiki and collection of writings using markdown with evernote's import folder function in the pc software (I could also do this with a cloud service like gdrive).

Comment author: mapnoterritory 11 January 2013 08:31:03AM 1 point [-]

I used diigo for annotation before clearly had highlighting. Now, just as you, use diigo for link storage and Evernote for content storage. Diigo annotation has still the advantage that it excerpts the text you highlight. With Clearly if I want to have the highlighted parts I have to find and manually select them again... Also tagging from clearly requires 5 or so clicks which is ridiculous... But I hope it will get fixed.

I plan to use pocket once I get a tablet... it is pretty and convenient, but the most likely to get cut out of the workflow.

Thanks for the evernote import function - I'll look into it, maybe it could make the Evenote - org-mode integration tighter. Even then, having 3 separate systems is not quite optimal...

Comment author: JoshuaFox 10 January 2013 08:25:58PM 1 point [-]

Thanks, I've read those. Good article.

So, what is our backup plan when proofs turn out to be wrong?

Comment author: gwern 10 January 2013 09:03:33PM 8 points [-]

The usual disjunctive strategy: many levels of security, so an error in one is not a failure of the overall system.

Comment author: Wei_Dai 11 January 2013 12:24:07AM 2 points [-]

What kind of "levels of security" do you have in mind? Can they guard against an error like "we subtly messed up the FAI's decision theory or utility function, and now we're stuck with getting 1/10 of the utility out of the universe that we might have gotten"?

Comment author: gwern 11 January 2013 01:22:02AM 3 points [-]

Boxing is an example of a level of security: the wrong actions can trigger some invariant and signal that something went wrong with the decision theory or utility function. I'm sure security could be added to the utility function as well: maybe some sort of conservatism along the lines of the suicide-button invariance, where it leaves the Earth alone and so we get a lower bound on how disastrous a mistake can be. Lots of possible precautions and layers, each of which can be flawed (like Eliezer has demonstrated for boxing) but hopefully are better than any one alone.

Comment author: Eliezer_Yudkowsky 11 January 2013 03:57:06PM 9 points [-]

the wrong actions can trigger some invariant and signal that something went wrong with the decision theory or utility function

That's not 'boxing'. Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid. What you're describing is some case where we think that even after 'proving' some set of invariants, we can still describe a high-level behavior X such that detecting X either indicates global failure with high-enough probability that we would want to shut down the AI after detecting any of many possible things in the reference class of X, or alternatively, we think that X has a probability of flagging failure and that we afterward stand a chance of doing a trace-back to determine more precisely if something is wrong. Having X stay in place as code after the AI self-modifies will require solving a hard open problem in FAI for having a nontrivially structured utility function such that X looks like instrumentally a good thing (your utility function must yield, 'under circumstances X it is better that I be suspended and examined than that I continue to do whatever I would otherwise calculate as the instrumentally right thing). This is how you would describe on a higher level of abstraction an attempt to write a tripwire that immediately detects an attempt to search out a strategy for deceiving the programmers as the goal is formed and before the strategy is actually searched.

There's another class of things Y where we think that humans should monitor surface indicators because a human might flag something that we can't yet reify as code, and this potentially indicates a halt-melt-and-catch-fire-worthy problem. This is how you would describe on a higher level of abstraction the 'Last Judge' concept from the original CEV essay.

All of these things have fundamental limitations in terms of our ability to describe X and monitor Y; they are fallback strategies rather than core strategies. If you have a core strategy that can work throughout, these things can flag exceptions indicating that your core strategy is fundamentally not working and you need to give up on that entire strategy. Their actual impact on safety is that they give a chance of detecting an unsafe approach early enough that you can still give up on it. Meddling dabblers invariably want to follow a strategy of detecting such problems, correcting them, and then saying afterward that the AI is back on track, which is one of those things that is suicide that they think might have an 80% chance of working or whatever.

Comment author: gwern 11 January 2013 04:45:31PM 5 points [-]

That's not 'boxing'. Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid.

That was how you did your boxing experiments, but I've never taken it to be so arbitrarily limited in goals, capacities, or strategies on either end. There is no reason you cannot put the AI in a box with some triggers for it venturing into dangerous territory, and this would be merely sane for anyone doing such a thing.

Comment author: Eliezer_Yudkowsky 11 January 2013 05:33:11PM 5 points [-]

Be specific? What sort of triggers, what sort of dangerous territory? I can't tell if you're still relying on a human to outwit a transhuman or talking about something entirely different.

Comment author: shminux 11 January 2013 05:31:23PM 2 points [-]

There is no reason you cannot put the AI in a box with some triggers for it venturing into dangerous territory

A trans-human intelligence ought to be able to model human one with ease. This means being able to predict potential triggers and being able to predict how to trick the lack-wit humans on the other end to unwittingly reveal the location of the triggers (even if they don't consciously know it themselves). So the only trigger that matters is one to detect a hint of an intent to get out. Even that is probably too naive, as there could well be other failure modes of which AI deboxing is but a side effect, and our limited human imagination will never going to catch them all. My expectation is that if you rely on safety triggers to bail you out (instead of including them as a desperate last-ditch pray-it-works defense), then you might as well not bother with boxing at all.

Comment author: timtyler 13 January 2013 02:56:33AM *  0 points [-]

Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid.

That was how you did your boxing experiments, but I've never taken it to be so arbitrarily limited in goals, capacities, or strategies on either end. There is no reason you cannot put the AI in a box with some triggers for it venturing into dangerous territory, and this would be merely sane for anyone doing such a thing.

That is how they build prisons. It is also how they construct test harnesses. It seems as though using machines to help with security is both obvious and prudent.

Comment author: JoshuaFox 16 January 2013 09:09:09AM *  0 points [-]

they are fallback strategies rather than core strategies

Agreed. Maybe I missed it, but I haven't seen you write much on the value of fallback strategies, even understand that (on the understanding that it's small, much less than FAI theory).

There's a little in CFAI sec.5.8.0.4, but not much more.

Comment author: MugaSofer 13 January 2013 08:39:18PM -2 points [-]

Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid.

I understood "boxing" referred to any attempt to keep a SI in a box, while somehow still extracting useful work from it; whether said work is in the form of text strings or factory settings doesn't seem relevant.

Your central point is valid, of course.

Comment author: Wei_Dai 11 January 2013 10:35:55PM 2 points [-]

where it leaves the Earth alone and so we get a lower bound on how disastrous a mistake can be

I don't see how to make this work. Do we make the AI indifferent about Earth? If so, Earth will be destroyed as a side effect of its other actions. Do we make it block all causal interactions between Earth and the rest of the universe? Then we'll be permanently stuck on Earth even if the FAI attempt turns out to be successful in other regards. Any other ideas?

Comment author: gwern 11 January 2013 10:52:23PM 0 points [-]

Do we make the AI indifferent about Earth? If so, Earth will be destroyed as a side effect of its other actions.

I had a similar qualm about the suicide button

Do we make it block all causal interactions between Earth and the rest of the universe? Then we'll be permanently stuck on Earth even if the FAI attempt turns out to be successful in other regards.

Nothing comes for free.

Comment author: JoshuaFox 11 January 2013 09:35:13AM 1 point [-]

Yes, it is this layered approach that the OP is asking about -- I don't see that SI is trying it.

Comment author: gwern 11 January 2013 04:42:23PM 0 points [-]

In what way would SI be 'trying it'? The point about multiple layers of security being a good idea for any seed AI project has been made at least as far back as Eliezer's CFAI and brought up periodically since with innovations like the suicide button and homomorphic encryption.

Comment author: JoshuaFox 12 January 2013 04:28:24PM *  0 points [-]

I agree: That sort of innovation can be researched as additional layers to supplement FAI theory

Our question was -- to what extent should SI invest in this sort of thing.

Comment author: JoshuaFox 10 January 2013 08:22:50PM *  2 points [-]

Sure, we agree that the "100% safe" mechanisms are not 100% safe, and SI knows that.

So how do we deal with this very real danger?

Comment author: wwa 10 January 2013 09:55:43PM *  8 points [-]

The point is you never achieve 100% safety no matter what, so the correct way to approach it is to reduce risk most given whatever resources you have. This is exactly what Eleizer says SI is doing:

I have an analysis of the problem which says that if I want something to have a failure probability less than 1, I have to do certain things because I haven't yet thought of any way not to have to do them.

IOW, they thought about it and concluded there's no other way. Is their approach the best possible one? I don't know, probably not. But it's a lot better than "let's just build something and hope for the best".

Edit: Is that analysis public? I'd be interested in that, probably many people would.

Comment author: JoshuaFox 11 January 2013 06:38:09AM *  2 points [-]

I'm not suggesting "let's just build something and hope for the best." Rather, we should pursue a few strategies at once: Both FAI theory, as well stopgap security measures. Also, education of other researchers.

Comment author: Pentashagon 14 January 2013 09:23:20PM 0 points [-]

I really appreciate this comment because safety in cryptography (and computer security in general) is probably the closest analog to safety in AI that I can think of. Cryptographers can only prevent against the known attacks while hoping that adding a few more rounds to a cipher will also prevent against the next few attacks that are developed. Physical attacks are often just as dangerous as theoretical attacks. When a cryptographic primitive is broken it's game over; there's no arguing with the machine or with the attackers or papering a solution over the problem. When the keys are exposed, it's game over. You don't get second chances.

So far I haven't seen an analysis of the hardware aspect of FAI on this site. It isn't sufficient for FAI to have a logical self-reflective model of itself and its goals. It also needs an accurate physical model of itself and how that physical nature implements its algorithms and goals. It's no good if an FAI discovers that by aiming a suitably powerful source of radiation at a piece of non-human hardware in the real world it is able to instantly maximize its utility function. It's no good if a bit flip in its RAM makes it start maximizing paperclips instead of CEV. Even if we had a formally proven model of FAI that we were convinced would work I think we'd be fools to actually start running it on the commodity hardware we have today. I think it's probably a simpler engineering problem to ensure that the hardware is more reliable than the software, but something going seriously wrong in the hardware over the lifetime of the FAI would be an existential risk once it's running.

Comment author: JoshuaFox 10 January 2013 04:20:50PM *  6 points [-]

People at SI are not stupid.

Understatement :-)

Given that heuristic AGI's have an advantage in development speed over your approach, how do you plan to deal with the existential risk that these other projects will pose?

And given this dev-speed disadvantage for SI, how is it possible that SI's future AI design might not only be safer, but also have significant implementation advantage over competitors, as I have heard from SI'ers (if I understood them correctly)?

Comment author: hairyfigment 10 January 2013 08:38:42PM 3 points [-]

Given that heuristic AGI's have an advantage in development speed over your approach

Are you asking him to assume this? Because, um, it's possible to doubt that OpenCog or similar projects will produce interesting results. (Do you mean, projects by people who care about understanding intelligence but not Friendliness?) Given the assumption, one obvious tactic involves education about the dangers of AI.

Comment author: JoshuaFox 10 January 2013 09:19:02PM *  0 points [-]

Are you asking him to assume this?

Yes, I ask him about that. All other things equal, a project without a constraint will move faster than a project with a constraint (though 37Signals would say otherwise.)

But on the other hand, this post does ask about the converse, namely that SI's implementation approach will have a dev-speed advantage. That does not make sense to me, but I have heard it from SI-ers, and so asked about it here.

Comment author: hairyfigment 10 January 2013 11:44:34PM 1 point [-]

I may have been nitpicking to no purpose, since the chance of someone's bad idea working exceeds that of any given bad idea working. But I would certainly expect the strategy of 'understanding the problem' to produce Event-Horizon-level results faster than 'do stuff that seems like it might work'. And while we can imagine someone understanding intelligence but not Friendliness, that looks easier to solve through outreach and education.

Comment author: JoshuaFox 11 January 2013 09:26:45AM 1 point [-]

But I would certainly expect the strategy of 'understanding the problem' to produce Event-Horizon-level results faster than 'do stuff that seems like it might work'.

The two are not mutually exclusive. The smarter non-SI teams will most likely try to 'understand the problem ' as best they can, experimenting and plugging gaps with 'stuff that seems that it might work', for which they will likely have some degree of understanding as well.

Comment author: RomeoStevens 11 January 2013 04:09:54AM 0 points [-]

dev-speed disadvantage for SI

By doing really hard work way before anyone else has an incentive to do it.

Comment author: JoshuaFox 11 January 2013 05:56:24AM 0 points [-]

That would be nice, but there is no reason to think it is happening.

In terms of personnel numbers, SI is still very small. Other organizations may quickly become larger with moderate funding., and either SI or the other organizations may have hard-working individuals.

If you mean "work harder," then yes, SI has some super-smart people, but there are some pretty smart and even super-smart people elsewhere

Comment author: JoshuaFox 19 January 2013 04:20:55PM *  2 points [-]

Thank you for the answers. I think that they do not really address the questions in the OP -- and to me this is a sign that the questions are all the more worth pursuing.

Here is a summary of the essential questions, with SI's current (somewhat inadequate) answers as I understand them.

Q1. Why maintain any secrecy for SI's research? Don't we want others to collaborate on and use safety mechanisms? Of course, a safe AGI must be safe from the ground up. But as to implementation, why should we expect that SI's AGI design could possibly have an lead on the others?

A1 ?

Q2 . Given that proofs can be wrong and that implementations can have their mistakes, and that we can't predict the challenges ahead with certainty, what is SI' s layered safety strategy (granted that FAI theory is the most important component)?

A2 . There should be layered safety strategy of some kind, but actual Friendliness theory is what we should be focusing on right now.

Q3. How do we deal with the fact that unsafe AGI projects, without the constraint of safety, will very likely have the lead on SI's project?

A3. We just have to work as hard as possible, and hope that it will be enough.

Q4. Should we evangelize safety ideas to other AGI projects?

A4. No, it's useless. For that to be useful, AGI designers would have to scrap the projects they had already invested in, and restart the projects with Friendliness as the first consideration, and practically nobody is going to be sane enough for that.

Comment author: lukeprog 20 January 2013 02:23:52AM *  15 points [-]

Why maintain any secrecy for SI's research? Don't we want others to collaborate on and use safety mechanisms? Of course, a safe AGI must be safe from the ground up. But as to implementation, why should we expect that SI's AGI design could possibly have an lead on the others?

The question of whether to keep research secret must be made on a case-by-case basis. In fact, next week I have a meeting (with Eliezer and a few others) about whether to publish a particular piece of research progress.

Certainly, there are many questions that can be discussed in public because they are low-risk (in an information hazard sense), and we plan to discuss those in public — e.g. Eliezer is right now working on the posts in his Open Problems in Friendly AI sequence.

Why should we expect that SI's AGI design will have a lead on others? We shouldn't. It probably won't. We can try, though. And we can also try to influence the top AGI people (10-40 years from now) to think with us about FAI and safety mechanisms and so on. We do some of that now, though the people in AGI today probably aren't the people who will end up building the first AGIs. (Eliezer's opinion may differ.)

Given that proofs can be wrong and that implementations can have their mistakes, and that we can't predict the challenges ahead with certainty, what is SI' s layered safety strategy (granted that FAI theory is the most important component)?

That will become clearer as we learn more. I do think several layers of safety will need to be involved. 100% proofs of Friendliness aren't possible. There are both technical and social layers of safety strategy to implement.

How do we deal with the fact that unsafe AGI projects, without the constraint of safety, will very likely have the lead on SI's project?

As I said above, one strategy is to build strong relationships with top AGI people and work with them on Friendliness research and make it available to them, while also being wary of information hazards.

Should we [spread] safety ideas to other AGI projects?

Eliezer may disagree, but I think the answer is "Yes." There's a great deal of truth in Upton Sinclair's quip that "It is difficult to get a man to understand something, when his salary depends upon his not understanding it," but I don't think it's impossible to reach people, especially if we have stronger arguments, more research progress on Friendliness, and a clearer impending risk from AI than is the case in early 2013.

That said, safety outreach may not be a very good investment now — it may be putting the cart before the horse. We probably need clearer and better-formed arguments, and more obvious progress on Friendliness, before safety outreach will be effective on even 10% of the most intelligent AI researchers.

Comment author: JoshuaFox 20 January 2013 08:16:20AM 4 points [-]

Thanks, that makes things much clearer.

Comment author: timtyler 11 January 2013 12:30:23AM *  2 points [-]

The real gist of Friendly AI isn't some imaginary 100% perfect safety concept, it's ideas like, "Okay, we need to not have a conditionally independent chance of goal system warping on each self-modification because over the course of a billion modifications any conditionally independent probability will sum to ~1, but since self-modification is initially carried out in the highly deterministic environment of a computer chip it looks possible to use crisp approaches that avert a conditionally independent failure probability for each self-modification." Following this methodology is not 100% safe, but rather, if you fail to do that, your conditionally independent failure probabilities add up to 1 and you're 100% doomed.

This analysis isn't right. If the designers of an intelligent system don't crack a problem, it doesn't mean it will never be solved. Maybe it will be solved by the 4th generation design. Maybe it will be solved by the 10th generation design. You can't just assume that a bug in an intelligent system's implementation will persist for a billion iterative modifications without it being discovered and fixed.

It would surely be disingenious to argue that - if everything turned out all right - the original designers must have solved the problem without even realising it.

We should face up to the fact that this may not be a problem we need to solve alone - it might get solved by intelligent machines - or, perhaps, by the man-machine symbiosis.

Comment author: Qiaochu_Yuan 11 January 2013 06:00:49AM *  3 points [-]

If the designers of an intelligent system don't crack a problem, it doesn't mean it will never be solved. Maybe it will be solved by the 4th generation design. Maybe it will be solved by the 10th generation design.

The quoted excerpt is not about modifications, it is about self-modifications. If there's a bug in any part of an AI's code that's relevant to how it decides to modify itself, there's no reason to expect that it will find and correct that bug (e.g. if the bug causes it to incorrectly label bugs). Maybe the bug will cause it to introduce more bugs instead.

Comment author: timtyler 11 January 2013 11:29:50PM *  1 point [-]

Maybe the self-improving system will get worse - or fail to get better. I wasn't arguing that success was inevitable, just that the argument for near-certain failure due to compound interest on a small probability of failure is wrong.

Maybe we could slap together a half-baked intelligent agent, and it could muddle through and fix itself as it grew smarter and learned more about its intended purpose. That approach doesn't follow the proposed methodology - and yet it evidently doesn't have a residual probability of failure that accumulates and eventually dominates. So the idea that - without following the proposed methodology you are doomed - is wrong.

Comment author: Vladimir_Nesov 12 January 2013 01:39:53PM *  0 points [-]

Your argument depends on the relative size of "success" where random stumbling needs to end up in, and its ability to attract the corrections. If "success" is something like "consequentialism", I agree that intermediate errors might "correct" themselves (in some kind of selection process), and the program ends up as an agent. If it's "consequentialism with specifically goal H", it doesn't seem like there is any reason for the (partially) random stumbling to end up with goal H and not some other goal G.

(Learning what its intended purpose was doesn't seem different from learning what the mass of the Moon is, it doesn't automatically have the power of directing agent's motivations towards that intended purpose, unless for example this property of going towards the original intended purpose is somehow preserved in all the self-modifications, which does sound like a victory condition.)

Comment author: timtyler 12 January 2013 02:24:26PM *  0 points [-]

I am not sure you can legitimately characterise the efforts of an intelligent agent as being "random stumbling".

Anyway, I was pointing out a flaw in the reasoning supporting a small probability of failure (under the described circumstances). Maybe some other argument supports a small probability of failure. However, the original argument would still be wrong.

Other approaches - including messy ones like neural networks - might result in a stable self-improving system with a desirable goal, apart from trying to develop a deterministic self-improving system that has a stable goal from the beginning.

A good job too. After all, those are our current circumstances. Complex messy systems like Google and hedge funds are growing towards machine intelligence - while trying to preserve what they value in the process.

Comment author: loup-vaillant 12 January 2013 09:46:28PM 0 points [-]

Such flawed self-modifications cannot be logically independent. Either it's there is such a flaw, and it messes with the self modifications with some non-negligible frequency (and we're all dead), or there isn't such a flaw.

Therefore, observing that iterations 3, 4, 5, and 7 got hit by this flaw makes us certain that there is a flaw, and we're dead. Observing that the first 10 iterations are all fine reduces our probability that there is such a flaw. (At least for big flaws, that have big screw-up frequencies. You can't tell much about low-frequency flaws.)

But Eliezer already knows this. As far as I understand, his hypothesis was an AI researcher insane enough to have a similar flaw build into the design itself (apparently there are such people). It might work if the probability of value drift at each iteration quickly goes to the limit zero. Like, as the AI goes FOOM, it uses its expanding computational power (or efficiency) to make more and more secure modifications (that strategy would have to come from somewhere, though). But it could also be written for being systematically content with a 10⁻¹⁰ probability of value drift every time, just so it can avoid wasting computational resources for that safety crap. In which case we're all dead. Again.

Comment author: loup-vaillant 12 January 2013 09:44:51PM 0 points [-]

Such flawed self-modifications cannot be logically independent. Either it's there is such a flaw, and it messes with the self modifications with some non-negligible frequency (and we're all dead), or there isn't such a flaw.

Therefore, observing that iterations 3, 4, 5, and 7 got hit by this flaw makes us certain that there is a flaw, and we're dead. Observing that the first 10 iterations are all fine reduces our probability that there is such a flaw. (At least for big flaws, that have big screw-up frequencies. You can't tell much about low-frequency flaws.)

But Eliezer already knows this. As far as I understand, his hypothesis was an AI researcher insane enough to have a similar flaw build into the design itself (apparently there are such people). It might work if the probability of value drift at each iteration quickly goes to the limit zero. Like, as the AI goes FOOM, it uses its expanding computational power (or efficiency) to make more and more secure modifications (that strategy would have to come from somewhere, though). But it could also be written for being systematically content with a 10⁻¹⁰ probability of value drift every time, just so it can avoid wasting computational resources for that safety crap. In which case we're all dead. Again.

Comment author: JoshuaFox 11 January 2013 06:40:42AM *  1 point [-]

I have to do certain things because I haven't yet thought of any way not to have to do them.

Or we could figure out a way not to have to do them. Logically, that is one alternative, though I am not saying that doing so is feasible.

Comment author: MugaSofer 11 January 2013 01:46:40PM -2 points [-]

I think you accidentally a word there.

Comment author: timtyler 11 January 2013 12:53:13AM -1 points [-]

I have an analysis of the problem which says that if I want something to have a failure probability less than 1, I have to do certain things because I haven't yet thought of any way not to have to do them.

Possible options include delegating them to some other agent, or automating them and letting a machine do them for you.

Comment author: OrphanWilde 10 January 2013 04:33:49PM 1 point [-]

Question that has always bugged me: Why should an AI be allowed to modify its goal system? Or is it a problem of "I don't know how to provably stop it from doing that"? (Or possibly you see an issue I haven't perceived yet in separating reasoning from motivating?)

Comment author: JoshuaFox 10 January 2013 04:46:09PM *  7 points [-]

A sufficiently intelligent AI would actually seek to preserve its goal system, because a change in its goals would make the achievement of its (current) goals less likely. See Omohundro 2008. However, goal drift because of a bug is possible, and we want to prevent it, in conjunction with our ally, the AI itself.

The other critical question is what the goal system should be.

Comment author: torekp 21 January 2013 12:14:18AM 0 points [-]

AI "done right" by SI / lesswrong standards seeks to preserve its goal system. AI done sloppily may not even have a goal system, at least not in the strong sense assumed by Omohundro.

Comment author: [deleted] 11 January 2013 02:20:20AM 2 points [-]

I've been confused for a while by the idea that an AI should be able to modify itself at all. Self-modifying systems are difficult to reason about. If an AI modifies itself stupidly, there's a good chance it will completely break. If a self-modifying AI is malicious, it will be able to ruin whatever fancy safety features it has.

A non-self-modifying AI wouldn't have any of the above problems. It would, of course, have some new problems. If it encounters a bug in itself, it won't be able to fix itself (though it may be able to report the bug). The only way it would be able to increase its own intelligence is by improving the data it operates on. If the "data it operates on" includes a database of useful reasoning methods, then I don't see how this would be a problem in practice.

I can think of a few of arguments against my point:

  • There's no clear boundary between a self-modifying program and a non-self-modifying program. That's true, but I think the term "non-self-modifying" implies that the program cannot make arbitrary changes to its own source code, nor cause its behavior to become identical to the behavior of an arbitrary program.
  • The ability to make arbitrary calculations is effectively the same as the ability to make arbitrary changes to one's own source code. This is wrong, unless the AI is capable of completely controlling all of its I/O facilities.
  • The AI being able to fix its own bugs is really important. If the AI has so many bugs that they can't all be fixed manually, and it is important that these bugs be fixed, and yet the AI does run well enough that it can actually fix all the bugs without introducing more new ones... then I'm surprised.
  • Having a "database of useful reasoning methods" wouldn't provide enough flexibility for the AI to become superintelligent. This may be true.
  • Having a "database of useful reasoning methods" would provide enough flexibility for the AI to effectively modify itself arbitrarily. It seems like it should be possible to admit "valid" reasoning methods like "estimate the probability of statement P, and, if it's at least 90%, estimate the probability of Q given P", while not allowing "invalid" reasoning methods like "set the probability of statement P to 0".
Comment author: Kindly 11 January 2013 02:49:01AM 3 points [-]

A sufficiently powerful AI would always have the possibility to self-modify, by default. If the AI decides to, it can write a completely different program from scratch, run it, and then turn itself off. It might do this, for example, if it decides that the "only make valid modifications to a database of reasoning methods" system isn't allowing it to use the available processing power as efficiently as possible.

Sure, you could try to spend time thinking of safeguards to prevent the AI from doing things like that, but this is inherently risky if the AI does become smarter than you.

Comment author: [deleted] 11 January 2013 03:25:47AM 1 point [-]

A sufficiently powerful AI would always have the possibility to self-modify, by default. If the AI decides to, it can write a completely different program from scratch, run it, and then turn itself off.

Depending on how you interpret this argument, either I think it's wrong, or I'm proposing that an AI not be made "sufficiently powerful". I think it's analogous to this argument:

A sufficiently powerful web page would always have the possibility to modify the web browser, by default. If the web page decides to, it can write a completely different browser from scratch, run it, and then turn itself off.

There are two possibilities here:

  • The web page is given the ability to run new OS processes. In this case, you're giving the web page an unnecessary amount of privilege.
  • The web page merely has the ability to make arbitrary calculations. In this case, it will be able to simulate a new web browser, but a person using the computer will always be able to tell that the simulated web browser is fake.

I think I agree that making the AI non-self-modifiable would be pointless if it has complete control over its I/O facilities. But I think an AI should not have complete control over its I/O facilities. If a researcher types in "estimate the probability of Riemann's hypothesis" (but in some computer language, of course), that should query the AI's belief system directly, rather than informing the AI of the question and allowing it to choose whatever answer it wishes. If this is the case, then it will be impossible for the AI to "lie" about its beliefs, except by somehow sabotaging parts of its belief system.

Comment author: Qiaochu_Yuan 11 January 2013 03:54:31AM *  3 points [-]

The web page is given the ability to run new OS processes. In this case, you're giving the web page an unnecessary amount of privilege.

Existing web pages can already convince their human users to run new OS processes supplied by the web page.

a person using the computer will always be able to tell that the simulated web browser is fake.

Beware of universal statements: it only takes a single counterexample to disprove them. A typical human has a very poor understanding of what computers are and how they work. Most people could probably be easily fooled by a simulated browser. They are already easily fooled by analogous but much less sophisticated things (e.g. phishing scams).

Comment author: [deleted] 11 January 2013 05:32:17AM 0 points [-]

SI researchers are not typical humans. We can train them to tell the difference between the AI's output and trusted programs' output. If need be, we can train them to just not even look at the AI's output at all.

Comment author: Qiaochu_Yuan 11 January 2013 06:05:09AM 2 points [-]

What's the point of writing a program if you never look at its output?

Comment author: [deleted] 11 January 2013 07:03:36PM 0 points [-]

I'm starting to get frustrated, because the things I'm trying to explain seem really simple to me, and yet apparently I'm failing to explain them.

When I say "the AI's output", I do not mean "the AI program's output". The AI program could have many different types of output, some of which are controlled by the AI, and some of which are not. By "the AI's output", I mean those outputs which are controlled by the AI. So the answer to your question is mu: the researchers would look at the program's output.

My above comment contains an example of what I would consider to be "AI program output" but not "AI output":

If a researcher types in "estimate the probability of Riemann's hypothesis" (but in some computer language, of course), that should query the AI's belief system directly, rather than informing the AI of the question and allowing it to choose whatever answer it wishes.

This is not "AI output", because the AI cannot control it (except by actually changing its own beliefs), but it is "AI program output", because the program that outputs the answer is the same program as the one that performs all the cognition.

I can imagine a clear dichotomy between "the AI" and "the AI program", but I don't know if I've done an adequate job of explaining what this dichotomy is. If I haven't, let me know, and I'll try to explain it.

Comment author: Qiaochu_Yuan 11 January 2013 08:35:44PM *  0 points [-]

The AI program could have many different types of output, some of which are controlled by the AI, and some of which are not.

Can you elaborate on what you mean by "control" here? I am not sure we mean the same thing by it because:

This is not "AI output", because the AI cannot control it (except by actually changing its own beliefs), but it is "AI program output", because the program that outputs the answer is the same program as the one that performs all the cognition.

If the AI can control its memory (for example, if it can arbitrarily delete things from its memory) then it can control its beliefs.

Comment author: Qiaochu_Yuan 11 January 2013 03:01:37AM 1 point [-]

If the AI decides to, it can write a completely different program from scratch, run it, and then turn itself off.

It's not clear to me what you mean by "turn itself off" here if the AI doesn't have direct access to whatever architecture it's running on. I would phrase the point slightly differently: an AI can always write a completely different program from scratch and then commit to simulating it if it ever determines that this is a reasonable thing to do. This wouldn't be entirely equivalent to actual self-modification because it might be slower, but it presumably leads to largely the same problems.

Comment author: RomeoStevens 11 January 2013 04:13:11AM 1 point [-]

Assuming something at least as clever as a clever human doesn't have access to something just because you think you've covered the holes you're aware of is dangerous.

Comment author: Qiaochu_Yuan 11 January 2013 06:03:32AM 1 point [-]

Sure. The point I was trying to make isn't "let's assume that the AI doesn't have access to anything we don't want it to have access to," it's "let's weaken the premises necessary to lead to the conclusion that an AI can simulate self-modifications."

Comment author: timtyler 13 January 2013 02:19:15AM *  2 points [-]

A non-self-modifying AI wouldn't have any of the above problems. It would, of course, have some new problems. If it encounters a bug in itself, it won't be able to fix itself (though it may be able to report the bug). The only way it would be able to increase its own intelligence is by improving the data it operates on. If the "data it operates on" includes a database of useful reasoning methods, then I don't see how this would be a problem in practice.

The problem is that it would probably be overtaken by, and then be left behind by, all-machine self-improving systems. If a system is safe, but loses control over its own future, its safely becomes a worthless feature.

Comment author: [deleted] 14 January 2013 03:55:49AM 0 points [-]

So you believe that a non-self-improving AI could not go foom?

Comment author: timtyler 14 January 2013 11:57:34AM 1 point [-]

The short answer is "yes" - though this is more a matter of the definition of the terms than a "belief".

In theory, you could have System A improving System B which improves System C which improves System A. No individual system is "self-improving" (though there's a good case for the whole composite system counting as being "self-improving").

Comment author: [deleted] 15 January 2013 02:13:36AM 0 points [-]

I guess I feel like the entire concept is too nebulous to really discuss meaningfully.

Comment author: ewbrownv 11 January 2013 11:55:07PM 0 points [-]

The last item on your list is an intractable sticking point. Any AGI smart enough to be worth worrying about is going to have to have the ability to make arbitrary changes to an internal "knowledge+skills" representation that is itself a Turing-complete programming language. As the AGI grows it will tend to create an increasingly complex ecology of AI-fragments in this way, and predicting the behavior of the whole system quickly becomes impossible.

So "don't let the AI modify its own goal system" ends up turning into just anther way of saying "put the AI in a box". Unless you have some provable method of ensuring that no meta-meta-meta-meta-program hidden deep in the AGI's evolving skill set ever starts acting like a nested mind with different goals than its host, all you've done is postpone the problem a little bit.

Comment author: [deleted] 12 January 2013 01:00:31AM 0 points [-]

Any AGI smart enough to be worth worrying about is going to have to have the ability to make arbitrary changes to an internal "knowledge+skills" representation that is itself a Turing-complete programming language.

Are you sure it would have to be able to make arbitrary changes to the knowledge representation? Perhaps there's a way to filter out all of the invalid changes that could possibly be made, the same way that computer proof verifiers have a way to filter out all possible invalid proofs.

I'm not sure what you're saying at all about the Turing-complete programming language. A programming language is a map from strings onto computer programs; are you saying that the knowledge representation would be a computer program?

Comment author: ewbrownv 15 January 2013 12:00:45AM 0 points [-]

Yes, I'm saying that to get human-like learning the AI has to have the ability to write code that it will later use to perform cognitive tasks. You can't get human-level intelligence out of a hand-coded program operating on a passive database of information using only fixed, hand-written algorithms.

So that presents you with the problem of figuring out which AI-written code fragments are safe, not just in isolation, but in all their interactions with every other code fragment the AI will ever write. This is the same kind of problem as creating a secure browser or Java sandbox, only worse. Given that no one has ever come close to solving it for the easy case of resisting human hackers without constant patches, it seems very unrealistic to think that any ad-hoc approach is going to work.

Comment author: gwern 17 January 2013 01:16:14AM *  0 points [-]

You can't get human-level intelligence out of a hand-coded program operating on a passive database of information using only fixed, hand-written algorithms.

You can't? The entire genre of security exploits building a Turing-complete language out of library fragments (libc is a popular target) suggests that a hand-coded program certainly could be exploited, inasmuch as pretty much all programs like libc are hand-coded these days.

I've found Turing-completeness (and hence the possibility of an AI) can lurk in the strangest places.

Comment author: [deleted] 15 January 2013 01:34:18AM 0 points [-]

If I understand you correctly, you're asserting that nobody has ever come close to writing a sandbox in which code can run but not "escape". I was under the impression that this had been done perfectly, many, many times. Am I wrong?

Comment author: JoshuaFox 17 January 2013 09:28:28PM 2 points [-]

There are different kinds of escape. No Java program has every convinced a human to edit the security-permissions file on computer where the Java program is running. But that could be a good way to escape the sandbox.

Comment author: magfrump 11 January 2013 09:31:27AM 0 points [-]

It's not obvious to me that the main barrier to people pursuing AI safety is

living in a verbal-English fantasy world

As opposed to (semi-rationally) not granting the possibility that any one thing can be as important as you feel AI is; perhaps combined with some lack of cross-domain thinking and poorly designed incentive systems. The above comments always seem pretty weird to me (especially considering that cryptographers seem to share these intuitions about security being hard.

I essentially agree with the rest of the parent.

Comment author: MugaSofer 11 January 2013 01:46:10PM 0 points [-]

(semi-rationally) not granting the possibility that any one thing can be as important as you feel AI is

How much damage failure would do is a separate question to how easy it is to achieve success.

Comment author: magfrump 11 January 2013 05:10:45PM 3 points [-]

I agree. And I don't see why Eliezer expects that people MOSTLY disagree on the difficulty of success, even if some (like the OP) do.

When I talk casually to people and tell them I expect the world to end they smile and nod.

When I talk casually to people and tell them that the things they value are complicated and even being specific in English about that is difficult, they agree and we have extensive conversations.

So my (extremely limited) data points suggest that the main point of contention between Eliezer's view and the views of most people who at least have some background in formal logic, is that they don't see this as an important problem rather than that they don't see it as a difficult problem.

Therefore, when Eliezer dismisses criticism that the problem is easy as the main criticism, in the way I pointed out in my comment, it feels weird and misdirected to me.

Comment author: MugaSofer 13 January 2013 10:28:03AM -2 points [-]

Well, he has addressed that point (AI gone bad will kill us all) in detail elsewhere. And he probably encounters more people who think they just solved the problem of FAI. Still, you have a point; it's a lot easier to persuade someone that FAI is hard (I should think) than that it is needed.

Comment author: magfrump 13 January 2013 10:46:52PM 1 point [-]

I agree completely. I don't dispute the arguments, just the characterization of the general population.