Comment author: 29 November 2013 06:58:15PM *  5 points [-]

Thank you for the specific examples!

I like: jumping on the bandwagon, manipulation, self awareness. On the other hand, I disagree with the reductio ad absurdum. Reduction as absurdum is like this: "if X, then Y, but Y is obviously silly, therefore not X." The Y is somehow derived from X. Fully general counterargument is something that actually does not depend on X (this is what makes it fully general); it is a chain of words where you can substitute any value X and get the result "therefore not X". Being a fully general counterargument is a semantic property of some arguments; and probably most of them are syntactically reduction ad absurdum.

More meta: that's the point. If we have specific examples, we can discuss them specifically, and perhaps accept some and refuse others. (And even the act of refusing is helpful for communication, because it makes more clear what exactly we mean by saying something.)

Comment author: 30 November 2013 12:52:18AM *  0 points [-]

Thanks for the feedback. I agree that reductio ad absurdum is the weakest of the examples I gave, but let me try to justify it anyways: if X is a fully general counterargument, then we can use it to argue against true statements as well as false ones. So applying X without any additional justification would lead to patently false conclusions, and therefore (by reductio ad absurdum) X is not a valid form of reasoning. Perhaps this is not the best word for it, but it is similar to a very pervasive idea in mathematics, where when formulating possible approaches to prove a theorem, a key criterion is whether those approaches can distinguish between the theorem and similar statements that are known or suspected to be false.

ETA: And yes, I agree that specific examples are good!

Comment author: 29 November 2013 11:09:58AM *  5 points [-]

we often create new words when existing concepts already exist

Every time I read this, it comes without specific examples. And it keeps returning for years.

If some people think this is a real problem, the best way to solve it would be to publish a dictionary of LW-specific words and their equivalents used by the outside world. I believe that such article would be upvoted, and if there is an agreement that those words really are equivalent, many of them would be replaced. At least you could give a hyperlink to that article every time this topic appears in the debate.

If writing an article is too much work, writing a short list or just a single example in the Open Thread would be a good start.

Comment author: 29 November 2013 05:50:54PM 4 points [-]

Here are some. One reason that translation between jargons is hard is because there are often not words in one jargon that can be used exactly in place of words in the other jargon. I've therefore included substitute words as long as they point to the same concepts. The words below were taken from the LW wiki.

Belief as attire: jumping on the bandwagon, groupthink Dark arts: hard sell, manipulation Fully general counterargument: reductio ad absurdum Fuzzy: warm fuzzy feeling, self gratification Luminosity: self awareness, emotional awareness Ugh field: aversion

Comment author: 29 November 2013 08:57:27AM *  0 points [-]

I believe that research is comparably efficient to much of industry, and that many of the things that look like inefficiencies are actually trading off small local gains for large global gains.

I understand "trade small local gains for large global gains" as a prescriptive principle, but does it work as a descriptive hypothesis? Why expect academics to be so much better than philanthropists at cause neutrality? When I speak to academics who aren't also EAs, they are basically never cause neutral, and they even joke about how ridiculously non-cause-neutral everybody in academia is, and how accidental everyone's choice of focus is, including their own.

Comment author: 29 November 2013 09:05:19AM 0 points [-]

I'm not talking about cause neutrality. My point is that even once the general problem has been decided, there are many possible approaches, and academics often do things that seem inefficient but are actually exploring the space of possible approaches (possibly by trying to better understand the objects they are studying).

Comment author: 28 November 2013 06:37:00AM *  2 points [-]

It's counterproductive if you have a big vocabulary but use it to show off (which is unfortunately how most people use them). A big vocabulary used correctly can be invaluable.

It's not clear to me what "correctly" means here. An illustrative example: I recently wanted to use "exacting" in a paper because, well, it was exactly the word to use in that sentence. I know about twelve similar words that each didn't quite fit for one reason or another, and I am not at all trying to show off my vocabulary by using it.

My advisor looks at the draft and says "look at this typo, did you mean to type exciting?" (He is not a native English speaker.) "No, I meant exacting; it means that it has precise requirements." "Oh. Well, that's confusing, you should change it."

Most of the people in my field are not native English speakers; if he has trouble understanding it, likely they will as well. Is it correct to use it?

Comment author: 28 November 2013 11:07:54PM 4 points [-]

Most of the people in my field are not native English speakers; if he has trouble understanding it, likely they will as well. Is it correct to use it?

No. The purpose of writing is to be understood, if your writing will not be understood then it should be changed. Write to the audience you have, not the audience you want to have.

Comment author: 28 November 2013 08:58:37PM 3 points [-]

I think that in general people have a tendency to create too much notation rather than too little ("advanced vocabulary" is really just a type of notation). The LessWrong community seems particularly susceptible to this: we often create new words when existing concepts already exist, or when it wouldn't be that hard to explain what we mean in the first place. This is problematic because it makes idea transfer with outsiders far more difficult: not just because they are less willing to engage, but because one now needs to translate between notations, which is often much harder than it would naievely seem. So I would actually be in favor of less vocabulary, more effort to tie existing concepts to concepts that exist in other communities, and more effort on simple explanations of ideas that don't require someone to have read the Sequences.

Comment author: 23 November 2013 11:03:29PM *  16 points [-]

Yes; apology is an underrated consequentialist tool among nerds.

Some of the social function of apology can be understood game theoretically: Apology explicitly disavows a past action, allowing the one to whom the apology was made to leverage that confession in future: If someone apologises for something then does it again, then response can escalate because we have evidence that they are doing it even knowing that it's 'wrong'. The person who apologised knows this, and often the implicit threat of escalation if they do the same thing checks their future behaviour. Therefore apology is (possibly among other things) a signal, where the cost to apologising is the greater susceptibility to escalation in future cases.

Apology falls into a class--along with other things such as forgiving misdeeds, forgetting misdeeds, retribution, punishing an agent against its will, compensation for misdeeds--of things that would make no sense among sufficiently advanced and cooperative rationalists. Some things in that class (e.g. forgiveness) might already have been transcended by LW, and others (e.g. apology) are probably not possible to transcend even on LW, because the knowledge of other participants (e.g. confidence of their cooperativeness) required to transcend apology is probably too high for an online community of this size.

I would guess that the Bay Area rationalist set and its associates--which as far as I can tell is by far the most advanced community in the world in terms of how consummately instrumental x-rationality is forged into their swords--apologizes way, way, way more than the average LW'er, just like they talk about/express their feelings way more than people on LW typically do, and win because they're willing to confront that prospect of 'being vulnerable'.

"Well," said the boy. His eyes had not wavered from the Defense Professor's. "I certainly regret hurting you, Professor. But I do not think the situation calls for me to submit to you. I never really did understand the concept of apology, still less as it applies to a situation like this; if you have my regrets, but not my submission, does that count as saying sorry?"

Again that cold, cold laugh, darker than the void between the stars.

"I wouldn't know," said the Defense Professor, "I, too, never understood the concept of apology. That ploy would be futile between us, it seems, with both of us knowing it for a lie. Let us speak no more of it, then. Debts will be settled between us in time."

Two mistakes in thinking that my past self made a lot and others might also:

(1) Refusing to apologize if another party was 'more wrong'. Even if you're 99.9% right/innocent/blameless, you still have to make a choice between apologizing and not apologizing to the other person. If you refuse to apologize, things will probably get worse, because the other person thinks you're more wrong than you think you are, and they will see you not apologizing as defecting. If you apologize in a smart way, you can give an apology (which shouldn't make a difference but has the actual consequence where the other person is more probable to also apologise) without tying yourself down with too broad a commitment on your future behaviour, and without lying that you thought something was a mistake that wasn't.

(2) Using the fact that, in the limit as rationality and cooperation become arbitrarily great, apology is meaningless, as a rationalization for not apologising, when in fact you just feel embarrassed/are generally untrained and therefore not fit enough to apologise, and you're therefore avoiding the exertion of doing so.

I want to point out the difference between completely fake apologies for things one does not think were mistakes, and apologising for things that were mistakes even if the other person's mistakes were much greater. The former is less often the smart thing to do, and the latter is a lot more often than one might think. Once you get fairly strong, you can sometimes even win free points by apologising in front of a big group of people for something that everyone but the other disputant think is completely outweighed by the other disputant's actions.

E.g. 'I'm sorry I used such an abrupt tone in asking you to desist from stealing my food; it probably put you on the defensive.' If you really mean it (and you should, because you're almost certainly not a perfect communicator and there were probably things you could have done better), then often onlookers will think you're awesome and think the other person sucks for 'making you' apologise when you'd 'done nothing wrong'. Sometimes even the other disputant will be so disarmed by your unwavering 'politeness' that they will realise the ridiculousness of the situation and realise that you're being genuine and that they made a mistake, whereas when they thought you were a hostile opponent, it was much easier for them to rationalise that mistake.

Notice than in that example, your apology has not even constrained your future actions; everyone was so distracted by the ridiculousness of you apologising when you were innocent and the contrast it made between yourself and your opponent, that nobody will think to escalate against you in future the next time somebody steals your food.

That's why it's so important to know how to lose--so that you can win! Just like how the best things you could do to decrease your personal risk from fights are things like practising conflict defusion techniques, learning how to walk away from conflict, being less tempestuous, being situationally aware, or even just learning how to play dead/fake a seizure/panic attack, rather than something that just looks like winning, like practising flashy kicks.

Comment author: 24 November 2013 02:26:10AM 2 points [-]

I wish I could upvote this more than once.

Comment author: 07 November 2013 10:38:03PM 3 points [-]

Don't worry, I wasn't offended :)

Good to hear, and thanks for the reassurance :-) And yeah, I do too well know the problem of having too little time to write something polished, and I do certainly prefer having the discussion in fairly raw form to not having it at all.

One possibility is that MIRI's arguments actually do look that terrible to you

What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.

I'm not really sure what you mean by a "concrete instantiation". I can think of concrete toy models, of AIs using logical reasoning which know an exact description of their environment as a logical formula, which can't reason in the way I believe is what we want to achieve, because of the Löbian obstacle. I can't write down a self-rewriting AGI living in the real world that runs into the Löbian obstacle, but that's because I can't write down any AGI that lives in the real world.

My reason for thinking that the Löbian obstacle may be relevant is that, as mentioned in the interview, I think that a real-world seed FAI will probably use (something very much like) formal proofs to achieve the high level of confidence it needs in most of its self-modifications. I feel that formally specified toy models + this informal picture of a real-world FAI are as close to thinking about concrete instantiations as I can get at this point.

I may be wrong about this, but it seems to me that when you think about concrete instantiations, you look towards solutions that reason about the precise behavior of the program they're trying to verify -- reasoning like "this variable gets decremented in each iteration of this loop, and when it reaches zero we exit the loop, so we won't loop forever". But heuristically, while it seems possible to reason about the program you're creating in this way, our task is to ensure that we're creating a program which creates a program which creates a program which goes out to learn about the world and look for the most efficient way to use transistors it finds in the external environment to achieve its goals, and we want to verify that those transistors won't decide to blow up the world; it seems clear to me that this is going to require reasoning of the type "the program I'm creating is going to reason correctly about the program it is creating", which is the kind of reasoning that runs into the Löbian obstacle, rather than the kind of reasoning applied by today's automated verification techniques.

Writing this, I'm not too confident that this will be helpful to getting the idea across. Hope the face-to-face with Paul with help, perhaps also with translating your intuitions to a language that better matches the way I think about things.

I think that the point above would be really helpful to clarify, though. This seems to be a recurring theme in my reactions to your comments on MIRI's arguments -- e.g. there was that LW conversation you had with Eliezer where you pointed out that it's possible to verify properties probabilistically in more interesting ways than running a lot of independent trials, and I go, yeah, but how is that going to help with verifying whether the far-future descendant of an AI we build now, when it has entire solar systems of computronium to run on, is going to avoid running simulations which by accident contain suffering sentient beings? It seems that to achieve confidence that this far-future descendant will behave in a sensible way, without unduly restricting the details of how it is going to work, is going to need fairly abstract reasoning, and the sort of tools you point to don't seem to be capable of this or to extend in some obvious way to dealing with this.

You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe

I'm not sure I understand what you're saying here, but I'm not convinced that this is the sort of reasoning I'd use.

I'm fairly sure that the reason your brain goes "it would be safe if we only allow self-modifications when there's a proof that they're safe" is that you believe that if there's a proof that a self-modification is safe, then it is safe -- I think this is probably a communication problem between us rather than you actually wanting to use different reasoning. But again, hopefully the face-to-face with Paul can help with that.

I don't think that "whole brain emulations can safely self-modify" is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: "The argument in favor of studying Lob's theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. [...]"

Thanks for the reply! Thing is, I don't think that ordinary human reasoning should run into that obstacle, and the "ordinary" is just to exclude humans reasoning by writing out formal proofs in a fixed proof system and having these proofs checked by a computer. But I don't think that ordinary human reasoning can achieve the level of confidence an FAI needs to achieve in its self-rewrites, and the only way I currently know how an FAI could plausibly reach that confidence is through logical reasoning. I thought that "whole brain emulations can safely self-modify" might describe our disagreement because that would explain why you think that human reasoning not being subject to Löb's theorem would be relevant.

My next best guess is that you think that even though human reasoning can't safely self-modify, its existence suggests that it's likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb's theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?

I can imagine that that might be the case, but I don't think it's terribly likely. I can more easily imagine that there would be something completely different from both human reasoning or logical reasoning, or something quite similar to normal logical reasoning but not subject to Löb's theorem. But if so, how will we find it? Unless essentially every kind of reasoning except human reasoning can easily be made safe, it doesn't seem likely that AGI research will hit on a safe solution automatically. MIRI's current research seems to me like a relatively promising way of trying to search for a solution that's close to logical reasoning.

When I say "failure to understand the surrounding literature", I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I doubt that this comment is the best place to go into that, but perhaps I will make a top-level post about this in the near future.

Ok, I think I probably don't understand this yet, and making a post about it sounds like a good plan!

Sorry for ducking most of the technical points, as I said, I hope that talking to Paul will resolve most of them.

No problem, and hope so as well.

Comment author: 08 November 2013 02:41:01AM 2 points [-]

I don't have time to reply to all of this right now, but since you explicitly requested a reply to:

My next best guess is that you think that even though human reasoning can't safely self-modify, its existence suggests that it's likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb's theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?

The answer is yes, I think this is essentially right although I would probably want to add some hedges to my version of the statement (and of course the usual hedge that our intuitions probably conflict at multiple points but that this is probably the major one and I'm happy to focus in on it).

Comment author: 07 November 2013 12:16:27AM *  5 points [-]

I thought the example was pretty terrible.

Glad to see you're doing well, Benja :)

Sorry for being curmudgeonly there -- I did afterwards wish that I had tempered that. The thing is that when you write something like

I also agree that the idea of "logical uncertainty" is very interesting. I spend much of my time as a grad student working on problems that could be construed as versions of logical uncertainty.

that sounds to me like you're painting MIRI as working on these topics just because it's fun, and supporting its work by arguments that are obviously naive to someone who knows the field, and that you're supporting this by arguments that miss the point of what MIRI is trying to say. That's why I found the example of program analysis so annoying -- people who think that the halting problem means that program analysis is impossible really are misinformed (actually Rice's theorem, really, but someone with this misconception wouldn't be aware of that), both about the state of the field and about why these theorems say what they say. E.g., yes, of course your condition is undecidable as long as there is any choice f(s) of chooseAction2(s) that satisfies it; proof: let chooseAction2(s) be the program that checks whether chooseAction2(s) satisfies your criterion, and if yes return chooseAction(s), otherwise f(s). That's how these proofs always go, and of course that doesn't mean that there are no programs that are able to verify the condition for an interesting subclass of chooseAction2's; the obvious interesting example is searching for a proof of the condition in ZFC, and the obvious boring example is that there is a giant look-up table which decides the condition for all choices of chooseAction2(s) of length less than L.

One possibility is that MIRI's arguments actually do look that terrible to you, but that this is because MIRI hasn't managed to make them clearly enough. I'm thinking this may be the case because you write:

In addition, there are more general proof strategies than the above if that one does not satisfy you. For instance, I could just require that any proposed modification to chooseAction2 come paired with a proof that that modification will be safe. Now I agree that there exist choices of chooseAction2 that are safe but not provably safe and this strategy disallows all such modifications. But that doesn't seem so restrictive to me.

First, that's precisely the "obvious" strategy that's the starting point for MIRI's work.

Second, yes, Eliezer's arguing that this isn't good enough, but the reason isn't that it there are some safe modifications which aren't provably safe. The work around the Löbian obstacle has nothing to do with trying to work around undecidability. (I will admit that for a short period at the April workshop I thought this might be a good research direction, because I had my intuitions shaken by the existence of Paul's system and got overly optimistic, but Paul quickly convinced me that this was an unfounded hope, and in any case the main work around the Löbian obstacle was never really related to this.) MIRI's argument definitely isn't that "the above algorithm can't decide for all chooseAction2 whether they're safe, therefore it probably can't decide it for the kind of chooseAction2 we're interested in, therefore it's unacceptable". If that's how you've understood the argument, then I see why you would think that the program analysis example is relevant. (The argument is indeed that it seems to be unable to decide safety for the chooseAction2 we're interested in, but not because it's unable to decide this for any generic chooseAction2.)

Third, you seem to imply that your proposal will only take safe actions. You haven't given an argument for why we should think so, but the implication seems clear: You're using a chooseAction that is obviously safe as long as it doesn't rewrite itself, and it will only accept a proposed modification if it comes with a proof that it is safe, so if it does choose to rewrite itself then its successor will in fact be safe as well. Now I think this is fine reasoning, but you don't seem to agree:

Finally, I agree that such a naieve proof strategy as "doing the following trivial self-modification is safe because the modified me will only do things that it proves won't destroy the world, thus it won't destroy the world" does not work. I'm not proposing that. The proof system clearly has to do some actual work.

You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe, but you don't think the AI should be able to use the same sort of reasoning. Eliezer's argument is that this is in fact reasoning you want to use when building a self-improving AI: Yes, you can reason in more detail about how the AI you are building works, but this AI_1 will build an AI_2 and so on, and when proving that it's ok to build AI_1 you don't want to reason in detail about how AI_1,000,000 is going to work (which is built using design principles you don't understand, by AI_999,999 which is much smarter than you); rather, you want to use general principles to reason that the because of the way AI_1,000,000 came to be, it has to be safe (because AI_999,999 only builds safe AIs, because it was built by AI_999,998 which only builds safe AIs...). But not only you need to reason like that, because you don't know and aren't smart enough to comprehend AI_1,000,000's exact design; AI_1, which also isn't that smart, will need to be able to use the same sort of reasoning. Hence, the interest in the Löbian obstacle.

There are caveats to add to this and parts of your comment I haven't replied to, but I'm running into the same problem as you with your original comment in this thread, having already spent too much time on this. I'd be happy to elaborate if useful. For my part, I'd be interested in your reply to the other part of my comment: do you think I have localized our disagreement correctly?

Oh, one last point that I shouldn't skip over: I assume the point about MIRI lacking "an understanding of the surrounding literature" refers to the thing about being tripped up at the July workshop by not knowing Gaifman's work on logical uncertainty well enough. If so, I agree that that was an avoidable fail, but I don't think it's indicative of always ignoring the relevant literature or something like that. I'll also admit that I still haven't myself more with Gaifman's work, but that's because I'm not currently focusing on logical uncertainty, and I intend to do so in the future.

Comment author: 07 November 2013 05:52:10AM 4 points [-]

Sorry for being curmudgeonly there -- I did afterwards wish that I had tempered that.

Don't worry, I wasn't offended :)

that sounds to me like you're painting MIRI as working on these topics just because it's fun, and supporting its work by arguments that are obviously naive to someone who knows the field, and that you're supporting this by arguments that miss the point of what MIRI is trying to say.

I don't think that MIRI is working on these topics just because they are fun, and I apologize for implying that. I should note here that I respect the work that you and Paul have done, and as I said at the beginning I was somewhat hesitant to start this discussion at all, because I was worried that it would have a negative impact on either you / Paul's reputation (regardless of whether my criticisms ended up being justified) or on our relationship. But in the end I decided that it was better to raise my objections in fairly raw form and deal with any damage later.

One possibility is that MIRI's arguments actually do look that terrible to you

What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.

You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe

I'm not sure I understand what you're saying here, but I'm not convinced that this is the sort of reasoning I'd use. It seems like Paul's argument is similar to yours, though, and I'm going to talk to him in person in a few days, so perhaps the most efficient thing will be for me to talk to him and then report back.

Do you think I have localized our disagreement correctly?

I don't think that "whole brain emulations can safely self-modify" is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: "The argument in favor of studying Lob's theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. Standard epistemic hygiene calls for trying to produce concrete examples to motivate this line of work. I have not seen this done by MIRI, and all of the examples I can think of, both from having done AI and verification work myself, and from looking at what my colleagues do in program analysis, points in the squarely opposite direction."

When I say "failure to understand the surrounding literature", I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I doubt that this comment is the best place to go into that, but perhaps I will make a top-level post about this in the near future.

Sorry for ducking most of the technical points, as I said, I hope that talking to Paul will resolve most of them.

Comment author: 06 November 2013 10:39:01PM 8 points [-]

I agree that Luke here overstates the significance of my result, but I do think you miss the point a bit or are uncharitable. Regardless of whether making predictions about your own behavior is fundamentally difficult, we don't yet understand any formal framework that can capture reasoning of the form “my decisions are good because my beliefs correspond to reality.” Assuming there is a natural formal framework capturing human reasoning (I think the record so far suggests optimism) then there is something interesting that we don’t yet understand. It seems like you are applying the argument: “We know that humans can do X, so why do you think that X is an important problem?” The comment about undecidability issues not applying in practice also seems a bit unfair; for programs that do proof search we know that we cannot prove claims of the desired type based on simple Godelian arguments, and almost all interesting frameworks for reasoning are harder to prove things about than a simple proof search. (Of course the game is that we don’t want to prove things about the algorithms in question, we are happy to form justified beliefs about them in whatever way we can, including inductive inference. But the point is that there are things we don’t understand.)

There are further questions about whether any work at MIRI is a meaningful contribution to this problem or any other. I think that the stuff I’ve worked on is plausibly but not obviously a significant contribution (basically the same status as the other work I’m doing).

Regarding the modal agents stuff, I agree that it’s a toy problem where you should expect progress to be fast (if there was a nice theorem at the end of it, then it wouldn’t be too unusual as a paper in theoretical CS, except for the unfashionable use of mathematical logic). Regarding updateless/timeless/ambient decision theory, it’s a clear step forward for a very idiosyncratic problem, but one for which I think you can make a reasonable case that it’s worthwhile.

I think you shouldn’t be too surprised to make meaningful headway on theoretically interesting questions, even those which will plausibly be important. It seems like in theoretical research today things are still developing reasonably rapidly, and the ratio between plausibly important problems and human capital is very large. I expect that given effort and success at recruiting human capital MIRI can make good headway, in the same sort of way that other theorists do. Optimistically they would be distinguished primarily by working on a class of problems which is unusually important given their values and model of the world (a judgment with which you might disagree).

Comment author: 07 November 2013 05:19:07AM 0 points [-]

I think you shouldn’t be too surprised to make meaningful headway on theoretically interesting questions, even those which will plausibly be important. It seems like in theoretical research today things are still developing reasonably rapidly, and the ratio between plausibly important problems and human capital is very large.

I agree with this. Luke seems to be making a much stronger claim than the above, though.

It seems like you are applying the argument: “We know that humans can do X, so why do you think that X is an important problem?”

I agree that that would be a bad argument. That was not the argument I intended to make, though I can see why it has been interpreted that way and I should have put more effort into explaining myself. I am rather saying that human reasoning looks so far away from even getting close to running into issues with Godel / Lob, that it seems like a rather abstruse starting point for investigation.

The rest of your comment seems most easily discussed in person, so I'll do that and hopefully we'll remember to update the thread with our conclusion.

Comment author: 05 November 2013 06:55:40PM *  5 points [-]

Jacob, have you seen Luke's interview with me, where I've tried to reply to some arguments of the sort you've given in this thread and elsewhere?

I don't think [the fact that humans' predictions about themselves and each other often fail] is sufficient to dismiss my example. Whether or not we prove things, we certainly have some way of reasoning at least somewhat reliably about how we and others will behave. It seems important to ask why we expect AI to be fundamentally different; I don't think that drawing a distinction between heuristics and logical proofs is sufficient to do so, since many of the logical obstacles carry over to the heuristic case, and to the extent they don't this seems important and worth grappling with.

Perhaps here is a way to get a handle on where we disagree: Suppose we make a whole-brain emulation of Jacob Steinhardt, and you start modifying yourself in an attempt to achieve superintelligence while preserving your values, so that you can save the world. You try to go through billions of (mostly small) changes. In this process, you use careful but imperfect human (well, eventually transhuman) reasoning to figure out which changes are sufficiently safe to make. My expectation is that one of two things happens: Either you fail, ending up with very different values than you started with or stopping functioning completely; or you think very hard about how much confidence you need to have in each self-modification, and how much confidence you can achieve by ordinary human reasoning, and end up not doing a billion of these because you can't achieve the necessary level of confidence. The only way I know for a human to reach the necessary level of confidence in the majority of the self-modifications would be to use formally verified proofs.

Presumably you disagree. If you could make a convincing case that a whole-brain emulation could safely go through many self-modifications using ordinary human reasoning, that would certainly change my position in the direction that the Löbian obstacle and other diagonalization issues won't be that important in practice. If you can't convince me that it's probably possible and I can't convince you that it probably isn't, this might still help understanding where the disagreement is coming from.

Also note that, even if you did think it was sufficient, I gave you another example that was based purely in the realm of formal logic.

I thought the example was pretty terrible. Everybody with more than passing familiarity with the halting problem, and more generally Rice's theorem, understands that the result that you can't decide for every program whether it's in a given class doesn't imply that there are no useful classes of programs for which you can do so. MIRI's argument for the importance of Löb's theorem is: There's an obvious way you can try to get stable self-modification, which is to require that if the AI self-modifies, it has to prove that the successor will not destroy the world. But if the AI tries to argue "doing the following trivial self-modification is safe because the modified me will only do things that it proves won't destroy the world, thus it won't destroy the world", that requires the AI to understand the soundness of its own proof system, which is impossible by Löb's theorem. This seems to me like a straight-up application of what Löb's theorem actually says, rather than the kind of half-informed misunderstanding that would suggest that program analysis is impossible because Rice's theorem.

Comment author: 06 November 2013 05:22:50PM 3 points [-]

I thought the example was pretty terrible.

Glad to see you're doing well, Benja :)

Here's a concrete way you could try to get stable self-modification:

Suppose for concreteness that we have a C program, call it X, and that within the C program there is an array called "world_state" of length M and a floating point number called "utility". A simple instantiation of X would look something like:

while(true){ action = chooseAction(worldState); world_state = propgateWorldState(worldState, action); utility = calculateUtility(worldState); }

We would like to consider modifications to X where we replace chooseAction with some new method chooseAction2 to get a program X2. Suppose we want to ensure some condition such as: from the current world state, if we use X2 instead of X, then after some finite period of time the sequence of utilities we get from using chooseAction2 will always be larger than the corresponding sequence if we have used chooseAction. Abusing notation a bit, this is the same as verifying the statement:

"there exists N such that for all n > N, utility2(n) > utility(n)"

[although note that utility2 and utility have fairly complicated descriptions if you actually try to write them out]. Now I agree that reasoning about this for arbitrary choices of chooseAction and chooseAction2 will be quite difficult (probably undecidable although I haven't proved that). But the key point is that I get to choose chooseAction2, and there are many decision procedures that can prove such a statement in special cases. For instance, I could partition the space of world states into finitely many pieces, write down a transition function that over-approximates the possible transitions (for instance, by having a transition from Piece1 to Piece2 if any element of Piece1 can transition to any element of Piece2). Then I only need to reason about finite automata and those are trivially decidable.

You could argue that this proof system is fairly weak, but again, the AI gets to tailor its choices of chooseAction2 to be easy to reason about. You could also argue that an AI that could only create instances of chooseAction2 that could be reasoned about as above would be severely hampered, but I think this overlooks the fact that such techniques have been extraordinarily successful at reasoning about fairly complex systems.

In addition, there are more general proof strategies than the above if that one does not satisfy you. For instance, I could just require that any proposed modification to chooseAction2 come paired with a proof that that modification will be safe. Now I agree that there exist choices of chooseAction2 that are safe but not provably safe and this strategy disallows all such modifications. But that doesn't seem so restrictive to me.

Finally, I agree that such a naieve proof strategy as "doing the following trivial self-modification is safe because the modified me will only do things that it proves won't destroy the world, thus it won't destroy the world" does not work. I'm not proposing that. The proof system clearly has to do some actual work. But your argument seems to me to be saying "this extremely naieve strategy doesn't work, let's develop an entire subfield of logic to try to make it work" whereas I advocate "this extremely naieve strategy doesn't work, let's use a non-naieve approach instead, and look at the relevant literature on how to solve similar problems".

I also agree that the idea of "logical uncertainty" is very interesting. I spend much of my time as a grad student working on problems that could be construed as versions of logical uncertainty. But it seems like a mistake to me to tackle such problems without at least one, and preferably both, of: an understanding of the surrounding literature; experience with concrete instantiations of the problems at hand.

View more: Next