Don't worry, I wasn't offended :)
Good to hear, and thanks for the reassurance :-) And yeah, I do too well know the problem of having too little time to write something polished, and I do certainly prefer having the discussion in fairly raw form to not having it at all.
One possibility is that MIRI's arguments actually do look that terrible to you
What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.
I'm not really sure what you mean by a "concrete instantiation". I can think of concrete toy models, of AIs using logical reasoning which know an exact description of their environment as a logical formula, which can't reason in the way I believe is what we want to achieve, because of the Löbian obstacle. I can't write down a self-rewriting AGI living in the real world that runs into the Löbian obstacle, but that's because I can't write down any AGI that lives in the real world.
My reason for thinking that the Löbian obstacle may be relevant is that, as mentioned in the interview, I think that a real-world seed FAI will probably use (something very much like) formal proofs to achieve the high level of confidence it needs in most of its self-modifications. I feel that formally specified toy models + this informal picture of a real-world FAI are as close to thinking about concrete instantiations as I can get at this point.
I may be wrong about this, but it seems to me that when you think about concrete instantiations, you look towards solutions that reason about the precise behavior of the program they're trying to verify -- reasoning like "this variable gets decremented in each iteration of this loop, and when it reaches zero we exit the loop, so we won't loop forever". But heuristically, while it seems possible to reason about the program you're creating in this way, our task is to ensure that we're creating a program which creates a program which creates a program which goes out to learn about the world and look for the most efficient way to use transistors it finds in the external environment to achieve its goals, and we want to verify that those transistors won't decide to blow up the world; it seems clear to me that this is going to require reasoning of the type "the program I'm creating is going to reason correctly about the program it is creating", which is the kind of reasoning that runs into the Löbian obstacle, rather than the kind of reasoning applied by today's automated verification techniques.
Writing this, I'm not too confident that this will be helpful to getting the idea across. Hope the face-to-face with Paul with help, perhaps also with translating your intuitions to a language that better matches the way I think about things.
I think that the point above would be really helpful to clarify, though. This seems to be a recurring theme in my reactions to your comments on MIRI's arguments -- e.g. there was that LW conversation you had with Eliezer where you pointed out that it's possible to verify properties probabilistically in more interesting ways than running a lot of independent trials, and I go, yeah, but how is that going to help with verifying whether the far-future descendant of an AI we build now, when it has entire solar systems of computronium to run on, is going to avoid running simulations which by accident contain suffering sentient beings? It seems that to achieve confidence that this far-future descendant will behave in a sensible way, without unduly restricting the details of how it is going to work, is going to need fairly abstract reasoning, and the sort of tools you point to don't seem to be capable of this or to extend in some obvious way to dealing with this.
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe
I'm not sure I understand what you're saying here, but I'm not convinced that this is the sort of reasoning I'd use.
I'm fairly sure that the reason your brain goes "it would be safe if we only allow self-modifications when there's a proof that they're safe" is that you believe that if there's a proof that a self-modification is safe, then it is safe -- I think this is probably a communication problem between us rather than you actually wanting to use different reasoning. But again, hopefully the face-to-face with Paul can help with that.
I don't think that "whole brain emulations can safely self-modify" is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: "The argument in favor of studying Lob's theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. [...]"
Thanks for the reply! Thing is, I don't think that ordinary human reasoning should run into that obstacle, and the "ordinary" is just to exclude humans reasoning by writing out formal proofs in a fixed proof system and having these proofs checked by a computer. But I don't think that ordinary human reasoning can achieve the level of confidence an FAI needs to achieve in its self-rewrites, and the only way I currently know how an FAI could plausibly reach that confidence is through logical reasoning. I thought that "whole brain emulations can safely self-modify" might describe our disagreement because that would explain why you think that human reasoning not being subject to Löb's theorem would be relevant.
My next best guess is that you think that even though human reasoning can't safely self-modify, its existence suggests that it's likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb's theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
I can imagine that that might be the case, but I don't think it's terribly likely. I can more easily imagine that there would be something completely different from both human reasoning or logical reasoning, or something quite similar to normal logical reasoning but not subject to Löb's theorem. But if so, how will we find it? Unless essentially every kind of reasoning except human reasoning can easily be made safe, it doesn't seem likely that AGI research will hit on a safe solution automatically. MIRI's current research seems to me like a relatively promising way of trying to search for a solution that's close to logical reasoning.
When I say "failure to understand the surrounding literature", I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I doubt that this comment is the best place to go into that, but perhaps I will make a top-level post about this in the near future.
Ok, I think I probably don't understand this yet, and making a post about it sounds like a good plan!
Sorry for ducking most of the technical points, as I said, I hope that talking to Paul will resolve most of them.
No problem, and hope so as well.
I don't have time to reply to all of this right now, but since you explicitly requested a reply to:
My next best guess is that you think that even though human reasoning can't safely self-modify, its existence suggests that it's likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb's theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
The answer is yes, I think this is essentially right although I would probably want to a...
Previously: Why Neglect Big Topics.
Why was there no serious philosophical discussion of normative uncertainty until 1989, given that all the necessary ideas and tools were present at the time of Jeremy Bentham?
Why did no professional philosopher analyze I.J. Good’s important “intelligence explosion” thesis (from 19591) until 2010?
Why was reflectively consistent probabilistic metamathematics not described until 2013, given that the ideas it builds on go back at least to the 1940s?
Why did it take until 2003 for professional philosophers to begin updating causal decision theory for the age of causal Bayes nets, and until 2013 to formulate a reliabilist metatheory of rationality?
By analogy to financial market efficiency, I like to say that “theoretical discovery is fairly inefficient.” That is: there are often large, unnecessary delays in theoretical discovery.
This shouldn’t surprise us. For one thing, there aren’t necessarily large personal rewards for making theoretical progress. But it does mean that those who do care about certain kinds of theoretical progress shouldn’t necessarily think that progress will be hard. There is often low-hanging fruit to be plucked by investigators who know where to look.
Where should we look for low-hanging fruit? I’d guess that theoretical progress may be relatively easy where:
These guesses make sense of the abundant low-hanging fruit in much of MIRI’s theoretical research, with the glaring exception of decision theory. Our September decision theory workshop revealed plenty of low-hanging fruit, but why should that be? Decision theory is widely applied in multi-agent systems, and in philosophy it’s clear that visible progress in decision theory is one way to “make a name” for oneself and advance one’s career. Tons of quality-adjusted researcher hours have been devoted to the problem. Yes, new theoretical advances (e.g. causal Bayes nets and program equilibrium) open up promising new angles of attack, but they don’t seem necessary to much of the low-hanging fruit discovered thus far. And progress in decision theory is definitely not valuable only to those with unusual views. What gives?
Anyway, three questions:
1 Good (1959) is the earliest statement of the intelligence explosion: “Once a machine is designed that is good enough… it can be put to work designing an even better machine. At this point an ”explosion“ will clearly occur; all the problems of science and technology will be handed over to machines and it will no longer be necessary for people to work. Whether this will lead to a Utopia or to the extermination of the human race will depend on how the problem is handled by the machines. The important thing will be to give them the aim of serving human beings.” The term itself, “intelligence explosion,” originates with Good (1965). Technically, artist and philosopher Stefan Themerson wrote a "philosophical analysis" of Good's intelligence explosion thesis called Special Branch, published in 1972, but by "philosophical analysis" I have in mind a more analytic, argumentative kind of philosophical analysis than is found in Themerson's literary Special Branch. ↩