JonahSinick comments on Tiling Agents for Self-Modifying AI (OPFAI #2) - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (260)
Neither 2 nor 3 is the sort of argument I would ever make (there's such a thing as an attempted steelman which by virtue of its obvious weakness doesn't really help). You already know(?) that I vehemently reject all attempts to multiply tiny probabilities by large utilities in real-life practice, or to claim that the most probable assumption about a background epistemic question leads to a forgone doom and use this to justify its improbable negation. The part at which you lost me is of course part 1.
I still don't understand what you could be thinking here, and feel like there's some sort of basic failure to communicate going on. I could guess something along the lines of "Maybe Jonah is imagining that Friendly AI will be built around principles completely different from modern decision theory and any notion of a utility function..." (but really, is something like that one of just 10^10 equivalent candidates?) "...and more dissimilar to that than logical AI is from decision theory" (that's a lot of dissimilarity but we already demonstrated conceptual usefulness over a gap that size). Still, that's the sort of belief someone might end up with if their knowledge of AI was limited to popular books extolling the wonderful chaos of neural networks, but that idea is visibly stupid so my mental model of Anna warns me not to attribute it to you. Or I could guess, "Maybe Jonah is Holden-influenced and thinks that all of this discussion is irrelevant because we're going to build a Google Maps AGI", where in point of fact it would be completely relevant, not a tiniest bit less relevant, if we were going to build a planning Oracle. (The experience with Holden does give me pause and make me worry that EA people may think they already know how to build FAI using their personal wonderful idea, just like vast numbers of others think they already know how to build FAI.) But I still can't think of any acceptable steel version of what you mean, and I say again that it seems to me that you're saying something that a good mainstream AI person would also be staring quizzically at.
What would be one of the other points in the 10^10-sized space? If it's something along the lines of "an economic model" then I just explained why if you did something analogous with an economic model it could also be interesting progress, just as AIXI was conceptually important to the history of ideas in the field. I could explain your position by supposing that you think that mathematical ideas never generalize across architectures and so only analyzing the exact correct architecture of a real FAI could be helpful even at the very beginning of work, but this sounds like a visibly stupid position so the model of Anna in my head is warning me not to attribute it to you. On the other hand, some version of, "It is easier to make progress than Jonah thinks because the useful generalization of mathematical ideas does not require you to select correct point X out of 10^10 candidates" seems like it would almost have to be at work here somewhere.
I seriously don't understand what's going on in your head here. It sounds like any similar argument should Prove Too Much by showing that no useful work or conceptual progress could have occurred due to AI work in the last 60 years because there would be 10^10 other models for AI. Each newly written computer program is unique but the ideas behind them generalize, the resulting conceptual space can be usefully explored, that's why we don't start over with every new computer program. You can do useful things once you've collected enough treasure nuggets and your level of ability builds up, it's not a question of guessing the one true password out of 10^10 tries with nothing being progress until then. This is true on a level of generality which applies across computer science and also to AI and also to FAI and also to decision theory and also to math. Everyone takes this for granted as an obvious background fact of doing research which is why I would expect a good mainstream AI person to also be staring quizzically at your statements here. I do not feel like the defense I'm giving here is in any way different from the defense I'd give of a randomly selected interesting AI paper if you said the same thing about it. "That's just how research works," I'd say.
Please amplify point 1 in much greater detail using concrete examples and as little abstraction as possible.
I continue to appreciate your cordiality.
A number of people have recently told me that they have trouble understanding me unless I elaborate further, because I don't spell out my reasoning in sufficient detail. I think that this is more a matter of the ideas involved being complicated, and there being a lot of inferential distance, than it is lack of effort on my part, but I can see how it would be frustrating to my interlocutors. It seems that I'm subject to the illusion of transparency. I appreciate your patience.
I know that you've explicitly disavowed arguments of the type in my points 2 and 3. My reason for bringing them up is to highlight the importance of addressing point 1: to emphasize that it doesn't suffice to say "the problem is important and we have to get started on it somehow." I recognize that we have very different implicit assumptions on point 1, and that that's where the core of the disagreement lies.
There's essentially only one existing example of an entity with general intelligence: a human. I think that our prior should be that the first AGI will have internal structure analogous to that of a human. Here I'm not suggesting that an AGI will have human values by default: I'm totally on board with your points about the dangers of anthropomorphization in that context. Rather, what I mean is that I envisage the first AGI as having many interacting specialized modules, rather than a mathematically defined utility function.
There are serious dangers of such an entity having values that are orthogonal to humans, and serious dangers of value drift. (Your elegant article Why does power corrupt?, has some relevance to the latter point.) But it seems to me that the measures that one would want to take to prevent humans' goals changing seem completely different from the sorts of measures that might emerge from MIRI's FAI research.
I'll also highlight a comment of Nick Beckstead, which you've already seen and responded to. I didn't understand your response.
I should clarify that I don't have high confidence that the first AGI will develop along these lines. But it's my best guess, and it seems much more plausible to me than models of the type in your paper.
The difference that I perceive between the two scenarios is the nature of the feedback loops in each case.
When one is chipping away at a problem incrementally, one has the capacity to experiment and use the feedback generated from experimentation to help one limit the search space. Based on what I know about the history of science, general relativity is one of the only successful theories that was created without lots of empirical investigation.
The engineers who designed the first bridges had trillions of combinations of design features and materials to consider a priori, the vast majority of which wouldn't work. But an empirical discovery like "material X is too weak to work within any design" greatly limits the search space, because you don't have to think further about any of the combinations involving material X. Similarly if one makes a discovery of the type "material Y is so strong that it'll work with any design." By making a series of such discoveries, one can hone in on a few promising candidates.
This is how I predict that the development of AGI will go. I think that the search space is orders of magnitude too large to think about in a useful way without a lot of experimentation, and that a priori we can't know what the first AGI will look like. I think that once it becomes more clear what the first AGI will look like, it will become much more feasible to make progress on AI safety.
It'll take me a while to come up with a lot of concrete hypotheticals, but I'll get back to you on this.
Okay. This sounds like you're trying to make up your own FAI theory in much the same fashion as Holden (and it's different from Holden's, of course). Um, what I'd like to do at this point is take out a big Hammer of Authority and tell you to read "Artificial Intelligence: A Modern Approach" so your mind would have some better grist to feed on as to where AI is and what it's all about. If I can't do that... I'm not really sure where I could take this conversation. I don't have the time to personally guide you to understanding of modern AI starting from that kind of starting point. If there's somebody else you'd trust to tell you about AI, with more domain expertise, I could chat with them and then they could verify things to you. I just don't know where to take it from here.
On the object level I will quickly remark that some of the first attempts at heavier-than-air flying-machines had feathers and beaks and they did not work very well, that 'interacting specialized modules' is Selling Nonapples, that there is an old discussion in cognitive science about the degree of domain specificity in human intelligence, and that the idea that 'humans are the only example we have' is generally sterile, for reasons I've already written about but I can't remember the links offhand, hopefully someone else does. It might be in Levels of Organization in General Intelligence, I generally consider that pretty obsolete but it might be targeted to your current level.
Either of my best guess or Holden's best guess could be right, and so could lots of other ideas that we haven't thought of. My proposed conceptual framework should be viewed as one of many weak arguments.
The higher level point that I was trying to make is that [the conceptual framework implicit in view that the MIRI's current FAI research has a non-negligible chance of being relevant to AI safety] seems highly conjunctive. I don't mean this rhetorically at all – I genuinely don't understand why you think that we can make progress given how great the unknown unknowns are. You may be right, but justification of your view requires further argumentation.
A more diplomatic way of framing this would be something like:
"The book Artificial Intelligence: A Modern Approach has a discussion of current approaches to artificial intelligence. Are you familiar with the ideas therein? If not, I'd suggest that you take a look"
Putting that aside, based on conversations with a number of impressive people in machine learning, etc. who I know, my impression is that at the moment, there aren't strong contenders for research programs that could plausibly lead to AGI. I largely accept Luke's argument in his blog post on AI timelines, but this is based on the view that the speed of research is going to increase a lot over the coming years, rather than on the belief that any existing research programs have a reasonable chance of succeeding.
I'd be very interested in hearing about existing research programs that have a reasonable chance of succeeding.
Is it your view that no progress has occurred in AI generally for the last sixty years?
The field as a whole has been making perfectly good progress AFAICT. We know a bleepton more about cognition than we did in 1955 and are much less confused by many things. Has someone been giving you an impression otherwise and if so, what field were they in?
No, it's clear that there have been many advances, for example in chess playing programs, auto-complete search technology, automated translation, driverless cars, and speech recognition.
But my impression is that this work has only made a small dent in the problem of general artificial intelligence.
Also, the fraction of scientists who I know who believe that there's a promising AGI research agenda on the table is very small, mostly consisting of people around MIRI. Few of the scientists who I know have subject matter expertise, but if there was a promising AGI research agenda on the table, I would expect news of it to have percolated to at least some of the people in question.
I think I may have been one of those three graduate students, so just to clarify, my view is:
Zero progress being made seems too strong a claim, but I would say that most machine learning research is neither relevant to, nor trying to be relevant to, AGI. I think that there is no real disagreement on this empirical point (at least, from talking to both Jonah and Eliezer in person, I don't get the impression that I disagree with either of you on this particular point).
The model for AGI that MIRI uses seems mostly reasonable, except for the "self-modification" part, which seems to be a bit too much separated out from everything else (since pretty much any form of learning is a type of self-modification --- current AI algorithms are self-modifying all the time!).
On this vein, I'm skeptical of both the need or feasibility of an AI providing an actual proof of safety of self-modification. I also think that using mathematical logic somewhat clouds the issues here, and that most of the issues that MIRI is currently working on are prerequisites for any sort of AI, not just friendly AI. I expect them to be solved as a side-effect of what I see as more fundamental outstanding problems.
However, I don't have reasons to be highly confident in these intuitions, and as a general rule of thumb, having different researchers with different intuitions pursue their respective programs is a good way to make progress, so I think it's reasonable for MIRI to do what it's doing (note that this is different from the claim that MIRI's research is the most important thing and is crucial to the survival of humanity, which I don't think anyone at MIRI believes, but I'm clarifying for the benefit of onlookers).
Agreed, the typical machine learning paper is not AGI progress - a tiny fraction of such papers being AGI progress suffices.
I want to note that the general idea being investigated is that you can have a billion successive self-modifications with no significant statistically independent chance of critical failure. Doing proofs from axioms in which case the theorems are, not perfectly strong, but at least as strong as the axioms with conditionally independent failure probabilities not significantly lowering the conclusion strength below this as they stack, is an obvious entry point into this kind of lasting guarantee. It also suggests to me that even if the actual solution doesn't use theorems proved and adapted to the AI's self-modification, it may have logic-like properties. The idea here may be more general than it looks at a first glance.
Can you name some papers that you think constitute AGI progress? (Not a rhetorical question.)
I'm not sure if I parse this correctly, and may be responding to something that you don't intend to claim, but I want to remark that if the probabilities of critical failure at each stage are
0.01, 0.001, 0.0001, 0.00001, etc.
then total probability of critical failure is less than 2%. You don't need the probability of failure at each stage to be infinitesimal, you only need the probabilities of failure to drop off fast enough.
How would they drop off if they're "statistically independent"? In principle this could happen, given a wide separation in time, if humanity or lesser AIs somehow solve a host of problems for the self-modifier. But both the amount of help from outside and the time-frame seem implausible to me, for somewhat different reasons. (And the idea that we could know both of them well enough to have those subjective probabilities seems absurd.)
I'm aware of this argument, but I think there are other ways to get this. The first tool I would reach for would be a martingale (or more generally a supermartingale), which is a statistical process that somehow manages to correlate all of its failures with each other (basically by ensuring that any step towards failure is counterbalanced in probability by a step away from failure). This can yield bounds on failure probabiity that hold for extremely long time horizons, even if there is non-trivial stochasticity at every step.
Note that while martingales are the way that I would intuitively approach this issue, I'm trying to make the broader argument that there are ways other than mathematical logic to get what you are after (with martingales being one such example).
Please expand on this, because I'm having trouble understanding your idea as written. A martingale is defined as "a sequence of random variables (i.e., a stochastic process) for which, at a particular time in the realized sequence, the expectation of the next value in the sequence is equal to the present observed value even given knowledge of all prior observed values at a current time", but what random variable do you have in mind here?
I'd be interested in your thoughts on the point about computational complexity in this comment.
It seems to me like relatively narrow progress on learning is likely to be relevant to AGI. It does seem plausible that e.g. machine learning research is not too much more relevant to AGI than progress in optimization or in learning theory or in type theory or perhaps a dozen other fields, but it doesn't seem very plausible that it isn't taking us closer to AGI in expectation.
Yes, reflective reasoning seems to be necessary to reason about the process of learning and the process of reflection, amongst other things. I don't think any of the work that has been done applies uniquely to explicit self-modification vs. more ordinary problems with reflection (e.g. I think the notion of "truth" is useful if you want to think about thinking, and believing that your own behavior is sane is useful if you want to think about survival as an instrumental value).
This seems quite likely (or at least the weaker claim, that either these results are necessary for any AI or they are useless for any AI, seems very likely). But of course this is not enough to say that such work isn't useful for better understanding and coping with AI impacts. If we can be so lucky as to find important ideas well in advance of building the practical tools that make those ideas algorithmically relevant, then we might develop a deeper understanding of what we are getting into and more time to explore the consequences.
In practice, even if this research program worked very well, we would probably be left with at least a few and perhaps a whole heap of interesting theoretical ideas. And we might have few clues as to which will turn out to be most important. But that would still give us some general ideas about what human-level AI might look like, and could help us see the situation more clearly.
Indeed, I would be somewhat surprised if interesting statements get proven often in the normal business of cognition. But this doesn't mean that mathematical logic and inference won't play an important role in AI---logical is by far the most expressive language that we are currently aware of, and therefore a natural starting point if we want to say anything formal about cognition (and as far as I can tell this is not at all a fringe view amongst folks in AI).
I'd be interested in your response to the following, which I wrote in another context. I recognize that I'm far outside of my domain of expertise, and what I write should be read as inquisitive rather than argumentative:
The impression that I've gotten is that to date, impressive applications of computers to do tasks that humans do are based around some combination of
In particular, they doesn't seem at all relevant to mimicking human inference algorithms.
As I said in my point #2 here: I find it very plausible that advances in narrow AI will facilitate the development of AGI by enabling experimentation.
The question that I'm asking is more: "Is it plausible that the first AGI will be based on filling in implementation details of current neural networks research programs, or current statistical inference research programs?"
Something worth highlighting is that researchers in algorithms have repeatedly succeeded in developing algorithms that solve NP-complete problems in polynomial time with very high probability, or that give very good approximations to solutions to problems in polynomial time where it would be NP-complete to get the solutions exactly right. But these algorithms can't be ported from one NP-complete problem to another while retaining polynomial running time. One has to deal with each algorithmic problem separately.
From what I know, my sense is that one has a similar situation in narrow AI, and that humans (in some vague sense) have a polynomial time algorithm that's robust across different algorithmic tasks.
I don't really understand how "task specific algorithms generated by humans" differs from general intelligence. Humans choose a problem, and then design algorithms to solve the problem better. I wouldn't expect a fundamental change in this situation (though it is possible).
I think this is off. A single algorithm currently achieves the best known approximation ratio on all constraint satisfaction problems with local constraints (this includes most of the classical NP-hard approximation problems where the task is "violate as few constraints as possible" rather than "satisfy all constraints, with as high a score as possible"), and is being expanded to cover increasingly broad classes of global constraints. You could say "constraint satisfaction is just another narrow task" but this kind of classification is going to take you all the way up to human intelligence and beyond. Especially if you think 'statistical inference' is also a narrow problem, and that good algorithms for planning and inference are more of the same.
You can't do that? From random things like computer security papers, I was under the impression that you could do just that - convert any NP problem to a SAT instance and toss it at a high-performance commodity SAT solver with all its heuristics and tricks, and get an answer back.
Point of order: Let A = "these results are necessary for any AI" and B = "they are useless for any AI". It sounds like you're weakening from A to (A or B) because you feel the probability of B is large, and therefore the probability of A isn't all that large in absolute terms. But if much of the probability mass of the weaker claim (A or B) comes from B, then if at all possible, it seems more pragmatically useful to talk about (i) the probability of B and (ii) the probability of A given (not B), instead of talking about the probability of (A or B), since qualitative statements about (i) and (ii) seem to be what's most relevant for policy. (In particular, even knowing that "the probability of (A or B) is very high" and "the probability of A is not that high" -- or even "is low" -- doesn't tell us whether P(A|not B) is high or low.)
My impression from your above comments is that we are mostly in agreement except for how much we respectively like mathematical logic. This probably shouldn't be surprising given that you are a complexity theorest and I'm a statistician, and perhaps I should learn some more mathematical logic so I can appreciate it better (which I'm currently working on doing).
I of course don't object to logic in the context of AI, it mainly seems to me that the emphasis on mathematical logic in this particular context is unhelpful, as I don't see the issues being raised as being fundamental to what is going on with self-modification. I basically expect whatever computationally bounded version of probability we eventually come up with to behave locally rather than globally, which I believe circumvents most of the self-reference issues that pop up (sorry if that is somewhat vague intuition).
Thanks Jacob.
I'd be interested in your thoughts on my comment here.
Hm. I'm not sure if Scott Aaronson has any weird views on AI in particular, but if he's basically mainstream-oriented we could potentially ask him to briefly skim the Tiling Agents paper and say if it's roughly the sort of paper that it's reasonable for an organization like MIRI to be working on if they want to get some work started on FAI. At the very least if he disagreed I'd expect he'd do so in a way I'd have better luck engaging conversationally, or if not then I'd have two votes for 'please explore this issue' rather than one.
I feel again like you're trying to interpret the paper according to a different purpose from what it has. Like, I suspect that if you described what you thought a promising AGI research agenda was supposed to deliver on what sort of timescale, I'd say, "This paper isn't supposed to do that."
This part is clearer and I think I may have a better idea of where you're coming from, i.e., you really do think the entire field of AI hasn't come any closer to AGI, in which case it's much less surprising that you don't think the Tiling Agents paper is the very first paper ever to come closer to AGI. But this sounds like a conversation that someone else could have with you, because it's not MIRI-specific or FAI-specific. I also feel somewhat at a loss for where to proceed if I can't say "But just look at the ideas behind Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, that's obviously important conceptual progress because..." In other words, you see AI doing a bunch of things, we already mostly agree on what these sorts of surface real-world capabilities are, but after checking with some friends you've concluded that this doesn't mean we're less confused about AGI then we were in 1955. I don't see how I can realistically address that except by persuading your authorities; I don't see what kind of conversation we could have about that directly without being able to talk about specific AI things.
Meanwhile, if you specify "I'm not convinced that MIRI's paper has a good chance of being relevant to FAI, but only for the same reasons I'm not convinced any other AI work done in the last 60 years is relevant to FAI" then this will make it clear to everyone where you're coming from on this issue.
He wrote this about a year ago:
And later:
Without further context I see nothing wrong here. Superintelligences are Turing machines, check. You might need a 10^20 slowdown before that becomes relevant, check. It's possible that the argument proves too much by showing that a well-trained high-speed immortal dog can simulate Mathematica and therefore a dog is 'intellectually expressive' enough to understand integral calculus, but I don't know if that's what Scott means and principle of charity says I shouldn't assume that without confirmation.
EDIT: Parent was edited, my reply was to the first part, not the second. The second part sounds like something to talk with Scott about. I really think the "You're just as likely to get results in the opposite direction" argument is on the priors overstated for most forms of research. Does Scott think that work we do today is just as likely to decrease our understanding of P/NP as increase it? We may be a long way off from proving an answer but that's not a reason to adopt such a strange prior.
As it happens, I've been chatting with Scott about this issue recently, due to some comments he made in his recent quantum Turing machine paper:
I thought his second objection ("how could we know what to do about it?") was independent of his first objection ("AI seems farther away than the singularitarians tend to think"), but when I asked him about it, he said his second objection just followed from the first. So given his view that AI is probably centuries away, it seems really hard to know what could possibly help w.r.t. FAI. And if I thought AI was several centuries away, I'd probably have mostly the same view.
I asked Scott: "Do you think you'd hold roughly the same view if you had roughly the probability distribution over year of AI creation as I gave in When Will AI Be Created? Or is this part of your view contingent on AI almost certainly being several centuries away?"
He replied: "No, if my distribution assigned any significant weight to AI in (say) a few decades, then my views about the most pressing tasks today would almost certainly be different." But I haven't followed up to get more specifics about how his views would change.
And yes, Scott said he was fine with quoting this conversation in public.
I'm doing some work for MIRI looking at the historical track record of predictions of the future and actions taken based on them, and whether such attempts have systematically done as much harm as good.
To this end, among other things, I've been reading Nate Silver's The Signal and the Noise. In Chapter 5, he discusses how attempts to improve earthquake predictions have consistently yielded worse predictive models than the Gutenberg-Richter law. This has slight relevance.
Such examples not withstanding, my current prior is on MIRI's FAI research having positive expected value. I don't think that the expected value of the research is zero or negative – only that it's not competitive with the best of the other interventions on the table.
My own interpretation of Scott's words here is that it's unclear whether your research is actually helping in the "get Friendly AI before some idiot creates a powerful Unfriendly one" challenge. Fundamental progress in AI in general could just as easily benefit the fool trying to build a AGI without too much concern for Friendliness, as it could benefit you. Thus, whether fundamental research helps out avoiding the UFAI catastrophy is unclear.
Yes, I would welcome his perspective on this.
I think I've understood your past comments on this point. My questions are about the implicit assumptions upon which the value of the research rests, rather than about what the research does or doesn't succeed in arguing.
As I said in earlier comments, the case for the value of the research hinges on its potential relevance to AI safety, which in turn hinges on how good the model is for the sort of AI that will actually be built. Here I don't mean "Is the model exactly right?" — I recognize that you're not claiming it to be — the question is whether the model is in the right ballpark.
A case for the model being a good one requires pointing to a potentially promising AGI research program to which the model is relevant. This is the point that I feel hasn't been addressed.
Some things that I see as analogous to the situation under discussion are:
Similarly, somebody without knowledge of the type of AI that's going to be built could research AI safety without the research being relevant to AI safety.
Does this help clarify where I'm coming from?
I'm open to learning object level material if I learn new information that convinces me that there's a reasonable chance that MIRI's FAI research is relevant to AI safety in practice.
Yes, this is where I'm coming from.
Just wondering why you see Jonah Sinick of high enough status to be worth explaining to what's been discussed on LW repeatedly. Or maybe I'm totally misreading this exchange.
I'm puzzled as to what you think I'm missing: can you say more?
Matching "first AGI will [probably] have internal structure analogous to that of a human" and "first AGI [will probably have] many interacting specialized modules" in a literal (cough uncharitable cough) manner, as evidenced by "heavier-than-air flying-machines had feathers and beaks". Your phrasing hints at an anthropocentric architectural bias, analogous to the one you specifically distance yourself from regarding values.
Maybe you should clarify that part, it's crucial to the current misunderstanding, and it's not clear whether by "interacting specialized modules" you'd also refer to "Java classes not corresponding to anything 'human' in particular", or whether you'd expect a "thalamus-module".
I think that people should make more of an effort to pay attention to the nuances of people's statements rather than using simple pattern matching.
There's a great deal to write about this, and I'll do so at a later date.
To give you a small taste of what I have in mind: suppose you ask "How likely is it that the final digit of the Dow Jones will be 2 in two weeks." I've never thought about this question. A priori, I have no Bayesian prior. What my brain does, is to amalgamate
Different parts of my brain generate the different pieces, and another part of my brain combines them. I'm not using a single well-defined Bayesian prior, nor am I satisfying a well defined utility function.
I don't want to comment on the details, as this is way outside my area of expertise, but I do want to point out that you appear to be a victim of the bright dilettante fallacy. You appear to think that your significant mathematical background makes you an expert in an unrelated field without having to invest the time and effort required to get up to speed in it.
I don't claim to have any object level knowledge of AI.
My views on this point are largely based on what I've heard from people who work on AI, together with introspection as to how I and other humans reason, and the role of heuristics in reasoning.
Maybe something to do with Jonah being previously affiliated with GiveWell?
Let me try from a different angle.
With humans, we see three broad clusters of modification: reproduction, education, and chemistry. Different people are physically constructed in different ways, and so we can see evolution of human civilization by biological evolution of the humans inside it. The environments that people find themselves in or choose leave imprints on those people. Chemicals people ingest can change those people, such as with caffeine, alcohol, morphine, or heroin. (I would include 'changing your diet to change your thought processes' under chemical changes, but the chemical changes from becoming addicted to heroin and from not being creatine deficient look very different.)
For AIs, most of the modification that's interesting and new will look like the "chemistry" cluster. An AI modifying its source code will look a lot like a human injecting itself with a new drug that it just invented. (Nick_Beckstead's example of modifying the code of the weather computer is more like education than it is like chemistry.)
This is great because some drugs dramatically improve performance, and so a person on caffeine could invent a super nootropic, and then on the super nootropic invent a cure for cancer and an even better nootropic, and so on. This is terrifying because any drug that adjusts your beliefs or your decision-making algorithm (think of 'personality' as a subset of this) dramatically changes how you behave, and might do so for the worse. This is doubly terrifying because these changes might be irreversible- you might take a drug that gets rid of your depression by making you incapable of feeling desire, and then not have any desire to restore yourself! This is triply terrifying because the effects of the drug might be unknown- you might not be able to determine what a drug will do to you until after you take it, and by then it might be too late.
For humans this problem is mostly solved by trial and error followed by patternmatching- "coffee is okay, crack is not, because Colin is rich and productive and Craig is neither"- which is not useful for new drugs, and not useful for misclassified old drugs, and not very safe for very powerful systems. The third problem- that the effects might be unknown- is the sort of thing that proofs might help with, except there are some technical obstacles to doing that. The Lobstacle is a prominent theoretical one, and while it looks like there are lots of practical obstacles as well surmounting the theoretical obstacles should help with surmounting the practical obstacles.
Any sort of AGI that's able to alter its own decision-making process will have the ability to 'do chemistry on itself,' and one with stable values will need to have solved the problem of how to do that while preserving its values. (I don't think that humans have 'stable' values; I'd call them something more like 'semi-stable.' Whether or not this is a bug or feature is unclear to me.)
I understand where you're coming from, and I think that you correctly highlight a potential source of concern, and one which my comment didn't adequately account for. However:
I'm skeptical that it's possible to create an AI based on mathematical logic at all. Even if an AI with many interacting submodules is dangerous, it doesn't follow that working on AI safety for an AI based on mathematical logic is promising.
Humans can impose selective pressures on emergent AI's so as to mimic the process of natural selection that humans experienced.
Eliezer's position is that the default mode for an AGI is failure; i.e. if an AGI is not provably safe, it will almost certainly go badly wrong. In that contest, if you accept that "an AI with many interacting submodules is dangerous," that that's more or less equivalent to believing that one of the horribly wrong outcomes will almost certainly be achieved if an AGI with many submodules is created.
Humans are not Friendly. They don't even have the capability under discussion here, to preserve their values under self-modification; a human-esque singleton would likely be a horrible, horrible disaster.