Now that we have a decent grounding of what Yudkowsky thinks deep knowledge is for, the biggest question is how to find it, and how to know you have found good deep knowledge.
This is basically the thing that bothered me about the debates. Your solution seems to be to analogize, Einstein:relativity::Yudkowsky:alignment is basically hopeless. But in the debates, M. Yudkowsky over and over says, "You can't understand until you've done the homework, and I have, and you haven't, and I can't tell you what the homework is." It's a wall of text that can be reduced to, "Trust me."
He might be right about alignment, but under the epistemic standards he popularized, if I update in the direction of his view, the strength of the update must be limited to "M. Yudkowsky was right about some of these things in the past and seems pretty smart and to have thought a lot about this stuff, but even Einstein was mistaken about spooky action at a distance, or maybe he was right and we haven't figured it out yet, but, hey, quantum entanglement seems pretty real." In many ways, science just is publishing the homework so people can poke holes in it.
If Einstein came to you in 1906 (after general relativity) and stated the conclusion of the special relativity paper, and when you asked him how he knew, he said, "You can't understand until you've done the homework, and I have, and you haven't," which is all true from my experience studying the equations, "and I can't tell you what the homework is," the strength of your update would be similarly limited.
You might respond that M. Yudkowsky isn't trying to really convince anyone, but in that case, why debate? He's at least trying to get people to publish their AI findings less in order to burn less timeline.
This is basically the thing that bothered me about the debates. Your solution seems to be to analogize, Einstein:relativity::Yudkowsky:alignment is basically hopeless. But in the debates, M. Yudkowsky over and over says, "You can't understand until you've done the homework, and I have, and you haven't, and I can't tell you what the homework is." It's a wall of text that can be reduced to, "Trust me."
He might be right about alignment, but under the epistemic standards he popularized, if I update in the direction of his view, the strength of the update must be limited to "M. Yudkowsky was right about some of these things in the past and seems pretty smart and to have thought a lot about this stuff, but even Einstein was mistaken about spooky action at a distance, or maybe he was right and we haven't figured it out yet, but, hey, quantum entanglement seems pretty real." In many ways, science just is publishing the homework so people can poke holes in it.
I definitely feel you: that reaction was my big reason for taking so much time rereading his writing and penning this novel-length post.
The first thing I want to add is that after looking for discussions of this in the Sequences, they were there. So the uncharitable explanation of "he's hiding the homework/explanation because he knows he's wrong or doesn't have enough evidence" doesn't really work. (I don't think you're defending this, but it definitely crossed my mind and that of others I talked to). I honestly believe Yudowsky is saying in good faith that he has found deep knowledge and that he doesn't know how to share it in a way he didn't try in his 13 years of writing about them.
The second thing is that I feel my post brings together enough bits of Yudkowsky's explanations of deep knowledge that we have at least a partial handle on how to check it? Quoting back my conclusion:
Yudkowsky sees deep knowledge as highly compressed causal explanations of “what sort of hypothesis ends up being right”. The compression means that we can rederive the successful hypotheses and theories from the causal explanation. Finally, such deep knowledge translates into partial constraints on hypothesis space, which focus the search by pointing out what cannot work.
So the check requires us to understand what sort of successful hypotheses he is compressing, if that is really a compression as a causal underlying process that can be used to rederive these hypotheses, and if the resulting constraint actually cuts a decent chunk of hypothesis space when applied to other problems.
That's definitely a lot of work, and I can understand if people don't want to invest the time there. But it seems different from me to have a potential check and be "I don't think this is a good time investment" from saying that there's no way to check the deep knowledge.
Lastly,
If Einstein came to you in 1906 (after general relativity) and stated the conclusion of the special relativity paper, and when you asked him how he knew, he said, "You can't understand until you've done the homework, and I have, and you haven't," which is all true from my experience studying the equations, "and I can't tell you what the homework is," the strength of your update would be similarly limited.
I recommend reading Einstein's Speed and Einstein's Superpowers, which are the two posts where Yudkowsky tries to point out that if you look for it, it's possible to find where Einstein was coming from and the sort of deep knowledge he used. I agree it would be easier if the person leveraging the deep knowledge could state it succintly enough that we could get it, but I also acknowledge that this sort of fundamental principle from which other thing derives are just plain hard to express. And even then, you need to do the homework.
(My disagreement with Yudkowsky here is that he seems to believe mostly in providing a lot of training data and examples so that people can see the deep knowledge for themselves, whereas I expect that most smart people would find it far easier to have a sort of pointer to the deep knowledge and what it is good for, and then go through a lot of examples).
I think you've identified a real through-line in Yudkowsky's work, one I hadn't noticed before. Thank you for that.
Even so, when you're trying to think about this sort of thing I think it's important to remember that this:
In our world, Einstein didn't even use the perihelion precession of Mercury, except for verification of his answer produced by other means. Einstein sat down in his armchair, and thought about how he would have designed the universe, to look the way he thought a universe should look—for example, that you shouldn't ought to be able to distinguish yourself accelerating in one direction, from the rest of the universe accelerating in the other direction.
...is not true. In the comments to Einstein's Speed, Scott Aaronson explains the real story: Einstein spent over a year going down a blind alley, and was drawn back by -- among other things -- his inability to make his calculations fit the observation of Mercury's perihelion motion. Einstein was able to reason his way from a large hypothesis space to a small one, but not to actually get the right answer.
(and of course, in physics you get a lot of experimental data for free. If you're working on a theory of gravity and it predicts that things should fall away from each other, you can tell right away that you've gone wrong without having to do any new experiments. In AI safety we are not so blessed.)
There's more I could write about the connection between this mistake and the recent dialogues, but I guess others will get to it and anyway it's depressing. I think Yudkowsky doesn't need to explain himself more, he needs a vacation.
Thanks for the kind and thoughtful comment!
...is not true. In the comments to Einstein's Speed, Scott Aaronson explains the real story: Einstein spent over a year going down a blind alley, and was drawn back by -- among other things -- his inability to make his calculations fit the observation of Mercury's perihelion motion. Einstein was able to reason his way from a large hypothesis space to a small one, but not to actually get the right answer.
(and of course, in physics you get a lot of experimental data for free. If you're working on a theory of gravity and it predicts that things should fall away from each other, you can tell right away that you've gone wrong without having to do any new experiments. In AI safety we are not so blessed.)
That's a really good point. I didn't go into that debate in the post (because I tried to not criticize Yudkowky, and also because the post is already way too long), but my take on this is: Yudkowsky probably overstates the case, but that doesn't mean he's wrong about the relevance for Einstein's work of the constrains and armchair reasoning (even if the armchair reasoning was building on more empirical evidence that Yudkowsky originally pointed out). As you say, Einstein apparently did reduce the search space significantly: he just failed to find exactly what he wanted in the reduced space directly.
My comment had an important typo, sorry: I meant to write that I hadn't noticed this through-line before!
I mostly agree with you re: Einstein, but I do think that removing the overstatement changes the conclusion in an important way. Narrowing the search space from (say) thousands of candidate theories to just 4 is an great achievement, but you still need a method of choosing among them, not just to fulfill the persuasive social ritual of Science but because otherwise you have a 3 in 4 chance of being wrong. Even someone who trusts you can't update that much on those odds. That's really different from being able to narrow the search space down to just 1 theory; at that point, we can trust you -- and better still, you can trust yourself! But the history of science doesn't, so far as I can tell, contain any "called shots" of this type; Einstein might literally have set the bar.
I think we disagree on Yudkowsky's conclusion: his point IMO is that Einstein was able to reduce the search space a lot. He overemphasize for effect (and because it's more impressive to have someone who guesses right directly through these methods), but that doesn't change that Einstein reduced the state space a lot (which you seem to agree with).
Many of the relevant posts I quoted talk about how the mechanism of Science are fundamentally incapable of doing that, because they don't specify any constraint on hypothesis except that they must be falsifiable. Your point seems to be that in the end, Einstein still used the sort of experimental data and methods underlying traditional Science, and I tend to agree. But the mere fact that he was able to get the right answer out of millions of possible formulations by checking a couple of numbers should tell you that there was a massive hypothesis-space reducing step before.
Nah, we're on the same page about the conclusion; my point was more about how we should expect Yudkowsky's conclusion to generalize into lower-data domains like AI safety. But now that I look at it that point is somewhat OT for your post, sorry.
Besides invoking “Deep Knowledge” and the analogy of ruling out perpetual motion, another important tool for understanding AI foom risk is security mindset, which Eliezer has written about here.
Maybe this is tangential, but I don’t get why the AI foom debate isn’t framed more often as a matter of basic security considerations. AI foom risk seems like a matter of basic security mindset. I think AI is a risk to humanity for the same reason I think any website can be taken out by a hack if you put a sufficiently large bounty on it.
Humanity has all kinds of vulnerabilities that are exploitable by a team of fast simulated humans, not to mention Von Neumann simulations or superhuman AIs. There are so many plausible attack vectors by which to destroy or control humanity: psychology, financial markets, supply chains, biology, nanotechnology, just to name a few.
It’s very plausible that the AI gets away from us, runs in the cloud, self-improves, and we can’t turn it off. It’s like a nuclear explosion that may start slow, but it’s picking up speed, recursively self-improving or even just speeding up to the level of an adversarial Von Neumann team, and it’s hidden in billions of devices.
We have the example of nuclear weapons. The US was a singleton power for a few years due to developing nukes first. At least a nuclear explosion stops when it burns through its fissile material. AI doesn’t stop, and it’s a much more powerful adversary that will not be contained. It’s like the first nuclear pile you’re testing with has a yield much larger than Tsar Bomba. You try one test and then you’ve permanently crashed your ability to test.
So to summarize my security-mindset view: Humanity is vulnerable to hackers, without much ability to restore a backup once we get hacked, and it’s very easy to think AI becomes a great hacker soon.
As for the many attack vectors, I would also add "many places and stages where things can go wrong", AI became a genius social and computer hacker. (By the way, I heard that most hacks are carried out not with the help of computer hacking, but with the help of social engineering, because a person is a much more unreliable and difficult to patch system) From my point of view, the main problem is not even that the first piece of uranium explodes so that it melts the Earth, the problem is that there are 8 billion people on Earth, each has several electronic devices, and processors (well, or batteries for a more complete analogy) are made of californium. Now you have to hope that literally no one in 8 billion people will cause their device to explode (this is much worse than expecting that no one in just 1 million wizards will be prompted with the idea of transfiguring anti matter, botulinum toxin, thousands of infections, nuclear weapons, strandels, as well as things like "only top quarks", which cannot be imagined at all), or that literally none of these reactions will go as a chain reaction through all processors (which are also connected to a worldwide network operating on the basis of radiation) in form of a direct explosion or neutron beams, or that you will be able to stop literally every explosive / neutron chain reaction. We can conditionally calculate that for each of 8 billion people there are three probabilities that they will not fail all three points, and even if on average each of them is very high, we raise each of them to the power of 8 billion, worse, these are all probabilities in a certain period of time, conditionally, a year, the problem is that over time, not even the probabilities grow, but the interval for creating AI is shortened, so that we get the difference between a geometric and exponential progression. Of course, one can say that one should not consider the average over all, that the number should be reduced for all but the number of processors, but then the number of people who can interfere will be reduced, and the likelihood that one of them will create AI will increase, and again, the problem is that it's not the chance of creating AI that increases, but the process becomes easier, so that more people have a higher chance of creating it, and that's why I still count for all people. Finally, we can say that civilization will react when it sees not smoke, but fire. But civilization is not adequate. Generally. Only here she did not take fire-fighting measures and did not react to smoke. She also showed how she would react to the example of the coronavirus. But only here, "it's not more dangerous than the flu. Graphic is exponential? Never mind", "it's all a conspiracy and not true danger", "I won't get vaccinated" will be added "it's all fiction / cult", "AI is good" and so on.
Yes, I do quote the security mindset in the post.
I feel you're quite overstating the ability of the security mindset to show FOOM though. The reason it's not presented as a direct consequence of a security mindset is... because it's not one?
Like, once you are convinced of the strong possibility and unavoidability of AGI and superintelligence (maybe through FOOM arguments), then the security mindset actually helps you, and combining it with deep knowledge (like the Orthogonality Thesis) let's you find a lot more ways of breaking the "humanity security". But the security mindset applied without arguments for AGI doesn't let you postulate AGI, for the same reason that the security mindset without arguments about mind-reading doesn't let you postulate that the hackers might read the password in your mind.
For me the security mindset frame comes down to two questions:
1. Can we always stop AI once we release it?
2. Can we make the first unstoppable AI do what we want?
To which I'd answer "no" and "only with lots of research".
Without security mindset, one tends to think an unstoppable AI is a-priori likely to do what humans want, since humans built it. With security mindset, one sees that most AIs are nukes that wreak havoc on human values, and getting them to do what humans want is analogous to building crash-proof software for a space probe, except the whole human race only gets to launch one probe and it goes to whoever launches it first.
I'd like to see this kind of discussion with someone who doesn't agree with MIRI's sense of danger, in addition to all the discussions about how to extrapolate trends and predict development.
Without security mindset, one tends to think an unstoppable AI is a-priori likely to do what humans want, since humans built it. With security mindset, one sees that most AIs are nukes that wreak havoc on human values, and getting them to do what humans want is analogous to building crash-proof software for a space probe, except the whole human race only gets to launch one probe and it goes to whoever launches it first.
I think this is a really shallow argument that undersells enormously the actual reasons for caring about alignment. We have actual arguments for why unstoppable AI are not-likely to do what human wants, and they don't need the security mindset at all. The basic outline is something like:
This line of reasoning (which is not new by any mean, it's basically straight out of Bostrom and early Yudkowsky's writing) justify the security mindset for AGI and alignment. Not the other way around.
(And historically, Yudkowsky wanted to build AGI before he found out about these points, which turned him into the biggest user — but not the only one by all mean — of the security mindset in alignment)
Ok I agree there are a bunch of important concepts to be aware of, such as complexity of value, and there are many ways for security mindset by itself to fail at flagging the extent of AI risk if one is ignorant of some of these other concepts.
I just think the outside view and extrapolating trends is so far from how one should reason about mere nukes, and superhuman intelligence is very nuke-like or at least has a very high chance of being nuke-like: that is, unlock unprecedentedly large rapid irreversible effects. Extrapolating from current trends would have been quite unhelpful to nuclear safety. I know Eliezer is just trying to meet other people in the discussion where they are, but it would be nice to have another discussion that seems more on-topic from Eliezer’s own perspective.
For what it's worth, I often find Eliezer's arguments unpersuasive because they seem shallow. For example:
The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.
This seem like a fuzzy "outside view" sort of argument. (Compare with: "A loaded gun is one trigger pull away from killing someone and is therefore liable to be deadly in many ways." On the other hand, a causal model of a gun lets you explain which specific gun operations can be deadly and why.)
I'm not saying Eliezer's conclusion is false. I find other arguments for that conclusion much more persuasive, e.g. involving mesa-optimizers, because there is a proposed failure type which I understand in causal/mechanistic terms.
(I can provide other examples of shallow-seeming arguments if desired.)
I agree that it's a shallow argument presentation, but that's not the same thing as being based on shallow ideas. The context provided more depth, and in general a fair few of the shallowly presented arguments seem to be counters to even more shallow arguments.
In general one of the deeper concepts underlying all these shallow arguments appears to be some sort of thesis of "AGI-completeness", in which any single system that can reach or exceed human mental capability on most tasks, will almost certainly reach or exceed on all mental tasks, including deceiving and manipulating humans. Combining that with potentially very much greater flexibility and extensibility of computing substrate means you get an incredibly dangerous situation no matter how clever the designers think they've been.
One such "clever designer" idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware. You don't need a deep argument to point out an obvious flaw there. Talking about mesa-optimizers in a such a context is just missing the point from a view in which humans can potentially be used as part of a toolchain in much the same way as robot arms or protein factories.
One such "clever designer" idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware. You don't need a deep argument to point out an obvious flaw there.
I don't see the "obvious flaw" you're pointing at and would appreciate a more in-depth explanation.
In my mind, decoupling plan generation from plan execution, if done well, accomplishes something like this:
You ask your AGI to generate a plan for how it could maximize paperclips.
Your AGI generates a plan. "Step 1: Manipulate human operator into thinking that paperclips are the best thing ever, using the following argument..."
You stop reading the plan at that point, and don't click "execute" for it.
I had the same view as you, and was persuaded out of it in this thread. Maybe to shift focus a little, one interesting question here is about training. How do you train a plan-generating AI? If you reward plans that sound like they'd succeed, regardless of how icky they seem, then the AI will become useless to you by outputting effective-sounding but icky plans. But if you reward only plans that look nice enough to execute, that tempts the AI to make plans that manipulate whoever is reading them, and we're back at square one.
Maybe that's a good way to look at the general problem. Instead of talking about AI architecture, just say we don't know any training methods that would make AI better than humans at real world planning and safe to interact with the world, even if it's just answering questions.
I agree these are legitimate concerns... these are the kind of "deep" arguments I find more persuasive.
In that thread, johnswentworth wrote:
In particular, even if we have a reward signal which is "close" to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.
I'd solve this by maintaining uncertainty about the "reward signal", so the AI tries to find a plan which looks good under both alignment and the actual-process-which-generates-the-reward-signal. (It doesn't know which is which, but it tries to learn a sufficiently diverse set of reward signals such that alignment is in there somewhere. I don't think we can do any better than this, because the entire point is that there is no way to disambiguate between alignment and the actual-process-which-generates-the-reward-signal by gathering more data. Well, I guess maybe you could do it with interpretability or the right set of priors, but I would hesitate to make those load-bearing.)
(BTW, potentially interesting point I just thought of. I'm gonna refer to actual-process-which-generates-the-reward-signal as "approval". Supposing for a second that it's possible to disambiguate between alignment and approval somehow, and we successfully aim at alignment and ignore approval. Then we've got an AI which might deliberately do aligned things we disapprove of. I think this is not ideal, because from the outside this behavior is also consistent with an AI which has learned approval incorrectly. So we'd want to flip the off switch for the sake of caution. Therefore, as a practical matter, I'd say that you should aim to satisfy both alignment and approval anyways. I suppose you could argue that on the basis of the argument I just gave, satisfying approval is therefore part of alignment and thus this is an unneeded measure, but overall the point is that aiming to satisfy both alignment and approval seems to have pretty low costs.)
(I suppose technically you can disambiguate between alignment and approval if there are unaligned things that humans would approve of -- I figure you solve this problem by making your learning algorithm robust against mislabeled data.)
Anyway, you could use a similar approach for the nice plans problem, or you could formalize a notion of "manipulation" which is something like: conditional on the operator viewing this plan, does their predicted favorability towards subsequent plans change on expectation?
Edit: Another thought is that the delta between "approval" and "alignment" seems like the delta between me and my CEV. So to get from "approval" to "alignment", you could ask your AI to locate the actual-process-which-generates-the-labels, and then ask it about how those labels would be different if we "knew more, thought faster, were more the people we wished we were" etc. (I'm also unclear why you couldn't ask a hyper-advanced language model what some respected moral philosophers would think if they were able to spend decades contemplating your question or whatever.)
Another edit: You could also just manually filter through all the icky plans until you find one which is non-icky.
(Very interested in hearing objections to all of these ideas.)
The main problem is that "acting via plans that are passed to humans" is not much different from "acting via plans that are passed to robots" when the AI is good enough at modelling humans.
I don't think this needs an in-depth explanation, does it?
In my mind, decoupling plan generation from plan execution, if done well, accomplishes something like this: [...]
I don't think the given scenario is realistic for any sort of competent AI. There are two sub-cases:
If step 1 won't fail due to being read, then the scenario is unrealistic at the "you stop reading the plan at that point" stage. This might be possible for a sufficiently intelligent AI, but that's already a game over case.
If step 1 will fail due to the plan being read, a competent AI should be able to predict that step 1 will fail due to being read. The scenario is then unrealistic at the "your AGI generates a plan ..." stage, because it should be assumed that the AI won't produce plans that it predicts won't work.
So this leaves only the assumption that the AI is terrible at modelling humans, but can still make plans that should work well in the real world where humans currently dominate. Maybe there is some tiny corner of possibility space where that can happen, but I don't think it contributes much to the overall likelihood unless we can find a way to eliminate everything else.
The main problem is that "acting via plans that are passed to humans" is not much different from "acting via plans that are passed to robots" when the AI is good enough at modelling humans.
I agree this is true. But I don't see why "acting via plans that are passed to humans" is what would happen.
I mean, that might be a component of the plan which is generated. But the assumption here is that we've decoupled plan generation from plan execution successfully, no?
So we therefore know that the plan we're looking at (at least at the top level) is the result of plan generation, not the first step of plan execution (as you seem to be implicitly assuming?)
The AI is searching for plans which score highly according to some criteria. The criteria of "plans which lead to lots of paperclips if implemented" is not the same as the criteria of "plans which lead to lots of paperclips if shown to humans".
My point is that plan execution can't be decoupled successfully from plan generation in this way. "Outputting a plan" is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
Also, I think the last sentence is literally true, but misleading. Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.
My point is that plan execution can't be decoupled successfully from plan generation in this way. "Outputting a plan" is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.
"Outputting a plan" may technically constitute an action, but a superintelligent system (defining "superintelligent" as being able to search large spaces quickly) might not evaluate its effects as such.
Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.
I think you're making a lot of assumptions here. For example, let's say I've just created my planner AI, and I want to test it out by having it generate a paperclip-maximizing plan, just for fun. Is there any meaningful sense in which the displayed plan will be optimized for the criteria "plans which lead to lots of paperclips if shown to humans"? If not, I'd say there's an important effective difference.
If the superintelligent search system also has an outer layer that attempts to collect data about my plan preferences and model them, then I agree there's the possibility of incorrect modeling, as discussed in this subthread. But it seems anthropomorphic to assume that such a search system must have some kind of inherent real-world objective that it's trying to shift me towards with the plans it displays.
Yes, if you've just created it, then the criteria are meaningfully different in that case for a very limited time.
But we're getting a long way off track here, since the original question was about what the flaw is with separating plan generation from plan execution as a general principle for achieving AI safety. Are you clearer about my position on that now?
Yes, if you've just created it, then the criteria are meaningfully different in that case for a very limited time.
It's not obvious to me that this is only true right after creation for a very limited time. What is supposed to change after that?
I don't see how we're getting off track. (Your original statement was: 'One such "clever designer" idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware.' If we're discussing situations where that claim may be false, it seems to me we're still on track.) But you shouldn't feel obligated to reply if you don't want to. Thanks for your replies so far, btw.
What changes is that the human sees that the AI is producing plans that try to manipulate humans. It is very likely that the human does not want the AI to produce such plans, and so applies some corrective action against it happening in future.
After first read-through of your post the main thing that stuck with me was this:
But the thing is… rereading part of the Sequences, I feel Yudkowsky was making points about deep knowledge all along? Even the quote I just used, which I interpreted in my rereading a couple of weeks ago as being about making predictions, now sounds like it’s about the sort of negative form of knowledge that forbids “perpetual motion machines”.
This gives me an icky feeling.
(low confidence in the following parts of this comment)
It makes me think of the Bible. The "specifications" laid out in the bible are loosey-goosey enough that believers can always re-interpret such-and-such verse to actually mean whatever newer evidence permits. (I want to stress that I'm not drawing a parallel between unthinking Christian believers and anyone changing their belief based upon new evidence! I'm drawing a parallel between the difficult task of writing text designed to change future behavior.)
If it's so loosey-goosey than what's it good for?
That's most definitely not to say that anything that you can re-interpret in the light of new evidence is full of shit. However, you've got to have a good and solid explanation for the discrepancy between your earlier and later interpretations. The importance of, and difficulty of producing, this explanation is probably based upon if we're talking about a quantitative physics experiment or a complicated tome of reasoning, philosophy, rhetoric. The complicated tome case is important and hard because it's so very hard to convey our most complicated thoughts in ways that are so explicit that we can't interpret them in a multitude of ways.
I think producing the explanation of the discrepancy between earlier and later interpretations is likely full of cognitive booby traps.
I find myself confused by this comment. I'm going to try voicing this confusion as precisely as possible, so you can hopefully clarify it for me.
I am confused that you get an icky feeling from basically the most uncontroversial part of my post and Yudkowsky's point. The part you're quoting is just saying that Yudkowsky cares more about anticipation-constraining than predictions. Of course, predictions are a particular type of very strong anticipation-constraining, but saying "this is impossible" is not wishy-washy fake specification: if the impossible thing is done, that invalidates your hypothesis. So "no perpetual motion machines" is definitely anticipation-constraining in that sense, and can readily falsified.
I am confused because this whole anticipation constraing, especially saying what can't be done, is very accepted in traditional Science. Yudkowsky says that Science Isn't Strict Enough because he says that it allows any type of anticipation-constraining hypothesis to the rank of "acceptable hypothesis": if it's wrong, it will evenutally be falsified.
I am confused because you keep comparing deep knowledge with the sort of conclusions that can always be reinterpreted from new evidence, when my posts goes into a lot of details about how Yudkowsky writes about the anticipation-constraining aspect and how to be stricter with your hypothesis, not just allowing any non-disproved hypothesis the same level of credibility.
Also I feel that I should link to this post, where Yudkowsky argues that the whole "Religion is non-falsifiable" is actually a modern invention that it doesn't make sense to retrofit into the past.
Now I'm confused about why you're confused!
I'll say a few different things and see if it helps:
I find myself unsatisfied with the content of this comment, but as of right now I'm not sure how to better convey my thoughts. On the other hand I don't want to ignore your comment, so here's hoping this helps rather than hinders.
Oh no, confusion is going foom!
Joke aside, I feel less confused after your clarifications. I think the issue is that it wasn't clear at all to me that you were talking about the whole "interpreting Yudkowsky" schtick as the icky feeling.
Now it makes sense, and I definitely agree with you that there are enormous parallel with Biblical analysis. Yudkowsky's writing is very biblical in ways IMO (the parables and the dialogues), and in general is far more literary than 99% of the rat writing out there. I'm not surprised he found HPMOR easy to write, his approach to almost everything seem like a mix of literary fiction and science-fiction tropes/ideas.
Which is IMO why this whole interpretation is so important. More and more, I think I'm understanding why so many people get frustrated with Yudkowsky's writing and points: because they come expecting essays with arguments and a central point, and instead they get a literary text that requires strong interpretation before revealing what it means. I expect your icky feeling to come from the same place.
(Note that I think Yudkowsky is not doing that to be obscure, but for a mix of "it's easier for him" and "he believes that you only learn and internalize the sort of knowledge he's trying to convey through this interpretative labor, if not on the world itself, at least on his text".)
Also, as a clarifier: I'm not comparing the content of literary fiction or the Bible to Yudkowsky's writing. Generally with analysis of the former, you either get mysterious answers or platitudes; more and more with Yudkwosky I'm getting what I feel are deep insights (and his feedback on this post make me think that I'm not off the mark by much for some of those).
Great investigation/clarification of this recurring idea from the ongoing Late 2021 MIRI Conversations.
You might not like his tone in the recent discussions, but if someone has been saying the same thing for 13 years, nobody seems to get it, and their model predicts that this will lead to the end of the world, maybe they can get some slack for talking smack.
Good point and we should. Eliezer is a valuable source of ideas and experience around alignment, and it seems like he's contributed immensely to this whole enterprise.
I just hope all his smack talking doesn't turn off/away talented people coming to lend a hand on alignment. I expect a lot of people on this (AF) forum found it like me after reading all Open Phil and 80,000 Hours' convincing writing about the urgency of solving the AI alignment problem. It seems silly to have those orgs working hard to recruit people to help out, only to have them come over here and find one of the leading thinkers in the community going on frequent tirades about how much EAs suck, even though he doesn't know most of us. Not to mention folks like Paul and Richard who have been taking his heat directly in these marathon discussions!
Thanks for the comment, and glad it helped you. :)
- outside vs. inside view - I've thought about this before but hadn't read this clear a description of the differences and tradeoffs before (still catching up on Eliezer's old writings)
My inner Daniel Kokotajlo is very emphatically pointing to that post about all the misuses of the term "outside view". Actually, Daniel commented on my draft that he definitely didn't thought that Hanson was using the real outside view AKA reference class forecasting in the FOOM debate, and that as Yudkowsky points out, reference class forecasting just doesn't seem to work for AGI prediction and alignment.
I just hope all his smack talking doesn't turn off/away talented people coming to lend a hand on alignment. I expect a lot of people on this (AF) forum found it like me after reading all Open Phil and 80,000 Hours' convincing writing about the urgency of solving the AI alignment problem. It seems silly to have those orgs working hard to recruit people to help out, only to have them come over here and find one of the leading thinkers in the community going on frequent tirades about how much EAs suck, even though he doesn't know most of us. Not to mention folks like Paul and Richard who have been taking his heat directly in these marathon discussions!
Yeah, I definitely think there are and will be bad consequences. My point is not that I think this is a good idea, just that I understand better where Yudkowsky is coming from, and can empathize more with his frustration.
I feel the most dangerous aspect of the smack talking is that it makes people not want to listen to him, and just see him as a smack talker with nothing to add. That was my reaction when reading the first discussions, and I had to explicitly notice that my brain was going from "This guy is annoying me so much" to "He's wrong", which is basically status-fueled "deduction". So I went looking for more. But I completely understand the people, especially those who are doing a lot of work in alignment, being just "I'm not going to stop my valuable work to try to understand someone who's just calling me a fool and is unable to voice their arguments in a way I understand."
Here is an exploration of what Eliezer Yudkowsky means when he writes about deep vs shallow patterns (although I’ll be using "knowledge" instead of "pattern" for reasons explained in the next section). Not about any specific pattern Yudkowsky is discussing, mind you, about what deep and shallow patterns are at all. In doing so, I don’t make any criticism of his ideas and instead focus on quoting him (seriously, this post is like 70% quotes) and interpreting him by finding the best explanation I can of his words (that still fit them, obviously). Still, there’s a risk that my interpretation misses some of his points and ideas— I’m building a lower-bound on his argument’s power that is as high as I can get, not an upper-bound. Also, I might just be completely wrong, in which case defer to Yudkowsky if he points out that I’m completely missing the point.
Thanks to Eliezer Yudkowsky, Steve Byrnes, John Wentworth, Connor Leahy, Richard Ngo, Kyle, Laria, Alex Turner, Daniel Kokotajlo and Logan Smith for helpful comments on a draft.
Back to the FOOM: Yudkowsky’s explanation
In recent discussions, Yudkowsky often talks about deep patterns and deep thinking. What he made clear in a comment on this draft is that he has been using the term “deep patterns” in two different ways:
Focusing on deep knowledge then, Yudkowsky recently seems to ascribe his interlocutors’ failure to grasp his point to their inability to grasp different instances of deep knowledge.
(All quotes from Yudkowsky if not mentioned otherwise)
(From the first discussion with Richard Ngo)
That being said, he doesn’t really explain what this sort of deep knowledge is.
(From the same discussion with Ngo)
The thing is, he did exactly that in the FOOM debate with Robin Hanson 13 years ago. (For those unaware of this debate, Yudkoswky is responding to Hanson’s use of trends — like Moore’s law — extrapolations to think about intelligence explosion).
(From The Weak Inside View (2008))
An important subtlety here comes from the possible conflation of two uses of “surface”: the implicit use of “surface knowledge” as the consequences of some underlying causal processes/generator, and the explicit use of “surface knowledge” as drawing similarities without thinking about the causal process generating them. To simplify the discussion, let’s use the more modern idiom of “shallow” for the more explicit sense here.
So what is Yudkowsky pointing at? Two entangled things:
Imagine a restaurant that has a dish you really like. The last 20 times you went to eat there, the dish was amazing. So should you expect that the next time it will also be great? Well, that depends on whether anything in the kitchen changes. Because you don’t understand what makes the dish great, you don’t know of the most important aspects of the causal generators. So if they can’t buy their meat/meat-alternative at the same place, maybe that will change the taste; if the cook is replaced, maybe that will change the taste; if you go at a different time of the day, maybe that will change the taste.
You’re incapable of extending your trend (except by replicating all the conditions) to make a decent prediction because you don’t understand where it comes from. If on the other hand you knew why the dish was so amazing (maybe it’s the particular seasoning, or the chef’s touch), then now you can estimate its quality. But then you’re not using the trend, you’re using a model of the underlying causal process.
Here is another phrasing by Yudkowsky from the same essay:
More generally, these quotes point out to what Yudkowsky means when he says “deep knowledge”: the sort of reasoning that focuses on underlying causal models.
As he says himself:
Before going deeper into how such deep knowledge/Weak Inside View works and how to build confidence in it, I want to touch upon the correspondence between this kind of thinking and the Lucas Critique in macroeconomics. This link has been pointed out in the comments of the recent discussions — we thus shouldn’t be surprised that Yudkowsky wrote about it 8 years ago (yet I was surprised by this).
(From Intelligence Explosion Microeconomics (2013))
and later in that same essay:
This last sentence in particular points out another important feature of deep knowledge: that it might be easier to say negative things (like “this can’t work”) than precise positive ones (like “this is the precise law”) because the negative thing can be something precluded by basically all coherent/reasonable causal explanations, while they still disagree on the precise details.
Let’s dig deeper into that by asking more generally what deep knowledge is useful for.
How does deep knowledge work?
We now have a pointer (however handwavy) to what Yudkowsky means by deep knowledge. Yet we have very little details at this point about what this sort of thinking looks like. To improve that situation, the next two subsections explore two questions about the nature of deep knowledge: what is it for, and where does it come from?
The gist of this section is that:
What is deep knowledge useful for?
The big difficulty that comes up again and again, in the FOOM debate with Hanson and the discussion with Ngo and Christiano, is that deep knowledge doesn’t always lead to quantitative predictions. That doesn’t mean that the deep knowledge isn’t quantitative itself (expected utility maximization is an example used by Yudkowsky that is completely formal and quantitative), but that the causal model only partially constrains what can happen. That is, it doesn’t constrain enough to make precise quantitative predictions.
Going back to his introduction of the Weak Outside view, recall that he wrote:
He follows up writing:
Let’s summarize it that way: deep knowledge only partially constrains the surface phenomena it describes (which translate into quantitative predictions) and it takes a lot of detailed deep knowledge (and often data) to refine it enough to pin down exactly the phenomenon and make precise quantitative predictions. Alignment and AGI are fields where we don’t have that much deep knowledge, and the data is sparse, and thus we shouldn’t expect precise quantitative predictions anytime soon.
Of course, just because a prediction is qualitative doesn’t mean it comes from deep knowledge; all hand-waving isn’t wisdom. For a good criticism of shallow qualitative reasoning in alignment, let’s turn to Qualitative Strategies of Friendliness.
The shallow qualitative reasoning criticized here relies too much on human common sense and superiority to the AI, when the situation to predict is about superintelligence/AGI. That is, this type of qualitative reasoning extrapolates across a change in causal generators.
On the other hand, Yudkowsky uses qualitative constraints to guide his criticism: he knows there’s a problem because the causal model forbids that kind of solution. Just like the laws of thermodynamics forbid perpetual motion machines.
Deep qualitative reasoning starts from the underlying (potentially quantitative) causal explanations and mostly tells you what cannot work or what cannot be done. That is, deep qualitative reasoning points out that a whole swatch of search space is not going to yield anything. A related point is that Yudkwosky rarely (AFAIK) makes predictions, even qualitative ones. He sometimes admits that he might do some, but it feels more like a compromise with the prediction-centered other person than what the deep knowledge is really for. Whereas he constantly points out how certain things cannot work.
(From Qualitative Strategies of Friendliness (2008))
(From the second discussion with Ngo)
(From Security Mindset and Ordinary Paranoia (2017))
Or my reading of the whole discussion with Christiano, which is that Christiano constantly tries to get Yudkowsky to make a prediction, but the latter focuses on aspects of Christiano’s model and scenario that don’t fit his (Yudkoswky’s) deep knowledge.
I especially like the perpetual motion machines analogy, because it drives home how just proposing a tweak/solution without understanding Yudkowsky’s deep knowledge (and what it would take for it to not apply) has almost no chance of convincing him. Because if someone said they built a perpetual motion machine without discussing how they bypass the laws of thermodynamics, every scientifically literate person would be doubtful. On the other hand, if they seemed to be grappling with thermodynamics and arguing for a plausible way of winning, you’d be significantly more interested.
(I feel like Bostrom’s Orthogonality Thesis is a good example of such deep knowledge in alignment that most people get, and I already argued elsewhere that it serves mostly to show that you can’t solve alignment by just throwing competence at it — also note that Yudkowsky had the same pattern earlier/parallely, and is still using it)
To summarize: the deep qualitative thinking that Yudkowsky points out by saying “deep knowledge” is the sort of thinking that cuts off a big chunk of possibility space, that is tells you the whole chunk cannot work. It also lets you judge from the way people propose a solution (whether they tackle the deep pattern or not) whether you should ascribe decent probability to them being right.
A last note in this section: although deep knowledge primarily leads to negative conclusions, it can also lead to positive knowledge through a particularly Bayesian mechanism: if the deep knowledge destroys every known hypothesis/proposal except one (or a small number of them), then that is strong evidence for the ones left.
(This quote is more obscure than the others without the context. It’s from Intelligence Explosion Microeconomics (2013), and discusses the last step in a proposal for formalizing the sort of deep insight/pattern Yudkowksy leveraged during the FOOM debate. If you’re very confused, I feel like the most relevant part to my point is the bold last sentence.)
Where does deep knowledge come from?
Now that we have a decent grounding of what Yudkowsky thinks deep knowledge is for, the biggest question is how to find it, and how to know you have found good deep knowledge. After all, maybe the causal models one assumes are just bad?
This is the biggest difficulty that Hanson, Ngo, and Christiano seemed to have with Yudkowsky’s position.
(Robin Hanson, from the comments after Observing Optimization in the FOOM Debate)
(Richard Ngo from his second discussion with Yudkowsky)
(Paul Christiano from his discussion with Yudkowsky)
Note that these attitudes make sense. I especially like Ngo’s framing. Falsifiable predictions (even just postdictions) are the cornerstone of evaluation hypotheses in Science. It even feels to Ngo (as it felt to me) that Yudkowsky argued for that in the Sequences:
(Ngo from his second discussion with Yudkowsky)
(And Yudkoswky himself from Making Belief Pay Rent (In Anticipated Experience))
But the thing is… rereading part of the Sequences, I feel Yudkowsky was making points about deep knowledge all along? Even the quote I just used, which I interpreted in my rereading a couple of weeks ago as being about making predictions, now sounds like it’s about the sort of negative form of knowledge that forbids “perpetual motion machines”. Notably, Yudkowsky is very adamant that beliefs must tell you what cannot happen. Yet that doesn’t imply at all to make predictions of the form “this is how AGI will develop”, so much as saying things like “this approach to alignment cannot work”.
Also, should I point out that there’s a whole sequence dedicated to the ways rationality can do better than science? (Thanks to Steve Byrnes for the pointer). I’m also sure I would find a lot of relevant stuff by rereading Inadequate Equilibria too, but if I wait to have reread everything by Yudkowsky before posting, I’ll be there a long time…
My Initial Mistake and the Einstein Case
Let me jump here with my best guess of Yudkowsky’s justification of deep knowledge: their ability to both
The thing is, I got it completely wrong initially. Reading Einstein’s Arrogance (2007), an early Sequences post that is all about saying that Einstein had excellent reasons to believe General Relativity’s correctness before experimental verification (of advanced predictions), I thought that relativity was the deep knowledge and that Yudkowsky was pointing out how Einstein, having found an instance of true deep knowledge, could allow himself to be more confident than the social process of Science would permit in the absence of experimental justification.
Einstein’s Speed (2008) made it clear that I had been looking at the moon when I was supposed to see the pointing finger: the deep knowledge Yudkowsky pointed out was not relativity itself, but what let Einstein single it out by a lot of armchair reasoning and better use of what was already known.
More generally, I interpret the whole Science and Rationality Sequence as explaining how deep knowledge can let rationalists do something that isn’t in the purview of traditional Science: estimate which hypotheses make sense before the experimental predictions and evidence come in.
(From Faster Than Science (2008))
There’s a subtlety that is easy to miss: Yudkowsky doesn’t say that specifying an hypothesis in a large answer space makes it high evidence. After all, you can just generate any random guess. What he’s pointing at is that to ascribe a decent amount of probability to a specific hypothesis in a large space through updating on evidence, you need to cut a whole swath of the space to redirect the probability on your hypothesis. And that from a purely computational perspective, this implies more work on whittling down hypotheses than to make the favored hypothesis certain enough through experimental verification.
His claim then seems that Einstein, and other scientists who tended to “guess right” at what would be later experimentally confirmed, couldn’t have been just lucky — they must have found ways of whittling down the vastness of hypothesis space, so they had any chance of proposing something that was potentially right.
Yudkowsky gives some pointers to what he thinks Einstein was doing right.
(From Einstein’s Speed (2008))
So in that interpretation, Einstein learned from previous physics and from thought experiments how to cut away the parts of the hypothesis space that didn’t sound like they could make good physical laws, until he was left with a small enough subspace that he could find the right fit by hand (even if that took him 10 years)
In summary, deep knowledge doesn’t come in the form of a particularly neat hypothesis or compression; it is the engine of compression itself. Deep knowledge compresses “what sort of hypothesis tends to be correct”, such that it can be applied to the search of a correct hypothesis at the object level. That also cements the idea that deep knowledge gives constraints, not predictions: you don’t expect to be able to have such a strong criterion for correct hypothesis that given a massive hypothesis space, you can pinpoint the correct one.
Here it is good to generalize my previous mistake; recall that I took General Relativity for the deep knowledge, when it was actually the sort of constraints on physical laws that Einstein used for even finding General Relativity. Why? I can almost hear Yudkowsky answering in my head: because General Relativity is the part accepted and acknowledged by Science. I don’t think it’s the only reason, but there’s an element of truth: I privileged the “proper” theory with experimental validation over the more vague principles and concepts that lead to it.
A similar mistake is to believe the deep knowledge is the theory when it actually is what the theory and the experiments unearthed. This is how I understand Yudkowsky’s use of thermodynamics and evolutionary biology: he points out at the deep knowledge that led and was revealed by the work on these theories, more than at the theories themselves.
Compression and Fountains of Knowledge
We still don’t have a good way of finding and checking deep knowledge, though. Not any constraint on hypothesis space is deep knowledge, or even knowledge at all. The obvious idea is to have a reason for that constraint. And the reason Yudkowsky goes for almost every time is compression. Not a compressed description, like Moore’s law; nor a “compression” that is as complex as the pattern of hypothesis it’s trying to capture. Compression in the sense that you get a simpler constraint that can get you most of the way to regenerate the knowledge you’re starting from.
This view of the importance of compression is everywhere in the Sequences. A great example is Truly Part of You, which asks what knowledge you could rederive if it was deleted from your mind. If you have a deep understanding of the subject, and you keep recursively asking how a piece of knowledge could be rederived and then how “what’s needed for the derivation” can be rederived, Yudkwosky argues that you will reach “fountains of knowledge”. Or in the terminology of this post, deep knowledge.
What do these fountains look like? They’re not the fundamental theories themselves, but instead their underlying principles. Stuff like the principle of least action, Noether’s theorem and the principles underlying Statistical Mechanics (don’t know enough about it to name them). They are the crystallized insights which constrain enough the search space that we can rederive what we knew from them.
(Feynman might have agreed, given that he chose the atomic hypothesis/principle, “all things are made of atoms — little particles that move around in perpetual motion, attracting each other when they are a little distance apart, but repelling upon being squeezed into one another” was the one sentence he salvage for further generations in case of a cataclysm.)
Here I hear a voice in my mind saying “What does simple mean? Shouldn’t it be better defined?” Yet this doesn’t feel like a strong objection. Simple is tricky to define intensively, but scientists and mathematicians tend to be pretty good at spotting it, as long as they don’t fall for Mysterious Answers. And most of the checks on deep knowledge seem to be in their ability to rederive the known correct hypotheses without adding stuff during the derivation.
A final point before closing this section: Yudkowsky writes that the same sort of evidence can be gathered for more complex arguments if they can be summarized by simple arguments that still get most of the current data right. My understanding here is that he’s pointing at the wiggle room of deep knowledge, that is at the non-relevant ways in which it can be off sometimes. This is important because asking for that wiggle room can sound like ad-hoc adaptation of the pattern, breaking the compression assumption.
(From Intelligence Explosion Microeconomics (2013))
Conclusion
Based on my reading of his position, Yudkowsky sees deep knowledge as highly compressed causal explanations of “what sort of hypothesis ends up being right”. The compression means that we can rederive the successful hypotheses and theories from the causal explanation. Finally, such deep knowledge translates into partial constraints on hypothesis space, which focus the search by pointing out what cannot work. This in turn means that deep knowledge is far better at saying what won’t work than at precisely predicting the correct hypothesis.
I also want to point out something that became clearer and clearer in reading old posts: Yudkowsky is nothing if not coherent. You might not like his tone in the recent discussions, but if someone has been saying the same thing for 13 years, nobody seems to get it, and their model predicts that this will lead to the end of the world, maybe they can get some slack for talking smack.