This is a frequent disagreement I have with Eliezer and he seems to consistently find my view either perplexing or obviously misguided. So I guess this is as good a place as any to express that view:
- Realistically I think the core issue is that Eliezer is very skeptical about the possibility of competitive AI alignment. That said, I think that even on Eliezer's pessimistic view he should probably just be complaining about competitiveness problems rather than saying pretty speculative stuff about what is needed for a pivotal act.
Maybe I'm not understanding what you mean by "competitive". On my model, if counterfactually it were possible to align AGI systems that are exactly as powerful as the strongest unaligned AGI systems, e.g. five years after the invention of AGI, then you'd need to do a pivotal act with the aligned AGI system immediately or you die.
So competitiveness of aligned AGI systems doesn't seem like the main crux to me; the crux is more like 'if you muddle along and don't do anything radical, do all the misaligned systems just happen to not be able to find any way to kill all humans?'. Equal capability doesn't solve the problem when attackers are advantaged.
It sounds like your view is "given continued technological change, we need strong international coordination to avoid extinction, and that requires a 'pivotal act.'"
But that "pivotal act" is a long time in the subjective future, the case for it being a single "act" is weak, the kinds of pivotal acts being discussed seem totally inappropriate in this regime, and the discussion overall feels pretty inappropriate with very little serious thought by the participants.
For example, my sense from this discourse is that MIRI folks think a strong world government is more likely to come from an AI lab taking over the world than from a more boring looking process of gradual political change or conflict amongst states (and that this is the large majority of how discussed pivotal acts address the problem you are mentioning). I disagree with that,, don't think it's been argued for, and don't think the surprisingness of the claim has even been acknowledge and engaged with.
I disagree with the whole spirit of the sentence "misaligned systems just happen not to be able to find a way to kill all humans:"
In the world where new technologies destroy the world, I think the default response is a combination of:
Overall, these don't seem like problems current humans need to deal with. I'm very excited for some people to be thinking through these problems, because I do think that helps put us in a better position to solve these problems in the future (and solving them pre-AI would remove the need for technical solutions to alignment!). But I don't currently think they have a big effect on how we think about the alignment problem.
I don't think it's about misaligned AI. I agree with the mainstream opinion that if competitive alignment is solved, humans deliberately causing trouble represent a larger share of the problem than misaligned AI.
Why is this a mainstream opinion? Where does this "mainstream" label come from? I don't think almost anyone in the broader world has any opinions on this scenario, and from the people I've talked to in AI Alignment, this really doesn't strike me as a topic I've seen any kind of consensus on. This to me just sounds like you are labeling people you agree with as "mainstream". I don't currently see a point in using words like "mainstream" and (the implied) "fringe" in contexts like this.
I disagree with that,, don't think it's been argued for, and don't think the surprisingness of the claim has even been acknowledge and engaged with.
This also seems to me to randomly throw in an elevated burden of proof, claiming that this claim is surprising, but that your implied opposite claim is not surprising, without any evidence. I find your claims in this domain really surprising, and I also haven't seen you "acknowledge the surprisingness of [your] claim". And I wouldn't expect you to, because to you your claims presumably aren't surprising.
Claiming that someone "hasn't acknowledged the surprisingness of their claim" feels like a weird double-counting of both trying to dock someone points for being wrong, and trying to dock them points for not acknowledging that they are wrong, which feel like the same thing to me (just the latter feels like it relies somewhat more on the absurdity heuristic, which seems bad to me in contexts like this).
I'd say "mainstream opinion" (in either ML broadly, "safety" or "ethics," AI policy) is generally focused on misuse relative to alignment---even without conditioning on "competitive alignment solution." I normally disagree with this mainstream opinion, and I didn't mean to endorse the opinion in virtue of its mainstream-ness, but to identify it as the mainstream opinion. If you don't like the word "mainstream" or view the characterization as contentious, feel free to ignore it, I think it's pretty tangential to my post.
I'm happy to leave it up to the reader to decide if the claim ("world government likely to come from AI lab rather than boring political change") is surprising. I'm also happy if people read my sentence as an expression of my opinion and explanation of why I'm engaging with other parts of Eliezer's views rather than as an additional argument.
I agree some parts of my comment are just expressions of frustration rather than useful contributions.
I'd say "mainstream opinion" (in either ML broadly, "safety" or "ethics," AI policy) is generally focused on misuse relative to alignment---even without conditioning on "competitive alignment solution." I normally disagree with this mainstream opinion, and I didn't mean to endorse the opinion in virtue of its mainstream-ness, but to identify it as the mainstream opinion. If you don't like the word "mainstream" or view the characterization as contentious, feel free to ignore it, I think it's pretty tangential to my post.
Thanks, that clarifies things. I did misunderstand that sentence to refer to something like the "AI Alignment mainstream", which feels like a confusing abstraction to me, though I feel like I could have figured it out if I had thought a bit harder before commenting.
For the record, my current model is that "AI ethics" or "AI policy" doesn't really have a consistent model here, so I am not really sure whether I agree with you that this is indeed the opinion of most of the AI ethics or AI policy community. E.g. I can easily imagine both an AI ethics article saying that if we have really powerful AI, the most important thing is not misuse risk, but moral personhood of the AIs, or the "broader societal impact of the AIs", both of which feel more misalignment shaped, but I really don't know (my model of AI ethics people think that whether the AI is misaligned has an effect of whether it "deserves" moral personhood).
I do expect the AI policy community to be more focused on misuse, because they have a lot of influence from national security, which sure is generally focused on misuse and "weapons" as an abstraction, but I again don't really trust my models here. During the cold war a lot of the policy community ended up in a weird virtue signaling arms race that ended up having a strong consensus in favor of a weird flavor of cosmopolitanism, which I really didn't expect when I first started looking into this, so I don't really trust my models of what actual consensus will be when it comes to transformative AI (and don't really trust current local opinions on AI to be good proxies for that).
long time in the subjective future [...] subjective decades [...] subjective centuries
What is subjective time? Is the idea that human-imitating AI will be sufficiently faithful to what humans would do, such that if AI does something that humans would have done in ten years, we say it happened in a "subjective decade" (which could be much shorter in sidereal time, i.e., the actual subjective time of existing biological humans)?
... ah, I see you address this in the linked post on "Handling Destructive Technology":
This argument implicitly measures developments by calendar time—how many years elapsed between the development of AI and the development of destructive physical technology? If we haven’t gotten our house in order by 2045, goes the argument, then what chance do we have of getting our house in order by 2047?
But in the worlds where AI radically increases the pace of technological progress, this is the wrong way to measure. In those worlds science isn’t being done by humans, it is being done by a complex ecology of interacting machines moving an order of magnitude faster than modern society. Probably it’s not just science: everything is getting done by a complex ecology of interacting machines at unprecedented speed.
If we want to ask about “how much stuff will happen”, or “how much change we will see”, it is more appropriate to think about subjective time: how much thinking and acting actually got done? It doesn’t really matter how many times the earth went around the sun.
I'm not thinking of AI that is faithful to what humans would do, just AI that at all represents human interests well enough that "the AI had 100 years to think" is meaningful. If you don't have such an AI, then (i) we aren't in the competitive AI alignment world, (ii) you are probably dead anyway.
If you think in terms of calendar time, then yes everything happens incredibly quickly. It's weird to me that Rob is even talking about "5 years" (though I have no idea what AGI means, so maybe?). I would usually guess that 5 calendar years after TAI is probably post-singularity, so effectively many subjective millennia and so the world is unlikely to closely resemble our world (at least with respect to governance of new technologies).
So I guess this is as good a place as any to express that view
Meta point: seems sad to me if many arguments on topics are spread across many posts in a way that'd be hard for a person to track down e.g. all the arguments regarding generalization/not-generalization.
This makes me want something like the Arbital wiki vision where you can find not just settled facts, but also the list of arguments/considerations in either direction on disputed topics.
Plausibly the existing LW/AF wiki-tag system could do this as far as format/software goes, we just need to get people creating pages for all the concepts/disagreements and then properly tagging things and distilling things. This is an addition to better pages for relatively more settled ideas like "inner alignment".
All of this is a plausible thing for the LW team to try to make happen. [Focus for the next month or two is (a) ensuring that amidst all the great discussion of AI, LessWrong doesn't lose its identity as a site for Rationality/epistemics/pursuing truth/accurate models across all domains, (b) fostering epistemics in the new wave of alignment researchers (and community builders), though I am quite uncertain about many aspects of this goal/plan.]
I have just recently been wondering where we stand on the very basic description of the problem criteria for productive conversations. Of late our conversations seem to have more of the flavor of proposal for solution -> criticism of solution, which of course is fine if we have the problem described; but if that were the case why do so many criticisms take the form of disagreements over the nature of the problem?
A very reasonable objection is that there are too many unknowns at work, so people are working on those. But this feels like one meta-problem, so the same reasoning should apply and we want a description of the meta-problem.
I suppose it might be fair to say we are currently working on competing descriptions of the meta-problem. Note to self: doing another survey of the recent conversations with this in mind might be clarifying.
Realistically I think the core issue is that Eliezer is very skeptical about the possibility of competitive AI alignment. That said, I think that even on Eliezer's pessimistic view he should probably just be complaining about competitiveness problems rather than saying pretty speculative stuff about what is needed for a pivotal act.
Isn't the core thing here that Eliezer expects that a local, hard-takeoff is possible? He thinks that a single AI system can rapidly gain enormous power relative to the rest of the world (either by recursive self improvement, or by seizing compute, or by just deploying on more computers)
If this is possible thing for an AGI system to do, it seems like ensuring a human future requires that you're able to prevent an unaligned AGI from undergoing a hard takeoff.
If you have aligned systems that are competitive in a number of different domains, that doesn't matter if 1) local hard takeoff is on the table and 2) you aren't able to produce systems whose alignment is robust to a hard takeoff.
It seems like the pivotal act ideology is a natural consequence of 1) expecting hard takeoff and 2) thinking that alignment is hard, full stop. Whether or not aligned systems will be competitive doesn't come into it. Or by "competitive" do you mean, specifically "competitive, even across the huge relative capability gain of a hard takeoff"?
It seems like Eliezer's chain of argument is:
Even if competitiveness is likely tractable, we might have more influence over some worlds where competitiveness is intractable. I don't think this overwhelms a large disagreement about tractability of a fully competitive alignment solution, but I think there's something valuable about how pivotal act plans work under the weakest assumptions about alignment difficulty.
If powerful AIs are deployed in worlds mostly shaped by slightly less powerful AIs, you basically need competitiveness to be able to take any "pivotal action" because all the free energy will have been eaten by less powerful AIs.
It looks like it’s totally plausible for many kinds of limited systems to greatly accelerate R&D
Do you have any concrete example from any current alignment work where it would be helpful to have some future AI technology? Or else any other way in which such technologies will be useful for alignment? Would it be something like "we trained it listen to humans, it wasn't perfect, but we looked at its utility function with transparency tools and that gave us ideas"? Oh, and it should be more useful for alignment than it is useful for creating something that can defeat safety measures at the moment, right? Because I don't get how, for example, better hardware or coding assistance or money wouldn't just result in faster development of something misaligned. And so I don't get how everyone competing on developing AI is helping matters - wouldn't existence of half as capable AI before the one that ended the world just made world-ending AI to appear earlier? Like, it could dramatically change things, but why would it change things for the better if no one planned it?
I think that future AI technology could automate my job. I think it could also automate capability researchers' jobs. (It could also help in lots of other ways, but this point seems sufficient to highlight the difference between our views.)
I don't think that being more useful for alignment is a necessary claim for my position. We are talking about what we want our aligned AIs to do for us, and hence what we should have in mind while doing AI alignment research. If we think AI accelerates technological progress across the board, then the answer "we want our AI to keep accelerating good stuff happening in the world at the same rate that it accelerates dangerous technology" seems like it's valid.
And it will be ok to have unaligned capabilities, because government will stop them, maybe using existing aligned AI technology, and it will do it in the future but not now because future AI technology will be better in demonstrating risk? Why do you think that default response of humanity to increasing offense-defense balance and vulnerability to terrorism will be correct? Why, for example, capability detection can't be insufficient at the time when multiple actors arrive at world-destroying capabilities for regulators to stop them?
Here's a conversation that I think is vaguely analogous:
Alice: Suppose we had a one-way function, then we could make passwords better by...
Bob: What do you want your system to do?
Alice: Well, I want passwords to be more robust to...
Bob: Don't tell me about the mechanics of the system. Tell me what you want the system to do.
Alice: I want people to be able to authenticate their identity more securely?
Bob: But what will they do with this authentication? Will they do good things? Will they do bad things?
Alice: IDK I just think the world is likely to be generically a better place if we can better autheticate users.
Bob: Oh OK, we're just going to create this user authetication technology and hope people use it for good?
Alice: Yes? And that seems totally reasonable?
It seems to me like you don't actually have to have a specific story about what you want your AI to do in order for alignment work to be helpful. People in general do not want to die, so probably generic work on being able to more precisely specify what you want out of your AIs, e.g. for them not to be mesa-optimizers, is likely to be helpful.
This is related to complaints I have with [pivotal-act based] framings, but probably that's a longer post.
Bob: Oh OK, we're just going to create this user authetication technology and hope people use it for good?
Seems to me that the answer "I hope people will use it for good" is quite okay for authentication, but not okay for alignment. Doing good is outside the scope of authentication, but is kinda the point of alignment.
The basis of all successful technology to date has been separation of concerns. One of the problems with Alignment as an academic discipline is keeping the focus on problems that can actually be solved, without drawing in all of philosophy and politics. It's like the old joke about object-oriented programming: you asked for a banana, but you got a gorilla holding the banana, and the entire jungle too.
Do you mean like there are (at least) two subproblems that can be addressed separately?
Where the former is the proper concern of AI researchers, and the latter should be studied by someone else (even if we currently have no idea who could do such thing reliably, it's a separate problem regardless).
I'm actually more interested in corrigibility than values alignment, so I don't think that AI should be solving moral dilemmas every time it takes an action. I think values should be worked out in the post-ASI period, by humans in a democratic political system.
People currently give MIRI money in the hopes they will use it for alignment. Those people can't explain concretely what MIRI will do to help alignment. By your standard, should anyone give MIRI money?
When you're part of a cooperative effort, you're going to be handing off tools to people (either now or in the future) which they'll use in ways you don't understand and can't express. Making people feel foolish for being a long inferential distance away from the solution discourages them from laying groundwork that may well be necessary for progress, or even from exploring.
Some common issues with alignment plans, on Eliezer's account, include:
These are some examples of reasons it's much more often helpful to think about 'what is the AGI for?' or 'what task are you trying to align?', even though it's true that not all domains work this way!
Presumably those early time sharing systems, "we need some way for a user to access their files but not other users". So password.
Then later in scale "system administrators or people with root keep reading the passwords file and using it for bad acts later". So one way hash.
Password complexity requirements came from people rainbow tabling the one way hash file.
None of the above was secure so 2 factor.
People keep SMS redirecting so apps on your phone...
Each additional level of security was taken from a pool of preexisting ideas that academics and others had contributed. But it wasn't applied until it was clear it was needed.
People often work on alignment proposals without having a clear idea of what they actually want an aligned system to do. Eliezer thinks this is bad.
...
Is it never useful to have a better understanding of the mechanics, even if we don’t have a clear target in mind?
Not sure how relevant this is in Eliezer's model, but I'd note that there's a difference between alignment proposals vs understanding the mechanics. When working to understand things in general, I usually want to have multiple different use-cases in mind, to drive the acquisition of generalizable knowledge. For an actual proposal, it makes more sense to have a particular use-case.
That said, even under this argument one should have multiple concrete use-cases in mind, as opposed to zero concrete use-cases.
I started out nodding along with Eliezer in this post, then read Mark's take and was like "yeah, that seems fair."
But I do quite like the notion of "have a couple concrete examples of how the whole thing will get used end-to-end."
This feels related to how I feel about a lot of voting theory, or economics mechanism design. I'm often frustrated by people working hard on fleshing out the edge cases for some voting process, when the actual bottleneck is a much simpler voting system that anyone is actually likely to use, and a use-case to bootstrap it to more popularity. Or for mechanism design, the actual bottleneck is someone who is good at UX and product design.
In the case of AI we have a different set of problems. We probably actually do need the complicated theory-heavy solutions. But those solutions need to be relevant to stuff that have a chance of actually helping, which requires thinking through things concretely.
...
On the other hand, this all feels a bit distinct from what (I think?) Eliezer's point is, which is less about how to design things generally, and more about "The AGI is actually gonna kill you tho and the road to hell is paved with people bouncing off the hard problem to work on more tractable things." I think there's some kind of deeply different macrostrategy that Paul and Eliezer and Critch are pointing. (I'm a bit confused about Critch/Paul, I had been bucketing them as in essentially the same strategic camp but then last year they had some significant disagreement I was confused by). Where Eliezer is like "you definitely just need a pivotal act" and Critch is like "you're not gonna get a safe pivotal act and also it harms the coordination commons" and Paul is like "you're not gonna get a safe pivotal act you need to figure out how to make alignment competitive."
I think this disagreement is nontrivial to resolve.
Start-ups need customers, but basic science/theory doesn't need applications. If it's an alignment technique, not an alignment proposal, the important consideration is whether there is a lesson, an idea relevant to the informal topic of alignment, or to some question that might come to mind when thinking about it, something that helps with understanding. If that idea doesn't slot into any known-useful role, that isn't an argument against developing it, only a weak priority consideration that's not obviously more important than things like neglectedness of the question the idea clarifies.
You might get applications much later, when understanding builds up to a point where pieces of the puzzle start assembling into useful plans. Prioritizing development of pieces of the puzzle before that point is mostly intractable, and individually most of them will never find applications.
This feels like a rather different attitude compared to the “rocket alignment” essay. They’re maybe both compatible but the emphasis seems very different.
Agreed!
In terms of MIRI's 2017 'strategic background' outline, I'd say that these look like they're in tension because they're intervening on different parts of a larger plan. MIRI's research has historically focused on:
For that reason, MIRI does research to intervene on 8 from various angles, such as by examining holes and anomalies in the field’s current understanding of real-world reasoning and decision-making. We hope to thereby reduce our own confusion about alignment-conducive AGI approaches and ultimately help make it feasible for developers to construct adequate “safety-stories” in an alignment setting. As we improve our understanding of the alignment problem, our aim is to share new insights and techniques with leading or up-and-coming developer groups, who we’re generally on good terms with.
I.e., our perspective was something like 'we have no idea how to do alignment, so we'll fiddle around in the hope that new theory pops out of our fiddling, and that this new theory makes it clearer what to do next'.
In contrast, Bob in the OP isn't proposing a way to try to get less confused about some fundamental aspect of intelligence. He's proposing a specific plan for how to actually design and align an AGI in real life:
"Let’s suppose we had a perfect solution to outer alignment. I have this idea for how we could solve inner alignment! First, we could get a human-level oracle AI. Then, we could get the oracle AI to build a human-level agent through hardcoded optimization. And then--"
This is also important, but it's part of planning for step 6, not part of building toward step 8 (or prerequisites for 8).
“Bob isn't proposing a way to try to get less confused about some fundamental aspect of intelligence”
This might be what I missed. I thought he might be. (E.g., “let’s suppose we have” sounds to me like a brainstorming “mood” than a solution proposal.)
Eliezer: What do you want the system to do?
Bob: I want the system to do what it thinks I should want it to do.
Eliezer: The Hidden Complexity of Wishes
Related: this draft document I have, inspired by the same conversation. It tries to cover some of those questions.
I typically want the system to interact with the world digitally, gain humans' trust, take over manufacturing capability to radically raise the standard of living on earth, and simultaneously spin up orbital infrastructure to launch replicating probes. For starters.
One could probably aim for more modest actions - I'm not super convinced that that would be easier, but it could be!
This affects what kinds of alignment strategies I consider. As I've said elsewhere, you have to solve a lot of philosophical problems when doing AI alignment, but the advantage we have over philosophers is that when a problem seems impossible, we get to throw it out and pick an easier problem. You design the AI, you get to pick what problem it's solving - if you ask it to do something impossible, that's bad, and it's your job to ask it to do something possible instead.
Put that way, though, "what do I want the AI to do" is more importantly about "what currently unsolved problems do I want the AI to solve?" This is a finer grain of detail than the "oh, it will make peoples' lives better" of my first paragraph.
I would like for the system to provide humans with information. So if a human asks a reasonable question (How do I get a strawberry?) the system gives information on cultivating strawberries. If a human asks for the dna sequence of a strawberry and how to create a strawberry from that, the system gives safety information and how to do that. If a human asks how to create a thermonuclear bomb, the system asks why, and refuses to answer unless the human can provide a verifiable reason why creating this is necessary to solve an existential threat to humanity. I would like the system to be able to provide this information in a variety of ways, such as interactive chat or a written textbook.
I would like the system to gain scientific and engineering knowledge. So I would like the system to do things like setup telescopes and send probes to other planets. I would like the system to provide monitoring of Earth from orbit. If it needs to run safe experimental facilities, I would like the system to be able to do that.
I would like the system to leave most of the universe alone. So I would like it to leave most of the surfaces of planets and other natural bodies untouched. (If the system dug up more than 10 cubic kilometers of a planet, or disturbed more than 1% of the surface area or volume area, I would consider that a violation of this goal) (Tearing apart say an O-type main sequence star that will not ever have life would be okay if necessary for a really interesting experiment that could not be done in any other way, ripping apart the majority of stars in a galaxy is not something I would want except to prevent an existential threat.)
I would like the system to be incredibly careful not to disturb life. So on Earth, it should only disturb life with human's permission, and elsewhere should entirely avoid any resource extraction on any planet or place with existing life.
I would like the system to use a reasonable effort to prevent humans or other intelligent lifeforms from completely destroying themselves. (So if banning nanotechnology and nuclear bombs is needed, okay, but banning bicycles or knives is going too far. Diverting asteroids from hitting Earth would be good.)
I would like the system to have conversations with humans, about ethics and other topics, and try to help humans figure out what would be truly good.
(And of course, what we want AGIs to do and how to get AGIs to do that are two separate questions. Also, this list is partially based on Ursula K. Le Guin's The City of Mind (Yaivkach) AGIs in Always Coming Home.)
This is a write-up of a conversation I overheard between Eliezer and some junior alignment researchers. Eliezer reviewed this and gave me permission to post this, but he mentioned that "there's a lot of stuff that didn't get captured well or accurately." I'm posting it under the belief that it's better than nothing.
TLDR: People often work on alignment proposals without having a clear idea of what they actually want an aligned system to do. Eliezer thinks this is bad. He claims that people should start with the target (what do you want the system to do?) before getting into the mechanics (how are you going to get the system to do this?)
I recently listened in on a conversation between Eliezer and a few junior alignment researchers (let’s collectively refer to them as Bob). This is a paraphrased/editorialized version of that conversation.
Bob: Let’s suppose we had a perfect solution to outer alignment. I have this idea for how we could solve inner alignment! First, we could get a human-level oracle AI. Then, we could get the oracle AI to build a human-level agent through hardcoded optimization. And then--
Eliezer: What do you want the system to do?
Bob: Oh, well, I want it to avoid becoming a mesa-optimizer. And you see, the way we do this, assuming we have a perfect solution to outer alignment is--
Eliezer: No. What do you want the system to do? Don’t tell me about the mechanics of the system. Don’t tell me about how you’re going to train it. Tell me about what you want it to do.
Bob: What… what I want it to do. Well, uh, I want it to not kill us and I want it to be aligned with our values.
Eliezer: Aligned with our values? What does that mean? What will you actually have this system do to make sure we don’t die? Does it have to do with GPUs? Does it have to do with politics? Tell me what, specifically, you want the system to do.
Bob: Well wait, what if we just had the system find out what to do on its own?
Eliezer: Oh okay, so we’re going to train a superintelligent system and give it complete freedom over what it’s supposed to do, and then we’re going to hope it doesn’t kill us?
Bob: Well, um….
Eliezer: You’re not the only one who has trouble with this question. A lot of people find it easier to think about the mechanics of these systems. Oh, if we just tweak the system in these ways-- look! We’ve made progress!
It’s much harder to ask yourself, seriously, what are you actually trying to get the system to do? This is hard because we don’t have good answers. This is hard because a lot of the answers make us uncomfortable. This is hard because we have to confront the fact that we don’t currently have a solution.
This happens with start-ups as well. You’ll talk to a start-up founder and they’ll be extremely excited about their database, or their engine, or their code. And then you’ll say “cool, but who’s your customer?”
And they’ll stare back at you, stunned. And then they’ll say “no, I don’t think you get it! Look at this-- we have this state-of-the-art technique! Look at what it can do!”
And then I ask again, “yes, great, but who is your customer?”
With AI safety proposals, I first want to know who your customer is. What is it that you actually want your system to be able to do in the real-world? After you have specified your target, you can tell me about the mechanics, the training procedures, and the state-of-the-art techniques. But first, we need a target worth aiming for.
Questions that a curious reader might have, which are not covered in this post:
After the conversation, Eliezer mentioned that he often finds himself repeating this when he hears about alignment ideas. I asked him if he had written this up. He said no, but maybe someone like me should write it up.