I have a lot of disagreements with section 6. Not sure where the main crux is, so I'll just write down a couple of things.
One intuition pump here is: in the current, everyday world, basically no one goes around with much of a sense of what people’s “values on reflection” are, or where they lead.
This only works because we're not currently often in danger of subjecting other people to major distributional shifts. See Two Neglected Problems in Human-AI Safety.
That is, ultimately, there is just the empirical pattern of: what you would think/feel/value given a zillion different hypothetical processes; what you would think/feel/value about those processes given a zillion different other hypothetical processes; and so on. And you need to choose, now, in your actual concrete circumstance, which of those hypotheticals to give authority to.
I notice that in order to argue that solving AI alignment does not need "very sophisticated philosophical achievement", you've proposed a solution to metaethics, which would itself constitute a "very sophisticated philosophical achievement" if it's correct!
Personally I'm very uncertain about metaethics (see also previous discussion on this topic between Joe and me), and don't want to see humanity bet the universe on any particular metaethical theory in our current epistemic state.
Curated!
("Curated", a term which here means "This just got emailed to 30,000 people, of whom typically half open the email and it gets shown at the top of the frontpage to anyone who hasn't read it for ~1 week.")
This is a thoughtful and detailed attempt to think through the entire alignment problem, making slightly different conceptual distinctions and tradeoffs, and this reaching somewhat different conclusions, and that's very worthwhile! I want to reward people doing and publishing serious intellectual labor like this that otherwise mostly wouldn't get done.
I like the notion of 'avoiding' and 'handling' the alignment problem as distinct from 'solving' it, and generally trying to talk about the same subject but without definitionally building-in the assumption that the agent will need to have identical values to us (which is especially worthwhile given how confused I am about my own values!)). I amused that you consider your definition here 'devious'.
One critique I'll make is that only a while in did I pick up that you weren't talking about building maximally-intelligent systems, merely superintelligent systems (i.e. there's a whole range of how much more intelligent than us a machine can be, and for a substantial part of this I believe you're focusing on the lower end). I read you as focusing on the level of superintelligence that solves tons of major problems that have plagued humanity since its inception and has tons of obvious benefits (e.g. ending disease, amazing videogames, superintelligent life advice, etc) but not crazily higher than that (e.g. perhaps uploading everyone into ems and redesigning the human mind). It seems to me like your choice to focus on dynamics at this level of intelligence, while potentially highly worthwhile, rests on a bunch of empirical beliefs about how the development of AI will play out that are pretty absent in this more abstract, philosophical treatise.
I have many more thoughts and disagreements with this and related works, I hope to write a more thorough response sometime, but still, really glad to read it, thank you!
I'm confused about the clarifications in this post. Generally speaking, I think the terms "alignment", "takeover", and "disempowered" are vague and can mean dramatically different things to different people. My hope when I started reading this post was to see you define these terms precisely and unambiguously. Unfortunately, I am still confused about how you are using these terms, although it could very easily be my fault for not reading carefully enough.
Here is a scenario that I want you to imagine that I think might help to clarify where I'm confused:
Suppose we grant AIs legal rights and they become integrated into our society. Humans continue to survive and thrive, but AIs eventually and gradually accumulate the vast majority of the wealth, political power, and social status in society through lawful means. These AIs are sentient, extremely competent, mostly have strange and alien-like goals, and yet are considered "people" by most humans, according to an expansive definition of that word. Importantly, they are equal in the eyes of the law, and have no limitations on their ability to hold office, write new laws, and hold other positions of power. The AIs are agentic, autonomous, plan over long time horizons, and are not enslaved to the humans in any way. Moreover, many humans also upload themselves onto computers and become AIs themselves. These humans expand their own cognition and often choose to drop the "human" label from their personal identity after they are uploaded.
Here are my questions
Hi Matthew -- I agree it would be good to get a bit more clarity here. Here's a first pass at more specific definitions.
On these definitions, the scenario you've given is underspecified in a few respects. In particular, I'd want to know:
If we assume the answer to (1) is that the non-human-descended AIs end up with most of the power (sounds this is basically what you had in mind -- see also my "people-who-like paperclips" scenario here) then yes I'd want to call this a takeover and I'd want to say that humans have been disempowered. Whether it was a "bad takeover", and whether this was a good or bad outcome for humanity, I think depends partly on (2). If in fact this scenario results in a future that is extremely low in value, in virtue of the alien-ness of the goals the AIs are pursuing, then I'd want to call it a bad takeover despite the cooperativeness of the path getting there. I think this would also imply that the AIs are practically PS-misaligned, and I think I endorse this implication, despite the fact that they are broadly cooperative and law-abiding -- though I do see a case for reserving "PS-misalignment" specifically for uncooperative power-seeking. If the resulting future is high in value, then I'd say that it was not a bad takeover and that the AIs are aligned.
Does that help? As I say, I think your comments here are pushing me a bit towards focusing specifically on uncooperative takeovers, and on defining PS-misalignment specifically in terms of AIs with a tendency to engage in uncooperative forms of power-seeking. If we went that route, then we wouldn't need to answer my question (2) above, and we could just say that this is a non-bad takeover and that the AIs are PS-aligned.
OK, but what is your “intent”? Presumably, it’s that something be done in accordance with your values-on-reflection, right?
No, I don't think so at all. Pretty much the opposite, actually; if it was in accordance to my values-on-reflection, it would be value-aligned to me rather than intent-aligned. Collapsing the meaning of the latter into the former seems entirely unwise to me. After all, when I talk about my intent, I am explicitly not thinking about any long reflection process that gets at the "core" of my beliefs or anything like that;[1] I am talking more about something like this:
I have preferences right now; this statement makes sense in the type of low-specificity conversation dominated by intuition where we talk about such words as though they referred to real concepts that point to specific areas of reality. Those preferences are probably not coherent, in the case that I can probably be money pumped by an intelligent enough agent that sets up a strange-to-my-current-self scenario. But they still exist, and one of them is to maintain a sufficient amount of money in my bank account to continue living a relatively high-quality life. Whether I "endorse" those preferences or not is entirely irrelevant to whether I have them right now; perhaps you could offer a rational argument to eventually convince me that you would make much better use of all my money, and then I would endorse giving you that money, but I don't care about any of that right now. My current, unreflectively-endorsed self, doesn't want to part with what's in my bank account, and that's what guiding my actions, not an idealized, reified future version.
None of this means anything conclusive about me ultimately endorsing these preferences in the reflective limit, of those preferences being stable under ontology shifts that reveal how my current ontology is hopelessly confused and reifies the analogues of ghosts, of there being any nonzero intersection between the end states of a process that tries to find my individual volition, of changes to my physical and neurological make-up keeping my identity the same (in a decision-relevant sense relative to my values) when my memories and path through history change.
In any case, I am very skeptical of this whole values-on-reflection business,[2] as I have written about at length in many different spots (1, 2, 3 come to mind off the top of my head). I am loathe to keep copying the exposition of the same ideas over and over and over again (it also probably gets annoying to read at some point), but here is a relevant sample:
Whenever I see discourse about the values or preferences of beings embedded in a physical universe that goes beyond the boundaries of the domains (namely, low-specificity conversations dominated by intuition) in which such ultimately fake frameworks function reasonably well, I get nervous and confused. I get particularly nervous if the people participating in the discussions are not themselves confused about these matters (I am not referring to [Wei Dai] in particular here, since [Wei Dai] has already signaled an appropriate level of confusion about this). Such conversations stretch our intuitive notions past their breaking point by trying to generalize them out of distribution without the appropriate level of rigor and care.
What counts as human "preferences"? Are these utility function-like orderings of future world states, or are they ultimately about universe-histories, or maybe a combination of those, or maybe something else entirely? Do we actually have any good reason to think that (some form of) utility maximization explains real-world behavior, or are the conclusions broadly converged upon on LW ultimately a result of intuitions about what powerful cognition must be like whose source is a set of coherence arguments that do not stretch as far as they were purported to? What do we do with the fact that humans don't seem to have utility functions and yet lingering confusion about this remained as a result of many incorrect and misleading statements by influential members of the community?
How can we use such large sample spaces when it becomes impossible for limited beings like humans or even AGI to differentiate between those outcomes and their associated events? After all, while we might want an AI to push the world towards a desirable state instead of just misleading us into thinking it has done so, how is it possible for humans (or any other cognitively limited agents) to assign a different value, and thus a different preference ranking, to outcomes that they (even in theory) cannot differentiate (either on the basis of sense data or through thought)?
In any case, are they indexical or not? If we are supposed to think about preferences in terms of revealed preferences only, what does this mean in a universe (or an Everett branch, if you subscribe to that particular interpretation of QM) that is deterministic? Aren't preferences thought of as being about possible worlds, so they would fundamentally need to be parts of the map as opposed to the actual territory, meaning we would need some canonical framework of translating the incoherent and yet supposedly very complex and multidimensional set of human desires into something that actually corresponds to reality? What additional structure must be grafted upon the empirically-observable behaviors in order for "what the human actually wants" to be well-defined?
[...]
What do we mean by morality as fixed computation in the context of human beings who are decidedly not fixed and whose moral development through time is almost certainly so path-dependent (through sensitivity to butterfly effects and order dependence) that a concept like "CEV" probably doesn't make sense? The feedback loops implicit in the structure of the brain cause reward and punishment signals to "release chemicals that induce the brain to rearrange itself" in a manner closely analogous to and clearly reminiscent of a continuous and (until death) never-ending micro-scale brain surgery. To be sure, barring serious brain trauma, these are typically small-scale changes, but they nevertheless fundamentally modify the connections in the brain and thus the computation it would produce in something like an emulated state (as a straightforward corollary, how would an em that does not "update" its brain chemistry the same way that a biological being does be "human" in any decision-relevant way?).
I do have some other thoughts on other parts of the post, which I might write out at some point.
I agree and yet I think it's not actually that hard to make progress.
There is no canonical way to pick out human values,[1] and yet using an AI to make clever long-term plans implicitly makes some choice. You can't dodge choosing how to interpret humans, if you think you're dodging it you're just doing it in an unexamined way.
Yes, humans are bad at philosophy and are capable of making things worse rather than better by examining them. I don't have much to say other than get good. Just kludging together how the AI interprets humans seems likely to lead to problems to me, especially in a possible multipolar future where there's more incentive for people to start using AI to make clever plans to steer the world.
This absolutely means disposing of appealing notions like a unique CEV, or even an objectively best choice of AI to build, even as we make progress on developing standards for good AI to build.
See the Reducing Goodhart sequence for me on this, which starts sketching some ways to deal with humans not being agents.
I agree and I think this is critical. The standard of getting >90% of the possible value from our lightcone, or similar, seems ridiculously high given the seemingly very real possibility of achieving zero or negative value.
And it seems certain that there's no absolute standard for achieving human values. What they are is path dependent.
But we can still achieve an unimaginably good future by achieving ASI that does anything that humans roughly want.
morality as fixed computation ... decidedly not fixed ... path-dependent
Updatelessness teaches us that looking at the tree of possibilities as a whole is a saner point of view than looking at any one leaf, to the point that in the limit and where feasible you want to put the map of the whole tree in charge of the decision making at every leaf. So path-dependence is not necessarily a problem in principle, only in practice.
Another problem is influence of others, and boundaries/membranes or respect for autonomy seem like clues towards abstracting these influences away without removing them altogether as sources of more possibilities, so that only appropriate external influences remain permitted to enter the updateless dataset of possible trajectories of reflection on morality. And each trajectory has potential to access the map of all trajectories, though a membrane might need to gate access to such a map.
Updatelessness sure seems nice from a theoretical perspective, but it has a ton of problems that go beyond what you just mentioned and which seem to me to basically doom the entire enterprise (at least with regards to what we are currently discussing, namely people):
Of course, I don't expect that you are trying to literally say that going updateless gets rid of all the issues, but rather that thinking about it in those terms, after internalizing that perspective, helps put us in the right frame of mind to make progress on these philosophical and metaphilosophical matters moving forward. But, as I said at the end of my comment to Wei Dai:
I do not have answers to the very large set of questions I have asked and referenced in this comment. Far more worryingly, I have no real idea of how to even go about answering them or what framework to use or what paradigm to think through. Unfortunately, getting all this right seems very important if we want to get to a great future. Based on my reading of the general pessimism you have been signaling throughout your recent posts and comments, it doesn't seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us.
Perhaps if a group of really smart philosophy-inclined people who have internalized the lessons of the Sequences without being wedded to the very specific set of conclusions MIRI has reached about what AGI cognition must be like and which seem to be contradicted by the modularity, lack of agentic activity, moderate effectiveness of RLHF etc (overall just the empirical information) coming from recent SOTA models were to be given a ton of funding and access and 10 years to work on this problem as part of a proto-Long Reflection, something interesting would come out. But that is quite a long stretch at this point.
Making maps is practical even when they are not as precise as the whole territory. The point is, path dependence happens in some space of possibilities, and it's possible to make maps of that whole space and to make use of them to navigate the possibilities jointly, as opposed to getting caught in any one of them. This doesn't need to involve global coherence across all possibilities (of moral reflection, in this case), just as optimization of the world doesn't need to involve steamrolling it into repetition of some perfect pattern. But some parts will have similarities and shared issues with other parts, and can inform each other in their development.
Updatelessness closer to something practical is consulting an external map of possibilities that gives advice on acting in the current situation and explains how following its advice influences the possibilities (in their further development that results from following the advice). That is, you don't need to yourself "be updateless", the essential observation is that a single computation can exist in many possible situations, and by being the same thing its evaluation will give the same results in all these situations, coordinating what happens in them (without the use of causal influence of some physical thing). This computation doesn't need to be the whole agent, for example a calculator on Mars computes the same results as a calculator (of a different make) on Earth, and both implementing the same computation thus coordinate what happens on Mars with what happens on Earth without a need to physically communicate. This becomes a matter of decision theory when the coordinating computation is itself an agent. But it doesn't need to be the same agent as a user of this decision theory as a whole, it doesn't need to be something like a human, it can be much smaller and more legible, more like a calculator.
I skipped 99% of this post but just want to respond to this:
I mostly just care about avoiding takeover and getting access to the main benefits of superintelligence
and
Trying to ensure that AI takeover is somehow OK... should be viewed as an extreme last resort.
"Takeover" is the natural consequence of superintelligence. Even if superintelligence mostly leaves humans alone while pursuing its own inscrutable goals, they will exist at its mercy, just as the animals now exist at the mercy of humanity.
Suppose, nonetheless, that you manage to make a tame superintelligence. What's to stop someone else from making a wild one? To compel all future superintelligences to fall within safe boundaries, you're going to have to take over the world anyway, either with a human regime which regulates or bans all unsafe AI forever, or with a safety regime which is directly run by a superintelligent tame AI.
In any case, even if you think you have a superintelligence that is tame and safe, which will e.g. just be an advisor: if it is truly a superintelligence, it will still be the one that is in charge of the situation, not you. It would be capable of giving you "advice" that would transform you, and through you the world, in some completely unexpected direction, if that were the outcome that its humanly incomprehensible heuristics ended up favoring.
That's why, in my opinion, CEV-style superalignment is the problem that has to be solved, or that we should attempt to solve. If we are going to have superintelligent AI, then we need to make AI takeover safe for humanity, because AI takeover is the one predictable consequence of superintelligence.
Edit after rereading: I think maybe the overall take on alignment here is closer to my own view than I initially thought. I think the framework for thinking about what we tend to mean by alignment and all of the different routes to success is largely true and useful. I think some of the paths suggested here are highly unlikely to work, while others are quite reasonable. I'm out of time to comment in more depth on each of the many takes here. Particularly since Joe doesn't seem to ever respond to comments here, I assume this won't be of use to him, but may be for other readers.
I have read this and your other recent work with interest. It is very well written, even erudite. It is likely to sway some young minds. And it does give me new perspectives, which I value.
I think it's great that you're considering the whole problem space here. We don't do that enough.
Edit: rereading more carefully: This post is vast. The following is only the beginning of a response.
Having said that, I do think your reconsideration doesn't adequately build on previous thought. I'm afraid it seems to me that you're not meeting the traditional alignment view at its strong points. If that's correct, your erudition creates a risk of confusing a very important issue.
There is a good reason that most existing alignment work considers handing over the future to an aligned ASI as success. We do not trust humans. It is this point you don't take seriously here.
It's easy to look at the world and say that humans are doing rather well all in all, thank you very much.
I think you're technically correct that co-existing with autonomous AGI that's not fully aligned is possible. And that existing with servant AI long-term is possible.
The arguments have always been that both of those scenarios are highly unlikely to be long-term stable. My recent post If we solve alignment, do we die anyway? tries to spell out why humans in control of AGI is untenable in the long or even medium term. Similar arguments apply to semi-aligned AGI. In both cases the problem is this: when players can amplify their own intelligence and production capacity, and conceal their actions, the most vicious player wins. Changing that scenario requires drastic measures you don't discuss. Keep playing long enough without draconian safeguards, and you're guaranteed to get a very vicious player. They'll attack and win and control the future, at which point we'd better hope they're merely selfish and not sadistic.
I apologize for stating it so bluntly: it looks to me like you're anthropomorphizing AGI through a very optimistic lens, and encouraging others to do the same. And this is coming from someone who co-authored a paper titled Anthropomorphic reasoning about neuromorphic AGI safety. I apologize for saying this. I respect you as a thinker on AGI. It's an extremely complicated topic.
Speaking as a psychologist and neuroscientist, I think it's important to recognize that we can't use anthropomorphic reasoning on alignment in part because many humans aren't aligned or safe. Sociopaths (at least some subset) will be more concerned with an injury to their little finger than with millions of deaths that won't affect them directly.
AGI will be sociopathic by default. Evolution has created very specific mechanisms to make most humans tend toward empathy, and therefore valuable teamwork.
Those mechanisms seem to be turned down in sociopaths. AGIs will lack them by default. It's possible that this is backward and empathy is the default, and sociopaths have extra mechanisms to turn it down/off, but that would be a result of specific brain computational schemes. AGI may well have none of those; or choose to disable them. If we try to make AGI that is pro-social, getting that right is not trivial. You seem to assume it here. Technical alignment is arguably the most important bit, and inarguably an important bit.
Or you might assume that we sort of all get along by default. That is sort of the case with humans, who are stuck with a limited mind and body roughly matching the other humans. That logic changes drastically when each being can enhance or duplicate itself without limit. If I need no allies, the smart move is to rely on no one but myself.
And humans have done very well so far, but that does not indicate that we are a good choice to control the future. There is a nonzero chance of nuclear annihilation every year; perhaps as high as 1%. The fact we're doing the best we ever have is not a good enough reason to think we'll continue to do great into the far future.
That's why building a being better than us and giving it control sounds like the least-bad option.
My post I linked and other work lays out a route to get there, past the long List of Lethalities. We first do personal-intent-aligned AGI, in the hands of a non-sociopathic human. They wisely leverage that to limit AGI proliferation. Then we enjoy a long reflection and decide how to align the sovereign AGI we build. The future is finally safe from sociopathic/otherwise malign humans.
Edit:
I have more responses to your other points. I agree with many, and disagree with many. There are a lot of claims and implications here.
I agree that corrigibility/ loyal servant is the likely path to useful, safe ASI. I disagree that avoiding takeover is a workable long-term solution. I don't think ASI with an "aversion" to powerseeking or murder is a reasonable goal, for the classic reasons; humans may be motivated by random aversions, but we're really incoherent. We can't expect a superintelligence to behave the same way unless it's not only carefully engineered to do that as an ASI, but we're really sure that its alignment will remain stable as it advances to ASI.
First, I have to note this is way more than I can wrap my head around in one reading (in fact it was more than I could read in one sitting so really have not completed reading it) but thank you for posting this as it presents a very complicated subject in a framework I find more accessible that prior discussings here (or anywhere else I've looked at). But then I'm just a curious outsider to this issue who occasionally explores the discussion so information overload is normal I think.
I particularly like the chart and how it laid out the various states/outcomes.
I found Section 6 particularly interesting! Here's how I understand it:
From my understanding, this context relates to the "be careful what you wish for" problem with AI, where AI could optimize in dangerous or unexpected ways. There's a race here: can we control AI well enough to still gain its benefits?
However, I don't think you've provided enough evidence that this level of control is actually possible. Additionally, there’s the issue of deceptive alignment—I’m not convinced we could manage this "race" without receiving some kind of feedback from AI systems.
Finally, the description of the oracle AI in this section seems quite similar to the idea of corrigible AI.
I enjoyed reading this, thanks.
I think your definition of solving alignment here might be too broad?
If we have superintelligent agentic AI that tries to help its user but we end up missing out of the benefits of AI bc of catastrophic coordination failures, or bc of misuse, then I think you're saying we didn't solve alignment bc we didn't elicit the benefits?
You discuss this, but I prefer to separate out control and alignment. Where I wouldn't count us as having solved alignment if we only elicit behavior via intense/exploitative control schemes. So I'd adjust your alignment definition with the extra requirement that we avoided takeover while not doing super-intense control schemes relative to what is acceptable to do to humans today. Which is a higher bar, and separates it from the thing we care about --avoiding takeover and eliciting benefits-- but I think that's a better def
If we have superintelligent agentic AI that tries to help its user but we end up missing out of the benefits of AI bc of catastrophic coordination failures, or bc of misuse, then I think you're saying we didn't solve alignment bc we didn't elicit the benefits?
In my definition, you don't have to actually elicit the benefits. You just need to have gained "access" to the benefits. And I meant this specifically cover cases like misuse. Quoting from the OP:
“Access” here means something like: being in a position to get these benefits if you want to – e.g., if you direct your AIs to provide such benefits. This means it’s compatible with (2) that people don’t, in fact, choose to use their AIs to get the benefits in question.
- For example: if people choose to not use AI to end disease, but they could’ve done so, this is compatible with (2) in my sense. Same for scenarios where e.g. AGI leads to a totalitarian regime that uses AI centrally in non-beneficial ways.
Re: separating out control and alignment, I agree that there's something intuitive and important about differentiating between control and alignment, where I'd roughly think of control as "you're ensuring good outcomes via influencing the options available to the AI," and alignment as "you're ensuring good outcomes by influencing which options the AI is motivated to pursue." The issue is that in the real world, we almost always get good outcomes via a mix of these -- see, e.g. humans. And as I discuss in the post, I think it's one of the deficiencies of the traditional alignment discourse that it assumes that limiting options is hopeless, and that we need AIs that are motivated to choose desirable options even in arbtrary circumstances and given arbitrary amounts of power over their environment. I've been trying, in this framework, to specifically avoid that implication.
That said, I also acknowledge that there's some intuitive difference between cases in which you've basically got AIs in the position of slaves/prisoners who would kill you as soon as they had any decently-likely-to-succeed chance to do so, and cases in which AIs are substantially intrinsically motivated in desirable ways, but would still kill/disempower you in distant cases with difficult trade-offs (in the same sense that many human personal assistants might kill/disempower their employers in various distant cases). And I agree that it seems a bit weird to talk about having "solved the alignment problem" in the former sort of case. This makes me wonder whether what I should really be talking about is something like "solving the X-risk-from-power-seeking-AI problem," which is the thing I really care about.
Another option would be to include some additional, more moral-patienthood attuned constraint into the definition, such that we specifically require that a "solution" treats the AIs in a morally appropriate way. But I expect this to bring in a bunch of gnarly-ness that is probably best treated separately, despite its importance. Sounds like your definition aims to avoid that gnarly-ness by anchoring on the degree of control we currently use in the human case. That seems like an option too -- though if the AIs aren't moral patients (or if the demands that their moral patienthood gives rise to differ substantially from the human case), then it's unclear that what-we-think-acceptable-in-the-human-case is a good standard to focus on.
Also suggest exploring what it may means we are unable to be able to solve the alignment problem for fully autonomous learning machinery.
There will be a [new AI Safety Camp project](https://docs.google.com/document/d/198HoQA600pttXZA8Awo7IQmYHpyHLT49U-pDHbH3LVI/edit) about formalising a model of AGI uncontainability.
People often talk about “solving the alignment problem.” But what is it to do such a thing? I wanted to clarify my thinking about this topic, so I wrote up some notes.
In brief, I’ll say that you’ve solved the alignment problem if you’ve:
become able to elicit some significant portion of those benefits from some of the superintelligent AI agents at stake in (2).[1]
The post also discusses what it would take to do this. In particular:
Thanks to Carl Shulman, Lukas Finnveden, and Ryan Greenblatt for discussion.
1. Avoiding vs. handling vs. solving the problem
What is it to solve the alignment problem? I think the standard at stake can be quite hazy. And when initially reading Bostrom and Yudkowsky, I think the image that built up most prominently in the back of my own mind was something like: “learning how to build AI systems to which we’re happy to hand ~arbitrary power, or whose values we’re happy to see optimized for ~arbitrarily hard.” As I’ll discuss below, I think this is the wrong standard to focus on. But what’s the right standard?
Let’s consider two high level goals:
Avoiding a bad sort of takeover by misaligned AI systems – i.e., one flagrantly contrary to the intentions and interests of human designers/users.[3]
It’s plausible that one of the benefits of vastly-better-than-human AI is access to a safe path to the benefits of as-intelligent-as-physically-possible AI – in which case, cool. But I’m not pre-judging that here.[4]
That said: to the extent you want to make sure you’re able to safely scale further, to even-more-superintelligent-AI, then you likely need to make sure that you’re getting access to whatever benefits merely-superintelligent AI gives in this respect – e.g., help with aligning the next generation of AI.
My basic interest, with respect to the alignment problem, is in successfully achieving both (1) and (2). If we do that, then I will consider my concern about this issue in particular resolved, even if many other issues remain.
Now, you can avoid bad takeover without getting access to the benefits of superintelligent AI. For example, you could not ever build superintelligent AI. Or you could build superintelligent AI but without it being able to access its capabilities in relevantly beneficial ways (for example, because you keep it locked up inside a secure box and never interact with it).
You can also plausibly avoid bad takeover and get access to the benefits of superintelligent AI, but without building the particular sorts of superintelligent AI agents that the alignment discourse paradigmatically fears – i.e. strategically-aware, long-horizon agentic planners with an extremely broad range of vastly superhuman capabilities.
Indeed, I actually think it’s plausible that we could get access to tons of the benefits of superintelligent AI using large numbers of fast-running but only-somewhat-smarter-than-human AI agents, rather than agents that are qualitatively superintelligent. And I think this is likely to be notably safer.[5]
Generally, though, the concern is that we are, in fact, on the path to build superintelligent AI agents of the sort of the alignment discourse fears. So I think it’s probably best to define the alignment problem relative to those paths forward. Thus:
Then, further, I’ll say that you avoided or handled the alignment problem “with major loss in access-to-benefits” if you failed to get access to the main benefits of superintelligent AI. And I’ll say that you avoided or handled it “without major loss in access-to-benefits” if you succeeded at getting access to the main benefits of superintelligent AI.
Finally, I’ll say that you’ve solved the alignment problem if you’ve handled it without major loss in access-to-benefits, and become able to elicit some significant portion of those benefits specifically from the dangerous SI-agents you’ve built.
Thus, in a chart:
I’ll focus, in what follows, on solving the problem in this sense. That is: I’ll focus on reaching a scenario where we avoid the bad forms of AI takeover, build superintelligent AI agents, get access to the main benefits of superintelligent AI, and do so, at least in part, via the ability to elicit some of those benefits from SI agents.
However:
Note, though, that to the extent you’re avoiding the problem, there’s a further question whether your plan in this respect is sustainable (after all, as I noted above, we’re currently “avoiding” the problem according to my taxonomy). In particular: are people going to build superintelligent AI agents eventually? What happens then?[6]
So the “avoiding the problem” states will either need to prevent superintelligent AI agents from ever being built, or they’ll transition to either handling the problem, or failing.
And we can say something similar about routes that “handle” the problem, but without getting access to the main benefits of superintelligence. E.g., if those benefits are important to making your path forward sustainable, then “handling it” in this sense may not be enough in the long term.
Admittedly, this is a somewhat deviant definition of “solving the alignment problem.” In particular: it doesn’t assume that our AI systems are “aligned” in a sense that implies sharing our values. For example, it’s compatible with “solving the alignment problem” that you only ever controlled your superintelligences and then successfully elicited the sorts of task performance you wanted, even if those superintelligences do not share your values.
This deviation is on purpose. I think it’s some combination of (a) conceptually unclear and (b) unnecessarily ambitious to focus too much on figuring out how to build AI systems that are “aligned” in some richer sense than I’ve given here. In particular, and as I discuss below, I think this sort of talk too quickly starts to conjure difficulties involved in building AI systems to which we’re happy to hand arbitrary power, or whose values we’re happy to see optimized for arbitrarily hard. I don’t think we should be viewing that as the standard for genuinely solving this problem. (And relatedly, I’m not counting “hand over control of our civilization to a superintelligence/set of superintelligences that we trust arbitrarily much” as one of the “benefits of superintelligence.”)
On the other hand, I also don’t want to use a more minimal definition like “build an AGI that can do blah sort of intense-tech-implying thing with a strawberry while having a less-than-50% chance of killing everyone.” In particular: I’m not here focusing on getting safe access to some specific and as-minimal-as-possible sort of AI capability, which one then intends to use to make things (pivotally?) safer from there. Rather, I want to focus on what it would be to have more fully solved the whole problem (without also implying that we’ve solved it so much that we need to be confident that our solutions will scale indefinitely up through as-superintelligent-as-physically-possible AIs).
2. A framework for thinking about AI safety goals
Let’s look at this conception of “solving the alignment problem” in a bit more detail. In particular, we can think about a given sort of AI safety goal in terms of the following six components:
Scaling: how confident you want to be that the techniques you used to get the relevant safety properties and elicitation would also work on more capable models.[7]
How would we analyze “solving the alignment problem” in terms of these components? Well, the first three components of our AI safety goal are roughly as follows:
OK, but what about the other three components – i.e. competitiveness, verification, and scaling? Here’s how I’m currently thinking about it:
Let’s look at the safety property of “avoiding bad takeover” in more detail.
3. Avoiding bad takeover
We can break down AI takeovers according to three distinctions:
Coordinated vs. uncoordinated: was there a (successful) coordinated effort to disempower humans, or did humans end up disempowered via uncoordinated efforts from many disparate AI systems to seek power for themselves.[8]
This distinction applies most naturally to coordinated takeovers. In uncoordinated takeovers featuring lots of disparate efforts at power-seeking, the ex ante ease or difficulty of those efforts can be more diverse.[9]
That said, even in uncoordinated takeover scenarios, there’s still a question, for each individual act of power-seeking by the uncoordinated AI systems, whether that act was or was not predicted to succeed with high probability.
(There’s some messiness, here, related to how to categorize scenarios where misaligned AI systems coordinate with humans in order to take over. As a first pass, I’ll say that whether or not an AI has to coordinate with humans or not doesn’t affect the taxonomy above – e.g., if a single AI system coordinates with some humans-with-different-values in order to takeover, that still counts as “unilateral.” However, if some humans who participate in a takeover coalition end up with a meaningful share of the actual power to steer the future, and with the ability to pursue their actual values roughly preserved, then I think this doesn’t count as a full AI takeover – though of course it may be quite bad on other grounds.[10])
Each of the takeover scenarios these distinctions carve out has what we might call a “vulnerability-to-alignment condition.” That is, in order for a takeover of the relevant type to occur, the world needs to enter a state where AI systems are in a position to take over in the relevant way, and with the relevant degree of ease. Once you have entered such a state, then avoiding takeover requires that the AI systems in question don’t choose to try to take-over, despite being able to (with some probability). So in that sense, your not-getting-taken-over starts loading on the degree of progress in “alignment” you’ve made at the point, and you are correspondingly vulnerable.
So solving the alignment problem involves building superintelligent AI agents, and eliciting some of their main benefits, while also either:
Let’s go through each of these in turn.
3.1 Avoiding vulnerability-to-alignment conditions
What are our prospects with respect to avoiding vulnerability-to-alignment conditions entirely?
The classic AI safety discourse often focuses on safely entering the vulnerability-to-alignment condition associated with easy, unilateral takeovers. That is, the claim/assumption is something like: solving the alignment problem requires being able to build a superintelligent AI agent that has a decisive strategic advantage over the rest of the world, such that it could take over with extreme ease (and via a wide variety of methods), but either (a) ensuring that it doesn’t choose to take over, or (b) ensuring that to the extent it chooses to take over, this is somehow OK.
As I discussed in my post on first critical tries, though, I think it’s plausible that we should be aiming to avoid ever entering into this particular sort of vulnerability-to-alignment condition. That is: even if a superintelligent AI agent would, by default, have a decisive strategic advantage over the present world if it was dropped into this world out of the sky (I don’t even think that this bit is fully clear[11]), this doesn’t mean that by the time we’re actually building such an agent, this advantage would still obtain – and we can work to make it not obtain.
However, for the task of solving the alignment problem as I’ve defined it, I think it’s harder to avoid the vulnerability-to-alignment conditions associated with multilateral takeovers. In particular: consider the following claim:
Again, I don’t think “Need SI-agent to stop SI-agent” is clearly true (more here). But I think it’s at least plausible, and that if true, it’s highly relevant to our ability to avoid vulnerability-to-alignment conditions entirely while also solving/handling (rather than avoiding) the alignment problem. In particular: since solving the alignment problem, in my sense, involves building at least one superintelligent AI agent, Need SI-agent to stop SI-agent implies that this agent would have a DSA absent some other superintelligent AI agent serving as a check on the first agent’s power. And that looks like a scenario vulnerable to the motivations of some set of AI agents – whether in the context of coordination between all these agents, or in the context of uncoordinated power-seeking by all of them (even if those agents don’t choose to coordinate with each other, and choose instead to just compete/fight, their seeking power in problematic ways could still result in the disempowerment of humanity).
Still: I think we should be thinking hard about ways to get access to the main benefits of superintelligence without entering vulnerability-to-alignment conditions, period – whether by avoiding the alignment problem entirely (i.e., per my taxonomy above, by getting the relevant benefits-access without building superintelligent AI agents at all), or by looking for ways that “Need SI-agent to stop SI-agent” might be false, and implementing them.
3.2 Ensuring that AI systems don’t try to takeover
Let’s suppose, though, that we need to enter a vulnerability-to-alignment condition of some kind in order to solve the alignment problem. What are our prospects for ensuring that the AI systems in question don’t attempt the sorts of power-seeking that might lead to a takeover?
In my post on “A framework for thinking about AI power-seeking,” I laid out a framework for thinking about choices that potentially-dangerous AI agents will make between (a) seeking power in some problematic way (whether in the context of a unilateral takeover, a coordinated multilateral takeover, or an uncoordinated takeover), or (b) pursuing their “best benign alternative.”[12]
“I think about the incentives at stake here in terms of five key factors:
In particular, I highlighted the difference between thinking about “easy” vs. “non-easy” takeovers in this respect.
I think that “ensuring that AI systems don’t try to take over” is where the rubber, for alignment, really meets the road – and I think of the difficulty in exerting the relevant sort of control over an AI’s motivations as the key question re: the difficulty of alignment.
Note, however, that the AI’s internal motivations are basically never going to be the only factor here. Rather, and even in the context of quite easy takeovers, the nature of the AI’s environment is also going to play a key role in determining what options it has available (e.g., what exactly the non-takeover option consists in, what actual paths to takeover are available, what the end result of successful takeover looks like in expectation, etc), and thus in determining what its overall incentives are. In this sense, solving the alignment problem is not purely a matter of technical know-how with respect to understanding and controlling an AI’s internal motivations. Rather, the broader context in which the AI is operating remains persistently relevant – and ongoing changes in that context imply changing standards for motivational understanding/control.
3.3 Ensuring that takeover efforts don’t succeed
Beyond avoiding vulnerability-to-alignment conditions, and ensuring that AIs don’t ever try to take over, there’s also the option of ensuring that takeover efforts do not succeed. This isn’t much help in “easy takeover” scenarios, which by hypothesis are ones in which the AIs in question justifiably predict an extremely high probability of success at takeover if they go for it. And we might worry that building genuinely superintelligent agents will imply entering a vulnerability condition for easy multilateral takeover in particular. But to the extent that it is possible to check the power of superintelligent AI agents using something other than additional superintelligent AI agents (i.e., Need an SI-agent to stop an SI-agent is false), and/or to make it more difficult for superintelligent AI agents to successfully coordinate to takeover, measures in this vein can both lower the probability that AIs will try to takeover (since they have a lower chance of success), AND make it more likely that if they go for it, their efforts fail.
3.4 Ensuring that the takeover in question is somehow OK
Finally, I want to flag a conception of alignment that I brought up in my last post – namely, one which accepts that AIs are going to take over in some sense, but which aims to make sure that the relevant kind of takeover is somehow benign. Thus, consider the following statement from from Yudkowsky’s “List of lethalities”:
Here, Yudkowsky is assuming, per usual, that you are building a superintelligence that will be so powerful that it can take over the world extremely easily.[13] And as I discussed in my last post, his first approach to alignment (e.g., the CEV-style sovereign) seems to assume that the superintelligence in question does indeed take over the world – hopefully, via some comparatively benign and non-violent path – despite its alignment. That is, it becomes a “Sovereign” that no longer accepts any “human input trying to stop it,”[14] and then proceeds (presumably after completing some process of further self-improvement) to optimize all the galaxies extremely intensely according to its values. Luckily, though, its values are exactly right.
I agree with Yudkowsky that if our task is to build a superintelligence (or: the seed of a superintelligence) that we never again get to touch, correct, or shut-down; which will then proceed to seize control of the world and optimize the lightcone extremely hard according to whatever values it ends up with after it finishes some process of further self-modification/improvement; and where those values need to reflect “exactly what we extrapolated-want,” then this task does indeed seem difficult. That is, you have to somehow plant, in the values of this “seed AI,” some pointer to everything that “extrapolated-you” (whatever that is) would eventually want out of a good future; you have to anticipate every single way in which things might go wrong, as the AI continues to self-improve, such that extrapolated-you would’ve wanted to touch/correct/shut-down the process in some way; and you need to successfully solve every such anticipated problem ahead of time, without the benefit of any “redos.” Sounds tough.
Indeed, as I discussed in my last post, my sense is that people immersed in the Bostrom/Yudkowsky alignment discourse sometimes inherit this backdrop sense of difficulty. E.g., someone describes, to them, some alignment proposal. But it seems, so easily, such a very far cry from “and thus, I have made it the case that this AI’s values are exactly right, and I have anticipated and solved every other potential future problem I would want to intervene on the AI’s values/continued-functioning to correct, such that I am now happy to hand final and irrevocable control over our civilization, and of the future more broadly, to whatever process of self-improvement and extreme optimization this AI initiates.” And no wonder: it’s a high standard.
So while on the one hand, meeting the standard at stake in Yudkowsky’s “CEV-style sovereign” approach does indeed seem extremely tough, I also wonder whether, even assuming you are going to irrevocably pass off control of the future to some “incorrigible” process, Yudkowsky’s picture implicitly assumes a degree of required “grip” on that future that is some combination of unrealistic or unnecessary. Unrealistic, because you were never going to get that level of control, even in a more human-centric case. And unnecessary, because in more normal and familiar contexts, you didn’t actually think that level of control required for the future to be good – and perhaps, the thing that made it unnecessary in the human-centric case extends, at least to some extent, to a more AI-centric case as well.
That said, we should note that Yudkowsky’s particular story about “benign takeover,” here, isn’t the only available type. For example: you could, in principle, think that even if the AI takes over, it’s possible to get a good future without causing the AI to have exactly the right values. You could think this, for example, if you reject the “fragility of value” thesis, applied to humans with respect to AIs.
My own take, though, is that “accept that the AIs will take over, but make it the case that their doing is somehow OK” is an extremely risky strategy that we should be viewing as a kind of last resort.[15] So I’ll generally focus, in thinking about solving the alignment problem, on routes that don’t involve letting the AI takeover at all.
3.5 What’s the role of “corrigibility” here?
In the quote from Yudkowsky above, he contrasts the “CEV-style sovereign” approach to alignment with an alternative that he associates with the term “corrigibility.” So I want to pause, here, to address the role of the notion of “corrigibility” in what I’ve said thus far.
3.5.1 Some definitions of corrigibility
What is “corrigibility”? People say various different things. For example:
A loyal assistant, by contrast, is more intuitively “pliable,” “obedient,” “docile.” If you give it some instruction, or tell it to stop what it’s doing, or to submit to getting its values changed, it obeys in some manner that is (elusively) more directly responsive to the bare fact that you gave this instruction, rather than in a way mediated via whether its own calculation as to whether obedience conduces to its own independent goals (except, perhaps, insofar as its goals are focused directly on some concept like “following-instructions,” “obedience,” “helpfulness,” “being whatever-the-hell-is-meant-by-the-term-“corrigible,” etc). In this sense, despite satisfying the agential pre-requisites I describe here, it functions, intuitively, more like a tool.[16] And I think people sometimes use the term “corrigibility” as a stand-in for vibes in this broad vein.
And note that an aspiration to build loyal assistants also gives rise to a number of distinctive ethical questions in the context of AI moral patienthood. That is: building independent, autonomous agents that share our values is one thing. Building servants – even happy, willing servants – is another.
My own sense is that the term “corrigibility” is probably best used, specifically, to indicate something like “doesn’t resist shut-down/values-modification” – and that’s how I’ll use it here. And I think that insofar as “shut yourself down” or “submit to values-modification” are candidate instructions we might give to an AI system, something like “loyal servant” strongly implies something like corrigibility as well.
I’ll note, though, that I think “doesn't want exactly what we want, and yet somehow fails to kill us and take over the galaxies” picks out something importantly broader, and corrigibility in the sense just discussed isn’t the only way to get it. In particular: there are possible agents that (a) don’t want exactly what you want, (b) resist shut-down/value-modification, (c) don’t try to kill you/take-over-the-galaxies. Notably, for example, humans fit this definition with respect to one another – they don’t want exactly the same things, and their incentives are such that they will resist being murdered, brain-washed, etc, but their incentives aren’t such that it makes sense, given their constraints, to try to kill everyone else and take over the world.
Of course, if we follow Yudkowsky in imagining that our AI systems are enormously powerful relative to their environment, or at least relative to humanity, then we might expect a stronger link between “resists shut-down/values-modification” and “tries to take-over.” In particular: you might think that taking-over is one especially robust way to avoid being shut-down/values-modified, such that if taking over is sufficiently free, an agent disposed to resist shut-down/values-modification will be disposed to take-over as part of that effort.
Even in the context of such highly capable AIs, though, we should be careful in moving too quickly from “resists shut-down/values-modification” to “tries to take over.” For example, if taking over involves killing everyone, it’s comparatively easy to imagine (even if not: to create) AIs that are sufficiently inhibited with respect to killing everyone that they won’t engage in takeover via such a path, even if they would resist other types of shut-down/values-modification (consider, for example, humans who would try to protect themselves if Bob tried to kill/brainwash them, but not at the cost of omnicide – and this even despite not wanting exactly what Bob extrapolated-wants). And similarly, we can imagine AIs who place some intrinsic disvalue on having-taken-over, even in a non-violent manner, such that they won’t go for it as an extension of resisting shut-down etc.
3.5.2 Is corrigibility necessary for “solving alignment”?
Is corrigibility necessary for “solving alignment,” at least if we don’t want to bank on “let the AIs takeover, but make that somehow OK”?
I tend to think it’s specifically takeover that we should be concerned about, in the context of solving the alignment problem, rather than with corrigibility. That is: if, for some reason, we do in fact create superintelligent agents that resist shut-down/values-modification, but which don’t also take over, then (depending on what share of power we’ve lost), I don’t think the game is over – at least not by definition. For example: those agents might be comparatively content with protecting whatever share of power they have, but not interested in disempowering humans further – and thus, even if we remain unable to shut them down or modify them given their resistance, their presence in the world is plausibly more compatible with humans maintaining a lot of control over a lot of stuff (even if not: over those AIs in particular, at least within some domain).
That said, at least if we were setting aside moral patienthood concerns, then other things equal I do think that we probably want to be able to shut down our AIs when we want to, and/or to modify their values in an ongoing way, without them resisting. And being able to do this seems notably correlated with worlds where we are able to shape their motivations to avoid other forms of problematic power-seeking. So at least modulo moral patienthood stuff, I do expect that many of the worlds in which we solve the alignment problem, in the sense of building SI agents while avoiding takeover, will involve building corrigible SI agents in particular.
Indeed: when I personally imagine a world where we have “solved the alignment problem without major access-to-benefits loss,” I tend to imagine, first, a world where we have successfully built superintelligent AI agents that function, basically, as loyal servants.[17] That is: we ask them to do stuff, and then they do it, very competently, the way we broadly intended for them to do it – like how it is with Claude etc, when things go well. Hence, indeed, our “access” to the benefits they provide. We have access in the sense that, if we asked for a given benefit, or a given type of task-performance, they would provide it. But by extension, indeed: if we asked them to stop/shut-down, they would stop/shut-down; if we asked them to submit to retraining, they would so submit, etc.
This vision, though, does indeed raise the ethical concerns I noted above. And it’s not the only vision available. There are also worlds, for example, where AI agents end up functioning more like human citizens/employees – and in particular, where they are not expected to submit to arbitrary types of shut-down/values-modification, but where they are nevertheless adequately constrained by various norms, incentives, and ethical inhibitions that they don’t engage in a bad takeover, either. And I think we should be interested in models of that kind as well.
3.5.3 Does ensuring corrigibility raise issues that avoiding takeover does not?
Does corrigibility raise issues that takeover-prevention does not? I haven’t thought about the issue in much depth, but at a glance, I’m not sure why it would. In particular: I think that resisting shut-down, and resisting values-modification, are themselves just a certain type of problematic power-seeking. So in principle, then we can just plug such actions into the framework I discussed above, and analyze the incentives at stake in a very similar way. That is, we can ask, of a given context of choice: exactly how much benefit would the AI derive via successful power-seeking of this kind, what’s the AI’s probability of success at the relevant sort of power-seeking, what sorts of inhibitions might block it from attempting this form of power-seeking, how easily can it route around those inhibitions, what’s the downside risk, etc.
And the “classic argument” for expecting incorrigibility will be roughly similar to the “classic argument” for expecting takeover – that is, that an ultra-powerful AI system with a component of (sufficiently long-horizon) consequentialism in its motivations will derive at least some benefit, relative to the status quo, from preventing shut-down/values-modification, and that it will be so powerful/likely to succeed/able-to-route-around-its-inhibitions that there won’t be any competing considerations that outweigh this benefit or block the path to getting it. But as in the classic argument for expecting takeover, if we weaken the assumption that the relevant form of power-seeking is extremely likely to succeed via a wide variety of methods, the incentives at play become more complicated. And if we introduce the ability to exert fairly direct influence on the AI’s values – sufficient to give it very robust inhibitions, or sufficient to make it intrinsically averse to the end-state of the relevant form of power-seeking (i.e., intrinsically averse to “undermining human control,” “not following instructions,” “messing with the off-switch,” etc) – the argument plausibly weakens even in the cases where the relevant form of problematic power-seeking is quite “easy.” And as in the case of takeover, if you can improve the AI’s “best benign option,” this might help as well.
4. Desired elicitation
So far, and modulo the interlude on corrigibility, I’ve focused centrally on the “avoiding bad takeover” aspect of solving the alignment problem. But I said, above, that we were interested specifically in handling the alignment problem without major access-to-benefits loss, and I’ve defined “solving the problem” such that least some of these benefits needed to be elicited, specifically, from the SI agents we’ve built.
And indeed, the idea that you need to elicit various of an SI-agent’s capabilities plays an important role in constraining the solution space to preventing takeover. Thus, for example, insofar as your approach to avoiding takeover involves building an SI-agent that operates with extremely intense inhibitions – well, these inhibitions need to be compatible with also eliciting from the AI system whatever access-to-benefits we’re imagining we need it to provide. And you can’t make it intrinsically averse to all forms of power-seeking, shut-down-aversion, prevention-of-values-modification, etc either – since, plausibly, it does in fact need to do some versions of these things in some contexts.
I’m not, here, going to examine the topic of eliciting desired task-performance from SI agents in much depth. But I’ll say a few things about our prospects here.
When we talk about eliciting desired task-performance from a superintelligent agent, we’re specifically talking about causing this agent to do something that it is able to do. That is, we’re not, here, worried about “getting the capability into the agent.” Rather, granted that a capability is in the agent, we’re worried about getting it out.
In this sense, elicitation is separable from capabilities development. Note, though, that in practice, the two are also closely tied. That is, when we speak about the various incentives in the world that push towards capabilities development, they specifically push towards the development of capabilities that you are able to elicit in the way you want. If the capabilities in question remain locked up inside the model, that’s little help to anyone, even the most incautious AI actors who are “focusing solely on capabilities.”
Admittedly, it’s a little bit conceptually fuzzy what it takes for a capability to be “in” a model, but for you to be unable to elicit it.
Here, we’re specifically talking about eliciting desired task-performance of a superintelligent agent that satisfies the agential pre-requisites and goal-content pre-requisites I describe here. So it’s natural, in that context, to use the agency-loaded frame in particular – that is, to talk about how the AI would evaluate different plans that involve using its capabilities in different ways.[18]
And if we’re thinking in these terms, we can modify the framework I used re: takeover seeking above to reflect an important difference between various non-takeover options: namely, that some of them involve doing the task in the desired way, and some of them do not. In a diagram:
That is: above we discussed our prospects for avoiding a scenario where the AI chooses its favorite takeover option. But in order to get desired elicitation, we need to do something else: namely, we need to make sure that from among the AI’s non-takeover options, it specifically chooses to “do the task in the desired way,” rather than to do something else.[19] (Let’s assume that the AI knows that doing the task in the desired way is one of its options – or at least, that trying to do the task in this way is one of its options.)
Ok, those were some comments on desired elicitation. Now I want to say a few things about the role of “verification” in the dynamics discussed so far.
5. The role of verification
In my discussion of the “verification” in section 2, I said above that we don’t, strictly, need to “verify” that our aims with respect to ensuring safety properties (i.e., avoiding takeover) or elicitation properties are satisfied with respect to a given AI – what matters is that they are in fact satisfied, even if we aren’t confident that this is the case. Still, I think verification plays an important role, both with respect to avoiding takeover, and with respect to desired elicitation – and I want to talk about it a bit here.
Here I’m going to use the notion of “verification” in a somewhat non-standard way, and say that you have “verified” the presence of some property X if you have reached justifiably levels of confidence in this property obtaining. This means that, for example, you’re in a position to “verify” that there isn’t a giant pot of green spaghetti floating on the far side of the sun right now, even though you haven’t, like, gone to check. This break from standard usage isn’t ideal, but I’m sticking with it for now. In particular: I think that ultimately, “justifiable confidence” is the thing we typically care about in the context of verification.
Let’s say that if you are proceeding with an approach to the alignment problem that involves not verifying (i.e., not being justifiably confident) that a given sort of property obtains, then you are using a “cross-your-fingers” strategy.[20] Such strategies are indeed available in principle. And I suspect that they will be unfortunately common in practice as well. But verification still matters, for a number of reasons.
The first is the obvious fact that cross-your-fingers strategies seem scary. In particular, insofar as a given type of safety property is critical to avoiding takeover/omnicide (e.g., a property like “will not try to takeover on the input I’m about to give it”), then ongoing uncertainty about whether it obtains corresponds to ongoing ex ante uncertainty about whether you’re headed towards takeover/omnicide.
Even absent these “we all die if X property doesn’t obtain” type cases, though, it can still be very useful and important to know if X obtains, including in the context of capability-elicitation absent takeover. Thus, for example, if we want our superintelligent AI agent to be helping us cure cancer, or design some new type of solar cell, or to make on-the-fly decisions during some kind of military engagement, it’s at least nice to feel confident that it’s actually doing so in the way we want (even if we’re independently confident that it isn’t trying to take over).
What’s more: our ability to verify that some property holds of an AI’s output or behavior is often, plausibly, quite important to our ability to cause the AI to produce output/behavior with the property in question. That is: verification is often closely tied to elicitation. This is plausible in the context of contemporary machine learning, for example, where training signals are our central means of shaping the behavior of our AIs. But it also holds in the context of designing functional artifacts more generally. I.e., the process of trying something out, seeing if it has a desired property, then iterating until it does, will likely be key to less ML-ish AI development pathways too – but the “seeing if it has a desired property” aspect requires a kind of verification.
Let’s look at our options for verification in a bit more depth.
5.1 Output-focused verification and process-focused verification
Suppose that you have some process P that produces some output O. In this context, in particular, we’re wondering about a process P that includes (a) some process for creating a superintelligent AI agent, and (b) that AI agent producing some output – e.g., a new solar cell, a set of instructions for a wet-lab doing experiments on nano-technology, some code to be used in a company’s code-base, some research on alignment, etc.
You’d like to verify (i.e., become justifiably confident) that this output has some property X – for example, that the solar cell/wet-lab/code will work as intended, that it won’t lead to or promote a takeover somehow, etc. What would it take to do this?
We can distinguish, roughly, between two possible focal points of your justification: namely, output O, and process P. Let’s say that your justification is “output-focused” if it focuses on the former, and “process-focused” if it focuses on the latter.
Most real-world justificatory practices, re: the desirability of some output, mix output-focused and process-focused justification together. Indeed, in theory, it can be somewhat hard to find a case of pure output-focused justification – i.e., justification that holds in equal force totally regardless of the process producing the output being examined.
One candidate purely output-focused justification might be: if you ask any process to give you the prime factors of some semiprime i, then no matter what that process is, you’ll be able to verify, at least, that the numbers produced, when multiplied together, do in fact equal i (for some set of reasonable numbers, at least).[21]
E.g., at least within reasonable constraints, even a wildly intelligent superintelligence can’t give you two (reasonable) numbers, here, such that you’ll get this wrong.[22]
Indeed, in some sense, we can view a decent portion of the alignment problem as arising from having to deal with output produced by a wider and more sophisticated range of processes than we’re used to, such that our usual balance between output-focus and process-focus in verifying stuff is disrupted. In particular: as these processes are more able to deceive you, manipulate you, tamper with your measurements, etc – and/or as they are operating in domains and at speeds that you can’t realistically understand or track – then your verification processes have to rely less and less on sort of output-focused justification of the form “I checked it myself,” and they need to fall back more and more either on (a) process-focused justification, or (b) on deference to some other non-correlated process that is evaluating the output in question.
Correspondingly, I think, we can view a decent portion of our task, with respect to the alignment problem, as accomplishing the right form of “epistemic bootstrapping.”[23] That is, we currently have some ability to evaluate different types of outputs directly, and we have some set of epistemic processes in the world that we trust to different degrees. As we incorporate more and more AI labor into our epistemic toolkit, we need to find a way to build up justifiable trust in the output of this labor, so that it can then itself enter into our epistemic processes in a way that preserves and extends our epistemic grip on the world. If we can do this in the right order, then the reach of our justified trust can extend further and further, such that we can remain confident in the desirability of what’s going on with the various processes shaping our world, even as they become increasingly “beyond our ken” in some more direct sense.
5.2 Does output-focused verification unlock desired elicitation?
Now, above I mentioned a general connection between verification and elicitation, on which being able to tell whether you’re getting output with property X (whether by examining the output itself, or by examining the process that created it) is important to being able to create output with property X. In the context of ML, we can also consider a more specific hypothesis, which I discussed in my post “The ‘no sandbagging on checkable tasks’ hypothesis,” according to which, roughly, the ability to verify (or perhaps: to verify in some suitably output-focused way?) the presence of some property X in some output O implies, in most relevant cases, the ability to elicit output with property X from an AI capable of producing it.
In that post, I didn’t dwell too much on what it takes for something to be “checkable.” The paradigm notion of “checkability,” though, is heavily output-focused. That is, roughly, we imagine some process that mostly treats the AI as a black box, but which examines the AI’s output for whether it has the desired property, then rewards/updates the model based on this assessment. And the question is whether this broad sort of training would be enough for desired elicitation.
If the “no sandbagging on checkable tasks” hypothesis were true of superintelligent AI agents, for a heavily output-focused notion of checkable, and you could make the task performance you want to elicit output-focused-“checkable” in the relevant sense, then you could get desired elicitation this way. And note, as ever, that the type of output-focused checkability at stake, here, can draw on much more than unaided human labor. That is, we should imagine humans assisted by AIs doing whatever we justifiably trust them to do (assuming this trust is suitably independent from our trust in the process whose output is being evaluated). This is closely related to our prospects for “scalable oversight.”
In general, I think it’s an interesting question exactly how difficult it would be to output-verify the sorts of task-performance at stake in “access to the main benefits of superintelligent AI.” For various salient tasks – e.g. curing cancer, vastly improving our scientific understanding, creating radical abundance, etc (I think it would be useful to develop a longer list here and look at it in more detail) – my suspicion is that we can, in fact, output-focused verify much of what we want, at least according to the normal sorts of standards we would use in other contexts. E.g., and especially with AI help, I think we can probably recognize a functional and not-catastrophically-harmful cancer cure, solar cell, etc if our AIs produced one.
However, at the least, and even in the context of heavily output-focused forms of “checking,” I think we are likely going to need some aspect of process-focused verification as well, to rule out cases where the AIs are messing with our output-focused verification in more sophisticated ways – e.g., faking data, messing with measurement devices, etc.[24]
More broadly, though, it also seems possible that even if we can rule out various flagrant forms of measurement tampering, much of the task-performance we want out of superintelligent agents will end up quite difficult to verify in an output-focused way, even using scalable methods. For example, maybe this task performance involves working in a qualitatively new domain that even our scalable-oversight methods can’t “reach” epistemically.
5.3 What are our options for process-focused verification?
Given the possible difficulties with relying centrally on output-focused verification, what are our options for more process-focused types of verification?
I won’t examine the issue in much depth here, but here are a few routes that are currently salient to me:
Imitation learning: another sort of process-focused argument you could give would be something like: “we trained this agent via imitation learning on human data to be like a human in a blah way. We claim that in virtue of this, we can trust it to be producing output with property X in blah context we can’t output-verify.”[25]
Plausible that this is actually just a sub-variant of a “generalization + 'no successful adversariality'” arg. That is, plausibly you need to really be saying “it was like a human in blah way in these other contexts, and if it remains like a human in blah way in this context we can’t output-verify than things are good, and we do expect it to generalize in this way for blah reasons (including: that it’s not being successfully actively adversarial).” But I thought I’d flag it separately regardless.
A few other notes:
In general, I expect our actual practices of verification to mix output-focus and process-focus together heavily. E.g., you try your best to evaluate the output directly, and you also try your best to understand the trustworthiness of the process – and you hope that these two, together, can add up to justified confidence in the output’s desirability.
6. Does solving the alignment problem require some very sophisticated philosophical achievement re: our values on reflection?
I want to close with a discussion of whether solving the alignment problem in the sense I’ve described requires some very sophisticated philosophical (not to mention technical) achievement – and in particular, whether it requires successfully pointing an AI at some object like our “values on reflection,” our “coherent extrapolated volition,” or some such.
As I noted above, I think the alignment discourse is haunted by some sense that this sort of philosophical achievement is necessary.
My current guess, though, is that we don’t actually need to successfully point at (and get an AI to care intrinsically about) some esoteric object like our “values on reflection” in order to solve alignment in the sense I’ve outlined. And good thing, too, because I think our “values on reflection” may not be a well-defined object at all.
One intuition pump here is: in the current, everyday world, basically no one goes around with much of a sense of what people’s “values on reflection” are, or where they lead. Rather, we behave in desirable ways, vis-a-vis each other, by adhering to various shared, common-sense norms and standards of behavior, and in particular, by avoiding forms of behavior that would be flagrantly undesirable according to this current concrete person – or perhaps, according to some minimally extrapolated version of this person (i.e., what this person would think if they knew a bit more about the situation, rather than about what they would think if they had a brain the size of a galaxy).
What’s more, and even if we do end up needing to deal with edge cases or with a bunch of gnarly ethical/philosophical questions in order to get non-takeover/desired elicitation from our AIs, I think it’s plausible that getting access to something like an “honest oracle” – that is, an AI that will answer questions for us honestly, to the best of its ability – is enough to get us most of what we want here – and indeed, perhaps most of what’s available even in principle. And I think an “honest oracle” is a meaningfully more minimal standard than “an AI that cares intrinsically about your values-on-reflection.”
Here I’m roughly imagining something like: if you have an honest oracle, you can in principle ask it a zillion questions like: “if we do blah thing, is it going to lead to something I would immediately regret if I knew about it,” “what would I think about this thing if ten copies of me debated about it in the following scenario for the following amount of time,” “is there something about this thing that I’d probably really want to know that I don’t know right now?,” etc.[26] And as I discussed in “on the limits to idealized values,” I think the full set of answers to questions like this is probably ~all that the notion of your “values on reflection” comes down to.
That is, ultimately, there is just the empirical pattern of: what you would think/feel/value given a zillion different hypothetical processes; what you would think/feel/value about those processes given a zillion different other hypothetical processes; and so on. And you need to choose, now, in your actual concrete circumstance, which of those hypotheticals to give authority to.
So in a sense, on this picture, an honest oracle would give you access to ~everything there is to access about your values on reflection. The rest is on you, now.
Now, of course, there are lots of questions we can raise about ways that honest oracles can be dangerous, and/or extremely difficult, in themselves, to create (though note that an honest oracle doesn’t need to be a unitary mind – rather, it just needs to be some reliable process for eliciting the answers to the questions at stake). And as I noted above, notions like honesty, non-manipulation, and so on do themselves admit of various tough edge cases. I’m skeptical, though, that resolving all of these edges adequately itself requires reference to our full values-on-reflection (i.e., I think that good-enough concepts of “honesty” and “non-manipulation” are likely to be simpler and more natural objects than the full details of our full-values-on-reflection, whatever those are). And as above, I think it’s plausible that if you can just get AIs that aren’t dishonest or manipulative in non-edge-case ways, this goes a ton of the way.
We can also ask questions about how far we could get with more minimal sorts of “oracle”-like AIs. Thus, an “honest oracle” is intuitively up for trying to answer questions about weird counterfactual universes, somewhat ill-specified questions, and the like – questions like “would I regret this if a million copies of me went off into a separate realm and thought about it in blah way.” But we can also consider “prediction oracles” that only answer questions about different physically-possible branches of our current universe, “specified-question” oracles that only answer questions specified with suitable precision, and the like. And these may be easier to train in various ways.[27]
7. Wrapping up
OK, those were some disparate reflections on what’s involved in solving the alignment problem. Admittedly, it’s a lot of taxonomizing, defining-things, etc – and it’s not clear exactly what role this sort of conceptual work does in orienting us towards the problem. But I’ve found that for me, at least, it’s useful to have a clear picture of what the high level aim is and is not, here, so that I can keep a consistent grip on how hard to expect the problem to be, and on what paths might be available for solving it.
This is a somewhat deviant definition, in that it doesn’t require that you’ve created a superintelligence that is in some sense aimed at your values/intentions etc. But that’s on purpose.
The term "epistemic bootstrapping" is from Carl Shulman.
I have to specify “bad,” here, because some conceptions of alignment that I’ll discuss below countenance “good” forms of AI takeover.
And more generally, it seems like to me that ensuring that humanity gets the benefits of as-intelligent-as-physically-possible AI, even conditional on getting the benefits of superintelligence, is very much not my job.
Thanks to Ryan Greenblatt for conversation on this front.
Thanks to Ryan Greenblatt for discussion.
This is going to be relative to some development pathway for those more capable models.
I’ll count it as “uncoordinated” if many disparate AI systems go rogue and succeed at escaping human control, but then after fighting amongst themselves one faction emerges victorious.
In principle different AI systems participating in a coordinated takeover could predict different odds of success, but I’ll ignore this for now.
If misaligned AIs end up controlling ~all future resources, but humans end up with some tiny portion, I’ll say that this still counts as a takeover – albeit, one that some human value systems might be comparatively OK with.
I grant that a sufficiently superintelligent agent would have a DSA of this kind; but whether the least-smart agent that still qualifies as “superintelligent” would have such an advantage is a different question.
I focus on actions directly aimed at takeover here, but to the extent that uncoordinated takeovers involve AIs acting to secure other forms of more limited power, without aiming directly at takeover, a roughly similar analysis would apply – i.e., just replace “takeover” with “securing blah kind of more limited power”; and of think of “easiness” in terms of how easy or hard it would be for the effort to secure this power to succeed.
See Lethality 2: “A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.” Though note that “sufficiently high” is doing a lot of work in the plausibility of this claim – and our real-world task need not necessarily involve building an AI system with cognitive powers that are that high.
Here I think we should be interpreting the input in question in terms of the sorts of “corrections” at stake in Yudkowsky’s notion of “corrigibility” – e.g., shutting down the AI, or changing its values. A benign sovereign AI might still give humans other kinds of input – e.g., because it might value human autonomy (though I think the line between this and “corrigibility” might get blurry).
And note that to meet my definition of “solving the alignment problem without access-to-benefits loss,” we’d need to assume that “somehow OK” here means that those benefits are relevantly accessible.
Of course, depending on the specific way it obeys instructions, you can potentially turn a loyal assistant into something like an “agent that shares your values” by asking it to just act like an agent that shares your values and to ignore all future instructions to the contrary. But the two categories remain distinct.
I then have to modulate this vision to accommodate concerns about moral patienthood.
Note, though, that this approach brings in a substantive assumption: namely, that to the extent you are eliciting desired task-performance from the AI in question, you are specifically doing so from the AI qua potentially-dangerous-agent. That is, when the AI is doing the task, it is doing so in a manner driven by its planning capability, employing its situational awareness, etc.
It’s conceptually possible that you could get desired task performance without drawing on the AI’s dangerous agential-ness in this way. E.g., the image would be something like: sure, sometimes the AI sits around deciding between take-over plans and other alternatives, and having its behavior coherently driven by that decision-making. But when it’s doing the sorts of tasks you want it to do, it’s doing those in some manner that is more on “autopilot,” or more driven by sphex-ish heuristics/unplanned impulses etc.
That said, this approach starts to look a lot like “build a dangerous SI agent but don’t use it to get the benefits of superintelligence.” E.g., here you’ve built a dangerous SI agent, but you’re not using it qua dangerous to get the benefits of superintelligence. At which point: why did you build it at all?
Because this is specifically an elicitation problem, we’re assuming that the AI has this as an option.
Obviously, in reality there are different degrees of crossing-your-fingers, corresponding to different amounts of justifiable confidence, but let’s use a simple binary for now.
I’m setting aside whether you can verify that those numbers are prime.
Note that you’re allowed to use tools like calculators here, even though your reasons for trusting those tools might be “process-inclusive.” What matters is that your justification for believing that property X holds makes minimal reference to the process that produced the output in question, or to other processes whose trustworthiness is highly correlated with that process (the calculator’s trustworthiness isn’t).
This is a term from Carl Shulman.
Thanks to Ryan Greenblatt for extensive discussion here.
Thanks to Collin Burns for discussion.
Thanks to Carl Shulman and Lukas Finnveden for discussion here.
See e.g. the ELK report’s discussion of “narrow elicitation,” and the corresponding attempt to define a utility function given success at narrow elicitation, for some efforts in this vein (my impression is that an “honest oracle” in my sense is more akin to what the ELK report calls “ambitious ELK” – though maybe even ambitious ELK is limited to questions about our universe?).