a company that pays its employees a below-subsistance wages will get outcompeted by companies that offer better conditions... once we automate a large fraction of the economy and society, this relationship between competitiveness and being beneficial to humans can cease to hold
Walmart is one of the biggest employers in the world, and its salaries are notoriously so low that a large percentage of employees depend on welfare to survive (in addition to walmart salary). The economy is already pretty far from what I'd call aligned. If we want to align it, the best time to start was a couple centuries ago, the second best time is now. Let's not wait until AI increases concentration of power even more.
I think some things we can do to better our chances include:
What about quickly distributing frontier AI when it is shown to be safe? That is risky of course if it isn't safe, however if the deployed AI is as powerful and distributed as far as possible, then a bad AI would need to be more powerful comparatively to take over.
So
AI(x-1) is everywhere and protecting as much as possible, AI(x) is sandboxed
VS
AI(x-2) is protecting everything, AI(x-1) is in a few places, AI(x) is sandboxed.
or the bad ai is able to hack every copy of the widely distributed ai the same way, making the question moot.
Right, and it would be easier to hack, since it has the same adversarial examples, right?
Oh, wait, I see what you're saying. No I think hacking x-1 and x-2 will both be trivial. AIs are basically zero secure right now.
I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the "distribute AI(x-1) quickly" part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure. (Also, there is the "single point of failure" effect, though it seems unclear how large.)
To clarify: The question about improving Steps 1-2 was meant specifically for [improving things that resemble Steps 1-2], rather than [improving alignment stuff in general]. And the things you mention seem only tangentially related to that, to me.
But that complaint aside: sure, all else being equal, all of the points you mention seem better having than not having.
Excellent post. I think this is not a plan that's likely to succeed, but I think you've correctly and explicitly layed out the plan that many are following without being explicit about it - and therefore its limitations.
I'm very curious how many alignment researchers would agree that this is roughly their plan.
Summary: Many people seem to put their hopes on something like the following "plan":
If this is true, I think there should be more acknowledgment of this fact, and more discussion of the failure modes of this plan.
Epistemic status: Descriptive, rather than normative.
Descriptive, rather than Normative
I label the epistemic status of this post as "descriptive, rather than normative". What do I mean by that? And what do I mean by alignment "plan"?
While I thought a lot about AI alignment, I still have many uncertainties about the topic. And I don't have any plan, for helping us build beneficial AGI, that I would be optimistic about. But I keep working on this, and I have opinions and preferences over which projects to undertake. So, the question I ask in this post is: To the extent that my actions and beliefs seem to be in line with any plan at all, what "plan" do they seem to be following?
I should disclaim that my actual beliefs are a bit more nuanced than the description given here. But for the sake of brevity, I will stick with the simpler formulations below.
The main reason I write this post is that I suspect that many other people might be putting their hopes into a "plan" similar to what I describe. (In the case of alignment researchers, this might be explicit and due to the absence of better ideas. In the case of capabilities researchers, this might happen implicitly, as a result of background assumptions and not having thought about the topic.) To the extent that this is the case, I think it would be useful to acknowledge that this is what is happening, such that we can discuss the plan explicity. To the extent that other people have a significantly different plan, I would be curious to know what the plan is.
Finally, note that I make no claim that the plan described here, or even my more nuanced version of it, is good. In fact, I do not think it is good --- I just don't have a better one. And I think that explicitly describing the plan is the first step towards improving it.
My Alignment "Plan"
(Optional) Step 0: Hope, or assume, that sharp left turn will not happen.
By sharp left turn, I mean a scenario where an AI undergoes a sudden and extreme growth in capability, possibly until it becomes vastly more powerful than anything else around it. Some people seem convinced that sharp left turn cannot, or will not, happen. I think that being confident about this is misguided.[1]
However, it does seem plausible to me that we live in a universe where sharp left turn is impossible.
I also find it plausible that sharp left turn is possible in principle, but it is still far away in the "technological tree". In particular, it is possible that we still have a very long time until this problem needs to be addressed. Moreover, there is also the possiblity that kind of AI that could undergo sharp left turn will only become available at the point where the background level of capabilities is very large. In such a scenario, undergoing sharp left turn might no longer convey a sufficient advantage for the AI to make much of an impact.
Looking at my actions from the outside, it seems that aside from "don't build AI capable of sharp left turn" (see Steps 1-2), my only "strategy" for handling sharp left turn is
Step 1: Convince everybody to avoid building the kind of AI that could undergo sharp left turn.
I don't have any good ideas for controling the kind of AI that could undergo sharp left turn, and neither am I aware of any recent work that would make progress on this problem. Instead, I am excited[3] about work which demonstrates the dangers of powerful AI --- ideally in ways that are salient even to ML researchers, policy makers, and the public. Two examples of such results are:
It seems conceivable to me that with enough such results, a majority of people could adopt the view that powerful-AI-soon is probably unsurvivable. More specifically, the scenario that seems conceivable to me is that the groups that adopt this view are:
In scenarios like these, I expect that the change in opinion to suffice for civilisation to attempt to avoid building powerful AI. However, this does automatically mean the attempt will succeed. In particular, we still need to tackle issues such as:
Ultimately, the hope with this step is that we can delay the development of sharp-left-turn-capable AI until we solve the alignment problem for such AI, or until civilisation becomes sufficiently robust to stop being vulnerable to AI takeover. (Recall that I am merely describing the plan, rather than making any claims about how likely it is to succeed.)
Step 2: Build AI that automates trusted processes.
Even if there is a general consensus that powerful AI is unsurvivable, I still expect any attempts to pause all AI progress to be unsustainable. As a result, we might try to increase our chances of controlling AI progress by white-listing approaches that seem relatively safe. But which approaches are those?
One intuition is that if we are currently doing some process without the use of AI, and we already trust that process is safe, then automating that process and doing more of it is (probably) also safe. (I don't think this intuition is completely right, but since I discuss those reservations in Step 3, I will leave them aside for now.) To give a few positive examples, we can consider:
In contrast, the following strategies would not fall under the approach above:
Overall, this approach to building AI seems much slower and more expensive than building larger and larger foundation models and turning them into agents. However, it should still be sufficient to eventually automate most of the economy, which should in turn allow us to eliminate poverty, greatly speed up science, solve all problems that can be solved using technology, etc. So the "only" issues are whether we can successfully take Steps 1-2 ... and the minor detail of whether automating the economy might perhaps come with problems of its own.
Step 3: Solve problems caused by automating the economy.
As one might expect, even if the approach of automating trusted processes goes as well as possible, there will still be many remaining problems to solve. Some of these are:
However, once we automate a large fraction of the economy and society, this relationship between competitiveness and being beneficial to humans can cease to hold.
All of these problems sound like they have the potential to cause human extinction, or worse. At the same time, most problems have the property that one can tell a scary story about how the given problem will cause the world to end. So, uhm, perhaps we can wing it and it will all be fine?
Follow-up Questions
Finally, here are some related questions that I have:
This can either be because they assume that sharp left turn won't happen, or because they are trying to avoid building the kinds of AI for which it might happen.
Depending on how I wake up each day, I feel that the chance of sharp left turn happening in time to be relevant is something between 5% and 95%. And most days I am above 50%. (This is besides the point of this post, but it does seem somewhat relevant for context.)
Personally, I endorse the sentiment that one should first figure out in which universe they are, and then try to do the best they can in that universe --- as opposed to focusing on worlds where they know how to make progress. That is why this plan has Steps 1-2.
Well, at least more excited than about any other work.