Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
There are enormous hurdles preventing the U.S. military from overthrowing the civilian government.
The confusion in your statement is caused by blocking up all the members of the armed forces in the term "U.S. military". Principally, a coup is an act of coordination.
Is it your contention that similar constraints will not apply to AIs?
When people talk about how "the AI" will launch a coup in the future, I think they're making essentially the same mistake you talk about here. They’re treating a potentially vast group of AI entities — like a billion copies of GPT-7 — as if they form a single, unified force, all working seamlessly toward one objective, as a monolithic agent. But just like with your description of human affairs, this view overlooks the coordination challenges that would naturally arise among such a massive number of entities. They’re imagining these AIs could bypass the complex logistics of organizing a coup, evading detection, and maintaining control after launching a war without facing any relevant obstacles or costs, even though humans routinely face these challenges amongst ourselves.
In these discussions, I think there's an implicit assumption that AIs would automatically operate outside the usual norms, laws, and social constraints that govern social behavior. The idea is that all the ordinary rules of society will simply stop applying, because we're talking about AIs.
Yet I think this simple idea is basically wrong, for essentially the same reasons you identified for human institutions.
Of course, AIs will be different in numerous ways from humans, and AIs will eventually be far smarter and more competent than humans. This matters. Because AIs will be very capable, it makes sense to think that artificial minds will one day hold the majority of wealth, power, and social status in our world. But these facts alone don't show that the usual constraints that prevent coups and revolutions will simply go away. Just because AIs are smart doesn't mean they'll necessarily use force and violently revolt to achieve their goals. Just like humans, they'll probably have other avenues available for pursuing their objectives.
Asteroid impact
Type of estimate: best model
Estimate: ~0.02% per decade.
Perhaps worth noting: this estimate seems too low to me over longer horizons than the next 10 years, given the potential for asteroid terrorism later this century. I'm significantly more worried about asteroids being directed towards Earth purposely than I am about natural asteroid paths.
That said, my guess is that purposeful asteroid deflection probably won't advance much in the next 10 years, at least without AGI. So 0.02% is still a reasonable estimate if we don't get accelerated technological development soon.
Does trade here just means humans consuming, I.e. trading money for AI goods and services? That doesn't sound like trading in the usual sense where it is a reciprocal exchange of goods and services.
Trade can involve anything that someone "owns", which includes both their labor and their property, and government welfare. Retired people are generally characterized by trading their property and government welfare for goods and services, rather than primarily trading their labor. This is the basic picture I was trying to present.
How many 'different' AI individuals do you expect there to be ?
I think the answer to this question depends on how we individuate AIs. I don't think most AIs will be as cleanly separable from each other as humans are, as most (non-robotic) AIs will lack bodies, and will be able to share information with each other more easily than humans can. It's a bit like asking how many "ant units" there are. There are many individual ants per colony, but each colony can be treated as a unit by itself. I suppose the real answer is that it depends on context and what you're trying to figure out by asking the question.
A recently commonly heard viewpoint on the development of AI states that AI will be economically impactful but will not upend the dominancy of humans. Instead AI and humans will flourish together, trading and cooperating with one another. This view is particularly popular with a certain kind of libertarian economist: Tyler Cowen, Matthew Barnett, Robin Hanson.
They share the curious conviction that the probablity of AI-caused extinction p(Doom) is neglible. They base this with analogizing AI with previous technological transition of humanity, like the industrial revolution or the development of new communication mediums. A core assumption/argument is that AI will not disempower humanity because they will respect the existing legal system, apparently because they can gain from trades with humans.
I think this summarizes my view quite poorly on a number of points. For example, I think that:
AI is likely to be much more impactful than the development of new communication mediums. My default prediction is that AI will fundamentally increase the economic growth rate, rather than merely continuing the trend of the last few centuries.
Biological humans are very unlikely to remain dominant in the future, pretty much no matter how this is measured. Instead, I predict that artificial minds and humans who upgrade their cognition will likely capture the majority of future wealth, political influence, and social power, with non-upgraded biological humans becoming an increasingly small force in the world over time.
The legal system will likely evolve to cope with the challenges of incorporating and integrating non-human minds. This will likely involve a series of fundamental reforms, and will eventually look very different from the idea of "AIs will fit neatly into human social roles and obey human-controlled institutions indefinitely".
A more accurate description of my view is that humans will become economically obsolete after AGI, but this obsolescence will happen peacefully, without a massive genocide of biological humans. In the scenario I find most likely, humans will have time to prepare and adapt to the changing world, allowing us to secure a comfortable retirement, and/or join the AIs via mind uploading. Trade between AIs and humans will likely persist even into our retirement, but this doesn't mean that humans will own everything or control the whole legal system forever.
How could one control AI without access to the hardware/software? What would stop one with access to the hardware/software from controlling AI?
One would gain control by renting access to the model, i.e., the same way you can control what an instance of ChatGPT currently does. Here, I am referring to practical control over the actual behavior of the AI, when determining what the AI does, such as what tasks it performs, how it is fine-tuned, or what inputs are fed into the model.
This is not too dissimilar from the high level of practical control one can exercise over, for example, an AWS server that they rent. While Amazon may host these servers, and thereby have the final say over what happens to the computer in the case of a conflict, the company is nonetheless inherently dependent on customer revenue, implying that they cannot feasibly use all their servers privately for their own internal purposes. As a consequence of this practical constraint, Amazon rents these servers out to the public, and they do not substantially limit user control over AWS servers, providing for substantial discretion to end-users over what software is ultimately implemented.
In the future, these controls could also be determined by contracts and law, analogously to how one has control over their own bank account, despite the bank providing the service and hosting one's account. Then, even in the case of a conflict, the entity that merely hosts an AI may not have practical control over what happens, as they may have legal obligations to their customers that they cannot breach without incurring enormous costs to themselves. The AIs themselves may resist such a breach as well.
In practice, I agree these distinctions may be hard to recognize. There may be a case in which we thought that control over AI was decentralized, but in fact, power over the AIs was more concentrated or unified than we believed, as a consequence of centralization over the development or the provision of AI services. Indeed, perhaps real control was always in the hands of the government all along, as they could always choose to pass a law to nationalize AI, and take control away from the companies.
Nonetheless, these cases seem adequately described as a mistake in our perception of who was "really in control" rather than an error in the framework I provided, which was mostly an attempt to offer careful distinctions, rather than to predict how the future will go.
If one actor—such as OpenAI—can feasibly get away with seizing practical control over all the AIs they host without incurring high costs to the continuity of their business through loss of customers, then this indeed may surprise someone who assumed that OpenAI was operating under different constraints. However, this scenario still fits nicely within the framework as I've provided, as it merely describes a case in which one was mistaken about the true degree of concentration along one axis, rather than one of my concepts intrinsically fitting reality poorly.
It is not always an expression of selfish motives when people take a stance against genocide. I would even go as far as saying that, in the majority of cases, people genuinely have non-selfish motives when taking that position. That is, they actually do care, to at least some degree, about the genocide, beyond the fact that signaling their concern helps them fit in with their friend group.
Nonetheless, and this is important: few people are willing to pay substantial selfish costs in order to prevent genocides that are socially distant from them.
The theory I am advancing here does not rest on the idea that people aren't genuine in their desire for faraway strangers to be better off. Rather, my theory is that people generally care little about such strangers, when helping those strangers trades off significantly against objectives that are closer to themselves, their family, friend group, and their own tribe.
Or, put another way, distant strangers usually get little weight in our utility function. Our family, and our own happiness, by contrast, usually get a much larger weight.
The core element of my theory concerns the amount that people care about themselves (and their family, friends, and tribe) versus other people, not whether they care about other people at all.
While the term "outer alignment" wasn’t coined until later to describe the exact issue that I'm talking about, I was using that term purely as a descriptive label for the problem this post clearly highlights, rather than implying that you were using or aware of the term in 2007.
Because I was simply using "outer alignment" in this descriptive sense, I reject the notion that my comment was anachronistic. I used that term as shorthand for the thing I was talking about, which is clearly and obviously portrayed by your post, that's all.
To be very clear: the exact problem I am talking about is the inherent challenge of precisely defining what you want or intend, especially (though not exclusively) in the context of designing a utility function. This difficulty arises because, when the desired outcome is complex, it becomes nearly impossible to perfectly delineate between all potential 'good' scenarios and all possible 'bad' scenarios. This challenge has been a recurring theme in discussions of alignment, as it's considered hard to capture every nuance of what you want in your specification without missing an edge case.
This problem is manifestly portrayed by your post, using the example of an outcome pump to illustrate. I was responding to this portrayal of the problem, and specifically saying that this specific narrow problem seems easier in light of LLMs, for particular reasons.
It is frankly frustrating to me that, from my perspective, you seem to have reliably missed the point of what I am trying to convey here.
I only brought up Christiano-style proposals because I thought you were changing the topic to a broader discussion, specifically to ask me what methodologies I had in mind when I made particular points. If you had not asked me "So would you care to spell out what clever methodology you think invalidates what you take to be the larger point of this post -- though of course it has no bearing on the actual point that this post makes?" then I would not have mentioned those things. In any case, none of the things I said about Christiano-style proposals were intended to critique this post's narrow point. I was responding to that particular part of your comment instead.
As far as the actual content of this post, I do not dispute its exact thesis. The post seems to be a parable, not a detailed argument with a clear conclusion. The parable seems interesting to me. It also doesn't seem wrong, in any strict sense. However, I do think that some of the broader conclusions that many people have drawn from the parable seem false, in context. I was responding to the specific way that this post had been applied and interpreted in broader arguments about AI alignment.
My central thesis in regards to this post is simply: the post clearly portrays a specific problem that was later called the "outer alignment" problem by other people. This post portrays this problem as being difficult in a particular way. And I think this portrayal is misleading, even if the literal parable holds up in pure isolation.
Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.
I'll confirm that I'm not saying this post's exact thesis is false. This post seems to be largely a parable about a fictional device, rather than an explicit argument with premises and clear conclusions. I'm not saying the parable is wrong. Parables are rarely "wrong" in a strict sense, and I am not disputing this parable's conclusion.
However, I am saying: this parable presumably played some role in the "larger" argument that MIRI has made in the past. What role did it play? Well, I think a good guess is that it portrayed the difficulty of precisely specifying what you want or intend, for example when explicitly designing a utility function. This problem was often alleged to be difficult because, when you want something complex, it's difficult to perfectly delineate potential "good" scenarios and distinguish them from all potential "bad" scenarios. This is the problem I was analyzing in my original comment.
While the term "outer alignment" was not invented to describe this exact problem until much later, I was using that term purely as descriptive terminology for the problem this post clearly describes, rather than claiming that Eliezer in 2007 was deliberately describing something that he called "outer alignment" at the time. Because my usage of "outer alignment" was merely descriptive in this sense, I reject the idea that my comment was anachronistic.
And again: I am not claiming that this post is inaccurate in isolation. In both my above comment, and in my 2023 post, I merely cited this post as portraying an aspect of the problem that I was talking about, rather than saying something like "this particular post's conclusion is wrong". I think the fact that the post doesn't really have a clear thesis in the first place means that it can't be wrong in a strong sense at all. However, the post was definitely interpreted as explaining some part of why alignment is hard — for a long time by many people — and I was critiquing the particular application of the post to this argument, rather than the post itself in isolation.
The object-level content of these norms is different in different cultures and subcultures and times, for sure. But the special way that we relate to these norms has an innate aspect; it’s not just a logical consequence of existing and having goals etc. How do I know? Well, the hypothesis “if X is generally a good idea, then we’ll internalize X and consider not-X to be dreadfully wrong and condemnable” is easily falsified by considering any other aspect of life that doesn’t involve what other people will think of you.
To be clear, I didn't mean to propose the specific mechanism of: if some behavior has a selfish consequence, then people will internalize that class of behaviors in moral terms rather than in purely practical terms. In other words, I am not saying that all relevant behaviors get internalized this way. I agree that only some behaviors are internalized by people in moral terms, and other behaviors do not get internalized in terms of moral principles in the way I described.
Admittedly, my statement was imprecise, but my intention in that quote was merely to convey that people tend to internalize certain behaviors in terms of moral principles, which explains the fact that people don't immediately abandon their habits when the environment suddenly shifts. However, I was silent on the question of which selfishly useful behaviors get internalized this way and which ones don't.
A good starting hypothesis is that people internalize certain behaviors in moral terms if they are taught to see those behaviors in moral terms. This ties into your theory that people "have an innate drive to notice, internalize, endorse, and take pride in following social norms". We are not taught to see "reaching into your wallet and shredding a dollar" as impinging on moral principles, so people don't tend to internalize the behavior that way. Yet, we are taught to see punching someone in the face as impinging on a moral principle. However, this hypothesis still leaves much to be explained, as it doesn't tell us which behaviors we will tend to be taught about in moral terms, and which ones we won't be taught in moral terms.
As a deeper, perhaps evolutionary explanation, I suspect that internalizing certain behaviors in moral terms helps make our commitments to other people more credible: if someone thinks you're not going to steal from them because you think it's genuinely wrong to steal, then they're more likely to trust you with their stuff than if they think you merely recognize the practical utility of not stealing from them. This explanation hints at the idea that we will tend to internalize certain behaviors in moral terms if those behaviors are both selfishly relevant, and important for earning trust among other agents in the world. This is my best guess at what explains the rough outlines of human morality that we see in most societies.
I’m not sure what “largely” means here. I hope we can agree that our objectives are selfish in some ways and unselfish in other ways.
Parents generally like their children, above and beyond the fact that their children might give them yummy food and shelter in old age. People generally form friendships, and want their friends to not get tortured, above and beyond the fact that having their friends not get tortured could lead to more yummy food and shelter later on. Etc.
In that sentence, I meant "largely selfish" as a stand-in for what I think humans-by-default care overwhelmingly about, which is something like "themselves, their family, their friends, and their tribe, in rough descending order of importance". The problem is that I am not aware of any word in the English language to describe people who have these desires, except perhaps the word "normal".
The word selfish usually denotes someone who is preoccupied with their own feelings, and is unconcerned with anyone else. We both agree that humans are not entirely selfish. Nonetheless, the opposite word, altruistic, often denotes someone who is preoccupied with the general social good, and who cares about strangers, not merely their own family and friend circles. This is especially the case in philosophical discussions in which one defines altruism in terms of impartial benevolence to all sentient life, which is extremely far from an accurate description of the typical human.
Humans exist on a spectrum between these two extremes. We are not perfectly selfish, nor are we perfectly altruistic. However, we are generally closer to the ideal of perfect selfishness than to the ideal of perfect altruism, given the fact that our own family, friend group, and tribe tends to be only a small part of the entire world. This is why I used the language of "largely selfish" rather than something else.
I do think that AIs will eventually get much smarter than humans, and this implies that artificial minds will likely capture the majority of wealth and power in the world in the future. However, I don't think the way that we get to that state will necessarily be because the AIs staged a coup. I find more lawful and smooth transitions more likely.
There are alternative means of accumulating power than taking everything by force. AIs could get rights and then work within our existing systems to achieve their objectives. Our institutions could continuously evolve with increasing AI presence, becoming more directed by AIs with time.
What I'm objecting to is the inevitability of a sudden collapse when "the AI" decides to take over in an untimely coup. I'm proposing that there could just be a smoother, albeit rapid transition to a post-AGI world. Our institutions and laws could simply adjust to incorporate AIs into the system, rather than being obliterated by surprise once the AIs coordinate an all-out assault.
In this scenario, human influence will decline, eventually quite far. Perhaps this soon takes us all the way to the situation you described in which humans will become like stray dogs or cats in our current world: utterly at the whim of more powerful beings who do not share their desires.
However, I think that scenario is only one possibility. Another possibility is that humans could enhance their own cognition to better keep up with the world. After all, we're talking about a scenario in which AIs are rapidly advancing technology and science. Could humans not share in some of that prosperity?
One more possibility is that, unlike cats and dogs, humans could continue to communicate legibly with the AIs and stay relevant for reasons of legal and cultural tradition, as well as some forms of trade. Our current institutions didn't descend from institutions constructed by stray cats and dogs. There was no stray animal civilization that we inherited our laws and traditions from. But perhaps if our institutions did originate in this way, then cats and dogs would hold a higher position in our society.