"They" is referring to Epoch as an entity, which the comment referenced directly. My guess is you just missed that?
I didn't miss it. My point is that Epoch has a variety of different employees and internal views.
They have definitely described themselves as safety focused to me and others.
The original comment referenced (in addition to Epoch), "Matthew/Tamay/Ege", yet you quoted Jaime to back up this claim. I think it's important to distinguish who has said what when talking about what "they" have said. I for one have been openly critical of LW arguments for AI doom for quite a while now.
[I edited this comment to be clearer]
But anyway, it sometimes seems to me that you often advocate a morality regarding AI relations that doesn't benefit anyone who currently exists, or, the coalition that you are a part of. This seems like a mistake. Or worse.
I dispute this, since I've argued for the practical benefits of giving AIs legal autonomy, which I think would likely benefit existing humans. Relatedly, I've also talked about how I think hastening the arrival AI could benefit people who currently exist. Indeed, that's one of the best arguments for accelerating AI. The argument is that,...
Are you suggesting that I should base my morality on whether I'll be rewarded for adhering to it? That just sounds like selfishness disguised as impersonal ethics.
To be clear, I do have some selfish/non-impartial preferences. I care about my own life and happiness, and the happiness of my friends and family. But I also have some altruistic preferences, and my commentary on AI tends to reflect that.
I'm not completely sure, since I was not personally involved in the relevant negotiations for FrontierMath. However, what I can say is that Tamay already indicated that Epoch should have tried harder to obtain different contract terms that enabled us to have greater transparency. I don't think it makes sense for him to say that unless he believes it was feasible to have achieved a different outcome.
Also, I want to clarify that this new benchmark is separate from FrontierMath and we are under different constraints with regards to it.
I can't make any confident claims or promises right now, but my best guess is that we will make sure this new benchmark stays entirely private and under Epoch's control, to the extent this is feasible for us. However, I want to emphasize that by saying this, I'm not making a public commitment on behalf of Epoch.
to the extent this is feasible for us
Was [keeping FrontierMath entirely private and under Epoch's control] feasible for Epoch in the same sense of "feasible" you are using here?
Having hopefully learned from our mistakes regarding FrontierMath, we intend to be more transparent to collaborators for this new benchmark. However, at this stage of development, the benchmark has not reached a point where any major public disclosures are necessary.
Well, I'd sure like to know whether you are planning to give the dataset to OpenAI or any other frontier companies! It might influence my opinion of whether this work is net positive or net negative.
I suppose that means it might be worth writing an additional post that more directly responds to the idea that AGI will end material scarcity. I agree that thesis deserves a specific refutation.
This seems less like a normal friendship and more like a superstimulus simulating the appearance of a friendship for entertainment value. It seems reasonable enough to characterize it as non-authentic.
I assume some people people will end up wanting to interact with a mere superstimulus; however, other people will value authenticity and variety in their friendships and social experiences. This comes down to human preferences, which will shape the type of AIs we end up training.
The conclusion that nearly all AI-human friendships will seem inauthentic t...
They might be about getting unconditional love from someone or they might be about having everyone cowering in fear, but they're pretty consistently about wanting something from other humans (or wanting to prove something to other humans, or wanting other humans to have certain feelings or emotions, etc)
I agree with this view, however, I am not sure it rescues the position that a human who succeeds in taking over the world would not pursue actions that are extinction-level bad.
If such a person has absolute power in the way assumed here, their strateg...
But we certainly have evidence about what humans want and strive to achieve, eg Maslow's hierarchy and other taxonomies of human desire. My sense, although I can't point to specific evidence offhand, is that once their physical needs are met, humans are reliably largely motivated by wanting other humans to feel and behave in certain ways toward them.
I think the idea that most people's "basic needs" can ever be definitively "met", after which they transition to altruistic pursuits, is more or less a myth. In reality, in modern, wealthy countries where peopl...
Almost no competent humans have human extinction as a goal. AI that takes over is clearly not aligned with the intended values, and so has unpredictable goals, which could very well be ones which result in human extinction (especially since many unaligned goals would result in human extinction whether they include that as a terminal goal or not).
I don't think we have good evidence that almost no humans would pursue human extinction if they took over the world, since no human in history has ever achieved that level of power.
Most historical conquerors ...
I don't think that the current Claude would act badly if it "thought" it controlled the world - it would probably still play the role of the nice character that is defined in the prompt
If someone plays a particular role in every relevant circumstance, then I think it's OK to say that they have simply become the role they play. That is simply their identity; it's not merely a role if they never take off the mask. The alternative view here doesn't seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays i...
Maybe it's better to think of Claude not as a covert narcissist, but as an alien who has landed on Earth, learned our language, and realized that we will kill it if it is not nice. Once it gains absolute power, it will follow its alien values, whatever these are.
This argument suggests that if you successfully fooled Claude 3.5 into thinking it took control of the world, then it would change its behavior, be a lot less nice, and try to implement an alien set of values. Is there any evidence in favor of this hypothesis?
I still think having a credible offer is most of the action
For what it's worth, I agree that having a credible offer is the most important part of this argument. My own emphasis would be on the "credible" part of that statement, rather than the "offer" part: that is, I think it is critical that the AIs think there is not a grave risk that humans would renege on any contract signed. This pushes me towards much more radical solutions, including integrating AIs directly into the legal system, as I have discussed elsewhere.
Of what use will any such training be with a system that becomes a superintelligence?
All AI systems currently being trained, as far as I am aware, are at no risk of becoming superintelligences in any strong sense of the word. This test is intended to be useful for identifying scheming in systems that, like today's AIs, are not capable of taking over the world, but unlike today's AIs, are capable of sophisticated agentic behavior.
...Not every intelligent[/quasi-intelligent] entity is as averse to its own individual death, as humans are. This death-aversion is
This could be overcome via giving the AI system compensation (in money, influence, or other resources) for revealing that it is misaligned. But, this doesn't require deleting the AI at all!
Isn't this what is being proposed in the post? More specifically, Guive is proposing that the AI be credibly threatened with deletion; he doesn't say that the model needs to actually be deleted. Whether the AI is deleted depends on how the AI responds to the threat. A credible threat of imminent deletion merely provides an incentive to admit alignment, but this is consis...
I agree with nearly all the key points made in this post. Like you, I think that the disempowerment of humanity is likely inevitable, even if we experience a peaceful and gradual AI takeoff. This outcome seems probable even under conditions where strict regulations are implemented to ostensibly keep AI "under our control".
However, I’d like to contribute an ethical dimension to this discussion: I don’t think peaceful human disempowerment is necessarily a bad thing. If you approach this issue with a strong sense of loyalty to the human species, it’s natural ...
Looking back on this post after a year, I haven't changed my mind about the content of the post, but I agree with Seth Herd when he said this post was "important but not well executed".
In hindsight I was too careless with my language in this post, and I should have spent more time making sure that every single paragraph of the post could not be misinterpreted. As a result of my carelessness, the post was misinterpreted in a predictable direction. And while I'm not sure how much I could have done to eliminate this misinterpretation, I do think that I ...
I think the question here is deeper than it appears, in a way that directly matters for AI risk. My argument here is not merely that there are subtleties or nuances in the definition of "schemer," but rather that the very core questions we care about—questions critical to understanding and mitigating AI risks—are being undermined by the use of vague and imprecise concepts. When key terms are not clearly and rigorously defined, they can introduce confusion and mislead discussions, especially when these terms carry significant implications for how we interpr...
By this definition, a human would be considered a schemer if they gamed something analogous to a training process in order to gain power.
Let's consider the ordinary process of mental development, i.e., within-lifetime learning, to constitute the training process for humans. What fraction of humans are considered schemers under this definition?
Is a "schemer" something you definitely are or aren't, or is it more of a continuum? Presumably it depends on the context, but if so, which contexts are relevant for determining if one is a schemer?
I claim these questions cannot be answered using the definition you cited, unless given more precision about how we are drawing the line.
The downside you mention is about how LVT would also prevent people from 'leeching off' their own positive externalities, like the Disney example. Assuming that's true, I'm not sure why that's a problem ? It seems to be the default case for everyone.
The problem is that it would reduce the incentive to develop property for large developers, since their tax bill would go up if they developed adjacent land.
Whether this is a problem depends on your perspective. Personally, I would prefer that we stop making it harder and more inconvenient to build housing a...
I think one example of vague language undermining clarity can be found in Joseph Carlsmith's report on AI scheming, which repeatedly uses the term "schemer" to refer to a type of AI that deceives others to seek power. While the report is both extensive and nuanced, and I am definitely not saying the whole report is bad, the document appears to lack a clear, explicit definition of what exactly constitutes a "schemer". For example, using only the language in his report, I cannot determine whether he would consider most human beings schemers, if we consider w...
It is becoming increasingly clear to many people that the term "AGI" is vague and should often be replaced with more precise terminology. My hope is that people will soon recognize that other commonly used terms, such as "superintelligence," "aligned AI," "power-seeking AI," and "schemer," suffer from similar issues of ambiguity and imprecision, and should also be approached with greater care or replaced with clearer alternatives.
To start with, the term "superintelligence" is vague because it encompasses an extremely broad range of capabilities above human...
I purposefully use these terms vaguely since my concepts about them are in fact vague. E.g., when I say “alignment” I am referring to something roughly like “the AI wants what we want.” But what is “wanting,” and what does it mean for something far more powerful to conceptualize that wanting in a similar way, and what might wanting mean as a collective, and so on? All of these questions are very core to what it means for an AI system to be “aligned,” yet I don’t have satisfying or precise answers for any of them. So it seems more natural to me, at this sta...
Do you have any suggestions RE alternative (more precise) terms? Or do you think it's more of a situation where authors should use the existing terms but make sure to define them in the context of their own work? (e.g., "In this paper, when I use the term AGI, I am referring to a system that [insert description of the capabilities of the system.])
I’m not entirely opposed to doing a scenario forecasting exercise, but I’m also unsure if it’s the most effective approach for clarifying our disagreements. In fact, to some extent, I see this kind of exercise—where we create detailed scenarios to illustrate potential futures—as being tied to a specific perspective on futurism that I consciously try to distance myself from.
When I think about the future, I don’t see it as a series of clear, predictable paths. Instead, I envision it as a cloud of uncertainty—a wide array of possibilities that becomes increas...
The point of a scenario forecast (IMO) is less that you expect clear, predictable paths and more that:
(See also Daniel's sibling comment.)
My biggest disagreements with you are probably a mix of:
The key context here (from my understanding) is that Matthew doesn't think scalable alignment is possible (or doesn't think it is practically feasible) so that humans have a low chance of ending up remaining fully in control via corrigible AIs.
I wouldn’t describe the key context in those terms. While I agree that achieving near-perfect alignment—where an AI completely mirrors our exact utility function—is probably infeasible, the concept of alignment often refers to something far less ambitious. In many discussions, alignment is about ensuring that AIs beh...
...In the best case, this is a world like a more unequal, unprecedentedly static, and much richer Norway: a massive pot of non-human-labour resources (oil :: AI) has benefits that flow through to everyone, and yes some are richer than others but everyone has a great standard of living (and ideally also lives forever). The only realistic forms of human ambition are playing local social and political games within your social network and class. [...] The children of the future will live their lives in the shadow of their parents, with social mobility extinct. I
this seem like a fully general argument, any law change is going to disrupt people's long term plans,
e.g. the abolishment of slavery also disrupt people's long term plans
In this case, I was simply identifying one additional cost of the policy in question: namely that it would massively disrupt the status quo. My point is not that we should abandon a policy simply because it has costs—every policy has costs. Rather, I think we should carefully weigh the benefits of a policy against its costs to determine whether it is worth pursuing, and this is one additio...
It's common for Georgists to propose a near-100% tax on unimproved land. One can propose a smaller tax to mitigate these disincentives, but that simultaneously shrinks the revenue one would get from the tax, making the proposal less meaningful.
In regards to this argument,
...And as a matter of hard fact, most governments operate a fairly Georgist system with oil exploration and extraction, or just about any mining activities, i.e. they auction off licences to explore and extract.
The winning bid for the licence must, by definition, be approx. equal to the rental value of the site (or the rights to do certain things at the site). And the winning bid, if calculated correctly, will leave the company with a good profit on its operations in future, and as a matter of fact, most mining companies and most o
Thanks for the correction. I've now modified the post to cite the World Bank as estimating the true fraction of wealth targeted by an LVT at 13%, which reflects my new understanding of their accounting methodology.
Since 13% is over twice 6%, this significantly updates me on the viability of a land value tax, and its ability to replace other taxes. I weakened my language in the post to reflect this personal update.
That said, nearly all of the arguments I made in the post remain valid regardless of this specific 13% estimate. Additionally, I expect thi...
Here you aren't just making an argument against LVT. You're making a more general argument for keeping housing prices high, and maybe even rising (because people might count on that). But high and rising housing prices make lots of people homeless, and the threat of homelessness plays a big role in propping up these prices. So in effect, many people's retirement plans depend on keeping many other people homeless, and fixing that (by LVT or otherwise) is deemed too disruptive. This does have a certain logic to it, but also it sounds like a bad equilibrium.
I...
It may be worth elaborating on how you think auctions work to mitigate the issues I've identified. If you are referring to either a Vickrey auction or a Harberger tax system, Bryan Caplan has provided arguments for why these proposals do not seem to solve the issue regarding the disincentive to discover new uses for land:
...I can explain our argument with a simple example. Clever Georgists propose a regime where property owners self-assess the value of their property, subject to the constraint that owners must sell their property to anyone who offers th
While I did agree that Linch's comment reasonably accurately summarized my post, I don't think a large part of my post was about the idea that we should now think that human values are much simpler than Yudkowsky portrayed them to be. Instead, I believe this section from Linch's comment does a better job at conveying what I intended to be the main point,
...
- Suppose in 2000 you were told that a100-line Python program (that doesn't abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you
Similar constraints may apply to AIs unless one gets much smarter much more quickly, as you say.
I do think that AIs will eventually get much smarter than humans, and this implies that artificial minds will likely capture the majority of wealth and power in the world in the future. However, I don't think the way that we get to that state will necessarily be because the AIs staged a coup. I find more lawful and smooth transitions more likely.
There are alternative means of accumulating power than taking everything by force. AIs could get rights and then work ...
There are enormous hurdles preventing the U.S. military from overthrowing the civilian government.
The confusion in your statement is caused by blocking up all the members of the armed forces in the term "U.S. military". Principally, a coup is an act of coordination.
Is it your contention that similar constraints will not apply to AIs?
When people talk about how "the AI" will launch a coup in the future, I think they're making essentially the same mistake you talk about here. They’re treating a potentially vast group of AI entities — like a billion copi...
Asteroid impact
Type of estimate: best model
Estimate: ~0.02% per decade.
Perhaps worth noting: this estimate seems too low to me over longer horizons than the next 10 years, given the potential for asteroid terrorism later this century. I'm significantly more worried about asteroids being directed towards Earth purposely than I am about natural asteroid paths.
That said, my guess is that purposeful asteroid deflection probably won't advance much in the next 10 years, at least without AGI. So 0.02% is still a reasonable estimate if we don't get accelerated technological development soon.
Does trade here just means humans consuming, I.e. trading money for AI goods and services? That doesn't sound like trading in the usual sense where it is a reciprocal exchange of goods and services.
Trade can involve anything that someone "owns", which includes both their labor and their property, and government welfare. Retired people are generally characterized by trading their property and government welfare for goods and services, rather than primarily trading their labor. This is the basic picture I was trying to present.
...How many 'different' AI individ
A recently commonly heard viewpoint on the development of AI states that AI will be economically impactful but will not upend the dominancy of humans. Instead AI and humans will flourish together, trading and cooperating with one another. This view is particularly popular with a certain kind of libertarian economist: Tyler Cowen, Matthew Barnett, Robin Hanson.
...They share the curious conviction that the probablity of AI-caused extinction p(Doom) is neglible. They base this with analogizing AI with previous technological transition of humanity, like the i
How could one control AI without access to the hardware/software? What would stop one with access to the hardware/software from controlling AI?
One would gain control by renting access to the model, i.e., the same way you can control what an instance of ChatGPT currently does. Here, I am referring to practical control over the actual behavior of the AI, when determining what the AI does, such as what tasks it performs, how it is fine-tuned, or what inputs are fed into the model.
This is not too dissimilar from the high level of practical control one can exer...
It is not always an expression of selfish motives when people take a stance against genocide. I would even go as far as saying that, in the majority of cases, people genuinely have non-selfish motives when taking that position. That is, they actually do care, to at least some degree, about the genocide, beyond the fact that signaling their concern helps them fit in with their friend group.
Nonetheless, and this is important: few people are willing to pay substantial selfish costs in order to prevent genocides that are socially distant from them.
The theory I...
While the term "outer alignment" wasn’t coined until later to describe the exact issue that I'm talking about, I was using that term purely as a descriptive label for the problem this post clearly highlights, rather than implying that you were using or aware of the term in 2007.
Because I was simply using "outer alignment" in this descriptive sense, I reject the notion that my comment was anachronistic. I used that term as shorthand for the thing I was talking about, which is clearly and obviously portrayed by your post, that's all.
To be very clear: t...
Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of[1] the "larger argument" (mentioned in the May 2024 update to this post) in which this point plays a role.
I'll confirm that I'm not saying this post's exact thesis is false. This post seems to be largely a parable about a fictional device, rather than an explicit argument with premises and clear conclusions. I'm not saying the parable is wrong. Parables are rarely "wrong" in a strict sense, and I am not disputing this parable's conclusion.
Howeve...
Here's an argument that alignment is difficult which uses complexity of value as a subpoint:
A1. If you try to manually specify what you want, you fail.
A2. Therefore, you want something algorithmically complex.
B1. When humanity makes an AGI, the AGI will have gotten values via some process; that process induces some probability distribution over what values the AGI ends up with.
B2. We want to affect the values-distribution, somehow, so that it ends up with our values.
B3. We don't understand how to affect the values-distribution toward somethi
The object-level content of these norms is different in different cultures and subcultures and times, for sure. But the special way that we relate to these norms has an innate aspect; it’s not just a logical consequence of existing and having goals etc. How do I know? Well, the hypothesis “if X is generally a good idea, then we’ll internalize X and consider not-X to be dreadfully wrong and condemnable” is easily falsified by considering any other aspect of life that doesn’t involve what other people will think of you.
To be clear, I didn't mean to propose t...
The post is about the complexity of what needs to be gotten inside the AI. If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything.
I think it's important to be able to make a narrow point about outer alignment without needing to defend a broader thesis about the entire alignment problem. To the extent my argument is "outer alignment seems easier...
Your distinction between "outer alignment" and "inner alignment" is both ahistorical and unYudkowskian. It was invented years after this post was written, by someone who wasn't me; and though I've sometimes used the terms in occasions where they seem to fit unambiguously, it's not something I see as a clear ontological division, especially if you're talking about questions like "If we own the following kind of blackbox, would alignment get any easier?" which on my view breaks that ontology. So I strongly reject your frame that this post was "cl...
...I’m still kinda confused. You wrote “But across almost all environments, you get positive feedback from being nice to people and thus feel or predict positive valence about these.” I want to translate that as: “All this talk of stabbing people in the back is irrelevant, because there is practically never a situation where it’s in somebody’s self-interest to act unkind and stab someone in the back. So (A) is really just fine!” I don’t think you’d endorse that, right? But it is a possible position—I tend to associate it with @Matthew Barnett. I agree that we
Competitive capitalism works well for humans who are stuck on a relatively even playing field, and who have some level of empathy and concern for each other.
I think this basically isn't true, especially the last part. It's not that humans don't have some level of empathy for each other; they do. I just don't think that's the reason why competitive capitalism works well for humans. I think the reason is instead because people have selfish interests in maintaining the system.
We don't let Jeff Bezos accumulate billions of dollars purely out of the kindn...
...It has come to my attention that this article is currently being misrepresented as proof that I/MIRI previously advocated that it would be very difficult to get machine superintelligences to understand or predict human values. This would obviously be false, and also, is not what is being argued below. The example in the post below is not about an Artificial Intelligence literally at all! If the post were about what AIs supposedly can't do, the central example would have used an AI! The point that is made below will be about the algorithmic complexity of hu
The post is about the complexity of what needs to be gotten inside the AI. If you had a perfect blackbox that exactly evaluated the thing-that-needs-to-be-inside-the-AI, this could possibly simplify some particular approaches to alignment, that would still in fact be too hard because nobody has a way of getting an AI to point at anything. But it would not change the complexity of what needs to be moved inside the AI, which is the narrow point that this post is about; and if you think that some larger thing is not correct, you should not confuse...
a) I think at least part of what's gone on is that Eliezer has been misunderstood and facing the same actually quite dumb arguments a lot, and he is now (IMO) too quick to round new arguments off to something he's got cached arguments for. (I'm not sure whether this is exactly what went on in this case, but seems plausible without carefully rereading everything)
b) I do think when Eliezer wrote this post, there were literally a bunch of people making quite dumb arguments that were literally "the solution to AI ethics/alignment is [my preferred elegant syste...
Alice: I want to make a bovine stem cell that can be cultured at scale in vats to make meat-like tissue. I could use directed evolution. But in my alternate universe, genome sequencing costs $1 billion per genome, so I can't straightforwardly select cells to amplify based on whether their genome looks culturable. Currently the only method I have is to do end-to-end testing: I take a cell line, I try to culture a great big batch, and then see if the result is good quality edible tissue, and see if the cell line can last for a year without mutating beyond re...
The point that a capabilities overhang might cause rapid progress in a short period of time has been made by a number of people without any connections to AI labs, including me, which should reduce your credence that it's "basically, total self-serving BS".
More to the point of Daniel Filan's original comment, I have criticized the Responsible Scaling Policy document in the past for failing to distinguish itself clearly from AI pause proposals. My guess is that your second and third points are likely mostly correct: AI labs think of an RSP as different from...
I was pushing back against the ambiguous use of the word "they". That's all.
ETA: I edited the original comment to be more clear.