“Based on your understanding of AI technical developments as of March 29, 2023, evaluate the most important known object-level predictions of Eliezer Yudkowsky on the subject, and which ones seemed true versus false. Afterwards, evaluate those predictions as a group, on a scale from ‘mostly true’ to ‘mostly false.’“
I ran this prompt but substituted in "Gary Marcus" for "Eliezer Yudkowsky". Claude says
Overall evaluation: On a scale from 'mostly true' to 'mostly false,' I would rate Gary Marcus's predictions as a group as "Mostly True."
Many of Marcus's predictions about the limitations of current AI systems and the challenges ahead have proven accurate. His concerns about reasoning, abstract thinking, and the need for more sophisticated knowledge representation align with ongoing challenges in the field. His emphasis on AI safety and alignment has also been prescient.
However, it's worth noting that some of his predictions might be seen as overly pessimistic by some in the AI community. The rapid progress in LLMs and their applications has surprised many, including some skeptics. Nonetheless, many of the fundamental challenges he pointed out remain relevant.
It's also important to remember that the field of AI is rapidly evolving, and assessments of such predictions can change quickly as new breakthroughs occur. As of my last update in April 2024, many of Marcus's key points still held true, but the field continues to advance at a rapid pace.
I think Claude likes saying nice things about people, so it's worth trying to control for that.
In case Joshua Achiam ends up reading this post, my question for him is:
My understanding is that you think P(misaligned AGI will kill all humans by 2032) is extremely low, like 1e-6.
Is this because:
What is your P(violent misaligned AI takeover | AGI by 2032)?[1]
Some clarification:
I'm using AGI to mean top-human-expert level AI. (As in, can obsolete top human experts in most non-physical tasks.) OpenAI typically uses a similar definition.
By "misalignment", I mean "AIs that conspire against you and your countermeasures", not "AIs which aren't sufficiently robust to jailbreaks".
My guess is that your view is that "misalignment" could be bad, but not existential, while other risks from AGI (e.g. new superweapons used in great power conflict) could be existential.
My view is that a misaligned AI that succeeds in takeover probably wouldn't kill literally every person, though takeover has a high probability of killing billions, killing everyone is plausible, and this would eliminate human control of the future which is likely extremely bad.
So, I prefer to talk about the chances of "violent misaligned AI takeover" where violent means "a bunch of people die or are harmed".
(This is based on this twitter thread I wrote.)
Note that this includes takeover by later systems which are more powerful than AGI. I'm just fixing AGI being created before some date.. ↩︎
The more serious error, which got quoted some elsewhere, was: In the section about OpenAI, I noted some past comments from Joshua Achiam, and interpreted them as him lecturing EAs that misalignment risk from AGI was not real.
Uhh, with the additional context from this post in mind, I would argue that your initial interpretation was entirely correct. Like, this is precisely what I'd expected from reading that first quote.
Not the worst-case scenario of Yann LeCun, admittedly, but getting there.
Edit: Actually it's a bit worse than I'd expected. "Advocating for an international treaty is a call for violence" is just an embarrassing take.
My view on this:
Edit: Actually it's a bit worse than I'd expected. "Advocating for an international treaty is a call for violence" is just an embarrassing take.
The context is fairly critical here, and I broadly share iceman's perspective regarding the article's call for an international treaty:
https://www.lesswrong.com/posts/gZkYvA6suQJthvj4E/#SCnooEfdjJaQHS4TB
Which is while he didn't explicitly call for violence, he did argue for very drastic actions, and contra others, treaties are not usually enforced with nuclear war, for obvious reasons.
Eliezer did not call for enforcement via nuclear war. He said that:
Make it explicit in international diplomacy that preventing AI extinction scenarios is considered a priority above preventing a full nuclear exchange, and that allied nuclear countries are willing to run some risk of nuclear exchange if that’s what it takes to reduce the risk of large AI training runs.
and later clarified:
My TIME piece did not suggest nuclear strikes against countries that refuse to sign on to a global agreement against large AI training runs. It said that, if a non-signatory country is building a datacenter that might kill everyone on Earth, you should be willing to preemptively destroy that datacenter; the intended reading is that you should do this even if the non-signatory country is a nuclear power and even if they try to threaten nuclear retaliation for the strike. This is what is meant by "Make it explicit... that allied nuclear countries are willing to run some risk of nuclear exchange if that’s what it takes to reduce the risk of large AI training runs."
Violently enforcing certain particularly important principles on non-signatories is entirely within the norm, the ban on international-trade-endangering piracy being the prime example. The idea that applying a qualitatively similar standard to AI risk is "deranged" is only valid if you don't believe that catastrophic AI risk is real: if you don't believe that a rogue superintelligence somewhere in North Korea can hurt you in the US.
Anyway, that's not even the crux here. The crux is that there's a day-and-night difference between:
The difference between those is far beyond the fine points of whether it's okay or not to enforce an international treaty on nukes-having non-signatories. And the worst falsehoods being spread are those misrepresenting (1) as (2), and Joshua Achiam's quotes above likewise seem to fail to see the difference between the two (though I don't think he's doing that maliciously).
Violently enforcing certain particularly important principles on non-signatories is entirely within the norm
True as stated, though I'm not aware of examples of this being enforced on non-signatories which are nuclear powers. This is just quantitatively riskier, not a notable change in norms.
And I agree this seems clearly non-outrageous if we replace AGI training datacenter with something like "wet lab credibly planning on developing a virus that would kill literally every human if it escaped where this wet lab isn't planning on taking any serious precautions against a lab leak".
I think this is a disagreement I have:
This is just quantitatively riskier, not a notable escalation.
I think this is actually a fairly extreme escalation compared to how states deal with threats, and whether or not you think this is a good policy, it is a very, very important escalation step, and that this is evidence for it being a very extreme escalation:
True as stated, though I'm not aware of examples of this being enforced on non-signatories which are nuclear powers.
Sorry, I actually meant "not a notable change in norms". I agree that it is quantiatively much costlier from the perspective of the US.
Hm, a disagreement I have is that the norms around escalating to nuclear war are way, way stronger than basically any other norm in international relations, and there's a reason basically all states do their actions through proxies/covert wars, because the taboo on nuclear war is way stronger than a lot of other norms in the international setting.
I agree that it is norms violating for a country to respond to a conventional strike on their datacenter with a nuclear response. This is different from the statement that the conventional strike from the other country is norms violating.
I don't think conventional strikes on military assets of nuclear power are that norms violating. In fact, recently, a huge number of missiles were launched at a nuclear power. (Iran launched them at Israel which is widely believed to have nukes.)
(I believe the US has never directly launched a strike on a nuclear power within their territory. However, it has indirectly assisted with such strikes in the Russia Ukraine war and participated in proxy wars.)
The point that would justify an airstrike isn't violation of a treaty, but posing an immediate and grave risk to the international community. The treaty is only the precondition that makes effective and coordinated action possible.
I do think that most current alignment work does also advance capabilities, but that the distinction should mostly be ‘clear’ even if there are importantly shades of gray and you cannot precisely define a seperator.
For a large subclass of work, I actually disagree with this claim, and I think areas where you wouldn't accelerate capabilities with alignment work is work on reducing deceptively aligned AI takeover risk/sharp left turn risk, for the reasons that @RogerDearnaley stated here:
So that’s the thing, right? Fictional worlds like this almost never actually make sense on closer examination. The incentives and options and actions are based on the plot and the need to tell human stories rather than following good in-universe logic. That the worlds in question are almost always highly fragile, the worlds really should blow up, and the AIs ensure the humans work out okay in some sense ‘because of reasons’ because it feels right to a human writer and their sense of morality or something rather than that this would happen.
I worry this kind of perspective is load bearing, given he thinks it is ‘correctly predicting the future,’ the idea that ‘prosaic alignment’ will result in sufficiently strong pushes to doing some common sense morality style not harming of the humans, despite all the competitive dynamics among AIs and various other things they value and grow to value, that things turn out fine by default, in worlds that to me seem past their point of no return and infinitely doomed unless you think the AIs themselves have value.
I think one key crux here is whether you think partial alignment successes are possible. If AI alignment ends up binary, I would agree that Her is basically an incoherent description of the future.
If AI alignment ends up more of a continuous quantity such that reasonably large partial successes are probable, then Her is more coherent as a plausible future.
I admit I tend to favor the continuous side of the debate more than the discrete side, and tend to see discreteness as an abstraction over the actual continuous outcomes.
To be clear, Her is not a totally coherent story, but I do think that relatively minor changes are enough to make it more coherent.
On this:
Without getting into any technical arguments, it seems rather absurd to suggest the set of goals that imply undesired subgoals within plausibly desired goal space would have measure zero? I don’t see how this survives contact with common sense or relation to human experience or typical human situations.
The answer is because probabilities get weird in infinity, because probability can be 0% on an event happening, even if the event happening is possible, and probability can be 100% on an event happening, even if there are scenarios where the event can happen.
One perfect example is that if you pick a random real number, you will have 0% on getting any real number, no matter which it is, and if you try to search for an irrational number, you will have a 100% chance of getting an irrational real number, but that doesn't mean you can't sample a rational number.
I don't think that it is actually measure 0, but that's for different reasons.
My response to Joshua Achiam's point on why we get instrumental convergence is o1 exists.
While it often doesn't want to use it's instrumental convergence, it does have the capability to cause basic instrumental convergence, and in AI scaling, noisy capabilites foreshadow robust and powerful capabilities.
Indeed, I would go further. The market wants the AIs to be given as much freedom and authority as possible, to send them out to compete for resources and influence generally, for various ultimate purposes. And the outcome of those clashes and various selection effects and resource competitions, by default, dooming us.
I think it depends on whether we are assessing it using X-risk standards or GCR standards.
On the one hand, I absolutely believe that we could well get into GCR territory, for something like this reason below:
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
But I don't think it will get into X-risk territory, because I both expect AI agents to be more controlled than the molochian story tells us, and also because I expect some winners who will be able to go to the stars and have their values imprint on the stars.
Sixteen years ago, Eliezer Yudkowsky wrote the Value Theory sequence, going deep on questions like what makes things have value to us, how to reconcile when different entities (human or otherwise) have very different values, and so on. If you’re interested in these questions, this is a great place to start. I have often tried to emphasize that I continue to believe that Value is Fragile, whereas many who don’t believe in existential risk think value is not fragile.
It is a highly understood problem among our crowd that ‘human values’ is both very complex and a terrifyingly hard thing to pin down, and that people very strongly disagree about what they value.
Also it is a terrifyingly easy thing to screw up accidentally, and we have often said that this is one of the important ways to build AGI and lose – that you choose a close and well-meaning but incorrect specification of values, or your chosen words get interpreted that way, or someone tries to get the AGI to find those values by SGD or other search and it gets a well-meaning but incorrect specification.
I actually don't agree with both Joshua Achiam and a lot of LWers on the assumption that value is complicated (at least in generative structure), and also don't think this failure mode, conditional on no deceptive/proxy misalignment is likely to happen at all, because capabilities people are incentivized to make our specifications better, and I also think that the function of human values are actually fairly simple, when we abstract away the complicated mechanisms.
I agree with the literal claim in Value is Fragile post that this is probably true with reasonably high probability, but not so much with the other conclusions often held in the cluster.
Also, I think one very large meta-crux that underlies a lot of other cruxes is whether the best strategy is to do locally optimal things and iterate, closest to the model-free RL approaches, or whether you should just build an explicit model and optimize in the model, closest to model-based RL/search approaches.
Cf these comments:
I think this is a meta-crux underlying all other disagreements in strategies for AI alignment, and I think unfortunately we will probably have to solve this the hard way by executing both strategies in reality in parallel, to see which one wins out as AI progresses.
Edit:
I have to say something about this:
That is how every law and every treaty or agreement works, and indeed the only way they can work.
No, this isn't how every treaty works.
Treaty violations amongst states are usually not enforced with the threat of war, for fairly obvious reasons. Instead, they are settled some other way.
Joshua Achiam is the OpenAI Head of Mission Alignment
I start off this post with an apology for two related mistakes from last week.
The first is the easy correction: I incorrectly thought he was the head of ‘alignment’ at OpenAI rather than his actual title ‘mission alignment.’
Both are important, and make one’s views important, but they’re very different.
The more serious error, which got quoted some elsewhere, was: In the section about OpenAI, I noted some past comments from Joshua Achiam, and interpreted them as him lecturing EAs that misalignment risk from AGI was not real.
While in isolation I believe this is a reasonable way to interpret this quote, this issue is important to get right especially if I’m going to say things like that. Looking at it only that way was wrong. I both used a poor method to contact Joshua for comment that failed to reach him when I had better options, and I failed to do searches for additional past comments that would provide additional context.
I should have done better on both counts, and I’m sorry.
Indeed, exactly because OpenAI is so important, and to counter the potential spread of inaccurate information, I’m offering this deep dive into Joshua Achiam’s public statements. He has looked at a draft of this to confirm it has no major errors.
Here is a thread Joshua wrote in November 2022 giving various links to AI safety papers and resources. The focus is on concrete practical ‘grounded’ stuff, also it also includes a course by Dan Hendrycks that involves both levels.
Joshua Achiam Has a Very Different Model of AI Existential Risk
Having looked at many additional statements, Joshua clearly believes that misalignment risk from AGI is real. He has said so, and he has been working on mitigating that risk. And he’s definitely been in the business many times of pointing out when those skeptical of existential risk get sufficiently far out of line and make absolute statements or unfair personal or cultural attacks.
He does appear to view some models and modes of AI existential risk, including Yudkowsky style models of AI existential risk, as sufficiently implausible or irrelevant as to be effectively ignorable. And he’s shown a strong hostility in the x-risk context to the rhetoric, arguments tactics and suggested actions of existential risk advocates more broadly.
So for example we have these:
On the one hand, the first statement is a very clear ‘no everything will not automatically be fine’ and correctly identifies that position as indefensible. The others are helpful as well. The first statement is also a continued characterization of those worried as mostly ‘doomers’ with generally poor arguments.
The second is correct in principle as well. If there’s one thing Yann LeCun isn’t, it’s genre savvy.
Joshua is Strongly Dismissive of Alternative Models of AI X-Risk
In practice, however, the ‘consider the likelihood of each particular scenario’ approach tends to default everything to the ‘things turn out OK’ bracket minus the particular scenarios one can come up with.
It is central to my perspective that you absolutely cannot do that. I am very confident that the things being proposed do not default to good outcomes. Good outcomes are possible, but to get them we will have to engineer them.
There is no contradiction between ‘existential risk is unlikely’ and ‘we should reduce existential risk.’ It is explicit that Joshua thinks such risks are unlikely. Have we seen him put a number on it? Yes, but I found only the original quote I discussed last time and a clarification thereof, which was:
This makes clear he is dismissing in particular ‘all humans are physically dead by 2032’ rather than ‘the world is on a path by 2032 where that outcome (or another where all value is lost) is inevitable.’ I do think this very low probability is highly alarming, and in this situation I don’t see how you can possibly have model error as low as 1e-6 (!), but it is less crazy given it is more narrow.
The ‘much less than 100%’ doom number in 100 years doesn’t rule out my own number. What it tells me more than anything, on its own, is that he’s grown understandably exhausted with dealing with people who do put 99% or 99.9% in that spot.
But he’s actually making much stronger claims here, in the context of an EA constructive criticism thread basically telling them not to seek power because EA was too dysfunctional (which makes some good points and suggestions, but also proves far too much which points to what I think is wrong in the thread more broadly):
The idea that a 5% probability of such an outcome, as envisioned by someone else for some other person’s definition of AGI, proves they are ‘not thinking clearly,’ seems like another clear example of dismissiveness and overconfidence to me. This goes beyond not buying the threat model that creates such predictions, which I think is itself a mistake. Similarly:
At this point, I’ve heard quite a lot of people at or formerly at OpenAI in particular, including Sam Altman, espouse the kinds of timelines Joshua here says are ‘way, way out of connection to reality.’ So I’m curious what he thinks about that.
The fourth earlier claim, that AI could be a link in the causal chain to x-risk without requiring the AI commit an intentional act, seems very obviously true. If anything it highlights that many people place importance on there being an ‘intentional act’ or similar, whereas I don’t see that distinction as important. I do think that the scenario he’s describing there, where the superweapon becomes possible but we otherwise have things under control, is a risk level I’d happily accept.
The third claim is more interesting. Most of the talk I hear about ‘we’ll merge with the machines’ or what not doesn’t seem to me to make sense on any meaningful level. I see scenarios where humanity has a ‘gradual transformation’ as where we successfully solve ‘phase one’ and have the alignment and control issues handled, but then weird dynamics or changes happen in what I call ‘phase two’ when we have to get human dynamics in that world into some form of long term equilibrium, and current humanity turns out not to be it.
I do agree or notice I am confused which of those worlds count as valuable versus not. I’ve been mentally basically putting those mostly into the ‘win’ bucket, if you don’t do that then doom estimates go up.
Would Ordinary Safety Practices Would Be Sufficient for AI?
I would hope we can all agree they are necessary. They don’t seem sufficient to me.
Consider Joshua’s belief (at least in 2021) that if adapt general best safety practices from other industries, we’ll ‘get there.’ While they are much better than nothing, and better than current practices in AI, I very strongly disagree with this. I do think that given what else is happening at OpenAI, someone who believes strongly in ‘general best practices’ for safety is providing large value above replacement.
Standard safety policies cannot be assumed. Some major labs fall well short of this, and have made clear they have no intention of changing course. There is clear and extreme opposition, from many circles (not Joshua), to any regulatory requirements that say ‘you must apply otherwise ordinary safety protocols to AI.’
It seems clearly good to not throw out these standard policies, on the margin? It would be a great start to at least agree on that. If nothing else those policies might identify problems that cause us to halt and catch fire.
But I really, really do not think that approach will get it done on its own, other than perhaps via ‘realize you need to stop,’ that the threat models this time are very expansive and very different. I’d certainly go so far as to say that if someone assigns very high probabilities to that approach being sufficient, that they are not in my mind thinking clearly.
Consider also this statement:
The obvious way to interpret this statement is, in addition to the true statement that much alignment work also enhances capabilities, that the alignment work that isn’t also capabilities work isn’t real alignment work? Downthread he offers good nuance. I do think that most current alignment work does also advance capabilities, but that the distinction should mostly be ‘clear’ even if there are importantly shades of gray and you cannot precisely define a seperator.
Visions of the Future
In terms of ‘things that seem to me like not thinking clearly’:
So that’s the thing, right? Fictional worlds like this almost never actually make sense on closer examination. The incentives and options and actions are based on the plot and the need to tell human stories rather than following good in-universe logic. That the worlds in question are almost always highly fragile, the worlds really should blow up, and the AIs ensure the humans work out okay in some sense ‘because of reasons’ because it feels right to a human writer and their sense of morality or something rather than that this would happen.
I worry this kind of perspective is load bearing, given he thinks it is ‘correctly predicting the future,’ the idea that ‘prosaic alignment’ will result in sufficiently strong pushes to doing some common sense morality style not harming of the humans, despite all the competitive dynamics among AIs and various other things they value and grow to value, that things turn out fine by default, in worlds that to me seem past their point of no return and infinitely doomed unless you think the AIs themselves have value.
Alternatively, yes, Her is primarily about predicting the humans. And perhaps it is a good depiction of how humans would react to and interact with AI if that scenario took place. But it does a very poor job predicting the AIs, which is the part that actually matters here?
For the opposite perspective, see for example Eliezer Yudkowsky here last month.
Joshua Achiam versus Eliezer Yudkowsky
We definitely have a pattern of Joshua taking rhetorical pot-shots at Yudkowsky and AI. Here’s a pretty bad one:
I think this kind of rhetoric about ‘calls for violence’ is extremely bad and wrong. Even for example here, where the thread’s primary purpose is to point out that certain accusations against EA (that they ‘underemphasized AI x-risk’ and pretended to care about other things) are indeed quite ridiculous, you see him refer to Eliezer “Bomb the Datacenters” Yudkowsky.
What Yudkowsky said was that if there was an international agreement that you don’t develop AGI, you would if you ran out of other alternatives use physical force to enforce that agreement. That is how every law and every treaty or agreement works, and indeed the only way they can work.
This is so frustrating. Unlawful violence quite obviously is unacceptable even if it would work, and also it very obviously wouldn’t work. And we keep saying that.
In terms of Yudkowsky’s technical predictions, I noticed I disagreed, and rather than argue details I asked Claude and o1-preview this question (fully one-shot only):
“Based on your understanding of AI technical developments as of March 29, 2023, evaluate the most important known object-level predictions of Eliezer Yudkowsky on the subject, and which ones seemed true versus false. Afterwards, evaluate those predictions as a group, on a scale from ‘mostly true’ to ‘mostly false.’“
Skipping to the conclusions (you can of course try the prompt yourself, or try your own):
Given how difficult predictions are to make especially about the future, that’s not bad, and certainly quite different from ‘almost all wrong’ to the point of needing everyone else dismiss him as a thinker.
One of Eliezer’s key concepts is instrumental convergence. In this thread Achiam argues against the fully maximalist form of instrumental convergence:
It is technically true that you can impose arbitrarily strict implementation details and constraints on a goal, such that instrumental convergence ceases to be a useful means of approaching the goal, and thus you should expect not to observe it.
Without getting into any technical arguments, it seems rather absurd to suggest the set of goals that imply undesired subgoals within plausibly desired goal space would have measure zero? I don’t see how this survives contact with common sense or relation to human experience or typical human situations. Most humans spend most of their lives pursuing otherwise undesired subgoals and subdrives that exist due to other goals, on some level. The path to achieving almost any big goal, or pursuing anything maximalist, will do the same.
When I think about how an AI would achieve a wide variety of goals humans might plausibly assign to it, I see the same result. We’ve also seen observations (at least the way interpret it) of instrumental convergence in existing models now, when given goals, reasonably consistently among the reports I see that give the model reason to do so.
Am I holding out some probability that instrumental convergence mostly won’t be a thing for highly capable AIs? I have to, because this is not a place you can ‘prove’ anything as such. But it would be really boggling to me for it to almost never show up, if we assigned various complex and difficult tasks, and gave the models capabilities where instrumental convergence was clearly the ‘correct play,’ without any active attempts to prevent instrumental convergence from showing up.
I agree we should continue to test and verify, and would even if everyone agreed it was super likely. But convergence failing to show up would blow my mind hardcore.
People Are Going to Give AI Power
In the before times, people said things like ‘oh you wouldn’t connect your AI to the internet, you’d be crazy.’ Or they’d say ‘you wouldn’t make your AI into an agent and let it go off with your [crypto] wallet.’
Those predictions did not survive contact with the enemy, or with reality. Whoops.
This seems insanely difficult to avoid. As I have tried to warn many times, once AI is more effective at various tasks than humans, any humans who don’t turn those tasks over to AIs get left behind. That’s true for individuals, groups, corporations and nations. If you don’t want the AIs to be given that power, you have two options: You can prevent the AIs from being created, or you can actively bar anyone from giving the AIs that power, in a way that sticks.
Indeed, I would go further. The market wants the AIs to be given as much freedom and authority as possible, to send them out to compete for resources and influence generally, for various ultimate purposes. And the outcome of those clashes and various selection effects and resource competitions, by default, dooming us.
Your third option is the one Joshua suggests, that you assume the AIs get the power and plan accordingly.
Seems really hard on multiple levels. There is an implicit ‘you build distinct AIs to handle distinct narrow tasks where you can well-define what they’re aiming for’ but that is also not what the market wants. The market wants general purpose agents that will go out and do underspecified tasks to advance people’s overall situations and interests, in ways that very much want them to do all the things the above wants them not to do. The market wants AIs advising humans on every decision they make, with all the problems that implies.
If you want AIs to only do well-specified things in well-specified domains according to socially approved objectives and principles, how do you get to that outcome? How do you deal with all the myriad incentives lining up the other way, all the usual game theory problems? And that’s if you actually know how to get the AIs to be smart enough and perceptive enough to do their work yet respond to the training sets in a way that gets them to disregard, even under pressure, the obviously correct courses of action on every other level.
These are super hard and important questions and I don’t like any of the answers I’ve seen. That includes Joshua’s suggested path, which doesn’t seem like it solves the problem.
The place it gets weird was in this follow-up.
But that’s exactly how I think the competition for resources x-risk thing manifests. Browder outsources his everything to AgentGPT-N. He tells it to go out and use his money to compete for resources. So does everyone else. And then things happen.
So the argument is that these AIs will then ‘enthusiastically collaborate’ with each other? Why should we expect that? Is this a AIs-will-use-good-decision-theory claim? Something else? If they do all cooperate fully with each other, how does that not look like them taking control to maximize some joint objective? And so on.
In not directly related but relevant to similar issues good news, he notes that some people are indeed ‘writing the spec’ which is the kind of work he seems to think is most important?
I do think it’s very cool that several people are taking a crack at writing specifications. I have no idea how their specs could be expected to work and solve all these problems, but yes people are at least writing some specs.
Value is Complicated
Here is a thread by Joshua Achiam from July 2023, which I believe represents both a really huge central unsolved problem and also a misunderstanding:
I would strongly disagree, and say this is the only approach I know that takes the problem of what we value seriously, and that a false sense of exactly what you are protecting, or trying to aim now at protecting a specific specified target, would make things less likely to work. You’d pay to know what you really think. Us old school rationalists, starting with Eliezer Yudkowsky, have been struggling with the ‘what are human values’ problem as central to alignment, for a long time.
Sixteen years ago, Eliezer Yudkowsky wrote the Value Theory sequence, going deep on questions like what makes things have value to us, how to reconcile when different entities (human or otherwise) have very different values, and so on. If you’re interested in these questions, this is a great place to start. I have often tried to emphasize that I continue to believe that Value is Fragile, whereas many who don’t believe in existential risk think value is not fragile.
It is a highly understood problem among our crowd that ‘human values’ is both very complex and a terrifyingly hard thing to pin down, and that people very strongly disagree about what they value.
Also it is a terrifyingly easy thing to screw up accidentally, and we have often said that this is one of the important ways to build AGI and lose – that you choose a close and well-meaning but incorrect specification of values, or your chosen words get interpreted that way, or someone tries to get the AGI to find those values by SGD or other search and it gets a well-meaning but incorrect specification.
Thus, the idea to institute Coherent Extrapolated Volition, or CEV, which is very roughly ‘what people would collectively choose as their values, given full accurate information and sufficient time and skill to contemplate the question.’
Why would you do that? Exactly because of the expectation that if you do almost anything else, you’re not only not taking everyone’s values into account, you don’t even understand your own well enough to specify them. I certainly don’t. I don’t even have confidence that CEV, if implemented, would result in that much of the things that I actually value, although I’d take it. And yes, this whole problem terrifies me even in good scenarios.
What am I fighting to preserve right now? I am fighting for the ability to make those choices later. That means the humans stay alive and they stay in control. And I am choosing to be less concerned about exactly which humans get to choose which humans get to choose, and more concerned with humans getting to properly choose at all.
Because I expect that if humans don’t make an active choice, or use a poor specification of preferences that gets locked in, then the value that results is likely zero. Whereas if humans do choose intentionally, even humans whose values I strongly disagree with and that are being largely selfish, I do expect those worlds to have strongly positive value. That’s a way in which I think value isn’t so fragile. So yes, I do think the focus should be ensuring someone gets to choose at all.
Also, I strongly believe for these purposes in a form of the orthogonality thesis, which here seems obviously true to me. In particular: Either you can get the AI to reflect the values of your choice, or you can’t. You don’t need to know which values you are aiming for in order to figure that part out. And once you figure that part out you can and should use the AI to help you figure out your values.
Meanwhile, yes, I spend rather a lot of time thinking about what is actually valuable to me and others, without expecting us humans to find the answer on our own, partly because one cannot avoid doing so, partly because it is decision relevant in questions like ‘how much existential risk should we accept in the name of ‘beating China’’?
In a world where everyone wants the AI to ‘do our alignment homework’ for us, one must ask: What problems must we solve before asking the AI to ‘do our homework’ for us, versus which questions then allow us to safety ask the AI to do that? Almost everyone agrees, in some form, that the key is solving the problems that clear the way to letting AI fully help solve our other problems.
And no, I don’t like getting into too much detail about my best guess about what I value or we collectively should value in the end, both because I think value differences should be respected and because I know how distracting and overwhelming those discussions and fights get if you let them start.
Mostly, I try to highlight those who are expressing values I strongly disagree with – in particular, those that favor or are fine with human extinction. I’m willing to say I’m not okay with that, and I don’t find any of the ‘but it’s fine’ proposals to be both acceptable and physically realistic so far.
Is all this a ‘cop out’? I would say, absolutely not.
Do people ‘notice’ that you are insufficiently focused on these questions? Oh, sure. They notice that you are not focused on those political fights and arguments. Some of them will not like that, because those questions are what they care about. The alternative is that they notice the opposite. That’s worse.
Others appreciate that you are focused on solving problems and growing or preserving the pie, rather than arguing values and focusing on political battles.
Yes, if we succeed in getting to continue to live, as he says here we will then have to agree on how to divide the bounty and do the realignments (I would add, voluntarily or otherwise), same as we do today. But the parties aren’t in position to negotiate about this now, we don’t know what is available and we don’t know what we want and we don’t have anyone who could credibly negotiate for any of the sides or interests and so on. Kicking the ‘who benefits’ can down the road is a time tested thing to do when inventing new technologies and ensuring they’re safe to deploy.
Conclusion
The interactions I’ve had with Joshua after my initial errors leave me optimistic for continued productive dialogue. Whatever our disagreements, I believe Joshua is trying to figure things out and ensure we have a good future, and that all the public statements analyzed above were intended to be helpful. That is highly refreshing.
Those statements do contain many claims with which I very strongly disagree. We have very different threat models. We take very different views of various predictions and claims, about both the past and the future. At least in the recent past, he was highly dismissive of commonly expressed timeline projections, risk models and risk level assessments, including my own and even more those of many of his colleagues. At core, while I am very happy he at least does think ‘ordinary safety practices’ are necessary and worthwhile, he thinks ‘ordinary safety practices’ would ‘get us there’ and I very much do not expect this. And I fear the views he expresses may lead to shutting out many of those with the most important and strongest concerns.
These disagreements have what seem like important implications, so I am glad I took the time to focus on them and lay them out in detail, and hopefully start a broader discussion.