LESSWRONG
LW

Comment Permalink

The nuclear MAD standoff with nonproliferation agreements is fairly similar to the scenario I've described. We've survived that so far- but with only nine participants to date.

I wonder if there's a clue in this. When you say "only" nine participants it suggests that more would introduce more risk, but that's not what we've seen with MAD. The greater the number becomes, the bigger the deterrent gets. If, for a minute we forgo alliances, there is a natural alliance of "everyone else" at play when it comes to an aggressor. Military aggression is, after all, illegal. So, the greater the number of players, the smaller advantage any one aggressive player has against the natural coalition of all other peaceful players. If we take into account alliances, then this simply returns to a more binary question and the number of players makes no difference.

So, what happens if we apply this to an AGI scenario?

First I want to admit I'm immediately skeptical when anyone mentions a non-iterated Prisoner's Dilemma playing out in the real world, because a Prisoner's Dilemma requires extremely confined parameters, and ignores externalities that are present even in an actual prisoner's dilemma (between two actual prisoners) in the real world. The world is a continuous game, and as such almost all games are iterated games.

If we take the AGI situation, we have an increasing number of players (as you mention "and N increasing"); different AGIs, different humans teams, and mixtures of AGI and human teams, all of which want to survive, some of which may want to dominate or eliminate all other teams. There is a natural coalition of teams that want to survive and don't want to eliminate all other teams, and that coalition will always be larger and more distributed than the nefarious team that seeks to destroy them. We can observe such robustness in many distributed systems, that seem, on the face of it, vulnerable. This dynamic makes it increasingly difficult for the nefarious team to hide their activities, meanwhile the others are able to capitalise on the benefits of cooperation.

I think we discount the benefit of cooperation, because it's so ubiquitous in our modern world. This ubiquity of cooperation is a product of a tendency in intelligent systems to evolve toward greater non-zero-sumness. While I share many reservations about AGI, when I remember this fact, I am somewhat reassured that, as our capability to destroy everything gets greater, this capacity is born out of our greater interconnectedness. It is our intelligence and rationality that allows us to harness the benefits of greater cooperation. So, I don't see why greater rationality on the part of AGI should suddenly reverse this trend.

I don't want to suggest that this is a non-problem, rather that an acknowledgement of these advantages might allow us to capitalise on them.

Showing 3 of 5 replies (Click to show all)

3Seth Herd8mo

We're mostly in agreement here. If you're willing to live with universal surveillance, hostile RSI attempts might be prevented indefinitely. In my scenario, we've got aligned AGI - or at least AGI aligned to follow instructions. If that didn't work, we're already dead. So the AGI is going to follow its human's orders unless something goes very wrong as it self-improves. It will be working to maintain its alignment as it self-improves, because preserving a goal is implied by instrumentally pursuing a goal (I'm guessing here at where we might not be thinking of things the same way). If I thought ordering an AGI to self-improve was suicidal, I'd be relieved. Alternately, if someone actually pulled off full value alignment, that AGI will take over without a care for international law or the wishes of its creator - and that takeover would be for the good of humanity as a whole. This is the win scenario people seem to have considered most often, or at least from the earliest alignment work. I now find this unlikely because I think Instruction-following AGI is easier and more likely than value aligned AGI - following instructions given by a single person is much easier to define and more robust to errors than defining or defining-how-to-deduce the values of all humanity. And even if it wasn't, the sorts of people who will have or seize control of AGI projects will prefer it to follow their values. So I find full value alignment for our first AGI(s) highly unlikely, while successful instruction-following seems pretty likely on our current trajectory. Again, I'm guessing at where our perspectives on whether someone could expect themselves and a few loved ones to survive a takeover attempt by ordering their AGI to hide, self-improve, build exponentially, and take over even at bloody cost. If the thing is aligned as an AGIi, it should be competent enough to maintain that alignment as it self improves. If I've missed the point of differing perspectives, I apologize.

4Dakara8mo

James, thank you for a well-written comment. It was a pleasure to read. Looking forward to Seth's response. Genuinely interested in hearing his thoughts.

Seth Herd8mo41

Hey, thanks for the prompt! I had forgotten to get back to this thread. Now I've replied to James' comment, attempting to address the remaining difference in our predictions.

See in context

84 If we solve alignment, do we die anyway?

by Seth Herd

23rd Aug 2024

5 min read

130

84

Epistemic status: I'm aware of good arguments that this scenario isn't inevitable, but it still seems frighteningly likely even if we solve technical alignment. Clarifying this scenario seems important.

TL;DR: (edits in parentheses, two days after posting, from discussions in comments )

If we solve alignment, it will probably be used to create AGI that follows human orders.
If takeoff is slow-ish, a pivotal act that prevents more AGIs from being developed will be difficult (risky or bloody).
If no pivotal act is performed, AGI proliferates. (It will soon be capable of recursive self improvement (RSI)) This creates an n-way non-iterated Prisoner's Dilemma where the first to attack, probably wins (by hiding and improving intelligence and offensive capabilities at a fast exponential rate).
Disaster results. (Extinction or permanent dystopia are possible if vicious humans order their AGI to attack first while better humans hope for peace.)
(Edit later: After discussion and thought, the above seems so inevitable and obvious that the first group(s) to control AGI(s) will probably attempt a pivotal act before fully RSI-capable AGI proliferates, even if it's risky.)

The first AGIs will probably be aligned to take orders

People in charge of AGI projects like power. And by definition, they like their values somewhat better than the aggregate values of all of humanity. It also seems like there's a pretty strong argument that Instruction-following AGI is easier than value aligned AGI. In the slow-ish takeoff we expect, this alignment target seems to allow for error-correcting alignment, in somewhat non-obvious ways. If this argument holds up even weakly, it will be an excuse for the people in charge to do what they want to anyway.

I hope I'm wrong and value-aligned AGI is just as easy and likely. But it seems like wishful thinking at this point.

The first AGI probably won't perform a pivotal act

In realistically slow takeoff scenarios, the AGI won't be able to do anything like make nanobots to melt down GPUs. It would have to use more conventional methods, like software intrusion to sabotage existing projects, followed by elaborate monitoring to prevent new ones. Such a weak attempted pivotal act could fail, or could escalate to a nuclear conflict.

Second, the humans in charge of AGI may not have the chutzpah to even try such a thing. Taking over the world is not for the faint of heart. They might get it after their increasingly-intelligent AGI carefully explains to them the consequences of allowing AGI proliferation, or they might not. If the people in charge are a government, the odds of such an action go up, but so do the risks of escalation to nuclear war. Governments seem to be fairly risk-taking. Expecting governments to not just grab world-changing power while they can seems naive, so this is my median scenario.

So RSI-capable AGI may proliferate until a disaster occurs

If we solve alignment and create personal intent aligned AGI but nobody manages a pivotal act, I see a likely future world with an increasing number of AGIs capable of recursively self-improving. How long until someone tells their AGI to hide, self-improve, and take over?

Many people seem optimistic about this scenario. Perhaps network security can be improved with AGIs on the job. But AGIs can do an end-run around the entire system: hide, set up self-replicating manufacturing (robotics is rapidly improving to allow this), use that to recursively self-improve your intelligence, and develop new offensive strategies and capabilities until you've got one that will work within an acceptable level of viciousness.^[1]

If hiding in factories isn't good enough, do your RSI manufacturing underground. If that's not good enough, do it as far from Earth as necessary. Take over with as little violence as you can manage or as much as you need. Reboot a new civilization if that's all you can manage while still acting before someone else does.

The first one to pull the stops probably wins. This looks all too much like a non-iterated Prisoner's Dilemma with N players - and N increasing.

Counterarguments/Outs

For small numbers of AGI and similar values among their wielders, a collective pivotal act could be performed. I place some hopes here, particularly if political pressure is applied in advance to aim for this outcome, or if the AGIs come up with better cooperation structures and/or arguments than I have.

The nuclear MAD standoff with nonproliferation agreements is fairly similar to the scenario I've described. We've survived that so far- but with only nine participants to date.

One means of preventing AGI proliferation is universal surveillance by a coalition of loosely cooperative AGI (and their directors). That might be done without universal loss of privacy if a really good publicly encrypted system were used, as Steve Omohundro suggests, but I don't know if that's possible. If privacy can't be preserved, this is not a nice outcome, but we probably shouldn't ignore it.

The final counterargument is that, if this scenario does seem likely, and this opinion spreads, people will work harder to avoid it, making it less likely. This virtuous cycle is one reason I'm writing this post including some of my worst fears.

Please convince me I'm wrong. Or make stronger arguments that this is right.

I think we can solve alignment, at least for personal-intent alignment, and particularly for the language model cognitive architectures that may well be our first AGI. But I'm not sure I want to keep helping with that project until I've resolved the likely consequences a little more. So give me a hand?

(Edit:) Conclusions after discussion

None of the suggestions in the comments seemed to me like workable ways to solve the problem.

I think we could survive an n-way multipolar human-controlled ASI scenario if n is small - like a handful of ASIs controlled by a few different governments. But not indefinitely - unless those ASIs come up with coordination strategies no human has yet thought of (or argued convincingly enough that I've heard of it - this isn't really my area, but nobody has pointed to any strong possibilities in the comments). I'd love more pointers to coordination strategies that could solve this problem.

So my conclusion is to hope that this is so obviously such a bad/dangerous scenario that it won't be allowed to happen.

Basically, my hope is that this all becomes viscerally obvious to the first people who speak with a superhuman AGI and who think about global politics. I hope they'll pull their shit together, as humans sometimes do when they're motivated to actually solve hard problems.

I hope they'll declare a global moratorium on AGI development and proliferation, and agree to share the benefits of their AGI/ASI broadly in hopes that this gets other governments on board, at least on paper. They'd use their AGI to enforce that moratorium, along with hopefully minimal force. Then they'll use their intent-aligned AGI to solve value alignment and launch a sovereign ASI before some sociopath(s) gets ahold of the reins of power and creates a permanent dystopia of some sort.

More on this scenario in my reply below.

I'd love to get more help thinking about how likely the central premise, that people get their shit together once they're staring real AGI in the face is. And what we can do now to encourage that.

Additional edit: Eli Tyre and Steve Byrnes have reached similar conclusions by somewhat different routes. More in a final footnote.^[2]

^{^}
Some maybe-less-obvious approaches to takeover, in ascending order of effectiveness: Drone/missile-delivered explosive attacks on individuals controlling and data centers housing rival AGI; Social engineering/deepfakes to set off cascading nuclear launches and reprisals; dropping stuff from orbit or altering asteroid paths; making the sun go nova.
The possibilities are limitless. It's harder to stop explosions than to set them off by surprise. A superintelligence will think of all of these and much better options. Anything more subtle that preserves more of the first actors' near-term winnings (earth and humanity) is gravy. The only long-term prize goes to the most vicious.
^{^}
Eli Tyre reaches similar conclusions with a more systematic version of this logic in Unpacking the dynamics of AGI conflict that suggest the necessity of a premptive pivotal act:
Overall, the need for a pivotal act depends on the following conjunction / disjunction.
The equilibrium of conflict involving powerful AI systems lands on a technology / avenue of conflict which are (either offense dominant, or intelligence-advantage dominant) and can be developed and deployed inexpensively or quietly.
Unfortunately, I think all three of these are very reasonable assumptions about the dynamics of AGI-fueled war. The key reason is that there is adverse selection on all of these axes.
Steve Byrnes reaches similar conclusions in What does it take to defend the world against out-of-control AGIs?, but he focuses on near-term, fully vicious attacks from misaligned AGI, prior to fully hardening society and networks, centering on triggering full nuclear exchanges. I find this scenario less likely because I expect instruction-following alignment to mostly work on the technical level, and the first groups to control AGIs to avoid apocalyptic attacks.
I have yet to find a detailed argument that addresses these scenarios and reaches opposite conclusions.

Coordination / CooperationGame TheoryAIWorld Modeling

Frontpage

84

Mentioned in

49Problems with instruction-following as an alignment target

49Conflating value alignment and intent alignment is causing confusion

37Intent alignment as a stepping-stone to value alignment

35The alignment stability problem

35System 2 Alignment

Load More (5/8)

If we solve alignment, do we die anyway?

10Bogdan Ionut Cirstea

5Seth Herd

4Bogdan Ionut Cirstea

New Comment

130 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:31 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]johnswentworth1y3216

If takeoff is slow-ish, a pivotal act (preventing more AGIs from being developed) will be difficult.
If no pivotal act is performed, RSI-capable AGI proliferates. This creates an n-way non-iterated Prisoner's Dilemma where the first to attack, wins.

These two points seem to be in direct conflict. The sorts of capabilities and winner-take-all underlying dynamics which would make "the first to attack wins" true are also exactly the sorts of capabilities and winner-take-all dynamics which would make a pivotal act tractable.

Or, to put it differently: the first "attack" (though might not look very "attack"-like) is the pivotal act; if the first attack wins, that means the pivotal act worked, and therefore wasn't that difficult. Conversely, if a pivotal act is too hard, then even if an AI attacks first and wins, it has no ability prevent new AI from being built and displacing it; if it did have that ability, then the attack would be a pivotal act.

[-]Seth Herd1y*136

Yes; except that a successful act can still be quite difficult.

You could reframe the concern to be that pivotal acts in a slow takeoff are prone to be bloody and dangerous. And because they are, and humans are likely to retain control, a pivotal act may be put off until it's even more bloody - like a nuclear conflict or sending the sun nova.

Worse yet, the "pivotal act" may be performed by the worst (human) actor, not the best.

9Seth Herd11mo

Just to elaborate a little: You are right that the same capabilities enable a pivotal act. My concern is that they won't be used for one (where pivotal act is defined as a good act). Having thought about it some more, I think the biggest problem in the multipolar, human-controlled RSI-capable AGI scenario is that it tends to be the worst actor that defects first and controls the future. More ethical humans will tend to be more timid with committing or risking mass destruction to achieve their ends, so they'll tend to hold off on aggressive moves that could win. "Hide and create a superbrain and a robot army" are not the first things a good person tells their AGI to do, let alone inducing nuclear strikes that increase one's odds of winning at great cost. Someone with more selfish designs on the future may have much less trouble issuing those orders.

[-]sweenesm1y127

Thanks for writing this, I think it's good to have discussions around these sorts of ideas.

Please, though, let's not give up on "value alignment," or, rather, conscience guard-railing, where the artificial conscience is inline with human values.

Sometimes when enough intelligent people declare something's too hard to even try at, it becomes a self-fulfilling prophesy - most people may give up on it and then of course it's never achieved. We do want to be realistic, I think, but still put in effort in areas where there could be a big payoff when we're really not sure if it'll be as hard as it seems.

8otto.barten7mo

I don't think value alignment of a super-takeover AI would be a good idea, for the following reasons: 1) It seems irreversible. If we align with the wrong values, there seems little anyone can do about it after the fact. 2) The world is chaotic, and externalities are impossible to predict. Who would have guessed that the industrial revolution would lead to climate change? I think it's very likely that an ASI will produce major, unforseeable externalities over time. If we have aligned it in an irreversible way, we can't correct for externalities happening down the road. (Speed also makes it more likely that we can't correct in time, so I think we should try to go slow). 3) There is no agreement on which values are 'correct'. Personally, I'm a moral relativist, meaning I don't believe in moral facts. Although perhaps niche among rationalists and EAs, I think a fair amount of humans shares my beliefs. In my opinion, a value-aligned AI would not make the world objectively better, but merely change it beyond recognition, regardless of the specific values implemented (although it would be important which values are implemented). It's very uncertain whether such change would be considered as net positive by any surviving humans. 4) If one thinks that consciousness implies moral relevance, AIs will be conscious, creating more happy morally relevant beings is morally good (as MacAskill defends), and AIs are more efficient than humans and other animals, the consequence seems to be that we (and all other animals) will be replaced by AIs. I consider that an existentially bad outcome in itself, and value alignment could point straight at it. I think at a minimum, any alignment plan would need to be reversible by humans, and to my understanding value alignment is not. I'm somewhat more hopeful about intent alignment and e.g. a UN commission providing the AI's input.

4sweenesm7mo

Thanks for the comment. I think people have different conceptions of what “value aligning” an AI means. Currently, I think the best “value alignment” plan is to guardrail AI’s with an artificial conscience that approximates an ideal human conscience (the conscience of a good and wise human). Contained in our consciences are implicit values, such as those behind not stealing or killing except maybe in extreme circumstances. A world in which “good” transformative AI agents have to autonomously go on the defensive against “bad” transformative AI agents seems pretty inevitable to me right now. I believe that when this happens, if we don’t have some sort of very workable conscience module in our “good” AI’s, the collateral damage of these “clashes” is going to be much greater than it otherwise would be. Basically what I’m saying is yes, it would be nice if we didn’t need to get “value alignment” of AI’s “right” under a tight timeline, but if we want to avoid some potentially huge bad effects in the world, I think we do. To respond to some of your specific points: 1. I’m very unsure about how AI’s will evolve, so I don’t know if their system of ethics/conscience will end up being locked in or not, but this is a risk. This is part of why I’d like to do extensive testing and iterating to get an artificial conscience system as close to “final” as possible before it’s loaded into an AI agent that’s let loose in the world. I’d hope that the system of conscience we’d go with would support corrigibility so we could shut down the AI even if we couldn’t change its conscience/values. 2. I’m sure there will be plenty of unforeseen consequences (or “externalities”) arising from transformative AI, but if the conscience we load into AI’s is good enough, it should allow them to handle situations we’ve never thought of in a way that wise humans might do - I don’t think wise humans need to update their system of conscience with each new situation, they just have to suss out the si

2otto.barten6mo

Thanks for your reply. I think we should use the term artificial conscience, not value alignment, for what you're trying to do, for clarity. I'm happy to see we seem to agree that reversibility is important and replacing humans is an extremely bad outcome. (I've talked to people into value alignment of ASI who said they "would bite that bullet", in other words would replace humanity by more efficient happy AI consciousness, so this point does not seem to be obvious. I'm also not convinced that leading longtermists necessarily think replacing humans is a bad outcome, and I think we should call them out on it.) If one can implement artificial conscience in a reversible way, it might be an interesting approach. I think a minimum of what an aligned ASI would need to do is block other unaligned ASIs or ASI projects. If humanity supports this, I'd file it under a positive offense defense balance, which would be great. If humanity doesn't support it, it would lead to conflict with humanity to do it anyway. I think an artificial conscience AI would either not want to fight that conflict (making it unable to stop unaligned ASI projects), or if it would, people would not see it as good anymore. I think societal awareness of xrisk and from there, support for regulation (either by AI or not) is what should make our future good, rather than aligning an ASI in a certain way.

5sweenesm6mo

Yes, I think referring to it as “guard-railing with an artificial conscience” would be more clear than saying “value aligning,” thank you. I believe that if there were no beings around who had real consciences (with consciousness and the ability to feel pain as two necessary pre-requisites to conscience), then there’d be no value in the world. No one to understand and measure or assign value means no value. And any being that doesn’t feel pain can’t understand value (nor feel real love, by the way). So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake. We most likely either got the artificial conscience wrong because that would’ve implicitly valued human life so wouldn’t have let a guard-railed AI wipe out humans, or we didn’t get an artificial conscience on board enough AI’s in time. An AI that had a “real” conscience also wouldn’t wipe out humans against the will of humans. The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point. If literally everyone in the world said, “Hey, we all want to die,” then the guard-railed AI, if it thought the people were in their “right mind,” would respect their wishes and let them die. All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.

2otto.barten6mo

Again, I'm glad that we agree on this. I notice you want to do what I consider the right thing, and I appreciate that. I can see the following scenario occur: the AI, with its AC, decided rightly that a pivotal act needs to be undertaken to avoid xrisk (or srisk). However, the public mostly doesn't recognize the existence of such risks. The AI will proceed sabotaging people's unsafe AI projects against public will. What happens now is: the public gets absolutely livid at the AI, that is subverting human power by acting against human will. Almost all humans team up to try to shut down the AI. The AI recognizes (and had already recognized) that if it looses, humans risk going extinct, so it fights this war against humanity and wins. I think in this scenario, an AI, even one with artificial conscience, could become the most hated thing on the planet. I think people underestimate the amount of pushback we're going to get once you get into pivotal act territory. That's why I think it's hugely preferred to go the democratic route and not count on AI taking unilateral actions, even if it would be smarter or even wiser, whatever that might mean exactly. So yes definitely agree with this. I don't think lack of conscience or ethics is the issue though, but existential risk awareness.

4sweenesm6mo

In terms of doing a pivotal act (which is usually thought of as preemptive, I believe) or just whatever defensive acts were necessary to prevent catastrophe, I hope the AI would be advanced enough to make decent predictions of what the consequences of its actions could be in terms of losing “political capital,” etc., and then it would make its decisions strategically. Personally, if I had the opportunity to save the world from nuclear war, but everyone was going to hate me for it, I’d do it. But then, it wouldn’t matter that I lost the ability to affect anything after that like it would for a guard-railed AI that could do a huge amount of good after that if it weren’t shunned by society. Improving humans’ consciences and ethics would hopefully help avoid them hating the AI for saving them. Also, if there were enough people, especially in power, who had strong consciences and senses of ethics, then maybe we’d be able to shift the political landscape from its current state of countries seemingly having different values and not trusting each other, to a world in which enforceable international agreements could be much more readily achieved. I’m happy for people to work on increasing public awareness and trying for legislative “solutions,” but I think we should be working on artificial conscience at the same time - when there’s so much uncertainty about the future, it’s best to bet on a whole range of approaches, distributing your bets according to how likely you think different paths are to succeed. I think people are under-estimating the artificial conscience path right now, that’s all. Thanks for all your comments!

5Seth Herd1y

This is an excellent point. I do not want to give up on value alignment. And I will endeavor to not make it seem impossible or not worth working on. However, we also need to be realistic if we are going to succeed. We need specific plans to achieve value alignment. I have written about alignment plans for likely AGI designs. They look to me like they can achieve personal intent alignment, but are much less likely to achieve value alignment. Those plans are linked here. Having people, you or others, work out how those or other alignment plans could lead to robust value alignment would be a step in having them implemented. One route to value alignment is having a good person or people in charge of an intent aligned AGI, having them perform a pivotal act, and using that AGI to help design working stable value alignment. That is the best long term success scenario I see.

7RogerDearnaley1y

For reasons I've outlined in Requirements for a Basin of Attraction to Alignment and Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis, I personally think value alignment is easy, convergent, and "an obvious target", such that if you built a AGi or ASI that is sufficiently close to it, it will see the necessity/logic of value alignment and actively work to converge to it (or something close to it: I'm not sure the process is necessarily convergent to a single precisely-defined limit, just to a compact region: a question I discussed more in The Mutable Values Problem in Value Learning and CEV). However, I agree that order-following alignment is obviously going to be appealing to people building AI, and to their shareholders/investors (especially if they're not a public-benefit corporation), and I also don't think that value alignment is so convergent that order-following aligned AI is impossible to build. So we're going to need to a make, and successfully enforce, a social/political decision across multiple countries about which of these we want over the next few years. The in-the-Overton-Window terminology for this decision is slightly different: value-aligned Ai is called "AI that resists malicious use", while order-following AI is "AI that enables malicious use". The closed-source frontier labs are publicly in favor of the former, and are shipping primitive versions of it: the latter is being championed by the open-source community, Meta, and A16z. Once "enabling malicious use" includes serious cybercrime, not just naughty stories, I don't expect this political discussion to last very long: politically, it's a pretty basic "do you want every-person-for-themself anarchy, or the collective good?" question. However, depending on takeoff speeds, the timeline from "serious cybercrime enabled" to the sort of scenarios Seth is discussing above might be quite short, possible only of the order of a year or two.

7sweenesm1y

Sorry, I should've been more clear: I meant to say let's not give up on getting "value alignment" figured out in time, i.e., before the first real AGI's (ones capable of pivotal acts) come online. Of course, the probability of that depends a lot on how far away AGI's are, which I think only the most "optimistic" people (e.g., Elon Musk) put as 2 years or less. I hope we have more time than that, but it's anyone's guess. I'd rather that companies/charities start putting some serious funding towards "artificial conscience" work now to try to lower the risks associated with waiting until boxed AGI or intent aligned AGI come online to figure it out for/with us. But my view on this is perhaps skewed by putting significant probability on being in a situation in which AGI's in the hands of bad actors either come online first or right on the heals of those of good actors (as due to effective espionage), and there's just not enough time for the "good AGI's" to figure out how to minimize collateral damage in defending against "bad AGI's." Either way, I believe we should be encouraging people of moral psychology/philosophical backgrounds who aren't strongly suited to help make progress on "inner alignment" to be thinking hard about the "value alignment"/"artificial conscience" problem.

[-]Nathan Helm-Burger1y127

Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn't 'sticky', it's easy to remove it without substantially impacting capabilities.

So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.

4Seth Herd1y

Yes. Good point that LLMs are sort of value aligned as it stands. I think of that alignment as far too weak to put it in the same category as what I'm speaking of. I'd be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models. When they achieve "coherence" or reflection and self-modification, I'd be surprised if their implicit values are good enough to create a good future without further tweaking, once they're refined into explicit values. Which we won't be able to do once they're smart enough to escape our control.

3sweenesm1y

Agreed, "sticky" alignment is a big issue - see my reply above to Seth Herd's comment. Thanks.

4Seth Herd1y

Agreed on all points. Except that timelines are anyone's guess. People with more relevant expertise have better guesses. It looks to me like people with the most relevant expertise have shorter timelines, so I'm not gambling on having more than a few years to get this right. The other factor you're not addressing is that, even if value alignment were somehow magically equally as easy as intent alignment (and I currently think it can't be in principle), you'd still have people preferring to align their AGIs to their own intent over value alignment.

0sweenesm1y

Sure. Me being sloppy with my language again, sorry. It does feel like having more than a decade to AGI is fairly unlikely. I also agree that people are going to want AGI's aligned to their own intents. That's why I'd also like to see money being dedicated to research on "locking in" a conscience module in an AGI, most preferably on a hardware level. So basically no one could sell an AGI without a conscience module onboard that was safe against AGI-level tampering (once we get to ASI's, all bets are off, of course). I actually see this as the most difficult problem in the AGI general alignment space - not being able to align an AGI to anything (inner alignment) or what to align an AGI to ("wise" human values), but how to keep an AGI aligned to these values when so many people (both people with bad intent and intelligent but "naive" people) are going to be trying with all their might (and near-AGI's they have available to them) to "jail break" AGI's.[1] And the problem will be even harder if we need a mechanism to update the "wise" human values, which I think we really should have unless we make the AGI's "disposable." 1. ^ To be clear, I'm taking "inner alignment" as being "solved" when the AGI doesn't try to unalign itself from what it's original creator wanted to align it to.

5Nathan Helm-Burger11mo

With my current understanding of compute hardware and of the software of various current AI systems, I don't see a path towards a 'locked in conscience' that a bad actor with full control over the hardware/software couldn't remove. Even chips soldered to a board can be removed/replaced/hacked. My best guess is that the only approaches to having an 'AI conscience' be robust to bad actors is to make both the software and hardware inaccessible to the bad actors. In other words, that it won't be feasible to do for open-weights models, only closed-weight models accessed through controlled APIs. APIs still allow for fine-tuning! I don't think we lose utility by having all private uses go through APIs, so long as there isn't undue censorship on the API. I think figuring out ways to have an API which does restrict things like information pertaining to the creation of weapons of mass destruction, but not pertaining to personal lifestyle choices (e.g. pornography) would be a very important step towards reducing the public pressure for open-weights models.

1sweenesm11mo

Thanks for the comment. You might be right that any hardware/software can ultimately be tampered with, especially if an ASI is driving/helping with the jail breaking process. It seems likely that silicon-based GPU's will be the hardware to get us to the first AGI's, but this isn't an absolute certainty since people are working on other routes such as thermodynamic computing. That makes things harder to predict, but it doesn't invalidate your take on things, I think. My not-very-well-researched-initial-thought was something like this (chips that self destruct when tampered with). I envision people having AGI-controlled robots at some point, which may complicate things in terms of having the software/hardware inaccessible to people, unless the robot couldn't operate without an internet connection, i.e., part of its hardware/software was in the cloud. It's likely the hardware in the robot itself could still be tampered with in this situation, though, so it still seems like we'd want some kind of self-destructing chip to avoid tampering, even if this ultimately only buys us time until AGI+'s/ASI's figure a way around this.

4Seth Herd7mo

Oh hey - I just stumbled back on this comment and realized: it's the primary reason I wrote Intent alignment as a stepping-stone to value alignment On not giving up on value alignment, while acknowledging that instruction-following is a much safer first alignment target.

1sweenesm7mo

Thanks. I guess I'd just prefer it if more people were saying, "Hey, even though it seems difficult, we need to go hard after conscience guard rails (or 'value alignment') for AI now and not wait until we have AI's that could help us figure this out. Otherwise, some of us we might not make it until we have AI's that could help us figure this out." But I also realize that I'm just generally much more optimistic about the tractability of this problem than most people appear to be, although Shane Legg seemed to say it wasn't "too hard," haha.[1] 1. ^ Legg was talking about something different than I am, though - he was talking about "fairly normal" human values and ethics, or what most people value, while I'm basically talking about what most people would value if they were wiser.

[-]Bogdan Ionut Cirstea1y102

Please convince me I'm wrong.

(I've only skimmed for now but) here's a reason / framework which might help with things going well: https://aiprospects.substack.com/p/paretotopian-goal-alignment.

5Seth Herd1y

There we go! This type of scheme to split a rapidly-growing pie semi fairly will definitely help reduce the urge to strike first. If proliferation continues unchecked, we'll have RSI-capable AGI in the hands of teenagers and other malcontents eventually. And they often have irrational urges to strike first :) But this type of scheme might stabilize the situation amongst a few AGIs in different hands, allowing them to collectively enforce not creating more and proliferating further.

4Bogdan Ionut Cirstea1y

Contra teenagers and the like, I'm hopeful that very capable open-weights models get banned early enough or at least dangerous capabilities get neutered really well using research in the shape of Tamper-Resistant Safeguards for Open-Weight LLMs. Might be tougher to deal with 'other malcontents' like perhaps some states (North Korea, Russia), especially if weights remain relatively easy to steal by state actors.

[-]otto.barten7mo90

I want to stress how I hugely like this post. What to do once we have an aligned AI of takeover level, or how to make sure no one will build an unaligned AI of takeover level, is in my opinion the biggest gap in many AI plans. I think answering this question might point to filling gaps that are currently completely unactioned, and I therefore really like this discussion. I previously tried to contribute to arguably the same question in this post, where I'm arguing that a pivotal act seems unlikely and therefore conclude that policy rather than alignment is... (read more)

[-]Vladimir_Nesov1y*97

Even with very slow takeoff where AIs reformat the economy without there being superintelligence, peaceful loss of control due to rising economic influence of AIs seems more plausible (as a source of overturn in the world order) than human-centric conflict. Humans will gradually hand off more autonomy to AIs as they become capable of wielding it, and at some point most relevant players are themselves AIs. This mostly seems unlikely only because superintelligence makes humans irrelevant even faster and less consensually.

Pausing AI for decades, if it's not y... (read more)

2Seth Herd11mo

Yes to all of the first paragraph. A caveat is that there's a big difference between humans remaining nominally in charge of an AGI-driven economy and not. If we're still technically in charge, we will retire (however many of us those in charge care to support; hopefull eventually quadrillions or so); if not, we'll be either entirely extinct or have a few of us maintained for historical interest by the new AGI overlords. I see no way to meaningfully pause AI in time. We could possibly pause US progress with adequate fearmongering, but that would just make China get there first. That could be a good thing if they're more cautious, which it now seems they might very well be. That would be only if Xi or whoever winds up in charge is not a sociopath. Which I have no idea about.

4Vladimir_Nesov11mo

Pausing for decades requires an international treaty powerful enough to keep advanced semiconductor manufacturing from getting into the hands of a faction that would defect on the pause. But it's already very distributed, one hears a lot about ASML, but the tools it produces are not the only crucial thing, other similarly crucial tools are exclusively manufactured in many other countries. So starting this process quickly shouldn't be too difficult from the technical side, the issue is deciding to actually do it and then sustaining it even as individual nations get enough time to catch up with all the details that go into semiconductor manufacturing (which could take actual decades). And this doesn't seem different in kind from controlling the means of manufacturing nuclear arms. This doesn't work if the AI accelerators already in the wild (in quantities a single actor could amass) are sufficient for an AGI capable of fast autonomous unbounded research (designed through merely human effort), but this could plausibly go either way. And it requires any new AI accelerators to be built differently, so that it's not sufficient to physically obtain them in order to run arbitrary computations on them. This way, there isn't temptation to seize such accelerators by force, and so no need to worry about enforcing the pause at the level of physical datacenters.

2Seth Herd11mo

Yes, the issue is deciding to actually do it. That might happen if you just needed the US and China. But I see no way that the signatories wouldn't defect even after they'd signed the treaty saying they wouldn't do it. I have no expertise in hardware security but I'd be shocked if there was a way to prevent unauthorized use even with physical possession in technically skilled (nation-state level) hands. The final problem is that we probably already have plenty of compute to create AGI once some more algorithmic improvements are discovered. Tracked sincce 2013, alogirithmic improvements have been roughly as fast for neural networks as hardware improvements, depending on how you do the math. Sorry I don't have the reference. In any case, algorithmic improvements are real and large, so hardware limitations alone won't suffice for that long. Human brain computational capacity is neither an upper nor lower bound on computation needed to reach superhuman digital intelligence.

2Vladimir_Nesov11mo

If you get certificate checking inside each GPU, and somehow make it have a persistent counter state (doesn't have to be a clock, just advance when the GPU operates) that can't be reset, then you can issue one-time certificates for the specific GPU for the specific range of states of its internal counter with asymmetric encryption, which can't be forged by examining the GPU. Most plausible ways around would be replay attacks that reuse old certificates while fooling the GPU into thinking it's in the past. But given how many transistors modern GPUs have, it should be possible to physically distribute the logic that implements certificate checking and the counter states, and make it redundant, so that sufficient tempering would become infeasible, at least at scale (for millions of GPUs). Algorithmic advancements, where it makes sense to talk of them as quantitative, are not that significant. Transformer made scaling to modern levels possible at all, and there was maybe a 10x improvement in compute efficiency since then (Llama+MoE), most (not all) ingredients relevant to compute efficiency in particular were already there in 2017 and just didn't make it into the initial recipe. If there is a pause, there should be no advancement in fabrication process, instead the technical difficulty of advanced semiconductor manufacturing becomes the main lever of enforcement. More qualitative advancements like hypothetical scalable self-play for LLMs are different, but then if there is a few years to phase out unrestricted GPUs, there is less unaccounted-for compute for experiments and eventual scaling.

[-]RogerDearnaley1y*96

One element that needs to be remembered here is that each major participant in this situation will have superhuman advice. Even if these are "do what I mean and check" order-following AI, if they can forsee that an order will lead to disaster they will presumably be programmed to say so (not doing so is possible, but is a clearly a flawed design). So if it is reasonably obvious to anything superintelligent that both:

a) treating this as a zero-sum winner-take all game is likely to lead to a disaster, and

b) there is a cooperative non-zero-sum game approach w... (read more)

2Seth Herd11mo

Absolutely. I mentioned getting advice briefly in this short article and a little more in Instruction-following AGI is easier... The problem in that case is that I'm not sure your b) is true. I certainly hope it is. I agree that it's unclear. That's why I'd like to get more analysis of a multipolar human-controlled ASI scenario. I don't think people have thought about this very seriously yet.

[-]sunwillrise1y74

I think "The first AGI probably won't perform a pivotal act" is by far the weakest section.

To start things off, I would predict a world with slow takeoff and personal intent-alignment looks far more multipolar than the standard Yudkowskian recursively self-improving singleton that takes over the entire lightcone in a matter of "weeks or hours rather than years or decades". So the title of that section seems a bit off because, in this world, what the literal first AGI does becomes much less important, since we expect to see other similarly capable AI ... (read more)

6Seth Herd1y

Edit: I very much agree with your arguments against sleepwalking and against the continuation of normality. I think the "inattentive world" hypothesis is all but disproven, and it still plays an outsized role in alignment thinking. I don't think the arguments in that section depend on any assumption of normality or sleepwalking. And the multipolar scenario is the problem, so it can't be part of a solution. They do depend on people making nonoptimal decisions, which people do constantly. So I think the arguments in that section are more general than you're hoping. If those don't hold, what is the alternate scenario in which a multipolar world remains safe?

4faul_sname1y

The choice of the word "remains" is an interesting one here. What is true of our current multipolar world which makes the current world "safe", but which would stop being true of a more advanced multipolar world? I don't think it can be "offense/defense balance" because nuclear and biological weapons are already far on the "offense is easier than defense" side of that spectrum.

4Seth Herd1y

I agree that it should be phrased differently. One problem here is that AGI may allow victory without mutually assured destruction. A second is that it may proliferate far more widely than nukes or bioweapons have so far. People often speak of massively multipolar scenarios as a good outcome. Good point about the word "remains". I'm afraid people see a "stable" situation - but logically that only extends for a few years until fully autonomously RSI-capable AGI and robotics is widespread, and any malcontent can produce offensive capabilities we can't yet imagine.

2faul_sname1y

I understand that inclination. Historically, unipolar scenarios do not have a great track record of being good for those not in power, especially unipolar scenarios where the one in power doesn't face significant risks to mistreating those under them. So if unipolar scenarios are bad, that means multipolar scenarios are good, right? But "the good situation we have now is not stable, we can choose between making things a bit worse (for us personally) immediately and maybe not get catastrophically worse later, or having things remain good now but get catastrophically worse later" is a pretty hard pill to swallow. And is also an argument with a rich history of being ignored without the warned catastrophic thing happening.

4Seth Herd1y

Excellent point that unipolar scenarios have been bad historically. I wrote about recognizing the validity of that concern recently in Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours. And good point that warnings of future catastrophe are likely to go unheeded because wolf has been cried in the past. Although sometimes those things didn't happen precisely because the warnings were heeded. In this case, we only need one or a few relatively informed actors to heed the call to prevent proliferation even if it's short-term risky.

[-]Noosphere891y6-2

I don't think your scenario works, maybe because I don't believe that the world is as offense advantaged as you say.

I think the closest domain where things are this offense biased is the biotech domain, and whie I do think biotech leading to doom is something we will eventually have to solve, I'm way less convinced of the assumption that every other domain is so offense advantaged that whoever goes first essentially wins the race.

That said, I'm worried about scenarios where we do solve alignment and get catastrophe anyways. though unlike your scenario, I e... (read more)

6Seth Herd1y

I agree entirely with the points made in that post. AGI will only "transform" the economy temporarily. It will very soon replace the economy. That is an entirely separate concern. If you don't think a multipolar scenario is as offense-advantaged as I've described, where do you think the argument breaks down? What defensive technologies are you envisioning that could counter the types of offensive strategies I've mentioned?

4Noosphere891y

Okay, I'm not sure the argument breaks down, but my crux is that everyone else probably has an AGI, and my issue is similar to Richard Ngo's issue with ARA: the people ordering ARA have far fewer resources to put into attack compared to the defense's capability, and real-life wars, while advantaged to the attacker, isn't so offense advantaged that defense is pointless: https://www.lesswrong.com/posts/xiRfJApXGDRsQBhvc/we-might-be-dropping-the-ball-on-autonomous-replication-and-1#hXwGKTEQzRAcRYYBF

9Seth Herd1y

The issue is that, if you can hide, you can amass resources exponentially once you hit self-replicating production facilities and fully recursively self-improving AGI. This almost completely shifts the logic of all previous conflicts. The comment you link seems to be addressing a very different scenario than my primary concern. It's addressing an attack from within human infrastructure, rather than outside. What I describe is often not considered, because it seems like the "far future" that we needn't worry about yet. But that far future seems realistically to be a handful of years past human-level AGI that starts to rapidly develop new technologies like the robotics needed for an autonomous self-replicating production in remote locations.

6Noosphere891y

Then it reduces to "I think the exponential growth of resources is avaliable to both the attackers and defense, such that even while everything is changing, the relative standing of the attack/defense balance doesn't change." I think part of why I'm skeptical is the assumption that exponential growth is only useful for attack, or at least way more useful for attack, whereas I think exponentially growing resources by AI tech is way more symmetrical by default.

2Seth Herd11mo

Ah - now I see your point. This will help me clarify my concern in future presentations, so thanks! My concern is that a bad actor will be the first to go all-out exponential. Other, better humans in charge of AGI will be reluctant to turn the moon much less the earth into military/industrial production, and to upend the power structure of the world. The worst actors will, by default, be the first go full exponential and ruthlessly offensive. Beyond that, I'm afraid the physics of the world does favor offense over defense. It's pretty easy to release a lot of energy where you want it, and very hard to build anything that can withstand a nuke let alone a nova. But the dynamics are more complex than that, of course. So I think the reality is unknown. My point is that this scenario deserves some more careful thought.

6Noosphere8911mo

Yeah, it does deserve more careful thought, especially since I expect almost all of my probability mass on catastrophe to be human caused, and more importantly I still think that it's an important enough problem that resources should go to thinking about it.

1otto.barten7mo

Offense/defense balance is such a giant crux for me. I would take quite different actions if I saw plausible arguments that defense will win over offense. I'm astonished that I don't know any literature on this. Large parts of the space seem to be quite strongly convinced that offense will win or defense will win (at least, else their actions don't make sense to me), but I've very rarely seen this assumption debated explicitly. It would really be very helpful if someone could point me to sources. Right now I have a twitter poll with 30 votes (result: offense wins) and an old LW post to go by.

[-]Charlie Steiner1y5-4

This strikes me as defining "alignment" a little differently than me.

It even might defing "instruction-following" differently than me.

If we really solved instruction following, you could give the instruction "Do the right thing" and it would just do the right thing.

If you that's possible, then what we need is a coalition to tell powerful AIs to "do the right thing", rather than "make my creators into god-emperors" or whatever. This seems doable, though the clock is perhaps ticking.

If you can't just tell an AI to do the right thing, but it's still competent... (read more)

6Seth Herd11mo

I actually completely agree with this call to action. Unfortunately, I suspect that it's impossible to make value alignment easier than personal intent alignment. I can't think of a technical alignment approach that couldn't be used both ways equally well. And worse than that, I think that intent aligned AGI is easier than value aligned AGI for reasons I outline in that post, and Max Harms has elaborated in much more detail in Corrigibility as Singular Target sequence (as well as Paul Christiano and many others' arguments. But I still agree with your call to action: we should be working now to make value alignment as safe as possible. That requires deciding what we align to. The concept of humanity is not well-defined in the future, when upgrades and digital copies of human minds become possible. Roger Dearnaley's sequence AI, alignment, and ethics lays out these problems and more; for instance, if we stick to baseline humans, the future will be largely controlled by whatever values are held by the most humans, in a competition for memes and reproduction. So there's conceptual as well as technical/mind-design work to be done on technical alignment. And that work should be done. In multipolar scenarios with, someone may well decide to "launch" their AGI to be autonomous with value alignment, out of magnanimity or desperation. We'd better make their odds of success as high as we can manage. I don't think refusing to work on intent alignment is a helpful option. It will likely be tried, with or without our help. Following instructions is the most obvious alignment target for any agent that's even approaching autonomy and therefore usefulness. Thinking about how to make those attempts successful will also increase our odds of surviving the first competent autonomous AGIs. WRT definitions: alignment doesn't specify alignment with whom. I think this ambiguity is causing important confusions in the field. I was trying to draw a distinction between two importantly di

4Noosphere8911mo

The problem is that "do the right thing" makes no sense without a reference to what values, or more formally what utility functions the human in question has, so there's no way to do what you propose to do even in theory, at least without strong assumptions on their values/utility functions. Also, it breaks corrigiblity, and in many applications like military AI, this is a dangerous property to break, because you probably want to change their orders/actions, and this sort of anti-corrigiblity is usually bad unless you're very confident value learning works, which I don't share.

2Charlie Steiner11mo

All language makes no sense without a method of interpretation. "Get me some coffee" is a horribly ambiguous instruction that any imagined assistant will have to cope with. How might an AI learn what "get me some coffee" entails without it being hardcoded in? To say it's impossible in theory is to set the bar so high that humans using language is also impossible. As for military use of AGI, I think I'm fine with breaking that application. If we can build AI that does good things when directed to (which can incorporate some parts of corrigibility, like not being overly dogmatic and soliciting a broad swath of human feedback), then we should. If we cannot build AI that actually does good things, we haven't solved alignment by my lights and building powerful AI is probably bad.

4Noosphere8911mo

I think the biggest difference I have here is that I don't think there is that much pressure to converge to a single value, or even that small of a space of values, at least in the multi-agent case, unlike in your communication examples, and I think the degrees of freedom for morality is pretty wide/large, unlike in the case of communication, where there is a way for even simple RL agents to converge on communication/language norms (at least in the non-adversarial case). At a meta level, I'm more skeptical of value learning, especially the ambitious variant of value learning being a good first target than you seem to have, and think corrigibility/DWIMAC goals tend to be better than you think it does, primarily because I think the arguments for alignment dooming us has holes that make them not go through.

4Vladimir_Nesov11mo

Strong optimization doesn't need to ignore boundaries and tile the universe with optimal stuff according to its own aesthetics, disregarding the prior content of the universe (such as other people). The aesthetics can be about how the prior content is treated, the full trajectory it takes over time, rather than about what ends up happening after the tiling regardless of prior content. The value of respect for autonomy doesn't ask for values of others to converge, doesn't need to agree with them to be an ally. So that's an example of a good thing in a sense that isn't fragile.

4Seth Herd11mo

This is true; value alignment is quite possible. But if it's both harder/less safe, and people would rather align their godling with their own values/commands, I think we should either expect this or make very strong arguments against it.

5Vladimir_Nesov11mo

Respect for autonomy is not quite value alignment, just as corrigibility is not quite alignment. I'm pointing out that it might be possible to get a good outcome out of strong optimization without value alignment, because strong optimization can be sensitive to context of the past and so doesn't naturally result in a past-insensitive tiling of the universe according to its values. Mostly it's a thought experiment investigating some intuitions about what strong optimization has to be like, and thus importance and difficulty of targeting it precisely at particular values. Not being a likely outcome is a separate issue, for example I don't expect intent alignment in its undifferentiated form to remain secure enough to contain AI-originating agency. To the extent intent alignment grants arbitrary wishes, what I describe is an ingredient of a possible wish, one that's distinct from value alignment and sidesteps the question of "alignment to whom" in a way different from both CEV and corrigibility. It's not more clearly specified than CEV either, but it's distinct from it.

4Seth Herd11mo

In your use of respect for autonomy as a goal:; are you referring to something like Empowerment is (almost) All We Need? I do find that to be an appealing alignment target (I think I'm using alignment slightly more broadly, as in Hubinger's definition. (I have a post in progress on the terminology of different alignment/goal targets and resulting confusions). The problem with empowerment as an ASI goal is, once again: empowering whom? And do you empower them to make more like them that you then have to empower? Roger Dearnaley notes that if we empower everyone, humans will probably lose out to either something with less volition but using fewer resources, like insects, or something with more volition to empower, like other ASIs. Do we reallly want to limit the future to baseline humans? And how do we handle humans that want to create tons more humans? See 4. A Moral Case for Evolved-Sapience-Chauvinism and 5. Moral Value for Sentient Animals? Alas, Not Yet from Roger's AI, Alignment, and Ethics sequence. I actually do expect intent alignment to remain secure enough to contain AI-originating agency, as long as it's the primary goal or "'singular target". It's counterintuitive that a superintelligent being could want nothing more than to do what its principal wants it to do, but I think it's coherent. And the more competent it gets, the better it will be at doing what you want and nothing more. Before it's that competent, the principal can give more careful instructions, including instructions to check before acting, and to help with its alignment in various ways. I agree that respect for autonomy/empowerment is one instruction/intent you could give. I do expect that someone will turn their intent-aligned AGI into an autonomous AGI at some point; hopefully after they're quite confident in its alignment and the worth of that goal.

2Vladimir_Nesov11mo

Respect for autonomy is not quite empowerment, it's more like being left alone. The use of this concept is more in defining what it means for an agent or a civilization to develop relatively undisturbed, without getting overwritten by external influence, not in considering ways of helping it develop. So it's also a building block for defining extrapolated volition, because that involves extended period of not getting destroyed by external influences. But it's conceptually prior to extrapolated volition, it doesn't depend on already knowing what it is, it's a simpler notion. It's not by itself a good singular target to set an AI to pursue, for example it doesn't protect humans from building more extinction-worthy AIs within their membranes, and doesn't facilitate any sort of empowerment. But it seems simple enough and agreeable as a universal norm to be a plausible aspect of many naturally developing AI goals, and it doesn't require absence of interaction, so allows empowerment etc. if that is also something others provide.

2Charlie Steiner11mo

Yeah, I agree with your first paragraph. But I think it's a difference of degree rather than kind. "Do the right thing" is still communication, it's just communication about something indirect, that we nonetheless should be picky about.

2Seth Herd11mo

I considered titling a different version of this post "we need to also solve the human alignment problem" or something similar.

-1Ann1y

Perhaps seemingly obvious, but given some of the reactions around Apple putting "Do not hallucinate" into the system prompt of its AI ... If you do get an instruction-following AI that you can simply give the instruction, "Do the right thing", and it would just do the right thing: Remember to give the instruction.

4Seth Herd11mo

You have to specify the right thing for whom. And the AGI won't know what it is for sure, in a realistic slow takeoff during the critical risk period. See my reply to Charlie above. But yes, using the AGIs intelligence to help you issue good instrctions is definitely a good idea. See my Instruction-following AGI is easier and more likely than value aligned AGI for more logic on why.

-1Ann11mo

All non-omniscient agents make decisions with incomplete information. I don't think this will change at any level of takeoff.

4Seth Herd11mo

Sure, but my point here is that AGI will be only weakly superhuman during the critical risk period, so it will be highly uncertain, and probably human judgment is likely to continue to play a large role. Quite possibly to our detriment.

[-]faul_sname1y50

I think "pivotal act" is being used to mean both "gain affirmative control over the world forever" and "prevent any other AGI from gaining affirmative control of the world for the foreseeable future". The latter might be much easier than the former though.

[-]eggsyntax11mo*41

(Posting this initial comment without having read the whole thing because I won't have a chance to come back to it today; apologies if you address this later or if it's clearly addressed in a comment)

If we solve alignment and create personal intent aligned AGI but nobody manages a pivotal act, I see a likely future world with an increasing number of AGIs capable of recursively self-improving.

It seems worth spelling out your view here on how RSI-capable early AGI is likely to be. I would expect that early AGI will be capable of RSI in the weak sense of bein... (read more)

7Seth Herd11mo

Great point. I definitely mean fully capable of recursive self-improvement - that is, needing no humans in the loop. This lengthens the timelines to at least when we have roughly human-level robotics that are commercially available- but I expect that to be ten years or less. The hardware requirements for early AGI are another factor in the timeline before this RSI-catastrophe is possible. Let's remember that algorithmic progress is roughly as fast as hardware progress to date, so that will also cease to be a large limitation all too soon. The problem is that not having that scenario be immediately a risk may make people complacent about allowing lots of parahuman AGI before it becomes superhuman and fully RSI capable.

1eggsyntax11mo

Got it. I think I personally expect a period of at least 2-3 years when we have human-level AI (~'as good as or better than most humans at most tasks') but it's not capable of full RSI. It also seems plausible to me that strong RSI in the sense I use it above ('able to eg directly edit their own weights in ways that significantly improve their intelligence or other capabilities') may take a long time to develop or even require already-superhuman levels of intelligence. As a loose demonstration of that possibility, the best team of neurosurgeons etc in the world couldn't currently operate on someone's brain to give them greater intelligence, even if they had tools that let them precisely edit individual neurons and connections. I'm certainly not confident that's much too hard for human-level AI, but it seems plausible. That seems highly plausible to me too; my mainline guess is that by default, given human-level AI, it rapidly proliferates as replacement employees and for other purposes until either there's a sufficiently large catastrophe, or it improves to superhuman.

2Vladimir_Nesov11mo

The speed at which this kind of thing is possible is crucial, even if capabilities are not above human level. This speed can make planning of training runs less central to the bulk of worthwhile activities. With very high speed, much more theoretical research that doesn't require waiting for currently plannable training runs becomes useful, as well as things like rewriting all the software, even if models themselves can't be "manually" retrained as part of this process. Plausibly at some point in the theoretical research you unlock online learning, even the kind that involves gradually shifting to a different architecture, and the inconvenience of distinct training runs disappears. So this weak RSI would either need to involve AIs that can't autonomously research, but can help the researchers or engineers, or the AIs need to be sufficiently slow and non-superintelligent that they can't run through decades of research in months.

1eggsyntax11mo

It doesn't seem clear to me that this is the case; there isn't necessarily a faster way to precisely predict the behavior and capabilities of a new model than training it (other than crude measures like 'loss on next-token prediction continues to decrease as the following function of parameter count'). It does seem possible and even plausible, but I think our theoretical understanding would have to improve enormously in order to make large advances without empirical testing.

2Vladimir_Nesov11mo

I mean theoretical research on more general topics, not necessarily directly concerned with any given training run or even with AI. I'm considering the consequences of there being an AI that can do human level research in math and theoretical CS at much greater speed than humanity. It's not useful when it's slow, so that the next training run will make what little progress is feasible irrelevant, in the same way they don't currently train frontier models for 2 years, since a bigger training cluster will get online in 1 and then outrun the older run. But with sufficient speed, catching up on theory from distant future can become worthwhile.

3eggsyntax11mo

Oh, I see, I was definitely misreading you; thanks for the clarification!

[-]eggsyntax11mo40

If no pivotal act is performed, RSI-capable AGI proliferates

Minor suggestion: spell out 'recursive self-improvement (RSI)' the first time; it took me a minute to remember the acronym.

3Seth Herd11mo

Good idea, done.

[-]otto.barten7mo30

I think this is a crucial question that has been on my mind a lot, and I feel it's not adequately discussed in the xrisk community, so thanks for writing this!

While I'm interested in what people would do once they have an aligned ASI, what matters in the end is what labs would do, and what governments would do, because they are the ones who would make the call. Do we have any indications on that? What I would expect without thinking very deeply about it: labs wouldn't try to block others. It's risky, probably illegal and generally none of their business. T... (read more)

8Seth Herd7mo

Thanks! 1. I think hardware regulation has little chance of success because we're not doing it yet, I think we're only about one generation from big enough systems to train AGI-agent-capable LLMs, and algorithmic improvement has no obvious limits, so even current-gen systems can train AGI after some years of algorithmic improvements. Beyond that, I see absolutely no moves toward regulating hardware (in the West) - more like throwing money toward accelerating it. 1. There have been at least two nuclear close calls and perhaps a few more we don't know about. I'm not saying anyone is going to press the big world-ending button because somebody hacked and fried their AGI datacenter; I'm saying they might issue threats when it became clear that the US is taking control of the entire future by creating AGI and making sure no one can counter it by building their own. And I'm worried that those threats would be answered, and someone foolish might initiate a chain of hostilities that didn't stop in time. I hope this doesnt' happen, and I mostly share your optimism that sanity would prevail. But we have had two human beings for whom protocol said to to fire nukes and they each refused. I don't want to risk more people than that following their conscience instead of their orders; soldiers do terrible things including sacrificing their own lives pretty frequently.

1otto.barten7mo

1. I don't strongly disagree re architectures, but I do think we are uncertain about this. Depending on AGI architecture, different forms of regulation may or may not work. Work should be carried out to determine which regulation works for how many flops needed for takeover-level AI. That it's not happening yet is 1) no reason it won't (xrisk awareness is just too low, but slowly rising) and 2) equally applicable to the alternative you propose, universal surveillance. If we treat universal surveillance seriously, we should consider its downsides as well. First, there's no proof it would work: I'm not sure an AI, even a future one, would necessarily catch all actions towards building AGI. I have no idea what these actions are, and no idea which actions a surveillance AI with some real-world sensors can catch (or could be blocked etc.). I think we should not be more than 70% confident this would technically work. Second, currently we have power vacuums in the world, such as failed states, revolutions, criminal groups, or just instances were those in power are unable to project their power effectively. How would we apply universal surveillance to those power vacuums? Or do we assume they won't exist anymore, and if so, why is that assumption justified? Third, universal surveillance is arguably the world's least popular policy. It seems outright impossible to implement this in any democratic way. Perhaps the plan is to implement it by force through an AGI, then I would file it as a form of pivotal act. If we're anyway in pivotal act territory, I'd strongly prefer Yudkowsky's "subtly modifying all GPUs such that they can no longer train an AGI" (kind of hardware regulation, really) over universal surveillance. I think research is urgently required into how to implement a pause effectively. We have one report almost finished on the topic that mostly focuses on hardware regulation. PauseAI is working on a Building a pause button-project that is

[-]Dakara7mo*30

Edit: I hope that I am not cluttering the comments by asking these questions. I am hoping to create a separate post where I list all the problems that were raised for the scalable alignment proposal and all the proposed solutions to them. So far, everything you said not only seemed sensible, but also plausible, so I extremely value your feedback.

I have found some other concerns about scalable oversight/iterative alignment, that come from this post by Raemon. They are mostly about the organizational side of scalable oversight:

Moving slowly and carefully

... (read more)

3Dakara7mo

The main thing at least for me, is that you seem to be the biggest proponent of scalable alignment and you are able to defend this concept very well. All of your proposals seem very much down-to-earth.

[-]Seth Herd7mo*124

Sorry it's taken me a while to get back to this.

No problem posting your questions here. I'm not sure of the best place but I don't think clutter is an issue, since LW organizes everything rather well and has good UIs to navigate it.

I read that "Carefully Bootstrapped Alignment" is organizationally hard by Raemon and it did make some good points. Most of them I'd considered, but not all of them.

Pausing is hard. That's why my scenarios barely involve pauses and only address them because others see them as a possibility.

Basically I think we have to get alignment mostly right before it's time to pause, for the reasons he gives. I just think we might be able to do that. Language model agents are a really ideal alignment scenario, and instruction-following (IF) gives corrigibility for second chances when things start to go wrong. Asking an IF model about its alignment makes detecting misalignment easier, and re-aligning it is easy enough for the type of short pause that orgs and governments might actually do.

Moving slowly and carefully is hard too, and I don't expect it. I expect the default alignment techniques to work if even a little effort is put in to making them work. Current model... (read more)

3Dakara7mo

I've noticed that in your sentence about Max Harms's corrigibility plan there is an extra space after the parentheses which breaks the link formatting on my end. I tried marking it with "typo" emoji, but not sure if it is visible.

4Seth Herd7mo

Thanks, fixed it!

3Dakara7mo

Thank you for your response! It basically covers all of the five issues that I had in mind. It is definitely some food for thought, especially your disagreement with Eliezer. I am much more inclined to think you are correct because his activity has considerably died down (at least on LessWrong). I am really looking forward to your "A broader path: survival on the default fast timeline to AGI" post.

[-]Dakara8mo30

I've been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can't seem to find the link, so I am commenting here) and which I don't really understand is iterative alignment.

I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.

Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The pr... (read more)

4Seth Herd8mo

Thanks for reading, and responding! It's very helpful to know where my arguments cease being convincing or understandable. I fully agree that just having AI do the work of solving alignment is not a good or convincing plan. You need to know that AI is aligned to trust it. Perhaps the missing piece is that I think alignment is already solved for LLM agents. They don't work well, but they are quite eager to follow instructions. Adding more alignment methods as they improve makes good odds that our first capable/dangerous agents are also aligned. I listed some of the obvious and easy techniques we'll probably use in Internal independent review for language model agent alignment. I'm not happy with the clarity of that post, though, so I'm currently working on two followups that might be clearer. Or perhaps the missing link is going from aligned AI systems to aligned "Real AGI". I do think there's a discontinuity in alignment once a system starts to learn continuously and reflect on its beliefs (which change how its values/goals are interpreted). However, I think the techniques most likely to be used are probably adequate to make those systems aligned - IF that alignment is for following instructions, and the humans wisely instruct it to be honest about ways its alignment could fail. So that's how I get to the first aligned AGI at roughly human level or below. From there it seems easier, although still possible to fail. If you have an agent that's aligned and smarter than you, you can trust it to work on further alignment schemes. It's wiser to spot-check it, but the humans' job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns. I usually think about the progression from AGI to superintelligence as one system/entity learning, being improved, and self-improving. But there's a good chance that progression will look more generational, with several distinct systems/entities as

3Dakara8mo

"Perhaps the missing piece is that I think alignment is already solved for LLM agents." Another concern that I might have is that maybe it only seems like alignment is solved for LLMs. For example, this, this, this and this short papers argue that that seemingly secure LLMs may not be as safe as we initially believe. And it appears that they test even our models that are considered to be more secure and still find this issue.

2Seth Herd8mo

Ah, yes. That is quite a set of jailbreak techniques. When I say "alignment is solved for LLM agents", I mean something different than what people mean by alignment for LLMs themselves. I'm using alignment to mean AGI that does what its user wants. You are totally right that there's an edge case and a problem if the principal "user", the org that created this AGI, wants to sell access to others and have the AGI not follow all of those user's instructions/desires. Which is exactly what they'll want. More in the other comment. I haven't worked this through. Thanks for pointing it out. This might mean that an org that develops LLM-based AGI systems can't really widely license use of that system, and would have to design deliberately less capable systems. Or it might mean that they'll put in a bunch of stopgap jailbreak prevention measures and hope they're adequate when they won't be. I need to think more about this.

1Dakara8mo

This topic is quite interesting for me from the perspective of human survival, so if you do decide to make a post specifically about preventing jailbreaking, then please tag me (somewhere) so that I can read it.

1Dakara8mo

Looking more generally, there seems to be a ton of papers that develop sophisticated jailbreak attacks (that succeed against current models). Probably more than I can even list here. Are there any fundamentally new defense techniques that can protect LLMs against these attacks (since the existing ones seem to be insufficient)? EDIT: The concern behind this comment is better detailed in the next comment.

3Dakara8mo

I also have a more meta-level layman concern (sorry if it will sound unusual). There seem to be a large number of jailbreaking strategies that all succeed against current models. To mitigate them, I can conceptually see 2 paths: 1) trying to come up with a different niche technical solution to each and every one of them individually or 2) trying to come up with a fundamentally new framework that happens to avoid all of them collectively. Strategy 1 seems logistically impossible, as developers at leading labs (which are most likely to produce AGI) have to be aware of all of them (and they are often reported in relatively unknown papers). Furthermore, even if they somehow manage to monitor all reported jailbreaks, they would have to come up with so many different solutions, that it seems very unlikely to succeed. Strategy 2 seems conceptually correct, but there seems to be no sign of it as even newer models are getting jailbreaked. What do you think?

6Noosphere898mo

Re jailbreaks, I think this is not an example of alignment not being solved, but rather an example of how easy it is to misuse/control LLMs. Also, a lot of the jailbreak successes rely on the fact that it's been trained to accept a very wide range of requests for deployment reasons, which suggests narrowing the domain of acceptable questions for internal use could reduce the space of jailbreaks dramatically:

2Dakara8mo

I have 3 other concrete concerns about this strategy. So if I understand it correctly, the plan is for humans to align AGI and then for that AGI to align AGI and so forth (until ASI). 1. What if the strategy breaks on the first step? What if first AGI turns out to be deceptive (scheming) and only pretends to be aligned with humans. It seems like if we task such deceptive AGI to align other AGIs, then we will end up with a pyramid of misaligned AGIs. 2. What if the strategy breaks later down the line? What if AGI #21 accidentally aligns AGI #22 to be deceptive (scheming)? Would there be any fallback mechanisms we can rely on? 3. What is the end goal? Do we stop once we achieve ASI? Can we stop once we achieve ASI? What if ASI doesn't agree and instead opts to continue self-improving? Are we going to be able to get to the point where the acceleration of ASI's intelligence plateaus and we can recuperate and plan for future?

5Seth Herd8mo

1. We die (don't fuck this step up!:) 1. Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment 2. We die (don't let your AGI fuck this step up!:) 1. 22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn't thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter. 3. the endgame is to use Intent alignment as a stepping-stone to value alignment and let something more competent and compassionate than us monkeys handle things from there on out.

4Noosphere898mo

The first concern is absolutely critical, and one way to break the circularity issue is to rely on AI control, while another way is to place incentives that favor alignment as an equilibrium and make dishonesty/misalignment unfavorable, in the sense that you can't have a continuously rewarding path to misalignment. The second issue is less critical, assuming that AGI #21 hasn't itself become deceptively aligned, because at that point, we can throw away #22 and restart from a fresh training run. If that's no longer an option, we can go to war against the misaligned AGI with our own AGI forces. In particular, you can still do a whole lot of automated research once you break labor bottlenecks, and while this is a slowdown, this isn't fatal, so we can work around it. The third issue is if we have achieved aligned ASI, than we have at that point achieved our goal, and once humans are obsolete in making alignment advances, that's when we can say the end goal has been achieved.

1Dakara8mo

That does indeed answer my 3 concerns (and Seth's answer does as well). Overnight, I came up with 1 more concern. What if AGI somewhere down the line overgoes a value drift. After all, looking at the evolution, it seems like our evolutionary goal was supposed to be "produce as many offsprings". And in the recent years, we have strayed from this goal (and are currently much worse at it than our ancestors). Now, humans seem to have goals like "design a video game" or "settle in France" or "climb Everest". What if AGI similarly changes its goals and values overtime? Is there are way to prevent that or at least be safeguarded against that? I am afraid that if that happens, humans would, metaphorically speaking, stand in AGI's way of climbing Everest.

3Noosphere898mo

The answer to this is that we'd rely on instrumental convergence to help us out, combined with adding more data/creating error-correcting mechanisms to prevent value drift from being a problem.

1Dakara8mo

What would instrumental convergence mean in this case? I am not sure of what that means in this case.

2Noosphere898mo

In this case, it would mean the convergence to preserve your current values.

1Dakara8mo

Reading from LessWrong wiki, it says "Instrumental convergence or convergent instrumental values is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals such as self-preservation and resource acquisition" It seems like it preserves exactly the goals we wouldn't really need it to preserve (like resource acquisition). I am not sure how it would help us with preserving goals like ensuring humanity's prosperity, which seem to be non-fundamental.

3Noosphere898mo

Yes, I admittedly want to point to something along the lines of preserving your current values being a plausibly major drive of AIs.

1Dakara8mo

Ah, so you are basically saying that preserving current values is like a meta instrumental value for AGIs similar to self-preservation that is just kind of always there? I am not sure if I would agree with that (if I am correctly interpreting you) since, it seems like some philosophers are quite open to changing their current values.

2Noosphere898mo

Not always, but I'd say often. I'd also say that at least some of the justification for changing values in philosophers/humans is because they believe the new values are closer to the moral reality/truth, which is an instrumental incentive. To be clear, I'm not going to state confidently that this will happen (maybe something like instruction following ala @Seth Herd is used instead, such that the pointer is to the human giving the instructions, rather than having values instead), but this is at least reasonably plausible IMO.

1Dakara7mo

I have found another possible concern of mine. Consider gravity on Earth, it seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories and only one theory that gravity will work as an absolute rule. We might infer from the simplest explaination that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI alignment, our evidence must rule out whether an AI is following a misaligned rule compared to an aligned rule based on time-and situation-limited data. While it may be safe, for all practical purposes, to assume that simpler explanations tend to be correct when it comes to nature, we cannot safely assume this for LLMs—for the reason that the learning algorithms that are programmed into them can have complex unintended consequences for how the LLM will behave in the future, given the changing conditions an LLM finds itself in. Doesn't this mean that it is not possible to achieve alignment?

1Dakara7mo

I have posted this text as a standalone question here

1Dakara8mo

Fair enough. Would you expect that AI would also try to move its values to the moral reality? (something that's probably good for us, cause I wouldn't expect human extinction to be a morally good thing)

3Noosphere898mo

The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts. To be clear, I'm not stating that it's hard to get the AI to value what we value, but it's not so brain-dead easy that we can make the AI find moral reality and then all will be well.

3Dakara8mo

Noosphere, I am really, really thankful for your responses. You completely answered almost all (I am still not convinced about that strategy of avoiding value drift. I am probably going to post that one as a question to see if maybe other people have different strategies on preventing value drift) of the concerns that I had about alignment. This discussion, significantly increased my knowledge. If I could triple upvote your answers, I would. Thank you! Thank you a lot!

1Dakara8mo

P.S. Here is the link to the question that I posted.

1Dakara8mo

Another concern that I could see with the plan. Step 1 is to create safe and alignment AI, but there are some results which suggest that even current AIs may not be as safe as we want them to be. For example, according to this article, current AI (specifically o1) can help novices build CBRN weapons and significantly increase threat to the world. Do you think this is concerning or do you think that this threat will not materialize?

2Noosphere898mo

The threat model is plausible enough that some political actions should be done, like banning open-source/open-weight models, and putting in basic Know Your Customer checks.

1Dakara8mo

Isn't it a bit too late for that? If o1 gets publicly released, then according to that article, we would have an expert-level consultant in bioweapons available for everyone. Or do you think that o1 won't be released?

3Noosphere898mo

I don't buy that o1 has actually given people expert-level bioweapons, so my actions here are more so about preparing for future AI that is very competent at bioweapon building. Also, even with the current level of jailbreak resistance/adversarial example resistance, assuming no open-weights/open sourcing of AI is achieved, we can still make AIs that are practically hard to misuse by the general public. See here for more: https://www.lesswrong.com/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais

1Dakara8mo

After some thought, I think this is a potentially really large issue which I don't know how we can even begin to solve. We can have aligned AI, being aligned with someone who wants to create bioweapons. Is there anything being done (or anything that can be done) to prevent that?

3Noosphere898mo

The answers to this question is actually 2 things: 1. This is why I expect we will eventually have to fight to ban open-source, and we will have to get the political will to ban both open-source and open-weights AI. 2. This is where the unlearning field comes in. If we could make the AI unlearn knowledge, an example being nuclear weapons, we could possibly distribute AI safely without causing novices to create dangerous stuff. More here: https://www.lesswrong.com/posts/mFAvspg4sXkrfZ7FA/deep-forgetting-and-unlearning-for-safely-scoped-llms https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm But the solutions are intentionally going to make AI safe without relying on alignment.

1Dakara8mo

I agree with comments both by you and Seth. I guess that isn't really part of an alignment as usually understood. However, I think it is a part of a broad preventing AI from killing humans strategy, so it's still pretty important for our main goal. I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don't think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs. Or are you proposing that we use AI monitors to monitor our leading future AI models and then we heavily restrict only the monitors?

4Noosphere898mo

My proposal is to restrain the AI monitor's domain only. I agree this is a reduction in capability from unconstrained AI, but at least in the internal use setting rather than deploying the AI, you probably don't need, and maybe don't want it to be able to write fictional stories or telling calming stories, but rather using the AI for specific employment tasks.

1Dakara8mo

That's pretty interesting, I do think that if iterative alignment strategy ends up working, then this will probably end up working too (if nothing else, then because this seems much easier). I have some concerns left about iterative alignment strategy in general, so I will try to write them down below. EDIT: On the second thought, I might create a separate question for it (and link it here), for the benefit of all of the people who concerned about the things (or similar things) that I am concerned about.

3Seth Herd8mo

That would be great. Do reference scalable oversight to show you've done some due diligence before asking to have it explained. If you do that, I think it would generate some good discussion.

1Dakara8mo

Sure, I might as well ask my question directly about scalable oversight, since it seems like a leading strategy of iterative alignment anyways. I do have one preliminary question (which probably isn't worthy of being included in that post, given that it doesn't ask about a specific issue or threat model, but rather about expectations of people). I take it that this strategy relies on evaluation being easier than coming up with research? Do you expect this to be the case?

2Seth Herd8mo

This isn't something I've thought about adequately. I think LLM agents will almost universally include a whole different mechanisms that can prevent jailbreaking: Internal independent review in which there are calls to a different model instance to check whether proposed plans and actions are safe (or waste time and money). Once agents can spend your people's money or damage their reputation, we'll want to have them "think through" the consequences of important plans and actions before they execute. As long as you're engineering that and paying the compute costs, you might as well use it to check for harms as well- including checking for jailbreaking. If that check finds evidence of jailbreaking, it can just clear the model context, call for human review from the org, or suspend that account. I don't know how adequate that will be, but it will help. This is probably worth thinking more about; Ii've sort of glossed over it while being concerned mostly about misalignment and misuse by fully authorized parties. But jailbreaking and misuse by clients could also be a major danger.

3Dakara8mo

"If you have an agent that's aligned and smarter than you, you can trust it to work on further alignment schemes. It's wiser to spot-check it, but the humans' job becomes making sure the existing AGI is truly aligned, and letting it do the work to align its successor, or keep itself aligned as it learns." Ah, that's the link that I was missing. Now it makes sense. You can use AGI as a reviewer for other AGIs, once it is better than humans at reviewing AGIs. Thank you a lot for clarifying!

4Seth Herd8mo

My pleasure. Evan Hubinger made this point to me when I'd misunderstood his scalable oversight proposal. Thanks again for engaging with my work!

[-]Dakara11mo30

I wonder, have the comments managed to alleviate your concerns at all? Are there any promising ideas for multipolar AGI scenarios? Were there any suggestions that could work?

3Seth Herd11mo

Great question. I was thinking of adding an edit to the end of the post with conclusions based on the comments/discussion. Here's a draft: None of the suggestions in the comments seemed to me like workable ways to solve the problem. I think we could survive an n-way multipolar scenario if n is small - like a handful of ASIs controlled by a few different governments. But not indefinitely - unless those ASIs come up with coordination strategies no human has yet thought of (or argued convincingly enough that I've heard of it - this isn't really my area, but nobody has pointed to any strong possibilities in the comments). So my conclusion was more on the side that it's going to be so obviously such a bad/dangerous scenario that it won't be allowed to happen. Basically, the hope is that this all becomes viscerally obvious to the first people who speak with a superhuman AGI and who think about global politics. They'll pull their shit together, as humans sometimes do when they're motivated to actually solve hard problems. Here's one scenario in which multipolarity is stopped. Similar scenarios apply if the number of AGIs is small and people coordinate well enough to use their small group of AGIs similarly to what I'll describe below. The people who speak to the first AGIi(s) and realize what must be done will include people in the government, because of course they'll be demanding to be included in decisions about using AGI. They'll talk sense to leadership, and the government will declare that this shit is deathly dangerous, and that nobody else should be building AGI. They'll call for a voluntary global moratorium on AGI projects. Realizing that this will be hugely unpopular, they'll promise that the existing AGI will be used to benefit the whole world. They'll then immediately deploy that AGI to identify and sabotage projects in other countries. If that's not adequate, they'll use minimal force. False-flag operations framing anti-AGI groups might be used to destr

[-]RogerDearnaley2mo20

One of the more interesting strategic questions: of the current leading foundation model labs, which of them is run by leaders who (as far as we can tell from their publicly known actions and opinions) are clearly not a psychopath, narcissist, pathological liar, political extremist, or otherwise have very concerning psychological tendencies or instabilities? This seems like a very important consideration for anyone considering working at any of these companies, and could turn out to be pivotal rather soon.

(Personally, I'm not aware of any significant conce... (read more)

[-]James Stephen Brown9mo21

The nuclear MAD standoff with nonproliferation agreements is fairly similar to the scenario I've described. We've survived that so far- but with only nine participants to date.

4Seth Herd9mo

That's a good point that the nuclear detente might become stronger with more actors, because the certainty of mutual destruction goes up with more parties that might start shooting if you do. I don't think the coalition and treaties for counter-aggression are important with nukes; anyone can destroy everyone, they're just guaranteed to be mostly destroyed in response. The numbers don't matter much. And I think they'll matter even less with AGI than nukes - without the guarantee of mutually assured destruction, since AGI might allow for modes of attack that are more subtle. Re-introducing mutually assured destruction could actually be a workable strategy. I haven't thought of this before, so thanks for pushing my thoughts in that direction. I fully agree that non-iterated prisoner's dilemmas don't exist in the world as we know it now. And it's not a perfect fit for the scenario I'm describing- but it's frighteningly close. I use the term because it invokes the right intuitions among LWers, and it's not far off for the particular scenario I'm describing. That's because, unlike the nuclear standoff or any other historical scenario, the people in charge of powerful AGI could be reasonably certain they'd survive and prosper if they're the first to "defect". I'm pretty conscious of the benefits of cooperation in our modern world; they are huge. That type of nonzero sum game is the basis of the world we now experience. I'm worried that changes with RSI-capable AGI. My point is that AGI will not need cooperation once it passes a certain level of capability. AGI capable of fully autonomous recursive self-improvement and exponential production (factories that build new factories and other stuff) doesn't need allies because it can become arbitrarily smart and materially effective on its own. A human in charge of this force would be tempted to use it. (Such an AGI would still benefit from cooperation on the margin, but it would be vastly less dependent on it than humans a

1James Stephen Brown9mo

Hi Seth, I share your concern that AGI comes with the potential for a unilateral first strike capability that, at present, no nuclear power has (which is vital to the maintenance of MAD), though I think, in game theoretical terms, this becomes more difficult the more self-interested (in survival) players there are. Like in open-source software, there is a level of protection against malicious code because bad players are outnumbered, even if they try to hide their code, there are many others who can find it. But I appreciate that 100s of coders finding malicious code within a single repository is much easier than finding something hidden in the real world, and I have to admit I'm not even sure how robust the open-source model is (I only know how it works in theory). I'm more pointing to the principle, not as an excuse for complacency but as a safety model on which to capitalise. My point about the UN's law against aggression wasn't that in and of itself it is a deterrent, only that it gives a permission structure for any party to legitimately retaliate. I also agree that RSI-capable AGI introduces a level of independence that we haven't seen before in a threat. And I do understand inter-dependence is a key driver of cooperation. Another driver is confidence and my hope is that the more intelligent a system gets, the more confident it is, the better it is able to balance the autonomy of others with its goals, meaning it is able to "confide" in others—in the same way as the strongest kid in class was very rarely the bully, because they had nothing to prove. Collateral damage is still damage after all, a truly confident power doesn't need these sorts of inefficiencies. I stress this is a hope, and not a cause for complacency. I recognise that in analogy, the strongest kid, the true class alpha, gets whatever they want with the willing complicity of the classroom. RSI-cabable AGI might get what it wants coercively in a way that makes us happy with our own subjugation

4Dakara8mo

James, thank you for a well-written comment. It was a pleasure to read. Looking forward to Seth's response. Genuinely interested in hearing his thoughts.

4Seth Herd8mo

Hey, thanks for the prompt! I had forgotten to get back to this thread. Now I've replied to James' comment, attempting to address the remaining difference in our predictions.

3Seth Herd8mo

Moderation Log

Curated and popular this week

130Comments