Seth Herd

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for about two decades; I studied complex human thought. Now I'm applying what I've learned to the study of AI alignment. 

Research overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way, possibly much to our chagrin. See this excellent intro video for more. 

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first. 

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could actually accomplish that for all of humanity. So I focus on finding alignment solutions.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too. 

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function. I've focused on the emergent interactions that are needed to explain complex thought. Here's a list of my publications. 

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.  

More on approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans. 

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are.  Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll do the obvious thing: design it to follow instructions. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done.  An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Comments

Sorted by

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

That all makes sense. To expand a little more on some of the logic:

It seems like the outcome of a partial pause rests in part on whether that would tend to put people in the lead of the AGI race who are more or less safety-concerned.

I think it's nontrivial that we currently have three teams in the lead who all appear to honestly take the risks very seriously, and changing that might be a very bad idea.

On the other hand, the argument for alignment risks is quite strong, and we might expect more people to take the risks more seriously as those arguments diffuse. This might not happen if polarization becomes a large factor in beliefs on AGI risk. The evidence for climate change was also pretty strong, but we saw half of America believe in it less, not more, as evidence mounted. The lines of polarization would be different in this case, but I'm afraid it could happen. I outlined that case a little in AI scares and changing public beliefs

In that case, I think a partial pause would have a negative expected value, as the current lead decayed, and more people who believe in risks less get into the lead by circumventing the pause.

This makes me highly unsure if a pause would be net-positive. Having alignment solutions won't help if they're not implemented because the taxes are too high.

The creation of compute overhang is another reason to worry about a pause. It's highly uncertain how far we are from making adequate compute for AGI affordable to individuals. Algorithms and compute will keep getting better during a pause. So will theory of AGI, along with theory of alignment.

This puts me, and I think the alignment community at large, in a very uncomfortable position of not knowing whether a realistic pause would be helpful.

It does seem clear that creating mechanisms and political will for a pause are a good idea.

Advocating for more safety work also seems clear cut.

To this end, I think it's true that you create more political capitol by successfully pushing for policy.

A pause now would create even more capitol, but it's also less likely to be a win, and it could wind up creating polarization and so costing rather than creating capitol. It's harder to argue for a pause now when even most alignment folks think we're years from AGI.

So perhaps the low-hanging fruit is pushing for voluntary RSPs, and government funding for safety work. These are clear improvements, and likely to be wins that create capitol for a pause as we get closer to AGI.

There's a lot of uncertainty here, and that's uncomfortable. More discussion like this should help resolve that uncertainty, and thereby help clarify and unify the collective will of the safety community.

Great analysis. I'm impressed by how thoroughly you've thought this through in the last week or so. I hadn't gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we'll probably both be wrong in important ways, but I think it's important to at least try to do semi-accurate prediction if we want to be useful.

I have only one substantive addition to your projected timeline, but I think it's important for the alignment implications.

LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them "make me a lot of money selling shoes, but also make the world a better place" and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn't solve outer alignment or alignment stability, for a start. But GPT4's ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.

In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.

I think the agent disasters you describe will occur, but they will happen to people that don't put safeguards into their bots, like "track how much of my money you're spending and stop if it hits $X and check with me". When agent disasters affect other people, the media will blow it sky high, and everyone will say "why the hell didn't you have your bot worry about wrecking things for others?". Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.

Will all of that matter? I don't know. But predicting the social and economic backdrop for alignment work is worth trying.

Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It's a cognitive psychology/neuroscience perspective on why these things might work better, faster than you'd intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.

 

 

  1. ^

    I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would've taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.

One factor is different incentives for decision-makers. The incentives (and the mindset) for tech companies is to move fast and break things. The incentives (and mindset) for government workers is usually vastly more conservative.

So if it is the government making decisions about when to test and deploy new systems, I think we're probably far better off WRT caution.

That must be weighed against the government typically being very bad at technical matters. So even an attempt to be cautious could be thwarted by lack of technical understanding of risks.

Of course, the Trump administration is attempting to instill a vastly different mindset, more like tech companies. So if it's that administration we're talking about, we're probably worse off on net with a combination of lack of knowledge and YOLO attitudes. Which is unfortunate - because this is likely to happen anyway.

As Habryka and others have noted, it also depends on whether it reduces race dynamics by aggregating efforts across companies, or mostly just throws funding fuel on the race fire.

It's not every post, but there are still a lot of people who think that alignment is very hard.

The more common assumption is that we should assume that alignment isn't trivial, because an intellectually honest assessment of the range of opinions suggests that we collectively do not yet know how hard alignment will be.

This is really important pushback. This is the discussion we need to be having.

Most people who are trying to track this believe China has not been racing toward AGI up to this point. Whether they embark on that race is probably being determined now - and based in no small part on the US's perceived attitude and intentions.

Any calls for racing toward AGI should be closely accompanied with "and of course we'd use it to benefit the entire world, sharing the rapidly growing pie". If our intentions are hostile, foreign powers have little choice but to race us.

And we should not be so confident we will remain ahead if we do race. There are many routes to progress other than sheer scale of pretraining. The release of DeepSeek r1 today indicates that China is not so far behind. Let's remember that while the US "won" the race for nukes, our primary rival had nukes very soon after - by stealing our advancements. A standoff between AGI-armed US and China could be disastrous - or navigated successfully if we take the right tone and prevent further proliferation (I shudder to think of Putin controlling an AGI, or many potentially unstable actors).

This discussion is important, so it needs to be better. This pushback is itself badly flawed. In calling out the report's lack of references, it provides almost none itself. Citing a 2017 official statement from China seems utterly irrelevant to guessing their current, privately held position. Almost everyone has updated massively since 2017. If China is "racing toward AGI" as an internal policy, they probably would've adopted that recently. (I doubt that they are racing yet, but it seems entirely possible they'll start now in response to the US push to do so - and the their perspective on the US as a dangerous aggressor on the world stage. But what do I know - we need real experts on China and international relations.)

Pointing out the technical errors in the report seems irrelevant to harmful. You can understand very little of the details and still understand that AGI would be a big, big deal if true — and the many experts predicting short timelines could be right. Nitpicking the technical expertise of people who are essentially probably correct in their assessment just sets a bad tone of fighting/arguing instead of having a sensible discussion.

And we desperately need a sensible discussion on this topic.

Reply22111
Answer by Seth Herd84

I gave part of my answer in the thread where you first asked this question. Here's the rest.

TLDR: Value alignment is too hard even without the value stability problem. Goal-misspecification is too likely (I realize I don't know the best ref for this other than LoL - anyone else have a better central ref?). Therefore we'll very likely align our first AGIs to follow instructions, and use that as a stepping-stone to full value alignment.

This is something I used to worry about a lot. Now it's something I don't worry about it at all.

I wrote a paper on this, Goal changes in intelligent agents back in 2018 for a small FLI grant, (in perhaps the first round of public funds for AGI x-risk). One of my first posts on LW was The alignment stability problem.

I still think this would be a very challenging problem if we were designing a value-aligned autonomous AGI. Now I don't think we're going to do that. 

I now see goal mis-specification as a very hard problem, and one we don't need to tackle to create autonomous AGI or even superintelligence. Therefore I think we won't.

Instead we'll make the central goal of our first AGIs to follow instructions or to be corrigible (correctable).

It's counterintuitive to think of a highly intelligent and fully autonomous being that wants more than anything to do what a less intelligent human tells them to do. But I think it's completely possible, and a much safer option for our first AGIs.

This is much simpler than trying to instill our values with such accuracy that we'd be happy with the result. Neither showing examples of things we like (as in RL training) nor explicitly stating our values in natural language seems likely to be accurate enough after it's been interpreted by a superintelligent AGI that is likely to see the world at least somewhat differently than we do. That sort of re-interpretation is functionally similar to value drift, although it's separable. Adding the problem of actual value drift on top of the dangers of goal misspecification just makes things worse.

Aligning an AGI to follow instructions isn't trivial either, but it's a lot easier to specify than getting values right and stable. For instance, LLMs already largely "know" what people tend to mean by instructions - and that's before the checking phase of do what I mean and check (DWIMAC). 

Primarily, though, instruction-following has the enormous advantage of allowing for corrigibility - you can tell your AGI to shut down to accept changes, or issue new revised instructions if/when you realize (likely because you asked the AGI) that your instructions would be interpreted differently than you'd like.

If that works and we get superhuman AGI aligned to follow instructions, we'll probably want to use that AGI to help us solve the problem of full value alignment, including solving value drift. We won't want to launch an autonomous AGI that's not corrigible/instruction-following until we're really sure our AGIs have a sure solution. (This is assuming we have those AGIs controlled by humans who are ethical enough to release control of the future into better hands once they're available - a big if).

This might work. Let's remember the financial incentives. Exposing a non-aligned CoT to all users is pretty likely to generate lots of articles about how your AI is super creepy, which will create a public perception that your AI in particular is not trustworthy relative to your competition.

I agree that it would be better to expose from an alignment perspective, I'm just noting the incentives on AI companies.

  1. We die (don't fuck this step up!:)
    1. Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
  2. We die (don't let your AGI fuck this step up!:)
    1. 22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn't thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
  3. the endgame is to use Intent alignment as a stepping-stone to value alignment and let something more competent and compassionate than us monkeys handle things from there on out.

I agree with all of those points locally.

To the extent people are worried about LLM scaleups taking over, I don't think they should be.

We will get nice instruction-following tool AIs.

But the first thing we'll do with those tool AIs is turn them into agentic AGIs. To accomplish any medium-horizon goals, let alone the long-horizon ones we really want help with, they'll need to do some sort of continuous learning, make plans (including subgoals), and reason in novel sub-domains.

None of those things are particularly hard to add. So we'll add them. (Work is underway on all of those capacities in different LLM agent projects).

Then we have the risks of aligning real AGI.

That's why this post was valuable. It goes into detail on why and how we'll add the capacities that will make LLM agents much more useful but also add the ability and instrumental motivation to do real scheming.

I wrote a similar post to the one you mention, Cruxes of disagreement on alignment difficulty. I think understanding the wildly different positions on AGI x-risk among different experts is critical; we clearly don't have a firm grasp on the issue, and we need it ASAP. The above is my read on why TurnTrout, Pope and co are so optimistic - they're addressing powerful tool AI, and not the question of whether we develop real AGI or how easy that will be to align.

FWIW I do think that can be accomplished (as sketched out in posts linked from my user profile summary), but it's nothing like easy or default alignment, as current systems and their scaleups are.

I'll read and comment on your take on the issue.

Load More