Seth Herd

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for about two decades; I studied complex human thought. Now I'm applying what I've learned to the study of AI alignment. 

Research overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way, possibly much to our chagrin. See this excellent intro video for more. 

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first. 

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could actually accomplish that for all of humanity. So I focus on finding alignment solutions.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too. 

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function. I've focused on the emergent interactions that are needed to explain complex thought. Here's a list of my publications. 

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.  

More on approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans. 

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are.  Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll do the obvious thing: design it to follow instructions. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done.  An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Comments

Sorted by

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

That all makes sense. To expand a little more on some of the logic:

It seems like the outcome of a partial pause rests in part on whether that would tend to put people in the lead of the AGI race who are more or less safety-concerned.

I think it's nontrivial that we currently have three teams in the lead who all appear to honestly take the risks very seriously, and changing that might be a very bad idea.

On the other hand, the argument for alignment risks is quite strong, and we might expect more people to take the risks more seriously as those arguments diffuse. This might not happen if polarization becomes a large factor in beliefs on AGI risk. The evidence for climate change was also pretty strong, but we saw half of America believe in it less, not more, as evidence mounted. The lines of polarization would be different in this case, but I'm afraid it could happen. I outlined that case a little in AI scares and changing public beliefs

In that case, I think a partial pause would have a negative expected value, as the current lead decayed, and more people who believe in risks less get into the lead by circumventing the pause.

This makes me highly unsure if a pause would be net-positive. Having alignment solutions won't help if they're not implemented because the taxes are too high.

The creation of compute overhang is another reason to worry about a pause. It's highly uncertain how far we are from making adequate compute for AGI affordable to individuals. Algorithms and compute will keep getting better during a pause. So will theory of AGI, along with theory of alignment.

This puts me, and I think the alignment community at large, in a very uncomfortable position of not knowing whether a realistic pause would be helpful.

It does seem clear that creating mechanisms and political will for a pause are a good idea.

Advocating for more safety work also seems clear cut.

To this end, I think it's true that you create more political capitol by successfully pushing for policy.

A pause now would create even more capitol, but it's also less likely to be a win, and it could wind up creating polarization and so costing rather than creating capitol. It's harder to argue for a pause now when even most alignment folks think we're years from AGI.

So perhaps the low-hanging fruit is pushing for voluntary RSPs, and government funding for safety work. These are clear improvements, and likely to be wins that create capitol for a pause as we get closer to AGI.

There's a lot of uncertainty here, and that's uncomfortable. More discussion like this should help resolve that uncertainty, and thereby help clarify and unify the collective will of the safety community.

Great analysis. I'm impressed by how thoroughly you've thought this through in the last week or so. I hadn't gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we'll probably both be wrong in important ways, but I think it's important to at least try to do semi-accurate prediction if we want to be useful.

I have only one substantive addition to your projected timeline, but I think it's important for the alignment implications.

LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them "make me a lot of money selling shoes, but also make the world a better place" and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn't solve outer alignment or alignment stability, for a start. But GPT4's ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.

In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.

I think the agent disasters you describe will occur, but they will happen to people that don't put safeguards into their bots, like "track how much of my money you're spending and stop if it hits $X and check with me". When agent disasters affect other people, the media will blow it sky high, and everyone will say "why the hell didn't you have your bot worry about wrecking things for others?". Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.

Will all of that matter? I don't know. But predicting the social and economic backdrop for alignment work is worth trying.

Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It's a cognitive psychology/neuroscience perspective on why these things might work better, faster than you'd intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.

 

 

  1. ^

    I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would've taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.

I applaude the post. I think this is a step in the right direction of trying to consider the whole problem realistically. I think everyone working on alignment should have this written statement of a route to survival and flourishing; it would help us work collectively to improve our currently-vague thinking, and it would help better target the limited efforts we have to devote to alignment research.

My statement would be related but less ambitious. This is what we should do, if the clarity and will existed. I really hope there's a viable route following more limited things we realistically could do.

I'm afraid I find most of your proposed approaches to still fall short of being fully realistic given the inefficiencies of public debate and government decision-making. Fortunately I think there is a route to survival and autonomy that's a less narrow path; I have a draft in progress on my proposed plan, working title "a coordination-free plan for human survival". Getting the US gov't to nationalize AGI labs does sound plausible, but unlikely to happen until they've already produced human-level AGI or are near to doing so.

I think my proposed path is different based on two cruxes about AGI, and probably some others about how government decision-making works. My timelines include fairly short timelines, like the 3 years Aschenbrenner and some other OpenAI insiders hold. And I see general AGI as actually being easier than superhuman tool AI; once real continuous, self-directed learning is added to foundation model agents (it already exists but needs improement), having an agent learn with human help becomes an enormous advantage over human-designed tool AI.

There's much more to say, but I'm out of time so I'll say that and hope to come back with more detail.

Great post and great points.

Alignment researchers usually don't think of their work as a means to control AGI. The should.

We usually think of alignment as a means to create a benevolent superintelligence. But just about any workable technique for creating a value-aligned AGI will work even better for creating an intent aligned AGI that follows instructions. Keeping a human in the loop and in charge bypasses several of the most severe Lethalities by effectively adding corrigibility. What human in control of a major AGI project would take an extra risk to benefit all of humanity instead of ensure that AGI will follow their values by following their instructions?

That sets the stage for even more power-hungry humans to seize control of projects and AGIs with the potential for superintelligence. I fully agree that there's a scary first-mover advantage benefitting the most vicious actors in a multipolar human-controlled AGI scenario; see If we solve alignment, do we die anyway?

The result is a permanent dictatorship. Will the dictator slowly get more benevelent once they have absolute power? The pursuit of power seems to corrupt more than having secure power, so maybe - but I would not want to bet on it.

However, I'm not so sure about hiding alignment techniques. I think the alternative to human-controllable AGI isn't really slower progress, it's uncontrollable AGI- which will pursue its own weird ends and wipe out humanity in the process, for the reasons classical alignment thinking descibes.

Takes on a few more important questions:

Should safety-focused people support the advancement of FMA capabilities?

Probably. The advantages of a system without goal-directed RL (RL is used, but only to get the "oracle" to answer questions as the user intended them) and with a legible train-of-thought seem immense. I don't see how we close the floodgates of AGI development now. Given that we're getting AGI, it really seems like our best bet is FMA AGI.

But I'm not ready to help anyone develop AGI until this route to alignment and survival has been more thoroughly worked through in the abstract. I really wish more alignment skeptics would engage with specific plans isntead of just pointing to general arguments about how alignment would be difficult, some of which don't apply to the ways we'd really align FMAs (see my other comment on this post). We may be getting close; Shut It All Down isn't a viable option AFAICT so we need to get together our best shot.

  1. Will the first transformative AIs be FMAs?

Probably, but not certainly. I'd be very curious to get a survey of people who've really thought about this. People who are sure they won't give reasons I find highly dubious. At the least it seems likely enough that we should be thinking about aligning them in more detail, because we can see their general shape better than other possible first AGIs

2. Will narrow FMAs for a variety of specific domains be transformatively useful before we get transformatively useful general FMAs?

No. There are advantages to creating FMAs for specific domains, but there are also very large advantages to working on general reasoning. Humans are not limited to narrow domains, but can learn but anything through instruction or self-instruction. Language models trained on human "thought" can do the same as soon as they have any sort of useful persistent memory. Existing memory systems don't work well, but they will be improved, probably rapidly. 

3. If FMAs are the first transformative AIs (TAIs), how long will FMAs remain the leading paradigm?

This is a really important question. I really hope they remain the leading paradigm long enough to become useful in aligning other types of AGI. And that they remain free of goal-directed RL adequately to remain alignable.

Great post! These are the questions that keep me up at night. Here is my provisional answers to the central, most important question:

  1. Is Chain of Thought faithful enough in the relevant sense for FMA safety to be attainable through natural language alignment?

Maybe, but it doesn't have to be. The nice thing about foundation model agents is that there are several different alignment techniques that are cheap, easy, and so obvious that they'll probably be implemented, even if the org in question isn't terribly safety-oriented. I wrote about these a while ago in the ill-titled Internal independent review for language model agent alignment, which focuses on System 2 internal review (in Shane Legg's better terminology) but also lists several other approaches that would easily be layered with it. 

I need to do a clearer and simpler rewrite that surveys all of these. Here's the table from my draft post. CoT legibility is only relevant for one of these six approaches. Sorry for the missing links; I'm short on time and the draft is incomplete.

TechniqueExampleAlignment tax
Goal prompting"Keep pursuing goal X..." (repeated frequently)Negligible
Identity prompting...acting as a helpful, cautious assistant"Negligible
Internal System 2 action review"Does this proposed action/plan potentially violate (conditions)?" ... (further evaluations for actions/plans predicted to be highly impactfulLow (if review also implemented for costs and effectiveness)
Human action reviewWait for human review if this action/plan would cost more than $(x) or impact more than (y) days of human happiness (application of instruction-following goal)High, but saves costs and reputational damage
External CoT reviewHuman and external AI review of chain of thought log Modest for AI, high if reliant on human review
"Bitter lesson" synthetic data training setCurated training set for decision-making LLM leaving out hostile/misaligned "thoughts" High, but modest if synthetic data is a key approach for next-gen LLMs
Instruction-following as core goal"Keep following all of the instructions from your authorized user, consulting them when instructions might be interpreted in substantially different ways"Low if consulting is scaled to only impactful choices

 

So we don't need CoT to be perfectly faithful to succeed- but we'd sure be safer if it was.

Back to the original question CoT faithfulness: the case for CoT unfaithfulness is overstated currently, but if we adopt more outcome-driven RL, or even fine-tuning, it could easily become highly unfaithful. So people shouldn't do that. If they do, I think the remaining easy techniques might be adequate - but I'd rather not gamble the future of humanity on them.

There are many other important questions here, but I'll stick to this one for now.

Okay; so what's the reality about the people we're thinking of when we say psychopathic? The term seems to still be in use among some professionals, for bad or good reasons.

A garbage bin diagnosis seems like a step down if psychopathy or sociopathy was pointing to a more specific set of attitudes and tendencies.

Endgame strategies from who?

A lot of powerful people would focus on being the ones to control it when it happens, so they'd control the future - and not be subject to some else's control of the future. OpenPhil is about the only org that would think first of the public benefit and not the dangers of other humans controlling it. And not a terribly powerful org, particularly relative to governments.

Oh, dang, I thought you were posing the traditional Fermi Paradox question with the variant of AGIs: why haven't other civilizations created AGIs that have then spread far enough to reach Earth and pay us a visit of some sort?

The answer to your question of "why haven't humans created AGI yet" is very clearly just that humans haven't yet been able to create AGI - not programmed yet, combined with computers aren't fast enough to easily create AGI (I agree that they're probably fast enough to run an optimally efficient AGI alreadym but that's irrelevant here.

Intelligence is seeming easier than many thought, but it's not dead simple, so we're not there yet. We're hard at work on it. This community in particular is keeping very close tabs on progress, including any possibilities of hidden AGI projects. We don't agree on timelines, but almost anyone who's actually taken the time to understand progress in AI would agree that AGI almost certainly doesn't exist yet. We're just now at the point where AGI seems possible within a few years or with as few as one more breakthrough.

A breakthrough in AI happened recently, the Transformer architecture that powers large language models and most other cutting-edge AI. That was published publicly when it was invented. It's highly unlikely that similar breakthroughs happened in parallel in secret (or even more incredibly unlikely is that a breakthrough would happen by accident. That's never really been a source of technological progress). The main reason we think secret breakthroughs are unlikely is that it's taken a lot of expertise and compute power to advance AI as far as it's gotten in public. Having that much compute and that much expertise working in secret would be nearly impossible at this point.

Going forward, this starts to become possible, but it's going to be hard to hide the work of a group of people large enough and with adequate collective expertise to advance past the public state of the art.

The more interesting question to me is why we haven't yet seen alien AGI spread through our space. There are roughly three types of possible answer: it hasn't reached us yet because civilization is rare, theh "rare earth" hypothesis for the standard Fermi paradox. (Or technically it could be because AGI is really hard to make, but that seems highly unlikely given how easily we've made systems that seem nearly-human-level). The second class of answer is that it's here but hidden for some reason. It might be "hiding" from possible hostile AGI, while watching for signs of it - including civilizations creating hostile AGI. The third class of possibility is that the question is ill posed for some other reason; for instance, we're in a simulation of some sort, so other civilizations evolving isn't actually possible in this world.

Load More