LESSWRONG
LW

Seth Herd

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as it increasingly is designed to "think for itself" in all the ways that make humans capable and dangerous.

New to alignment? See the Research Overview section
Field veteran? See More on My Approach at the end.

I work on technical alignment, but doing that has forced me to branch into alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those.

Principal articles:

On technical alignment of LLM-based AGI agents:
- Broadening the training set for alignment
  - Training directly on alignment-critical decisions could help
- LLM AGI may reason about its goals and discover misalignments by default
  - An LLM-centric lens on why aligning Real AGI is hard
- System 2 Alignment: Deliberation, Review, and Thought Management
  - Likely approaches for LLM AGI on the current trajectory
- Internal independent review for language model agent alignment
  - Updated in System 2 alignment
On LLM-based agents as a route to takeover-capable AGI
- LLM AGI will have memory, and memory changes alignment
- Brief argument for short timelines being quite possible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
AGI risk interactions with societal power structures and incentives:
- A country of alien idiots in a datacenter: AI progress and public alarm
  - The default path includes public alarm before AGI
- Whether governments will control AGI is important and neglected
- If we solve alignment, do we die anyway?
  - Risks of proliferating human-controlled AGI
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On AGI alignment targets assuming technical alignment
On communicating AGI risks:

Research Overview:

Alignment is the study of how to align the goals of advanced AI with the goals of humanity, so we're not in competition with our own creations. This is tricky because we are creating AI by training it, not programming it. So it's a bit like trying to train a dog to eventually run the world. It might work, but wouldn't want to just hope.

Large language models like ChatGPT constitute a breakthrough in AI. We might have AIs more competent than humans in every way, fairly soon. Such AI will outcompete us quickly or slowly. We can't expect to stay around long unless we carefully build AI so that it cares a lot about our well-being or at least our instructions. See this excellent intro video if you're not familiar with the alignment problem.

There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we're most likely to develop first.

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they're about as autonomous and competent as a human, but then it gets really dicey. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.

More on My Approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it's a definite maybe!

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

I'm trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions. Competition and race dynamics make the probem much harder, and conflicting incentives and group polarization create motivated reasoning that distorts beliefs.

I think it's fairly likely that alignment isn't impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).

One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll probably continue with the alignment target developers currently focus on: Instruction-following.

It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it's still in our control, if we've gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that's the target we'll pursue.

I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios.

I think we need to resist motivated reasoning and accept the uncomfortable truth that we collectively don't understand the alignment problem as we actually face it well enough yet. But we might understand it well enough in time if we work together and strategically.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

Seth Herd — LessWrong

Seth Herd

Message me here or at seth dot herd at gmail dot com.

New to alignment? See the Research Overview section
Field veteran? See More on My Approach at the end.

Principal articles:

On technical alignment of LLM-based AGI agents:
- Broadening the training set for alignment
  - Training directly on alignment-critical decisions could help
- LLM AGI may reason about its goals and discover misalignments by default
  - An LLM-centric lens on why aligning Real AGI is hard
- System 2 Alignment: Deliberation, Review, and Thought Management
  - Likely approaches for LLM AGI on the current trajectory
- Internal independent review for language model agent alignment
  - Updated in System 2 alignment
On LLM-based agents as a route to takeover-capable AGI
- LLM AGI will have memory, and memory changes alignment
- Brief argument for short timelines being quite possible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
AGI risk interactions with societal power structures and incentives:
- A country of alien idiots in a datacenter: AI progress and public alarm
  - The default path includes public alarm before AGI
- Whether governments will control AGI is important and neglected
- If we solve alignment, do we die anyway?
  - Risks of proliferating human-controlled AGI
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On AGI alignment targets assuming technical alignment
On communicating AGI risks:

Research Overview:

Bio

More on My Approach

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

Posts

Sorted by New

40Broadening the training set for alignment

1mo

92A country of alien idiots in a datacenter: AI progress and public alarm

3mo

75LLM AGI may reason about its goals and discover misalignments by default

5mo

56Problems with instruction-following as an alignment target

9mo

35Anthropomorphizing AI might be good, actually

9mo

77LLM AGI will have memory, and memory changes alignment

10mo

29Whether governments will control AGI is important and neglected

11mo

37Will LLM agents become the first takeover-capable AGIs?

34OpenAI releases GPT-4.5

39System 2 Alignment: Deliberation, Review, and Thought Management

Wikitag Contributions

Comments

Sorted by

Newest

AI Timelines

Seth Herd2y*94

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

Strategy of von Neumann and strategy of Rosenbergs

Seth Herd4d42

I think this analysis is assuming that AGI proliferation would produce mutually assured destruction like nuclear weapons did.

I think it's somewhat more likely that at least in the early stages of AGI proliferation, the first party that attacks can both survive and eliminate their enemies.

This seems to be worth more thought. Maybe a lot more thought.

Episodic memory in AI agents poses new safety risks

Seth Herd6d41

I focus on AGI alignment, in contrast to your focus on near term alignment, but there's a lot of overlap.

Increasing AI Strategic Competence as a Safety Approach

Seth Herd8d70

It seems like this doesn't even take a lot of strategic competence (although that would be a nice addition).

It seems like all this takes is better general reasoning abilities. And that might not be too much to hope for.

Imagine someone asks GPT6 "Should we slow down progress toward AGI?" it might very well answer something like:

It looks like people with the highest level of relevant expertise disagree dramatically on whether we're on track to solve alignment. So yes you should probably slow down unless you want to take a big risk in order to reach breakthroughs faster.
Assumptions and logic:
I'm assuming that people who have thought about this the most, including surrounding relevant expertise, are a lot more likely to be right. This is how things work in pretty much every other domain we can get answers about. Most of the people who think it's very likely to succeed seem to not have much expertise, or be highly motivated to think it will succeed, or both. So it looks like if you don't want to take at least a 20% chance of everyone dying or an otherwise really bad future getting locked in, yes you should probably slow down.

I can do a bunch more web searches and write up a very lengthy report if you like, but based on my knowledge base and thinking about it for a few seconds, it seems like the conclusion is very likely to be along these lines. Different assumptions would probably just shift the odds of disastrous outcomes from 20-80% (there are so many unknowns I doubt I'd go farther toward the extremes under any realistic assumptions)
Would you like help composing a letter to your representative or your loved ones?

I'm writing a post now on how emulating human-like metacognition might be a route to better and more reliable reasoning. Abram Demski has also done some work on reducing slop from LLMs with similar goals in mind (although both of us are thinking more of slop reduction for working directly on alignment. The strategic advice is an angle I've only considered a little, but it seems like an important one.

There may be better cooperative strategies that a reasonably intelligent and impartial system could recommend to everyone. I've worried in the past that such strategies don't exist, but I'm far from sure and hopeful that they do. A little logically sound strategic advice might make the difference if it's given impartially to all involved.

koanchuk's Shortform

Seth Herd9d40

Roughly this analogy was explored at length in the possessed machines. It seems pretty interesting, although I onlly looked at this summary This was about the Bolshevik's that were Leninist's spiritual forebears. It's written by an anonymous person from a major lab I believe, so I think it might capture some of that ethos that I don't understand.

Independent researchers have a lot less of that orientation. I'd say most of us don't want to rush toward the glorious future at current cost. That would both probably get us all killed, and with an embarrassing lack of Dignity (and roughly, the advantages that come with following at least some virtue ethics). Although it certainly is tempting ;)

Eli's shortform feed

Seth Herd9d57

Commenting on a distributed network of personal blogs is not an efficient way to do collaborative epistemics.

Eli's shortform feed

Seth Herd9d86

Big upvote. I'd like to see this as a top-level post, even if it's pretty much exactly this. Posts are easier to reference and search, and so a better place for in-depth, high-effort discussion.

This is a really good list of alignment-relevant questions. Clarifying what needs to be answered seems quite useful at this point. I think we can converge toward answers to some of these. Doing so would clarify the alignment problem as we'll actually face it, and so make our work more efficient.

Buck's Shortform

Seth Herd12d20

The first podcast was great. It strengthened my impression that both of you are on top of the big-picture, strategic situation in a way few (if any) other people are (that is, being both broad and detailed enough to be effective).

I'd like to hear you discussion the default alignment plan more. I'd like to hear you elaborate and speculate in particular on how automated alignment is likely to go under various of the plan/effort types.

The orgs aren't publishing anything much like a plan or planning. It seems like somebody should be doing it. I nominate the two of you! :) no pressure. I do think it should be a distributed effort to refine the default plan and where it goes wrong; your efforts would do a lot to catalyze more of that discussion. The responses to What’s the short timeline plan? were pretty sketchy, and I haven't seen a lot of improvements, outside of Ryan's post, since then.

FWIW, I strongly agree with Ryan that big projects benefit from planning. Alignment isn't a unified project, but it does seem like one in important ways. Like the Apollo program, which had extensive planning, there's a lot of time pressure, so just doing stuff and getting there eventually won't work, like it does for most innovations.

Are We in a Continual Learning Overhang?

Seth Herd12d20

I don't think I understand. I think of the data for continuous learning as coming from deployment - sessions and evaluations of what's worth learning/remembering. Are you referring to data appropriate for learning-to-learn in initial training?

I agree that that's scarse. And it would be nice to have. The Transformers and Hope architectures need to learn-to-learn, and humans do too, to some extent. But to some extent, we have built-in emotional registers for what's important to learn, what's surprising and important. Loosely similar mechanisms might work for continuous learning.

Are We in a Continual Learning Overhang?

Seth Herd12d50

Excellent post! I agree that this deserves more attention in the alignment community than it's getting. Continuous learning of some sort seems inevitable, and like it breaks a lot of load-bearing assumptions in the current standard thinking. Alignment is one, and rate of progress is another; you mention both.

What's most salient to me is that we don't know how much or how good of CL would be enough to break those assumptions. It might take a lot or a little to be important.

I have become somewhat more optimistic about our ability to align continuously-learning models since writing LLM AGI will have memory (thanks for the prominent citation). But that leaves me not-very-optimistic still.

The other form of weight-based continual learning is just doing fine-tuning on carefully-selected bits of the model's "experience". This can be used to develo "skills". This is subject to large interference problems, but it's potentially pretty cheap. And it can potentially be applied as a removable LORA to keep capabilities intact (I'm not sure to what degree this would actually work).

You mention the extreme use to an organization of preserving its internal knowledge in weights vs. reloading it. I just want to emphasize that that would include solving Dwarkesh's "perpetual first-day intern" problem; that tacit knowledge includes all of the knowledge of how this organization actually gets its stuff done. (And when it's learned in weights, this should be thought of as skills as well as knowledge).