LESSWRONG
LW

177
Seth Herd
7737Ω25044186510
Message
Dialogue
Subscribe

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.

  • New to alignment? See the Research Overview section
  • Field veteran? See More on My Approach at the end.

I work on technical alignment, but doing that has forced me to branch into alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those. 

Principal articles:

  • On technical alignment of LLM-based AGI agents:
    • LLM AGI may reason about its goals and discover misalignments by default
      • An LLM-centric lens on why aligning Real AGI is hard
    • System 2 Alignment Likely approaches for LLM AGI on the current trajectory  
    • Seven sources of goals in LLM agents brief problem statement
    • Internal independent review for language model agent alignment
      • Updated in System 2 alignment
  • On LLM-based agents as a route to takeover-capable AGI
    • LLM AGI will have memory, and memory changes alignment
    • Brief argument for short timelines being quite possible
    • Capabilities and alignment of LLM cognitive architectures
      • Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
  • AGI risk interactions with societal power structures and incentives:
    • Whether governments will control AGI is important and neglected
    • If we solve alignment, do we die anyway?
      • Risks of proliferating human-controlled AGI
    • Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
  • On the psychology of alignment as a field:
    • Cruxes of disagreement on alignment difficulty
    • Motivated reasoning/confirmation bias as the most important cognitive bias
  • On AGI alignment targets assuming technical alignment
    • Problems with instruction-following as an alignment target
    • Instruction-following AGI is easier and more likely than value aligned AGI
    • Goals selected from learned knowledge: an alternative to RL alignment
  • On communicating AGI risks:
    • Anthropomorphizing AI might be good, actually
    • Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term
    • AI scares and changing public beliefs

 

Research Overview:

Alignment is the study of how to design and train AI to have goals or values aligned with ours, so we're not in competition with our own creations. 

Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. If we don't understand how to make sure it has only goals we like, it will probably outcompete us, and we'll be either sorry or gone. See this excellent intro video.

There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we're most likely to develop first. 

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they're about as autonomous and competent as a human, but then it gets really dicey. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too. 

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications. 

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.  

More on My Approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans. 

My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it's a definite maybe! 

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

I'm trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions (competition, polarization, and conflicting incentives creating motivated reasoning that distorts beliefs). Many serious thinkers give up on this territory, assuming that either aligning LLM-based AGI turns out to be very easy, or we fail and perish because we don't have much time for new research. 

I think it's fairly likely that alignment isn't impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).

One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll probably continue with the alignment target developers currently focus on: Instruction-following. 

It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done.  An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter. 

There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it's still in our control, if we've gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that's the target we'll pursue.

I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
92A country of alien idiots in a datacenter: AI progress and public alarm
9d
15
74LLM AGI may reason about its goals and discover misalignments by default
Ω
2mo
Ω
6
51Problems with instruction-following as an alignment target
Ω
6mo
Ω
14
35Anthropomorphizing AI might be good, actually
7mo
6
73LLM AGI will have memory, and memory changes alignment
7mo
15
28Whether governments will control AGI is important and neglected
8mo
2
37Will LLM agents become the first takeover-capable AGIs?
Q
9mo
Q
10
34OpenAI releases GPT-4.5
9mo
12
39System 2 Alignment: Deliberation, Review, and Thought Management
9mo
0
23Seven sources of goals in LLM agents
9mo
3
Load More
6Seth Herd's Shortform
2y
66
AI Timelines
Seth Herd2y*94

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

Reply4
The only important ASI timeline
Seth Herd5d40

Got it. I see the value, and I'll do likewise. There is another step of saying "and we need to get moving on it now" but doing the first step of "this is the most important thing in your world" is a good start.

Reply
The only important ASI timeline
Seth Herd6d93

I think this is right at the broad level. But once you've accepted that getting a good outcome from AGI is the most important thing to work on, timelines matter a lot again, because they determine what the most effective direction to work is. Figuring out exactly what AGI we have to align and how long we have to do it is pretty crucial for having the best possible alignment work done by the time we hit takeover capable AGI.

Reply
Insofar As I Think LLMs "Don't Really Understand Things", What Do I Mean By That?
Seth Herd6d40

I think "understanding" in humans is an active process that demands cognitive skills we develop with continuous learning. I think you're right that LLMs are missing "the big picture" and organizing their local concepts to be consistent with it. I don't think humans do this automatically (per Dweomite's comment on this post), but that we need to learn skills to do it. I think this a lot of what LLMs are missing (TsviBT's "dark matter of intelligence").

I wrote about this in Sapience, understanding, and "AGI" but I wasn't satisfied and it's out of date. This is an attempt to do a better and briefer explanation, as a sort of run-up to doing an updated post.

We've learned skills for thought management/metacognition/executive function. They're habits, not beliefs (episodic memories or declarative knowledge), so they're not obvious to us. We develop "understanding" by using those skills to metaphorically turn over concepts in our minds.  This is actively comparing them to memories of data, and other beliefs. Doing this checks their consistency with other things we know. Learning from these investigations improves our future understanding of that concept, and our skills for understanding others.

What LLMs are missing relative to humans is profound right now, but may be all too easy to add adequately to get takeover-capable AGI.  Among other things (below), they're missing cognitive skills that aren't well-described in the text training set, but may be pretty easy to learn with a system 2 type approach that can be "habitized" with continuous learning. This might be as easy as a little fine-tuning, if the interference problem is adequately solved - and what's adequate might not be a high bar. Fine-tuning already adds this type of skills, but it seems to produce too much interference for it to keep going. And I don't know of a full self-teaching loop, although there is constant progress on most or all of the components to build one.  

There may be other routes to filling in that missing executive function and active processing for human-like understanding.

This is why I'm terrified of short timelines while most people have slightly longer timelines at this point. 

I've been thinking about this a lot in light of the excellent critiques of LLM thinking over the last year. My background is "computational cognitive neuroscience," so comparing LLMs to humans is my main tool for alignment thinking.

When I was just getting acquainted with LLMs in early 2023, my answers were that they're missing episodic memory (for "snapshot" continuous learning) and "executive function", a vague term that I'm now thinking is mostly skills for managing cognition. I wrote about this in Capabilities and alignment of LLM cognitive architectures in early 2023. If you can overlook my focus on scaffolding, I think it stands up as a partial analysis of what LLMs are missing and the emergent/ synergistic/ multiplicative advantages of adding those things.

But it's incomplete. I didn't emphasize continuous skill learning there, but I now think it's pretty crucial for how humans develop executive function and therefore understanding.  I don't see a better way to give it to agentic LLMs. RL on tasks could do it, but that has a data problem if it's not self-directed like human learning is. But there might be other solutions.

I think this is important to figure out. It's pretty crucial for both timelines and alignment strategy. 
 

Reply
A country of alien idiots in a datacenter: AI progress and public alarm
Seth Herd7d80

Thank you! I've thought about this a lot, so most of the work was in cramming those many hypotheses in to a short and entertaining enough form that people might read them.

WRT this giving you hope: Yes, people will wake up to AI and fear it appropriately. I still don't have much hope for shutdowns. But even slowdowns and less proliferation might really help our (otherwise poor IMO) odds of getting this right and surviving.

Reply
A country of alien idiots in a datacenter: AI progress and public alarm
Seth Herd7d53

Good question. I probably should have emphasized that the difference between this and the AI 2027 scenario is that the route to AGI takes a little longer, so there is much more public exposure to agentic LLMs.

I did emphasize that this may all come too late to make much difference from a regulatory standpoint. Even if that happens, it's going to change the environment which people make crucial decisions about deploying the first takeover capable AIs. That cuts both ways; polarization dominates belief diffusion seems complex but it might be possible to get a better guess than I have now, which is almost none.

The other change I think were pretty guaranteed to get is dramatically improved funding for alignment and safety. That might come so late as to be barely useful, too. Or early enough to make a huge difference.

Reply
A country of alien idiots in a datacenter: AI progress and public alarm
Seth Herd8d70

I am really thinking that they'll be deployed beyond their areas of reliable competence. If it can do even 50% of the work it might be worth it. As that goes up, it doesn't need to be nearly 100% competent. I guess a factor I didn't mention is that the rates of alarming mistakes should be far higher in deployment than testing, because the real world throws lots of curve balls it's hard to come up with in training and testing.

And I think the downsides of AI incompetence will not fall on mostly on the businesses that deploy them, but on the AI itself. Which isn't right, but it's helpful for people blaming and fearing AI.

Reply11
A country of alien idiots in a datacenter: AI progress and public alarm
Seth Herd8d70

I said a variation of the Peter Principle. Maybe I should have said some relation of the Peter Principle, or not used that term at all. What I'm talking about isn't about promotion but expansion into new types of tasks.

Once somebody makes money deploying agents in one domain, other people will want to try similar agents in similar new domains that are probably somewhat more difficult. This is a very loose analog of promotion.

The bit about not wanting to demote them is totally different. I think they can be bad at a job and make mistakes that damage their and your reputation and still be well worth keeping in that job. There are also some momentum effects of not wanting to re-hire all the people you just fired in favor of AI and admit you made a big mistake. Many decision-makers would be tempted to push through and try to upgrade the AI and work around its problems instead of admit they screwed up.

See below response for the rest of that logic. There can be more upside than down even with some disastrous mistakes or near misses that will go viral.

I'd be happy to not call it a relation of the Peter Principle at all. Let's call it the Seth Principle; I'd find it funny to have a principle of incompetence named after me :)

Reply
A country of alien idiots in a datacenter: AI progress and public alarm
Seth Herd8d53

I agree that the crux is the difference between public and private models. That's exactly what I was pointing to in the opener by saying maybe somebody is completing a misaligned Agent-4 in a lab right when this is happening in public. That would make all of this concern almost useless. It still would be in the air and might push decision-makers a bit more cautious - which could be a nontrivial advantage.

I agree that anything that produces public worry earlier is probably important and useful. The only exceptions would be outright lies that could blow back. But sparking concerns about job losses early wouldn't be a lie. I'm constantly a bit puzzled as to why other alignment people don't seem to think we'll get catastrophic job losses before AGI. Mostly I don't think people spend time thinking about it, which makes sense since actual misalignment and takeover is so much worse. But I think it's between possible and likely that job losses will be very severe and people should worry about them while there's still time to slow them down dramatically. Which would also slow AGI.

Constantly asking politicians about their plans seems like a good start. Saying you're an AI researcher when you do would be better.

To your first point:

Yes, I think that incompetence will both be taken for misalignment when it's not, and it will also create real misalignments (largely harmless ones)

I think this wouldn't be that helpful if the public really followed the logic closely; ASI wouldn't be incompetent, so it wouldn't have the same sources of incompetence. But the two issues are semantically linked. This will just get the public worried about alignment. Then they'll stay worried even if they do untangle the logic. Because they should be.

Reply
13 Arguments About a Transition to Neuralese AIs
Seth Herd8d42

This is interesting; you make good points on both sides, probably by virtue of gathering them broadly. Nice work!

This is about whether we will move to neuraliese on the current default path. I think there's an additional important factor: We must not. If the default path takes us there, we must put our shoulders against it and change that path.

I think it's hard to argue that faithful CoT doesn't really help the odds of aligning LLM-based AGI. And it's really hard to argue that's not important enough to be worth some real capability loss. I realize that commercial and national defense pressures are strong; but we should not fail to mention this fact in any discussion of losing natural langauge CoT. If we do, we're substantially more likely to all die. It would be a horrible choice and a horrible mistake. Preventing research on improvements without CoT should be a very high priority.

The arguments for risk:

Every plan I know of for aligning LLM-based AGI benefits immensely if we still have somewhat-faithful CoT. And that's critical, because all of those plans suck! That's including my own. Taking an unreliable plan and making it a lot worse is really, really bad for our odds of survival.

Of course that's a big claim. I tried to give full credit to those who think alignment is a lot easier in Cruxes of disagreement on alignment difficulty. But even with the most generous steelmanning I can't get any reasonable misalignment estimate down below maybe 10% risk of disaster. Remove CoT and that goes way up, in that case maybe 20%.

More estimates based on my most recent and careful thinking are more like 40-60% chance of outright alignment failure (with a lot more chance that If we solve alignment, we die anyway). That would go way up to 60- 80% if we lose CoT.

On the actual question of the efficiencies of losing CoT:

I don't think there are large gains to be made by dropping language entirely. Language is reason-complete; that's what it's for, and we humans spent a long time crafting it to be complete and efficient. Language compresses very nicely, too; thinking in language doesn't prevent efficiency improvements by collapsing steps.

BUT even if that turns out to be

Reply
Load More
Guide to the LessWrong Editor
4 months ago
Guide to the LessWrong Editor
4 months ago
(+91)
Outer Alignment
7 months ago
(+54/-21)
Outer Alignment
7 months ago
(+81/-187)
Outer Alignment
7 months ago
(+9/-8)
Outer Alignment
7 months ago
(+94/-150)
Outer Alignment
7 months ago
(+1096/-13)
Language model cognitive architecture
a year ago
(+223)
Corrigibility
2 years ago
(+472)