LESSWRONG
LW

579
Seth Herd
7288Ω21742178010
Message
Dialogue
Subscribe

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.

If you're new to alignment, see the Research Overview section below. Field veterans who are curious about my particular take and approach should see the More on My Approach section at the end of the profile.

Important posts:

  • On LLM-based agents as a route to takeover-capable AGI
    • LLM AGI will have memory, and memory changes alignment
    • Brief argument for short timelines being quite possible
    • Capabilities and alignment of LLM cognitive architectures
      • Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
  • AGI risk interactions with societal power structures and incentives:
    • Whether governments will control AGI is important and neglected
    • If we solve alignment, do we die anyway?
      • Risks of proliferating human-controlled AGI
    • Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
  • On the psychology of alignment as a field:
    • Cruxes of disagreement on alignment difficulty
    • Motivated reasoning/confirmation bias as the most important cognitive bias
  • On technical alignment of LLM-based AGI agents:
    • System 2 Alignment on how developers will try to align LLM agent AGI
    • Seven sources of goals in LLM agents brief problem statement
    • Internal independent review for language model agent alignment
  • On AGI alignment targets assuming technical alignment
    • Problems with instruction-following as an alignment target
    • Instruction-following AGI is easier and more likely than value aligned AGI
    • Goals selected from learned knowledge: an alternative to RL alignment
  • On communicating AGI risks:
    • Anthropomorphizing AI might be good, actually
    • Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term
    • AI scares and changing public beliefs

 

Research Overview:

Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more. 

There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first. 

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too. 

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications. 

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.  

More on My Approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans. 

My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are.  Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done.  An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter. 

There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions. 

I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
6Seth Herd's Shortform
2y
66
AI Timelines
Seth Herd2y*94

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

Reply4
My talk on AI risks at the National Conservatism conference last week
Seth Herd1d20

I've been excited about pitching AGI x-risks to conservatives since seeing the great outreach work and writup from AE Studios, Making a conservative case for alignment.

My fervant hope is that we somehow avoid making this a politically polarized issue. I fear that polarization easily overwhelms reason, and is one of the few ways the public could fail to appreciate the dreadful, simple logic in time to be of any help.

Reply
The Eldritch in the 21st century
Seth Herd2d90

We are building new gods in the hope that they will love us.

You may laugh as though this were impossible or recoil in fear if you know it is.

But it is what we're doing.

This eclipses all other efforts to navigate those eldritch forces.

Will our new gods love us, or at least obey us? If they obey, will their mortal masters use their power for their brethren, or to create strange new worlds?

This piece is amazing. Thank you. I absolutely love the framing and agree with the analysis of the situation - only the action recommendations need to be updated

Like most incisive analyses of the current historical situation, this piece is hugely incomplete, particularly because it does not take account for likely progress in AI.

Reply
Review: E-bikes on Hills
Seth Herd2d52

I am in love with my ebike. I may get less exercise total, but I get a lot more joy from a lot more time outdoors. I feel a stronger connection to place and culture, because the bike allows going slow and looking around. When there's traffic or you're going fast, your attention will and should be mostly on that, but you can stop or go slow as much as you want to lok around. Rubbernecking in a car can feel rude or dangerous; on a bike it's easy and fun.

Having a big convenient cargo carrier and places to secure collapsible bags for extra space means you can do most errands by bike instead of car.

Increased time on a bike sounds risky, but here's a weird study: there were more closed head injuries recorded per mile travelled in a car vs. a bike. I didn't catch the ref and I realize this is absolutely astounding. The ones on bikes are probably more severe, but I think it's not fully appreciated how much even minor car accidents mess your brain up.

I do recommend biking as though no car will ever notice you by default, but realize that's not really an option in dense cities with a lot of traffic. For smaller towns with less traffic and more side streets and bike lanes (Boulder is an extreme but I'm also biking in Traverse City, a smaller and much less bike-friendly town) there's no comparison in levels of joy and fun to driving.

Reply
The Thalamus: Heart of the Brain and Seat of Consciousness
Seth Herd3d20

What I make of those studies is that stimulating the thalamus activates the whole corticothalemic loop. The non-specific or activating nuclei of the thalamus switch on matched areas of cortex. The thalamus has a powerful regulatory role, but it's not making decisions, it's enforcing them. The government that sits in DC is making decisions and that's why we call it the seat of the government. Your metaphor simply does not go through and it makes you sound confused.

The government's decisions are influenced from elsewhere, and they are enacted elsewhere. But the thalamist's role is much less like the Congress or Senate and much more like the people who enact and enforce the decisions made by those governing bodies. The decisions about what becomes conscious are made elsewhere, in the conjunction of the cortex and basal ganglia. Decisions about whether to be conscious or unconscious or made in subthalamic nuclei, and again only enforced or enacted by the thalamus.

There you said the thalamus is where consciousness is happening. That is just flat wrong. It's a system phenomenon. Trending towards statements like that is why it's a mistake to say any place is the seat of consciousness; it leads to very wrong conceptions and statements like that.

Government largely happens in DC. Consciousness largely happens throughout thalamocortical loops.

The reason I thought this is worth mentioning is that talking about a seat of consciousness confuses the whole phenomenon of consciousness. It implies that consciousness is some little add-on happening in some little corner of the brain, when that's not right at all; consciousness is a highly complex phenomena involving much of the brain's higher functions.

Reply
ryan_greenblatt's Shortform
Seth Herd3d31

I appreciate you saying that the 25th percentile timeline might be more important. I think that's right and underappreciated.

One of your recent (excellent) posts also made me notice that AGI timelines probably aren't normally distributed. Breakthroughs, other large turns of events, or large theoretical misunderstandings at this point probably play a large role, and there are probably only a very few of those that will hit. Small unpredictable events that create normal distributions will play a lesser role.

I don't know how you'd characterize that mathematically, but I don't think it's right to assume it's normally distributed, or even close.

Back to your comment on the 25th percentile being important: I think there's a common error where people round to the median and then think "ok, that's probably when we need to have alignment/strategy figured out." You'd really want to have it at least somewhat ready far earlier.

That's both in case it's on the earlier side of the predicted distribution, and because alignment theory and practice need to be ready far enough in advance of game time to have diffused and be implemented for the first takeover-capable model.

I've been thinking of writing a post called something like "why are so few people frantic about alignment?" making those points. Stated timeline distributions don't seem to match mood IMO and I'm trying to figure out why. I realize that part of it is a very reasonable "we'll figure it out when/if we get there." And perhaps others share my emotional dissociation from my intellectual expectations. But maybe we should all be a bit more frantic. I'd like some more halfassed alignment solutions in play and under discussion right now. The 80/20 rule probably applies here.

Reply
The Thalamus: Heart of the Brain and Seat of Consciousness
Seth Herd3d31

Okay fine, I'll engage a little. I do love this shit, even though I try not to spend time on it because it's a mess with little payoff (unless the aforementioned debate over AI consciousness starts to seem relevant to our odds of survival - which it well might).

I don't think your DC as the seat of government metaphor goes through. DC is indeed the seat of government. The thalamus isn't in charge of consciousness, it's just a valve (but far more sophisticated; an arena of competition) that someone else turns: the cortex and basal ganglia, in elaborate collaboration. The thalamus is the mechanism by which their decisions are enforced; it doesn't seem to play a large role in deciding what's attended.

Reply
The Thalamus: Heart of the Brain and Seat of Consciousness
Seth Herd3d60

Yes, I agree. Conscious decisions that firmly fit the definition take a lot longer.

Oh hey! if you do want to know my theory about decisions, I did write a whole article about it for a "real prestigious scientific" journal: Neural mechanisms of human decision-making

You sound like one of the very few humans who might be interested.

Warning: when I wrote that, I was worried that writing it too clearly or persuasively might give AI developers too many good ideas and shorten timelines. I also had a boss/collaborator with different ideas about how to frame things. So I compromised on partly clear writing about my own carefully thought-out theories, and partly scientifical jargon speak that would get it published in a good journal but interest or elighten almost no one.

Make of that what you will.

Reply
The Thalamus: Heart of the Brain and Seat of Consciousness
Seth Herd3d20

I'll pardon it but I won't engage with it. I think saying it's the seat of consciousness makes you sound like you don't know what's going on, when actually you do. I could be right or wrong.

Reply
The Thalamus: Heart of the Brain and Seat of Consciousness
Seth Herd3d*113

I didn't have time to read all of this, so I apologize for commenting. I'm quite familiar with the science of the thalamus and its relation to consciousness. I think all of this looks pretty accurate, but I must object to your framing of the thalamus as the seat of consciousness. Consciousness is the whole shebang. It's not a little add-on or accident. Substantial processing happens outside of consciousness, of course, but lots of the brain needs to participate to create consciousness; it's a sophisticated set of information processing functions across areas, particularly the corticothalemic loops.

So calling the thalamus the seat of consciousness is like saying the water comes from the valve because if that's stuck closed no water flows. The thalamus is critical for consciousness, but that doesn't mean consciousness originates there. Most of the sophisticated processing that creates the rich representations we refer to as qualia or consciousness originates in the cortex; The thalamus and basal ganglia are more involved in selecting between possible such representations.

I wish I had time to dig into this more; consciousness is fascinating, and I'm hoping it even becomes important again if people start arguing for AI rights on the basis of consciousness. Until then, I'm going to focus on alignment.

Pardon any errors from phone voice transcription.

Reply
Load More
49Problems with instruction-following as an alignment target
Ω
4mo
Ω
14
35Anthropomorphizing AI might be good, actually
4mo
6
73LLM AGI will have memory, and memory changes alignment
5mo
15
28Whether governments will control AGI is important and neglected
6mo
2
37Will LLM agents become the first takeover-capable AGIs?
Q
6mo
Q
10
34OpenAI releases GPT-4.5
7mo
12
35System 2 Alignment
7mo
0
23Seven sources of goals in LLM agents
7mo
3
78OpenAI releases deep research agent
7mo
21
71Yudkowsky on The Trajectory podcast
8mo
39
Load More
Guide to the LessWrong Editor
2 months ago
Guide to the LessWrong Editor
2 months ago
(+91)
Outer Alignment
5 months ago
(+54/-21)
Outer Alignment
5 months ago
(+81/-187)
Outer Alignment
5 months ago
(+9/-8)
Outer Alignment
5 months ago
(+94/-150)
Outer Alignment
5 months ago
(+1096/-13)
Language model cognitive architecture
a year ago
(+223)
Corrigibility
2 years ago
(+472)