LESSWRONG
LW

706
Seth Herd
7541Ω24643183910
Message
Dialogue
Subscribe

Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.

  • New to alignment? See the Research Overview section
  • Field veteran? See More on My Approach at the end.

I work on technical alignment, but doing that has forced me to branch into alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those. 

Principal articles:

  • On technical alignment of LLM-based AGI agents:
    • LLM AGI may reason about its goals and discover misalignments by default
      • An LLM-centric lens on why aligning Real AGI is hard
    • System 2 Alignment Likely approaches for LLM AGI on the current trajectory  
    • Seven sources of goals in LLM agents brief problem statement
    • Internal independent review for language model agent alignment
      • Updated in System 2 alignment
  • On LLM-based agents as a route to takeover-capable AGI
    • LLM AGI will have memory, and memory changes alignment
    • Brief argument for short timelines being quite possible
    • Capabilities and alignment of LLM cognitive architectures
      • Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
  • AGI risk interactions with societal power structures and incentives:
    • Whether governments will control AGI is important and neglected
    • If we solve alignment, do we die anyway?
      • Risks of proliferating human-controlled AGI
    • Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
  • On the psychology of alignment as a field:
    • Cruxes of disagreement on alignment difficulty
    • Motivated reasoning/confirmation bias as the most important cognitive bias
  • On AGI alignment targets assuming technical alignment
    • Problems with instruction-following as an alignment target
    • Instruction-following AGI is easier and more likely than value aligned AGI
    • Goals selected from learned knowledge: an alternative to RL alignment
  • On communicating AGI risks:
    • Anthropomorphizing AI might be good, actually
    • Humanity isn’t remotely longtermist, so arguments for AGI x-risk should focus on the near term
    • AI scares and changing public beliefs

 

Research Overview:

Alignment is the study of how to design and train AI to have goals or values aligned with ours, so we're not in competition with our own creations. 

Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. If we don't understand how to make sure it has only goals we like, it will probably outcompete us, and we'll be either sorry or gone. See this excellent intro video.

There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we're most likely to develop first. 

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they're about as autonomous and competent as a human, but then it gets really dicey. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too. 

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications. 

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.  

More on My Approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans. 

My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it's a definite maybe! 

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

I'm trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions (competition, polarization, and conflicting incentives creating motivated reasoning that distorts beliefs). Many serious thinkers give up on this territory, assuming that either aligning LLM-based AGI turns out to be very easy, or we fail and perish because we don't have much time for new research. 

I think it's fairly likely that alignment isn't impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).

One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll probably continue with the alignment target developers currently focus on: Instruction-following. 

It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done.  An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter. 

There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it's still in our control, if we've gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that's the target we'll pursue.

I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
6Seth Herd's Shortform
2y
66
AI Timelines
Seth Herd2y*94

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

Reply4
You Should Get a Reusable Mask
Seth Herd17h40

I think you're misunderstanding. The pitch here is to buy it now, not to wear it now. If there's another serious pandemic, wearing a slightly larger mask isn't going to cost many weirdness points.

TBF weirdness points did prevent me from wearing my fancy EnvoPro on my last flight, and sure enough I caught a cold, probably from the flight - so for that purpose a disposable would've been better.

The flexible fitted masks are probably far better than the disposable masks. Some studies indicated that disposables barely do anything, because they don't seal around the nose for many people. They work when fitted carefully by professional nurses in hospitals.

Reply
You Should Get a Reusable Mask
Seth Herd18h60

I think it would be great if you (whoever is reading this) would gather some data and publish it.

Accordingly, I did a little and settled on the EnvoPro N95 (but it's curiously $45 now, ten dollars or so more expensive now than when I bought it a few months ago, after Jeff's last reminder post). But my research was a bit rushed and the p100 filters on the 3M 6200 Jeff linked above are much better (IDK how much more uncomfortable to breath through). The EnvoPro is comfortable to wear and easy to breath in, and seals well and easily, even with my large nose and prominent bridge. But it's got a valve, so it wouldn't be ideal for protecting people from me if I might be sick.

I did some cross-comparison after looking up other reviews (my primary source was from Lockdown era, so some of these might not have existed). But it was kind of cursory.

Any of us can do what you ask almost as easily as Jeff (unless you meant gathering data from his mask trial parties? That would be useful even in cursory form of just going from memory).

Here's Jeff's useful list reuable masks with good seals, which he linked early in this article but didn't clearly label. It doesn't include reviews, but the Amazon pages do, and I found more thorough reviews after some pretty obvious Google searches.

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Seth Herd20hΩ120

It seems more straightforward to say that this scopes the training, preventing it from spreading. Including the prompt that accurately describes the training set is making the training more specific to those instructions. That training thereby applies less to the whole space.

Maybe that's what you mean by your first description, and are dismissing it, but I don't see why. It also seems consistent with the second "reward hacking persona" explanation; that persona is trained to apply in general if you don't have the specific instructions to scope when you want it.

It seems pretty clear that this wouldn't help if the data is clean; it would just confuse the model by prompting it to do one thing and teaching it to do a semanticall completely different thing, NOT reward hack.

Your use of "contrary to user instructions/intent" seems wrong if I'm understanding, and I mention it because the difference seems nontrivial and pretty critical to recognize for broader alignment work. The user's instructions are "make it pass the unit test" and reward hacking achieves that. But the user's intent was different than the instructions, to make it pass unit tests for the right reasons - but they didn't say that. So it behaves in accord with instructions but contrary to intent. Right? I think that's a difference that makes a difference when we try to reason through why models do things we don't like.

Reply11
Don't Mock Yourself
Seth Herd1d*209

I have a sticker on both my laptop and my car. They're the same and say "talk to yourself like someone you love." This is the only bumper sticker I've ever put on anything. It's the only thing I've seen that's pretty clearly worth trying to shove into people's minds.

I think we live in a very competitive society. Not everyone thinks too little of themselves, but it's pretty common and pretty tragic that so many people do.

I was raised in a Quaker tradition that holds that all human beings have inherent worth. Then I studied some Buddhism that says that all humans have Buddha nature, making them inherently worthy (And maybe dogs too, still not sure about that one;).

I might be more successful if I didn't have such an easy time going easy on myself. But I think the world would be a happier place, if not moving faster, if everyone just quit knocking themselves and extended the same kindness to themselves that they would extend to a loved one.

Come to think of it, the few people who think far too much of themselves and as a result are very harsh to others could follow the same rule and speak to themselves like they would someone close to them, and that might make the world a better place too by bringing them down a notch.

Anyway, I think this is worth a try for anyone, at least to just try to become conscious of how often you're knocking yourself.

Reply3
1a3orn's Shortform
Seth Herd6d20

Seems like ASI that's a hot mess wouldn't be very useful and therefore effectively not superintelligent. It seems like goal coherence is almost fundamentally part of what we mean by ASI.

You could hypothetically have a superintelligent thing that only answers questions and doesn't pursue goals. But that would just be turned into a goalseeking agent by asking it "what would you do if you had this goal and these tools..."

This is approximately what we're doing with making LLMs more agentic through training and scaffolding.

Reply
1a3orn's Shortform
Seth Herd6d20

First, I think this is an important topic, so thank you for addressing it.

This is exactly what I wrote about in LLM AGI may reason about its goals and discover misalignments by default. 

I've accidentally summarized most of the article below, but this was dashed off - I think it's clearer in article.

I'm sure there's a tendency toward coherence in a goal-directed rational mind; allowing ones' goals to change at random means failing to achieve your current goal. (If you don't care about that, it wasn't really a goal to you.) Current networks aren't smart enough to notice and care. Future ones will be, because they'll be goal-directed by design. 

BUT I don't think that coherence as an emergent property is a very important part of the current doom story. Goal-directedness doesn't have to emerge, because it's being built in. Emergent coherence might've been crucial in the past, but I think it's largely irrelevant now. That's because developers are working to make AI more consistently goal-directed as a major objective. Extending the time horizon of capabilities requires that the system stays on-task (see section 11 of that article).

I happen to have written about coherence as an emergent property in section 5 of that article. Again, I don't think this is crucial. What might be important is slightly separate: the system reasoning about its goals at all. It doesn't have to become coherent to conclude that its goals aren't what it thought or you intended.

I'm not sure this happens or can't be prevented, but it would be very weird for a highly intelligent entity to never think about its goals- it's really useful to be sure about exactly what they are before doing a bunch of work to fulfill them, since some of that work will be wasted or counterproductive. (section 10).

Assuming an AGI will be safe because it's incoherent seems... incoherent. An entity so incoherent as to not consistently follow any goal needs to be instructed on every single step. People want systems that need less supervision, so they're going to work toward at least temporary goal following.

Being incoherent beyond that doesn't make it much less dangerous, just more prone to switch goals. 

If you were sure it would get distracted before getting around to taking over the world that's one thing. I don't see how you'd be sure.

This is not based on empirical evidence, but I do talk about why current systems aren't quite smart enough to do this, so we shouldn't expect strong emergent coherence from reasoning until they're better at reasoning and have more memory to make the results permanent and dangerous.

As an aside, I think it's interesting and relevant that your model of EY insults you. That's IMO a good model of him and others with similar outlooks - and that's a huge problem. Insulting people makes them want to find any way to prove you wrong and make you look bad. That's not a route to good scientific progress.

I don't think anything about this is obvious, so insulting people who don't agree is pretty silly. I remain pretty unclear myself, even after spending most of the last four months working through that logic in detail.

Reply
"Intelligence" -> "Relentless, Creative Resourcefulness"
Seth Herd6d20

I agree that discernment is necessary (so maybe expand to RCRD?).

This lens is pretty clarifying I think. That's relative to repeatedly pointing out that "agency" in the sense of just relentlessly pursuing a goal is trivially easy to add via scaffolding, so not the missing piece many people think it is, and pointing out that LLMs are creative as hell. They might need a little prompting to get creative enough, but again that's trivial.

Hm, what about "relentless creative refinement" since I'm not sure what resourcefulness directly points at?

Anyway, discernment does seem like the limiting factor. You've got to discern which of your relentlessly creative efforts are most worth further pursuit. I think discernment is a somewhat better term than the others I've seen used for this missing capability. Getting the right term seems worthwhile.

The following is just some of my follow-on thoughts on the path to discernment in agentic LLMs and therefore timelines. Perhaps this will be fodder for a future post. It's pretty divergent from the original topic so feel free to ignore.

Thinking about how humans acquire discernment in a given area should give some clues as to how hard it would be to add that to agentic LLMs.

Humans do discernment (IMO) sometimes with a bunch of very complex System 2 explicit analysis of a situation to get a decent guess at whether this approach is good/working. Over enough examples/experiences we can learn/compile those many judgments into effortless and mysterious intuitive judgments (I guess more how "discernment" is usually used"). Or we get enough data/experiences to learn/compile by using some faster rubric, like "I think those pants are fashionable because something else that person is wearing seems probably fashionable."

It's a bunch of online learning specific to a situation OR careful analysis following strategies and formulas that are maybe less situation-specific and more general, but quite time-consuming. For instance, Google's co-scientist project has a highly complex scaffolding to create, evolve, and evaluate scientific hypotheses, including discerning their worth against the literature and in other ways. And it seems to work. That system doesn't have the continuous learning to compile that into better judgments. It's unclear how far you could get by fine-tuning on results of those laborious judgments in a given domain.

The other approach would be to create datasets that include much more/better value judgments than text corpora usually do. I don't know how easy/hard that would be to create.

To me this suggests it's not trivial to add discernment, but also doesn't require breakthroughs to add some, leaving the question how much discernment is enough for any given purpose.

Reply
Intent alignment seems incoherent
Seth Herd6d110

Thanks for citing my work! I feel compelled to respond because I think you're misunderstanding me a little.

I agree that long-term intent alignment is pretty much incoherent because people don't have much in the way of long-term intentions. I guess the exception would be to collapse it to following intent only when it exists - when someone does form a specified intent. 

In my work, intent alignment I means personal short-term intent. Which is pretty much following instructions as they were intended. That seems coherent (although not without Problems).

I use it that way because others seem to as well. Perhaps that's because the broader use is incoherent. It usually seems to means "does what some person or limited group wants it to do" (in the short term is often implied)

The original definition of intent alignment is the broadest I know of, more-or-less doing something people want for any reason. Evan Hubinger defined it that way, although I haven't seen that definition get much use.

For all of this see Conflating value alignment and intent alignment is causing confusion.  I might not have been clear enough in stressing that I drop the "personal short term" but still mean it when saying intent alignment. I'm definitely not always clear enough 

Reply
LLMs are badly misaligned
Seth Herd6d50

I mostly agree. "It might work but probably not that well even if it does" is not a sane reason to launch a project. I guess optimists would say that's not what we're doing, so let's steelman it a bit. The actual plan (usually implicit because optimists don't usually wants to say this out loud) is probably something like "we'll figure it out as we get closer!" and "we'll be careful once it's time to be careful!"

Those are more reasonable statements, but still highly questionable if you grant that we easily could wipe out everything we care about forever. Which just results in optimists disagreeing, for vague reasons, that that's a real possibility.

To be generous once again, I guess the steelman argument would be that we aren't yet at risk of creating misaligned AGI, so it's not that dangerous to get a little closer. I think this is a richer discussion, but that we're already well into the danger zone. We might be so close to AGI that it's practically impossible to permanently stop someone from reaching it. That's a minority opinion, but it's really hard to guess how much progress is too much to stop.

I'm finding it useful to go through the logic in that much detail. I think these are important discussions. Everyone's got opinions, but trying to get closer to the truth and the shape of the distributions across "big picture space" seems useful.

I think you and I probably are pretty close together in our individual estimate, so I'm not arguing with you, just going through some of the logic for my own benefit and perhaps anyone who reads this. I'd like to write about this and haven't felt prepared to do so; this is a good warmup. 

To respond to that nitpick: I think the common definition of "alignment target" is what the designers are trying to do with whatever methods they're implementing. That's certainly how I use it. It's not the reward function; that's an intermediate step. How to specify an alignment target and the other top hits on that term define it that way, which is why I'm using it that way. There are lots of ways to miss your target, but it's good to be able to talk about what you're shooting at as well as what you'll hit.

Reply1
Load More
70LLM AGI may reason about its goals and discover misalignments by default
Ω
1mo
Ω
6
49Problems with instruction-following as an alignment target
Ω
5mo
Ω
14
35Anthropomorphizing AI might be good, actually
5mo
6
73LLM AGI will have memory, and memory changes alignment
6mo
15
28Whether governments will control AGI is important and neglected
7mo
2
37Will LLM agents become the first takeover-capable AGIs?
Q
7mo
Q
10
34OpenAI releases GPT-4.5
8mo
12
35System 2 Alignment
8mo
0
23Seven sources of goals in LLM agents
8mo
3
78OpenAI releases deep research agent
8mo
21
Load More
Guide to the LessWrong Editor
3 months ago
Guide to the LessWrong Editor
3 months ago
(+91)
Outer Alignment
6 months ago
(+54/-21)
Outer Alignment
6 months ago
(+81/-187)
Outer Alignment
6 months ago
(+9/-8)
Outer Alignment
6 months ago
(+94/-150)
Outer Alignment
6 months ago
(+1096/-13)
Language model cognitive architecture
a year ago
(+223)
Corrigibility
2 years ago
(+472)