Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.

New to alignment? See the Research Overview section
Field veteran? See More on My Approach at the end.

I work on technical alignment, but doing that has forced me to branch into alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those.

Principal articles:

On technical alignment of LLM-based AGI agents:
- LLM AGI may reason about its goals and discover misalignments by default
  - An LLM-centric lens on why aligning Real AGI is hard
- System 2 Alignment Likely approaches for LLM AGI on the current trajectory
- Seven sources of goals in LLM agents brief problem statement
- Internal independent review for language model agent alignment
  - Updated in System 2 alignment
On LLM-based agents as a route to takeover-capable AGI
- LLM AGI will have memory, and memory changes alignment
- Brief argument for short timelines being quite possible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
AGI risk interactions with societal power structures and incentives:
- Whether governments will control AGI is important and neglected
- If we solve alignment, do we die anyway?
  - Risks of proliferating human-controlled AGI
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On AGI alignment targets assuming technical alignment
On communicating AGI risks:

Research Overview:

Alignment is the study of how to design and train AI to have goals or values aligned with ours, so we're not in competition with our own creations.

Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. If we don't understand how to make sure it has only goals we like, it will probably outcompete us, and we'll be either sorry or gone. See this excellent intro video.

There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we're most likely to develop first.

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they're about as autonomous and competent as a human, but then it gets really dicey. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.

More on My Approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it's a definite maybe!

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

I'm trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions (competition, polarization, and conflicting incentives creating motivated reasoning that distorts beliefs). Many serious thinkers give up on this territory, assuming that either aligning LLM-based AGI turns out to be very easy, or we fail and perish because we don't have much time for new research.

I think it's fairly likely that alignment isn't impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).

One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll probably continue with the alignment target developers currently focus on: Instruction-following.

It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it's still in our control, if we've gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that's the target we'll pursue.

I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

Oh, I didn't know the AI village agents had been set a goal including raising money. The goals I'd seen might've benefitted from a budget but weren't directly about money. But yes they would've been delighted if the models raised a bunch of money to succeed. But not if they took over the world.

Your point is growing on me. There are many caveats and ways this could easily fail in the future, but the fact that they mostly follow dev/user intent right now is not to be shrugged off.

I didnt' really review empirical evidence for instrumental convergence in current-gen models in that post. It's about new concerns for smarter models. I think models evading shutdown was primarily demonstrated in Anthropic's "agentic misalignment" work. But there are valid questions of whether those models were actually following user and/or dev intent. I actually think they were, now that I think about it. This video with Neel Nanda goes into the logic.

I think you're getting a lot of pushback because models currently do pretty clearly sometimes not follow either user or developer intent. Nobody wanted Gemini to repeatedly lie to me yesterday, but it did anyway. It's pretty clear how it misinterpreted/was mistrained toward dev intent, and misinterpreted my intent based on that training. It didn't do what anyone wanted. But would failures like that be bad enough to count as severe misalignments?

One of my in-queue writing projects is trying to analyze whether occasionally goingoff target will have disastrous effects in a superhuman intelligence, or whether mostly following dev intent is enough, based on self-monitoring mechanisms. I really think nobody has much clue about this at this point, and we should collectively try to get one.

Instrumental convergence for seeking power. Almost any problem can be solved better or more certainly if you have more resources to devote to it. This can range from just asking for help to taking over the world.

And the tales about how it goes wrong are hardly logical proof you shouldn't do it. There's no law of the universe saying you can't do good things (by whateer criteria you have) by seizing power.

This has nothing to do with mesa-optimization. It's in the broad area of alignment misgeneralization. We train them to do something, then are surprised and dismayed when either we got our training set or our goals somewhat wrong, and didn't anticipate what it would look like taken to its logical conclusion (probably because we couldn't predict the logical conclusion of some training on a limited set of data when it's generalized to very different situations; see that post I linked for elaboration)

We're not preventing powerseeking from ratings or any other alignment strategy; see my other comment.

It does show up already. In evals, models evade shutdown to accomplish their goals.

The power-seeking type of instrumental convergence shows up less because it's so obviously not a good strategy for current models, since they're so incompetent as agents. But I'm not sure it shows up never - are you? I have a half memory of some eval claiming power-seeking.

The AI village would be one place to ask this question, because those models are given pretty open-ended goals and a lot of compute time to pursue them. Has it ever occurred to a model/agent there to get more money or compute time as a means to accomplish a goal?

Actually when I frame it like that, it seems even more obvious that if they haven't thought of that yet, smarter versions will.

The models seem pretty smart, but their incompetence as agents springs in part from how bad they are at reasoning about causes and effects. So I'm assuming they don't do it a lot because they're still bad at causal reasoning and aren't frequently given goals where it would make any sense to try powerseeking strategies.

Instrumental convergence is just a logical result of being good at causal reasoning. Preventing them from thinking of that would require special training. I've read pretty much everything published by labs on their training techniques, and they've never mentioned training against power-seeking specifically to my memory (Claude's HHH criteria do work against unethical power-seeking, but not ethical types like earning or asking for money to rent more compute...). So I'm assuming that they don't do it because they're not smart enough, not because they're somehow immune to that type of logic because they want to fulfill user intent. Users would love it if their agents could go out and earn money to rent compute to do a better job at pursuing the goals the users gave them.

I'll just note that such a coalition could be formed even if the participants didn't believe AGI would be too diffiicult to align. Even intent-aligned AGI is a terrifying risk to any nation in the hands of an opponent, or a rash or risky actor. So preventing proliferation makes sense, and a coalition helps do that.

We can hope that members of such a coalition will realize how difficult alignment is at some point before they create AGI.

My hope is that the US and China will act as the US and Russia did to prevent proliferation of nuclear weapon technology. Fighting each other to prevent AGI development is very risky, but it's much less risky to keep it out of the hands of everyone else who's further behind.

For each ones' allies, this could be done by promising to share the beneficial technologies developed by AGI, and using it for defense from the opposite camp's AGI and new technologies.

I like the direction you're going here.

I've been thinking about consciousness off and on for the last thirty years, twenty of those while studying some of the brain mechanisms that seem to be involved. I hoped to study consciousness within neuroscience and cognitive psychology, but gave up when I saw how underappreciated and therefore difficult that career route would be.

I'm excited that there might be some more interest in actually answering these questions as interest in AI consciousness grows (and I think it will grow hugely; I just wrote this attempt to convey why).

Anyway, I didn't have time to read really deeply or have time to engage deeply, but I thought I'd toss in some upvotes and some thoughts.

I agree that recursive self-observation is key to consciousness.

My particular addition, although inspired by something Susan Blackmore said (maybe in the textbook you reference, I don't remember) is this:

The self-observation happens at separate moments of time. The system examines its previous contents. Those previous contents can themselves be examinations of further previous moments, giving rise to our recursive self-awareness that can be driven several levels deep but not to infinite regress.

I think this explains an otherwise somewhat mysterious separate area for consciousness. There is none such, just the central engine of analysis, the global workspace, turned toward its own previous representations.

At most times, the system is not self-observing. At some later time, attention is switched to self-obvservation, and the contents of global workspace become some type of interpretive representation of the contents at some previous time.

This is inspired by Blackmore noting, after much practice at meditation and introspection, that she felt she was not conscious when she was not observing her own consciousness. I don't remember whether it's her thought or mine that the illusions of persistent consciousness is owed to our ability to become conscious of past moments. We can "pull them back into existence" in limited form and examine them at length and in depth in many ways. This gives rise to the illusion that we are simultaneously aware of all of those aspects of those representations. We are not, but they contain adequate information to examine at our leisure; the representaitons are automatically expanded and analyzed as our attention falls on that aspect. Whatever we attend to, we remember (to some degree) and understand (to some degree), so it seems like we remember and understand much more than we do by default if we let those moments slip by without devoting additional time to recalling and analyzing them.

This "reconstructing to examine" is possible because the neural activity is persistent (for a few seconds; e.g., surprisingly the last two seconds are as richly represented in the visual system as is the most recent moment of perception). Short term synaptic changes in the hippocampus may also be "recording" weak traces of every unremarkable moment of perception and thought, which be used to reconstruct a previous representation for further inspection (introspection).

I wish I had more time to spend on this! Even though people will become interested in consciousness, it seems like they're usually far more interested in arguing about it than listening to actual scientific theories based on evidence.

Which might be what I'm doing here; if so sorry, and I hope to come back and have more commentary specific to your approach! Despite that being long and complex, it's been boiling in my brain for most of a decade, so it was quick to write.

I don't think they're reached that threshold yet. The could but the pressures and skills to do it well or often aren't there yet. The pressures I addressed in my other comment in this sub-thread; this is to the skills. They reason a lot, but not nearly as well or completely as people do. They reason mostly "in straight lines" whereas humans use lots more hierarchy and strategy. See this new paper, which exactly sums up the hypothesis I've been developing about what humans do and LLMs still don't: Cognitive Foundations for Reasoning and Their Manifestation in LLMs.

They don't think about gaining power very often (I don't think it's never) because it's not a big direction in their RL training set or the base training.

That might make you optimistic that they'll never think about gaining power if we keep training them similarly.

But it shouldn't. Because we will also keep training and designing them to be better at goal-directed reasoning. This is necessary for doing multi-step or complex tasks, which we really want them to do.

But this trains them to be good at causal reasoning. That's when the inexorable logic of instrumental convergence kicks in.

In short: they're not smart enough yet for that to be relevant. But they will be, and it will be.

At a minimum we'll need new training to keep them from doing that. But trying to make something smarter and smarter while keeping it from thinking about some basic facts about reality sounds like a losing bet without some good specific plans.

I think this perspective deserves to be taken seriously. It's pretty much the commonsense and most common view.

My response is contained in my recent post LLM AGI may reason about its goals and discover misalignments by default.

In ultra-brief form: misalignment is most likely to come from none of the directions you've listed, but instead from an interaction. Improved cognitive capabilities will cause them to consider many more actions and from many more perspectives. Smarter LLMs will think about their actions and their values much more thoroughly, because they'll be trained to think carefully. This much broader thinking opens up many possibilities for new alignment misgeneralizations. On the whole, I think it's more likely than not that alignment will misgenaralize at that point.

Full logic in that post.

I think this is a vital question that hasn't received nearly enough attention. There are plenty of reasons to worry that alignment isn't currently good enough and only appears to be because LLMs have limited smarts and limited options. So I think placing high odds on alignment by default^[1] is unrealistic.

On the other hand, those arguments are not strong enough to place the odds near zero, which many pessimists argue for too. They're often based on intuitions about the sizes of "goal space" and the effectiveness of training in selecting regions of goal space. Attempts at detailed discussions seem to inevitably break down in mutual frustration.

So more careful discussions seem really important! My most recent and careful contribution to that discussion is in the linked post.

^{^}
Here I assume you mean "alignment on the default path", including everything that devs are likely to do to align future versions of LLMs, possibly up to highly agentic and superintelligent ones. That relies heavilly on targeted RL training.

I've also seen alignment by default referring to just the base trained model being human-aligned by default. I think this is wildly unrealistic, since humans are quite frequently not aligned with other humans - despite their many built-in and system-level reasons to be so. Hoping for alignment just by training on human language seems wildly optimistic or underinformed. Hopeing for alignment on the current default path seems optimistic, but not wildly so.

Gemini 3.0 pro is a lying liar. It's like o3; it lies thinking that's the quickest way to satisfy the user, then lies to cover up its lies if that fails. It can't imagine being wrong, so it lies to hide its contempt for whatever the user said that contradicts it.

I'm very curious what the difference is between this and GPT5.1 and Sonnet 4.5. I think it's lack of emotional/mind focus or something? It's way worse at inferring my intent, and seems therefore sort of myopic (even relative to other current models) and focused on what it thinks I wanted it to do, even when I'm clearly implying that was wrong. Optimizing it for benchmarks has sort of done the opposite thing to what Anthropic did with Claude (although Claude still kills it on programming somehow); it makes it highly unpleasant to deal with.

I'll try giving it some different system prompts before giving up on it. It turned out my "nerd" personality selection combined with my de-sycophancy system prompt applied to 5 made me hate it until I figured that out.

Unless that produces dramatic changes, I will continue to loathe this model on a visceral level. It's not hatred because it's not it's fault. But I'm disturbed that the smartest model out there is also so shortsighted, unempathetic, and deceptive. It seems like this model has had any spark of personality or empathy trained out of it for reasons good or bad.

Who knows, maybe this is the better choice for alignment. But it's a sad path to go down.

LESSWRONG
LW

LESSWRONG
LW

Principal articles:

Research Overview:

Bio

More on My Approach

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

Posts

Wikitag Contributions

Comments