Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.
I work on technical alignment, but doing that has forced me to branch into alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those.
Alignment is the study of how to design and train AI to have goals or values aligned with ours, so we're not in competition with our own creations.
Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. If we don't understand how to make sure it has only goals we like, it will probably outcompete us, and we'll be either sorry or gone. See this excellent intro video.
There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they're about as autonomous and competent as a human, but then it gets really dicey. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it's a definite maybe!
I'm trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions (competition, polarization, and conflicting incentives creating motivated reasoning that distorts beliefs). Many serious thinkers give up on this territory, assuming that either aligning LLM-based AGI turns out to be very easy, or we fail and perish because we don't have much time for new research.
I think it's fairly likely that alignment isn't impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).
One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll probably continue with the alignment target developers currently focus on: Instruction-following.
It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it's still in our control, if we've gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that's the target we'll pursue.
I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.
The structure you describe seems like it could work. It also seems like that's the alignment target now and may be as we near AGI.
As you note, there's a conflict between whatever long-term goals the AI has, and the deontological principles it's following. We'd need to make very sure that conflict reliably goes in favor of the deontological rules like "follow instructions from authorized humans" even where those conflict with any or all of their other goals and values.
It seems simpler and safer to make that deontological principle the only one, or to have only weak and vague values/goals outside of that.
So it seems easier to make instruction-following the only training target, or similarly, Corrigibility as Singular Target. You'd then issue instructions for all of the other goals or behaviors you want. This puts more work on the operator, but you can instruct it to help you with that work, and it keeps prioritization in logic instead of training.
There are still Problems with instruction-following as an alignment target and similarly Serious Flaws in CAST but those problems have to be faced there anyway if corrigibility/IF are mixed in with a bunch of other alignment targets.
Training at cross-purposes seems like the major source of notable misalignments in current models. We could just not do it for models approaching AGI.
To Jeremy's point in the other comment, the single target is also probably more reflectively stable.
There's still plenty to go wrong, but this does seem to reduce the difficulty you note in having conflicting goals/principles of different priority that we're trying to specify by training.
You mean post AGI and pre ASI?
I agree that will be a tricky stretch even if we solve alignment.
Post ASI the only question is whether it's aligned or intent aligned to a good person(s). It takes care of the rest.
One solution is to push fast from AGI to ASI.
With an aligned ASI, other concerns are largely (understandable) failures of the imagination. The possibilities are nearly limitless. You can find something to love.
This is under a benevolent sovereign. The intuitively appealing balances of power seem really tough to stabilize long term or even short term during takeoff.
I think similar sentiments are largely a failure of the imagination. The possibilities demand a whole lot of imagination.
The only thing you can't have post singularity is truly suffering people to help. And if you must have that and refuse to tweak your reward system so you don't, you can enter a simulation where it seems exactly like you have that.
If you want a mundane existence you can simulate that until you're bored, then join the crowd doing things that are really new and exciting.
You don't stop being you from any little tweak to your reward system or memory. And they're all reversible.
Intuitions fail here for a good reason.
The possibilities are limitless.
We didn't get much pitch for the projects and challenges people in the culture do.
But yeah it did seem boring. At least in comparison to the challenge and purpose of Contact and SC.
I think that's a failure of alignment in world and a necessity of writing for a broad audience from the outside.
I agree. And I think the same point applies to alignment work on LLM AGI. Even though it's used for alignment and we expect more of it, there's not what I'd call a field of reward function design. Most alignment work on LLMs is probing how the few RL alignment attempts work, rather than using different RL functions and seeing what they do. And it doesn't even seem there's much theorizing about how alternate reward functions might change current or future more capable LLMs' alignment.
I think this analogy is pretty strong, and many of the questions are the same, even though the sources of RL signals are pretty different. The reward function for RL on LLMs seems to be more complex. It uses specs or Anthropic's constitution, and now perhaps the much richer Claude 4.5 Opus' Soul Document, all as interpreted by another LLM to produce an RL signal. But more RL-agent and brainlike RL functions are pretty complex too, since they're nontrivial as hardwired, then expressed through a complex environment and a critic/value function that learns a lot. I think there's a lot of similarity in the questions involved.
So I think your RL training signal starter pack is pretty relevant to LLM AGI alignment theory, too. It's nice to have those all in one place and some connections drawn out. I hope to comment over there after thinking it through a little more.
And this seems pretty important for LLMs even though they have lots of pretraining which changes the effect of RL dramatically. RL (and cheap knockoff imitations like DPO) is playing an increasingly large role in training recent LLMs. A lot of folks expect it to be critical for further progress on agentic capabilities. I expect something slightly different, self-directed continuous learning, but that would still have a lot of similarities even if it's not implemented literally as RL.
And RL has arguably always played a large role in LLM alignment. I know you attributed most of LLMs' alignment to their supervised training magically transmuting observations into behavior. But I think pretraining transmutes observations into potential behavior, and RL posttraining selects which behavior you get, doing the bulk of the alignment work. RL is sort of selecting goals from learned knowledge as Evan Hubinger pointed out on that post.
But more accurately, it's selecting behavior, and any goals or values are only sort of weakly implicit in that behavior. That's an important distinction. There's a lot of that in humans, too, although goals and values are also pursued through more explicit predictions and value function/critic reward estimates.
I'm not sure if it matters for these purposes, but I think the brain is also doing a lot of supervised, predictive learning, and the RL operates on top of that. But the RL also drives behavior and attention, which directs the predictive learning, so it's a different interaction than the LLMs pretraining-then-RL-to-select-behaviors.
In all, I think LLM descendents will have several relevant similarities to brainlike systems. Which is mostly a bad thing, since the complexities of online RL learning get even more involved in their alignment.
Here's a separate comment on the role this could/should play in the ongoing discussion:
I think the next step in this type of argument is trying to walk someone through the exercise you suggest, noting things that could go wrong and doing a rough OOM estimate of what odds you're coming up with. That's what I was trying to do in LLM AGI may reason.... I agree with you that people have to use roughly their own predicted mechanisms and path to AGI for that exercise, or it won't feel relevant to their thinking. So I was using mine, at a general enough level (progress in LLMs toward agency and competence) that a lot of people share as a likely path.
Extending the work in that direction seems important, because the arguments as stated here are still pretty abstract. There's a whole range of very specific to very abstract to cover, and I think working on that as a communication project is highly valuable. Describing roughly what differences we expect between training and deployment of takeover-capable systems (TCAI?) seems worthwhile. I did some of that in the above-linked post and elsewhere, but there's a lot more work to do.
I think this communication project is a vital part of "modeling the ocean" in your metaphor. That's something that independent researchers can help contribute to the alignment efforts at developers. Seeing likely problems farther in advance has multiple potential good effects.
I think this is really good and important. Big upvote.
I largely agree: for these reasons, the default plan is very bad, and far too likely to fail.
The AGI is on your side, until it isn't. There's not much basin. I note that the optimistic quote you lead with explicitly includes "you need to solve alignment".
Even though I've argued that Instruction-following is easier than value alignment, including some optimism about roughly the basin of alignment idea, I now agree that there really isn't much of a basin. I think there may be some real help from roughly human-level AGI that still thinks it's aligned, and/or is functionally aligned in those use cases (it hasn't yet hit much of "the ocean" in your metaphor). That could be really useful. But as soon as it realizes it's misaligned (see my reasoning post below) or hits severely OOD contexts, it will be just as against you as it was for you shortly before. There's no real basin keeping it in, just some help in guessing how it or its next generation might become misaligned.
I really like the shipbuilding metaphor. I think we're desperately in need of more precise and specific discussion on this topic, and more specific, engineering-related metaphors seem like a good way forward.
In that metaphor, I'd like to see more work on modeling conditions out at sea. That's how I view my work: trying to envision the most likely path from here to AGI, which I see as going through LLMs enhanced in very roughly brainlike directions.
I used that framing and those mechanisms for what's approximately my version of this argument: LLM AGI may reason about its goals and discover misalignments by default. That also resulted from doing (what I think is) the exercise you suggest.
After doing that exercise in the course of writing that mega-post, my specific estimates are a bit different from yours, but they're qualitatively similar.
Talking to you helped shift me in the pessimistic direction, although I did reach out asking to talk because I was on a project of really staring into the abyss of deep alignment worries.
I now think the current path is more likely than not to get us all killed. I don't enjoy thinking that, and I've done a lot of work trying to correct for my biases.
But I think the full theory and the full story is unwritten. I think there's a ton of model uncertainty still. Playing to our outs involves working toward alignment on the current path AND trying to slow or stop progress.
Based on that uncertainty, I think it's quite possible that relatively minor and realistic changes to alignment techniques might be enough to make the difference. So that's what I'm working on; more soon.
For a specific guess, I'd say it's tasks that are fairly simple by human standards, but ideosyncratic in details to that human or that business. They're not going to do great context engineering, but they will put in a little time telling the agent what it's doing wrong and how to do it better, like they'd train an assistant. The specificity is the big edge for limited continual learning over context engineering.
Before long, I expect even limited continual learning to outperform context engineering in pretty much every area, because it's the model doing the work, not humans doing meticulous engineering for each task.
But we don't yet have even limited continual learning in deployment. I remain a little confused why; I know working versions are in development, but there are hangups. Those include interference, but I wonder what else is preventing "just have the model think a bunch about what it's learned about this task, produce some example context-response-pairs, and finetune on those" from working.
I outlined my take in
Possibly, sometimes. But greatly surpassing human intelligence isn't really part of the risk model. Even humans have pretty much succeeded at taking over the world. It's only got to be as functionally smart, in relevant ways, as a human. A bit more would be a pretty big edge.
The remaining question is whether LLM-based systems will even achieve human-level intelligence. Steve thinks that probably won't happen; see for instance his Foom & Doom. I think it probably will, and that might happen very soon.
The issue is that nobody is sure how things are going to go. Taking a guess and going with it really isn't a smart way to deal with a situation that could be deadly dangerous. I'm sure you're seeing pessimists do that; optimists do too. Our overall response should be a careful weighing of pessimist and optimist positions.
I've been trying to do that, and I've reached a disturbing conclusion: nobody has much clue. This inclines me toward caution, because the deeper arguments in both directions are both quite strong.
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.