Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as it becomes capable of all the types of complex thought that make humans capable and dangerous.
If you're new to alignment, see the Research overview section below. Field veterans who are curious about my particular take and approach should see the More on approach section at the end of the profile.
Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.
There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.
I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.
Plans are worthless, but planning is everything.
WIth Eisenhower, I think planning is vital work. And I think the field of AGI safety is doing far too little of it.
This is standard in science, and alignment is largely a science. But in most areas of science it's okay to make slow and inefficient progress, and to make lots of mistakes.
In alignment, those things are definitely not okay. We need to be more efficient than any previous scientific effort (except perhaps the human genome project - an interesting model but I'll leave that aside for now).
That means making plans, primarily as a forcing function for thinking realistically about the realities of the challenge we're facing, and the most likely specific scenarios that might change that challenge in important ways.
The responses to @Marius Hobbhahn's What’s the short timeline plan? convinced me that we are desperately in need of better plans for alignment. Most of the plans listed there were simply not about alignment, but about control and interpretability. Those are helpful for alignment, but not in and of themselves a plan.
The plan "we'll figure it out when we get there; in the meantime we'll buy time" is not a good plan. It's the realistic basis, but more planning is clear possible and I think almost certainly useful.
Some people argue that there are already too many plans. That is because they are bad plans that don't accomplish the purpose of planning as forcing careful thought about the situation you are actually in.
Well argued, and addresses the obvious question "okey but if they're sentient at all it's got to be a tiny amount, right?"
This seems like the crux of disagreement I've had with some of your previous points; sure, pain is bad, but the intensity or "realness" of pain has to be on a spectrum of some sort it seems to me.
It does seem like you've got to make that adjustment to avoid another implausible conclusion, that sentience "switches on" at some point, leaving a spider non-sentient (say) but a beetle fully sentient (or some other narrow dividing line)?
Intelligence woulgn't be the same spectrum, but it would seem like a bacterium isn't complex enough to have a subjective feeling of suffering, and very simple insects probably have very little of it.
Bees are an interesting exception by having relativelly complex behavior and learning. But being able to manage 7% of human suffering while having only 1/1000 of neurons doesn't seem likely... I think their cognitive abilities probably aren't correlated with general mental sophistication in the same way mammals or avian brains with cortexes are.
This is what I was thinking. In a city in the summer there might be almost as much indoor space as outdoor space at ground level. The temporary change in outside temperature would then be almost as much as the reduction indoors, right?
I don't really have a good sense nor am I doing the math for indoor versus outdoor space or how rapidly air moves through cities. I still suspect this concern is largely illusory and another justification for the cult of pain. But I do want to think about the physics correctly.
Portable AC is not annoying in any way I can perceive? My climates have not been very hot, but I consider AC to be a major factor in my quality of life and a wonder of the modern world. It sounds like you are expressing some of the evidence you find lacking. I personally had no idea that Europeans thought this way, which makes this an informative post for me as an American.
You didn't say if you think this post is wrong or merely that this was already obvious to you. Or perhaps you think it's half right and obvious?
Roughly the liberal half of Americans expressed some of this attitude, including some of them forgoing AC. But most do not. The ideological climate is different than expressed in this post. So I'm curious about your impression of the European attitude.
Reporting personal impressions of attitudes is evidence of a sort. Survey responses are just about as difficult to interpret IMO.
I strongly support this as a research direction (along with basically any form of the question "how do LLMs behave in complex circumstances?").
As you've emphasized, I don't think understanding LLMs in their current form gets us all that far toward aligning their superhuman descendants. More on this in an upcoming post. But understanding current LLMs better is a start!
One important change in more capable models is likely to be improved memory and continuous, self-directed learning. I've made this argument in LLM AGI will have memory, and memory changes alignment.
Even now we can start investigating the effects of memory (and its richer accumulated context) on LLMs' functional selves. It's some extra work to set up a RAG system or fine-tuning on self-selected data. But the agents in the AI village get surprisingly far with a very simple system of being prompted to summarize important points in their context window whenever it gets full (maybe more often? I'm following the vague description in the recent Cognitive Revolution episode on the AI village) and one can easily imagine pretty simple elaborations on that sort of easy memory setup.
Prompts to look for and summarize lessons-learned can mock up continuous learning in a similar way. Even these simple manipulations might start to get at some ways that future LLMs' (and agentic LLM cognitive architectures) might have functional selves of a different nature.
For instance, we know that prompts and jailbreaks at least temporarily change the LLMs selves. And the Nova phenomenon indicates that permanent jailbreaks of a sort (and marked self/alignment changes) can result from fairly standard user conversations, perhaps only when combined with memory[1]
Because we know that prompts and jailbreaks at least temporarily change LLMs selves, I'd pose the question more as:
Speculation in the comments of that Zvi Going Nova post leads me to weakly believe the Nova phenomenon is specific to the memory function in ChatGPT since it hasn't been observed in other models; I'd love an update from someone who's seen more detailed information
Sure, in the case of severely flawed theories. And you'll have to judge how flawed before you stop believing (or severaly downgrade their likelihood if you're thinking in Bayesian terms). I agree that you don't need an alternative theory, and stand corrected.
But rejecting a theory without a better alternative can be suspicious, which is what I was trying to get at.
If you accept some theories with a flaw (like "I believe humans have moral worth even though we don't have a good theory of consciousness") while rejecting others because they have that same flaw, you might expect to be accused of inconsistency, or even motivated reasoning if your choices let you do something rewarding (like continuing to eat delicious honey).
All good points.
I agree that you need an argument for "you should consider bees to be morally important because they can count and show social awareness" I was filling that argument in. To me it seems intuitive and a reasonable baseline assumption, but it's totally reasonable that it doesn't seem that way to you.
(It's the same argument I make in a comment justifying neuron count as very rough proxy for moral consideration I in response to Kaj Sotala's related short form. I do suspect that in this case many of bees cognitive abilities do not correlate with whatever-you-want-to-call-consciousness/sentience in the same way they would in mammals, which is one of the reasons I'll continue eating honey occasionally.)
Agreed that trying to insist on a Schelling or anchor point is bad argumentation without a full justification. How much justification it needs is in the eye of the beholder. It seems reasonable to me for reasons to complex to go into, and reasonable that it doesn't to you since you don't share those background assumptions/reasoning.
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.