Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.
If you're new to alignment, see the Research Overview section below. Field veterans who are curious about my particular take and approach should see the More on My Approach section at the end of the profile.
Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.
There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.
I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.
This is fascinating. Let's think a little more about whether this really improves your epistemics. It reduces bias to fit in, but increases bias to "act cool" in ones' beliefs. I'm not sure that's better.
What about being conformist half the itme and anti-conformist the other half, to balance it and try to become conscious of your attitudes and resultant biases?
Also, I'm going thrifting for offbeat clothes. I do dress to conform most of the time and to anti-conform part of the time. I might step that up.
Very nicely written and backed by theory and experience. Would recommend.
I'd give a little more teaser for crossposts here. Although I suppose that Clippy cartoon was very well-chosen and enough to get me to click through even though I don't particularly struggle with the problem you're addressing.
Yes, by default alignment looks much like today when we reache the level of SueprClaude as I outlined it. The point of the article is how things go from there. I agree that it couldn't trigger a full FOOM but it could still be able to outmaneuver humanity as a whole. Or hopefully not, and it's a warning shot.
AGI is a clumsy term now since it's defined and redefined frequently. I did define what I meant by it in the essay, so whether or not people happened to call it that wouldn't make much difference.
A jump in capabilities from moderate scaling isn't at all what I meant by "phase shift" I just noticed that was part of the definition of the term suggested by AI. I took that all out after noticing how it had probably caused your confusion, because it had got that wrong, and I'd defined the important terms already
The official LessWrong stance on AI writing is quite insightful. A couple others have pointed at this but I'll try, too.
If an AI wrote something, it might well have also been the one that came up with or accepted the ideas. AI is terrible at that. It is a sophist, saying what sounds good and not what truly makes sense. And it's bad at telling the difference even if you ask it to.
Right now, human ideas beat AI ideas at the top end of the distribution. And that's what LessWrong is for.
So, I don't mind at all if you use AI to do your writing - as long as you somehow assure me that it had nothing to do with judging those ideas as valid and valuable.
I wish I had a link to that thread handy, it was quite insightful. I'd like ot use AI to help me write, but I see why it's a prejudice that's importantly accurate on average.
I'm curious what you mean by needing some mechanism for ground truth to get good outcomes?
I had a hard time writing this piece because to me it seems completely intuitive and obvious that anything worth calling an AGI would reason about its top-level goals and subgoals a lot. But when I showed an early draft to colleagues, people with lots of expertise found it unintuitive in multiple different ways. Thus the subsections addressing all of those reasons to doubt it. But I've been stuck wondering if I'm hammering those points too hard, because it seems like the default that they would, requiring remarkable successes to slow it down much, let alone stop it.
I guess writing those sections was a good exercise overall, because now I do think that careful implementation of countermeasures could delay this enough to matter.
Just a little more teaser here revealing the central argument would get me to click through if it's interesting...
Your observations have a methodological flaw: people you know don't react better when you look nice because they know it's still you. Their reactions won't fluctuate with your daily appearance because they average their impression of you. Strangers' might; but some of it is also how you act differently based on how people treat you on-average, which makes appearance a subtle but longer term effect.
Speaking of which: how you act is more important. How you dress habitually does matter (but differently for different subcultures/ingroups). And good personal grooming (clean clothes and hair, haircut that fits your desired role) is easy, so you're shooting yourself in the foot if you don't bother with that.
There's a lot to it, but it's worth knowing the basics of how your habitual manner of interaction affects people and their reactions to you. Steve Byrnes' Making yourself small is very insightful for the basics of social interaction dynamics. Just becoming a little conscious of how you're interacting with people goes a long way. I know that's not what you're asking, but could fit your final "Or am I missing something?"
This seems interesting and important. But I'm not really understanding your graph. How do I interpret action confidence? And what are the color variations portraying? Maybe editing in a legend would be useful to others too.
Marvelous! This makes more sense of the culture than the books do.
I haven't done much oppositional reading, but I enjoy oppositional watching. Movies need a lot more help than most books toward making sense. I'd tell you my theory of Sith ideology, but it's embarrassing to even have theories about a series made mostly for kids.
I was going to comment on the apparant deathism of the culture, which has always bothered me. Their cautious low level of interventions is a bit easier to explain, but the books don't bother to do it.
How about this: Their nonintervention is some remnant of an alignment that didn't allow them to intervene directly to influence humans, as a safety measure? And so special circumstance is only little efforts that are pursued by the very few Culture humans who care, aided by the few Minds that humor them.
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.