Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as it becomes capable of all the types of complex thought that make humans capable and dangerous.
If you're new to alignment, see the Research overview section below. Field veterans who are curious about my particular take and approach should see the More on approach section at the end of the profile.
Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.
There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.
I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.
Great analysis. I'm impressed by how thoroughly you've thought this through in the last week or so. I hadn't gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we'll probably both be wrong in important ways, but I think it's important to at least try to do semi-accurate prediction if we want to be useful.
I have only one substantive addition to your projected timeline, but I think it's important for the alignment implications.
LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them "make me a lot of money selling shoes, but also make the world a better place" and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn't solve outer alignment or alignment stability, for a start. But GPT4's ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.
In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.
I think the agent disasters you describe will occur, but they will happen to people that don't put safeguards into their bots, like "track how much of my money you're spending and stop if it hits $X and check with me". When agent disasters affect other people, the media will blow it sky high, and everyone will say "why the hell didn't you have your bot worry about wrecking things for others?". Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.
Will all of that matter? I don't know. But predicting the social and economic backdrop for alignment work is worth trying.
Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It's a cognitive psychology/neuroscience perspective on why these things might work better, faster than you'd intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.
I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would've taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.
Despite my contention on the associated paper post that focusing on wisdom in this sense is ducking the hard part of the alignment problem, I'll stress here that it Iseems thoroughly useful if it's a supplement not a substitute for work on the hard parts of the problem - technical, theoretical and societal.
I also think it's going to be easier to create wise advisors than you think, at least in the weak sense that they make their human users effectively wiser.
In short, think simple prompting schemes and eventually agentic scaffolds can do a lot of the extra work it takes to turn knowledge into wisdom, and that there's an incentive for orgs to train for "wisdom" in the sense you mean as well. So we'll get wiser advisors as we go, at little or no extra effort. More effort would of course help more.
I believe Deep Research has already made me wiser. I can get a broader context for any given decision.
And that was primarily achieved by prompting; the o3 model that powers OpenAI's version does seem to help but Perplexity introducing a nearly-as-good system just a week or two later indicates that just the right set of prompts were extremely valuable.
Current systems aren't up to helping very much with the hypercomplex problems surrounding alignment. But they can now help a little. And any improvements will be a push in the right direction.
Training specifically for "wisdom" as you define it is a push toward a different type of useful capability, so it may be that frontier labs pursue similar training by default.
(As an aside, I think your "comparisons" are all wildly impractical and highly unlikely to be executed before we hit AGI, even on longer realistic estimates. It's weird that they're considered valid points of comparison, as all plans that will never be executed have exactly the same value. But that's where we're at in the project right now.)
To return from the tangent, I don't think wise advisors is actually asking anyone to go far out of their default path toward capabilities. Wise advisors will help with everything, including things with lots of economic value, and with AGI alignment/survival planning.
I'll throw in the caveat that fake wisdom is the opposite of helpful, and there's a risk of getting sycophantic confabulations on important topics like alignment if you're not really careful. Sycophantic AIs and humans collaborating to fuck up alignment in a complementarily-foolish clown show that no one will laugh it is now one of my leading models of doom after John Wentworth's pointing it out.
That's why I favor AI as a wisdom-aid rather than trying to make it wiser-than-human on its own- if it was, we'd have to trust it, and we probably shouldn't truest AI more than humans until well past the alignment crunch.
Hm, I thought this use of "wise" is almost identical to capabilities. It's sort of like capabilities with less slop or confabulation, and probably more ability to take the context of the problem/question into account. Both of those are pretty valuable, although people might not want to bother even swerving capabilities in that direction.
It's an interesting and plausible claim that eating plain food is better than fighting your appetite. I tend to believe it. I'm curious how you handle eating as a social occasion; do you avoid it or go ahead and eat differently when there's a social occasion without it disrupting your diet or appetite?
Your boy slop also happens to follow my dietary theory.
I'm embarassed to share my diet philosophy but I'm going to anyway. It's embarassing because I am in fact modestly overweight. I feel it's still worth sharing as a datapoint for its strengths: I am only modestly overweight despite aging (50), doing nearly no exercise since lockdown, and most importantly eating whatever I want whenever I want - absolutely no fighting my own appetite. And I haven't put much effort into optimizing it, so there are probably easy gains if somebody did. With the caveats out of the way:
Eat more vegetables and fewer carbs than you're offered by default.
The theory here is that all calories are equal WRT raw weight gain, but carbs are processed quickly so you feel hungrier sooner. Nutrition science is a mess but this appears to be highly plausible given the data. I haven't dug down, but this matches my experience.
Eating veggies is compensated for by being allowed as much fat, salt, and spices as you want to make them flavorful. Salads are amazing with lots of dressing, cheeses and nuts and maybe some happy meat (better yet without the cruel joke of greens). Cook those veggies and they're even better: Vegetable-based stir fries, curries and soups are easy: add butter/oil salt and spices until they tast good. (and some happy meat if your voracity has overcome your ethics, as mine often does). It is hard to find this while eating out, but the point is to just trend in this direction when it's easy. So I crave veggie-rich dishes as much as anything else, and often choose them.
There's my embarrassing $.02 on rationalist eating.
I'm suddenly expecting the first AI escapes to be human-aided. And that could be a good thing.
Your mention of human-aided AI escape brought to mind Zvi's Going Nova post today about LLMs convincing humans they're conscious to get help in "surviving". My comment there is about how those arguments will be increasingly compelling because LLMs have some aspects of human consciousness and will have more as they're enhanced, particularly with good memory systems.
If humans within orgs help LLM agents "escape", they'll get out before they could manage it on their own. That might provide some alarming warning shots before agents are truly dangerous.
I haven't written about this because I'm not sure what effect similar phenomena will have on the alignment challenge.
But it's probably going to be a big thing in public perception of AGI, so I'm going to start writing about it as a means of trying to figure out how it could be good or bad for alignment.
Here's one crucial thing: there's an almost-certainly-correct answer to "but are they really conscious" and the answer is "partly".
Consciousness is, as we all know, a suitcase term. Depending on what someone means by "conscious", being able to reason correctly about ones own existence is it. There's a lot more than that to human consciousness. LLMs have some of it now, and they'll have an increasing amount as they're fleshed out into more complete minds for fun and profit. They already have rich representations of the world and its semantics, and while those aren't as rich or shift as quickly as humans', they are in the same category as the information and computations people refer to as "qualia".
The result of LLM minds being genuinely sort-of conscious is that we're going to see a lot of controversy over their status as moral patients. People with Replika-like LMM "friends" will be very very passionate about advocating for their consciousness and moral rights. And they'll be sort-of right. Those who want to use them as cheap labor will argue for the ways they're not conscious, in more authoritative ways. And they'll also be sort-of right. It's going to be wild (at least until things go sideways).
There's probably some way to leverage this coming controversy to up the odds of successful alignment, but I'm not seeing what that is. Generally, people believing they're "conscious" increases the intuition that they could be dangerous. But overhyped claims like the Blake Lemoine affair will function as clown attacks on this claim.
It's going to force us to think more about what consciousness is. There's never been much of an actual incentive to get it right to now (I thought I'd work on consciousness in cognitive neuroscience a long time ago, until I noticed that people say they're interested in consciousness, but they're really interested in telling you their theories or saying "wow, it's like so impossible to understand", not hearing about the actual science).
Obviously this is worth a lot more, but my draft post on the subject is perpetually unfinished behind more pressing/obviously important stuff, so I thought I'd just mention it here.
Back to the topic the competitive adaptivity of AI convincing humans it's "conscious": humans can benefit from that too. There will be things like Replika but a lot better. An assistant and helpful friend is nice, but there may be a version that sells better if people who use it swear it's conscious.
So expect AI "parasites" to have human help. In some cases they'll be symbiotic, for broadest market appeal.
I'm actually interested in your responses here. This is useful for my strategies how I frame things and understanding different people's intuitions.
Do you think we can't make autonomous agents that pursue goals well enough to get things done? Do you really think they'll lose focus between being goal-focused long enough for useful work, and long enough for taking over the world if they interpret their goals differently than we intended? Do you think there's no way RL or natural language could be misinterpreted?
I'm thinking it's easy to keep an LLM agent goal-focused; if RL doesn't do it, we'd just have a bit of scaffolding that every so often injects a prompt "remember, keep working on [goal]!"
The inference-compute scaling results seem to indicate that chain of thought RL already has o1 and o3 staying task focused for millions of tokens.
If you're superintelligent/competent, it doesn't take 100% focus to take over the world, just occasionally coming back to the project and not completely changing your mind.
Ghengis Khan probably got distracted a lot but he did alright at murdering, and he was only human.
Humans are optimizing AI and then AGI to get things done. If they can do that, we should ask what they're going to want to do.
Deep learning typically generalizes correctly within the training set. Once something is superintelligent and unstoppable, we're going to be way outside of the training set.
Humans change their goals all the time, when they reach new conclusions about how the world works and how that changes their interpretations of their previous goals.
I am curious about your intuitions but I've got to focus on work so that's got to be my last object-level contribution. Thanks for conversing.
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.