Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as it becomes capable of all the types of complex thought that make humans capable and dangerous.
If you're new to alignment, see the Research overview section below. Field veterans who are curious about my particular take and approach should see the More on approach section at the end of the profile.
Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.
There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.
I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.
Great analysis. I'm impressed by how thoroughly you've thought this through in the last week or so. I hadn't gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we'll probably both be wrong in important ways, but I think it's important to at least try to do semi-accurate prediction if we want to be useful.
I have only one substantive addition to your projected timeline, but I think it's important for the alignment implications.
LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them "make me a lot of money selling shoes, but also make the world a better place" and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn't solve outer alignment or alignment stability, for a start. But GPT4's ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.[1] You can easily make agents that both try to make money, and thinks about not harming people.
In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.
I think the agent disasters you describe will occur, but they will happen to people that don't put safeguards into their bots, like "track how much of my money you're spending and stop if it hits $X and check with me". When agent disasters affect other people, the media will blow it sky high, and everyone will say "why the hell didn't you have your bot worry about wrecking things for others?". Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.
Will all of that matter? I don't know. But predicting the social and economic backdrop for alignment work is worth trying.
Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It's a cognitive psychology/neuroscience perspective on why these things might work better, faster than you'd intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.
I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would've taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.
It's looped back to cool for me, I'm going to check it out.
The most capable version of each model has not yet been created when the model is released. As well as fine-tuning for specific tasks, scaffolding matters. The agentic scaffolds people create have an increasingly important role in the model's ultimate capability.
Suit yourself, but I happen to want to create many great continuations. I enjoy hearing about other people's happiness. I enjoy it more the better I understand them. I understand myself pretty well.
But I don't want to be greedy. I'm not sure a lot of forks of each person are better than making more new people.
Let me also mention that it's probably possible to merge forks. Simply averaging the weight changes in your simulated cortex and hippocampus will approximately work to share the memories across two forks. How far out that works before you start to get significant losses is an empirical matter. Clever modifications to the merge algorithm and additions to my virtual brain should let us extend that substantially; sharing memories across people is possible in broad form with really good translation software, so I expect we'll do that, too.
So in sum, life with aligned ASI would be incredibly awesome. It's really to imagine or predict exactly how it will unfold, because we'll have better ideas as we go.
WRT "cheapening" the experience, remember that we'll be able to twist the knobs in our brain for boredom and excitement if we want. I imagine some would want to do that more than others. Grand triumph and struggle will be available for simulated competitive/cooperative challenges; sometimes we'll know we're in a simulation and sometimes we'll block those memories to make it temporarily seem more real and important.
BUT this is all planning the victory party before fighting the war. Let's figure out how we can maximize the odds of getting aligned ASI by working out the complex challenges of getting there on both technical and societal levels.
Interesting! Do you think humans could pick up on word use that well? My perception is that humans mostly cue on structure to detect LLM slop writing, and that is relatively easily changed with prompts (although it's definitely not trivial at this point - but I haven't searched for recipes).
I did concede the point, since the research I was thinkingg of didn't use humans who've practiced detecting LLM writing.
I concede the point. That's a high bar for getting LLM submissions past you. I don't know of studies that tested people who'd actually practiced detecting LLM writing.
I'd still be more comfortable with a disclosure criteria of some sort, but I don't have a great argument beyond valuing transparency and honesty.
I read it too and had no such thought. I think that loose poetic free association type thing f writing is hard for humans and easy for LLMs.
That's a good point and it does set at least a low bar of bothering to try.
But they don't have to try hard. They can almost just append the prompt with "and don't write it in standard LLM style".
I think it's a little more complex than that, but not much. Humans can't tell LLM writing from human writing in controlled studies. The question isn't whether you can hide the style or even if it's hard, just how easy.
Which raises the question of whether they'd even do that much, because of course they haven't read the FAQ before posting.
Really just making sure that new authors read SOMETHING about what's appreciated here would go a long way toward reducing slop posts.
If you wrote the whole thing, then prompted Claude to rewrite it, that would seem to "add significant value." If you then read the whole thing carefully to say "that's what I meant, and it didn't make anything up I'm not sure about", then you've more than met the requirement laid out here, right?
They're saying the second part is all you have to do. If you had some vague prompt like "write an essay about how the field of alignment is misguided" and then proofread it you've met the criteria as laid out. So if your prompt was essentially the complete essay, you've gone far beyond their standards it seems like.
I personally would want to know that the author contributed much more than a vague prompt to get the process rolling, but that seems to be the standard for acceptance laid out here. I assume they'd prefer much mroe involvement on the prompting side, like you're talking about doing.
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.