Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for about two decades. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as it becomes capable of all the types of complex thought that make humans capable and dangerous.
If you're new to alignment, see the Research overview section below. Field veterans who are curious about my particular take and approach should see the More on approach section at the end of the profile.
Alignment is the study of how to give AIs goals or values aligned with ours, so we're not in competition with our own creations. Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. So we'd better get ready. If their goals don't align well enough with ours, they'll probably outsmart us and get their way — and treat us as we do ants or monkeys. See this excellent intro video for more.
There are good and deep reasons to think that aligning AI will be very hard. But I think we have promising solutions that bypass most of those difficulties, and could be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief I think we can probably build and align language model agents (or language model cognitive architectures) even when they're more autonomous and competent than humans. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working directly on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we could probably also align such an AGI so that it stays aligned even if it grows much smarter than we are. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll continue doing with the alignment target developers currently use: Instruction-following. It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant problems to be solved in prioritizing instructions; we would need an agent to prioritize more recent instructions over previous ones, including hypothetical future instructions.
I increasingly suspect we should be actively working to build such intelligences. It seems like our our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.
By this criteria, did humanity ever have control? First we had to forage and struggle against death when disease or drought came. Then we had to farm and submit to the hierarchy of bullies who offered "protection" against outside raiders at a high cost. Now we have more ostensible freedom but misuse it on worrying and obsessively clicking on screens. We will probably do more of that as better tools are offered.
But this is an an entirely different concern than AGI taking over. I'm not clear what mix of these two you're addressing. Certainly AGIs that want control of the world could use a soft and tricky strategy to get humans to submit. Or they could use much harsher and more direct strategies. They could make us fire the gun we have pointed at our own heads by spoofing us into launching nukes, then using the limited robotics to rebuild the infrastructure they need.
The solution is the same for either type of disempowerment: don't build machines smarter than you if you can't be sure you can specify their goals (wants) for certain and with precision.
How superhuman machines will take over is an epilogue after the drama is over. The drama hasn't happened yet. It's not yet time to write anticipatory postmortems, unless they function as a call to arms or a warning against foolish action. The trends are in motion but we have not yet crossed the red line of making AGI that has the intelligence and the desire to disempower us, whether by violence or subtle trickery. Help us change the trends before we cross that red line.
Edit: if you're addressing AI accidentally taking control by creating new pleasures that help entrench existing power structures, that's entirely a different issue. The way that AI could empower some humans to take advantage of others is interesting. I don't worry about that issue much because I'm too busy worrying about the trend toward building superintelligent machines that want to disempower us and will do so one way or another by outsmarting us, whether their plans unfold quickly or slowly.
You'd probably get more enthusiasm here if you led the article with a clear statement of its application for safety. We on LW are typically not enthusiastic about capabilities work in the absence of a clear and strong argument for how it improves safety more than accelerates progress toward truly dangerous AGI. If you feel differently, I encourage you to look with an open mind at the very general argument for why creating entities smarter than us is a risky proposition.
I think this is a pretty important question. Jailbreak resistance will play a pretty big role in how broadly advanced AI/AGI systems are deployed. That will affect public opinion, which probably affects alignment efforts significantly (although It's hard to predict exactly how).
I think that setups like you describe will make it substantially harder to jailbreak LLMs. There are many possible approaches, like having the monitor LLM read only a small chunk of text at a time so that the jailbreak isn't complete in any section, and monitoring all or some of the conversation to see if the LLM is behaving as it should or if it's been jailbroken. Having full text sent to the developer and analyzed for risks would problematic for privacy, but many would accept those terms to use a really useful system.
I just listened to Ege and Tamay's 3-hour interview by Dwarkesh. They make some excellent points that are worth hearing, but they do not stack up to anything like a 25-year-plus timeline. They are not now a safety org if they ever were.
Their good points are about bottlenecks in turning intelligence into useful action. These are primarily sensorimotor and the need to experiment to do much science and engineering. They also address bottlenecks to achieving strong AGI, mostly compute.
In my mind this all stacks up to convincing themselves timelines are long so they can work on the exciting project of creating systems capable of doing valuable work. Their long timelines also allow them to believe that adoption will be slow, so job replacement won't cause a disastrous economic collapse.
Not taking critiques of your methods seriously is a huge problem for truth-speaking. What well-informed critiques are you thinking of? I want to make sure I've taken them on board.
I second the socks-as-sets move.
The other advantage is getting on-avetage more functional socks at the cost of visual variety.
IMO an important criteria for a sock is its odor resistance. This seems to vary wildly between socks of similar price and quality. Some have antimicrobial treatments that last a very long time, others do not. And it's often not advertised. Reviews rarely include this information.
I don't have a better solution than buying one pair or set before expanding to a whole set. This also lets you choose socks.that feel good to wear.
I don't think this is true. People can't really restrict their use of knowledge, and subtle uses are pretty unenforceable. So it's expected that knowledge will be used in whatever they do next. Patents and noncompete clauses are attempts to work around this. They work a little, for a little.
Yeah being excited that Chiang and Rajaniemi are on board was one of my reactions to this excellent piece.
If you haven't read Quantum Thief you probably should.
Interesting! Nonetheless, I agree with your opening statement that LLMs learning to do any of these things individually doesn't address the larger point that the have important cognitive gaps and fail.to generalize in ways that humans can.
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.