Message me here or at seth dot herd at gmail dot com.
I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.
I work on technical alignment, but doing that has forced me to branch into alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those.
Alignment is the study of how to design and train AI to have goals or values aligned with ours, so we're not in competition with our own creations.
Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. If we don't understand how to make sure it has only goals we like, it will probably outcompete us, and we'll be either sorry or gone. See this excellent intro video.
There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we're most likely to develop first.
That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.
In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they're about as autonomous and competent as a human, but then it gets really dicey. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.
I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.
I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.
The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.
When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.
My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it's a definite maybe!
I'm trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions (competition, polarization, and conflicting incentives creating motivated reasoning that distorts beliefs). Many serious thinkers give up on this territory, assuming that either aligning LLM-based AGI turns out to be very easy, or we fail and perish because we don't have much time for new research.
I think it's fairly likely that alignment isn't impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).
One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll probably continue with the alignment target developers currently focus on: Instruction-following.
It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.
There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it's still in our control, if we've gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that's the target we'll pursue.
I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.
Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.
"We'll replace tons of jobs really fast and it will probably be good for anyone who's smart and cares" is counterintuitive, for good reasons. I'm a good libertarian capitalist like most folks here, but markets embedded in societies aren't magic.
New technologies have been net beneficial over the long run, not the short run. Job disruptions have taken up to a hundred years, by some good-sounding arguments, to return to the same average wage. I think that was claimed for industrial looms and the steam engine; but there's a credible claim that the average time of recovery has been very long. And those didn't disrupt the markets nearly as quickly as drop-in replacements for intellectual labor would do.
Assuming that upsides of even a relatively slow, aligned AI progress are likely to outweigh the negatives, without further argument, seems purely optimistic.
AI will certainly have prosaic benefits. They seem pretty unlikely to outweigh the harms.
Civilizations have not typically reacted well enough to massive disruptions to be optimistic about the unknowns here. Spreading the advantages of AI as broadly as the pains of job losses seems like threading a needle that nobody has even aimed at yet.
I am an optimist by nature. The more closely I think about AI impacts, the less optimistic I feel.
I don't know what to say to young people, because uncertainty is historically really bad, and the objective situation seems to be mostly about massive uncertainty.
How about takeover-capable AI?
I've been thinking about this issue a fair amount, and that's my nomination. It points directly at what we care about. And it doesn't have the implication that an AI would need to be a whole different category of intelligence to take over. Your neanderthal example and the correction is relevant here: they're gone because sapiens had varied advantages, not because they were cleanly outclassed in intelligence.
Individual humans have taken over most of the world many times while being smarter than those around them only in pretty limited ways. It's important to consider scifi takeover scenarios, but old-fashioned social dominance ("hey it's better for you if you listen to me," applied iteratively) is also great, and would suffice.
I think the explicit suggestion is to retreat to a more specific term rather than fight against the co-option of superintelligence to hype spikily human-level AI.
I agree that superintelligence has the right usage historically, and the right intuitive connotation.
Superman isn't slightly stronger than the strongest human, let alone the average. He's in a different category. That's what's evoked. But technically super just means better, so slightly better than human technically qualifies. So I see the term getting steadily co-opted for marketing, and agree we should have a separate term.
Yes. Right now, LLMs feel more like a tool than a mind or entity. Adding continual learning will make them feel more like humans, which is intuitively alarming. It will also broaden their deployment, another source of alarm. They'll become more continuous like a human, instead of ephemeral ghost. More agentic behavior, as a result of improving competence by "learning on the job" (and other relevant improvements), will also push in that direction, making them seem intuitively more like humans. Humans are intuitively extremely dangerous. Weird alien versions of humans are intuitively even more alarming (if you're not an AI enthusiast or engaged in a culture war with those pesky "doomers").
I wrote about this in A country of alien idiots in a datacenter: AI progress and public alarm, focusing on impacts on public opinion. I wrote about the technical side more in LLM AGI will have memory, and memory changes alignment.
I think this will make progress toward RSI. It will grow into a major unhobbling for agent competence in all areas. But it will be slower progress, because we'll have bad, limited continual learning before we have really good human-like continual learning. So I think it will unlock the dangers of AGI, but at a slower pace that will give us a fighting chance to wake up and take alignment seriously, barely in time.
I'm thinking of next-gen LLM agents with continual as parahuman AI, systems that work roughly like human brains/minds, and work alongside humans.
It seems like the more reasonable title for this piece is "you might be okay, just focus on that!"
If you don't want to talk about p(doom), you need to have a very wide uncertainty, like 10-90%. That actually seems like the logic you're using.
"You'll be okay" is not an accurate statement of that range of uncertainty. "You might be okay" is. And you're arguing that you should just focus on that. There I largely agree.
I just don't like reassurances coupled with epistemic distortions.
The proper level of uncertainty is very large, and we should be honest about that and try to improve it.
I'm not sure it's that simple. Even if it is, people do suboptimal things all the time. It seems worth watching.
Orthogonally, cultural standards of emotional tone during debates are also important for how much emotional struggle is involved in changing one's ideas.
If the tone implies that you were foolish for holding your idea, it's going to be a lot more painful to let it go.
Lesswrong has a pretty good standard of not just civil but polite and supportive discourse. This seems actually pretty crucial for it being an environment in which people do regularly change their minds.
I don't like the term arena in your suggested division because it implies combat. Combat is emotionally intense, I'd rather have a metaphor that's more collaborative.
This doesn't eliminate the worth of having separate spaces for support and rigorous testing of ideas, but I think it's important to keep in mind whenever we're discussing group epistemics.
Terence Tao can drive a car and talk to people to set up a business, and many other things outside of math. So I hope we don't give that name to systems that can't do those things.
Superintelligence has usually been used to mean far more intelligent than a human in the ways that practically matter. Current systems aren't there yet, but they will be soon.
Seed AI has been most commonly used to mean AI that improves itself fast.
Yes, if you took the component words seriously as definitions, you'd conclude that we already have ASI and seed AI. But that's not how language usually works.
I think it is much more true that we have not reached ASI or seed AI than that we have.
I think this essay assumes a definitional... definition? of language, that is simply not how language works. The constructivist view of language is that words mean what people mean when they say them. This is I think a more accurate theory in that it describes reality better. It is how language really works and describes what words really mean.
We might prefer a world in which words were crisply defined, but we do not live in that world.
So I think that not only is there an intuitive sense in which we have not yet reached seed AI or even recursively self-improving AI, or super intelligence, but the practical implications of blurring that line saying we're already there would be very harmful. Harmful. Those terms were all invented to describe the incredible danger of coming up against AI that can outsmart us quickly and easily in the domains that lead directly to power, and can improve itself fast enough to be unexpectedly dangerous. The terms were invented for that purpose and should be reserved for that purpose. In a practical sense. It is a bonus that that's also how they're commonly used.
The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.
I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.
I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.
Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.