Message me here or at seth dot herd at gmail dot com.

I was a researcher in cognitive psychology and cognitive neuroscience for two decades and change. I studied complex human thought using neural network models of brain function. I'm applying that knowledge to figuring out how we can align AI as developers make it to "think for itself" in all the ways that make humans capable and dangerous.

New to alignment? See the Research Overview section
Field veteran? See More on My Approach at the end.

I work on technical alignment, but doing that has forced me to branch into alignment targets, alignment difficulty, and societal and field sociological issues, because choosing the best technical research approach depends on all of those.

Principal articles:

On technical alignment of LLM-based AGI agents:
- LLM AGI may reason about its goals and discover misalignments by default
  - An LLM-centric lens on why aligning Real AGI is hard
- System 2 Alignment Likely approaches for LLM AGI on the current trajectory
- Seven sources of goals in LLM agents brief problem statement
- Internal independent review for language model agent alignment
  - Updated in System 2 alignment
On LLM-based agents as a route to takeover-capable AGI
- LLM AGI will have memory, and memory changes alignment
- Brief argument for short timelines being quite possible
- Capabilities and alignment of LLM cognitive architectures
  - Cognitive psychology perspective on routes to LLM-based AGI with no breakthroughs needed
AGI risk interactions with societal power structures and incentives:
- Whether governments will control AGI is important and neglected
- If we solve alignment, do we die anyway?
  - Risks of proliferating human-controlled AGI
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
On the psychology of alignment as a field:
- Cruxes of disagreement on alignment difficulty
- Motivated reasoning/confirmation bias as the most important cognitive bias
On AGI alignment targets assuming technical alignment
On communicating AGI risks:

Research Overview:

Alignment is the study of how to design and train AI to have goals or values aligned with ours, so we're not in competition with our own creations.

Recent breakthroughs in AI like ChatGPT make it possible we'll have smarter-than-human AIs soon. If we don't understand how to make sure it has only goals we like, it will probably outcompete us, and we'll be either sorry or gone. See this excellent intro video.

There are good and deep reasons to think that aligning AI will be very hard. Section 1 of LLM AGI may reason about its goals is my attempt to describe those briefly and intuitively. But we also have promising solutions that might address those difficulties. They could also be relatively easy to use for the types of AGI we're most likely to develop first.

That doesn't mean I think building AGI is safe. Humans often screw up complex projects, particularly on the first try, and we won't get many tries. If it were up to me I'd Shut It All Down, but I don't see how we could get all of humanity to stop building AGI. So I focus on finding alignment solutions for the types of AGI people are building.

In brief, I think we can probably build and align language model agents (or language model cognitive architectures) up to the point that they're about as autonomous and competent as a human, but then it gets really dicey. We'd use a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility (human-in-the-loop error correction) by having a central goal of following instructions. This scenario leaves multiple humans in charge of ASIs, creating some dangerous dynamics, but those problems might be navigated, too.

Bio

I did computational cognitive neuroscience research from getting my PhD in 2006 until the end of 2022. I've worked on computational theories of vision, executive function, episodic memory, and decision-making, using neural network models of brain function to integrate data across levels of analysis from psychological down to molecular mechanisms of learning in neurons, and everything in between. I've focused on the interactions between different brain neural networks that are needed to explain complex thought. Here's a list of my publications.

I was increasingly concerned with AGI applications of the research, and reluctant to publish my full theories lest they be used to accelerate AI progress. I'm incredibly excited to now be working full-time on alignment, currently as a research fellow at the Astera Institute.

More on My Approach

The field of AGI alignment is "pre-paradigmatic." So I spend a lot of my time thinking about what problems need to be solved, and how we should go about solving them. Solving the wrong problems seems like a waste of time we can't afford.

When LLMs suddenly started looking intelligent and useful, I noted that applying cognitive neuroscience ideas to them might well enable them to reach AGI and soon ASI levels. Current LLMs are like humans with no episodic memory for their experiences, and very little executive function for planning and goal-directed self-control. Adding those cognitive systems to LLMs can make them into cognitive architectures with all of humans' cognitive capacities - a "real" artificial general intelligence that will soon be able to outsmart humans.

My work since then has convinced me that we might be able to align such an AGI so that it stays aligned as it grows smarter than we are. LLM AGI may reason about its goals and discover misalignments by default is my latest thinking; it's a definite maybe!

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

I'm trying to fill a particular gap in alignment work. My approach is to focus on thinking through plans for alignment on short timelines and realistic societal assumptions (competition, polarization, and conflicting incentives creating motivated reasoning that distorts beliefs). Many serious thinkers give up on this territory, assuming that either aligning LLM-based AGI turns out to be very easy, or we fail and perish because we don't have much time for new research.

I think it's fairly likely that alignment isn't impossibly hard but also not easy enough that developers get it right on their own despite all of their biases and incentives. so a little work in advance, from outside researchers like me could tip the scales. I think this is a neglected approach (although to be fair, most approaches are neglected at this point, since alignment is so under-funded compared to capabilities research).

One key to my approach is the focus on intent alignment instead of the more common focus on value alignment. Instead of trying to give it a definition of ethics it can't misunderstand or re-interpret (value alignment mis-specification), we'll probably continue with the alignment target developers currently focus on: Instruction-following.

It's counter-intuitive to imagine an intelligent entity that wants nothing more than to follow instructions, but there's no logical reason this can't be done. An instruction-following proto-AGI can be instructed to act as a helpful collaborator in keeping it aligned as it grows smarter.

There are significant Problems with instruction-following as an alignment target. It does not solve the problem with corrigibility once an AGI has left our control, merely gives another route to solving alignment (ordering it to collaborate) while it's still in our control, if we've gotten close enough to the initial target. It allows selfish humans to seize control. Nonetheless, it seems easier and more likely than value aligned AGI, so I continue to work on technical alignment under the assumption that's the target we'll pursue.

I increasingly suspect we should be actively working to build parahuman (human-like) LLM agents. It seems like our best hope of survival, since I don't see how we can convince the whole world to pause AGI efforts, and other routes to AGI seem much harder to align since they won't "think" in English chains of thought, or be easy to scaffold and train for System 2 Alignment backstops. Thus far, I haven't been able to engage enough careful critique of my ideas to know if this is wishful thinking, so I haven't embarked on actually helping develop language model cognitive architectures.

Even though these approaches are pretty straightforward, they'd have to be implemented carefully. Humans often get things wrong on their first try at a complex project. So my p(doom) estimate of our long-term survival as a species is in the 50% range, too complex to call. That's despite having a pretty good mix of relevant knowledge and having spent a lot of time working through various scenarios. So I think anyone with a very high or very low estimate is overestimating their certainty.

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

I think you're probably right about that historical difference. But I don't agree with the implication that people won't believe AGI is coming until too late. (I realize this isn't the main claim you're making here, but I think you'd agree that's the most important implication.)

It's like January 2020 now, when those concerned with Covid were laughed off. That doesn't mean AGI concerns will be dismissed when more evidence hits. The public could easily go from not nearly concerned enough to making panicked demands for mass action like shutting down half the economy as a precautionary measure.

Yes, the modern assumption that nothing really changes will slow down recognition of AI's dangers. But not for long if we're fortunate enough to get a slowish takeoff and public deployments of useful (and therefore creepy) LLM agents. Of course, that might not happen until we're too close to internal deployment of a misaligned takeover-capable system like Agent-4 from AI 2027. But it's looking pretty likely we'll get such deployments and job replacements before the point of no return, so I think we should at least have some contingency plans in case of dramatic public concern.

AI is in far-mode thinking for most people now, but I predict it's going to be near-mode for a lot of people as soon as we've got inarguable job replacement and more common experience with agentic AI.

I'm the first to talk about how foolish people are compared to our idealized self-conception. People are terrible with abstract ideas. But I think the main reason is that they don't spend time thinking seriously about them until they're personally relevant. Humans take a long time to figure out new things. It takes a lot of thought. But it's also a collective process. As it becomes a bigger part of public conversation, basic logic like "oh yeah they're probably going to build a new species, and that sounds pretty dangerous" will become common.

Note that most of the people talking about AI now are entrepreneurs and AI developers - the small slice of humanity most prone to be pro-AI biased. Most other people intuitively fear it, arguably for good reasons.

My logic and an attempt to convey my intuitions on this are in A country of alien idiots in a datacenter.

First, I think safety cases are pretty clearly the only sane way to deploy AI, and it's great that you're talking about them. Now on to the discussion of how they ought to go.

I most certainly do not prefer option 2! People frequently assure others "I've thought of everything!" then later say "oh well it's totally reasonable that I missed this thing that seemed totally irrelevant but turned out to be a huge deal..." Like The Moonrise triggering the early warning system and nearly getting everyone nuked.

This is particularly true when you're dealing with a vastly complex system (an AGI mind) that is trained not built. And far more important once that mind is smarter than you, so it should be expected to think of things, and probably in ways, you haven't.

In the case of potentially-takeover-capable AGI (maybe not what you meant by GPT6), I'd far prefer Option 1 if the analogy holds; it would give me above 99% chance of survival if the "flight" was the same type as the test cases. The abstract arguments that "we've thought of everything" probably wouldn't get me to that high level of certainty.

But even for airplane safety, I'd prefer a combination of options 1 and 2, empirical and theoretical approaches to safety. And I think everyone else should, and mostly does, too.

On the empirical side, I'd like them to say "we flew it 100 times, and in ten of those, we flew it right into weather we'd never attempt with passengers. We shined lasers at it and flew it through a flock of birds; we flew it into some pretty big drones."

I want the theoretical claim that we've thought of everything that could go wrong to be applied in testing. I want it to be exposed to public scrutiny to see if anyone else can think of anything that could go wrong. I'd like a really thorough multifaceted safety case before we do anything that even might be the equivalent of loading all of humanity into an experimental plane.

I don't think we'll get that, but we should be clear that that's what a wise and selfless world would do, and we should get as close as possible in this world.

One big outstanding question is how we could do empirical research taht would really be anything like the equivalent of Option 1. We can't quite get there, but we should think about getting as close as possible.

One major missing factor is that good. Gifts are evidence that someone thinks about you a lot. This in turn is strong. Evidence that they like you a lot and will be there to help you if you need it.

Of course people don't notice all the reasons they have positive emotional associations with gifts they don't get with money.

I think one major mistake here is assuming that there's one answer. I think it's a combination of many of the things you identify and the comments add.

This post confounds several factors. There's a lot more to immersion language learning than massed practice. And bodybuilding has almost nothing to do with learning. The brain is not a muscle.

There are whole sciences of learning; to be fair they're not very good, but you reference none of them, and they're at least worth referencing. In the past this would be fine, but in a world where you can talk to Claude, it just doesn't make sense to make claims of fact with this much effort without bothering to even learn say anything about relevant science.

I agree with this analysis. I think we probably have somewhat better odds on the current path, particularly if we can hold onto faithful CoT.

I expect some one major changes of continuous learning, but CoT and most of the LLM paradigm may stay pretty similar up to takeoff.

Steve Byrnes' Foom & Doom is a pretty good guess at what we get if the current paradigm doesn't pan out.

I meant if the predictor were superhumanly intelligent.

You have spent years studying alignment? If so, I think your posts would do better by including more ITT/steelmanning for that world view.

I agree with your arguments that alignment isn't necessarily hard. I think there are a complementary set of arguments against alignment being easy. Both must be addressed and figured in to produce a good estimate for alignment difficulty.

I've also been studying alignment for years, and my take is that everyone has a poor understanding of the whole problem and so we collectively have no good guess on alignment difficulty.

It's just really hard to accurately imagine agi. If it's just a smarter version of llms that acts as a tool, then sure it will probably be aligned enough just like current systems.

But it almost certainly won't be.

I think that's the biggest crux between your views and mine. Agency and memory/learning are too valuable and too easy to stay out of the picture for long.

I'm not sure the reasons Claude is adequately aligned won't generalize to AGI that's different in those ways, but I don't think we have much reason to assume it will.

I've expressed this probably best yet on LLM AGI may reason about its goals, the post I linked to previously.

training on a purely predictive loss should, even in the limit, give you a predictor, not an agent

I fully agree. But

a) Many users would immediately tell that predictor "predict what an intelligent agent would do to pursue this goal!" and all of the standard worries would re-occur.

b) This is similar to what we are actively doing, applying RL to these systems to make them effective agents.

Both of these re-introduce all of the standard problems. The predictor is now an agent. Strong predictions of what an agent should do include things like instrumental convergence toward power-seeking, incorrigibility, goal drift, reasoning about itself and its "real" goals and discovering misalignments, etc.

There are many other interesting points here, but I won't try to address more!

I will say that I agree with the content of everything you say, but not the relatively optimistic implied tone. Your list of to-dos sound mostly unlikely to be done well. I may have less faith in institutions and social dynamics. I'm afraid we'll just rush and and make crucial mistakes, so we'll fail even if alignment was only in between steam engine and apollo levels.

This is not inevitable! If we can clarify why alignment is hard and how we're likely to fail, seeing those futures can prevent them from happening - if we see them early enough and clearly enough to convince the relevant decision-makers to make better choices.

You make many good points about skills and group membership. Perhaps this one is really not that relevant to your point, but it does seem worth mentioning for any aspiring rationalist considering your scenario.

You don't mention the skill sets I think are most likely to contribute to that scenario.

I'd think after hearing that story that most likely explanation was that Adam irritated Bella.

The most relevant skills seem like communication, social grace, and emotion management.

The sequences frequently mention communication, but seldom mention social grace or emotion management. Not getting defensive when someone is being abrasive seems like largely an emotion management skill.

Communication is hard, so it's pretty likely that Adam and Bella weren't on exactly the same page about what was under discussion. They could've both been mostly right since they were talking about somewhat different things (you said the exchange was brief, making this more likely). Both transmissive and receptive communication skills are hard, particularlly for complex topics.

The skill of telling someone they're wrong without being abrasive is real and difficult. I don't think you mentioned it.

The skill of being abrasively told you're wrong and not letting it irritate you is also real and difficult.

I think this is worth mentioning because people seem to fairly frequently assume that being a rationalist does or should mean your emotions don't play a role in your thinking. This is wildly unrealistic. Being a rationalist provides some resistance to the worst forms of emotions interfering with logic (since the love of truth is an emotion that can counteract many other influences), but it doesn't nearly provide full immunity.

Assuming you're not biased or otherwise affected by emotion because you're a rationalist is very irrational.

I think they trained it in a way that crippled its capabilities in that area. This was probably the result of rushing and unforeseen consequences of what they chose focus on for training targets. So some of both.

This video:

https://www.lesswrong.com/posts/4mtqQKvmHpQJ4dgj7/daniel-tan-s-shortform?commentId=RD2BKixd9PbyDnBEc

I got tired and forgot to find it and add the link before sending.

LESSWRONG
LW

LESSWRONG
LW

Principal articles:

Research Overview:

Bio

More on My Approach

My plan: predict and debug System 2, instruction-following alignment approaches for our first AGIs

Posts

Wikitag Contributions

Comments