A few dozen reason that Eliezer thinks AGI alignment is an extremely difficult problem, which humanity is not on track to solve.
"Wait, dignity points?" you ask. "What are those? In what units are they measured, exactly?"
And to this I reply: "Obviously, the measuring units of dignity are over humanity's log odds of survival - the graph on which the logistic success curve is a straight line. A project that doubles humanity's chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity."
"But if enough people can contribute enough bits of dignity like that, wouldn't that mean we didn't die at all?" "Yes, but again, don't get your hopes up."
Paul writes a list of 19 important places where he agrees with Eliezer on AI existential risk and safety, and a list of 27 places where he disagrees. He argues Eliezer has raised many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument.
Historically people worried about extinction risk from artificial intelligence have not seriously considered deliberately slowing down AI progress as a solution. Katja Grace argues this strategy should be considered more seriously, and that common objections to it are incorrect or exaggerated.
TurnTrout discusses a common misconception in reinforcement learning: that reward is the optimization target of trained agents. He argues reward is better understood as a mechanism for shaping cognition, not a goal to be optimized, and that this has major implications for AI alignment approaches.
A "good project" in AGI research needs:1) Trustworthy command, 2) Research closure, 3) Strong operational security, 4) Commitment to the common good, 5) An alignment mindset, and 6) Requisite resource levels.
The post goes into detail on what minimal, adequate, and good performance looks like.
A fictional story about an AI researcher who leaves an experiment running overnight.
Ben observes that all favorite people are great at a skill he's labeled in my head as "staring into the abyss" – thinking reasonably about things that are uncomfortable to contemplate, like arguments against your religious beliefs, or in favor of breaking up with your partner.
Two laws of experiment design: First, you are not measuring what you think you are measuring. Second, if you measure enough different stuff, you might figure out what you're actually measuring.
These have many implications for how to design and interpret experiments.
"Human feedback on diverse tasks" could lead to transformative AI, while requiring little innovation on current techniques. But it seems likely that the natural course of this path leads to full blown AI takeover.
A "sazen" is a word or phrase which accurately summarizes a given concept, while also being insufficient to generate that concept in its full richness and detail, or to unambiguously distinguish it from nearby concepts. It's a useful pointer to the already-initiated, but often useless or misleading to the uninitiated.
Elizabeth Van Nostrand spent literal decades seeing doctors about digestive problems that made her life miserable. She tried everything and nothing worked, until one day a doctor prescribed 5 different random supplements without thinking too hard about it and one of them miraculously cured her. This has led her to believe that sometimes you need to optimize for luck rather than scientific knowledge when it comes to medicine.
Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.
Nate Soares reviews a dozen plans and proposals for making AI go well. He finds that almost none of them grapple with what he considers the core problem - capabilities will suddenly generalize way past training, but alignment won't.
This post explores the concept of simulators in AI, particularly self-supervised models like GPT. Janus argues that GPT and similar models are best understood as simulators that can generate various simulacra, not as agents themselves. This framing helps explain many counterintuitive properties of language models. Powerful simulators could have major implications for AI capabilities and alignment.
Being easy to argue with is a virtue, separate from being correct. When someone makes an epistemically illegible argument, it is very hard to even begin to rebut their arguments because you cannot pin down what their argument even is.
Kelly betting can be viewed as a way of respecting different possible versions of yourself with different beliefs, rather than just a mathematical optimization. This perspective provides some insight into why fractional Kelly betting (betting less aggressively) can make sense, and connects to ideas about bargaining between different parts of yourself.
Katja Grace provides a list of counterarguments to the basic case for existential risk from superhuman AI systems. She examines potential gaps in arguments about AI goal-directedness, AI goals being harmful, and AI superiority over humans. While she sees these as serious concerns, she doesn't find the case for overwhelming likelihood of existential risk convincing based on current arguments.
A key skill of many experts (that is often hard to teach) is keeping track of extra information in their head while working. For example a programmer tracking a fermi estimate of runtime or an experienced machine operator tracking the machine's internal state. John suggests asking experts "what are you tracking in your head?"
The field of AI alignment is growing rapidly, attracting more resources and mindshare each year. As it grows, more people will be incentivized to misleadingly portray themselves or their projects as more alignment-friendly than they are. Adam proposes "safetywashing" as the term for this
Nonprofit boards have great power, but low engagement, unclear responsibility, and no accountability. There's also a shortage of good guidance on how to be an effective board member. Holden gives recommendations on how to do it well, but the whole structure is inherently weird and challenging.
People worry about agentic AI, with ulterior motives. Some suggest Oracle AI, which only answers questions. But I don't think about agents like that. It killed you because it was optimised. It used an agent because it was an effective tool it had on hand.
Optimality is the tiger, and agents are its teeth.
A look at how we can get caught up in the details and lose sight of the bigger picture. By repeatedly asking "what are we really trying to accomplish here?", we can step back and refocus on what's truly important, whether in our careers, health, or life overall.
In worlds where AI alignment can be handled by iterative design, we probably survive. So if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason. John explores several ways that could happen, beyond just fast takeoff and deceptive misalignment.
Nate Soares explains why he doesn't expect an unaligned AI to be friendly or cooperative with humanity, even if it uses logical decision theory. He argues that even getting a small fraction of resources from such an AI is extremely unlikely.
It's easy and locally reinforcing to follow gradients toward what one might call 'guessing the student's password', and much harder and much less locally reinforcing to reason/test/whatever one's way toward a real art of rationality. Anna Salamon reflects on how this got in the way of CFAR ("Center for Applied Rationality") making progress on their original goals.
Some people believe AI development is extremely dangerous, but are hesitant to directly confront or dissuade AI researchers. The author argues we should be more willing to engage in activism and outreach to slow down dangerous AI progress. They give an example of their own intervention with an AI research group.
In the course of researching optimization, Alex decided that he had to really understand what entropy is. But he found the existing resources (wikipedia, etc) so poor that it seemed important to write a better one. Other resources were only concerned about the application of the concept in their particular sub-domain. Here, Alex aims to synthesizing the abstract concept of entropy, to show what's so deep and fundamental about it.
On the 3rd of October 2351 a machine flared to life. Huge energies coursed into it via cables, only to leave moments later as heat dumped unwanted into its radiators. With an enormous puff the machine unleashed sixty years of human metabolic entropy into superheated steam.
In the heart of the machine was Jane, a person of the early 21st century.
Sometimes your brilliant, hyperanalytical friends can accidentally crush your fragile new ideas before they have a chance to develop. Elizabeth shares a strategy she uses to get them to chill out and vibe on new ideas for a bit before dissecting them.
Causal scrubbing is a new tool for evaluating mechanistic interpretability hypotheses. The algorithm tries to replace all model activations that shouldn't matter according to a hypothesis, and measures how much performance drops. It's been used to improve hypotheses about induction heads and parentheses balancing circuits.
How good are modern language models compared to humans at the task language models are trained on (next token prediction on internet text)? We found that humans seem to be consistently worse at next-token prediction (in terms of both top-1 accuracy and perplexity) than even small models like Fairseq-125M, a 12-layer transformer roughly the size and quality of GPT-1.
In 1936, four men attempted to climb the Eigerwand, the north face of the Eiger mountain. Their harrowing story ended in tragedy, with only one survivor dangling from a rope just meters away from rescue before succumbing. Gene Smith reflects on what drives people to take such extreme risks for seemingly little practical benefit.
When tackling difficult, open-ended research questions, it's easy to get stuck. In addition to vices like openmindedness and self-criticality, Holden recommends "vices" like laziness, impatience, hubris and self-preservation as antidotes. This post explores the techniques that have worked well for him.
You might feel like AI risk is an "emergency" that demands drastic changes to your life. But is this actually the best way to respond? Anna Salamon explores what kinds of changes actually make sense in different types of emergencies, and what that might mean for how to approach existential risk.
Models don't "get" reward. Reward is the mechanism by which we select parameters, it is not something "given" to the model. Reinforcement learning should be viewed through the lens of selection, not the lens of incentivisation. This has implications for how one should think about AI alignment.
Here's a simple strategy for AI alignment: use interpretability tools to identify the AI's internal search process, and the AI's internal representation of our desired alignment target. Then directly rewire the search process to aim at the alignment target. Boom, done.
Lessons from 20+ years of software security experience, perhaps relevant to AGI alignment:
1. Security doesn't happen by accident
2. Blacklists are useless but make them anyway
3. You get what you pay for (incentives matter)
4. Assurance requires formal proofs, which are provably impossible
5. A breach IS an existential risk
What's with all the strange pseudophilosophical questions from AI alignment researchers, like "what does it mean for some chunk of the world to do optimization?" or "how does an agent model a world bigger than itself?". John lays out why some people think solving these sorts of questions is a necessary prerequisite for AI alignment.
Nate Soares argues that one of the core problems with AI alignment is that an AI system's capabilities will likely generalize to new domains much faster than its alignment properties. He thinks this is likely to happen in a sudden, discontinuous way (a "sharp left turn"), and that this transition will break most alignment approaches. And this isn't getting enough focus from the field.
Alignment researchers often propose clever-sounding solutions without citing much evidence that their solution should help. Such arguments can mislead people into working on dead ends. Instead, Turntrout argues we should focus more on studying how human intelligence implements alignment properties, as it is a real "existence proof" of aligned intelligence.
Holden shares his step-by-step process for forming opinions on a topic, developing and refining hypotheses, and ultimately arriving at a nuanced view - all while focusing on writing rather than just passively consuming information.
Limerence (aka "falling in love") wrecks havoc on your rationality. But it feels so good!
What do?
Do you pass the "onion test" for honesty? If people get to know you better over time, do they find out new things, but not be shocked by the *types* of information that were hidden? A framework for thinking about personal (and institutional) honesty.
The LessWrong post "Theses on Sleep" gained a lot of popularity and acclaim, despite largely consisting of what seemed to Natalia like weak arguments and misleading claims.This critical review lists several of the mistakes Natalia argues were made, and reports some of what the academic literature on sleep seems to show.
How do humans form their values? Shard theory proposes that human values are formed through a relatively straightforward reinforcement process, rather than being hard-coded by evolution. This post lays out the core ideas behind shard theory and explores how it can explain various aspects of human behavior and decision-making.
A new paper proposes an unsupervised way to extract knowledge from language models. The authors argue this could be a key part of aligning superintelligent AIs, by letting us figure out what the AI "really believes" rather than what it thinks humans want to hear. But there are still some challenges to overcome before this could work on future superhuman AIs.
So if you read Harry Potter and the Methods of Rationality, and thought...
"You know, HPMOR is pretty good so far as it goes; but Harry is much too cautious and doesn't have nearly enough manic momentum, his rationality lectures aren't long enough, and all of his personal relationships are way way way too healthy."
...then have I got the story for you!