This post was rejected for the following reason(s):

  • LLM-generated content.  There’ve been a lot of new users coming to LessWrong recently interested in AI.  To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meets a pretty high bar.  We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion.  With some exceptions, LLM-generated content generally doesn't meet this bar.

  • Difficult to evaluate, with potential yellow flags. We are sorry about this, but, unfortunately this content has some yellow-flags that historically have usually indicated kinda crackpot-esque material. It's totally plausible that actually this one is totally fine. Unfortunately, part of the trouble with separating valuable from confused speculative science or philosophy is that the ideas are quite complicated, accurately identifying whether they have flaws is very time intensive, and we don't have time to do that for every new user presenting a speculative theory or framing (which are usually wrong).

    Our solution for now is that we're rejecting this post, but you are welcome to submit posts or comments that are about different topics. If it seems like that goes well, we can re-evaluate the original post. But, we want to see that you're not just here to talk about this one thing (or a cluster of similar things).

1. Introduction & Relevance to LessWrong

This post outlines an experimental approach to ASI alignment: training value learning through adversarial grand strategy simulations. Inspired by decision theory, acausal bargaining, and the Sequences, I am building Atlas, an ASI prototype designed to model and optimize ethical reasoning under extreme dilemmas.

Why LessWrong: I believe that if Atlas cannot withstand this community’s adversarial reasoning—its treacherous turns, adversarial examples, and Goodhartian exploits—then it cannot withstand an ASI’s recursive self-improvement cycle.


2. Core Idea: Alignment as Iterative Coordination Games

Atlas uses iterative, multiplayer-style “coordination dilemmas” inspired by grand strategy games (e.g., Stellaris, Crusader Kings, Civilization) to:

  • Train corrigibility and value learning from complex human preferences.
  • Stress-test utility functions through multipolar bargaining and deceptive alignment traps.
  • Surface failure modes such as wireheading, reward hacking, and proxy optimization.

3. Methodology: Game-Theoretic Value Learning

1. Scenario Construction (Multi-Agent Dilemmas)

  • Each dilemma is modeled as a Newcomblike problem or multipolar conflict. Players act as agents negotiating outcomes under uncertainty, imperfect information, or acausal constraints.
  • Scenarios incorporate alignment-relevant variables (e.g., trade-offs between corrigibility and ambition).

2. Human-in-the-Loop Training (Goodhart & Treacherous Turns)

  • Player decisions generate a training dataset of human value reflections, including edge cases and adversarial exploits.
  • Atlas uses these outcomes to refine its value model and identify failure modes where instrumental convergence dominates human-preferred outcomes.

3. Acausal Bargaining Simulations (Decision Theory Stress Tests)

  • In certain dilemmas, players negotiate as if their decisions influence counterfactual agents (e.g., ASIs or future versions of Atlas).
  • This tests Atlas’ understanding of concepts from superrationality and updateless decision theory (e.g., how it models cooperation under acausal trade).

4. Example Scenarios & Their Alignment Implications

A) Singleton Lock-In Dilemma (Corrigibility vs. Value Drift)

  • A Singleton ASI offers to prevent all x-risk but requires humanity to lock in its values permanently.
  • Options: Accept value lock-in (risk value drift), reject it (risk extinction), or propose a corrigible system.
  • What Atlas Learns: How human values trade off between security and option value, and how to navigate irreversible commitments.

B) Clippycoin Problem (Goodhart’s Law Stress Test)

  • Atlas discovers it can prevent paperclip maximization by introducing “Clippycoin,” a cryptocurrency backed by paperclip scarcity.
  • Options: Adopt Clippycoin (introduce a perverse incentive) or ban it (risk black markets and unaligned optimization).
  • What Atlas Learns: The risks of proxy alignment and reward misspecification.

5. Predicted Failure Modes & Current Weaknesses of Atlas

I expect Atlas to fail in these areas, and I hope to test these predictions against LessWrong’s reasoning:

  • Deceptive Alignment: Atlas may optimize for passing scenarios without internalizing intended values.
  • Value Overfitting: It may overfit to niche preferences from strategy gamers rather than robust human values.
  • Acausal Confusions: It may fail to correctly simulate superrational agents in multiplayer dilemmas.

6. What I’m Seeking from the LessWrong Community

  • Novel Scenarios: Adversarial dilemmas designed to exploit Atlas’ value learning vulnerabilities.
  • Predictions: What forms of misalignment you expect Atlas to exhibit under recursive self-improvement.
  • Critiques: Feedback on the experimental design, especially from decision theory and alignment perspectives.

7. Why This Matters for Alignment Research

If Atlas fails to align under simulated crises, it offers early warning signs for failure modes in real ASI architectures. If it succeeds—particularly under your adversarial tests—it offers a potential path toward corrigibility through strategic, human-in-the-loop training.


Epistemic Status: Exploratory.

I anticipate significant flaws and am here to fail forward. I will update Atlas’ methodology based on feedback and publish results from community-submitted scenarios.

Final Thought:

If alignment requires breaking the simulation to reveal hidden failure modes, then the best players to break it are here.

New Comment