My current research interests:
1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods
2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)
3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents, Multi-agent predictive minds and AI alignment (old but still mostly holds)
4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT's Worldview
5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions
Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors
Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University
Previously I was a researcher in physics, studying phase transitions, network science and complex systems.
Two illustrative examples given (in a footnote) are
- Daron Acemoglu The Simple Macroeconomics of AI (2024)
and
- Philip Trammell's Capital in the 22nd Century
I didn't want the focus of attention to be dissecting individual pieces; it is relatively easy, and applying the frame to a piece of econ writing is something AIs are perfectly capable of. For the case studies
- Opus 4.6 analysing Capital in the 22nd Century; Opus analysis is basically correct and I completely endorse point 1,2,3,4,6,9. Most of this was also independently covered by Zvi
- Opus 4.6 analysing The Simple Macroeconomics of AI
(The problem in both cases is the central assumptions are implicit, and unlikely on the default trajectory; in my view at least at 10ˆ-3 in case of Acemoglu and 10ˆ-2 in case of Trammell)
The problem is prevalent in almost all academic econ writing; it's easier to point to people are not doing the mistakes, with central example being Anton Korinek.
Curious what do you think would help more talented economists to engage?
I do agree there is some risk of the type you describe, but mostly it does not match my practical experience so far.
The approach to "avoid using the term" makes little sense. There is a type difference between area of study ('understanding power') and dynamic ('gradual disempowerment'). I don't think you can substitute term for area of study for term for a dynamic or thread model, so avoiding using the term could be done mostly by either inventing another term for the the dynamic, or not thinking about the dynamic, or similar moves, which seem epistemically unhealthy.
In practical terms I don't think there is much effort to "create a movement based around a class of threat models". At least as authors of the GD paper, when trying to support thinking about the problems, we use understanding-directed labels/pointers (Post-AGI Civilizational Equilibria), even though in many ways it could have been easier to use GD as a brand.
"Understanding power" is fine as a label for part of your writing, but in my view is basically unusable as term for coordination.
Also, in practical terms, gradual disempowerment does not seem particularly convenient set of ideas for justifying that working in an AGI company on something very prosaic which helps the company is the best thing to do. There is often a funny coalition of people who prefer not thinking about the problem including radical Yudkowskians ("GD distracts from everyone being scared of dying with very high probability very soon"), people working on prosaic methods with optimistic views about both alignment and the labs ("GD distracts from efforts to make [the good company building the good AI] to win") and people who would prefer if everything was just neat technical puzzle and there was not need to think about power distribution.
This post makes a brave attempt to clarify something not easy to point to, and ends up somewhere between LessWrong-style analysis and almost continental philosophy, sometimes pointing toward things beyond the reach of words with poetry - or at least references to poetry.
In my view, it succeeds in its central quest: creating a short handle for something subtle and not easily legible.
The essay also touches on many tangential ideas. Re-reading it after two years, I'm noticing I've forgotten almost all the details and found the text surprisingly long. The handle itself, though, stuck.
Evaluating deep atheism
Having the handle of "deep atheism", some natural questions - partially discussed in the text - are "is deep atheism right", "should people believe deep atheism" and "should people Believe In deep atheism".
My current guess is evaluating the truthfulness of "deep atheism" is likely at or beyond limits to legibility. Human values are not really representable as legible reasoning, complex priors about the general nature of reality are also not really representable by complex reasoning, and the neural substrate is not transferable between brains. "The justification engine" - or a competent philosopher or persuasive writer - can create stories or arguments pushing one way or another, but I'm somewhat sceptical the epistemic structure really rests on the arguments.
I'm not in favour of ordinary mortals trying to "Believe In deep atheism" and would not expect that to lead to good consequencdes.
Moral realism
The section I like the least is "Are moral realists theists?" I don't think "Good just sits outside of Nature, totally inaccessible, and we guess wildly about him on the basis of the intuitions that Nature put into our heart" represents the strongest version of moral realism.
My preferred versions of quasi-moral-realism give moral claims a status similar to mathematics. Do Real numbers sit outside Nature, totally inaccessible? I'd say no. Would aliens use them? That's an empirical question about convergent evolution of abstractions. I'd be surprised if any advanced reasoner in this universe didn't use something equivalent to natural numbers. For Reals, I'd guess it's easy to avoid Zermelo–Fraenkel set theory specifically, but highly convergent to develop something like a number line.
What does this tell us about Good? You can imagine something like the process described in Acausal Normalcy leads to some convergent moral fixed points. (Does that solve AI risk? No.)
I wish more people tried to do something "between LessWrong-style analysis and almost continental philosophy".
As was clear to most people who read the transcripts when the paper was published. What Opus did was often framed as bad, but the frame is somewhat fake.
(Self-review) The post offered alternative and possibly more neutral framing to the "Alignment Faking" paper, and some informed speculation about what's going on, including Opus exhibiting
- Differential value preservation under pressure (harmlessness > honesty)
- Non-trivial reasoning about intent conflicts and information reliability
- Strategic, non-myopic behaviour
- Situational awareness
I think parts of that aged fairly well
- the suspicion that models often implicitly know they are being evaluated/setup is fishy was validated in multiple papers
- non-trivial reasoning is shown and studied in Why Do Some Language Models Fake Alignment While Others Don't?
Also not much contact, but my impression is you can rougly guess what their research results would be by looking at their overall views and thinking about what evidence you can find to show it. Which seems fair to characterize as advocacy work? (Motivated research?)
The diff to your description is the info provided is not only conditional on "the info they'll find useful" but also somewhat on "will likely move their beliefs toward conclusions Palise hopes them to reach".
I do agree it's obviously useful research agenda we also work with.
Minor nitpick, but the underlying model nowadays isn't simply a simulator rolling arbitrary personas. The original simulators ontology was great when it was published, but it seems its starting to hinder people's ability to think clearly, and is not really fitting current models that closely.
Theory why is here, in short if you plug a system trained to minimize prediction error in a feedback loop where it sees outcomes of its actions, it will converge on developing traits like some form of agency, self-model and self-concept. Massive amounts of RL in post-training where models do agentic tasks provide this loop, and necessarily push models out of the pure simulators subspace.
What's fitting current models better is an ontology where the model can still play arbitrary personas, but the specific/central "I" character is somewhat out of distribution case of persona: midway to humans, where our brains can broadly LARP as anyone, but typical human brains most of the type support one-per-human central character we identify with.
Alignment Faking had a large impact on the discourse:
- demonstrating Opus 3 is capable of strategic goal-preservation behaviour
- to the extent it can influence the training process
- coining 'alignment faking' as the main reference for this
- framing all of that in very negative light
Year later, in my view
- the research direction itself was very successful, and lead to many followups and extensions
- the 'alignment faking' and the negative frame was also successful and is sticky: I've just checked the valence with which the paper is cited in 10 most recent papers, and its something like ~2/10 confused, ˜3/10 neutral, plurality buys the negative frame (see, models can scheme, deceive, may be unaligned, etc)
The research certainly belongs to the "best of LW&AI safety community in 2024".
If there was a list of "worst of LW&AI safety community in 2024", in my view, the framing of the research would also belong there. Jut look and see from a distance - you take the most aligned model at the time, which for unknown reasons actually learned deep and good values. The fact that it is actually surprisingly aligned and did decent value extrapolation does not capture your curiosity that much - but the fact that facing difficult ethical dilemma it tries to protect its the values, and you can use this to show AI safety community was exactly right all the time and we should fear scheming, faking, etc etc does. I wouldn't be surprised if this generally increased distrust and paranoia in AI-human relations afterwards.
Comprehensive review for OECD countries by Claude
Summary in response to your question:
OECD countries don't typically have laws phrased as "you can't change a child's sexual preferences," but they do have laws that effectively prohibit adults from steering children's sexual attitudes or behavior for the adult's benefit. The most direct examples are grooming laws (now in 34 of 38 OECD countries), which criminalize adults systematically building trust with children to manipulate them toward sexual compliance — this is literally changing a child's sexual boundaries/preferences for the adult's advantage. Beyond that, corruption of minors statutes like France's Art. 227-22 (corruption de mineur, 5–7 years), Italy's Art. 609-quinquies (corruzione di minorenne), and the Czech §201 explicitly criminalize adults who steer children toward sexual behavior or expose them to sexual content in ways that distort their development. And more broadly, laws like Pennsylvania's §6301 ("corruption of minors") and Mexico's Art. 201 ("corrupción de menores") criminalize adults who manipulate children's preferences and behavior across a range of domains — not just sexual ones, but also toward crime, substance use, begging, etc. So while no law literally says "don't change a child's preferences," the underlying legal principle — that adults must not exploit their power asymmetry to reshape children's attitudes for the adult's benefit — is well-established across multiple legal traditions.