Crossposted to the EA Forum
Introduction
Imagine you are tasked with curing a disease which hasn’t appeared yet. Setting aside why you would know about such a disease’s emergence in the future, how would you go about curing it? You can’t directly or indirectly gather data about the disease itself, as it doesn’t exist yet; you can’t try new drugs to see if they work, as the disease doesn’t exist yet; you can’t do anything that is experimental in any way on the disease itself, as… you get the gist. You would be forgiven for thinking that there is nothing you can do about it right now.
AI Alignment looks even more hopeless: it’s about solving a never-before seen problem on a technology which doesn’t exist yet.
Yet researchers are actually working on it! There are papers, books, unconferences, whole online forums full of alignment research, both conceptual and applied. Some of these work in a purely abstract and theoretical realm, others study our best current analogous of Human Level AI/Transformative AI/AGI, still others iterate on current technologies that seem precursors to these kinds of AI, typically large language models. With so many different approaches, how can we know if we’re making progress or just grasping at straws?
Intuitively, the latter two approaches sound more like how we should produce knowledge: they take their epistemic strategies (ways of producing knowledge) out of Science and Engineering, the two cornerstones of knowledge and technology in the modern world. Yet recall that in alignment, models of the actual problem and/or technology can’t be evaluated experimentally, and one cannot try and iterate on proposed solutions directly. So when we take inspiration from Science and Engineering (and I want people to do that), we must be careful and realize that most of the guarantees and checks we associate with both are simply not present in alignment, for the decidedly pedestrian reason that Human Level AI/Transformative AI/AGI doesn’t exist yet.
I thus claim that:
- Epistemic strategies from Science and Engineering don’t dominate other strategies in alignment research.
- Given the hardness of grounding knowledge in alignment, we should leverage every epistemic strategy we can find and get away with.
- These epistemic strategies should be made explicit and examined. Both the ones taken or adapted from Science and Engineering, and every other one (for example the more theoretical and philosophical strategies favored in conceptual alignment research).
This matters a lot, as it underlies many issues and confusions in how alignment is discussed, taught, created and criticized. Having such a diverse array of epistemic strategies is fascinating, but their implicit nature makes it challenging to communicate with newcomers, outsiders, and even fellow researchers leveraging different strategies. Here is a non-exhaustive list of issues that boil down to epistemic strategy confusion:
- There is a strong natural bias towards believing that taking epistemic strategies from Science and Engineering automatically leads to work that is valuable for alignment.
- This leads to a flurry of work (often by well-meaning ML researchers/engineers) that doesn’t tackle or help with alignment, and might push capabilities (in an imagined tradeoff with the assumed alignment benefits of the work)
- Very important: that doesn’t mean I consider work based on ML as useless for alignment. Just that the sort of ML-based work that actually tries to solve the problem tends to be by people who understand that one must be careful when transfering epistemic strategies to alignment.
- There is a strong natural bias towards disparaging any epistemic strategy that doesn’t align neatly with the main strategies of Science and Engineering (even when the alternative epistemic strategies are actually used by scientists and engineers in practice!)
- This leads to more and more common confusion about what’s happening on the Alignment Forum and conceptual alignment research more generally. And that confusion can easily turn into the sort of criticism that boils down to “this is stupid and people should stop doing that”.
- It’s quite hard to evaluate whether alignment research (applied or conceptual) creates the kind of knowledge we’re after, and helps move the ball forward. This comes from both the varieties of epistemic strategies and the lack of the usual guarantees and checks when applying more mainstream ones (every observation and experiment is used through analogies or induction to future regimes).
- This makes it harder for researchers using different sets of epistemic strategies to talk to each other and give useful feedback to one another.
- Criticism of some approach or idea often stops at the level of the epistemic strategy being weird.
- This happens a lot with criticism of lack of concreteness and formalism and grounding for Alignment Forum posts.
- It also happens when applied alignment research is rebuked solely because it uses current technology, and the critics have decided that it can’t apply to AGI-like regimes.
- Teaching alignment without tackling this pluralism of epistemic strategies, or by trying to fit everything into a paradigm using only a handful of those, results in my experience in people who know the lingo and some of the concepts, but have trouble contributing, criticizing and teaching the ideas and research they learnt.
- You can also end up with a sort of dogmatism that alignment can only be done a certain way.
Note that most fields (including many sciences and engineering disciplines) also use weirder epistemic strategies. So do many attempts at predicting and directing the future (think existential risk mitigation in general). My point is not that alignment is a special snowflake, more that it’s both weird enough (in the inability to experiment and iterate directly with) and important enough that elucidating the epistemic strategies we’re using, finding others and integrating them is particularly important.
In the rest of this post, I develop and unfold my arguments in more detail. I start with digging deeper into what I mean by the main epistemic strategies of Science and Engineering, and why they don’t transfer unharmed to alignment. Then I demonstrate the importance of looking at different epistemic strategies, by focusing on examples of alignment results and arguments which make most sense as interpreted through the epistemic lens of Theoretical Computer Science. I use the latter as an inspiration because I adore that field and because it’s a fertile ground for epistemic strategies. I conclude by pointing at the sort of epistemic analyses I feel are needed right now.
Lastly, this post can be seen as a research agenda of sorts, as I’m already doing some of these epistemic analyses, and believe this is the most important use of my time and my nerdiness about weird epistemological tricks. We roll with what evolution’s dice gave us.
Thanks to Logan Smith, Richard Ngo, Remmelt Ellen,, Alex Turner, Joe Collman, Ruby, Antonin Broi, Antoine de Scorraille, Florent Berthet, Maxime Riché, Steve Byrnes, John Wentworth and Connor Leahy for discussions and feedback on drafts.
Science and Engineering Walking on Eggshells
What is the standard way of learning how the world works? For centuries now, the answer has been Science.
I like Peter Godfrey-Smith’s description in his glorious Theory and Reality:
Science works by taking theoretical ideas and trying to find ways to expose them to observation. The scientific strategy is to construe ideas, to embed them in surrounding conceptual frameworks, and to develop them, in such a way that this exposure is possible even in the case of the most general and ambitious hypotheses about the universe.
That is, the essence of science is in the trinity of modelling, predicting and testing.
This doesn’t mean there are no epistemological subtleties left in modern science; finding ways of gathering evidence, of forcing the meeting of model and reality, often takes incredibly creative turns. Fundamental Chemistry uses synthesis of molecules never seen in nature to test the edge cases of its models; black holes are not observable directly, but must be inferred by a host of indirect signals like the light released by matter falling in the black hole or gravitational waves during black holes merging; Ancient Rome is probed and explored through a discussion between textual analysis and archeological discoveries.
Yet all of these still amount to instantiating the meta epistemic strategy “say something about the world, then check if the world agrees”. As already pointed out, it doesn’t transfer straightforwardly to alignment because Human Level AI/Transformative AI/AGI doesn’t exist yet.
What I’m not saying is that epistemic strategies from Science are irrelevant to alignment. But because they must be adapted to tell us something about a phenomenon that doesn’t exist yet, they lose their supremacy in the creation of knowledge. They can help us to gather data about what exists now, and to think about the sort of models that are good at describing reality, but checking their relevance to the actual problem/thinking through the analogies requires thinking in more detail about what kind of knowledge we’re creating.
If we instead want more of a problem solving perspective, tinkering is a staple strategy in engineering, before we know how to solve the problem things reliably. Think about curing cancer or building the internet: you try the best solutions you can think of, see how they fail, correct the issues or find a new approach, and iterate.
Once again, this is made qualitatively different in alignment by the fact that neither the problem nor the source of the problem exist yet. We can try to solve toy versions of the problem, or what we consider analogous situations, but none of our solutions can be actually tested yet. And there is the additional difficulty that Human-level AI/Transformative AI/AGI might be so dangerous that we have only one chance to implement the solution.
So if we want to apply the essence of human technological progress, from agriculture to planes and computer programs, just trying things out, we need to deal with the epistemic subtleties and limits of analogies and toy problems.
An earlier draft presented the conclusion of this section as “Science and Engineering can’t do anything for us—what should we do?” which is not my point. What I’m claiming is that in alignment, the epistemic strategies from both Science and Engineering are not as straightforward to use and leverage as they usually are (yes, I know, there’s a lot of subtleties to Science and Engineering anyway). They don’t provide privileged approaches demanding minimal epistemic thinking in most cases; instead we have to be careful how we use them as epistemic strategies. Think of them as tools which are so good that most of the time, people can use them without thinking about all the details of how the tools work, and get mostly the wanted result. My claim here is that these tools need to be applied with significantly more care in alignment, where they lose their “basically works all the time” status.
Acknowledging that point is crucial for understanding why alignment research is so pluralistic in terms of epistemic strategies. Because no such strategy works as broadly as we’re used to for most of Science and Engineering, alignment has to draw from literally every epistemic strategy it can pull, taking inspiration from Science and Engineering, but also philosophy, pre-mortems of complex projects, and a number of other fields.
To show that further, I turn to some main alignment concepts which are often considered confusing and weird, in part because they don’t leverage the most common epistemic strategies. I explain these results by recontextualizing them through the lens of Theoretical Computer Science (TCS).
Epistemic Strategies from TCS
For those who don’t know the field, Theoretical Computer Science focuses on studying computation. It emerged from the work of Turing, Church, Gödel and others in the 30s, on formal models of what we would now call computation: the process of following a step-by-step recipe to solve a problem. TCS cares about what is computable and what isn’t, as well as how much resources are needed for each problem. Probably the most active subfield is Complexity Theory, which cares about how to separate computational problems in classes capturing how many resources (most often time – number of steps) are required for solving them.
What makes TCS most relevant to us is that theoretical computer scientists excel at wringing knowledge out of the most improbable places. They are brilliant at inventing epistemic strategies, and remember, we need every one we can find for solving alignment.
To show you what I mean, let’s look at three main ideas/arguments in alignment (Convergent Subgoals, Goodhart’s Law and the Orthogonality Thesis) through some TCS epistemological strategies.
Convergent Subgoals and Smoothed Analysis
One of the main argument for AI Risk and statement of a problem in alignment is Nick Bostrom’s Instrumental Convergence Thesis (which also takes inspiration from Steve Omohundro’s Basic AI Drives):
Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by many intelligent agents.
That is, actions/plans exist which help with a vast array of different tasks: self-preservation, protecting one’s own goal, acquiring resources... So a Human-level AI/Transformative AI/AGI could take them while still doing what we asked it to do. Convergent subgoals are about showing that behaviors which look like they can only emerge from rebellious robots actually can be pretty useful for obedient (but unaligned in some way) AI.
What kind of argument is that? Bostrom makes a claim about “most goals”—that is, the space of all goals. His claim is that convergent subgoals are so useful that goal-space is almost chock-full of goals incentivizing convergent subgoals.
And more recent explorations of this argument have followed this intuition: Alex Turner et al.’s work on power-seeking formalizes the instrumental convergence thesis in the setting of Markov decision processes (MDP) and reward functions by looking, for every “goal” (a distribution over reward functions) at the set of all its permuted variants (the distribution given by exchanging some states – so the reward labels stay the same, but are not put on the same states). Their main theorems state that given some symmetry properties in the environment, a majority (or a possibly bigger fraction) of the permuted variant of every goal will incentivize convergent subgoals for its optimal policies.
So this tells us that goals without convergent subgoals exist, but they tend to be crowded out by ones with such subgoals. Still, it’s very important to realize what neither Bostrom nor Turner are arguing for: they’re not saying that every goal has convergent subgoals. Nor are they claiming to have found the way humans sample goal-space, such that their results imply goals with convergent subgoals must be sampled by humans with high probability. Instead, they show the overwhelmingness of convergent subgoals in some settings, and consider that a strong indicator that avoiding them is hard.
I see a direct analogy with the wonderful idea of smoothed analysis in complexity theory. For a bit of background, complexity theory generally focuses on the worst case time taken by algorithms. That means it mostly cares about which input will take the most time, not about the average time taken over all inputs. The latter is also studied, but it’s nigh impossible to find the distribution of input actually used in practice (and some problems studied in complexity theory are never used in practice, so a meaningful distribution is even harder to make sense of). Just like it’s very hard to find the distribution from which goals are sampled in practice.
As a consequence of the focus on worst case complexity, some results in complexity theory clash with experience. Here we focus on the simplex algorithm, used for linear programming: it runs really fast and well in practice, despite having provable exponential worst case complexity. Which in complexity-theory-speak means it shouldn’t be a practical algorithm.
Daniel Spielman and Shang-Hua Teng had a brilliant intuition to resolve this inconsistency: what if the worst case inputs were so rare that just a little perturbation would make them easy again? Imagine a landscape that is mostly flat, with some very high but very steep peaks. Then if you don’t land exactly on the peaks (and they’re so pointy that it’s really hard to get there exactly), you end up on the flat surface.
This intuition yielded smoothed analysis: instead of just computing the worst case complexity, we compute the worst case complexity averaged over some noise on the input. Hence the peaks get averaged with the flatness around them and have a low smoothed time complexity.
Convergent subgoals, especially in Turner’s formulation, behave in an analogous way: think of the peaks as goals without convergent subgoals; to avoid power-seeking we would ideally reach one of them, but their rarity and intolerance to small perturbations (permutations here) makes it really hard. So the knowledge created is about the shape of the landscape, and leveraging the intuition of smoothed analysis, that tells us something important about the hardness of avoiding convergent subgoals.
Note though that there is one aspect in which Turner’s results and the smoothed analysis of the simplex algorithm are complete opposite: in the former the peaks are what we want (no convergent subgoals) while in the latter they’re what we don’t want (inputs that take exponential time to treat). This inversion doesn’t change the sort of knowledge produced, but it’s an easy source of confusion.
epistemic analysis isn’t only meant for clarifying and distilling results: it can and should pave the way to some insights on how the arguments could fail. Here, the analogy to smoothed complexity and the landscape picture suggests that Bostrom and Turner’s argument could be interrogated by:
- Arguing that the sampling method we use in practice to decide tasks and goals for AI targets specifically the peaks.
- Arguing that the sampling is done in a smaller goal-space for which the peaks are broader
- For Turner’s version, one way of doing that might be to not consider the full orbit, but only the close-ish variations of the goals (small permutations instead of all permutations). So questioning the form of noise over the choice of goals that is used in Turner’s work.
Smoothed Analysis | Convergent Subgoals (Turner et al) |
Possible Inputs | Possible Goals |
Worst-case inputs make steep and rare peaks | Goals without convergent subgoals make steep and rare peaks |
Goodhart’s Law and Distributed Impossibility/Hardness Results
Another recurrent concept in alignment thinking is Goodhart’s law. It wasn’t invented by alignment researchers, but Scott Garrabrant and David Manheim proposed a taxonomy of its different forms. Fundamentally, Goodhart’s law tells us that if we optimize a proxy of what we really want (some measure that closely correlates with the wanted quantity in the regime we can observe), strong optimization tends to make the two split apart, meaning we don’t end up with what we really wanted. For example, imagine that everytime you go for a run you put on your running shoes, and you only put on these shoes for running. Putting on your shoes is thus a good proxy for running; but if you decide to optimize the former in order to optimize the latter, you will take most of your time putting on and taking off your shoes instead of running.
In alignment, Goodhart is used to argue for the hardness of specifying exactly what we want: small discrepancies can lead to massive differences in outcomes.
Yet there is a recurrent problem: Goodhart’s law assumes the existence of some True Objective, of which we’re taking a proxy. Even setting aside the difficulties of defining what we really want at a given time, what if what we really want is not fixed but constantly evolving? Thinking about what I want nowadays, for example, it’s different from what I wanted 10 years ago, despite some similarities. How can Goodhart’s law apply to my values and my desires if there is not a fixed target to reach?
Salvation comes from a basic insight when comparing problems: if problem A (running a marathon) is harder than problem B (running 5 km), then showing that the latter is really hard, or even impossible, transfers to the former.
My description above focuses on the notion of one problem being harder than another. TCS formalizes this notion by saying the easier problem is reducible to the harder one: a solution for the harder one lets us build a solution for the easier problem. And that’s the trick: if we show there is no solution for the easier problem, this means that there is no solution for the harder one, or such a solution could be used to solve the easier problem. Same thing with hardness results which are about how difficult it is to solve a problem.
That is, when proving impossibility/hardness, you want to focus on the easiest version of the problem for which the impossibility/hardness still holds.
In the case of Goodhart’s law, this can be used to argue that it applies to moving targets because having True Values or a True Objective makes the problem easier. Hitting a fixed target sounds simpler than hitting a moving or shifting one. If we accept that conclusion, then because Goodhart’s law shows hardness in the former case, it also does in the latter.
That being said, whether the moving target problem is indeed harder is debatable and debated. My point here is not to claim that this is definitely true, and so that Goodhart’s law necessarily applies. Instead, it’s to focus the discussion on the relative hardness of the two problems, which is what underlies the correctness of the epistemic strategy I just described. So the point of this analysis is that there is another argument to decide the usefulness of Goodhart’s law in alignment than debating the existence of True Value
Running | Alignment | |
Easier Problem | 5k | Approximating a fixed target (True Values) |
Harder Problem | Marathon | Approximating a moving target |
Orthogonality Thesis and Complexity Barriers
My last example is Bostrom’s Orthogonality Thesis: it states that goals and competence are orthogonal, meaning that they are independent– a certain level of competence doesn’t force a system to have only a small range of goals (with some subtleties that I address below).
That might sound only too general to really be useful for alignment, but we need to put it in context. The Orthogonality Thesis is a response to a common argument for alignment-by-default: because a Human-level AI/Transformative AI/AGI would be competent, it should realize what we really meant/wanted, and correct itself as a result. Bostrom points out that there is a difference between understanding and caring. The AI understanding our real intentions doesn’t mean it must act on that knowledge, especially if it is programmed and instructed to follow our initial commands. So our advanced AI might understand that we don’t want it to follow convergent subgoals while maximizing the number of paperclips produced in the world, but what it cares about is the initial goal/command/task of maximizing paperclips, not the more accurate representation of what we really meant.
Put another way, if one wants to prove alignment-by-default, the Orthogonality thesis argues that competence is not enough. As it is used, it’s not so much a result about the real world, but a result about how we can reason about the world. It shows that one class of arguments (competence will lead to human values) isn’t enough.
Just like some of the weirdest results in complexity theory: the barriers to P vs NP. This problem is one of the biggest and most difficult open questions in complexity theory: settling formally the question of whether the class of problems which can tractably be solved (P for Polynomial time) is equal to the class of problems for which solutions can be tractably checked (NP for Non-deterministic Polynomial time). Intuitively those are different: the former is about creativity, the second about taste, and we feel that creating something of quality is harder than being able to recognize it. Yet a proof of this result (or its surprising opposite) has evaded complexity theorists for decades.
That being said, recall that theoretical computer scientists are experts at wringing knowledge out of everything, including their inability to prove something. This resulted in the three barriers to P vs NP: three techniques from complexity theory which have been proved to not be enough by themselves for showing P vs NP or its opposite. I won’t go into the technical details here, because the analogy is mostly with the goal of these barriers. They let complexity theorists know quickly if a proof has potential – it must circumvent the barriers somehow.
The Orthogonality thesis plays a similar role in alignment: it’s an easy check for the sort of arguments about alignment-by-default that many people think of when learning about the topic. If they extract alignment purely from the competence of the AI, then the Orthogonality Thesis tells us something is wrong.
What does this mean for criticism of the argument? That what matters when trying to break the Orthogonality Thesis isn’t its literal statement, but whether it still provides a barrier to alignment-by-default. Bostrom himself points out that the Orthogonality Thesis isn’t literally true in some regimes (for example some goals might require a minimum of competence) but that doesn’t affect the barrier nature of the result.
Barriers to P vs NP | Orthogonality Thesis |
Proof techniques that are provably not enough to settle the question | Competence by itself isn’t enough to show alignment-by-default |
Improving the Epistemic State-of-the-art
Alignment research aims at solving a never-before-seen problem caused by a technology that doesn’t exist yet. This means that the main epistemic strategies from Science and Engineering need to be adapted if used, and lose some of their guarantees and checks. In consequence, I claim we shouldn’t only focus on those, but use all epistemic strategies we can find to produce new knowledge. This is already happening to some extent, causing both progress on many fronts but also difficulties in teaching, communicating and criticizing alignment work.
In this post I focused on epistemological analyses drawing from Theoretical Computer Science, both because of my love for it and because it fits into the background of many conceptual alignment researchers. But many different research directions leverage different epistemic strategies, and those should also be studied and unearthed to facilitate learning and criticism.
More generally, the inherent weirdness of alignment makes it nigh impossible to find one unique methodology of doing it. We need everything we can get, and that implies a more pluralistic epistemology. Which means the epistemology of different research approaches must be considered and studied and made explicit, if we don’t want to be confusing for the rest of the world, and each other too.
I’m planning on focusing on such epistemic analyses in the future, both for the main ideas and concepts we want to teach to new researchers and for the state-of-the art work that needs to be questioned and criticized productively.
OK, here is the promised list formal methods based work which has advanced the field of AGI safety. So these are specific examples to to back up my earlier meta-level remarks where I said that formal methods are and have been useful for AGI safety.
To go back to the Wikipedia quote:
There are plenty of people in CS who are inspired by this thought: if we can design safer bridges by using the tool of mathematical analysis, why not software?
First, some background comments about formal methods in ML
In modern ML, there are a lot of people who use mathematical analysis in an attempt to better understand or characterize the performance of ML algorithms. Such people have in fact been there since the start of AI as a field.
I am attending NeurIPS 2021, and there are lots of papers which are all about this use of formal methods, or have sections which are all about this. To pick a random example, there is 'Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss' (preprint here) where the abstract says things like
(I have not actually read the paper. I am just using this as a random example.)
In general, my sense is that in the ML community right now, there is a broad feeling that mathematical analysis still needs to have some breakthrough before it can say truly useful things, truly useful things when it comes to safety engineering, for AIs that use deep neural networks.
If you read current ML research papers on 'provable guarantees' and then get very unexited, or even deeply upset at how many simplifying assumptions the authors are making in order to allow for any proofs at all, I will not blame you. My sense is also that different groups of ML researchers disagree about how likely such future mathematical breakthroughs will be.
As there is plenty of academic and commercial interest in this formal methods sub-field of ML, I am not currently working in it myself, and I do not follow it closely. As an independent researcher, I have the freedom to look elsewhere.
I am looking in a different formal methods corner, one where I abstract away from the details of any specific ML algorithm. I then try to create well-defined safety mechanisms which produce provable safety properties, properties which apply no matter what ML algorithm is used. This ties back into what I said earlier:
The list
So here is an annotated list of some formal methods based work which has, in my view, advanced the field of AGI safety. (To save me some time, I will not add any hyperlinks, pasting the title into google usually works.)
Hutter's AIXI model, 'Universal algorithmic intelligence: A mathematical top down approach' models superintelligence in a usefully universal way.
I need to point out however that the age-old π∗ optimal-policy concept in MDPs also models superintelligence in a usefully universal way, but at a higher level of abstraction.
MIRI's 2015 paper 'Corrigibility' makes useful contributions to defining provable safety properties, though it then also reports failure in the effort of attempting to design an agent that provably satisfies them. At a history-of-ideas level, what it interesting here is that 2015 MIRI seemed to have been very much into using formal methods to make progress on alignment, but 2021 MIRI seems to feel that there is much less promise to the approach. Or at least, going by recently posted discussions, 2021 Yudkowsky seems to have lost all hope that these methods will end up saving the world. My own opinion is much more like 2015 MIRI than it is like 2021 MIRI. Modulo of course the idea that people meditating on HPMOR is a uniquely promising approach to solving alignment. I have never much felt he appeal of that approach myself. I think you have argued recently that 2021 Yudkowsky has also become less enthusiastic about the idea.
The above 2015 MIRI paper owes a lot to Armstrong's 2015 'Motivated value selection for artificial agents', the first official academic-style paper about Armstrong's indifference methods. This paper arguably gets a bit closer to describing an agent design that provably meets the safety properties in the MIRI paper. The paper gets closer if we squint just right and interpret Armstrong's 2005 somewhat ambiguous conditional probability notation for defining counterfactuals in just the right way.
My 2019 paper 'Corrigibility with Utility Preservation' in fact can be read to show how you need to squint in just the right way if you want to use Armstrong's indifference methods to create provable corrigibility related safety properties. The paper also clarifies a lot of questions about agent self-knowledge and emerging self-preservation incentives. Especially because of this second theme, the level of math in this paper tends to be more than what most people can or want to deal with.
In my 2020 'AGI Agent Safety by Iteratively Improving the Utility Function' I use easier and more conventional MDP math to define a corrigible (for a definition of corrigibility) agent using Armstrong's indifference methods as a basis. This paper also shows some results and design ideas going beyond stop buttons and indifference methods.
In my 2021 'Counterfactual Planning in AGI Systems' I again expand on some of these themes, developing a different mathematical lens which I feel clarifies several much broader issues, and makes the math more accessible.
Orseau and Armstrong's 2016 'Safely interruptible agents' is often the go-to cite when people want to cite Armstrong's indifference methods. In fact it formulates and proves (for some ML systems) a related 'safely interruptable' safety property, a property which we also want to have if we apply a design like indifference methods.
More recent work which expands on safe interruptability is the 2021 'How RL Agents Behave When Their Actions Are Modified' by Langlois and Everitt.
Everitt and Hutter's 2016 'Self-modification of policy and utility function in rational agents' also deals with agent self-knowledge and emerging self-preservation incentives, but like my above 2019 paper which also deals with the topic, the math tends to be a bit too much for many people.
Carey, Everitt, et al. 2021 'Agent Incentives: A Causal Perspective' has (the latest version of) a definition of causal influence diagrams and some safety properties we might define by using them. The careful mathematical definitions provided in the paper use Pearl's style of defining causal graphs, which is again too much for some people. I have tried to offer a more accessible introduction and definitional framework here.
Everitt et al's 2019 'Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective' uses the tools defined above to take a broader look at various safety mechanisms and their implied safety properties.
The 2020 'Asymptotically unambitious artificial general intelligence' by Cohen et al designs a generic system where a uncertain-about-goals agent can call a mentor, and then proves some bounds for it.
I also need to mention the 2017 'The Off-Switch Game' by Hadfield-Menell et al. Like the 2015 MIRI paper on corrigibility, this paper ends up noting that it has clarified some problems, but without solving all of them.
There is lots of other work, but this should give you an idea of what I mean with formal methods work that defines useful AGI safety properties in a mathematically precise way.
Overall, given the results in the above work, I feel that formal methods are well underway in clarifying how one might define useful AGI safety properties, and how we might built them into AGI agents, if AGI agents are ever developed.
But it is also clear from recent comments like this one that Yudkowsky feels that no progress has been made. Yudkowsky is not alone in having this opinion.
I feel quite the opposite, but there is only a limited amount of time that I am willing to invest in convincing Yudkowsky, or any other participants on this forum, that a lot of problems in alignment and embeddedness are much more tractable than they seem to believe.
In fact, in the last 6 months or so, I have sometimes felt that the intellectually interesting open purely mathematical problems have been solved to such an extent that I need to expand the scope of my research hobby. Sure, formal methods research on characterizing and improving specific ML algorithms could still use some breakthroughs, but this is not the kind of math that I am interested in working on as a self-funded independent researcher. So I have recently become more interested in advancing the policy parts of the alignment problem. As a preview, see my paper here, for which I still need to write an alignment forum post.