Thanks! This article was primarily addressing some specific concerns around alignment, such as Goodharting and models going out of distribution. The various people worried about those specific issues somehow never engaged with it, and the people with different concerns about alignment unsurprisingly weren't interested.
This was basically me persuading myself these things weren't important concerns. My thinking on alignment has moved on a bit since then. but I stand by the basic point: AGI is only able to FOOM if it can do STEM, and there are a number of alignment failure modes that a bunch of people on LessWrong have spent a bunch of time worrying about which have the property that anything that suffers from them clearly can't do STEM reliably — therefore they're not an x-risk. Such as Goodharting your optimizer, using models outside their region of validity, or not being able to cope with ontology paradigm shifts. Some people need to stop thinking about agents created by RL as their baseline x-risk, and start thinking about agents that can do STEM well instead. This additional assumption tells you more about the agent, making it less abstract and giving you more information to reason about its behavior.
More generally, for an agent that's an AGI value learner, if you're concerned about an alignment failure mode that seems obvious to you, you then need to explain why it plausibly wouldn't also be obvious to or fixable by the agent before it left the basin of attraction of value learning.
I do not currently know of any way to do the kinds of calculations, approximately-Bayesian updates, and splintered model validity bounds-checking that I have been describing by using solely a deep neural net trained by any current deep learning techniques
There is a recent result suggesting that for sufficiently-deep neural nets, in-context learning approaches Bayesian optimality, which (with sufficiently large contexts) might form a basis for this. If so, then that would motivate improving our Interpretability for in-context learning.
To be an Artificial General Intelligence (AGI) an AI needs to meet or exceed skilled human performance for most tasks, specifically including Science/Technology/Engineering/Mathematics (STEM) research. I am here ignoring the possibility of building an existentially-dangerous AI that isn’t capable of doing STEM work: certainly it is hard to see how an AI could go Foom without that capability. So this analysis doesn’t cover, for example, AIs that are dangerously superhuman only at manipulating humans, but are not an AGI.
I am only going to consider AGIs that are value learners: i.e. they attempt to simultaneously learn to predict human preferences, and meanwhile act in light of their current understanding of human preferences and aim to produce outcomes that they are confident that (on some average or aggregate) humans will prefer. If we build a STEM-capable AGI that isn’t a value learner, and instead has any hardwired utility function that was written by humans, then I confidently predict Doom — let’s not do that!
Clearly a major convergent instrumental strategy for any value learner is to understand human preferences: the better the AI understands these, the more likely it is to be able to get them right and do the right thing. Relevant knowledge is power. So doing Alignment Research is a convergent instrumental strategy for any STEM-capable AGI value learner — exactly as we would want it to be.
Required Capabilities for STEM Work
In order to do STEM, such a AGI needs to be able to use the scientific method. That is to say, it needs to be able to hold in its mind multiple conflicting hypotheses about or models of the world, with uncertainty as to which is correct, and perform something approximating Bayesian updates on them. Note that, unlike AIXI, any real AGI will not have unlimited computational resources, so it cannot use a universal prior for its Bayesian priors — it will need to instead use fallible informed guesses of priors, and rely on the approximately-Bayesian updates to converge these guesses towards 100% for the best hypothesis and 0% for competing ones. Fortunately after a few good Bayesian updates, the result tends to be only weakly dependent on the prior, so this uncertainty tends to decrease.
Our STEM-capable AGI needs to be creative enough able to create new hypotheses — genetic algorithms and LLMs already show significant signs of matching human creativity.
It also needs to be able to figure out low risk/cost experiments to perform to efficiently distinguish between hypotheses (experiments with a risk/cost lower than the likely benefit of the resulting knowledge). Note that when estimating the risk/cost of an experiment in order to minimize them, and compare it to potential benefit, the AGI is not yet certain which of the hypotheses are correct, or even that any of the hypotheses it is yet considering are correct (though it may already have a preponderance of evidence that has already done some approximately-Bayesian updates on its informed-guess priors). Thus it needs figure out the cost/risk under all plausible hypotheses thought of so far, and then do its minimization across them (weighted by their now-Bayesian-updated estimated weights). Until is has enough evidence that its Bayesian updates have nearly finished deciding between hypotheses (so one of them is approaching 100%, and thus becoming insensitive to the prior used), it should be aware of the fact that its Bayesian priors are just informed guesses. So it has unknown-unknowns, also known as Knightian uncertainty, unavoidably injecting randomness into its process for attempting to make optimal risk/cost decisions, which is very likely going to make this produce worse results: generally optimization processes are degraded by adding noise. So it needs to be cautious/pessimistic when doing this weighted risk estimate, i.e. bias the initial guessed priors towards whichever possibility would be worse (increase the prior of the hypothesis that would give the plan being considered the highest risk/cost and decrease the priors of other hypotheses). To automate this it should initially start with a range or distribution of guessed priors, approximately-Bayesian-update the entire distribution, and then do its cost/risk analysis on a pessimistic (utility minimizing) basis on those updated distributions for each hypothesis. This sounds rather similar to the approach used in Infra-Bayesianism. The pessimistic approach used should obviously be aware of and compensate for all basic relevant statistical facts such as the look-elsewhere effect, its consequence the optimizer’s curse, and the various related Goodhart effects. The AGI is doing statistics here, and it needs to get it right.
A Technical Digression on Differences with Infra-Bayesianism
(This section is addressed primarily to those with an interest in Infra-Bayesianism or entropy, so feel free to skip to the next section if this doesn't interest you.)
Note that pessimism, assuming the minimal utility across the distribution of guessed priors, isn’t always quite the correct approach. Operating on an incorrect hypothesis will tend to randomize your decision making, so for a strong optimizer operating in an environment that is already well optimized, it will almost invariably produce worse results that well-informed optimization would have. However, environments generally have some sort of “thermodynamic/entropic equilibrium” maximum-entropy state, which will have some (generally very low) utility, and (by the law of large numbers and entropy) roughly speaking that utility level is the value that random actions will tend to move you towards. For example, in an environment that consists of a 5-way multiple-choice test with standard scoring, that maximum-entropy state is a score close to 20% from guessing all questions (pseudo)randomly. In most actual physical environments this equilibrium state tends to be something resembling a blasted wasteland (or at least, something that will gradually cool to a blasted wasteland), with a correspondingly low utility. Note that there almost always do exist a few unusual states whose utility is even lower than this level: there are policies that can score 0% on a multiple choice test (knowing all the answers and avoiding them) where partially-randomizing the policy (e.g. by not knowing the answers to some of the questions) will actually improve it. So in unusual sufficiently-low-utility initial states, injecting Knightian uncertainty can actually be pretty-reliably helpful. Similarly, there are physical states such as a minefield, whose utility (to someone walking through it it, if not to whoever laid it) is even worse than that of a blasted wasteland: so much so that that heavily randomizing it (using randomly lobbed grenades, for example) would actually improve its utility (towards that of a blasted wasteland) by clearing some of the mines. Outside adversarial situations involving other optimizers with incompatible goals, these what one might call “hellish” environments whose utility is actually significantly lower than equilibrium are statistically very unlikely to occur: they can basically only be created by deliberate action by a capable optimizer with different goals. So, unlike Infra-Bayesianism’s Murphy, the “pessimism” described in the previous section should actually be an assumption that mistaken actions based on misestimated priors or not having the correct hypothesis in your hypothesis set, just like random actions, will tend to cause the utility to revert towards its (usually very low) equilibrium level (which may not even be accurately known). So if you’re in a city built by humans (i.e. an already-heavily-optimized environment), then randomly lobbing grenades is almost invariably going to decrease utility, but in adversarial situations you can encounter environments where you need to flip the sign on that “pessimism”, and lobbing grenades randomly isn’t that dreadful a strategy. Note that, in these rare “hellish” adversarial situations, this differs from the Infra-Bayesianism formalism, which always expects Murphy to make things worse, all the way down to the worst possible hellish environment, unless constrained not to by some law: one might instead dub this “Entropic-Bayesianism”. In this, if we use a remapping via a positive affine transformation to set the equilibrium state's utility to zero, then the law for Murphy becomes minimizing the absolute value (positive or negative) of the utility.
Another way of explaining at this: sabotage is easier than building. This is because highly optimizing the world is into a very specific and unusual state is harder work than assisting the forces of entropy trying to pull it back again towards a high-entropy state near equilibrium. Sabotaging a hellworld that someone else highly-optimized for their goals (with the effect of highly-pessimizing it for yours) is just as easy as sabotaging a heavenworld that was optimized for your goals: both start off in very low-entropy states, and end up as blasted wastelands. The only difference is the sign of the effect on your evaluation of its utility (which is locally inversely correlated with the utility function of the hellworld's creator).
Model Splintering and Limited Regions of Model Validity in STEM
Another important fact about STEM hypotheses/models about the world is that, at least when operating with limited computation resources, they almost invariably have limited regions of effective validity and computational usefulness. For example, something resembling String Theory may or may not be the ultimate unified theory, but (at least with our current theoretical understanding) String Theory predicts many googles of (many more that 10100) possible metastable compactifications, and computing what equivalent of the Standard Model any single one of those compactifications produces as its low-energy approximation is generally a difficult calculation (so giving you googles of difficult calculations to do) thus making it functionally impossible to make almost any useful prediction from String Theory (well, other than the temperature/mass relationship of a black hole). The Standard Model of physics nominally explains all phenomena below CERN-collider energy levels other than dark matter and dark energy, but in practice we can’t even do the calculation to calculate the mass of the proton from it. So by the time you get to up to the scale of predicting which atomic weights and numbers of nuclei are stable or radioactive, computational limitations mean you’re already on you third splintered model. In theory chemistry should be predictable from quantum electromagnetism given a big enough quantum computer — in practice we don’t yet have big enough quantum computers so that’s generally not how computational chemistry or theoretical condensed matter predictions are done, nor how AlphaFold predicts protein folding. Continuing up to larger and larger scales, we introduce yet more splintered models in Biochemistry, Biology, Psychology, Ecology, Economics, Geology, Astronomy, and so forth — all with limited ranges of effective validity and usefulness. Notice that this really doesn't look much like the Universal Prior.
So any AGI capable of doing STEM work needs to be able to easily handle model splintering, whether caused by actual ignorance of physical law, or ignorance of the initial state of the system, or just practical or conceptual computational limits that require us to instead switch to doing some more approximate/coarser scale calculation based on some observed/measured emergent or statistical behavior, or put us in some region that we don’t yet have a good model for (generally a bad place to go). Good STEM splintered models of the world tend to come with some guidelines as to what their region of validity is, but those are usually part of the models, or often part of the of the relationship between pairs of splintered models, and these guidelines may also be subject to Knightian uncertainty or even computational limitations.
Making Suitably Cautious Predictions Under Knightian Uncertainty
Anything capable of doing STEM needs to understand and be able to handle the difference between (and mixtures of) two different types of uncertainty:
1) Stochastic uncertainty, or "known unknowns": many physical models cannot make exact predictions of outcomes, only predict some distribution of outcomes. In a specific model, it may, or may not, be possible with enough computational effort to predict this distribution in great detail, complete with mean, variance, detailed shape, and (less often) even a fine understanding of the shapes of its tails. If you are ~100% certain you have the right model and are within its region of validity, and if the model does allow you to predict detailed stochastic distributions, and if you can also map that distribution of outcomes to a distribution of utility differences in a precise, well principled way, then you can actually make well-informed utility gambles on stochastic uncertainty and reasonably expect to break even while doing so.
2) Knightian uncertainty and logical uncertainty, or "unknown unknowns": situations where you are not yet 100% sure what model to apply, or at least what the values of some of the parameters in the model are, or else you are computationally unable to use the model to make a sufficiently accurate prediction (though you may perhaps be able to make an approximate prediction or range of predictions). Generally in this case, if you try to use the model to estimate odds and decide whether to accept a gamble, there is a possibility that your decision will depend on your fallible initial informed guess of priors (over models, and/or over parameter values within a model) that have not yet been Bayesian-updated sufficiently tightly for you to be able to make a well-informed gamble: instead you’re partially relying on the values of your informed-guess priors. In this case you should again use a normally-pessimistic “Entropic-Bayesianism” approach over Knightian and logical uncertainty, which will require you to keep track of things like a range or even distribution of updated priors over models and model parameters. There is also generally Knightian or even logical uncertainty over whether you are even considering the right set of hypotheses, though with a sufficient body of evidence this can to some extent Bayesian-update away. However, it’s often not clear, for this sort of Knightian or logical uncertainty, what the neighborhood-of-validity for the previous evidence that you’re using in your approximately-Bayesian updates suggesting that vthe uncertainty has gone away and the model is in fact in its region of validity should be — one can almost always come up with some very contrived new hypothesis that would cause a currently-unobserved very narrow hole in the range of validity of the model you were assuming was safe to use, and often the only defense against this is Occam’s razor, if the resulting hypothesis looks much more complex or contrived-looking than well-supported theories in this area have been. So there tends to be some lingering Knightian uncertainty about the borders of the region of validity of even very-well-tested models. That's why the word "theory" gets used a lot in science.
Note that there are exceptions to things like model-parameter uncertainty being Knightian. For example, suppose that you have a model of the universe that you are ~100% certain of, which has only one free parameter that you still have any significant remaining uncertainty on: you're unsure of its exact value over a narrow range, and are not aware of any a-priori reasons to place a higher or lower prior on any particular values or region within that range, so a uniform prior seems reasonable. Then you could reasonably treat its exact value within that range as a stochastic rather than a Knightian variable, and take informed utility-bets on matters dependent on its value, without needing to reasonably expect that you are injecting unknown-unknown randomness into your decision-making process (perhaps other than on whether you should assume a uniform linear or exponential distribution over the range — hopefully the range is narrow enough that these are unlikely to make enough difference to perturb the results of your optimization.)
Note that these two forms of uncertainty are not mutually exclusive: frequently you are dealing with a situation that contains both stochastic uncertainty and also Knightian and/or logical uncertainty. The important difference is that stochastic uncertainty is well understood so you can rationally take well-informed bets over it (after dealing with issues like utility functions generally being non-linear in most resources like money), whereas Knightian uncertainty is not well understood, so you need to hedge your bets and assume that even your best estimate of the uncertainty could well be wrong, thus when betting you're still unavoidably injecting randomness into your optimization process, and that (at least outside hellish environments) random outcomes are statistically likely to be worse than well-optimized ones, to an extent that depends on the statistics of how many ways you could be mistaken, something which generally you also don't have good knowledge of, since it's always possible there are hypotheses you haven't yet considered or couldn't compute the implications of.
Thus if an agent attempts to optimize strongly in the presence of large amounts of Knightian uncertainty, it's going to fall afoul of the optimizer's curse and Goodhart's law. In the presence of sufficiently small amounts of Knightian uncertainty, it may be possible to at least estimate the risk in statistically-principled ways (e.g. "I somehow know that there are O(N) separate independently distributed relevant variables whose values I'm unaware of, all about equally important and that combine roughly linearly, and with distributions that the central limit theorem applies to, so I should expect to be off by about sqrt(N) times the effect of one of them" — which is a very low amount of Knightian uncertainty) and then deduce how much pessimism to apply to counteract these, but generally you don't even have that good a handle on how much Knightian uncertainty you have, so the best you can do is assume the law of large numbers is against you and apply Entropic-Bayesian pessimism.
So a STEM-capable AGI will need to be able to handle understanding when it is using a model of the world in a region where it is applicable, show due caution when the model is starting to get out of its range of validity (or for a model that is actually built in the form of a trained neural net, out of its training distribution), and also correctly and cautiously handle being unsure about this.
Consequences of STEM-Required Capabilities for a Value Learner
For a STEM-capable AGI value learner, in order to compute the utilities of actions under consideration, their outcomes need to stay within the region of validity for not only its models of how the physical world works, so it can be confident of accurately predicting what outcome will happen if it does that action, but also the region of validity of its models of human utility, so it can be confident of accurately predicting whether humans will be happy about the outcome or not. In the presence of Knightian or logical uncertainty about either or both of these, it needs to make suitable cautious/pessimistic assumptions about these, which in any situation with the potential that seems to have the potential for seriously bad outcomes will generally preclude it taking any such actions. So (outside things like desperate Hail-Mary gambles or cautious experimentation), it's basically forced to stay inside what it believes are the regions of validity of both its physical and utility models. Note that this requires it to understand and regulate the impact of its actions, to keep the outcomes within all relevant models' regions of validity.
So once again, doing Alignment Research to get better models of human preferences with larger regions of validity rapidly becomes a convergent instrumental strategy for any STEM-capable AGI value learner, especially so as soon as it has optimized the environment to near the edge of the region of validity of any of its current models of human values.
Indeed, this need to advance alignment research to keep up with capabilities (since otherwise the capabilities are too dangerous to use) is quite likely to slow takeoff speed after we achieve AGI. It’s entirely possible that the limiting factor for takeoff speed isn’t fundamental research or physically building new infrastructure, it’s doing customer surveys on humans to figure out what they actually want out of all the new possibilities. Obviously this is going to vary across individual technologies: understanding what costs humans are or are not willing to accept as consequences of ways that AGIs might cure cancer might well be much easier than say, figuring out how AGIs could take on the functions of government in ways humans would find acceptable.
Implementing These STEM-Required Capabilities
I do not currently know of any way to do the kinds of calculations, approximately-Bayesian updates, distinctions between types of uncertainty, and splintered model validity bounds-checking that I have been describing by using solely a deep neural net trained by any current deep learning techniques. While in-context learning for sufficiently deep transformer models is approximately-Bayesian, that's not all we need here. I suspect that it may require, at a minimum, a significantly more complex neural architecture, and it may well not be possible. Successfully implementing all of this sounds to me like it requires doing symbol manipulation along the lines of mathematics/statistics, programming, or symbolic AI, or indeed the kinds of symbol manipulations and tool-use that current LLMs are starting to get good at, perhaps also combined with less-widely used techniques like symbolic regression and equation learning. All that is, after all, generally how humans STEM researchers do this sort of work, which is what we're trying to automate. Human STEM researchers don't generally attempt to do STEM using their system 1 neural nets (at least, not often).
So, to try to put this in something resembling Reinforcement Learning terminology, I am proposing (at least for the AGI’s equivalents of slow/system 2 thinking, i.e. for longer term planning) something resembling an actor-critic setup, where at least the critic is a splinted set of models, systems, and ranges of model validity that internally look a lot like some combination of mathematics, programs, symbolic AI, LLM-powered symbolic manipulation, and some machine-learned data-scientific statistical models where needed, that between them provide estimates of human utility with confidence bounds in a properly cautious way. Probably much of it will look like things that a social science or advertising data-science team might use. Likely most of this would have been generated/developed by STEM-capable AGIs, building upon previous human research into human behavior. Obviously human alignment experts (and regulators/politicians) would be very interested in doing interpretability work on the critic system to attempt to understand it and confirm its validity. This seems like the area where most of the approaches often considered by alignment researchers might be focused, such as things like Debate or Amplification. However, much of it presumably won’t be neural nets: it should look like a combination of economics, behavioral psychology, statistical models of human preferences, and similar hardish/numerical versions of soft sciences. This also brings us into the area of applicability of approaches like Chain-of-thought alignment. So interpretability on this may look a lot more like data science or economics than neural net interpretability, except for those component models in it that are themselves neural nets.
Consequences for Alignment Research
It seems to me STEM-capable AGI value learners doing alignment research is a stable attractor, and that once we get near enough to it, our AGI value learners ought to be able to keep us on it better than we can, if they’re sufficiently capable. Indeed, making sure that we stay on it is a convergent instrumental strategy for them, and they should soon get superhumanly good at it. So the challenge for human alignment researchers is to build an initial sufficiently-aligned STEM-capable AGI value learner to act as an automated alignment researcher to get us into the basin of attraction of this stable attractor, without creating too much x-risk in the process. I personally believe that that’s the goal, and what we should be working on now. We don’t yet know how to do it, but between LLMs and agentic AutoGPT-like scaffolding systems/cognitive architectures, some of the likely components in such a system are becoming clear.
I also think it would be worth going back and looking through the work that was done on symbolic AI, and seeing it any of it looks like it might be useful to connect to an LLM.
What most concerns me is that a good deal of current effort in alignment research seems to be devoted to questions like “how do we avoid the critic system’s utility scoring neural net giving bad answers after a distribution shift in its inputs?” without recognizing that this is a specific utility-function-related instance of a more general capabilities problem that we are going have to solve in order to build a AGI capable of doing STEM research in the first place: telling when any splintered model is leaving its region of validity, and can no longer be relied on. Alignment researchers thinking about this also seem to often assume that the model is a pure neural net, and I think we should be thinking about systems that combine splintered models with a much wider range of internal structures or tool-use, including equations, checklists, textual descriptions, gradient-boosted models, software, and so forth.
Similarly, issues like look-elsewhere, breaking the optimizer’s curse, and avoiding Goodharting out of your validity region are also generally applicable capability problems in doing the statistics when doing STEM work, and are not unique to calculating and optimizing human utilities (though the application of optimization pressure is important to them). We don’t have a lot of alignment researchers, so I suggest leaving capabilities problems like these to all the capabilities researchers!
I think we should put significant weight on the possibility that our STEM-capable AGI will be doing these things using symbol manipulation and tool-use like statistics and approximately-Bayesian updates, not neural nets. So what we need to be doing is reading the statistical literature on how best to make decisions when you don't have perfect information.
Admittedly, statistical issues like these are probably both more common and more challenging in soft sciences than in hard sciences, and aggregated human utility is definitely a soft-science question. So I can imagine us first building a near-AGI capable of doing some STEM work in hard sciences that wasn’t yet very competent in soft sciences, but that was capable of Fooming. However, as long as it was a value learner, it would understand that this was a big problem, which urgently needed to be solved before Fooming, and it would be a convergent instrumental strategy for it to do the Computer Science and Statistical work to fix this and become a better statistician/data-scientist/soft-scientist — and fortunately those are all hard sciences.