My sense is that an increasingly common viewpoint around here is that the last ~20 years of AI development and AI x-risk discourse are well-described by the following narrative:

Eliezer Yudkowsky (and various others who were at least initially heavily influenced by his ideas) developed detailed models of key issues likely to be inherent in the process of developing smarter-than-human AI.

These models were somewhere between "maybe plausible" and "quite compelling" at the time that they were put forth, but recent developments in AI (e.g. behavioral characteristics of language models, smoothness / gradualness of scaling) have shown that reality just isn't panning out in quite the way Eliezer's models predicted.

These developments haven't entirely falsified Eliezer's models and key predictions, but there are now plenty of alternative models and theories. Some or all of these competing models either are or claim to:

Therefore, even if we can't entirely discount Eliezer's models, there's clearly a directional Bayesian update which any good Bayesian (including Eliezer himself) should be able to make by observing recent developments and considering alternate theories which they support. Even if the precise degree of the overall update (and the final landing place of the posterior) remains highly uncertain and debatable, the basic direction is clear.

Without getting into the object-level too much, or even whether the narrative as a whole reflects the actual views of particular real people, I want to make some remarks on the concept of belief updating as typically used in narratives like this.

Note, there's a sense in which any (valid) change in one's beliefs can be modeled as a Bayesian update of some kind, but here I am specifically referring to the popular rationalist practice of thinking and communicating explicitly in terms of the language of probabilities and likelihood ratios.

There are some questionable assumptions embedded in (what I suspect are) common views of (a) how the updating process is supposed to work in general and (b) how to apply the process validly to the particular case of updating one's models of AI development and x-risk.

When such views are expressed implicitly in the context of a sentiment that "updating" is broadly virtuous / desirable / correct, I find that there tends to be a lot of gloss over important caveats and prerequisites that keep the underlying mental motion tethered to reality - that is, ensure it remains a systematic (if rough and approximate) method for valid reasoning under uncertainty.

The rest of this post is a review of some of the key concepts and requirements for Bayesian updating to work as intended, with some examples and non-examples of how these requirements can fail to be met in practice.

My conclusion is not that the practice of explicit Bayesian updating is inherently flawed, but that it must be applied with attention to the preconditions and assumptions firmly in mind at all times. Local validity at each step must be tracked strictly and adhered to closely enough to ensure that the process as a whole actually holds together as a method for systematically minimizing expected predictive error.

Further, I think that most of the utility of explicit reasoning and communication in Bayesian terms derives not from the end result (whether that end result is a precise numerical posterior probability or just a rough / gut-sense directional update), but rather from the mental motions and stringent process requirements that valid application of the method entails.

In discourse, drilling down into each individual step of the update process and precondition for validity is likely to be more fruitful as a method of pinpointing (if not resolving) areas of disagreement and confusion, compared to attempts to reach consensus or even operationalize a disagreement by merely requiring that the end results of some unspecified underlying mental motions be expressed in Bayesian terms.

[Assumed background for this post: familiarity with the mechanics and jargon of Bayesian updating, as well as the more LW-specific idea of thinking and communicating explicitly in Bayesian terms. Concretely, you should understand all the math in Bayes' rule: Guide and not be confused when you encounter phrases like "I updated based on this information" or "What evidence would cause you to update your beliefs?" in comment threads or posts.]

Requirement 1: ability to inhabit your hypothetical worlds

A key requirement when forming likelihood ratios out of conditional probabilities is the ability to understand each hypothesis under consideration well enough that you can realistically condition on it. Conditioning requires the ability to mentally inhabit a world where each hypothesis is just true, including hypotheses that you consider quite unlikely in unconditional terms.

By "mentally inhabit", I mean: having a grasp of the hypothesis in enough mechanistic detail that you can actually operate in the possible world it generates. Within that possible world, you should be able to extrapolate, interpolate, and make logical deductions of all sorts based on the premises.

Whether the premises are true or false in actuality, within the hypothetical world itself they should radiate belief updates and logical facts outward in all directions at the speed of deduction, becoming entangled with every other fact that hypothetical-you know in that hypothetical world. You should be able to poke and prod at your hypothetical world with additional hypothetical observations and hypothetical predictions and see how your within-world (i.e. conditional) beliefs would evolve in response to such perturbations.

DALL·E's interpretation of the phrase "radiating logical facts outward at the speed of deduction".

Sound hard? In the general case of arbitrarily complicated hypotheses it is. Being able to do this kind of mental inhabitation for, say, a particular person's model of how the development of smarter-than-human AGI is likely to go is probably harder than having to pass their Ideological Turing Test (ITT) about the subject, and is a feat which many people, including the original proponents of key competing models and hypotheses (apparently and / or self-admittedly) often cannot do.

If you want to start comparing multiple hypotheses which aren't based on your own personal models of the world, you probably need to (at a minimum) be able to pass the ITT of multiple people with wildly different models and background assumptions.

It's worth noting that in toy examples used to illustrate the principles and mechanics of Bayesian updating, this requirement is often trivial: if your hypothesis is that a particular patient has a disease, or that a particular coin has some fixed propensity to land on heads when flipped, it's easy to imagine what the hypothetical worlds in which those things are true look like, because they look almost exactly like the actual world we already live in. We (collectively) encounter actual patients with actual diseases all the time, with real tests that sometimes come back with false positives or false negatives. Encountering a physical coin with a bias is a bit less common as an actual occurrence, but also poses no particular difficulty of imagination: perhaps the coin is bent or deformed in some way, or the flipping mechanism is a robot with precise motor control.

Requirement 2: ability to form any plausible hypotheses at all

Another prerequisite to make explicit Bayesian updating work in real life is the ability to describe any plausible explanation more specific than "I don't know" for all the observations you're trying to explain.

A single theory or hypothesis is sometimes enough to make the machinery of updating work in contexts where that hypothesis can be meaningfully compared to the catch-all null hypothesis of "I don't know" or "something I'm not thinking of", but you really do need at least one and ideally two concrete hypotheses that explain most or all of the evidence you have, without leaving a lot of confusion or unexplained gaps left over.

For example, if you're trying to solve a murder mystery, you generally need to have some plausible hypothesis about how the murder could have been carried out by someone (or some group), before you can start coming up with particular suspects. If you can't inhabit the hypothetical world where "X was the murderer" for any value of X, then you'll have a hard time forming conditional probabilities, P(E|Murderer is X), regardless of what E is.

This is why murder mysteries often turn on finding the murder weapon, determining the exact time and cause of death, and other surrounding facts.

Here again toy problems (about biased coins or sick patients) provide a nice non-example. The hypothesis space in a toy problem is often either trivial to compute or given as part of the setup. Whether or not the equivalent of the mental inhabitation step is easy in the toy problem (which might require some combinatorics or non-trivial math), which premises you're supposed to be conditioning on is usually clear.

Applying Bayesian statistical methods to real-world experimental data is another example where this requirement can sometimes pose a mechanical or mathematical challenge. It can be difficult to pick out a hypothesis space that contains both plausible physical models of the world and also makes the math tractable and the results suitable for comparison with other data and analysis.

If you can't (yet) come up with any plausible hypotheses, it is usually premature to bring in the machinery of explicit Bayesian updating. Thinking in Bayesian terms can still be a useful tool (among many other mental motions and methods) for generating hypotheses, shaping and guiding your intuitions, and figuring out where to look for more data, but assigning even gut-sense directional "updates" to your "beliefs" based on existing observations at this stage can often be misleading.

Requirement 3: avoid double-counting dependent observations

A key feature of Bayesian updating is that it allows you to combine different kinds of evidence in a uniform way via the multiplicativity of likelihood ratios (or equivalently, additivity of log-likelihood ratios).

Each piece of evidence must be independent from the others, but conditioning on each hypothesis under consideration screens off one of the most common and thorny sources of dependence (the truth of each hypothesis itself).

Another way of putting it is that the independence requirement for combining multiple likelihood ratios is only that the underlying observations be independent within the world of each hypothesis.

Again, this requirement is usually obvious and easy to satisfy (or easy to see is not satisfied) in the context of toy problems:

  • If you have a coin with some hypothesized fixed bias, each flip is independent of any other flip as an observation of the coin's behavior given that bias; though obviously the behavior (and likely outcome) of each flip is not independent of the bias itself.
  • In the example of a medical test with a false positive and false negative rate, repeating the same kind of test on an actually-sick or actually-healthy patient is not necessarily independent of the first test: an actually-sick patient who initially gets a false negative is probably much more likely to get a false negative on a second test of the same kind. (This is also why in real life, if you test positive for COVID-19 on a rapid antigen test, you usually want to follow up with a PCR test or at least a different brand of antigen test, rather than repeatedly taking the same test over and over again.)

In more complicated situations, it's pretty easy to accidentally double-count dependent evidence even after conditioning on a hypothesis.

For example, suppose your own mental model of AI development was surprised by GPT-4 in some way: perhaps you didn't expect SGD on transformer networks to scale as far as it has, or you didn't expect RLHF to work as well as it does for controlling the behavior of such networks.

In symbols, P(GPT-4 exists | M = a particular model or armchair theory of AI) is low.

However, a sensible model  probably shouldn't be additionally surprised very much that LoRA fine-tuning and activation engineering also work well to shape and control the behavior of language models.

In symbols, P(LoRA fine-tuning works on GPT-4 | M && RLHF works on GPT-4) is high, not just for any particular model , but for any model that doesn't specifically say that P(LoRA fine-tuning works | RLHF works) is low.

The upshot is that you usually can't count it as a strike against  (relative to an  that did predict GPT-4) every time a shiny new paper comes out with another technique for shaping and steering existing language models ever more precisely and efficiently. This is true even if  very specifically predicted something like activation engineering working in advance - there's just not much surprise left to explain, unless there are other plausible s which have low P(activation engineering works | RLHF works).

Non-requirement: picking priors and actually multiplying everything together

A common objection to Bayesianism in general (at least outside of LessWrong) is that it requires you to pick a prior, which is subjective and can be difficult. I think such objections are adequately addressed elsewhere already, but it's worth noting that priors are not actually necessary when working with Bayesian updates specifically. You can just work entirely with likelihood ratios and let the prediction markets and reality itself sort out the priors and posteriors.

Of course, when applying Bayesian updating to real-world problems there's also usually lots of subjectivity in deciding on likelihood ratios themselves. This is fine if all you're trying to do is get a gut-sense directional update about a single new observation, but can be a problem if don't keep track of the independence criteria over time. If you're not careful, you can end up remembering a bunch of gut checks that all felt like they pointed in the same general direction, which adds up to one seemingly giant felt sense of update, without realizing that all of them were basically the same gut check repeatedly.

Actually multiplying all your likelihood ratios together is also usually unnecessary, though it can be useful as a sanity check that your made-up likelihood ratios are actually independent and in the right ballpark: if you end up with a giant likelihood ratio but didn't notice anything revelatory or surprising while actually considering each piece of evidence (during the mental inhabitation step, say), that's more likely a sign that your numbers are off or the whole process is invalid than that you've actually missed a giant update until just now.

Tips for actually doing Bayesian updating

So what should one actually do about all of these complicated preconditions and requirements for updating? I don't think the answer is to entirely abandon the practice of thinking and communicating explicitly in Bayesian terms, but I also don't have great ideas.

Some assorted thoughts on how to make "updating" a real and valid mental motion and not just a shibboleth:

  • Be fluent with the math. Drill and practice with odds ratios and log odds until they're just as intuitive as probabilities and percentages. (Or if probabilities aren't intuitive for you, start by drilling and practicing with those.)
  • Trade in prediction markets to practice the step of cashing out your beliefs and updates into an actual number.[2]
  • Remember that whenever you see or say "I'm updating on X" there should usually be a relatively complicated and often tedious mental motion associated with that statement, which involves evicting a bunch of cached thoughts, (re)-inhabiting some hypotheticals, poking at those hypotheticals with the new evidence, and then trying to weigh up a felt sense of surprise from each of those pokes into some conditional probabilities and likelihood ratios.
  • Practice the math and the more subjective steps on a mix of different kinds of problems. Some random (mostly untested) ideas for practicing various sub-skills:
    • Analyze the price movements in a popular prediction market, e.g. figure out what piece of news caused the market to spike in price at a particular time, and what that implies about the collective underlying belief update of the market participants (identify the likelihood ratio of the evidence observed, over what mix of hypotheses, etc.)
    • Apply split and commit to the latest scandal or community drama of your choosing, and then weigh up the evidence and come up with an odds ratio for which possible world you're actually in.
    • Try this rationality exercise.
    • Investigate real-life mysteries using explicit Bayesian updating.
    • Read fiction in which a character makes use of explicit Bayesian updating.[3]

Commentary on AI risk discourse

This post began as a partial response to often-frustrated views I've seen expressed recently at one person or another's "failure to update" in some particular direction about AI development. Although I used beliefs about AI as an example throughout, I ended up not really getting into object-level issues on that topic much here, so for now I'll just conclude with some meta-level commentary.

In a lot of disagreements about AI, in addition to disagreement on the object-level, there are usually implicit disagreements about the shape of the object-level disagreement itself. On one side, there's a view that there's a lot of uncertainty in the object-level that more empirical investigation and time will inevitably resolve, but if you diligently study current ML and alignment research, and a dutifully apply Bayesian updating, you can predict which way the winds are blowing a bit faster and more precisely than everyone else.

On the other side, there's a view that disagreements about the object-level lie mostly in confusion (of the Other) about the fundamental nature of the problem. More empirical investigation and time might be one way of resolving this confusion, but it's also possible to resolve much of that confusion right now (or at any point before now) via explanations and discourse, though admittedly that hasn't actually worked out so well in practice.

Views on the object-level tend to be correlated with views on the meta-level shape of the disagreement, but regardless of who is right about the object-level or the meta-level, I think making the meta-level disagreement more explicit and legible is useful in a lot of contexts. Everyone passing everyone else's Ideological Turing Test about the object-level might be kind of hopeless, but we should at least be able to pass each other's ITT about the meta-level.

I think drilling into the mechanics of purported Bayesian updating is one way of pinpointing the location of meta-level disagreements. For example, one can unpack "it seems like this should be an update" to "I think the observation that [activation engineering works on GPT-2] or [GPT-4 can do impressive stuff] makes [Paul Christiano's disagreement #2 here] a more plausible projection of the medium-term future than [a future in which an unaligned AI quickly develops molecular nanotech] because...". Brackets denote that these are examples of the structural form of the unpacking that I have in mind, not that any actual binding of the things in the brackets corresponds to my or any other person's actual beliefs.

I have some of my own thoughts on what has been an object-level update for me personally, and what would cause me to make which kinds of updates in the future that I've alluded to in various comments. I plan to expand on those ideas in a future post now that I've gotten this one out of the way.

  1. ^

    Out-of-narrative comment: retrodiction-only theories can sometimes have issues with overfitting and hindsight bias, but in the case of models of AI development I think this often isn't an issue because people generally don't even agree on a characterization of past events, let alone a theory that retrodicts them.

    So I think looking for concise retrodiction-only theories that everyone on all sides actually agreed accurately described only the past would be quite useful and represent real progress, even if people don't agree on what such a theory predicted about the future.

  2. ^

    I used to trade on PredictIt quite actively and I think it gave me a felt sense for probabilities that nothing else quite could. These days, Manifold has a lot more interesting markets and much better UX. Though for me personally, the play money aspect of Manifold tends to make me lazy about the actual numbers. On PredictIt I used to be very careful with limit orders and pay close attention to the order book and market movements, which was often tedious but also educational. On Manifold I tend to just place a randomly-sized bet in whatever direction I think the market should move without giving it too much thought. YMMV; I think some people find Manifold quite fun / addictive and take it very seriously.

  3. ^

    There's probably not actually much mileage to get out of this unless you read actively, e.g. each time Keltham mentions a numerical likelihood ratio, stop and think about whether the numbers he made up actually make sense in context and what they imply about his underlying models / hypotheses.

New Comment
4 comments, sorted by Click to highlight new comments since:

This post puts some words to a difficulty I've been having lately with assigning probabilities to things. Thanks a lot for writing it, it's a great example of the sort of distillation and re-distillation that I find most valuable on LW.

In discourse, drilling down into each individual step of the update process and precondition for validity is likely to be more fruitful as a method of pinpointing (if not resolving) areas of disagreement and confusion, compared to attempts to reach consensus or even operationalize a disagreement by merely requiring that the end results of some unspecified underlying mental motions be expressed in Bayesian terms.

I don't understand this. What two things are being contrasted here? Is it "inhabiting the other's hypothesis" vs. "finding something to bet on"?

EDIT: Also, this is a fantastic post. 

But if I had to say one thing, then I'd say that requirements 1 and 2 are actually just the requirements for clear thinking on a topic. If your "belief" about X doesn't satisfy those requirements, then I'd say that your thinking on X is muddled, incoherent, splintered, coarse, doesn't bind to reality or is a belief in a shibboleth/slogan.

 

Is it "inhabiting the other's hypothesis" vs. "finding something to bet on"?

Yeah, sort of. I'm imagining two broad classes of strategy for resolving an intellectual disagreement:

  • Look directly for concrete differences of prediction about the future, in ways that can be suitably operationalized for experimentation or betting. The strength of this method is that it almost-automatically keeps the conversation tethered to reality; the weakness is that it can lead to a streetlight effect of only looking in places where the disagreement can be easily operationalized.
  • Explore the generators of the disagreement in the first place, by looking at existing data and mental models in different ways. The strength of this method is that it enables the exploration of less-easily operationalized areas of disagreement; the weakness is that it can pretty easily degenerate into navel-gazing.

An example of the first bullet is this comment by TurnTrout.

An example of the second would be a dialogue or post exploring how differing beliefs and ways of thinking about human behavior generate different starting views on AI, or lead to different interpretations of the same evidence.

Both strategies can be useful in different places, and I'm not trying to advocate for one over the other. I'm saying specifically that the rationalist practice of applying the machinery of Bayesian updating in as many places as possible (e.g. thinking in terms of likelihood ratios, conditioning on various observations as Bayesian evidence, tracking allocations of probability mass across the whole hypothesis space) works at least as well or better when using the second strategy, compared to applying the practice when using the first strategy. The reason thinking in terms of Bayesian updating works well when using the second strategy is that it can help to pinpoint the area of disagreement and keep the conversation from drifting into navel-gazing, even if it doesn't actually result in any operationalizable differences in prediction.

... priors are not actually necessary when working with Bayesian updates specifically. You can just work entirely with likelihood ratios...

I think that here you're missing the most important use of priors. 

Your prior probabilities for various models may not be too important, partly because it's very easy to look at the likelihood ratios for models and see what influence those priors have on the final posterior probabilities of the various models.

The much more important, and difficult, issue is what priors to use on parameters within each model.

Almost all models are not going to fix every aspect of reality that could affect what you observe. So there are unknowns within each model. Some unknown parameters may be common to all models; some may be unique to a particular model (making no sense in the context of a different model). For parameters of both types, you need to specify prior distributions in order to be able to compute the probability of the observations given the model, and hence the model likelihood ratios.

Here's a made-up example (about a subject of which I know nothing, so it may be laughably unrealistic). Suppose you have three models about how US intelligence agencies are trying to influence AI development. M0 is that these agencies are not doing anything to influence AI development. M1 is that they are trying to speed it up. M2 is that they are trying to slow it down. Your observations are about how fast AI development is proceeding at some organizations such as OpenAI and Meta.

For all three models, there are common unknown parameters describing how fast AI progresses at an average organization without intelligence agency intervention, and how much variation there is between organizations in their rate of progress. For M1 and M2, there are also parameters describing how much the agencies can influence progress (eg, via secret subsidies, or covert cyber attacks on AI compute infrastructure), and how much variation there is in the agencies' ability to influence different organizations.

Suppose you see that AI progress at OpenAI is swift, but progress at Meta is slow. How does that affect the likelihood ratios among M0, M1, and M2?

It depends on your priors for the unknown model parameters. If you think it unlikely that such large variation in progress would happen with no intelligence agency intervention, but that there could easily be large variation in how much these agencies can affect development at different organizations, then you should update to giving higher probability to M1 or M2, and lower probability to M0. If you also thought the slow progress at Meta was normal, you should furthermore update to giving M1 higher probability relative to M2, explaining the fast progress at OpenAI by assistance from the agencies. On the other hand, if you think that large variation in progress at different organizations is likely even without intelligence agency intervention, then your observations don't tell you much about whether M0, M1, or M2 is true.

Actually, of course, you are uncertain about all these parameters, so you have prior distributions for them rather than definite beliefs, with the likelihoods for M0, M1, and M2 being obtained by integrating over these priors. These likelihoods can be very sensitive to what your priors for these model parameters are, in ways that may not be obvious.