Last week, ARC released a paper called Towards a Law of Iterated Expectations for Heuristic Estimators, which follows up on previous work on formalizing the presumption of independence. Most of the work described here was done in 2023.

A brief table of contents for this post:

 

In "Formalizing the Presumption of Independence", we defined a heuristic estimator to be a hypothetical algorithm that estimates the values of mathematical expression based on arguments. That is, a heuristic estimator is an algorithm  that takes as input

  • A formally specified real-valued expression ; and
  • A set of formal "arguments"  --

-- and outputs an estimate of the value of  that incorporates the information provided by . We denote this estimate by .[1]

In that paper, we introduced the following question: is there a computationally efficient heuristic estimator that formalizes intuitively valid reasoning about the values of mathematical quantities based on arguments? We studied the question by introducing intuitively desirable coherence properties (one such property is linearity: a heuristic estimator's estimate of  should equal its estimate of  plus its estimate of ) and working to satisfy those properties. Ultimately, we left the question open.

 

The main technical contribution of our new work is to outline a new type of coherence property: a heuristic estimator should not be able to predict its own errors. We call this intuitive statement the principle of unpredictable errors. The principle is loosely inspired by the law of iterated expectations from probability theory, as well as the martingale property: a Bayesian reasoner's estimate of their future estimate of a quantity should be equal to its current estimate. One of the main purposes of this work is to explore ways to formalize this principle.

Our paper is structured as follows:

  • We begin by explaining the core motivating intuition behind heuristic estimators through three analogies: proof verification, conditional expectations, and subjective probabilities.
  • We explain why we believe in the principle of unpredictable errors. We then describe a natural attempt to formalize the principle: to simplify, we ask that  for every expression  and set of arguments . (We call this property iterated estimation and also define a more complex property which we call error orthogonality.) We then discuss important drawbacks of this formalization, which stem from the nested 's in the definition of the properties.
  • Taking inspiration from these properties, we define a cluster of accuracy properties, which -- roughly speaking -- replace the outer  with an expected value over a distribution of expressions . The simplest of these properties states that .
  • We examine the accuracy properties in the context of two estimation problems: (1) estimating the expected product of jointly normal random variables and (2) estimating the permanent of a matrix. In both cases, we encounter barriers to satisfying the accuracy properties, even when the set of heuristic arguments is small and simple. This leads us to reject accuracy as a formalization of the principle of unpredictable errors. We leave open the question of how to correctly formalize this principle.
  • We conclude with a discussion of our motivations for pursuing this line of research. While the problem of heuristic estimation is deeply interesting from a theoretical standpoint, we believe that it could have applications for understanding the behavior of neural networks. We discuss three potential applications of heuristic estimation to understanding neural network behavior: mechanistic anomaly detection,[2] safe distillation, and low probability estimation.

This blog post summarizes the paper, with proportionally more emphasis on the main ideas and less emphasis on the mathematical details.

 

What is a heuristic estimator?

In "Formalizing the Presumption of Independence", we described a heuristic estimator as an efficient program that forms a "subjective expected value" of a quantity based on arguments. We gave several examples of heuristic estimation, such as estimating the number of twin primes in a given range and estimating the probability that some 256-bit input to the SHA-256 circuit has an all-zeros output.

In our new paper, we expand on the intuition behind heuristic estimators through one example and three analogies.

 

Example: Sum of sixth digits of square roots

Let  denote the sixth digit of  past the decimal point (in base 10), and let . What is your best guess for the value of ?

Without actually calculating any square roots, your best bet is to estimate each of the twenty digits as  (the average of 0 through 9); this gives an estimate of . This is perhaps how we want our heuristic estimator to behave when given no arguments; in other words, we want our heuristic estimator  to satisfy .[3]

Now, let  be a computation of the sixth digit of . When given  as an argument,  should update its estimate of  accordingly. For example, the sixth digit of  happens to be 5. Correspondingly, we would like . If  is additionally given , which shows that the sixth digit of  is 4,  should again update its estimate:  -- and so on. If  is given all of  through , then it should be able to compute the correct answer.

(The purpose of this example is to provide a very simple and intuitive picture of how we expect  to update based on arguments. In practice, we expect the arguments given to  to be much more complex.)

 

Analogy #1: Proof verification

A proof verifier is a program that takes as input a formal mathematical statement and a purported proof of the statement, and checks whether the proof is valid. This is very similar to a heuristic estimator, which is a program that takes as input a formal mathematical expression and some arguments about the expression, and outputs an estimate of the value of the expression in light of those arguments.

Just as a proof verifier does not attempt to generate its own proof of the statement -- it just checks whether the given proof is valid -- a heuristic estimator does not attempt to estimate the given quantity using its own arguments. Its only purpose is to incorporate the arguments that it is given into an estimate. Moreover, we can think of a heuristic estimator as a generalized proof verifier: we expect heuristic estimators to respect proofs, in the sense that if an argument  proves that , then  should lie between  and . (See Section 4 here.)

This table (adapted from chapter 9 here) illustrates the analogy between proof verifiers and heuristic estimators in more detail.

Heuristic estimationProof verification
Heuristic estimatorProof verifier
Formal mathematical expressionFormal mathematical statement
List of heuristic argumentsPurported proof of statement
Formal language for heuristic argumentsFormal language for proofs
Desiderata for estimatorSoundness and completeness
Algorithm's estimate of expressionVerifier's output (accept or reject)

 

Analogy #2: Conditional expectation

In some ways, a heuristic estimator is analogous to a conditional expected value. For a random variable  and event  is the average value of  conditioned on  -- or put otherwise, it is the estimate of  given by an observer who knows that  occurred and nothing else. Similarly,  is the estimate of  given by an observer who has not computed the exact value of  and has instead only done the computations described in the arguments in . Although there is a particular correct value of , the observer does not know this value, and  is a subjective "best guess" about  given only . Both quantities can thus be thought of as a best guess conditional on a state of knowledge.

 

Analogy #3: Subjective probabilities and estimates

Perhaps the best intuitive explanation of heuristic estimation is this: a heuristic estimator is a procedure that extracts a subjective expectation from a state of knowledge. Under this view,  formally describes a set of facts known by an observer, and  is a subjective estimate of  in light of those facts.

By "subjective expectation", we mean "expected value under the subjectivist view of probability". The subjectivist view of probability interprets probability as the subjective credence of an observer. For example, suppose that I have chosen a random number  uniformly from , minted a coin that comes up heads with probability , and then flipped it. What is the probability that the coin came up heads?

To an observer who only knows my procedure (but doesn't know ), the subjective probability that the coin came up heads is . To an observer who knows  (but hasn't seen the outcome of the coin flip), the subjective probability that the coin came up heads is . And to an observer who saw the outcome of the coin flip, the probability is either 0 (if the coin came up tails) or 1 (if it came up heads).

Much as observers can have subjective probabilities, they can also have subjective expectations. For example, a typical mathematician does not know the 6th digit past the decimal point of , but would subjectively assign a uniform probability to each of , which means that their subjective expectation for the digit is . Recalling our example from earlier, the mathematician's subjective expectation for  is . But if the mathematician were to learn that , they would update their subjective expectation to . This is exactly how we want our heuristic estimator  to operate.

 

How can heuristic estimation help us understand neural networks?

While heuristic estimation is a deep and interesting topic in its own right, our research is primarily motivated by potential applications to understanding neural network behavior.

Historically, researchers mostly understood neural network behavior through empirical observation: the model's input-output behavior on particular inputs. However, this approach has important drawbacks: for example, any understanding gained about a model's behavior on one input distribution may not carry over to a different input distribution.

More recently, there has been substantial work on neural network interpretability via understanding how models internally represent concepts. Interpretability research aims to address the barriers faced by methods that rely only on input-output behavior. However, current interpretability techniques tend to only work under strong assumptions about how neural networks represent information (such as the linear representation hypothesis). Also, for the most part, these techniques can only work insofar as neural representations of concepts are understandable to humans.

A different approach is formal verification: formally proving properties of neural networks such as accuracy or adversarial robustness. While formal verification does not rely on human understanding, we believe that formally proving tight bounds about interesting behaviors of large neural networks is out of reach.

By contrast, heuristic arguments about properties of neural networks  may have important advantages of both formal verification and interpretability. On the one hand, heuristic arguments (like proofs) are formal objects that are not required to be human-understandable. This means that heuristic arguments could be used to reason about properties of neural networks for which no compact human-understandable explanation exists. On the other hand, heuristic arguments (like interpretability approaches) do not require perfect certainty to be considered valid. This allows for short heuristic arguments of complex properties of large models, even when no short proofs of those properties exist.[4] (See our earlier post on surprise accounting for further discussion.)

In the rest of this section, I will give three examples of problems that we believe cannot be solved in full generality with current approaches, but that they may be solvable with heuristic arguments. (All three examples will just be sketches of possible approaches, with many details left to be filled in.)

 

Mechanistic anomaly detection

Let  be a neural network that was trained on a distribution  of inputs  using the loss function .[5] Suppose that  successfully learns to achieve low loss: that is,  is small. Let  be a (perhaps out-of-distribution) input. We call  a mechanistic anomaly for  if  gets a low loss on , but for a "different reason" than the reason why it gets low average loss on . In other words, mechanistic anomalies are inputs on which  acts in a seemingly reasonable way, but via anomalous internal mechanisms.[6] To detect a mechanistic anomaly, reasoning about 's internal structure may be necessary.

How could we use a heuristic estimator to detect mechanistic anomalies? Suppose that we find a set of arguments  such that the following quantity is low:[7]

That is,  explains why  attains low average loss on .[8] Given an out-of-distribution input  such that  is once again low, we consider the quantity . This represents a heuristic estimate of 's loss on  based only on the reasons provided in : that is, based only on the facts necessary to explain 's low loss on .

If  is (correctly) low, then the reasons why  performs well on  also explain why  performs well on . By contrast, if  is (incorrectly) high, then  performs well on  for a different reason that why  performs well on . As a result, we flag  as a mechanistic anomaly for .

(See here for a more detailed discussion of mechanistic anomaly detection.)

 

Safe distillation

Let  ("fast") and  ("slow") be two neural networks that were trained on a distribution  of inputs to complete the same task. Thus,  and  behave similarly on . Suppose that we trust  to be aligned (e.g. we trust  to generalize well off-distribution) and do not similarly trust , but that  is much slower than .[9] Given an out-of-distribution input , we would like to estimate  without running . We could do this by running  on  and hoping that  generalizes well to . However, this approach is not very robust. Instead, we can attempt to use the internal activations of  to predict .

Concretely, suppose for simplicity that  and  output vectors, and suppose that we find a set of arguments  such that the following quantity is low:

That is,  explains why  and  produce similar outputs on . Given an out-of-distribution input , we consider the quantity

This represents a heuristic estimate of  given the computations done by  and the argument  for why  and  are similar on . If the reason why  and  behave similarly on  also extends to , then  will correctly estimate  to be similar to . On the other hand, if the reason why  and  behave similarly on  does not extend to , then 's estimate of  may be different from . This estimate may be more robust to distributional shifts, because it is based on mechanistic reasoning about how  and  work.

Safe distillation and mechanistic anomaly detection are closely related problems. The key difference is that in the safe distillation problem, we have a trusted model . This makes the setting easier; in exchange, we can hope to do more  Concretely, we expect  and  to differ on  if  is a mechanistic anomaly for  (but not for ). Solving the safe distillation problem would allow us to not only detect  as an anomalous input, but to predict  from the internals of . An algorithm for predicting  from 's internals would act as a distillation of , but -- unlike  -- it would be a safe distillation (hence the name).

 

Low probability estimation

Let  be a neural network that was trained on a distribution . Let  (for "catastrophe") be a different neural network that checks the output of  for some rare but highly undesirable behavior:  returns  if  exhibits the undesirable behavior on , and  otherwise. We may wish to estimate , and we cannot do so by sampling random inputs  because  outputs  very rarely. Suppose that we find a set of arguments  that explains the mechanistic behavior of  and . If this explanation is good enough, then  will be a high-quality estimate of this probability. Additionally, we may use  to more efficiently check 's behavior on particular inputs: given an input , the quantity

represents an estimate of the likelihood that  based on the computations done by . This is especially useful if  is slow and running it on every output of  is prohibitively expensive.

A few weeks ago, we published a blog post on this potential application. In the blog post, we discussed layer-by-layer activation modeling as a possible approach to the problem. We believe that solving the problem in full generality would require sophisticated activation models that can represent a rich space of possible distributional properties of neural activations. This is similar to our goal of developing a conception of heuristic arguments that is rich enough to point out essentially any structural property of a neural network. Activation modeling and heuristic estimation are different perspectives on the same underlying approach.

 

With this motivation for heuristic estimators in mind, let us now discuss what properties they ought to satisfy.

Formalizing the principle of unpredictable errors

Recall from Analogy #3 that we should expect heuristic estimators to behave a lot like Bayesian reasoners. In fact, perhaps heuristic estimators should satisfy formal properties that Bayesian reasoners satisfy.

One  property of Bayesian reasoning is known as the martingale property, which states that a Bayesian reasoner's estimate (i.e. subjective expectation) of their future estimate of some quantity should equal their current estimate of the quantity. Or more informally, a Bayesian reasoner cannot predict the direction in which they will update their estimate in light of new information. If they could, then they would make that update before receiving the information.

By analogy, a heuristic estimator ought not to be able to predict the direction of its own errors. We call this the principle of unpredictable errors. While this principle is informal, formalizing it could be a first step toward searching for an intuitively reasonable heuristic estimator. This motivates trying to find a satisfying formalization of the principle. Below, we will discuss two approaches for formalization: a subjective approach and an objective approach.

 

The subjective approach: Iterated estimation and error orthogonality

Our subjective approach to formalizing the principle of unpredictable errors involves two properties that we call iterated estimation and error orthogonality.

We say that  satisfies the iterated estimation property if for all expressions  and for all sets of arguments  and , we have

The sense in which the iterated estimation property formalizes the principle of unpredictable errors is fairly straightforward. The expression  represents the heuristic estimator's belief given only the arguments in . The expression  represents the estimator's belief given a larger set of arguments. The iterated estimation property states that the estimator's belief (given ) about what their belief about  would be if presented with all of  is equal to their current belief about . This closely mirrors the martingale property of Bayesian reasoners. It is also directly analogous to the law of iterated expectations from probability theory, hence the name "iterated estimation."[10]

As an example, let  be defined as in our earlier example with the square roots. Let  and . As discussed above, we have  (because the sixth digit of  is ). The iterated estimation property states that  is also equal to . In other words, after  has learned  (but not yet ), its estimate of what its belief of  will be after learning  is its current estimate of , namely .

 

We say that  satisfies the error orthogonality property if for all expressions  and for all sets of arguments  and , we have

Error orthogonality is a more "sophisticated" version of iterated estimation.[11] It is directly analogous to the projection law of conditional conditional expected values.[12]

As an example, let  be defined as above and let  be the expression . Let  and . The outer  does not know the exact value of . However, it believes  and  to be subjectively uncorrelated, because it knows that  includes . Thus, given the outer 's state of knowledge, the best estimate of  is zero for all possible values of . Thus, its estimate of the entire expression is .

(If you want to build more intuition about this property, see Example 2.5 of our paper.)

To see why error orthogonality is desirable, recall our interpretation of heuristic estimates as subjective expected values. Suppose that error orthogonality does not hold for some particular . This means that an observer with state-of-knowledge  believes  and  to be subjectively correlated over the observer's uncertainty. In other words: the observer believes that the subjective estimate of  given state-of-knowledge  is predictive of the error in the subjective estimate of  given state-of-knowledge . However, any such prediction should have already been factored into the estimate .

 

Challenges with the subjective approach

Although iterated estimation and error orthogonality are intuitively compelling, there are challenges with using these properties as stated to seek a reasonable heuristic estimator. These challenges come primarily from the fact that the properties concern 's estimates of its own output.

The first challenge: it seems plausible that these two properties could be satisfied "by fiat." This means that  could check whether it is estimating the quantity  given some subset  of , and then -- if so -- simply compute  and output the result. Although this behavior would satisfy the iterated estimation property, we do not want  to special-case expressions of this form. Instead, we want  to satisfy the property as a consequence of its more general behavior.[13]

The second challenge: the fact that these two properties concern 's estimates of its own output makes it difficult to use these properties to reason about . If our goal is to find a reasonable heuristic estimator , it is most useful to have properties that pin down 's outputs on simple inputs. The iterated estimation and error orthogonality properties are not helpful in this regard, because the simplest possible equality that is derivable from either property still involves a mathematical expression that includes the code of  as part of the expression. Furthermore, without knowing 's code, a constraint that involves 's behavior on its own code is less useful.

For these two reasons, we are interested in more grounded variants of the iterated estimation and error orthogonality properties: ones that still capture the key intuition that 's errors ought not be predictable, but that do not involve nested 's. This motivates searching for an objective approach to formalizing the principle of unpredictable errors.

 

The objective approach: Accuracy

The basic idea of the objective approach is to replace the outer  in the iterated estimation and error orthogonality properties with an expected value  over some probability distribution . In other words, 's errors should be objectively unpredictable -- rather than subjectively unpredictable -- over a specified distribution of expressions. Depending on the context, we call these properties accuracy or multiaccuracy.[14]

Before defining accuracy for , we define accuracy in the context of probability theory.

Definition (accuracy of an estimator.) Let  be a space of real-valued mathematical expressions, and let  be a probability distribution over . Let  be a random variable. An estimator  is -accurate over  if

We say that  is self-accurate over  if  is -accurate over . For a set  of random variables, we say that  is -multiaccurate over  if  is -accurate over  for all .

 

Intuitively, being -accurate means that an estimator has "the right amount of ": the estimator's error is uncorrelated with , which means that adding a constant multiple of  to the estimator can only hurt the quality of the estimator. (See Proposition 3.4 in the paper for a formal statement of this intuition.)

As an example, let  be the space of expressions of the form , where . (For example, the expression  belongs to .) Let  be the distribution over  obtained by selecting  independently from , the standard normal distribution.

Let us consider the estimator . This estimator is -accurate, meaning that it has the correct mean: . It is also self-accurate: . However, it is not -accurate: . This reflects the fact that  does not have "the right amount of ": adding  to  will make it a better estimator (indeed, a perfect estimator) of .

Here is a Venn diagram of some other estimators of  based on whether they are -accurate, -accurate, and self-accurate over .

To get some intuition for accuracy, you could verify that the Venn diagram is correct. Another exercise: find another estimator that goes in the center of the Venn diagram. This Wikipedia page may be useful.

(Multi-)accuracy is closely related to linear regression. In particular, the OLS (i.e. linear regression) estimator[15] of  in terms of a set of predictors  is -multiaccurate (and is the only -multiaccurate estimator that is a linear combination of ). The linear regression estimator is also self-accurate.[16]

 

We can adapt our definition of accuracy for estimators (in the probability theory sense used above) to heuristic estimators . The basic idea is to replace the estimator  with our heuristic estimator , i.e. to substitute  for :

Definition (accuracy of a heuristic estimator.) Let  be as in the previous definition. Let  be a heuristic estimator and  be a random variable. A set of heuristic arguments  makes  be -accurate over  if for all  is an -accurate estimator over  -- that is,

We say that  is -accurate over  if there is a short  that makes  be -accurate over . We say that  is -multiaccurate over  if  is -accurate over  for all .

(The paper clarifies some details that have been skipped over in this definition. For example, the paper defines "short" and clarifies some subtleties around the interpretation of  in the context of the definition.)

(Note also the similarity between this equation and our definitions of iterated estimation and error orthogonality. Indeed, we recover the definition of iterated estimation in the special case of  -- except that the outer  is replaced by an expectation over . We can similarly recover error orthogonality in the case of  -- see the paper for details, including for a definition of self-accuracy for heuristic estimators.)

 

While iterated estimation and error orthogonality are primarily "soundness" conditions on  -- that is, they constrain  to output internally consistent estimates -- accuracy can additionally be used as a "completeness" condition. In particular, if  is -multiaccurate, this means that:

  •  can successfully incorporate the predictors in  into its estimates; and that
  •  can reasonably merge its estimates based on these predictors (because if  then  needs to be an -accurate estimator of  and also an -accurate estimator of ).

 Now, our goal is for  to be -multiaccurate for a rich class of predictors . Such a  would be powerful while producing reasonable estimates. Unfortunately, as we discuss in the next section, it seems quite difficult to produce such a .

 

Challenges with the objective approach

Given a natural distribution  over mathematical expressions and a small, simple, and natural set of predictors , it is always possible to efficiently produce a self-accurate and -multiaccurate estimator?

It may seem like the answer is yes: in particular, we mentioned earlier that the linear regression of  onto the predictors in  is self-accurate and -multiaccurate. However, this raises an important question: can we efficiently compute the necessary regression coefficients?

Unfortunately, as we will discuss, the answer to this question is no.

 

Estimating the product of jointly normal random variables

Here is a quite simple and natural estimation problem: given a  covariance matrix , estimate the expected product of  random variables with mean  and covariance matrix .

We consider this estimation problem for two reasons. On the one hand, this is one of the simplest estimation problems for which computing (or even approximating) the correct answer is computationally intractable. On the other hand, this problem captures the core difficulty of a natural, more general estimation problem: estimating the average output of an arithmetic circuit (a circuit with addition and multiplication gates). Addition gates are straightforward: , and so the challenge lies in the multiplication gates.

It turns out that the answer to this estimation problem is equal to the sum of all -fold products of covariances in which each variable is used exactly once. That is, if  are jointly normal, zero-mean random variables, then

where  denotes the set of all pairings of  (for example, one element of  is ). (This is called the hafnian of the covariance matrix.)

And so our estimation problem amounts to computing a giant sum. This suggests a natural class of predictors: namely, partial sums.

Concretely, let  be the expected product of random variables with mean zero and covariance matrix , and let  be the distribution over  induced by selecting the (off-diagonal) entries of  independently from .[17] Given a pairing  of , we define , and for a subset , we define  (so ).

Note that  can be efficient to compute even for exponentially large sets of pairings . For example, let  be the set of  pairings that pair 1, 2, 3, 4 amongst themselves (there are three ways to do so); pair 5, 6, 7, 8, amongst themselves; and so on. Then

And so we might wonder: given two efficiently computable partial sums , is it always possible to efficiently combine them into a single estimate of  with linear regression? That is, are the requisite regression coefficients efficiently computable?

Alas, the answer is no. We show this by reduction from 3SAT -- that is, we show that if you can compute these regression coefficients, then you can solve boolean satisfiability problem, which is NP-complete. If you're interested in the details, check out Section 4.1 of the paper!

(We have not ruled out the possibility that there is a more sophisticated way to accurate merge the estimators  and , given that linear regression is intractable. However, we conjecture that no accurate and efficiently computable merge exists. That is, in general, there is no estimator  of  that is both efficiently computable and -multiaccurate.)

 

Even though you can't efficiently compute the regression coefficients for merging these arguments exactly, you might wonder whether it's possible to compute them approximately. Concretely, given sets  for which the predictors  are efficient to compute, is it possible to find a linear combination of the 's that is approximately self-accurate and -multiaccurate? It turns out that, in order to compute the regression coefficients of  onto , it is sufficient to compute  for all . This suggests that you can estimate the regression coefficients by estimating the sizes of these intersections, which you can do e.g. by randomly sampling elements of  and seeing if they belong to .

It turns out that the number of samples you need depends polynomially on the condition number of the correlation matrix of . Unfortunately, this condition number can be exponentially large.[18]

Currently we do not know how to merge these estimates even approximately, although we aren't confident that this cannot be done. At the very least, the most straightforward approaches do not work, which suggests a barrier to creating algorithms that accurately merge even simple estimates for simple and natural estimation problems.

 

Estimating the permanent of a matrix

We consider another natural estimation problem: given an  matrix , estimate its permanent. The permanent of  is the sum of all products of  elements of , one per row and column:

While superficially similar to the determinant, the determinant can be computed efficiently, while the permanent cannot even be approximated efficiently.

In the paper, we consider three different estimates of the permanent, all of which are motivated by an argument via presumption of independence. The row sum estimate is given by

The row sum estimate is  times the average product obtained by taking one element of each row of . It is also the average permanent of all matrices obtained from  by shuffling each row independently. Similarly, the column sum estimate is given by

Finally, the matrix sum estimate is  times the average product obtained by taking  random elements of  (with replacement):

If we want to accurately merge these estimates over a distribution of matrices, we can do so with linear regression. However, the resulting estimator can have undesirable properties. For example, the linear regression estimator over the distribution of matrices with each entry selected independently from  takes the form , where  is positive. In particular, this means that even if  has exclusively non-negative entries, this linear regression estimator for the permanent of  can be negative.

This by itself is okay: we do not expect estimates to be reasonable by default. What we do expect, however, is that upon noticing that an estimate is unreasonable, we should be able to correct it. That is, we ought to be able to produce an estimator that merges the row sum, column sum, and matrix sum estimates, and is non-negative on matrices with non-negative entries.

Unfortunately, we are not aware of a natural way to produce such an estimator. Actually, for matrices with non-negative entries, there is a natural estimator that "merges" the row sum, column sum, and matrix sum estimates: namely, . (See Section 5.2 of the paper for an explanation of where this estimator comes from and why it is reasonable.) However, this estimator does not satisfy any accuracy properties over any natural distribution of matrices. This makes sense, because this estimator is "multiplicative" in nature, whereas accuracy is an "additive" property.

Our discussion of estimating the permanent thus poses another barrier to using multiaccuracy to formalize the principle of unpredictable errors. Namely, accuracy forces us to reject a seemingly reasonable estimator while forcing us to create seemingly unnatural estimators to satisfy additional properties (like estimating the permanents of matrices with non-negative entries as being non-negative).

 

Conclusion

The ultimate goal of formalizing properties like the principle of unpredictable errors it to help guide the search for a heuristic estimator. Once you have formal properties, you can pose a formal mathematical question: "Does there exist a polynomial-time algorithm  that satisfies [formal properties like linearity, unpredictable errors, etc.]?". Once you have such a question, you can use your mathematical toolbox to try to resolve it. By contrast, without such a question, you're forced to answer vague subjective questions like "What would a reasonable heuristic estimator do in this situation?".

Two properties that have stood the test of time are linearity and respect for proofs. Linearity states that for  and mathematical expressions , we have that  Respect for proofs states that, given a proof that , the proof may be turned into a heuristic argument  such that for all  containing , we have . Unfortunately, these two properties alone are insufficient to pin down the behavior of : there are heuristic estimators that satisfy linearity and respect for proofs but behave "unreasonably" (see Chapter 9 here).

It would be really nice if we could formalize the principle of unpredictable errors, because perhaps a satisfying formalization of this principle, together with linearity and respect for proofs, would force  to behave reasonably. So far, we have not found a satisfying formalization; finding one might constitute an important step forward in our understanding of heuristic estimation.

 

This year, though, we have mostly focused on a different approach. Our new approach reframes the heuristic estimation problem as an activation modeling problem: learning a coherent representation of the statistical properties of a neural network's activations (or the values of a circuit's wires) that lets us answer questions about the neural network (or circuit).[19] We haven't written about this perspective in detail yet, because we are still developing it, but see our recent blog post on estimating tail risks in neural networks for an outline of what this approach might look like. We are excited to see where our new perspective takes us!

 

  1. ^

    Our original notation is . In our new work, we use the notation  to emphasize that, while there are similarities between heuristic estimation and expected values, they are importantly different.

  2. ^

    See here for an earlier blog post that introduced the mechanistic anomaly detection problem.

  3. ^

    Perhaps instead of no arguments,  is given a short argument that points out that there are twenty digits and that its estimate for each digit ought to be (4.5).

  4. ^

    Here, "short" means "about as large as the model itself."

  5. ^

    For example,  could be based on a trained reward predictor, as in RLHF.

  6. ^

    For example, if  is a financial assistant that takes actions such as buying stocks and transferring money between bank accounts, then  might have a low loss on  because it makes good financial decisions, but a low loss on  because it implements a money laundering scheme that  fails to notice.

  7. ^

    One of the most important and difficult questions faced by this approach is how to find such a . If the space of arguments is parameterized, then we may hope to learn  via gradient descent in parallel with training  itself.

  8. ^

    The idea is that, without any arguments,  does not understand anything about the structure of , and so should estimate 's loss as if  were a randomly initialized neural network. (Such a network would incur high loss.) Heuristic arguments that explain 's structure should cause 's estimate of 's loss to decrease.

  9. ^

    For example, in the iterated distillation and amplification process,  could be a distillation of a trusted model ; however, we may not trust .

  10. ^

    Most generally, the law of iterated expectations states that for a probability space  with -algebras , for any integrable random variable  we have .

  11. ^

    It may seem like a generalization (consider the case of ), but it is not: deriving iterated estimation from error orthogonality would require assuming additional properties of .

  12. ^

    The projection law states that for a probability space  with -algebras , for square-integrable random variables , we have .\(\\)

  13. ^

    As an analogy, consider a proof verifier  that takes as input a mathematical statement  and a purported proof , and outputs  (accept) or  (reject) depending on whether  is a proof of . Let  be the statement "If , then ." For every , there is a proof  of  (specifically: if , then  proves  and thus ; and if , then the computational trace of  on  shows that  and thus proves ). However,  should not treat the input  as a special case; instead,  should verify that  proves  just as it would verify any other proof.

  14. ^

    The term "multiaccuracy" originates in the algorithmic fairness literature, where it is used to describe a predictor that appears unbiased to a given set of statistical tests (see e.g. here and here).

  15. ^

    Without a constant term. If you want a constant term, you can add the predictor  to .

  16. ^

    Here we speak of the exact linear regression estimator of  in terms of : in other words, the linear combination of these predictors that is closest to  (in terms of expected squared error over ). This contrasts with the more typical setting for linear regression, in which coefficients are computed only approximately based on samples.

  17. ^

    The diagonal entries (which do not matter for the value of ) can always be chosen so that  is a valid covariance matrix, simply by making those entries be very large.

  18. ^

    We believe that by using ridge regression instead of linear regression, it is possible to find an approximately -multiaccurate estimate of . However, this estimate is not approximately self-accurate. See Remark 4.14 in the paper for an explanation for why we consider self-accuracy to be important.

  19. ^

    Very loosely speaking, the correspondence between heuristic estimation and activation modeling is that a particular activation model corresponds to , and so an activation model can take as input a quantity and return an estimate of that quantity.

New Comment
2 comments, sorted by Click to highlight new comments since:

to simplify, we ask that for every expression and set of arguments

Here and in the next dot point, should the inner heuristic estimate be conditioning on a larger set of arguments (perhaps chosen by an unknown method)? Otherwise it seems like you're just expressing some sort of self-knowledge.

Yeah, that's right -- see this section for the full statements.