Abstract
This essay explains the Direct Approach proposed by (Barnett and Besiroglu 2023a).1 I encourage you to play with the Direct Approach Interactive Model to explore an interactive simulation using the approach.
1 The thing is released in a scattered way, typical for an internet-native publication. There is the report (Barnett and Besiroglu 2023a), in the form of a paper – clearly meant to be cited, despite being hard to read. There is the website (Barnett and Besiroglu 2023b), in the form of a blog post – clearly meant to be read, despite not being upper-class enough to be cited in journal papers. Finally there is the interactive model which looks like an optional add-on to the blog post.
The Direct Approach framework bounds the compute requirements for transformative AI by extrapolating neural scaling laws. We combine those estimates with simple models of future progress in algorithms, investment, and compute costs to produce a user-adjustable forecast over the date at which TAI will be achieved. (Barnett and Besiroglu 2023b)
From the POV of the judge, a Turing test is a sequential test for two statistical hypotheses – “is human” and “is machine”. Under reasonable assumptions, halving the (reducible part of) log-perplexity loss of the language model would double the time it can survive in a Turing test.
We can think of the peer-review of scientific papers as a Turing test, and say that AGI has arrived when we have AI scientists that can pass the papers peer-review. This allows us to calculate the log-perplexity loss of the first AGI. If we assume it is just a scaled-up GPT, then assuming the Chinchilla scaling law, it would cost about 200 years of global GDP. This makes it virtually certain that the first AGI will not be a scaled-up GPT.
Turing test as statistical hypothesis test
Turing test
In the Turing test, there are three players: one judge and two players. The first player is a human, and the second is a machine. The judge asks each player text questions and receives text answers. The judge must decide who is the human.
We consider a simplified Turing test. In this test, the judge does not ask, and simply receives one stream of text
Cast in the language of statistical hypothesis testing, we have two hypotheses:
: “the stream is produced by the human”; : “the stream is produced by the machine”.
The judge would read from the stream o-n-e- -t-o-k-e-n
at a time, and at each token, decide whether to take another one, or announce its judgment:
As the organizers of the Turing test, we would start the test by flipping a fair coin to decide whether to use the human or the machine. Therefore,
This allows us to use the sequential probability ratio test (SPRT). The judge would decide on two decision boundaries, and calculate
For example, suppose the judge wants to decide when the odds ratio is 10 to 1, then it would set the decision boundaries to be
The
Sequential hypothesis testing
Consider the following simple equation:
The first term is the KL-divergence per token between the machine and the human. Roughly speaking, it is how different they are, per token emitted. It is an information-theoretic quantity.
The second term is negative log-likelihood loss per token. This is what language models are trained to minimize. We write it as
The third term is the entropy rate of the human. It is how random the human is. We write it as
If the machine is a perfect replica of the human, then the second term is zero, and the first term equals the third term.
Assuming that the human is an ergodic speakers of English,2 we can sample an infinite stream
2 In short, an ergodic speaker is someone who has only one speech. If you hear it speak once for a very long time, then hear it speak again for a very long time, then you can take the first and shift it around, so that it looks like the second over a very long sub-segment. Ergodic speakers allow you to take the average over a single very long speech, and be assured that it is close to the average over all possible speeches.
In long, see the appendix on ergodic theory.
On the other hand, if the machine is also an ergodic speaker of English, then we can sample an infinite stream
where unfortunately, we have the odd
We can interpret them as the loss of the human at imitating the machine, and the entropy rate of the machine itself. When the machine is close enough to the human, we can take the approximation
Now, define the log-ratio at step
So, imagine that such a perfect judge is going through a Turing test, upon receiving “my cat is technically”, and we are listening on its thoughts:
- “If it were a human, then it would start with ‘my’ with probability
. If it were a machine, then . Therefore, the odds ratio is 2 to 1.” - “If it were a human, then it would follow ‘my’ with ‘cat’ with probability
. If it were a machine, then . Therefore, the odds ratio is 3 to 1.” - “If it were a human, then it would follow ‘is’ with ‘my cat’ with probability… I do not know. However, I do know that the odds ratio is 2 to 1. Now that the total odds ratio is 12 to 1, I can decide:
.”
We see that the judge does not have to know the probabilities
Let
Intuitively, each token on average moves the log-probability-ratio away from 0 by another
We are not able to simply look at a few tokens, draw a straight line, and call it a day, because the trajectory of log-probability-ratio is much closer to a random walk with drift. Subjectively, if you were a judge and watching the log-probability-ratio moving, you’d see ups and downs, keeping you in suspense, until it finally crosses the decision boundaries.
Slowdown factor
To perform the SPRT as described, the judge must know intimately the difference between a human and a machine. Can the judge do that? Can anyone know, with certainty, that I would start my speech with “Forty cats …” with a probability that is exactly 32.42 times that of GPT-3?
As a crude approximation, we can model real-world judges as slowed-down version of the perfect judge. We can imagine that at each step, instead of updating the log-ratio by
we update it by
where
Measuring the slowdown factor
The slowdown factor
Informed by an internal poll, we enforce a lognormal distribution with a median of 53.1, a 15th percentile estimate of 9.84, and an 85th percentile of 290. (Atkinson 2023)
The original paper (Barnett and Besiroglu 2023a) contains no estimate of
My ideas
What I think would work well is if both
Perhaps we can make this into a gambling game. The human subject would be presented with two long outputs from two hidden Markov models. Then the subject becomes the judge of a Turing test: “Are you seeing the output from machine 0 or machine 1?”. At each step, the subject can either pay a few cents of fake money to see another character, or stop and make a bet with the entire bankroll: “I bet 70% of my bankroll on machine 0 and the rest on machine 1!”. Both bets have payoff odds of
A diffusion guessing game: The user uploads a lot of MIDI musics. The program plays back note by note, but adds increasingly severe distortions (add a gaussian or Poisson noise to each note). The user guesses which music it is. The more noise there is, the longer it should take the user to guess. Compare the length with the Bayesian optimal predictor.
Human-or-not
There was one large-scale attempt at the Turing test in early 2023, in a game called “Human or Not?” (Jannai et al. 2023). Human participants took 2-minute conversations, and at the end, had to decide whether they were talking to a human or a bot.3
3 There was no mention of whether the bots had to decide the same question.
The conversations have a “ping-pong” structure that prevents players from sending two consecutive messages without a response, in order to ensure a balanced and dynamic exchange. Each message, limited to a maximum of 100 characters, has to be composed and sent within a 20-second window, and the chat ends after 2 minutes, usually consisting of 4-5 messages from each side. This ensures that players don’t have to wait for too long, so they can remain engaged with the game and a constant suspense is kept. Once the conversation is over, players are prompted to guess whether their conversational partner was a fellow human or an AI bot. (Jannai et al. 2023)
I counted that during a typical message, each side sends
They used several different AI, ranging between Jurassic-2, GPT-4, and Cohere. None of them have published their training compute or loss curves. The only good estimate is for GPT-4, which has training cost
Assuming that Chinchilla scaling holds, average log-ratio per token that an ideal judge should achieve is
I did not expect the estimate to be nearly symmetric around
Entropy of natural languages
In Equation 1, we argued that
Since tokenizers are temporary, but English is permanent, we convert all units to
Chinchilla scaling
In the Chinchilla scaling law paper, the authors trained many language models with various sizes from a single architecture family, and fitted a statistical law to the data, giving
To find the effective WikiText-2
corpus, and found that on average,
Thus,
Turing-completeness
Unfortunately, to lower-bound entropy rates, we typically have to make strong assumptions, because in general, lower-bounding entropy is not computable. For example, the entropy of
There is also another problem: Probabilistically speaking, there is no such thing as the entropy rate of a single sequence. We can convert a single sequence into a stochastic process, by, for example, sampling a random positive integer
This is similar to the Halting problem. It’s possible in general to prove that a machine halts: just run it. It’s not possible in general to prove that it does not halt. Similarly, there is a program
There are two catches.
One, it is impossible to know in general whether
Two, it is also impossible to have a program
(Feutrill and Roughan 2021) reviews the entropy rate formulas for several commonly used models. An ergodic Markov chain with stationary distribution
Guessing game
The earliest attempt to measure the entropy rate of English is by Shannon himself (Shannon 1951):
Let
The upper bound is proved by Shannon’s source coding theorem. Taking a human subject, copy it, then they can be used as an encoder-decoder pair.4 The lower bound is not only tricky to prove, but also wrong in general. It is only correct when the human subject is the optimal
4 It still works even if the humans are pseudorandom. We just have to whisper the same RNG seed into both humans’ ears, and then they would behave in the same pseudorandom way.
5 The simplest counterexample: Suppose the source is binary, and satisfies
This source can be made ergodic by adding an
(Burton and Licklider 1955) uses the same method as Shannon, but with longer contexts and texts from more books (Shannon sampled all passages from just one book). They found that English has “redundancy” in
Over the years, others devised other methods to estimate this entropy. For example, (T. Cover and King 1978) used a gambling game estimation, in the style of the Kelly criterion. Subjects were required to divide their entire bankroll into 27 differently-sized bets over 27 possibilities (26 letters and 1 whitespace). The right bet pays back 27-fold, and the other bets are lost. Let
They found that
The guesser does not have to be a human. It can very well be a language model. (Brown et al. 1992) made a simple trigram model over the Brown corpus (600 million words), and found that it gives
A more recent attempt in 2022 is crucial, as it seems to show humans are already surpassed by the best language models like GPT-3 in terms of reaching low perplexity, but I haven’t yet studied it in detail.
Lossless compression
Another way to estimate is by lossless compression of a large corpus, since the entropy rate is the lower bound on compression rate. In more detail, if you have a source of information emitting symbols, and its symbol stream has an entropy rate of
The Hutter prize is a competition for compressing a enwik9
). For the size of the finished product, both the algorithm and the compressed data must be counted. In particular, if a neural network is used, then the size of the neural network weights must be counted as well.
The enwik9
dataset is in XML
format, and thus contains a lot of non-English content like <timestamp>2005-12-27T18:46:47Z</timestamp>
. It has XML
formatting. As a simple estimate, we simply counted its characters:
Therefore, the entropy rate is
The standard zip algorithm can compress it down to about 300 Mb in size, a compression ratio of
Similar to the Hutter prize, the Large Text Compression Benchmark also asks for compressing the enwik9
dataset. However, there is no limit to the algorithm runtime or size, so the compression ratio for this benchmark is always higher. Currently (2024-01-19), the maximal compression rate reached is nncp v3.2
, which uses a small Transformer model.
(Grassberger 2002) used a substitutional compression algorithm with increasingly large codebooks. When the codebook had 6000 codes, the algorithm gave
Summary
estimate | method | raw number | effective entropy rate (bit/char) |
---|---|---|---|
(Grassberger 2002) | compression, extrapolation | ||
Hutter prize (Kaido Orav, 2024) | compression | compression ratio |
|
Hutter prize extrapolated | compression, extrapolation | compression ratio |
|
Large Text Compression Benchmark (nncp v3.2 , 2023) |
compression | compression ratio |
|
(Shannon 1951) | guessing game | ||
(Burton and Licklider 1955) | guessing game | ||
(T. Cover and King 1978) | guessing game | ||
(Brown et al. 1992) | 3-gram language model | ||
(Behr Jr et al. 2002) | n-gram language model | ||
(Kaplan et al. 2020) | Transformer language model, extrapolation | ||
(Hoffmann et al. 2022) | Transformer language model, extrapolation |
6 Because the tokenizers differ, the same nat/token
translates to different bit/char
.
Notably, the above table has mostly upper bounds, and only one dubious lower bound (by Shannon) from 1951. Perhaps lower bounds can be established by using randomness extractors on a large corpus, and checking that the output from the extractor passes pseudorandomness tests.
Most of the data seems to be centered around 0.8 bpc. The one outlier is the Chinchilla scaling law estimate: 0.54 bpc. I have found that a lot hinges on the exact tokenizer-dataset fit, which is why tokenization is so annoying and I wish people would try to do away with it, or at least report bit-per-character in addition to nat-per-token.
Forecasting AGI
According to the Chinchilla scaling law (Hoffmann et al. 2022), if we have a fixed amount of computing budget
Assuming a slowdown factor
This gives, as a rule of thumb,
For example, assuming a slowdown factor of
If GPT-4 costs
meaning it has a good chance of passing the Turing test if limited to only 150 words. For context, the Attention is All You Need paper has an abstract that’s 200 tokens long.
A typical scientific paper is about 4000 words long, which is
This implies that the first AGI will not be a scaled-up GPT – autoregressive transformer generatively pretrained on a lightly filtered text dataset. It has to include something else, perhaps multimodal data, high-quality data, better architecture, etc. Even if we were to attempt to merely scale it up, turning earth into a GPT-factory,7 with even 50% of global GDP devoted,8 and with 2% growth rate forever, it would still take 110 years,9 arriving at year 2133. Whole brain emulation would likely take less time.10
7 Consider this anecdote from Edward Teller:
The possibilities of developing an atomic weapon and the desirability of doing it secretly were discussed at a Princeton University conference in which I participated in March 1939… Bohr said this rare variety could not be separated from common uranium except by turning the country into a gigantic factory. Bohr was worried that this could be done and that an atomic bomb could be developed–but he hoped that neither could be accomplished. Years later, when Bohr came to Los Alamos, I was prepared to say, “You see…” But before I could open my mouth, he said: “You see, I told you it couldn’t be done without turning the whole country into a factory. You have done just that.” (Teller and Brown 1975)
8 Only in a life-or-death situation does 50% of GDP get devoted to one purpose. For example, that is about the level of GDP devoted to war production during WWII in the major combatant countries. The USA spent 4 trillion USD2011 over 6 years out of an annual GDP of 1.3 trillion USD2011.
9 Solve for
10
Appendix: Ergodic theory
Since we used ergodic theory during the essay, we should quickly explain what it is about. This section is foundational, but the full complexity is not necessary.
Measure-theoretic POV
I know, you know too, nobody really likes measure theory any more than pianists like practicing scales hundreds of times. Still, it is at the right level of abstraction for many theories, including probability.
We omit all mentions of “almost-everywhere”, “except on a set of measure zero”, and similar annoying phrases. As long as you never make a union of uncountable many subsets, you will not be hurt by this omission.
A probability space is a measurable space with a measure of
11 Pronounced “mu” – it is a pun because both “mu” and “measure” starts with “m”.
We consider a single measurable function
We demand that
A subset is measurable iff it is an element of
A subset
12 That is, except on a subset of measure zero:
Now, obviously any set of measure zero or one are
We usually ask
Ergodic maps have many very good properties. We will use the following one. For the theorem, you can picture it as the real space
13 Except pathological examples constructed by logicians who have nothing better to do than to care about the continuum hypothesis, large cardinals, and the arithmetic hierarchy. Those who desire the rigor-mortis of logic, let them have it.
Theorem 1 (Dense orbits) If the state space is a topological space with a countable basis, and any nonempty open set has positive measure, then almost any
Proof. Let
Now,
Finally, there is a common theme in ergodic theory. There are rigorous versions of it, but instead of going for rigor, the spirit is more important:
Theorem 2 (ergodic decomposition) Any interesting map is a partition/sum/integral of ergodic maps.
For example, the shear map on the unit square
can be thought of as an integral over rotations: For each
Sequence POV
We must interpret the language of measure theory, which is dead like chalk dust, back into the language of sequence predictions, which is alive like reinforced concrete.
Each point in the state space
The shift map on the state space
The shift map is measure-preserving, meaning that the process is stationary: We could have started reading at any point, and we would still expect to see the same kind of probability distribution. It would not be like “Sorry, the word ‘cat’ appears with zero probability when
Repeatedly applying the shift map
A periodic point of
A
A probability distribution over
If we can partition
We wish to consider only texts created by some imaginary “universal English speaker”. In particular, we do not want it to get stuck in one sub-language of English, then never escape from it. That is, we assume the universal speaker is ergodic.
Now imagine that we randomly sample two pieces of text generated by the universal speaker, and we shift the first text around to match it against the second. By Theorem 1, the orbit of the first text is dense in the space of all possible English texts spoken by the universal speaker. We can gamify this situation thus:
- Prover: “I take one piece of text
, then another piece .”. - Challenger: “I challenge you to find a stretch of text from
that matches the stretch in .”. - Prover asks a team of immortal monkeys to do the task. A million years later: “At
.”. - Challenger verifies that
.
Shannon–McMillan–Breiman
If someone has created an infinite sequence of coin flips
How do we measure the entropy of an English speaker? It speaks token by token, and we have to measure the average information we obtain per token. The problem is that there are two senses of “average”. It could be the time-average: we listen to the speaker speak for a very long time, and calculate the entropy in the speech. It could be the ensemble-average: we listen to the speaker speak for a very long time, then do it again, then again, etc, then average together the time-averages.
If the speaker is ergodic, then the speaker essentially has just one speech, and any two samples of its speech are just translations of each other.14 Consequently, it is intuitively clear that with probability 1, the time-average of the entropy of one speech equals the ensemble-average of the entropy of all speeches. Intuitively, with probability 1,
14 Statistical mechanists might recognize this as saying that the language is self-averaging. That is, sampling many speeches from the language is essentially the same as sampling one very long speech that we then cut into many sub-speeches.
For non-ergodic speakers. We simply decompose the speaker into an ensemble of ergodic speakers, then apply the SMB theorem to each one. It is like the strong law of large numbers. Intuitively, with probability 1,
This is the Shannon–McMillan–Breiman theorem.
In textbooks and Wikipedia, the SMB theorem is stated rigorously, but you have already understood the idea of SMB, and the rigorous versions are simply paraphrases of the idea.