Some other discussions on formal measures of intelligence, also building on Legg & Hutter's work:
One issue, which a lot of papers bring up, is that the way we define the distribution of environments in which to test the agent has a major impact on its measured intelligence. While sufficiently intelligent agents will perform well in any environment or distribution of environments, that's probably of not much help since even humans don't seem to be at that level of intelligence: we've evolved to deal with the kinds of environments that have historically existed on our planet, not with arbitrary ones. Legg & Veness discuss this issue:
When looking at converting the Universal Intelligence Measure into a concrete test of intelligence, a major issue is the choice of a suitable reference machine. Unfortunately, there is no such thing as a canonical universal Turing machine, and the choice that we make can have a significant impact on the test results. Very powerful agents such as AIXI will achieve high universal intelligence no matter what reference machine we choose, assuming we allow agents to train from samples prior to taking the test, as suggested in Legg and Hutter (2007). For more limited agents however, the choice of reference machine is important. Indeed, in the worst case it can cause serious problems (Hibbard, 2009). When used with typical modern reinforcement learning algorithms and a fairly natural reference machine, we expect the performance of the test to lie between these two extremes. That is, we expect that the reference machine will be important, but perhaps not so important that we will be unable to construct a useful test of machine intelligence. Providing some empirical insight into this is one of the main aims of this paper.
Before choosing a reference machine, it is worth considering, in broad terms, the effect that different reference machines will have on the intelligence measure. For example, if the reference machine is like the Lisp programming language, environments that can be compactly described using lists will be more probable. This would more heavily weight these environments in the measure, and thus if we were trying to increase the universal intelligence of an agent with respect to this particular reference machine, we would progress most rapidly if we focused our effort on our agent’s ability to deal with this class of environments. On the other hand, with a more Prolog like reference machine, environments with a logical rule structure would be more important. More generally, with a simple reference machine, learning to deal with small mathematical, abstract and logical problems would be emphasised as these environments would be the ones computed by small programs. These tests would be more like the sequence prediction and logical puzzle problems that appear in some IQ tests.
What about very complex reference machines? This would permit all kinds of strange machines, potentially causing the most likely environments to have bizarre structures. As we would like our agents to be effective in dealing with problems in the real world, if we do use a complex reference machine, it seems the best choice would be to use a machine that closely resembles the structure of the real world. Thus, the Universal Intelligence Measure would become a simulated version of reality, where the probability of encountering any given challenge would reflect its real world likelihood. Between these extremes, a moderately complex reference machine might include three dimensional space and elementary physics.
Which is why a lot of the above papers focus on trying to define useful environmental distributions.
Thx for the references! I was familiar with Hernandez-Orallo et al. (2011) and Goertzel (2010).
However, it seems that none of these papers tackles the Duality problem.
Regarding environmental distributions, I think this problem is solved rather elegantly in my approach by the quasi-Solomonoff distribution, which singles out environments compatible with the tentative model D. Essentially it is the Solomonoff prior updated by a period of observations during which D-behavior was seen.
Regarding the choice of a reference machine, its role asymptotically vanishes in the tail of the Solomonoff distribution. The quasi-Solomonoff distribution samples the tail of the Solomonoff distribution by design, the more so the more complex is D.
In applications it seems to be a good idea to use D as complex as possible (i.e. put in as much as possible information about the universe) while using a reference machine as simple as possible. In fact I would use lambda-calculus rather than a Turing machine. This is because the simpler the reference machine the closer the relation between the Solomonoff distirbution and Occam's razor. If we assume that our intuitive grasp of simplicity is approximately correct then using a complex reference machine doesn't make sense.
In this article I review a few previously proposed mathematical metrics of general intelligence and propose my own approach. This approach has a peculiar relation to decision theory, different decision theories corresponding to different notions of "maximally intelligent agent". I propose a formalization of UDT in this context.
Background
The first (to my knowledge) general intelligence metric was proposed by Legg and Hutter in "Universal Intelligence". Their proposal was to consider an agent A interacting with environment V where A exerts a sequence of actions ai on V and in turn receives from V the observations oi. Their original proposal was including in the observations a special utility channel ui, so that the utility of A is defined by
This is essentially reinforcement learning. However, general utility functions of the form can be just as easily accommodated by the method.
The Legg-Hutter intelligence metric is given by
Where the expectation value is taken with respect to a Solomonoff distribution for V.
The maximally intelligent agent w.r.t. this metric is the famous AIXI.
This definition suffers from a number of problems:
Side note: due to it might be that the "correct" definition of I(A) is efficiently computable but it's not feasible to compute A with human-like level of I(A) (that is it's not possible to compute it using a short program and scarce computing resources). In this case the success of natural evolution of intelligence would have to be blamed on the Anthropic Principle. It would also mean we cannot construct an AI without using the human brain as a "template" in some sense. Moreover there would be a physical bound on constructing intelligent agents which might rule out the "AI foom" scenario.
Orseau and Ring2 proposed solutions for the efficiency and duality problem.
Solving the efficiency problem is straightforward. Consider a computing machine M (e.g. a universal Turing machine) which produces an action a every fixed cycle (for example a is read off a register in M) and able to receive an observation o every fixed cycle (for example the o is written to a register in M). We can then evaluate I(A(Q)), where A(Q) is the abstract (Legg-Hutter) agent computed by M primed with the program Q. We can then ask for Q*(n), the maximally intelligent program of length at most n. In general it seems to be uncomputable.
The duality problem is more tricky. Orseau and Ring suggest considering a natural number Q which primes the stochastic process
Given a utility function we can define the intelligence of Q to be
For example we can imagine Q to be the state of a computer controlling a robot, or the state of a set of cells within a cellular automaton. This removes the unphysical separation of A and V. However, for me this framework seems too general if we don't impose any constraints on ρ.
Intelligence in a Quasi-Solomonoff Universe
Consider a universe Y described by a sequence of states {υi}. In order for the Solomonoff distribution to be applicable, we need the state υi to be represented as a natural number. For example Y can consist of a computing machine M interacting with an environment V through actions a and observations o. Or, Y can be a cellular automaton with all but a finite number of cells set to a stationary "vacuum" state.
Consider further a tentative model D of the dynamics of Y. There are several types of model which might be worth considering:
The quasi-Solomonoff probability distribution for {υi} associated with model D and learning time t is defined by
Here t is a parameter representing the amount of evidence in favor of D.
The agent A is now defined by a property q(υt) of υt, for example some part of the state of M. Given a utility function we can consider
where the expectation value is taken with respect to the quasi-Solomonoff distribution. This can be interpreted as an intelligence metric.
We now take the meta-level point of view where Q is regarded as a decision allowing to make the link to decision theory. For example we can think of an AI programmer making the decision of what code to enter into her robot, or natural evolution making a "decision" regarding the DNA of h. sapiens. According to this point of view, IEDT(Q) is the quality of the decision Q according to Evidential Decision Theory. It is possible to define the Causal Decision Theory analogue ICDT(Q) by imaging a "divine intervention" which changes the q(υt) into Q regardless of previous history. We have
where is expectation value with respect to an unconditional quasi-Solomonoff distribution modified by "divine intervention" as above.
These intelligence metrics are prone to the standard criticism of the respective decision theories. IEDT(Q) factors in the indirect effect Q has on U by giving more weight to universes in which Q is likely. ICDT(Q) favors agents that fail on certain Newcomb-like problems. Specifically, suppose A discovers that the universe contains another agent Ω which created two boxes B1 and B2 and placed utility in each before t. Then a CDT-intelligent A would prefer taking B1 + B2 over taking B1 only even if it knows Ω predicted A's decision and adjusted the utility in the boxes in such a way that taking B1 only would be preferable.
I suggest solving the problem by applying a certain formalization of UDT.
The Solomonoff distribution works by randomly producing a program R which computes3 the transition probabilities of the universe. When we apply the Solomonoff distribution to Y, we can imagine the event space as consisting of pairs (R, {ηi,υ}). Here are uniformly distributed numbers used to determine the occurrence of random events. That is, (R, {ηi,υ}) determines {υi} by the recursive rule
Here PR stands for the conditional probability computed by R and
We define UT as the program receiving (R, {ηi,υ}) as input (it's infinite but it doesn't pose a problem) and producing a binary fraction as output, s.t.
Further, we define U*T as the program receiving (R, {ηi,υ}, Q) as input and producing a binary fraction as output, s.t.
Thus UT and U*T are least-squares optimal predictors for U under the constraint of running time T where U*T is provided the additional information q(υt).
Consider
u(Q,T) is the gain in estimated utility obtained by adding the information q(υt)=Q, provided that the underlying physics (R, {ηi,υ}) of Y (which allows predicting q(υt)=Q) is already known. For T >> 0, u(Q,T) approaches 0 (since the additional piece of information that q(υt)=Q can be easily deduced within given computing resources). For T approaching 0, u(Q,T) approaches 0 as well, since there's not enough time to process the actual input anyway. Hence it is interesting to consider the quantity
It is tempting to interpret u* as an intelligence metric. However it doesn't seem sensible to compare u*(Q) for different values of Q. Denote T* the value of T for which the maximum above is achieved (). Instead of defining an intelligence metric per se, we can define Q to be a maximally UDT-intelligent agent if we have
This represents the (non-vacuous) self-referential principle that Q is the best agent given the information that Y is the sort of universe that gives rise to Q. The role of T* is to single out the extent of logical uncertainty for which Q cannot be predicted but given Q the consequences for U can be predicted.
Directions for Further Research
Footnotes
1The name "ontology problem" is courtesy pengvado
2I was exposed to this work by reading Anja
3It computes them in the weak sense of producing a convergent sequence of lower bounds