Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I'm also supported by the LTFF. See also LinkedIn.
E-mail: {first name}@alter.org.il
You now understand correctly. The reason I switch to colored operads is to add even more generality. My key use case is when the operad consists of terms-with-holes in a programming language, in which case the colors are the types of the terms/holes.
The following are my thoughts on the definition of learning in infra-Bayesian physicalism (IBP), which is also a candidate for the ultimate prescriptive agent desideratum.
In general, learning of hypotheses about the physical universe is not possible because of traps. On the other hand, learning of hypotheses about computable mathematics is possible in the limit of ample computing resources, as long as we can ignore side effects of computations. Moreover, learning computable mathematics implies approximating Bayesian planning w.r.t the prior about the physical universe. Hence, we focus on this sort of learning.
We consider an agent comprised of three modules, that we call Simulator, Learner and Controller. The agent's history consists of two phases. In the Training phase, the Learner interacts with the Simulator, and in the end produces a program for the Controller. In the Deployment phase, the Controller runs the program.
Roughly speaking:
We will refer to this as the SiLC architecture.
Let be our hypothesis class about computable mathematics. Let be our prior about the physical universe[1]. These have to satisfy the coherence conditions
Here, means that .
Together, these ensure that is a coherent IBP hypothesis. Notice that for any satisfying the first condition[2], there is a unique minimal coherent s.t. . Moreover, given a coherent and any , there is a unique minimal coherent s.t. .
The duration of the Training phase will be denoted by [3]. We can think of it as "computational time".
Let the source codes of the Learner (obtained by quining), the Simulator and the Controller respectively be denoted by
Here, the argument of corresponds to and is a probability distribution in which all probabilities are rational numbers[4].
We assume that the simulator can indeed run any computation, and that any given halting computation would run fast for . These are assumptions on (or, on some combination of (i) , (ii) the definition of , and (iii) the support of all ) that we will not spell out here.
We will say that a policy is a mapping of type and a metapolicy is a mapping of type .
Given any , we can compose it with and in the obvious way[5] to yield
In particular, we can take for some metapolicy by postulating no dependence on the argument.
Denote by the set of all policies. Given metapolicy and , we define by
Given any , we say that is a -consistent counterpossible when the following conditions hold:
We denote by the set of -consistent counterpossibles.
A (deterministic) copolicy is a mapping of signature . We denote the set of copolicies by . Given a policy and a copolicy , we define in the obvious way. Given policies , we define their total variation distance[6] to be
Given , , and metapolicy , we will use the notation
Intuitively, should be thought as the counterfactual expectation of loss function assuming metapolicy , while adding a "bonus" to account for "fair" treatment of randomization by the agent. More on that below.
Given a metapolicy and , we define by
Intuitively, is the set of universe states for which at least one copy of the agent exists which followed the metapolicy until computational time .
Given a loss function [7] (which we allow to explicitly depend on computational time for greater generality), the learning condition on a metapolicy and hypothesis is
Here, is the "regret bound" function which should vanish in the limit.
Some remarks on the particulars of this definition:
This framework assumes all our hypotheses are disintegrable w.r.t. the factorization into and . It is an interesting question to understand whether we should or can relax this assumption.
For example, we can imagine to be a Solomonoff-like prior along the following lines. Every hypothesis comprising is defined by a Turing machine with access to two oracles representing and two tapes of random and "ambiguous" bits respectively. is defined by running with one oracle fielding queries about (i.e. we given a program we can request to know its counterpossible output ) and the other oracle fielding queries about some s.t. we want to decide whether for . is only allowed to return NO if there was at least one query to which the two oracles gave different answers.
We use the "duration" interpretation for simplicity, but more generally can be some parameter controlling the computing resources available in the Training phase, and we can also allow the computing resources of the Controller to scale with .
The reason we restrict to rational numbers is because we need a notion of computing the distribution. It is in principle possible to generalize further to computable numbers. On the other hand, it might be more realistic to constrain even further to e.g. dyadic rationals (which can be implemented via fair coinflips). We stick to for simplicity.
We let the Learner interact with the Simulator for timesteps, producing some output , and then run the Controller with as an input.
This is not technically a distance since it is possible to have if so long as and only disagree on histories that are inconsistent with these policies. Such and are morally equivalent.
We could also allow to have a argument, but then we would have to remove the factor from the learning condition, because the choice of policy would matter intrinsically even if the agent doesn't exist. Alternatively, we could modify the definition of to avoid that. Or perhaps use some normalization factor more complicated than .
No? The elements of an operad have fixed arity. When defining a free operad you need to specify the arity of every generator.
I don't think that undecidability of exact comparison (as opposed to comparison within any given margin of error) is necessarily a bug, however, if you really want comparison for periodic sequences, you can insist that the utility function is defined by a finite state machine. This is in any case already a requirement in the bounded compute version.
So far interest in the programme was modest. I would appreciate it to hear from people who either (i) deliberated whether to apply and decided against it or (ii) feel that they might meet the requirements but are not interested. Specifically, what held you back and what changes (if any) would persuade you to apply?
First, it's uncomputable to measure performance because that involves the Solomonoff prior. You can approximate it if you know some bits of Chaitin's constant, but that brings a penalty into the description complexity.
Second, I think that saying that comparison is computable means that the utility is only allowed to depend on a finite number of time steps, it rules out even geometric time discount. For such utility functions, the optimal policy has finite description complexity, so g is upper bounded. I doubt that's useful.
I added some examples to the end of this post, thank you for the suggestion.
Not sure these are the best textbooks, but you can try:
Another excellent catch, kudos. I've really been sloppy with this shortform. I corrected it to say that we can approximate the system arbitrarily well by VNM decision-makers. Although, I think it's also possible to argue that a system that selects a non-exposed point is not quite maximally influential, because it's selecting somethings that's very close to delegating some decision power to chance.
Also, maybe this cannot happen when is the inverse limit of finite sets? (As is the case in sequential decision making with finite action/observation spaces). I'm not sure.
Thanks for this!
What I was saying up there is not a justification of Hurwicz' decision rule. Rather, it is that if you already accept the Hurwicz rule, it can be reduced to maximin, and for a simplicity prior the reduction is "cheap" (produces another simplicity prior).
Why accept the Hurwicz' decision rule? Well, at least you can't be accused of a pessimism bias there. But if you truly want to dig deeper, we can start instead from an agent making decisions according to an ambidistribution, which is a fairly general (assumption-light) way of making decisions. I believe that a similar argument (easiest to see in the LF-dual cramble set representation) would allow reducing that to maximin on infradistributions (credal sets).
To make such an approach even more satisfactory, it would be good to add a justification for a simplicity ambi/infra-prior. I think this should be possible by arguing from "opinionated agents": the ordinary Solomonoff prior is the unique semicomputable one that dominates all semicomputable measures, which decision-theoretically corresponds to something like "having preferences about as many possible worlds as we can". Possibly, the latter principle formalized can be formalized into something which ends up picking out an infra-Solomonoff prior (and, replacing "computability" by a stronger condition, some other kind of simplicity infra-prior).