What can we say about rational behaviour among realizable algorithms, i.e. Turing machines? What limits can we put on their performance? What exactly does Cox's theorem have to say?

Cox's theorem tells us that optimally rational beliefs imply Bayesian reasoning. But such results are independent of specific models of computation. Can we rigorously apply machinery like Cox's theorem if we state the problem in terms of optimal behaviour among general recursive algorithms -- i.e. realizable algorithms.

To make that a little more concrete: take the problem of writing an algorithm (designing a Turing machine) that processes sensor inputs and writes actions as output. As a concrete Turing machine, the algorithm consists of discrete steps at which it either writes an output symbol (corresponding to some action) or does not (which we can interpret as some null action). There is a second Turing machine corresponding to the external world that responds to each action by changing its state, which feeds back into the sensor inputs of our algorithm. We seek the algorithm maximizing the expectation of some pre-specified utility function U, defined over states of the world. To define this rigorously we need to nail down how utility is summed over time, perhaps something along the lines of Legg and Hutter's formalism would be appropriate. Or perhaps we can just terminate the process after some pre-specified number of steps.

So it seems to me that Cox's theorem establishes that there must be a Bayes-optimal decision at any point in time, and a "Bayes oracle" that always outputs this decisions would be optimal among all algorithms. But such an oracle is certainly not directly realizable as a Turing machine, since at each time step a Turing machine simply changes state, moves the tape, and writes a symbol, whereas a full Bayesian update could require arbitrary amounts of computation. And there is no option of not acting, only of outputting the null action.

One approach would be to make a Bayes-optimal decision about how much computation to spend on each update that takes into account opportunity costs. But while this seems intuitively reasonable, there is also an opportunity cost to performing this computation, so the question really is how sure we can be that this is the best design choice.

Nevertheless, it seems quite reasonable to write Bayesian algorithms, and indeed to expect them to perform optimally on average. But can we formalize and prove a result along these lines? Does someone know of existing work in this direction? Perhaps Cox's theorem or something similar applies in some direct way that I haven't perceived?

 

 

New Comment
8 comments, sorted by Click to highlight new comments since:

But such an oracle is certainly not directly realizable as a Turing machine, since at each time step a Turing machine simply changes state, moves the tape, and writes a symbol, whereas a full Bayesian update could require arbitrary amounts of computation.

I don't understand this post at all. What is bad about requiring an arbitrary amount of computation? As long as it's finite, it's computable. I don't see how oracles enter into it. (If your TM was implementing AIXI, then I can see why an oracle for the halting problem would be useful.)

Yeah, it is the universal prior which is grossly uncomputable, not Bayesian updating in general.

"Bayes oracle" was really just a term I invented right then, and it was a bad choice. I just meant that there is an optimal output symbol at each step, but that we can't necessarily write down a turing machine that actually generates it. This is not about halting-style uncomputability, rather it's about having to implement an algorithm as a tangible computation, and hence not being able to pause time while we make decisions.

If this sounds confused it's because I am!

I think alexflint might be saying that Turing machines are "too slow" - that between steps n and n+1 you can perform one operation, which is insufficient to fully update from the state of the world at n to the state of the world at n+1. I don't think this is really a meaningful objection, though... you could, say, have a Turing machine that outputs the best action for right now, or given some data, or whatever. Alternately, you could specify some granularity at which your decisions must be optimal and have the Turing machine run really fast to complete an update within that allowance. Does that sound like a fair interpretation?

Edit: It seems like you could raise a similar objection even if the machine could do a complete update in every timestep - at time n+(1/2) the decision might be nonoptimal.

a Turing machine that outputs the best action for right now, or given some data

Sure, that would be the Bayes-optimal thing to do, but it's not obvious that we can actually implement this. Where is the bit of code that figures out the best action for right now?

This isn't really meant to be an objection to anything; my intuition also suggests that making reasonable approximations to Bayesian updates is a good way to design algorithms, but can we prove this? Or has someone already done so?

Here's another way to think about it: Imagine we enumerate all possible Turing machines with K states and pick the one that performs best on average w.r.t our pre-specified utility function. What features, if any, can we predict about this machine? Cox's theorem suggests this machine would in some sense implement Bayesian belief updates, but can we formalize this notion and prove it? How would we recognize a Bayesian/non-Bayesian turing machine? Neither of these seems trivially answerable from Cox's theorem.

Sure, that would be the Bayes-optimal thing to do, but it's not obvious that we can actually implement this.

We were responding to your specific objection. Does my post accurately represent it? Does it dissolve it?

As for the rest - I expect we'd recognize a Bayesian turing machine from its behavior, or from observing that it acts according to the correct probabilities for things. Not sure on this part.

Also: actual Turing Machines and actual Utility Functions are terrible at their jobs.

We were responding to your specific objection. Does my post accurately represent it? Does it dissolve it?

Yup, the first sentence of your first post describes my post, but I disagree with the second sentence. I don't think it's dissolved yet.

As for the rest - I expect we'd recognize a Bayesian turing machine from its behavior, or from observing that it acts according to the correct probabilities for things. Not sure on this part.

I think you're saying that the Turing machine that reasons by Bayesian inference (i.e. repeated application of the product rule of probability theory) would turn out to be the best one. This seems like a sensible conjecture, but I wonder whether we can prove it. I don't think this fact emerges trivially from Cox's theorem or anything other work that I'm aware of.

The other approach is to define the Bayesian Turing machine as the best one. That's also quite reasonable, but now we have a completely new definition of "Bayesian", and the question of whether we can prove that it has a connection back to the old definition that involves repeated application of the product rule.

Also: actual Turing Machines and actual Utility Functions are terrible at their jobs.

Agreed.

I think you're saying that the Turing machine that reasons by Bayesian inference (i.e. repeated application of the product rule of probability theory) would turn out to be the best one.

What I meant to say was that if, per your hypothetical, we designed the best utility optimizer, we could see if it was "Bayesian" by checking whether it acts in accordance with the Bayesian probabilities (a la the Turing test). I guess?

The only real point of my post was that your stated objection to demonstrate that Turing machines clearly couldn't perform Bayesian updates wasn't really meaningful, in that the same objection could be leveled at any system more granular than the universe itself. On the other hand, I do expect that Turing-equivalent systems will have trouble with calculating priors.

I suppose you could have one Turing machine for each Planck duration in your lifetime (really this isn't so much to ask, as each machine is infinite in size anyway), each returning its calculation one after the other?