Approximating Solomonoff Induction
Solomonoff Induction is a sort of mathematically ideal specification of machine learning. It works by trying every possible computer program and testing how likely they are to have produced the data. Then it weights them by their probability. Obviously Solomonoff Induction is impossible to do in the real world. But it forms the basis of AIXI and other theoretical work in AI. It's a counterargument to the no free lunch theorem; that we don't care about the space of all possible datasets, but ones which are generated by some algorithm. It's even been proposed as a basis for a universal intelligence test. Many people believe that trying to approximate Solomonoff Induction is the way forward in AI. And any machine learning algorithm that actually works, to some extent, must be an approximation of Solomonoff Induction. But how do we go about trying to approximate true Solomonoff Induction? It's basically an impossible task. Even if you make restrictions to remove all the obvious problems like infinite loops/non-halting behavior. The space of possibilities is just too huge to reasonably search through. And it's discrete - you can't just flip a few bits in a program and find another similar program. We can simplify the problem a great deal by searching through logic circuits. Some people disagree about whether logic circuits should be classified as Turing complete, but it's not really important. We still get the best property of Solomonoff Inducion; that it allows most interesting problems to be modelled much more naturally. In the worst case you have some overhead to specify the memory cells you need to emulate a Turing machine. Logic circuits have some nicer properties compared to arbitrary computer programs, but they still are discrete and hard to do inference on. To fix this we can easily make continuous versions of logic circuits. Go back to analog. It's capable of doing all the same functions, but also working with real valued states instead of binary. Instead
I'm not sure what my exact thoughts were back then. I was/am at least skeptical of the specific formula used as it seems arbitrary. It is designed intentionally to have certain properties like exponentially diminishing returns. So it's not exactly a "wild implication" that it has these properties.
I recently fit the Chinchilla formula to the data from the first LLaMA paper: https://i.imgur.com/u1Tm5EU.png
This was over an unrelated disagreement elsewhere about whether Chinchilla's predictions still held or made sense. As well as the plausibility of training tiny models to far greater performance.
First, the new parameters are wildly different than the old ones. Take that for what you will, but they are hardly set in... (read more)