It's true that it's only possible to find local optima, but that's true with any algorithm. Unless P=NP, perfect mathematically optimal AI is probably impossible. But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn't a huge problem.
As to why we can have decent machine learning and not AGI, I don't know. Approximating SI isn't sufficient for one, you need to act on the models you find. But even so, I'm more trying to say it's is a promising research direction, and better than other stuff. I'm not saying it's a solved problem. Or that there are no further advances to be made.
It's not even clear that a learning program must approximate Bayesian inference.
Everything approximates Bayesian inference, it's just a matter of how ideal the approximation is. If you have enough data, the maximum likelihood approaches bayesian inference. And that's how NNs are typically trained. But it's nice to have other hypotheses and not just assume that the most likely hypothesis is 100% correct.
But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn't a huge problem.
That's not even true. In practice, it's the best we've got, but it's still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can't).
As to why we can have decent machine learning and not AGI, I don't know.
It's because the neural net algorithms are not even close to finding the optimal neural net in complex situations.
...Approximating SI isn't sufficient for one, you need to act on the models you
Solomonoff Induction is a sort of mathematically ideal specification of machine learning. It works by trying every possible computer program and testing how likely they are to have produced the data. Then it weights them by their probability.
Obviously Solomonoff Induction is impossible to do in the real world. But it forms the basis of AIXI and other theoretical work in AI. It's a counterargument to the no free lunch theorem; that we don't care about the space of all possible datasets, but ones which are generated by some algorithm. It's even been proposed as a basis for a universal intelligence test.
Many people believe that trying to approximate Solomonoff Induction is the way forward in AI. And any machine learning algorithm that actually works, to some extent, must be an approximation of Solomonoff Induction.
But how do we go about trying to approximate true Solomonoff Induction? It's basically an impossible task. Even if you make restrictions to remove all the obvious problems like infinite loops/non-halting behavior. The space of possibilities is just too huge to reasonably search through. And it's discrete - you can't just flip a few bits in a program and find another similar program.
We can simplify the problem a great deal by searching through logic circuits. Some people disagree about whether logic circuits should be classified as Turing complete, but it's not really important. We still get the best property of Solomonoff Inducion; that it allows most interesting problems to be modelled much more naturally. In the worst case you have some overhead to specify the memory cells you need to emulate a Turing machine.
Logic circuits have some nicer properties compared to arbitrary computer programs, but they still are discrete and hard to do inference on. To fix this we can easily make continuous versions of logic circuits. Go back to analog. It's capable of doing all the same functions, but also working with real valued states instead of binary.
Instead of flipping between discrete states, we can slightly increase connections between circuits, and it will only slightly change the behavior. This is very nice, because we have algorithms like MCMC that can efficiently approximate true bayesian inference on continuous parameters.
And we are no longer restricted to boolean gates, we can use any function that takes real numbers. Like a function that takes a sum of all of it's inputs, or one that squishes a real number between 0 and 1.
We can also look at how much changing the input of a circuit slightly, changes the output. Then we can go to all the circuits that connect to it in the previous time step. And we can see how much changing each of their input changes their output, and therefore the output of the first logic gate.
And we can go to those gates' inputs, and so on, chaining it all the way through the whole circuit. Finding out how much a slight change to each connection will change the final output. This is called the gradient, and we can then do gradient descent. Basically change each parameter slightly in the direction that increases the output the way we want.
This is a very efficient optimization algorithm. With it we can rapidly find circuits that fit functions we want. Like predicting the price of a stock given the past history, or recognizing a number in an image, or something like that.
But this isn't quite Solomonoff Induction. Since we are finding the best single model, instead of testing the space of all possible models. This is important because essentially each model is like a hypothesis. There can be multiple hypotheses which also fit the data yet predict different things.
There are many tricks we can do to approximate this. For example, if you randomly turn off each gate with 50% probability and then optimize the whole circuit to deal with this. For some reason this somewhat approximates the results of true bayesian inference. You can also fit a distribution over each parameter, instead of a single value, and approximate bayesian inference that way.
Although I never said it, everything I've mentioned about continuous circuits is equivalent to Artificial Neural Networks. I've shown how they can be derived from first principles. My goal was to show that ANNs do approximate true Solomonoff Induction. I've found the Bayes-Structure.
It's worth mentioning that Solomonoff Induction has some problems. It's still an ideal way to do inference on data, it just has problems with self-reference. An AI based on SI might do bad things like believe in an afterlife, or replace it's reward signal with an artificial one (e.g. drugs.) It might not fully comprehend that it's just a computer, and exists inside the world that it is observing.
Interestingly, humans also have these problem to some degree.
Reposted from my blog here.