As you probably know, there are multiple theoretically-interesting ML ideas that achieve very good results on MNIST. Have you tried more challenging image recognition benchmarks, such as CIFAR-100, or some non-CV benchmark? Since you posted your code, I wouldn't mind spending a bit of time looking over what you've accomplished. However, MNIST, which is now considered pretty much a toy benchmark (I don't consider PI-MNIST to be a better benchmark), will likely be an obstacle to get others to also look at it in-depth, as it will be considered quite preliminary. Another practical point: using C and CUDA kernels also makes it less accessible to a good percentage of researchers.
I have, deliberately, taken away everything relating to geometry from this presentation.
It took 12 years (1986-1998) and (how much research effort ?) ,to go from BP/SGD to convolutions.
This is a one man effort, on my own personal time (20,000 hours over the past 6 years), that I am giving away for the community to freely take over. I am really sorry if it is not enough. Their choice.
It is not an add-on to something that exist but a complete restart. One thing at a time.
As for CUDA, if you have a lot of threads, it is bearable, and you can use old, cheap, GPUs with very little loss (they have been optimised recently for FP multiply/add, at the expense of ADD of INT).
FYI, I got >99.3% with only adding a fix layer of simple preset filters (no augmentation) and the idea behind can be readily extended. And you can also train, unsupervised, convolutions.
I didn't mean "CUDA kernels" as in requiring NVIDIA GPUs - that's fine. I meant that you're limiting the readability of your code to a subset of people who understand both ML and CUDA programming. In my experience, this limits the reach, especially among younger researchers (I've hired C++ programmers and ML researchers for my business).
But, of course, you can choose to promote (or not) your work however you prefer.
The only function that is implemented in CUDA is the test one (test_gpu).
It is also implement for CPU (test_mt_one), identically.
What matters is all clearly (I hope) explained in the text. It is simple enough that its reach is not limited to ML researchers and clearly within that of a lot of coders. The IT revolution started when amateurs got PCs.
In this version of the code, I had to make a tradeoff between completeness, usability and practicality. Write your own code, it does not matter. It is the concept that does.
The (upcoming) website will give separate, readable, versions. I am waiting to get a proper idea of what is demanded before I do that, so thank you for that input.
In the software industry we have the concept of responsible disclosure when one finds a security exploit in a published package.
Do we have responsible disclosure procedures for something that may represent a fundamental AI capability advancement? Who would you even disclose to? Slide into Yudkowsky's DMs?
MIRI should have a form you can fill out (or an email address) specifically for people who think they've advanced AI and are worried about ending the world with it. MIRI should also have Cold-War-style hotlines to OpenAI, DeepMind, and the other major actors.
Does MIRI do much in the way of capabilities research? It is my understanding that they don't. If MIRI doesn't do capabilities research then it seems unlikely to me they would do much with an idea that is all about capabilities advancement.
I don't think they do either. I was thinking they would provide alignment advice / troubleshooting services and would facilitate multipolar coordination in the event of a multiple slow-takeoff scenario.
I guess the process would be to pass it on to whichever cababilities researchers they trust with it. There would be a few of them at this point.
So, why not go straight to those researchers instead of MIRI? Because MIRI are more legible responsible intermediaries I guess.
I am not sure I understand your question (sorry, I do not know what is Yudkowsky'DMs)
I basically disclosed, to all, that the way we all think we think, does work.
What kind of responsibility could that bear ?
Sorry, I was being a bit flip/insider-y. Probably inappropriately so.
I'm curious how much you've engaged with the AI Safety literature/arguments?
"Yudkowsky's DM" --> Eliezer Yudkowsky's [Twitter] Direct Messages.
I think I have expressed my views on the matter of responsibility quiet clearly in the conclusion.
I just checked Yudkowsky on Google. He founded this website, so good.
Here is not the place to argue my views on super-intelligence, but I clearly side with Russell and Norvig. Life is just too complex; luckily.
As for safety, the title of Jessica Taylor's article is:
"Quantilizers: A Safer Alternative to Maximizers for Limited Optimization".
I will just be glad to have proved that alternative to be effective.
I'm going to try to port this to python, just to see how it works, and make it easier for other people to try variations on it. I'll post a repo link under this comment when I have it to any sort of decent state.
Started working on a python version here:
https://github.com/epurdy/dpis_spiking_network
As of now I have a (probably buggy) full translation that uses python for-loops (so can't be made fast) and have started on a more pythonic translation that can probably be put on a gpu relatively easily.
Dpi, I welcome any contributions or corrections you have to this repository. Since you don't know python it will probably be hard to contribute to the python versions, but even just uploading the C version would be helpful.
Let me know what license I should use for this repository, if any.
Please do, and thank you for trying.
That is exactly what I am trying to elicit.
If you have any question, I am available to help (through private messages).
I do not know Python (I am extremely comfortable with C and I get full speed and I do not have the time or need), but it seems the ML community is.
Update:
I changed the adjustment values for the 98.65% version to 500/-500 (following lsusr comments).
1000/-1000 is good for the larger network (98.9%), but too much for the smaller ones. It makes the accuracy reduce after it has reached the peak value.
I was too fast publishing and did not go through all the required verifications. My fault.
I am running a series of tests to confirm. The first two are in spec and stable at 10 million steps.
Larger values speed up the convergence, and I was trying to make it as fast as possible, to not waste the time of those who would spend it verifying. Sorry about that.
How much compute does this take to beat SOTA on PI-MNIST? How does that amount of compute compare to the amount used by the previous SOTA systems?
Have you tried it on other tests besides PI-MNIST? What were the results?
I think that is the wrong question at this point.[1] When a new paradigm is proposed the question can't be "is it faster"? That comes later when things get optimized. The question is: Does it show new ways of thinking about learning? Or maybe: Does it allow for different optimizations?
[1] UPDATE: I should have phrased this as
I'm interested in sniffing out potential new paradigms and this might be the beginnings of one even if it isn't already faster than the current mainstream alternatives.
I disagree. The other questions are cool too, but only the "Is it faster?" question has the potential to make me drop everything I'm doing and pay attention.
We all have different strategies we follow. Arguments that invalidate one don't invalidate others.
Fair. Would you have agreed if I had said something like: "This question has the risk of leading the discussion away from the core insight or innovation of this post."
I don't think PI-MNIST SOTA is really a thing. The OP even links to the original dropout paper from 2014, which shows this. MNIST SOTA is much less of a thing than it used to be but that's at 99.9%+, not 98.9%.
That is a question of low philosophical value, but of the highest practical importance.
At line 3,000,000 with the 98.9% setup, in the full log, there is these two informations:
'sp: 1640 734 548', and 'nbupd: 7940.54 6894.20' (and the test accuracy is exactly 98.9%)
It means that the average spike count for the IS-group is 1640 per sample, 734 for the highest ISNOT-group and 548 for the other ones. The number of weight updates per IS-learn is 7940.54 and 6894.20 per ISNOT-learn. With the coding scheme used, the average number of inputs per sample over the four cycles is 2162 (total number of activated pixels in the input matrix) for 784 pixels. There is 7920 neurons per group with 10 connections each (so 10/784 th of the pixel matrix), for a total of 79200 neurons.
From those numbers:
The average number of integer additions done for all neurons in a group when a sample is presented is: 79200 * 2162 * 10/784 = 2,184,061 integer additions in total.
And for the spike counts: 1640 + 734 + 8*548 = 6758 INCrements (counting the spikes).
When learning, for each IS-learn, there is 7940, and for each ISNOT-learn, 6894, weight updates . Those are additions of integers. So an average of (7940 + 9*6894) / 10 = 7000 additions of integer.
That is to be compared with a 3 fully connected, say 800 units (to make up for the difference between 98.90% and 98.94%) layers.
That would be at least 800*800*2 + 800*784 = 1,907,200 floating point multiplications, plus what would be used for Max-norm, ReLu,... that I am not qualified to evaluate, but might roughly double that ?
And the same for each update (low estimate).
Even with recent works on sparse updates that do reduce that by 93%, it is still more than 133,000 floating-point multiplications (against 7000 additions of integers).
I have managed to get over 98.5% with 2000 neurons (20,000 connections). I would like to know if BP/SGD can perform to that level with such a small number of parameters (that would be one fully connected layer of 25 units) ? And, as I said in the roadmap, that is what will matter for full real systems.
That is the basic building bloc. the 1x1x1 Lego brick. 1.5/1.1 = 36% improvement with 40 times the ressources is useless in practice.
And that is missing the real point laid out in the Roadmap: this system CAN and MUST be implemented in analog (hybrid until we get practical memristors), whereas BP/SGD CAN NOT.
There is, at least, another order of magnitude in efficiency to be gained there.
There is a lot of effort invested, at this time, in the industry to implement AI at IC-level. Now is the time.
This is interesting in that it's biologically plausible, in ways that backpropagating neural networks aren't. But this doesn't have state of the art performance, and I'm going to register the prediction that, when scaled and tuned, it will still fail to match the best existing scaled-and-tuned neural networks. I could be wrong about this, and the framing of how this algorithm works is different enough from the usual backpropagation that I'd be interested in the results of the experiment, but I'm not all that excited (or scared).
You report 98.9% accuract on PI-MNIST. For comparison, this 2015 paper on ladder networks claims (table 1) to have gotten 0.57% test error (99.43% accuracy) when using the full training set. I haven't looked very hard for other papers (the permutation invariance is a weird search term), but I expect a thorough search of the literature to turn up significantly higher. I also have a preexisting belief that MNIST-in-particular has a large number of easy samples and a small number of hard samples, such that getting to ~98% is easy and then it starts getting progressively harder.
See my answer to mlem_mlem_mlem for the second part of your comment.
You are bringing another interesting point: scaling up and tuning.
As I indicated in the roadmap, nature has chosen the way of width to that of depth.
The cortical sheet is described as a 6 layers structure, but only 3 are neurons and 2 pyramidal neurons. That is not deep. Then we see columns, functional 'zones', 'regions'... There is an organisation, but it is not very deep. The number of columns in each 'zone' is very large. Also note that the neuron is deemed 'stochastic', so precision is not possible. Lastly, note (sad but true) that those who got the prize worked on technical simplification for practical use.
There is two options at this stage:
We consider, as the symbolic school has since 1969, that the underlying substrate is unimportant and, if we can find mathematical ways to describe it, we will be able to reproduce it, or...
We consider that nature has done the work (we are here to attest to that), properly, and we should look at how it did it.
1986 was an acceptable compromise, for a time. 2026, will mark one century of the 5th Solvay conference.
The code compiles. It runs faster than advertised (assuming I'm reading the output correctly). I guess that means I'm running on better hardware than D𝜋 used.
Here are the results after running the code for several hours on my CPUs.
testing 20000: 0: 77.553 1: 77.74
testing 40000: 0: 83.525 1: 83.33
testing 60000: 0: 87.880 1: 87.44
testing 80000: 0: 91.203 1: 90.10
⋮
testing 10860000: 0: 99.958 1: 98.50
testing 10880000: 0: 99.957 1: 98.47
testing 10900000: 0: 99.957 1: 98.56
testing 10920000: 0: 99.955 1: 98.47
I wrote a comment on that but this is a better to place for it.
I changed the update value from 1000 to 500 for that network size (in the code).
1000 is for the large network (98.9%). At size 792 (for 98.6%) it is too much, and the accuracy goes down after reaching the top. I did not take the time to check properly before publishing. My fault.
If you check it out now, it will get to >98.6% and stay there (tested up to 10 millions, three times, random).
First of all, congrats! It's neat that your model beats state of the art on this benchmark, and with a new method of modeling too.
I feel like this post wasn't sufficient to convince me to use spiking models or your curriculum strategy. I think in part this is because I'm pretty jaded. The recent-ish history of machine learning includes a whole slew of comp neuro derived models, and almost always they come with two things:
So it feels like I'm skeptical on priors, which is a bit useless to say out loud. I'm curious where this research goes from here, and if you do more, I hope you consider sharing it.
I do think that one of the least understood features of comp neuro models used for machine learning (of which deep neural networks are currently the top contender, but other candidates would be boltzmann machines and reservoir computing) is the inductive bias / inductive prior they bring to machine learning problems.
I think it's possible that spiking neural networks have better inductive priors than other models, or at least better than the models we're using today.
The sparsity you mention also is probably a good thing.
The curricula this induces is neat. My personal take with ML today is that learning from data IID is pretty crazy (imagine trying to teach a child math by randomly selecting problems from all of mathematics). It's possible this is a better way to do it.
Thank you for the congrats, it helps.
Note, that I only claim to reach SOTA, not to beat it.
It would be preposterous to convince anybody with this limited evidence. The goal is to raise interest so some will spend some time to look deeper into it. Most will not, of course, for many reasons, and yours is a valid one.
The advantage of this one is its simplicity. At this point any coder can take it up and build on it. This has to be turned into a new type of construction set. I would like this to provide the 15 years old of today the pleasure my first computer (machine language) gave me, and Legos before that.
You got the last bit correctly. That is what self-organisation provides: ad-hoc selection.
At first glance, this approach appears similar to Stephen Grossberg's Adaptive Resonance Theory: https://en.wikipedia.org/wiki/Adaptive_resonance_theory
D𝜋 recommends the following compiler flags to run the code on a GPU.
$ nvcc -ccbin g++ -I./cuda/NVIDIA_CUDA-9.2_Samples/common/inc -I/usr/local/cuda-9.2/targets/x86_64-linux/include -m64 -Xcompiler -fno-stack-protector -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61 -rdc=true -O2 sonn.o -c sonn.c
$ nvcc -ccbin g++ -m64 -Xcompiler -fno-stack-protector -gencode arch=compute_61,code=sm_61 -gencode arch=compute_61,code=compute_61 -o sonn sonn.o -lcuda
If you're using CUDA version 10.1 instead of 9.2 then you will have to rename sonn.c
to sonn.cu
.
The link I typed in and appears when hovering over is, indeed, 'http://yann.lecun.com/exdb/mnist/', and it works on my machine... Thanks for the additional link.
Trying to summarize this for my understanding benefit: So the "trick" is that, instead of precisely tracking error, we just track the distribution using the Quantiliser section? (And the broader insight being that, instead of backpropping a function, we're measuring a distribution?)
I am going to answer this comment because it is the first to address the analysis section. Thank you.
I close the paragraph saying that there is no functions anywhere and it will aggrieve some. The shift I am trying to suggest is for those who want to analyse the system using mathematics, and could be dismayed by the absence of functions to work with.
Distributions can be a place to start. The quantilisers are a place to restart mathematical analysis. I gave some links to an existing field of mathematical research that is working along those lines.
Check this out: they are looking for a multi-dimensional extension to the concept. Here it is, I suggest.
Is there any relation to this paper from 1988?
https://www.semanticscholar.org/paper/Self-Organizing-Neural-Networks-for-the-Problem-Tenorio-Lee/fb0e7ef91ccb6242a8f70214d18668b34ef40dfd
No, there isn't, but it is interesting.
I gave it a quick look. It seems to be closer to this (this is closer to the point)
I was heavily influenced, back in the 70s, by the works of Mandelbrot and the chaos theory that developed at the time, and has gone nowhere.
The concept of self-organisation has been around for a long time but it is hard to study from the mathematical point of view, and, probably for that reason, it has never 'picked up'.
So, of course, there are similarities, and, please, go back to all of those old papers and re-think it all.
You will benefit from an hands-on approach rather then a theoretical one. First you experiment, then you find, then you analyse and finally, you theorise. This is not quantum physics and we have the tools (computers) to easily conduct experiments.
This is just another exemple, one that could prove very useful. That's it.
I wonder how this relates to Predictive Coding has been Unified with Backpropagation. If D𝜋 doesn't use backpropagation is this predictive coding or something else?
And if it is (like) predictive coding I wonder why it doesn't have 100x worse performance.
How does the batch() procedure terminate? It contains an infinite while loop, within which there is neither a return or a break statement. Maybe there are some goings-on concerning termination of threads, but that is beyond my knowledge of C or C++.
We never stop learning.
To kill the program, shoot 'Ctrl+C'.
Seriously, this system works 'online'. I gave the exemple of the kids and the Dutch to illustrate that in nature, things change around us and we adjust what we know to the new conditions. A learning process should not have a stopping criteria.
The system converges, on PI-MNIST, at 3-5 million steps. To compare, recent research papers stop at 1 million, but keep in mind that we only update about 2 out of 10 groups each time, so it is equivalent.
So you can use "for( ; b<5000000 ; b++ )" instead of "while( 1 == 1 )" in the batch() function.
After convergence, it stays within a 0.1% margin forever after. You can design a stop test around that if you want, or around the fact that weights stabilise, or anything of that kind.
If you were to use a specific selection of the dataset, wait until it stabilises and, then, use the whole set, the system would 'start learning again' and adjust to that change. Forever.
It is a feature, not a bug.
The accuracy reaches state of the art (98.9%) on PI-MNIST with 750,000 low precision (binarisable) connections (>98.6% with 80,000 connections), one layer, no bias, no back-propagation and only additions. It is yet to be optimised. It works online, and is backed by existing mathematical theorems. It also relates to the real cortical structure and personal experience; and even popular wisdom.
Isn't SOTA for MNIST more like 99.9%? At least according to this website. I don't think MNIST is a great benchmark because it's pretty much solved.
It is PI-MNIST.
Permutation Invariant. To keep it simple, you cannot use convolutions. It is all explained in the text.
Real SOTA on that version is 99.04% (Maxout), but that is with 65 Millions+ parameters. I do not have the hardware (or time).
I stopped at 98.9% with 750,000 connections (integers and additions) and this is close to what BP/SGD (table 2) gets with 3 hidden layer of 1024 units each, for a total of >3,000,000 parameters (floating-points and multiplication) with max-norm and Relu.
For a similar accuracy, the number of 'parameters' is almost an order of magnitude lower with this system and efficiency even more.
Remember, it is not supposed to work at all, and it is not optimised.
PI MNIST is up to at least 99.43% with Ladder Networks https://arxiv.org/abs/1507.02672. I think I vaguely remember some ~99.5% published since (it's been 6 years) but I haven't done the lit tree crawling to find it currently. Another example of a higher performing result than Maxout is Virtual Adversarial Training at 99.36% https://arxiv.org/abs/1704.03976. The JMLR version of dropout https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf also has a 99.21% with dropout finetuning of a Deep Boltzmann Machine.
You are comparing step and ladder (I had to seize on it !).
If you look at Table 2 in your last reference, you will see that they, carefully, show results improving has steps are added. Ladder is just another step (an optimisation one). There is a reason why researchers use PI-MNIST: it is to reduce the size of the ladder to make comparisons clearer.
What I am trying to bring here is a new first step.
I could have tried a 784-25-10 BP/SGD network (784*25 = 19600 parameters) to compare with this system with 196 neurons and 10 connections. I have managed to get 98% with that. How much for the same with BP/SGD ?
The current paradigm has been building up since 1986, and was itself based on the perceptron from 1958.
Here, I take the simplest form of the perceptron (single layer), only adjoin a, very basic, quantiliser to drive it, and already get near SOTA. I also point out that this quantiliser is just another form of neuron.
I am trying to show it might be an interesting step to take.
See also Evolution of Modularity. Using your quantilizer to choose training examples from which to learn appears to be a very simple, natural way of accomplishing modularly varying reward functions. (I left a comment there too.)
The piece of code bellow is indisputable, reproducible, proof of what we are going to say in this post.
It implements a basic spiking neural network, with a simple learning method that introduces a new paradigm.
The accuracy reaches state of the art (98.9%) on PI-MNIST with 750,000 low precision (binarisable) connections (>98.6% with 80,000 connections), one layer, no bias, no back-propagation and only additions. It is yet to be optimised. It works online, and is backed by existing mathematical theorems. It also relates to the real cortical structure and personal experience; and even popular wisdom.
So it cannot be just ignored or dismissed. It requires proper falsification, or further analysis and extension.
I will describe the system and its mechanisms, analyse it, and share the vision, starting from the beginning, in plain language, so everybody can understand, and, yet, it will be essentially complete.
I am setting up a website where I will add more information and code depending on expressed interest, if any. The code at the bottom is there for completeness, immediate reference and, required, formal proof.
The system
A Neuron, in the brain, is connected to multiple other neurons from dendrites. It receives signals carried by nerves. Each connection has a strength (weight) that varies over time. There is positive and negative connections. When a neuron receives a signal from a connection, it ‘adds’ or ‘subtracts’ (chemically) the weight associated with the connection, inside its body, to a ‘total’. If it reaches the limit of what it can hold (its threshold), it spikes: it sends a signal through its axon, its only outgoing nerve, for other neurons to receive. In doing so empties its content. And so on… In our system, there is neurons that operate the same way.
When a neuron spikes, the connections that have contributed to it are reinforced. It is Hebb’s rule. In this system, we do the same.
Connection are not permanent. They are created on the fly, and their weight changes over time. Decay reduces weights over time, as well. When the weight goes down too much, the connection is cut. In this system, we do the same.
In the brain, neurons group together, locally, in what is called columns. They tend to spike, or not, in a fairly synchronised manner inside a group. It is a matter of scientific debate how or what for, but it is. In this system, we set up neurons in groups, as columns.
We have, at this point, described most of the system. We will now use it in a real test case and describe how it operates.
The mechanism
In the field there is a dataset that is used in most research papers to compare the performance of the systems and methods presented: the MNIST dataset. It is made of the digits, 0 to 9, handwritten by different people and presented in a pixel matrix of size 28x28 in shades of grey. There is 60,000 samples to learn from. The system must use those to ‘learn’ to identify digits. There is an additional 10,000 samples to test the system with, after it has been trained.
Intelligence is not just ‘one thing’. There is many clues in a handwritten number. Most of them relate to geometry: the shape of the line, its orientation,… To make it more difficult, in the Permutation Invariant (PI) version, the pixels of the display matrix can be mixed up randomly. That is the version used in all research papers on the core subject of’AI’ (no augmentation, of course). The clues that are left are only the correlations of pixels ‘lightening up’ (it is, actually, darkening), wherever they are.
The first task of a brain is to detect correlations, before it can start to wonder about causality.
So we set up ten groups of neurons, one for each digit. Each neuron is connected to a few pixels from the input matrix.
When we present a sample to the system, each neuron will look at the pixels its inputs are connected to. If a pixel is on, the weight associated with the input is added to its total. If that total reaches a given, fix, threshold, the neuron ‘spikes’ and the total is reset to zero. The spikes are counted per group. We compare the spike count of the groups. The digit group with the highest count is the answer of the system.
Learning is adapting the weights.
When the system learns from a sample, it adjusts the weights of the connections that are active. In neurons that belong to the group that represents the actual digit, the connection weights will be augmented (the weights go up); positive learning. In the neurons belonging to the other groups, they will be reduced; negative learning. All weight adjustments are of the same, fix, amount.
The connections decay over time. All weights are reduced by a small, fix, amount at regular intervals. This is done on the absolute value of the weight (so negative values increase).
The system starts with no connection. Before we learn from a sample, a small number a connections are created from active pixels in the sample to random neurons in each digit group that have available slots (Hebb’s rule). They are given a fix initial weight; positive for neurons in the sample digit group, negative otherwise. The connection weight will vary over time as learning goes. If it goes under a set limit (in absolute), the connection is cut off.
The data in the MNIST is made of pixels in shades of grey. To input them into the system, we reduce the depth to 4 shades, linearly (1,64,128,192), and present 4 views, in as many ‘cycles’, to the system. Increased precision is useless, and we want binary inputs.
That is it for the mechanism.
As everybody knows, that has studied the field, the basic perceptron - this is a most simplistic version of it - is intellectually interesting, but it is not usable as is.
So this should give a very bad result on the task, and it does.
In 1986, Rumelhart, Hinton, Williams, proposed the use of hidden neurons and back-propagation (BP/SGD), on the perceptron, as a way to solve that problem. And it works, with additional details, providing the base for most of the recent successes in AI.
But it requires a very large amount of silicon and power, so we had to wait thirty years for that, and we will have to wait much longer again before we see it used at scale on a personal device. More annoyingly, the method involves precise calculation of the error between the result the system gives and the result it should give, to be applied on each neuron in a precise manner, and backwards (nerves are directional).
Clearly, the neurons in the brain do not do that. So it is, indeed, completely artificial. It means that the brain does it another way, that nature has found, and we have not, so far.
Another way
Born in 1964, I have been searching for it fifty years. I followed advances in the field as a kid, got my first real computer at 15, studied the field in the mid 80s, until 1986. Like everybody else at the time, I calculated how long it would take for BP/SGD to be usable, and decided to just run my life doing other things. But before I stopped, I listed the constraints that you would have to accept to make the wait shorter (only additions of integers; sparse connectivity; it should be possible to implement it on asynchronous analog hardware, when memristors become a reality), and kept it in mind to ponder in my spare time, looking for that other way.
The sparse dynamic connection part of this system is the leftover of that work. Initially, it was much more complicated, but the simpler version used here is enough, at this scale. On its own, it proved not to be enough to reach the accuracy achieved with BP/SGD.
Starting in 2016, I searched frantically for month, tried hundreds of things. One day, I gave up trying to control everything. And it started working.
I was trying methods used in chess: take the system, at some point in the process, sort out the samples based on their error and try to find a way to decide witch sample to learn from next. I could not find one. I decided to try simple: let the sample with the biggest error be the one to learn from. It worked.
After a very large amount of work refining it, this is the main result:
The question to ask is not ‘how’ to learn, but ‘when’.
When presented with a digit sample, the system checks if it is worth learning from or not. The group that represents the digit (IS group) should have a high response and the others (ISNOT) a low one. High or low are not set numbers, they are relative to all responses, and, for one sample, between groups. If you take two similar samples, one written with a thick line and one with a slim one, the spikes will mechanically be higher with the first one, as there is more input. The ‘8’ have more pixels active then the ‘1’. So high and low must be expressed as the difference between the groups, for each sample.
For the IS group, what you want is for it to have a higher spike count then the highest of the ISNOT ones. The other ones do not matter. So you take that difference (IS-difference). For the ISNOT groups, you just want them to be low, so you use, for each group, the difference between that group and the average of all the other ones (so 9 ISNOT-differences).
For each sample, when a sample is presented, you take (and store) those values (IS and ISNOTs).
- For the IS group, if the IS-differences is in the top 5% of all those seen so far, you do a positive learning.
- For each other group, if the ISNOT-difference is in the top 0.66% of all those seen so far, you do a negative learning.
Many variations are possible. This one works best on MNIST (only by a small margin).
That is it. And it is enough to reach state of the art. As simple as possible, but not simpler.
One useful addition to the update mechanism is to only update the weights of a neuron if its total when it spikes is below a limit. The idea is that if a neuron spikes too quickly, there is no point in increasing it: it is already effective. It has the added advantage of a large reduction in the number of updates, and a better stability.
There is many ways to implement that rule. In this code, we set up two thresholds, used at learn time. The low one is the spike one: if it is reached, the total is reset. Then, if the total is under the high one, the update takes place, otherwise it does not. It is another form of selective learning, within each neuron.
The values for the adjustments, decay, cutoff weight, etc…, do not change much to the final accuracy. They mostly affect the convergence rate. The variance in accuracy, within a run once it has converged, and between runs, is within 0.1% on the tests set, and it always converges.
Quantilisers
The last mechanism we will detail here, because it is of interest, is the one we use to compute the values at a given quantile of the distributions of those differences. You can store and sort them, of course (that is what I did first), but it lacks elegance and you need to store all that data. There is, again, a much easier and better way, that everybody can actually come up with.
Stand in the street with a vertical measuring stick. Place the mark at the bottom. When somebody passes by, if they are taller than the mark, up the mark one notch. If they are shorter, down the mark one notch. As people pass by, it will go up and down.
Think about it: where will it converge to?
That is correct (most people figure it out): It will be the average height of the passers by. Actually, it is the median height. But for heights, a gaussian distribution, it is the same.
1 up, 1 down, gives the middle: 100 / (1+1) = 50%.
What would happen if you were to up by three instead of one when somebody taller passes by?
3 up, 1 down, gives the value at the first quarter of the distribution: 100 / (1+3) = 25 %
9 up, 1 down will give the height over witch they are in the 10% tallest: 100 / (1+9) = 10%.
For the bottom part of the distribution, do 1 up and X down.
So, if you want to know if a sample difference is within the highest 5%, for each sample seen, up by 19 when it is over the current estimate (and you learn) and down by 1 if it is under (and do not learn).
This is beautiful for seven reasons:
And that brings us to the final nail.
I was looking for a name for this last process, and I came up with the obvious ’quantiliser’. I typed it in Google, out of curiosity. Do it. At the top (it is an unusual word) you will find links to this paper, by Jessica Taylor, who I bow to.
It is a perfect vindication of using ‘quantilisers’ to decide when to learn. It is also aimed at improving the safety of AI systems. I take it.
The conditions under witch it is valid are met, in our system, because each update is of a small fix value.
The paper is about finding a better method than expected utility maximisation, in a decision process, for a good and safe answer. Maximising is precisely where is started when I tried learning from the sample with the worst error.
The proposed answer is to select any decision that “is in the top proportion of some ‘base distribution’ over actions, sorted by the expected utility achieved if that action is executed”. She even had the common decency to provide a rational analysis of the principle behind it. More right, as well.
Run our system without the quantiliser (learn from all samples) for a while, then stop. From that same starting point each time, apply learning from each sample (the ‘action’) and measure the impact it makes to the global accuracy. That is the ‘expected utility’. Sort those impacts with, next to them, the IS-difference for the sample. You will immediately see that there is a very good correlation.
So the ‘top proportion of the base distribution over actions’ can be those samples that have the highest IS-difference. It is interesting to note that correlation diminishes, gradually, as the system learns and accuracy improves.
It is remarkable that, in the case of ‘learning’, a suitable utility function should be ‘when you are wrong’.
Is it?
Meaning gets lost in symbols and functions, as Descartes noted in his ‘discourse on method’, and common-sense must be ruled, not dismissed and ignored.
Analysis
This system works, state of the art, and is as simple as possible. How come it works ?
If you do not apply the selection rule, it performs badly. As you increase the selectivity, the accuracy increases with it (with an optimum). All other settings have a marginal impact on the result.
Unambiguously, this is the one parameter that matters; what makes it work.
One way to look at it is that the learning system only sees a fraction of the dataset at any time. The samples in that subset change over time. A sample that was out can come in and another go out, following a learning step.
So it can be seen as two interacting, changing, sets: the weights and the samples.
And it goes on forever, with no outside intervention, adapting to changes in the learnable subset, if there is some.
That is a self-organised system.
So we will call it Self-Organised Neural Network. SONN.
The probability for a digit sample to be learned from is related to its difference to the norm (for that digit). Samples that are unusual will be learned from more often.
Instead of calculating a proportion of the difference to back-propagate and adjust the weights, as in BP/SGD (the generalised delta rule), here, we adjust the weights by a fix value, a number of times that is proportional to the difference; to the same effect, but without the complications.
Each neuron in a group can be seen as a model, and the group does pure model averaging. It is also the case with the cycles. That is why we use that type of input mapping on this task (a form of frequency coding). As usual, the best mapping will be task dependent and a subject of further study (data has to be binarised). An interesting fact is that the total carried between cycles can be halfed, for a slightly better result (an acceptable exception as it is just a right shift, implementable in analog).
The system is also resilient. Random ablation of 20% of the pixels goes unnoticed. Same with additional noise.
To those that are interested in recent developments around sparse connectivity, note that this system is really sparse by structure and, on top of it, a large part of the connections is also ignored during training.
You can split neurons into subgroups trained with separate quantilisers and group the subgroups by averaging them, or any other ways of splitting/regrouping, to no difference.
The neurons provide Internal representations due to their limited connectivity, and they replace hidden layers. The group/column is not a ‘layer’, but a convenience to interface with the quantilisers. In a larger system, the separate neurons provide, in population coding, a convenient way to dispatch information to the rest of the system (must see Numenta).
A recurrent setup works, to similar accuracy, by using the output of all neurons as negative inputs, instead of the input matrix.
To stop the weights from rising indefinitely, they are limited to the threshold value. It is not required. If you set the limit to a multiple of it, it works, with a limited loss on accuracy. Whatever value you set as a limit, weights will accumulate at that limit. Decay is a way to limit that, but if you let it go, it only affect the accuracy by a small amount. That brings the question of binarisation: if they all are at the limit, it is, in effect, binarised. Indeed, If you treat all weights, at test time, as 1 or -1, but keep full precision at learn time, the system is mostly undisturbed.
What matters is the difference between the number of positive and negative activations, not the weight values.
Empirical evidence indicates that there is an optimal ratio between the two quantile values (IS and ISNOT). The IS quantile (5%) is always valid. We set the ISNOT quantile to be such that the ratio of the number of negative learnings to positive learning is around 1.18. The reason for that is yet to confirm, but it is my educated guess that it is related to the ratio of the surface (number of pixels activated) covered by the ‘sum’ of a given digit against that covered by the ‘sum’ of all the other digits. There is, then, more negative connections but with smaller weights. That measure of ’surface’ can be done many different ways, and finding one that ‘fits’ would not prove anything. It is specific to this dataset. Another way to balance this difference is by using different weights for positive and negative updates, with similar effect.
A large part of the time in this research was spent looking at every detail of the learning mechanism, trying to imagine all the ways they could be changed. Every line of code (is there additional conditions?), every variable (should it be within some limits?) is the root of a bunch of possibilities.
I tried thousands. And they combine. They have to be tried more than once, and at different values, and network size.
I have mentioned a few points, saying that, in the end, they do not really matter. The only ones that have some impact are the parameter for the ISNOT quantilisers and/or the weight update difference, and decay. And even those have only a small impact on final accuracy; they impact convergence speed. The resulting distribution of connections and weights can vary enormously, but the accuracy is the same. You can change parameters during a run, prune connections, change weights arbitrarily as it goes: the system always recovers very quickly.
That, in our view, is the most interesting aspect of it, and demonstrate the power of self-organisation.
The real shift, as I see it, with this paradigm, is from functions to distributions. Distributions are what our brain captures and builds on. We do not find a suitable function for each existing distribution. Gaussian are widespread and convenient, but they are not it all. Felines size distribution has two bumps; the moment ‘intelligent agents’ come into play, Gaussian’s are replaced by Pareto’s. There is something else about maxout... Quantilisers handle any type indifferently. High precision is futile.
There no such things as functions anywhere in this and I know it will aggrieve a lot of people, but feel free to try add some anywhere to optimise something…
Reality is that what is needed as a base for the future is a basic building bloc that is small, efficient and resilient. I have wasted very little time testing large networks, just enough to get to that 98.9%. Remember this is not supposed to work.
Nature can be somewhat described using mathematics, but it does not operate using them. Not even multiplication.
As you see, there is a lot to look at and investigate. This is only just the beginning for this new approach.
Intelligence
A last thought experiment, for everybody:
We are continuously exposed to situations, and make decisions. There is something on the path, should I go around it? My stomach tells me I am hungry: what am I going to eat? Will the hero win in the end? (And down to: should I move my arm up a bit as I turn to seat in that armchair?).
Each time, we use a list of conditions to make the decision. Is there a car coming? Is the object small, dirty? Am I at home? What’s in the fridge?... We decide based on a sort of weighted sum of those conditions. It is never twice the exact same ones, and the weights are approximate.
After we have made the decision, we get a result. That car was really slow, I had to wait too long; walking on that piece of paper did not matter. I should not have had a burger, I feel bloated. I ‘forgot’ I had already got one for lunch. That chicken was nice. In that scribbled phone number, the third digit was a 7, not a 1. They were fast writing it. And I knew that.
We up and down, a bit and fuzzily, the causes that we remember and identify to have contributed to the decision, depending on wether they were for or against that good or bad decision. We do that thousands of times a day and we are mostly right. As a baby we were mostly wrong. Sometimes we make a bad decision and, with little awareness, we adjust our views. That is learning.
In logic, there is three modalities: deduction, induction, and abduction. That last one is mostly useless from a formal point of view -it is fuzzy- but it is the one that is most used in daily life, and is the basis of creativity. The way we ‘feel’ the world around us is abducted from our daily life.
That is the first stage of intelligence, the one this system aims at, and the one we can all relate to. Even the dog.
I should eat less junk food. I should not be too careful. I should check with the person that gives me their number. That is generalising, and it is the next step.
Setting up rules to proceed to generalisation is method, and the (last?) step after.
Roadmap
We mentioned the existence of neuronal columns in the brain and made an analogy with our neuronal groups.
There is about 12 billion neurons in the neocortex. A column is a few hundred neurons; say 1,000. That makes for 12 Million columns.
There is 50,000 words in a dictionary. They combine, in different ways, into ‘concept’ or ‘class’, ‘type’...
12 Millions / 50,000 = 240 combinations per word. It is reasonable.
Columns will assemble in zones, and than regions, forming a map, driven by inputs provided by sensors linked to different areas on its edge. The design part is in the map settings. Look at the brain map across evolution. It will learn from real life: that is our dataset.
Part of the map (the neocortex) is detached from direct inputs, and free to structure as the self-organising process will make it to be, based on what has been received from the edge. That is were global structuring will happen.
It will, itself, receive information from... itself, and structure it. What would you call that ?
‘I’ is abducted self.
Once it has that, it can move from correlation detection to causality analysis. Call it reason.
And we must not forget the hormonal system (feelings, emotions,...) that is separate, but interact with the nervous system, inwards by nerves, outwards by hormones, through the blood stream, to neurotransmitters.
As model averaging shows, it is better to have more, different, less correct, opinions, than one that knows it all. On the PI-MNIST the system can reach 98.5% with only 200 neurons per group and 9 connections each. Increasing the size leads to very quickly diminishing returns (but better stability across runs), and is useless for practical matters.
Existing commercial hardware is usable for our system, but wholly inefficient. With a simple hybrid design (it is only routing and additions, absolute synchronicity is not required, small loss acceptable), this system, at that scale, could fit today (we ‘think’ at 40Hz, CPUs at 4Ghz), in a large rack. In a skull in 20 years.
Movement is quite simple if you think about it, and will benefit from nitinol muscle. Vision can be done once and for all. Unlike movement that can be altered, it never changes. A wise set of filters will save a huge amount of hardware.
Data path from the eye to the brain is narrow and the sampling done by neurons in the retina is similar to what is done by our neurons. Those are to be further studied.
Time is relative. Chronology is not. ‘Brain waves’ serve a purpose for approximate timing and synchronisation.
Maybe it is time to look again at Copycat.
Unsupervised learning is an obvious next step to be studied.
The genes do not contain a complete description of the cortical map, just rules for its development that work most of the time. Finding global rules and architectures that foster the development of the map is a core item on the todo list.
Conclusion
It does sound ridiculous that it could be so simple (disappointing ?), and it took me a long time and thousands of tests to convince myself. this is not something I ‘thought of’. I just found it on my way (like Columbus found America). My only merit is to have looked where nobody would waste their time looking, spent thousand of hours on it on my own time (and that of my family), and not dismiss it as ridiculous when I found it, but worked more to make sense out of it.
There is much more to show and say, and study, and I hope some will get interested, look at it, tinker with it, think and wonder about it, and build on it. I hope distributed wisdom will self-organise around the evidence.
The use society makes of scientific discoveries is not the responsibility of scientists, but it is their duty to state their domain of validity. Mathematicians should clearly state the limits of mathematical models and AI scientist the limits of AI models, and all should stand against their abuse.
AI models include a larger number of input parameters, but it is still humans that select them and humans that set the criteria for their validation; and there will never be enough of them (that butterfly…) specially when it comes to human matters. They belong to the technology domain. Good servants but bad masters.
784 Dimensions to tell digits apart. How many for humans?
Hubris has seized yet another civilisation, this time under the guise of mathematics and its AI variant.
It is, historically, a sign of imminent collapse.
My attention is required elsewhere at this time, and then I will leave. I am a polymath that does not do maths, so I do not belong.
This is my message in a bottle. Let it be.
Dπ