All of D𝜋's Comments + Replies

Welcome onboard this IT ship to baldly go where no one as gone before !

Indeed, I just wrote 'when it spikes' and further as the 'low threshold' and no more. I work in complete isolation and some things are so obvious inside my brain that I do not consider them as non obvious to others.

It is part of the 'when' aspect of learning, but uses an internal state of the neuron instead of an external information from the quantilisers.

If there is little reaction to a sample in a neuron (spiking does happen slowly, or not), it is meaningless and you should ignore it.... (read more)

I am going to answer this comment because it is the first to address the analysis section. Thank you.

I close the paragraph saying that there is no functions anywhere and it will aggrieve some. The shift I am trying to suggest is for those who want to analyse the system using mathematics, and could be dismayed by the absence of functions to work with.

Distributions can be a place to start. The quantilisers are a place to restart mathematical analysis. I gave some links to an existing field of mathematical research that is working along those lines.

Check this out: they are looking for a multi-dimensional extension to the concept. Here it is, I suggest.

1Nicholas / Heather Kross
Thank you!

This introduces a new paradigm. Read T.Kuhn. You cannot compare different paradigms.

Everything that matters is in the post. Read it; really.

What is needed next is engineering, ingenuity and suitable ICs, not maths. The IT revolution came from IT (coders) and ICs, not CS.

As for your recommendation, I have tried so many things over the past four years… I posted here first to get to the source of one of the evidences; to no avail.

Good bye everyone

I am available through private messages

BP is Back-Propagation.

We are completely missing the plot here. 

I had to use a dataset for my explorations and MNIST was simple; and I used PI-MNIST to show an 'impressive' result so that people have to look at it. I expected the 'PI' to be understood, and it is not. Note that I could readily answer the 'F-MNIST challenge'.

If I had just expressed an opinion on how to go about AI, the way I did in the roadmap, it would have been just, rightly, ignored. The point was to show it is not 'ridiculous' and the system fits with that roadmap.

I see that your la... (read more)

Also,

No regularisation. I wrote about that in the analysis.

Without max-norm (or maxout, ladder, VAT: all forms of regularisation), BP/SGD only achieves 98.75% (from the dropout -2014- paper).

Regularisation must come from outside the system. - SO can be seen that way - or through local interactions (neighbors). Many papers clearly suggest that should improve the result. 

That is yet to do.

1philip_b
What is BP in BP/SGD? So, as I see it, there are three possible different fairness criteria which define what we can compare your model with. 1. Virtually anything goes - convolutions, CNNs, pretraining on imagenet, ... 2. Permutation-invariant models are allowed, everything else is disallowed. For instance, MLPs are ok, CNNs are forbidden, tensor decompositions are forbidden, SVMs are ok as long as the transformations used are permutation-invariant. Pre-processing is allowed as long as it's permutation-invariant. 3. The restriction from the criterion 2 is enabled. Also, the model must be biologically plausible, or, shall we say, similar to the brain. Or maybe similar to how a potential brain of another creature might be? Not sure. This rules out SGD, regularization that uses norm of vectors, etc. are forbidden. Strengthening neuron connections based on something that happens locally is allowed. Personally, I know basically nothing about the landscape of models satisfying the criterion 3.

... and it is in this description:

"The spiking network can adjust the weights of the active connections"

It is not a toolbox you will be using tomorrow.

I applied it to F-MNIST, in a couple of hours after being challenged, to show that is not just only MNIST. I will not do it again, that is not the point.

It is a completely different approach to AGI, that sounds so ridiculous that I had to demonstrate that it is not, by getting near SOTA on one widely used dataset (so PI-MNIST) and finding relevant mathematical evidence.

I am going after pure BP/SGD, so neural networks (no SVM), no convolution,...

No pre-processing either. That is changing the dataset.

It is just a POC, to make a point: you do not need mathematics for AGI. Our brain does not.

I will publish a follow-up post soon.

3D𝜋
Also, No regularisation. I wrote about that in the analysis. Without max-norm (or maxout, ladder, VAT: all forms of regularisation), BP/SGD only achieves 98.75% (from the dropout -2014- paper). Regularisation must come from outside the system. - SO can be seen that way - or through local interactions (neighbors). Many papers clearly suggest that should improve the result.  That is yet to do.

I doubt that this would be the best a MLP can achieve on F-MNIST.

I will put it this way: SONNs and MLPs do the same thing, in a different way. Therefore they should achieve the same accuracy. If this SONN can get near 90%, so should MLPs. 

It is likely that nobody has bothered to try 'without convolutions' because it is so old-fashioned.

Convolutions are for repeated locally aggregated correlations.

Spot on.

I hope your explanation will be better understood than mine. Thank you.

It 'so happens' that MNIST (but not PI) can also be used for basic geometry. That is why I selected it for my exploration (easy switch between the two modes).

no convolution.

You are comparing pears and apples.

I have shared the base because it has real scientific (and philosophical) value.

Geometry and other are separate, and of lesser scientific value. they are more technology.

3ZankerH
Your result is virtually identical to the first-ranking unambiguously permutation-invariant method (MLP 256-128-100). HOG+SVM does even better, but it's unclear to me whether that meets your criteria. Could you be more precise about what kinds of algorithms you consider it fair to compare against, and why?

That is correct.

I am referring to that paper as a vindication of the concept, but I do not use the quantiliser algorithm provided.

The one I use I devised on my own, a long time ago, with the thought experiment described, but it has since been mathematically studied. Actually, when I searched for it and found it the first time, it was in a much simpler version, but I cannot find that one again now... 

I have not been down to every detail of lsusr's rewrite yet, just the main corrections to the description of the mechanism. I had to do F-MNIST first.

Side... (read more)

Please do, and thank you for trying.

That is exactly what I am trying to elicit.

If you have any question, I am available to help (through private messages).

I do not know Python (I am extremely comfortable with C and I get full speed and I do not have the time or need), but it seems the ML community is.

Update: 3 runs (2 random) , 10 million steps. All three over 88.33 (average 9.5-10.5 million on the 3: 88.43). New SOTA ? Please check and update.

Update 2: 89.85 at step 50 Million with QuantUpP = 3.2 and quantUpN = 39. It does perform very well. I will leave it at that. As said in my post, those are the two important parameters (no, it is not a universal super-intelligence in 600 lines of code). Be rational, and think about what the fact that this mechanism works so well means (I am talking to everybody, there).

I looked at it, the informed way.

It gets ove... (read more)

3Maxime Riché
Here with 2 conv and less than 100k parameters the accuracy is ~92%. https://github.com/zalandoresearch/fashion-mnist SOTA on Fashion-MNIST is >96%. https://paperswithcode.com/sota/image-classification-on-fashion-mnist

See my comment on reversing the shades on F-MNIST. I will check it later but I see it gets up to 48% in the 'wrong' order and that is surprisingly good. I worked on CIFAR, but that is another story. As-is it gives bad results and you have to add other 'things'.

As you guessed, I belong to neuroinspired branch and most of my 'giants' belong there. I strongly expected, when I started my investigations, to use some of the works that I knew and appreciated along the lines your are mentioning, and I investigated some of them early on.

To my surprise, I did not ne... (read more)

That actually brings us to the core of it.

The way I phrased that was, deliberately, ambiguous.

Since 1958, the question the field has been trying to answer is how to transfer the information we get when a sample is presented, to the weights, so next time it will perform better.

BP computes the difference between what would be expected and what is measured, and the propagates it to all intermediary weights according to a set of mathematically derived rules (the generalised delta rule). A lot of work as gone into figuring out the best way to do that. This is w... (read more)

'those seen before' are.values of Δ⊥i across all samples seen before, not within a sample.

It is not, per se, Hebb's rule. Hebb's rule is very general. I personally see this as belonging to it, that's all. I give attributions where is think it is deserved.

2D𝜋
... and it is in this description: "The spiking network can adjust the weights of the active connections"

I just discovered about the 'ping back' on LessWrong...

I gave a first read of your description. Most of it is correct. I will check in more details.

I used the terms 'total' and 'groups' to make things simpler, but yours are better.

four corrections:

1.

The potential of a neuron can be negative. It is the pure sum of all weights, positive and negative. There is no 'negative spiking' (It is one of the huge number of things I tried that did not bring any benefit). It think I remember trying to set a bottom limit at 0 (no negative potential) and that, as always, ... (read more)

Thank you for the congrats, it helps.

Note, that I only claim to reach SOTA, not to beat it.

It would be preposterous to convince anybody with this limited evidence. The goal is to raise interest so some will spend some time to look deeper into it. Most will not, of course, for many reasons, and yours is a valid one.

The advantage of this one is its simplicity. At this point any coder can take it up and build on it. This has to be turned into a new type of construction set. I would like this to provide the 15 years old of today the pleasure my first computer (machine language) gave me, and Legos before that.

You got the last bit correctly. That is what self-organisation provides: ad-hoc selection.

See my answer to mlem_mlem_mlem for the second part of your comment.

You are bringing another interesting point: scaling up and tuning.

As I indicated in the roadmap, nature has chosen the way of width to that of depth.

The cortical sheet is described as a 6 layers structure, but only 3 are neurons and 2 pyramidal neurons. That is not deep. Then we see columns, functional 'zones', 'regions'... There is an organisation, but it is not very deep. The number of columns in each 'zone' is very large. Also note that the neuron is deemed 'stochastic', so precision is... (read more)

You are comparing step and ladder (I had to seize on it !).

If you look at Table 2 in your last reference, you will see that they, carefully, show results improving has steps are added. Ladder is just another step (an optimisation one). There is a reason why researchers use PI-MNIST: it is to reduce the size of the ladder to make comparisons clearer. 

What I am trying to bring here is a new first step.

I could have tried a 784-25-10 BP/SGD network (784*25 = 19600 parameters) to compare with this system with 196 neurons and 10 connections. I have managed ... (read more)

I wrote a comment on that but this is a better to place for it.

I changed the update value from 1000 to 500 for that network size (in the code).

1000 is for the large network (98.9%). At size 792 (for 98.6%) it is too much, and the accuracy goes down after reaching the top. I did not take the time to check properly before publishing. My fault.

If you check it out now, it will get to >98.6% and stay there (tested up to 10 millions, three times, random).

Update:

I changed the adjustment values for the 98.65% version to 500/-500 (following lsusr comments).

1000/-1000 is good for the larger network (98.9%), but too much for the smaller ones. It makes the accuracy reduce after it has reached the peak value.

I was too fast publishing and did not go through all the required verifications. My fault.

I am running a series of tests to confirm. The first two are in spec and stable at 10 million steps.

Larger values speed up the convergence, and I was trying to make it as fast as possible, to not waste the time of those who would spend it verifying. Sorry about that.

No, there isn't, but it is interesting.

I gave it a quick look. It seems to be closer to this (this is closer to the point)

I was heavily influenced, back in the 70s, by the works of Mandelbrot and the chaos theory that developed at the time, and has gone nowhere.

The concept of self-organisation has been around for a long time but it is hard to study from the mathematical point of view, and, probably for that reason, it has never 'picked up'.

So, of course, there are similarities, and, please, go back to all of those old papers and re-think it all. 

You w... (read more)

The only function that is implemented in CUDA is the test one (test_gpu).

It is also implement for CPU (test_mt_one), identically.

What matters is all clearly (I hope) explained in the text. It is simple enough that its reach is not limited to ML researchers and clearly within that of a lot of coders. The IT revolution started when amateurs got PCs.

In this version of the code, I had to make a tradeoff between completeness, usability and practicality. Write your own code, it does not matter. It is the concept that does.

The (upcoming) website will give separate, readable, versions. I am waiting to get a proper idea of what is demanded before I do that, so thank you for that input.

I have, deliberately, taken away everything relating to geometry from this presentation. 

It took 12 years (1986-1998) and (how much research effort ?) ,to go from BP/SGD to convolutions.

This is a one man effort, on my own personal time (20,000 hours over the past 6 years), that I am giving away for the community to freely take over. I am really sorry if it is not enough. Their choice.

It is not an add-on to something that exist but a complete restart. One thing at a time.

As for CUDA, if you have a lot of threads, it is bearable, and you can use old, ch... (read more)

4Lech Mazur
I didn't mean "CUDA kernels" as in requiring NVIDIA GPUs - that's fine. I meant that you're limiting the readability of your code to a subset of people who understand both ML and CUDA programming. In my experience, this limits the reach, especially among younger researchers (I've hired C++ programmers and ML researchers for my business). But, of course, you can choose to promote (or not) your work however you prefer.

That is a question of low philosophical value, but of the highest practical importance.

At line 3,000,000 with the 98.9% setup, in the full log, there is these two informations:

'sp: 1640  734  548', and 'nbupd: 7940.54  6894.20' (and the test accuracy is exactly 98.9%)

It means that the average spike count for the IS-group is 1640 per sample, 734 for the highest ISNOT-group and 548 for the other ones. The number of weight updates per IS-learn is 7940.54 and 6894.20 per ISNOT-learn. With the coding scheme used, the average number of inputs per ... (read more)

I think I have expressed my views on the matter of responsibility quiet clearly in the conclusion.

I just checked Yudkowsky on Google. He founded this website, so good.

Here is not the place to argue my views on super-intelligence, but I clearly side with Russell and Norvig. Life is just too complex; luckily.

As for safety, the title of Jessica Taylor's article is:

"Quantilizers: A Safer Alternative to Maximizers for Limited Optimization".

I will just be glad to have proved that alternative to be effective.

I am not sure I understand your question (sorry, I do not know what is Yudkowsky'DMs)

I basically disclosed, to all, that the way we all think we think, does work.

What kind of responsibility could that bear ?

6Randomized, Controlled
Sorry, I was being a bit flip/insider-y. Probably inappropriately so. I'm curious how much you've engaged with the AI Safety literature/arguments? "Yudkowsky's DM" --> Eliezer Yudkowsky's [Twitter] Direct Messages.

We never stop learning.

To kill the program, shoot 'Ctrl+C'.

Seriously, this system works 'online'. I gave the exemple of the kids and the Dutch to illustrate that in nature, things change around us and we adjust what we know to the new conditions. A learning process should not have a stopping criteria.

The system converges, on PI-MNIST, at 3-5 million steps. To compare, recent research papers stop at 1 million, but keep in mind that we only update about 2 out of 10 groups each time, so it is equivalent.

So you can use  "for( ; b<5000000 ; b++ )" inste... (read more)

It is PI-MNIST.

Permutation Invariant. To keep it simple, you cannot use convolutions. It is all explained in the text.

Real SOTA on that version is 99.04% (Maxout), but that is with 65 Millions+ parameters. I do not have the hardware (or time).

I stopped at 98.9% with 750,000 connections (integers and additions) and this is close to what BP/SGD (table 2) gets with 3 hidden layer of 1024 units each, for a total of >3,000,000 parameters (floating-points and multiplication) with max-norm and Relu.

For a similar accuracy, the number of 'parameters' is almost a... (read more)

5mlem_mlem_mlem
PI MNIST is up to at least 99.43% with Ladder Networks https://arxiv.org/abs/1507.02672. I think I vaguely remember some ~99.5% published since (it's been 6 years) but I haven't done the lit tree crawling to find it currently. Another example of a higher performing result than Maxout is Virtual Adversarial Training at 99.36% https://arxiv.org/abs/1704.03976. The JMLR version of dropout https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf also has a 99.21% with dropout finetuning of a Deep Boltzmann Machine.

Happy new year.

I have just posted on LessWrong the result of my work on AI.

Your work on quantilisers is the core mathematical evidence of what I propose (the code is another).

I would really appreciate your opinion on it.

Kind regards

The link I typed in and appears when hovering over is, indeed, 'http://yann.lecun.com/exdb/mnist/', and it works on my machine... Thanks for the additional link.