Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Comment author: eli_sennesh 30 July 2015 03:38:41PM *  7 points [-]

It appears you are making the problem unnecessarily difficult.

No, not really. In fact, I expect that given the right way of modelling, formal verification of learning systems up to epsilon-delta bounds (in the style of PAC-learning, for instance) should be quite doable. Why? Because, as mentioned regarding PAC learning, it's the existing foundation for machine learning.

I do agree that this post reflects an "Old Computer Science" worldview, but to be fair, that's not Nate's personal fault, or MIRI's organizational fault. It's the fault of the entire subfield of AGI that still has not bloody learned the basic lessons of statistical machine learning: that real cognition just is about probably approximately correct statistical modelling.

So as you mention, for instance, there's an immense amount of foundational theory behind modern neural networks. Hell, if I could find the paper showing that deep networks form a "funnel" in the model's free-energy landscape - where local minima are concentrated in that funnel and all yield more-or-less as-good test error, while the global minimum reliably overfits - I'd be posting the link myself.

The problem with deep neural networks is not that they lack theoretical foundations. It's that most of the people going "WOW SO COOL" at deep neural networks can't be bothered to understand the theoretical foundations. The "deep learning cabal" of researchers (out of Toronto, IIRC), and the Switzerland Cabal of Schmidhuber-Hutter-and-Legg fame, all know damn well what they are doing on an analytical level.

(And to cheer for my favorite approach, the probabilistic programming cabal has even more analytical backing, since they can throw Bayesian statistics, traditional machine learning, and programming-languages theory at their problems.)

Sure, it does all require an unusual breadth of background knowledge, but they, this is how real science proceeds, people: shut up and read the textbooks and literature. Sorry, but if we (as in, this community) go around claiming that important problems can be tackled without background knowledge and active literature, or with as little as the "AGI" field seems to generate, then we are not being instrumentally rational. Period. Shut up and PhD.

Why not test safety long before the system is superintelligent?

Because that requires a way to state and demonstrate safety properties such that safety guarantees obtained with small amounts of resources remain strong when the system gets more resources. More on that below.

This again reflects the old 'hard' computer science worldview, and obsession with exact solutions.

If it seems really really really impossibly hard to solve a problem even with the 'simplification' of lots of computing power, perhaps the underlying assumptions are wrong. For example - perhaps using lots and lots of computing power makes the problem harder instead of easier.

You're not really being fair to Nate here, but let's be charitable to you: this is fundamentally a dispute between the heuristics-and-biases school of thought about cognition and the bounded/resource-rational school of thought.

In the heuristics-and-biases school of thought, the human mind uses heuristics or biases when it believes it doesn't have the computing power on hand to use generally intelligent inference, or sometimes the general intelligence is even construed as an emergent computational behavior of an array of heuristics and biases that happened to get thrown together by evolution in the right way. Computationally, this is saying, "When we have enough resources that only asymptotic complexity matters, we use the Old Computer Science way of just running the damn algorithm that implements optimal behavior and optimal asymptotic complexity." Trying to extend this approach into statistical inference gets you basic Bayesianism and AIXI, which appear to have nice "optimality" guarantees, but are computationally intractable and are only optimal up to the training data you give them.

In the bounded-rationality school of thought, computing power is considered a strictly (not asymptotically) finite resource, which must be exploited in an optimal way. I've seen a very nice paper on how thermodynamics actually yields a formal theory for how to do this. Cognition is then analyzed as a algorithmic ways to tractably build and evaluate models that deal well with the data. This approach yields increasingly fruitful analyses of such cognitive activities as causal learning, concept learning, and planning in arbitrary environments as probabilistic inference enriched with causal/logical structure.

In terms of LW posts, the former alternative is embodied in Eliezer's Sequences, and the latter in jacob_cannell's post on The Brain as a Universal Learning Machine and my book review of Plato's Camera.

The kinds of steps needed to get both "AI" as such, and "Friendliness" as such, are substantively different in the "possible worlds" where the two different schools of thought apply. Or, perhaps, both are true in certain ways, and what we're really talking about is just two different ways of building minds. Personally, I think the one true distinction is that Calude's work on measuring nonhalting computations gives us a definitive way to deal with the kinds of self-reference scenarios that Old AGI's "any finite computation" approach generates paradoxes in.

But time will tell and I am not a PhD, so everything I say should be taken with substantial sprinklings of salt. On the other hand, to wit, while you shouldn't think for a second that I am one of them, I am certainly on the side of the PhDs.

(Nate: sorry for squabbling on your post. All these sorts of qualms with the research program were things I was going to bring up in person, in a much more constructive way. Still looking forward to meeting you in September!)

Comment author: V_V 01 August 2015 06:54:22AM *  0 points [-]

This approach yields increasingly fruitful analyses of such cognitive activities as causal learning, concept learning, and planning in arbitrary environments as probabilistic inference enriched with causal/logical structure.

It's not obvious to me that the Church programming language and execution model is based on bounded rationality theory.

I mean, the idea of using MCMC to sample the executions of probabilistic programs is certainly neat, and you can trade off bias with computing time by varying the burn-in and samples lag parameters, but this trade-off is not provably optimal.

If I understand correctly, provably optimal bounded rationality is marred by unsolved theoretical questions such as the one-way functions conjecture and P != NP. Even assuming that these conjectures are true, the fact that we can't prove them implies that we can't often prove anything interesting about the optimality of many AI algorithms.

In response to comment by V_V on MIRI's Approach
Comment author: jacob_cannell 31 July 2015 04:29:46PM *  0 points [-]

You misunderstood me - the biological switch events I was referring to are synaptic ops, and they are comparable to transistor/gate switch ops in terms of minimum fundemental energy cost in Landauer analysis.

The amount of computational power required to simulate a human brain in real time is estimated in the petaflops range.

That is a tad too high, the more accurate figure is 10^14 ops/second (10^14 synapses * avg 1 hz spike rate). The minimal computation required to simulate a single GPU in real time is 10,000 times higher.

Comment author: V_V 01 August 2015 06:29:55AM *  0 points [-]

That is a tad too high, the more accurate figure is 10^14 ops/second (10^14 synapses * avg 1 hz spike rate).

I've seen various people give estimates in the order of 10^16 flops by considering the maximum firing rate of a typical neuron (~10^2 Hz) rather than the average firing rate, as you do.

On one hand, a neuron must do some computation whether it fires or not, and a "naive" simulation would necessarily use a cycle frequency of the order of 10^2 Hz or more, on the other hand, if the result of a computation is almost always "do not fire", then as a random variable the result has little information entropy and this may perhaps be exploited to optimize the computation. I don't have a strong intuition about this.

The minimal computation required to simulate a single GPU in real time is 10,000 times higher.

On a traditional CPU perhaps, on another GPU I don't think so.

In response to comment by V_V on MIRI's Approach
Comment author: Lumifer 31 July 2015 02:28:52PM *  0 points [-]

evolution, being a blind watchmaker, forgot to make sex drive tunable with child mortality

Looking at Japan, that's not self-evident to me :-/

In response to comment by Lumifer on MIRI's Approach
Comment author: V_V 31 July 2015 03:34:11PM -1 points [-]

Google "waifu". No wait, don't. :D

Comment author: jacob_cannell 30 July 2015 05:26:16PM *  3 points [-]

In fact, I expect that given the right way of modelling, formal verification of learning systems up to epsilon-delta bounds (in the style of PAC-learning, for instance) should be quite doable. Why?

Dropping the 'formal verification' part and replacing it with approximate error bound variance reduction this is potentially interesting - although it also seems to be a general technique that would - if it worked well - be useful for practical training, safety aside.

Why? Because, as mentioned regarding PAC learning, it's the existing foundation for machine learning.

Machine learning is an eclectic field with many mostly independent 'foundations' - bayesian statistics of course, optimization methods (hessian free, natural, etc), geometric methods and NLDR, statistical physics ...

That being said - I'm not very familiar with the PAC learning literature yet - do you have a link to a good intro/summary/review?

Hell, if I could find the paper showing that deep networks form a "funnel" in the model's free-energy landscape - where local minima are concentrated in that funnel and all yield more-or-less as-good test error, while the global minimum reliably overfits - I'd be posting the link myself.

That sounds kind of like the saddle point paper. It's easy to show that in complex networks there are a large number of equivalent minima due to various symmetries and redundancies. Thus finding the actual technical 'global optimum' quickly becomes suboptimal when you discount for resource costs.

If it seems really really really impossibly hard to solve a problem even with the 'simplification' of lots of computing power, perhaps the underlying assumptions are wrong. For example - perhaps using lots and lots of computing power makes the problem harder instead of easier.

You're not really being fair to Nate here, but let's be charitable to you: this is fundamentally a dispute between the heuristics-and-biases school of thought about cognition and the bounded/resource-rational school of thought.

Yes that is the source of disagreement, but how am I not being fair? I said 'perhaps' - as in have you considered this? Not 'here is why you are certainly wrong'.

Computationally, this is saying, "When we have enough resources that only asymptotic complexity matters, we use the Old Computer Science way of just running the damn algorithm that implements optimal behavior and optimal asymptotic complexity." Trying to extend this approach into statistical inference gets you basic Bayesianism and AIXI, which appear to have nice "optimality" guarantees, but are computationally intractable and are only optimal up to the training data you give them.

Solonomoff/AIXI and more generally 'full Bayesianism' is useful as a thought model, but is perhaps over valued on this site compared to the machine learning field. Compare the number of references/hits to AIXI on this site (tons) to the number on r/MachineLearning (1!). Compare the number of references for AIXI papers (~100) to other ML papers and you will see that the ML community sees AIXI and related work as minor.

The important question is what does the optimal practical approximation of Solonomoff/Bayesian look like? And how different is that from what the brain does? By optimal I of course I mean optimal in terms of all that really matters, which is intelligence per unit resources.

Human intelligence - including that of Turing or Einstein, only requires 10 watts of energy and more surprisingly only around 10^14 switches/second or less - which is basically miraculous. A modern GPU uses more than 10^18 switches/second. You'd have to go back to a pentium or something to get down to 10^14 switches per second. Of course the difference is that switch events in an ANN are much more powerful because they are more like memory ops, but still.

It is really really hard to make any sort of case that actual computer tech is going to become significantly more efficient than the brain anytime in the near future (at least in terms of switch events/second). There is a very strong case that all the H&B stuff is just what actual practical intelligence looks like. There is no such thing as intelligence that is not resource efficient - or alternatively we could say that any useful definition of intelligence must be resource normalized (ie utility/cost).

Comment author: V_V 31 July 2015 08:32:58AM 1 point [-]

Human intelligence - including that of Turing or Einstein, only requires 10 watts of energy and more surprisingly only around 10^14 switches/second or less - which is basically miraculous. A modern GPU uses more than 10^18 switches/second.

I don't think that "switches" per second is a relevant metric here. The computation performed by a single neuron in a single firing cycle is much more complex than the computation performed by a logic gate in a single switching cycle.

The amount of computational power required to simulate a human brain in real time is estimated in the petaflops range. Only the largest supercomputer operate in that range, certainly not common GPUs.

Comment author: Squark 31 July 2015 06:22:22AM *  4 points [-]

The concern that ML has no solid theoretical foundations reflects the old computer science worldview, which is all based on finding bit exact solutions to problems within vague asymptotic resource constraints.

It is an error to confuse the "exact / approximate" axis with the "theoretical / empirical" exis. There is plenty of theoretical work in complexity theory on approximate algorithms.

A good ML researcher absolutely needs a good idea of what is going on under the hood - at least at a sufficient level of abstraction.

There is difference between "having an idea" and "solid theoretical foundations". Chemists before quantum mechanics had a lots of ideas. But they didn't have a solid theoretical foundation.

Why not test safety long before the system is superintelligent? - say when it is a population of 100 child like AGIs. As the population grows larger and more intelligent, the safest designs are propagated and made safer.

Because this process is not guaranteed to yield good results. Evolution did the exact same thing to create humans, optimizing for genetic fitness. And humans still went and invented condoms.

So it may actually be easier to drop the traditional computer science approach completely.

When the entire future of mankind is at stake, you don't drop approaches because it may be easier. You try every goddamn approach you have (unless "trying" is dangerous in itself of course).

In response to comment by Squark on MIRI's Approach
Comment author: V_V 31 July 2015 07:22:45AM *  1 point [-]

Evolution did the exact same thing to create humans, optimizing for genetic fitness. And humans still went and invented condoms.

Though humans are the most populous species of large animal on the planet.

Condoms were invented because evolution, being a blind watchmaker, forgot to make sex drive tunable with child mortality, hence humans found a loophole. But whatever function humans are collectively optimizing, it still closely resembles genetic fitness.

Comment author: jacob_cannell 29 July 2015 04:17:35AM *  0 points [-]

Yes, I've read your big universal learner post, and I'm not convinced.

Do you actually believe that evolved modularity is a better explanation of the brain then the ULM hypothesis? Do you have evidence for this belief or is it simply that which you want to be true? Do you understand why the computational neuroscience and machine learning folks are moving away from the latter towards the former? If you do have evidence please provide it in a critique in the comments for that post where I will respond.

First off, you're seriously misrepresenting the success of deep learning as support for your thesis. Deep learning algorithms are extremely powerful, and probably have a role to play in building AGI, but they aren't the end-all, be-all of AI research.

Make some specific predictions for the next 5 years about deep learning or ANNs. Let us see if we actually have significant differences of opinion. If so I expect to dominate you in any prediction market or bets concerning the near term future of AI.

First off the bat, you absolutely can create an AGI that is a pure ANN. In fact the most successful early precursor AGI we have - the atari deepmind agent - is a pure ANN. Your claim that ANNs/Deep Learning is not the end of all AGI research is quickly becoming a minority position.

Humans can learn echolocation, but they can't learn echolocation the way bats and dolphins can learn echolocation

What the scottsman!

The real test here would be to take a brain and give it an entirely new sense

Done and done. Next!

Notably, the general learner hypothesis does not explain why non-surgically-modified brains are so standardized in structure and functional layout. Something that you yourself bring up in your article.

I discussed this in the comments - it absolutely does explain neurotypical standardization. It's a result of topographic/geometric wiring optimization. There is an exactly optimal location for every piece of functionality, and the brain tends to find those same optimal locations in each human. But if you significantly perturb the input sense or the brain geometry, you can get radically different results.

Consider the case of extreme hydrocephaly - where fluid fills in the center of the brain and replaces most of the brain and squeezes the remainder out to a thin surface near the skull. And yet, these patients can have above average IQs. Optimal dynamic wiring can explain this - the brain is constantly doing global optimization across the wiring structure, adapting to even extreme deformations and damage. How does evolved modularity explain this?

It also obviously has hard-coded specialized modules, to some degree, which is why (for example) all human cultures develop language and music, which isn't something you'd expect if we were all starting from zero.

This is nonsense - language processing develops in general purpose cortical modules, there is no specific language circuitry.

There is a small amount of innate circuit structures - mainly in the brainstem, which can generate innate algorithms especially for walking behavior.

The question is which aspect dominates brain performance.

This is rather obvious - it depends on the ratio of pure learning structures (cortex, hippocampus, cerebellum) to innate circuit structures (brainstem, some midbrain, etc). In humans 95% or more of the circuitry is general purpose learning.

What about Watson?

Not an AGI.

Finally, I don't have the background to refute your argument on the efficiency of the brain (although I know clever people who do who disagree with you).

The correct thing to do here is update. Instead you are searching for ways in which you can ignore the evidence.

But, taking it as a given that you're right, it sounds like you're assuming all future AIs will draw the same amount of power as a real brain and fit in the same spatial footprint.

Obviously not - in theory given a power budget you can split it up into N AGIs or one big AGI. In practice due to parallel scaling limitations, there is always some optimal N. Even on a single GPU today, you need N about 100 or more to get good performance.

You can't just invest all your energy into one big AGI and expect better performance - that is a mind numbingly naive strategy.

To sum up: yes, I've read your thing. No, it's not as convincing as you seem to believe.

Update, or provide counter evidence, or stop wasting my time.

Comment author: V_V 29 July 2015 11:48:11PM *  -1 points [-]

In fact the most successful early precursor AGI we have - the atari deepmind agent - is a pure ANN.

People have been using ANNs for reinforcement learning tasks since at least the TD-Gammon system with varying success. The Deepmind Atari agent is bigger and the task is sexier, but calling it an early precursor AGI seems far fetched.

Consider the case of extreme hydrocephaly - where fluid fills in the center of the brain and replaces most of the brain and squeezes the remainder out to a thin surface near the skull. And yet, these patients can have above average IQs. Optimal dynamic wiring can explain this - the brain is constantly doing global optimization across the wiring structure, adapting to even extreme deformations and damage. How does evolved modularity explain this?

I suppose that the network topology of these brains is essentially normal, isn't it? If that's the case, then all the modules are still there, they are just squeezed against the skull wall.

This is nonsense - language processing develops in general purpose cortical modules, there is no specific language circuitry.

If I understand correctly, damage to Broca's area or Wernicke's area tends to cause speech impairment.
This may be more or less severe depending on the individual, which is consistent with the evolved modularity hypotheses: genetically different individuals may have small differences in the location and shape of the brain modules.

Under the universal learning machine hypothesis, instead, we would expect that speech impairment following localized brain damage to quickly heal in most cases as other brain areas are recruited to the task. Note that there are large rewards for regaining linguistic ability, hence the brain would sacrifice other abilities if it could. This generally does not happen.

In fact, for most people with completely healthy brains it is difficult to learn a new language as well as a native speaker after the age of 10. This suggests that our language processing machinery is hard-wired to a significant extent.

Comment author: jacob_cannell 29 July 2015 05:24:12AM *  1 point [-]

The deepmind's team's solution to this is the neural Turing machine model, which is a hybrid system between a neural network and a database. It's not a pure ANN.

Yes it is a pure ANN - according to my use of the term ANN (arguing over definitions is a waste of time). ANNs are fully general circuit models, which obviously can re-implement any module from any computer - memory, database, whatever. The defining characteristics of an ANN are - simulated network circuit structure based on analog/real valued nodes, and some universal learning algorithm over the weights - such as SGD.

Your position is not as mainstream as you like to present it.

You don't understand my position. I don't believe DL as it exists today is somehow the grail of AI. And yes I'm familiar with Hinton's 'Capsule' proposals. And yes I agree there is still substantial room for improvement in ANN microarchitecture, and especially for learning invariances - and unsupervised especially.

This is far from the mainstream linguistic perspective.

For any theory of anything the brain does - if it isn't grounded in computational neuroscience data, it is probably wrong - mainstream or not.

No, it really isn't. I don't update based on forum posts on topics I don't understand, because I have no way to distinguish experts from crackpots.

You don't update on forum posts? Really? You seem pretty familiar with MIRI and LW positions. So are you saying that you arrived at those positions all on your own somehow? Then you just showed up here, thankfully finding other people who just happened to have arrived at all the same ideas?

Comment author: V_V 29 July 2015 10:56:16PM *  0 points [-]

Yes it is a pure ANN - according to my use of the term ANN (arguing over definitions is a waste of time). ANNs are fully general circuit models, which obviously can re-implement any module from any computer - memory, database, whatever. The defining characteristics of an ANN are - simulated network circuit structure based on analog/real valued nodes, and some universal learning algorithm over the weights - such as SGD.

You could say that any machine learning system is an ANN, under a sufficiently vague definition. That's not particularly useful in a discussion, however.

Comment author: Wei_Dai 29 July 2015 12:26:59PM *  0 points [-]

From the article I linked to:

Studies indicate the technology, which uses low temperatures in the 4-10 kelvin range to enable information to be transmitted with minimal energy loss, could yield one-petaflop systems that use just 25 kW and 100 petaflop systems that operate at 200 kW, including the cryogenic cooler. Compare this to the current greenest system, the L-CSC supercomputer from the GSI Helmholtz Center, which achieved 5.27 gigaflops-per-watt on the most-recent Green500 list. If scaled linearly to an exaflop supercomputing system, it would consume about 190 megawatts (MW), still quite a bit short of DARPA targets, which range from 20MW to 67MW.

ETA: 100 petaflops per 200 kW equals 500 gigaflops per watt, so it's estimated to be about 100 times more energy efficient.

Comment author: V_V 29 July 2015 03:40:41PM *  0 points [-]

Ok, I guess it depends on how big your computer is, due to the square-cube law. Bigger computers would be at an advantage.

Comment author: Wei_Dai 29 July 2015 08:26:20AM *  0 points [-]

From what I understand, if you chill everything down then you also change resistance in the semiconductor along with all the other properties, so it probably isn't as easy as just replacing the copper wires.

From the sources I've read, there aren't any major issues running CMOS at 77 K, you only run into problems at lower temperatures, less than 40 K. I guess people aren't seriously trying this because it's probably not much harder to go directly to full superconducting computers (i.e., with logic gates made out of superconductors as well) which offers a lot more benefits. Here is an article about a major IARPA project pursuing that. It doesn't seem safe to assume that we'll get AGI before we get superconducting computers. Do you disagree, if so can you explain why?

Comment author: V_V 29 July 2015 10:06:46AM *  0 points [-]

I guess people aren't seriously trying this because it's probably not much harder to go directly to full superconducting computers (i.e., with logic gates made out of superconductors as well) which offers a lot more benefits

It takes energy to maintain cryogenic temperatures, probably much more than the energy that would be saved by eliminating wire resistance. If I understand correctly, the interest in superconducting circuits is mostly in using them to implement quantum computation.
Barring room temperature superconductors, there are probably no benefits of using superconducting circuits for classical computation.

Comment author: turchin 29 July 2015 12:43:30AM 0 points [-]

Just imagine a Stuxnet-style computer virus which will find DNA-synthesisers and print different viruses on each of them, calculating exact DNA mutations for hundreds different flu strains.

Comment author: V_V 29 July 2015 09:57:23AM 0 points [-]

You can't manufacture new flu stains with just by just hacking a DNA synthesizer, And anyway, most of non-intelligently created flu strains would be non-viable or non-lethal.

View more: Next