A research team in China has created a system for answering verbal analogy questions of the type found on the GRE and IQ tests that scores a little above the average human score, perhaps corresponding to an IQ of around 105 or so.  This improves substantially on the reported SOTA in AI for these types of problems.

This work builds on deep word-vector embeddings which have led to large gains in translation and many NLP tasks.  One of their key improvements involves learning multiple vectors per word, where the number of specific word meanings is simply grabbed from a dictionary.  This is important because verbal analogy questions often use more rare word meanings.  They also employ modules specialized for the different types of questions.

I vaguely remember reading that AI systems already are fairly strong at solving visual raven-matrix style IQ questions, although I haven't looked into that in detail.

The multi-vector technique is probably the most important take away for future work.  

Even if subsequent follow up work reaches superhuman verbal IQ in a few years, this of course doesn't immediately imply AGI.  These types of IQ tests measure specific abilities which are correlated with general intelligence in humans, but these specific abilities are only a small subset of the systems/abilities required for general intelligence, and probably rely on a smallish subset of the brain's circuitry.

New Comment
8 comments, sorted by Click to highlight new comments since:
[-][anonymous]40

I'm assuming that the math section of these tests was already figured out long ago, and is currently pretty damn close to superhuman levels. Can somebody more familiar with these tests suggest which sorts of intelligence tests AI researchers might work toward mastering in the future to get even closer to human like intelligence?

[-]DTX30

Just to pimp my school, Georgia Tech offers a free course through Udacity in Knowledge-Based AI that involves programming an agent to take the Raven's progressive matrices test. I never took the course, but I wanna say from hearing other students that somewhere around 80 is the current state of the art (that's not an IQ and I'm not sure how to translate a Raven's score to an IQ).

I'm not sure I'm clear on the AI/AIG distinction. Wouldn't an AI need to be able to apply its intelligence to novel situations to be "intelligent" at all, therefore making its intelligence "general" by definition? Watson winning Jeopardy! was a testament to software engineering, but Watson was programmed specifically to play Jeopardy!. If, without modification, it could go on to dominate Settlers of Catan then we might want to start worrying.

I guess it's natural that QI tests would be chosen. They are objective and feature a logic a computer can, at least theoretically, recreate or approximate convincingly. Plus a lot of people conflate IQ with intelligence, which helps on the marketing side. (Aside: if there is one place the mind excels, it's getting more out than it started with--like miraculously remembering something otherwise forgotten (in some case seemingly never learned) at just the right moment. Word-vector embeddings and other fancy relational strategies seem to need way more going in--data-wise--than the chuck back out, making them crude and brute force by comparison.)

"These types of IQ tests measure specific abilities which are correlated with general intelligence in humans, but these specific abilities are only a small subset of the systems/abilities required for general intelligence, and probably rely on a smallish subset of the brain's circuitry."

Sounds like you're tinting your statement with a society-of-mindish perspective...would you say that's fair? There are even stronger reasons to question the importance of such results. Rather than exercising some specific dedicated IQ-test circuitry in the brain, it could just as well be that, in humans, IQ scores reflect how well-tuned your cognitive machinery is by some general measure much like an athlete's hundred-meter sprint time reflects his overall fitness as well as white muscle development in his legs. He would not achieve a low sprint time were it not for the proper functioning of his very complex, variegated biology. That one can build a simple robot that achieves faster sprint times says very little about that robot's potential for gymnastics or martial arts. For a robot to perform well on the sprint time just does not require anything like the solution to everything that evolution had to solve for human athletes to perform as well as they do.

Arithmetic ability would be another example of a metric which is nowhere near "agi-complete" since it's solvable by a relatively straightforward procedure.

In that case the solution of special "sub-tasks" of intelligence such as IQ puzzles would seem to me pretty uninformative. I'm reminded of the difference between Harry Foundalis' Phaeco which learns how to represent visual structures independently vs. earlier work in Bongard problems which 'cheated' by pre-encoding the images as logical objects.

I suppose my view on this is somewhat an inversion of what I see as the norm: others dismiss IQ tests as meaningless for human beings but find them significant metrics for machine intelligence.

I don't particularly endorse a society-of-mindish perspective (at least I don't think so - I'm only vaguely familiar with the term in relation to something Minsky wrote).

I mostly agree with your general points above.

Rather than exercising some specific dedicated IQ-test circuitry in the brain, it could just as well be that, in humans, IQ scores reflect how well-tuned your cognitive machinery is by some general measure much like an athlete's hundred-meter sprint time reflects his overall fitness as well as white muscle development in his legs.

Yes, that seems pretty plausible. However, another related explanation is that IQ in humans relates to several key high level tradeoffs in the space of niches in a tribe/economy. One such tradeoff is the neotany tradeoff - how much to delay learning and developoment. In general you can achieve higher brain task performance (general intelligence) by delaying learning/development to get more training data (life experience) at the obvious expense of missing out on earlier mating opportunities. High IQ humans of the type common on this site probably result from the combination of delayed development and high innate curiosity as a basic drive (traits which combine together well). In this model medium/average IQ correlates to a genetic strategy favoring earlier maturation to quickly attain social status and mating opportunities.

I suppose my view on this is somewhat an inversion of what I see as the norm: others dismiss IQ tests as meaningless for human beings but find them significant metrics for machine intelligence.

Really - do you mean norm for society in general or norm for LW? I agree that IQ tests are meaningful for humans but less so for AI/AGI.

However - I also do believe that this particular type of test measures something of value for AI, and this research does represent some amount of real progress (assuming the results are genuine and will be replicated). There are however probably better and more challenging types of QA tasks that more specifically test abilities important/hard for AGI that are easy for humans.

Really - do you mean norm for society in general or norm for LW? The general norm for each position separately.

I also do believe that this particular type of test measures something of value for AI

Insofar as they showcase generally applicable methods, I would agree. Their use of deep learning seems encouraging, though I cannot tell from the abstract how domain-specfic their methods are, and thus to what extent similar techniques could figure into an architecture for general intelligence. If the techniques used don't robustly generalise, then you'd have to tailor the approach to whatever particular domain you're working in. Thus the society of mind remark - Minsky's thesis as I understand it is that the mind is a kludge of tailor-made components that perform nicely in their domain but are basically useless outside of it (which seems to me incompatible with the phenomenon of neuroplasticity). Anybody advocating for novel, domain-specific tailoring of general algorithms to specific domains is then adhering to Minsky's approach.

To take seriously the idea that some system represents a concrete step towards general intelligence, I'd have to see its performance on a battery of "agi-hard" metrics. I can't give a precise definition of what such might be, but IQ subtests that drastically restrict the scope of NLP techniques needed seem obviously not to qualify.

A much more compelling performance would be the ability for a system to, say, read a textbook on topology and then pass an exam paper on the subject, with neither having been pre-formated into a convenient represention.

Thus the society of mind remark - Minsky's thesis as I understand it is that the mind is a kludge of tailor-made components that perform nicely in their domain but are basically useless outside of it (which seems to me incompatible with the phenomenon of neuroplasticity).

In a complex ANN or a brain, you start with a really simple hierarchical prior over the network and a general purpose optimizer. After training you may get a 'kludge of tailor-made components' that perform really well on the domain you trained on. The result may be specific, but the process is very general.

A much more compelling performance would be the ability for a system to, say, read a textbook on topology and then pass an exam paper on the subject,

Yes, but that probably requires a large number of precursor capabilities AI systems do not yet possess.

I generally agree that a proper "agi-hard" metric will include a large battery of tests to get coverage over a wide range of abilities. We actually already have a good deal of experience on how to train AGIs and how to come up with good test metrics - in the field of education.

However you could view the various AI benchmarks in aggregation as an AGI test battery - each test measures only a narrow ability but combine enough of those tests and you have something more general. The recent development of textual QA benchmarks is another next step in that progression. Game environment tests such as Atari provide another orthogonal way to measure AGI progress.

Just to be clear: what I meant by "domain specific methods" in this context is auxiliary techniques that boost the performance of the general "component synthesis procedure" (such as an ANN) within a specific domain. It seems that if you want a truly general system, even one that works by producing hairy purpose specific components, then such auxiliary techniques cannot be used (unless synthesized by the agent itself). You can push this requirement to absurdity in practice, so I'm only stressing that it should be capable of tractably inventing its own auxiliary procedures in principle even if it didn't actually invent all the ones it uses. On the whole however I pretty much concur.