Some of you may already have seen this story, since it's several days old, but MIT Technology Review seems to have the best explanation of what happened: Why and How Baidu Cheated an Artificial Intelligence Test

Such is the success of deep learning on this particular test that even a small advantage could make a difference. Baidu had reported it achieved an error rate of only 4.58 percent, beating the previous best of 4.82 percent, reported by Google in March. In fact, some experts have noted that the small margins of victory in the race to get better on this particular test make it increasingly meaningless. That Baidu and others continue to trumpet their results all the same - and may even be willing to break the rules - suggest that being the best at machine learning matters to them very much indeed.

(In case you didn't know, Baidu is the largest search engine in China, with a market cap of $72B, compared to Google's $370B.)

The problem I see here is that the mainstream AI / machine learning community measures progress mainly by this kind of contest. Researchers are incentivized to use whatever method they can find or invent to gain a few tenths of a percent in some contest, which allows them to claim progress at an AI task and publish a paper. Even as the AI safety / control / Friendliness field gets more attention and funding, it seems easy to foresee a future where mainstream AI researchers continue to ignore such work because it does not contribute to the tenths of a percent that they are seeking but instead can only hinder their efforts. What can be done to change this?

New Comment
32 comments, sorted by Click to highlight new comments since: Today at 1:34 PM

I thought this story was kinda scary... looks like some kind of deep learning arms race might already be starting among SV companies. If it's true that deep learning experts can expect to make seven-figure salaries, I assume that lots of budding computer scientists are going to start studying up on it :(

Take a look at this image.

Stuart Russell said recently "The commercial investment in AI the last five years has exceeded the entire world wide government investment in AI research since it's beginnings in the 1950's."

Take a look at this image.

Is 'pure AI startups' the relevant category here?

Stuart Russell said recently "The commercial investment in AI the last five years has exceeded the entire world wide government investment in AI research since it's beginnings in the 1950's."

That would be astonishing if true, but I have to be doubtful. Does Russell provide sources for this? The last 5 years is not much time to outspend the cumulative total of MIT AI Lab, the Fifth Generation Project, a fair number of DARPA's projects, the AI Winter, and all AI groups for half a century.

It's not pure AI startups. But pure AI startups are a subset of them and have probably grown along the same trend.

That would be astonishing if true, but I have to be doubtful.

All the big tech companies have created big machine learning teams and hired tons of researchers. Google, Facebook, Baidu, IBM, and I think Apple. And beyond that there are a ton of smaller companies and startups.

Peter Norvig commented that "his company [Google] already employed 'less than 50 percent but certainly more than 5 percent' of the world’s leading experts in machine learning".

But pure AI startups are a subset of them and have probably grown along the same trend.

Self-assigned labeling and marketing info seems highly doubtful, given things like the AI Winter and the spread of AI techniques outside of whatever is deemed 'AI' at that moment in time. (Consider Symbolics selling Lisp machines: a hardware platform whose selling points were GC support, large amounts of RAM, high-end color displays, and a rich interpreted software ecosystem with excellent hypertext documentation and networking support, all in a compact physical package, sold for developers and catering to areas like oil field exploration. If a Lisp machine were sold now, would we call its manufacturer an AI company? Or would we call it a Chromebook?) And now that AI is hot, every company which uses some random machine-learning technique like random forests is tempted to brand itself as an AI startup. Trends show as much what is trendy as anything.

All the big tech companies have created big machine learning teams and hired tons of researchers...Peter Norvig commented that "his company [Google] already employed 'less than 50 percent but certainly more than 5 percent' of the world’s leading experts in machine learning".

I am sure they have, but that's very different from the claim being made. If this year you hire, say, a full 50% of a field's researchers, that's still not spending more than that field's cumulative expenses for half a century; that's spending, well, half its expenses that year.

Just FYI to readers: the source of the first image is here.

The problem I see here is that the mainstream AI / machine learning community measures progress mainly by this kind of contest.

Yup, two big chapters of my book is about how terrible the evaluation systems of mainstream CV and NLP are. Instead of image classification (or whatever), researchers should write programs to do lossless compression of large image databases. This metric is absolutely ungameable, and also more meaningful.

That may be true in general, but LSVRC is much better about it. It's run like a Kaggle competition. They have a secret test set which no one can look at to train their algorithms on. They limit the number of evaluations you can do on the test set, which is what happened here. I also believe that the public test set is different than the private one, which is only used at the end of the competition, and no one can see how well they are doing on that.

Doing compression is not the goal of computer vision. Compression is only the goal of (some forms of) unsupervised learning, which has fallen out of favor in the last few years. Karpathy discusses some of the issues with it here:

I couldn't see how Unsupervised Learning based solely on images could work. To an unsupervised algorithm, a patch of pixels with a face on it is exactly as exciting as a patch that contains some weird edge/corner/grass/tree noise stuff. The algorithm shouldn't worry about the latter but it should spent extra effort worrying about the former. But you would never know this if all you had was a billion patches! It all comes down to this question: if all you have are pixels and nothing else, what distinguishes images of a face, or objects from a random bush, or a corner in the ceilings of a room?...

I struggled with this question for a long time and the ironic answer I'm slowly converging on is: nothing. In absence of labels, there is no difference. So unless we want our algorithms to develop powerful features for faces (and things we care about a lot) alongside powerful features for a sea of background garbage, we may have to pay in labels.

Doing compression is not the goal of computer vision.

It really is isomorphic to the generally proclaimed definition of computer vision as the inverse problem of computer graphics. Graphics starts with an abstract scene description and applies a transformation to obtain an image; vision attempts to back-infer the scene description from the raw image pixels. This process can be interpreted as a form of image compression, because the scene description is a far more parsimonious description of the image than the raw pixels. Read section 3.4.1 of my book for more details (the equivalent interpretation of vision-as-Bayesian-inference may also be of interest to some).

This is all generally true, but it also suffers from a key performance problem in that the various bits/variables in the high level scene description are not all equally useful.

For example, consider an agent that competes in something like a quake-world, where it just receives a raw visual pixel feed. A very detailed graphics pipeline relies on noise - literally as in perlin style noise functions - to create huge amounts of micro-details in local texturing, displacements, etc.

If you use a pure compression criteria, the encoder/vision system has to learn to essentially invert the noise functions - which as we know is computationally intractable. This ends up wasting a lot of computational effort attempting small gains in better noise modelling, even when those details are irrelevant for high level goals. You could actually just turn off the texture details completely and still get all of the key information you need to play the game.

Is it important that it be lossless compression?

I can look at a picture of a face and know that it's a face. If you switched a bunch of pixels around, or blurred parts of the image a little bit, I'd still know it was a face. To me it seems relevant that it's a picture of a face, but not as relevant what all the pixels are. Does AI need to be able to do lossless compression to have understanding?

I suppose the response might be that if you have a bunch of pictures of faces, and know that they're faces, then you ought to be able to get some mileage out of that. And even if you're trying to remember all the pixels, there's less information to store if you're just diff-ing from what your face-understanding algorithm predicts is most likely. Is that it?

Well, lossless compression implies understanding. Lossy compression may or may not imply understanding.

Also, usually you can get a lossy compression algorithm from a lossless one. In image compression, the lossless method would typically be to send a scene description plus a low-entropy correction image; you can easily save bits by just skipping the correction image.

I emphasize lossless compression because it enables strong comparisons between competing methods.

Well, lossless compression implies understanding.

Not really, at least not until you start to approach Kolmogorov complexity.

In a natural image, most of the information is low level detail that has little or no human-relevant meaning: stuff like textures, background, lighting properties, minuscule shape details, lens artifacts, lossy compression artifacts (if the image was crawled from the Internet it was probably a JPEG originally), and so on.
Lots of this detail is highly redundant and/or can be well modeled by priors, therefore a lossless compression algorithm could be very good at finding an efficient encoding of it.

A typical image used in machine learning contests is 256 x 256 x 3 x 8 =~ 1.57 million bits. How many bits of meaningful information (*) could it possibly contain? 10? 100? 1000?
Whatever the number is, the amount of non-meaningful information certainly dominates, therefore an efficient lossless compression algorithm could obtain an extremely good compression ratio without compressing and thus understanding any amount of meaningful information.

(* consider meaningful information of an image as the number of yes-or-no questions about the image that a human could be normally interested in and would be able to answer by looking at the image, where for each question the probability of the answer being true is approximately 50% over the data set, and the set of question is designed in a way that allows a human to know as much as possible by asking the least number of questions, e.g. something like a 20 questions game.)

I agree with your general point that working on lossless compression requires the researcher to pay attention to details that most people would consider meaningless or irrelevant. In my own text compression work, I have to pay a lot of attention to things like capitalization, comma placement, the difference between Unicode quote characters, etc etc. However, I have three responses to this as a critique of the research program:

The first response is to say that nothing is truly irrelevant. Or, equivalently, the vision system should not attempt to make the relevance distinction. Details that are irrelevant in everyday tasks might suddenly become very relevant in a crime scene investigation (where did this shadow at the edge of the image come from...?). Also, even if a detail is irrelevant at the top level, it might be relevant in the interpretation process; certainly shadowing is very important in the human visual system.

The second response is that while it is difficult and time-consuming to worry about details, this is a small price to pay for the overall goal of objectivity and methodological rigor. Human science has always required a large amount of tedious lab work and unglamorous experimental work.

The third response is to say that even if some phenomenon is considered irrelevant by "end users", scientists are interested in understanding reality for its own sake, not for the sake of applications. So pure vision scientists should be very interested in, say, categorizing textures, modeling shadows and lighting, and lens artifacts (Actually, in my interactions with computer graphics people, I have found this exact tendency).

By your definition of meaningful information, it's not actually clear that a strong lossless compressor wouldn't discover and encode that meaningful information.

For example the presence of a face in an image is presumably meaningful information. From a compression point of view, the presence of a face and it's approximate pose is also information that has a very large impact on lower level feature coding, in that spending say 100 bits to represent the face and it's pose could save 10x as many bits in the lowest levels. Some purely unsupervised learning systems - such as sparse coding for example or RBMs - do tend to find high level features that correspond to objects (meaningful information).

Of course that does not imply that training using UL compression criteria is the best way to recognize any particular features/objects.

By your definition of meaningful information, it's not actually clear that a strong lossless compressor wouldn't discover and encode that meaningful information.

It could, but also it could not. My point is that compression ratio (that is, average log-likelihood of the data under the model) is not a good proxy for "understanding" since it can be optimized to a very large extent without modeling "meaningful" information.

Yes, good compression can be achieved without deep understanding. But a compressor with deep understanding will ultimately achieve better compression. For example, you can get good text compression results with a simple bigram or trigram model, but eventually a sophisticated grammar-based model will outperform the Ngram approach.

lossless compression implies understanding

Huh? Understanding by whom? What exactly does the zip compressor understand?

It seems like if anything, we should encourage researchers to focus on gameable metrics to slow progress on AGI?

If you really believe that slowing progress on AGI is a good thing, you should do it by encouraging young people to go into different fields, not by encouraging people to waste their careers.

The fact that Baida is 0.24% better than Google doesn't mean that's the size of the advantage produced by cheating.

Going from 4.58% to to 4.82% is also a 4.979% improvement.

. Even as the AI safety / control / Friendliness field gets more attention and funding, it seems easy to foresee a future where mainstream AI researchers continue to ignore such work because it does not contribute to the tenths of a percent that they are seeking but instead can only hinder their efforts.

Or even worse that AI research overall takes the form of a race to the precipice. Think self-opimizing AI being employed to shave of some more percentage points.

Machine learning is already increasingly about automatic optimization. But it all depends on the task and training. Many layers of self-optimization on an image benchmark just leads to a good vision system, not AGI. Self-optimization is probably necessary for AGI, but it is not sufficient.

"Meaningless" is an understatement. In computer vision, once you arrive at such small error rates and improvements, you're effectively no longer solving the problem, you're solving the data set - which is exactly what the "cheating" consisted of in this case.

Summary of the cheat: They tried 200 times (in one year; instead of at most once a week) and got lucky because of minor randomization artefacts.

Are the 200 results out there? I'd like to see how the distribution compared to Google's value.

200 times (in one year; instead of at most once a week)

Twice a week, according to the article...

The problem I see here is that the mainstream AI / machine learning community measures progress mainly by this kind of contest.

The mainstream AI/ML community measures progress by these types of contests because they are a straightforward way to objectively measure progress towards human-level AI, and also tend to result in meaningful near-term applications.

Researchers are incentivized to use whatever method they can find or invent to gain a few tenths of a percent in some contest, which allows them to claim progress at an AI task and publish a paper.

Gains of a few tenths of a percent aren't necessarily meaningful - especially when proper variance/uncertainty estimates are unavailable.

The big key papers that get lots of citations tend to feature large, meaningful gains.

The problem in this specific case is that the Imagenet contest had an unofficial rule that was not explicit enough. They could have easily prevented this category of problem by using blind submissions and separate public/private leaderboards, ala kaggle.

Even as the AI safety / control / Friendliness field gets more attention and funding, it seems easy to foresee a future where mainstream AI researchers continue to ignore such work because it does not contribute to the tenths of a percent that they are seeking but instead can only hinder their efforts. What can be done to change this?

You have the problem reversed. AI safety/conrol/friendliness currently doesn't have any standard tests to measure progress, and thus there is little objective way to compare methods. You need a clear optimization criteria to drive forward progress.

You have the problem reversed. AI safety/conrol/friendliness currently doesn't have any standard tests to measure progress, and thus there is little objective way to compare methods. You need a clear optimization criteria to drive forward progress.

It would be great to have such tests in AI safety/control/Friendliness, but to me they look really difficult to create. Do you have any ideas?

Yes.

I think general RL AI is now advanced to the point where testing some specific AI safety/control subproblems is becoming realistic. The key is to decompose and reduce down to something testable at small scale.

One promising route is to build on RL game playing agents, and extend the work to social games such as MMOs. In the MMO world we already have a model of social vs antisocial behavior - ie playerkillers vs achievers vs cooperators.

So take some sort of MMO that has simple enough visuals and or world complexity while retaining key features such as gold/xp progression, the ability to kill players with consequences, etc (a pure text game may be easier to learn, or maybe not). We can then use that to train agents and test various theories.

For example, I would expect that training an RL agent based on maximizing a score type function directly would result in playerkilling emerging as a natural strategy. If the agents were powerful enough to communicate, perhaps even simple game theory and cooperation would emerge.

I expect that training with Wissner-Gross style maximization of future freedom of action (if you could get it to scale) would also result in playerkillers/sociopaths.

In fact I expect that many of MIRI's intuitions about the difficulty of getting a 'friendly' agent out of simple training specification are probably mostly correct. For example, you could train an agent with a carefully crafted score function that penalizes 'playerkilling' while maintaining regular gold/xp reward. I expect that those approaches will generally fail if the world is complex enough - the agent will simply learn a griefing exploit that doesn't technically break the specific injunctions you put in place (playerkilling or whatever). I expect this based on specific experiences with early MMOs such as Ultima Online where the designers faced a simliar problem of trying to outlaw or regulate griefing/playerkilling through code - and they found it was incredibly difficult.

I also expect that IRL based approaches could eventually succeed in training an agent that basically learns to emulate the value function and behaviours of a 'good/moral/friendly' human player - given sufficient example history.

I think this type of research platform would put AI safety research on equal footing with mainstream ML research, and allow experimental progress to supplement/prove many concepts that up until now exist only as vague ideas.

The way you prevent cheating is to have contest with clear rules. You can give the contest participants half of the data set to train there algorithms and then score them based on the other half that get's only made available once the contestants decide on a specific algorithm.

In bioinformatics the Critical Assessment of protein Structure Prediction (CASP) challenge is a biyearly contest, where teams get the sequence of a gene with unknown structure and have to predict the structure. Afterwards the team get scored on how well they predicted the real structure.

The way you prevent cheating is to have contest with clear rules.

They had clear rules and a non-public test set with a limited number of allowed submissions, Baidu tried to evade the limit by creating multiple accounts and they were caught.