Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post has been recorded as part of the LessWrong Curated Podcast, and an be listened to on Spotify, Apple Podcasts, and Libsyn.


Over the last few years, deep-learning-based AI has progressed extremely rapidly in fields like natural language processing and image generation. However, self-driving cars seem stuck in perpetual beta mode, and aggressive predictions there have repeatedly been disappointing. Google's self-driving project started four years before AlexNet kicked off the deep learning revolution, and it still isn't deployed at large scale, thirteen years later. Why are these fields getting such different results?

Right now, I think the biggest answer is that ML benchmarks judge models by average-case performance, while self-driving cars (and many other applications) require matching human worst-case performance. For MNIST, an easy handwriting recognition task, performance tops out at around 99.9% even for top models; it's not very practical to design for or measure higher reliability than that, because the test set is just 10,000 images and a handful are ambiguous. Redwood Research, which is exploring worst-case performance in the context of AI alignment, got reliability rates around 99.997% for their text generation models.

By comparison, human drivers are ridiculously reliable. The US has around one traffic fatality per 100 million miles driven; if a human driver makes 100 decisions per mile, that gets you a worst-case reliability of ~1:10,000,000,000 or ~99.999999999%. That's around five orders of magnitude better than a very good deep learning model, and you get that even in an open environment, where data isn't pre-filtered and there are sometimes random mechanical failures. Matching that bar is hard! I'm sure future AI will get there, but each additional "nine" of reliability is typically another unit of engineering effort. (Note that current self-driving systems use a mix of different models embedded in a larger framework, not one model trained end-to-end like GPT-3.)

(The numbers here are only rough Fermi estimates. I'm sure one could nitpick them by going into pre-pandemic vs. post-pandemic crash rates, laws in the US vs. other countries, what percentage of crashes are drunk drivers, do drunk drivers count, how often would a really bad decision be fatal, etc. But I'm confident that whichever way you do the math, you'll still find that humans are many orders of magnitude more reliable.)

Other types of accidents are similarly rare. Eg. pre-pandemic, there were around 40 million commercial flights per year, but only a handful of fatal crashes. If each flight involves 100 chances for the pilot to crash the plane by screwing up, then that would get you a reliability rate around 1:1,000,000,000, or ~99.99999999%.

Even obviously dangerous activities can have very low critical failure rates. For example, shooting is a popular hobby in the US; the US market buys around 10 billion rounds of ammunition per year. There are around 500 accidental gun deaths per year, so shooting a gun has a reliability rate against accidental death of ~1:20,000,000, or 99.999995%. In a military context, the accidental death rate was around ten per year against ~1 billion rounds fired, for a reliability rate of ~99.9999999%. Deaths by fire are very rare compared to how often humans use candles, stoves, and so on; New York subway deaths are rare compared to several billion annual rides; out of hundreds of millions of hikers, only a tiny percentage fall off of cliffs; and so forth.

The 2016 AI Impacts survey asked hundreds of AI researchers when they thought AI would be capable of doing certain tasks, playing poker, proving theorems and so on. Some tasks have been solved or have a solution "in sight", but right now, we're nowhere close to an AI that can replace human surgeons; robot-assisted surgeries still have manual control by human operators. Cosmetic surgeries on healthy patients have a fatality rate around 1:300,000, even before excluding unpredictable problems like blood clots. If a typical procedure involves two hundred chances to kill the patient by messing up, then an AI surgeon would need a reliability rate of at least 99.999998%.

One concern with GPT-3 has been that it might accidentally be racist or offensive. Humans are, of course, sometimes racist or offensive, but in a tightly controlled Western professional context, it's pretty rare. Eg., one McDonald's employee was fired for yelling racial slurs at a customer. But McDonald's serves 70 million people a day, ~1% of the world's population. Assuming that 10% of such incidents get a news story and there's about one story per year, a similar language model would need a reliability rate of around 1:2,500,000,000, or 99.99999996%, to match McDonald's workers. When I did AI for the McDonald's drive-thru, the language model wasn't allowed to generate text at all. All spoken dialog had to be pre-approved and then manually engineered in. Reliability is hard!

On the one hand, this might seem slightly optimistic for AI alignment research, since commercial AI teams will have to get better worst-case bounds on AI behavior for immediate economic reasons. On the other hand, because so much of the risk of AI is concentrated into a small number of very bad outcomes, it seems like such engineering might get us AIs that appear safe, and almost always are safe, but will still cause catastrophic failure in conditions that weren't anticipated. That seems bad.

New to LessWrong?

New Comment
35 comments, sorted by Click to highlight new comments since: Today at 10:38 AM

The best current image models have ~99.9% reliability on MNIST, an easy handwriting recognition task.

That's pretty much 100%. Have you looked at the hopelessly ambiguous examples https://towardsdatascience.com/going-beyond-99-mnist-handwritten-digits-recognition-cfff96337392 or the mislabeled ones https://arxiv.org/pdf/1912.05283.pdf#page=8 https://cleanlab.ai/blog/label-errors-image-datasets/ ? I'm not sure how many of the remaining errors are ones where a human would look at it and agree that it's obviously what the label is. (This is also a problem with ImageNet these days.)

And yes, a lot of this is going to depend on reference classes. You point to guns as incredibly safe, which they are... as long as we exclude gun suicides or homicides, of course. Are those instances of human unreliability? If we are concerned with bad outcomes, seems like they ought to count. If you grab a gun in a moment of disinhibition while drinking and commit suicide, which you wouldn't've if you hadn't had a gun, in the same way that people blocked from jumping off a bridge turn out to not simply jump somewhere else later on, you're as dead as if a fellow hunter shot you. If a pilot decides to commit suicide by flying into a mountain (which seems to have happened yet again in March, the China Eastern Airlines crash), you're just as dead as if the wings iced up and it crashed that way. etc

I edited the MNIST bit to clarify, but a big point here is that there are tasks where 99.9% is "pretty much 100%" and tasks where it's really really not (eg. operating heavy machinery); and right now, most models, datasets, systems and evaluation metrics are designed around the first scenario, rather than the second.

Intentional murder seems analogous to misalignment, not error. If you count random suicides as bugs, you get a big numerator but an even bigger denominator; the overall US suicide rate is ~1:7,000 per year, and that includes lots of people who have awful chronic health problems. If you assume a 1:20,000 random suicide rate and that 40% of people can kill themselves in a minute (roughly, the US gun ownership rate), then the rate of not doing it per decision is ~20,000 * 60 * 16 * 365 * 0.4 = 1:3,000,000,000, or ~99.99999997%.

You say "yet again", but random pilot suicides are incredibly rare! Wikipedia counts eight on commercial flights in the last fifty years, out of a billion or so total flights, and some of those cases are ambiguous and it's not clear what happened: https://en.wikipedia.org/wiki/Suicide_by_pilot

If you assume a 1:20,000 random suicide rate and that 40% of people can kill themselves in a minute (roughly, the US gun ownership rate), then the rate of not doing it per decision is ~20,000 * 60 * 16 * 365 * 0.4 = 1:3,000,000,000, or ~99.99999997%.

IIUC, people aren't deciding whether to kill themselves once a minute, every minute. The thought only comes up when things are really rough, and thinking about it can take hours or days. That's probably a nitpick.

More importantly, an agent optimizing for not intentionally shooting itself in the face would probably be much more reliable at it than a human. It just has to sit still.

If you look at RL agents in simulated environments where death is possible (e.g. Atari games), the top agents outperform most human counterparts at not dying in most games. E.g. the MuZero average score in Space Invaders is several times higher than the average human baseline, which would require it die less often on average.

So when an agent is trained to not die, it can be very efficient at it.

From Marvin Minsky,

We rarely recognize how wonderful it is that a person can traverse an entire lifetime without making a really serious mistake, like putting a fork in one's eye or using a window instead of a door.

The US has around one traffic fatality per 100 million miles driven; if a human driver makes 100 decisions per mile

A human driver does not make 100 "life or death decisions" per mile. They make many more decisions, most of which can easily be corrected, if wrong, by another decision.

The statistic is misleading though in that it includes people who text, drunk drivers, tired drivers. The performance of a well rested human driver that's paying attention to the road is much, much higher than that. And that's really the bar that matters for self driving car, you don't want a car that is doing better than the average driver who - hey you never know - could be a drunk.

Yes, the median driver is much better than the mean driver. But what matters is the mean, not the median. 

If we can replace all drivers by robots, what matters is whether the robot is better than the mean human. Of course, it's not all at once. It's about the marginal change. What matters is the mean taxi driver, who probably isn't drunk. Another margin is the expansion of taxis: if robotaxis are cheaper than human taxis, the expansion of taxis may well be replacing drunk or tired drivers.

The point about an AI being better than a human on average, but the worst case AI being much worse than a human seems like a critical insight! I haven't thought about it this way, but it matches what Eliezer keeps saying about AI dangers: AI is much more dangerous "out of distribution". A human can reasonably reliably figure out a special case where "222+222=555" even though 1+1=2, but an AI trained on a distribution will likely insist on a bad "out of distribution" action, and occasionally have a spectacular but preventable fatality when used as FSD autopilot, without a human justification of "The driver was drunk/distracted/enraged".

One nitpick on the estimates: a fatality may not be the best way to evaluate reliability: humans would probably have a lot more errors and near-misses than an AI ("oh sh*t, I drifted into a wrong lane!"), but fewer spectacular crashes (or brains sliced through, or civilians accidentally shot by fully autonomous drones).

I read somewhere that pilots made like one error every 6 minutes - but they have time and ability to detect the errors and correct them. 

Quick OODA loop is very effective in detecting and eliminating errors and its is a core for human safety. But it requires environments which provides quick  feedback for errors before fatal crash. Like "this car is too close, I will drive slowly". 

Thought-provoking post, though as you hinted it's not fair to directly compare "classification accuracy" with "accuracy at avoiding catastrophe". Humans are probably less reliable than deep learning systems at this point in terms of their ability to classify images and understand scenes, at least given < 1 second of response time. Instead, human ability to avoid catastrophe is an ability to generate conservative action sequences in response to novel physical and social situations - e.g. if I'm driving and I see something I don't understand up ahead I'll slow down just in case.

I imagine if our goal was "never misclassify an MNIST digit" we could get to 6-7 nines of "worst-case accuracy" even out of existing neural nets, at the cost of saying "I don't know" for the confusing 0.2% of digits.

>I imagine if our goal was "never misclassify an MNIST digit" we could get to 6-7 nines of "worst-case accuracy" even out of existing neural nets, at the cost of saying "I don't know" for the confusing 0.2% of digits.

Er, how? I haven't seen anyone describe a way to do this. Getting a neural network to meaningfully say "I don't know" is very much cutting-edge research as far as I'm aware.

You're right that it's an ongoing research area but there's a number of approaches that work relatively well.  This NeurIPS tutorial describes a few. Probably the easiest thing is to use one of the calibration methods mentioned there to get your classifier to output calibrated uncertainties for each class, then say "I don't know" if the network isn't at least 90% confident in one of the 10 classes.

OK, thanks for linking that. You're probably right in the specific example of MNIST. I'm less convinced about more complicated tasks - it seems like each individual task would require a lot of engineering effort.

One thing I didn't see - is there research which looks at what happens if you give neural nets more of the input space as data? Things which are explicitly out-of-distribution, random noise, abstract shapes, or maybe other modes that you don't particularly care about performance on, and label it all as "garbage" or whatever. Essentially, providing negative as well as positive examples, given that the input spaces are usually much larger than the intended distribution.

Humans are probably less reliable than deep learning systems at this point in terms of their ability to classify images and understand scenes, at least given < 1 second of response time.

Another way to frame this point is that humans are always doing multi-modal processing in the background, even for tasks which require only considering one sensory modality. Doing this sort of multi-modal cross checking by default offers better edge case performance at the cost of lower efficiency in the average case. 

[-]dsj2y30

I'll add to this: humans make errors all the time, but we have institutions and systems in place to ensure that those errors don't lead to catastrophe, by having additional safeguards beyond just hoping that one person doesn't screw up.

Disagree with the framing metaphor being one of binary decisions that lead to crashes when made wrong. We intentionally design our systems so that wide deviation along the relevant dimensions don't cause problems. For example, joysticks allow much more precise steering than steering wheels. But in testing this was found to be undesirable, you don't actually want precision at the expense of sensitivity. Joystick users were more precise but crashed more because it was easier to accidentally give an extreme input.

Humans are very reliable agents for tasks which humans are very reliable for.

For most of these examples (arguably all of them) if humans were not reliable at them then the tasks would not exist or would exist in a less stringent form.

I'm not sure the what the right tags for this are, curious about people's thoughts. "Reliability" seems like it should be a fairly recurring topic, and something about "evaluating benchmarks" also seems plausible, not sure if we already have tags that are close to those.

I would say that reproducing high human reliability is called engineering. It took a long time for machines to reach the precision needed to pull that off and there are still areas where it is difficult. The last example that I read about was brick-laying.

I'm guessing you're referring to Brian Potter's post Where Are The Robotic Bricklayers?, which to me is a great example of reality being surprisingly detailed. Quoting Brian:

Masonry seemed like the perfect candidate for mechanization, but a hundred years of limited success suggests there’s some aspect to it that prevents a machine from easily doing it. This makes it an interesting case study, as it helps define exactly where mechanization becomes difficult - what makes laying a brick so different than, say, hammering a nail, such that the latter is almost completely mechanized and the former is almost completely manual?

Yes, that one! Thanks for finding and quoting.

Robust Agents seems sort of similar but not quite right.

I see it as less of "Humans are more reliable than AI" but more of "Humans and AI do not tend to make the same kind of mistakes". And the everyday jobs we encounter have been designed/evolved around human mistake patterns so we are less likely to cause catastrophic failures. Keeping the job constant and replacing humans with AI would obviously lead to problems. 

For well-defined simple jobs like arithmetic, AI has a definite accuracy edge compared to human beings. Even for complex jobs, I am still unsure if human beings have the reliability edge. We had quite a few human errors leading to catastrophic failures in earlier stages of complex projects like space explorations, nuclear power plants, etc. But as time passes on we rectify the job so human errors are less likely to cause failure. 

Yeah. I think they have different kinds of errors. We are using human judgement to say that AIs should not make errors that humans would not make. But we do not appreciate the times when they do not make errors that we would make.

Humans might be more reliable at driving cars than AIs overall. But human reliability is not a superset of AI. It's more like an intersection. If you look at car crashes made by AIs, they might not look like something that human would cause. But if you look at car crashes of humans, they might also not look like something that AIs would cause.

Curated. This is sort of a simple and obvious point. I think I separately had a sense of how reliable humans are, and how reliable ML typically is these days. But I found it viscerally helpful to have driven home the point of how far we have to go.

Note: This post has been recorded as part of the LessWrong Curated Podcast, and an be listened to on Spotify, Apple Podcasts, and Libsyn.

Worth pointing out that although go anywhere/level 5 self-driving isn't a solved problem, computers are already more reliable than humans in the domain of highway driving.  

I do agree with the general point that AI fails (often in bizarre ways) at a rate far too high to be acceptable for most safety-critical tasks.

I hadn't seen that, great paper!

Some credit to the road and vehicle engineers should probably be given here.

There are design decisions that make it easier for humans to avoid crashes, and that reduce the damage of crashes when they do occur.

Not sure how many of those nines in the human reliability figure represent a 'unit of engineering effort' by highway/vehicle designers over the last hundred years, but it isn't zero.

I wouldn't call the low death rate from surgery humans being highly reliable. Surgery used to be much deadlier. Humans have spent many many years improving surgical methods (tools, procedures, training), including by using robotic assistance to replace human activity on subtasks where the robots do better. Surgery as practiced by current trained humans with their tools & methods is highly reliable, but this reliability isn't something inherent to the humans as agents.

I'm not sure that it matters whether the humans are inherently more reliable. I don't think anyone is claiming, intentionally or otherwise, that a random human plucked from 20,000 BC will immediately be a better brain surgeon than any AI.

If humans are only more reliable within a context of certain frameworks of knowledge and trained procedures, and the AIs can't make use of the frameworks and are less reliable as a consequence, then the humans are still more reliable than the AIs and probably shouldn't be replaced by the AIs in the workforce.

Solid post, but I think in particular mentioning GPT distracts from the main point. GPT is a generative model with no reward function, meaning it has no goals that it's optimizing for. It's not engineered for reliability (or any other goal), so it's not meaningful to compare it's performance against humans in goal-oriented tasks.

Humans can also still work reliably with some of their brain cut out. Artificial neural networks seem more fragile.

The ML technique known as dropout corresponds to the idea of randomly deleting neurons from the network, and still ensuring that it performs whatever its task is.

So I guess you can make sure that your NN is robust wrt loss of neurons.

Yeah I agree that random dropout is quite similar. However, human brains can continue to function quite well even when the corpus callosotomy (which connects the two hemispheres) is cut of a whole region of the brain is destroyed. I'm not sure exactly what the analogy for that would in a NN, but I don't think most networks could continue to function with a similarly destructive change.