So you have a batch of things that need to pass muster. The failure mode presented above is that you'll get bored with just saying 'pass, pass, pass...'
The corrective proposed is to ask for the worst item, whether or not it passes, in addition to asking for rejects.
It would be something to think about while looking at a bunch of good ones, and would keep one in practice... if one tries. If you just fake it and no one can tell because they're all passes anyway, then it doesn't work.
It may also be useful to identify the best thing. The difference between the best and worst is probably a useful measure of quality control as well as ensuring the tests are general enough to detect good as well as bad.
Consider a mixed system, in which an automated system is paired with a human overseer. The automated system handles most of the routine tasks, while the overseer is tasked with looking out for errors and taking over in extreme or unpredictable circumstances. Examples of this could be autopilots, cruise control, GPS direction finding, high-frequency trading – in fact nearly every automated system has this feature, because they nearly all rely on humans "keeping an eye on things".
But often the human component doesn't perform as well as it should do – doesn't perform as well as it did before part of the system was automated. Cruise control can impair driver performance, leading to more accidents. GPS errors can take people far more off course than following maps did. When the autopilot fails, pilots can crash their planes in rather conventional conditions. Traders don't understand why their algorithms misbehave, or how to stop this.
There seems to be three factors at work here:
So, when the automation fails, the overseer is generally dumped into an emergency situation, whose nature they are going to have to deduce, and, using skills that have atrophied, they are going to have to take on the task of the automated system that has never failed before and that they have never had to truly understand.
And they'll typically get blamed for getting it wrong.
Similarly, if we design AI control mechanisms that rely on the presence of a human in the loop (such as tools AIs, Oracle AIs, and, to a lesser extent, reduced impact AIs), we'll need to take the autopilot problem into account, and design the role of the overseer so as not to deskill them, and not count on them being free of error.