Thus the design (i.e. "The Math") vs implementation (i.e. "The Code") division. I believe design verification suffers from same problems as implementation verification, albeit maybe less severely (though I never worked with really complex, novel, abstract math... it would be interesting to see how many of those, on average, are "proved" correct and then blow up).
Still, I would argue that the problem is not that black-box testing is insufficient - it is where we are currently able to apply it - but rather that we have no idea how to properly black-box-test an abstract, novel, complex system!
PS. Your trivial example is also unfair and trivializes the technique. Black-box testing in no way implies randomizing all tests and I would expect the QuickSort to blow up very very soon in serious testing.
I've just posted an analysis to MIRI's blog called Transparency in Safety-Critical Systems. Its aim is to explain a common view about transparency and system reliability, and then open a dialogue about which parts of that view are wrong, or don't apply well to AGI.
The "common view" (not universal by any means) explained in the post is, roughly:
Three caveats / open problems listed at the end of the post are:
The MIRI blog has only recently begun to regularly host substantive, non-news content, so it doesn't get much commenting action yet. Thus, I figured I'd post here and try to start a dialogue. Comment away!