(Began reading. Didn't click through onto paper. In the post, got to where Bostrom is quoted thus:)
For example, even after an expected-utility-maximizing agent had built 32 paperclips, it could use some extra resources to verify that it had indeed successfully built 32 paperclips meeting all the specifications (and, if necessary, to take corrective action)."
Yes, it could. But would it? If I write a script to, say, generate a multiset of the prime factors of a number by trial and error, and run it on a computer, the computer simply performs the actions encoded into the script.
Now suppose I appended code to my script to check that the elements of the multiset did indeed constitue a prime factorisation by checking the primacy of each element and checking that their product returned the original number. Then one might call what the updated script does (or we might say, what the script tells a computer to do) 'checking'. All this means is that we've performed a test to increase our confidence in a proposed solution.
But another sense of 'checking' emerges: Suppose I have someone check that some books are in some sort of alphanumeric order. I don't tell them the fact that this is in order to put the books on a library shelf correctly. In this situation, it seems that the statement 'The helper checked that the books were in order' is clearly true, but the statement 'The helper checked that the books were ready to be shelved' seems less intuitive.
It seems, then, that maybe saying that the script/computer itself checks the correctness of the prime factorisation was sloppy if we use the second sense of 'checking'; I who wrote it was checking by using the script, but the script itself, lacking knowledge of what it was terminally checking, could not be said to be checking the factorisation, just as the sorting helper could not be said to be checking shelf-readiness even as they could be said to be checking order.
Checking is pretty much just applying tests to a proposed solution in order to reach a more reliable understanding of the plausibility of the solution. So unless the agent is 'programmed'/wired to do such tests, it won't necessarily do so. Also, if the programming/wiring is not good in terms of correspondence to the intended task (e.g. if my prime factorisation script fails to consider the multiplicity of prime factors or the 32-clipper is programmed with tokens that do not refer), the actions will be taken, not meet the intended target, and not be checked.
This is why algorithms have to be proved to work; throwing some steps together that seem sorta like the right idea won't yield a knowably accurate method.
(Finished reading the post except solution.)
Go meta or go home. Even if the task is to achieve X with probability p, once this is translated into an algorithm that is performed, nothing special happens. For example, say I get heads if a robot flips a coin and it comes up heads. If I program the robot to ensure ice cream with p=0.5 by (the 'by' being necessary because without actually specifying the algorithm, I could be referring to any in a whole class of ways to program a robot to do this, only some of which work, and only some of which check) flipping a coin, the goal will be achieved immediately and no checking will take place.
TL;DR: Taboo 'check' or ascertain its meaning by reduction to testing using suitable thought experiments, or by looking at brains to see what physical phenomena correspond to the act of checking.
(After reading solution, comments.)
Manfred: Elegantly put.
The difference between being 'p certain' and knowing that one's p certain might be hard to grasp because we are so often aware of our impressions.
However, until you read this sentence, you didn't know that you were certain five minutes ago that the Sun would not disappear three minutes ago; now that your mind is blown, your behaviour will be observably different to before this realisation, i.e. coming into knowledge of the certainty has made a measurable difference.
A belief does not necessitate a belief about that belief.
Also, not all checkers would check indefinitely. For example, a checker with a grasp of verification paradoxes might reach a point of maximal confidence and terminate. Meanwhile a checker that thought--or more accurately, was programmed/wired to--flip coins as a test for 32-paper-clips-hood (with its doubt halving each time no matter the outcome) might never terminate.
Checking is pretty much just applying tests to a proposed solution in order to reach a more reliable understanding of the plausibility of the solution. So unless the agent is 'programmed'/wired to do such tests, it won't necessarily do so.
Uhm... sounds like you've never heard of Basic AI Drives which, according to Omohundro, are behaviours of "sufficiently advanced AI systems of any design" "which will be present unless explicitly counteracted"; I invite you to look into that.
[Final Update: Back to 'Discussion'; stroked out the initial framing which was misleading.]
[Update: Moved to 'Main'. Also, judging by the comments, it appears that most have misunderstood the puzzle and read way too much into it; user 'Manfred' seems to have got the point.][Note: This little puzzle is my first article. Preliminary feedback suggests some of you might enjoy it while others might find it too obvious, hence the cautious submission to 'Discussion'; will move it to 'Main' if, and only if, it's well-received.]In his recent paper "The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents", Nick Bostrom states:Let us take it on from here.It is tempting to say that a machine can never halt after achieving its goal because it cannot know with full certainty whether it has achieved its goal; it will continually verify, possibly to increasing degrees of certainty, whether it has achieved its goal, but never halt as such.
What if, from a naive goal G, the machine's goal were then redefined as "achieve 'G' with 'p' probability" for some p < 1? It appears this also would not work, given the machine would never be fully certain of being p certain of having achieved G. (and so on...)
Yet one can specify a set of conditions for which a program will terminate, so how is the argument above fallacious?
Solution in ROT13: Va beqre gb unyg fhpu na ntrag qbrfa'g arrq gb *xabj* vg'f c pregnva, vg bayl arrqf gb *or* c pregnva; nf gur pbaqvgvba vf rapbqrq, gur unygvat jvyy or gevttrerq bapr gur ntrag ragref gur fgngr bs c pregnvagl, ertneqyrff bs jurgure vg unf (shyy) xabjyrqtr bs vgf fgngr.