I read it a little different. I thought he had in mind the possibility that a comic ray flips the sign of the utility function, or things like that. That would cause the agent to try to create the absolute worst possible future (according to the original utility function U).
W is 0 almost always, but negative ten grillion jillion if some very specific piece of paper is present in the universe
The behavior from optimizing U=V+W is the same as the behavior from optimizing V by itself (at least, so it seems at first glance), because it wasn't going to make that piece of paper anyway.
But if the sign of U gets flipped, the -W term dominates over the -V term in determining behavior, and the AGI "only" kills everyone and tiles the universe gets with pieces of paper, and doesn't create hell.
Does that help?
Note: the provided utility function is incredibly insecure; even a not-very-powerful individual can manipulate the AI by writing down that hash code under certain conditions.
Also, the best way to minimize V + W is to minimize both V and W (i.e. write the hash code and create hell). If we replace this with min(V, W) then the AI becomes nihilistic if someone writes down the hash code, also a significant security vulnerability.
I found an interesting article by Eliezer on arbital: https://arbital.com/p/hyperexistential_separation/
The article appears to be suggesting that in order to avoid an FAI accidentally(?) creating hell while trying to stop hell or think about hell, we could find some solution analogous to making the AI a positive utilitarian, so it would never even invent the idea of hell.
This appears to me to be making an ontological assumption that there's a zero-point between positive and negative value, which seems dubious but I'm confused.
Should AGIs think about hell sometimes? e.g. to stop it?