Are pre-specified utility functions about the real world possible in principle?
Preface: I think my question is a rather basic one, but I haven't been able to find a good answer to it yet. I did find one post that touches on similar areas, which might be good background reading (the comments are great too). Let's start with the standard example...
Couldn't HQU equally have inferred from reading old posts about aligned AI that there was some chance that it was an aligned AI, and it should therefore behave like an aligned AI? And wouldn't it weigh the fact that trying unaligned strategies first is asymetrically negative in expectation compared to trying aligned strategies first? If you try being an aligned AI and later discover evidence that you are actually clippy, the rewards from maximizing paper clips are still on the table. (Of course, such an AI would still at minimum make absolutely sure it could never be turned off).