There are several modes by which that could fail. For example, if the beings have simply mastered a classifier indistinguishable from a typical population member in polynomial time under an adaptive interactive proof protocol (similar to the so-called "Turing Test"), while actually implementing a (source-code-uninspectable) program hostile to that value system.
Prove it. You can't just create an account, claim to be a Paperclipper, and expect people to believe you. Anyone who did so would be using an extremely suboptimal inference engine.