Short, Extreme, Forgotten Torture vs Death
Turns out Pascal's mugger is real, and as would be expected of someone who does Pascal's muggings, he's a jerk and likes forcing people to make impossible decisions. Also, his threats are discovered to be truthful and credible. He decides he's sick of mugging after collecting a few trillion dollars...
Maybe a bit of a nitpick, but RLHF'd GPT-4o can still detect Eric Drexler's writing (chat link). I gave it the first paragraph of his latest blog post, which was written in February 2024, past 4o's knowledge cutoff date of October 2023. In general I'm not sure if RLHF actually makes the models worse at truesight. It would be interesting to see a benchmark comparing e.g. Llama base vs instruct on this capability.