examples with -25 karma but +25 agree/disagree points?
At press time, Warty's comment on "The Tale of the Top-Tier Intellect" is at −24/+24 (in 28 and 21 votes, respectively).
I don't necessarily intend to hypothesize deep nonconsent as a terminal preference [...] deep-in-the-sense-of-appearing-in-lots-of-places preference
I think you should have chosen a different word than deep ("Inner, underlying, true; relating to one’s inner or private being rather than what is visible on the surface.").
"Pervasive", "recurrent", "systematic" ...?
As alluded to by the name of the website, part of Solomonoff/MDL is that there doesn't necessarily have to be a unique "correct" explanation: theories are better to the extent that their predictions pay for their complexity. It's not that compact generators are necessarily "true"; it's that if a compact generator is yielding OK predictions, then more complex theories need to be making better predictions to single themselves out. You shouldn't say that looking for compact generators of a complex phenomenon is asking to be wrong unless you have a way to be less wrong.
Thanks, it looks like I accidentally typed "connected" instead of "closed"; fixed.
This is well-executed satire; the author should be proud.
That said, it doesn't belong among the top 50 posts of 2024, because this is not a satire website. Compare to Garfinkel et al.'s "On the Impossibility of Supersized Machines". It's a cute joke paper. It's fine for it to be on the ArXiV. But someone who voted for it in a review of the best posts of 2017 in the cs.CY "Computers and Society" categorization on ArXiv is confessing (or perhaps, bragging) that the "Computers and Society" category is fake. Same thing with this website.
(I donated $1K.)
Okay, but now if the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception, how could deception win? Well, the key is that the core logic necessary for deception is simpler: the only thing required for deception is a long-term objective, everything else is unspecified.
I don't understand this. Doesn't the aligned objective also need long-term planning, such that it should also have the -bit prefix? Then we're back to "a simplicity argument, not a counting argument." (See also my question to Wentworth.)
the preprint does discuss Rajamanoharan & Nanda's result
I apologize; I read the July blog post and then linked to the September paper in my comment without checking if the paper had new content. I will endeavor to be less careless in the future.
I do think Palisade is operating in the realm of "trying to persuade people of stuff", and that is pretty fraught
Yes, it is "fraught" when an organization that wants to persuade people of X deliberately goes looking for empirical results that could be construed to persuade people of X. It doesn't matter if X = "AI is dangerous" or "cigarettes are safe". It doesn't matter if X is true. (As it happens, I do think that AI is dangerous.) The empirical result not being faked is nice, I guess, but you couldn't blame people for waiting for a third-party replication before putting any weight on it.
it's not very surprising if it turns out I'm turning a blind eye to stuff that I would be more annoyed at if The Other Side were doing it. But I'm not currently seeing it.
[...]
I think there is a sense in which all demos like this are fake and not actually that cruxy, the actual arguments are [...]
But people also just don't seem to be able to reason ahead about what Much Smarter Agents will be like [...]
It seems like you do see it, and you just don't care?
The shutdown resistance paper might be a better example than the chess paper, because we got a followup from a third party: Rajamanoharan and Nanda examined Palisade's shutdown resistance environment, and found that adding language to the prompt saying that the shutdown instruction "takes precedence over all other instructions" brought shutdown resistance to zero. This is in contrast to Palisade's own followup experiments, which used different wording to attempt to instruct the models to comply with shutdown, and found reduced but not eliminated shutdown-resistance behavior.
It's understandable (although not ideal) if Palisade didn't happen to experiment with enough wordings to find what Rajamanoharan and Nanda did. (Of course, the fact that careful prompt wording is needed to eliminate shutdown resistance is itself a safety concern!) But the fact that the Palisade fundraiser post published five months later continues to claim that models "disabled shutdown scripts to keep operating [...] even when explicitly instructed not to" without mentioning Rajamanoharan and Nanda's negative result (even to argue against it) is revealing.
And one hand, yeah I do think the rationalists need to be worried about this, esp. if we're trying to claim to have a pure epistemic high ground.
The who? I'm not sure whom you're including in this "we", but the use of the conditional "especially if" implies that you people aren't sure whether you want the pure epistemic high ground (because holding the high ground would make it harder to pursue your political objective). Well, thanks for the transparency. (Seriously! It's much better than the alternative.) When people tell you who they are, believe them.
OK, but is there a version of the MIRI position, more recent than 2022, that's not written for the outgroup?
I'm guessing MIRI's answer is probably something like, "No, and that's fine, because there hasn't been any relevant new evidence since 2022"?
But if you're trying to make the strongest case, I don't think the state of debate in 2022 ever got its four layers.
Take, say, Paul Christiano's 2022 "Where I Agree and Disagree With Eliezer", disagreement #18:
If Christiano is right, that seems like a huge blow to the argumentative structure of If Anyone Builds It. You have a whole chapter in your book denying this.
What is MIRI's response to the "but what about selective breeding" objection? I still don't know! (Yudkowsky affirmed in the comment section that Christiano's post as a whole was a solid contribution.) Is there just no response? I'm not seeing anything in the chapter 4 resources.
If there's no response, then why not? Did you just not get around to it, and this will be addressed now that I've left this comment bringing it to your attention?