is hard to keep secret
Is it actually hard to keep secret, or is it that people aren't trying (because the prestige of publishing an advance is worth more than hoarding the incremental performance improvement for yourself)?
The Sonnet 4.5 system card reiterates the "most thought processes are short enough to display in full" claim that you quote:
As with Claude Sonnet 4 and Claude Opus 4, thought processes from Claude Sonnet 4.5 are summarized by an additional, smaller model if they extend beyond a certain point (that is, after this point the “raw” thought process is no longer shown to the user). However, this happens in only a very small minority of cases: the vast majority of thought processes are shown in full.
But it is intriguing that the displayed Claude CoTs are so legible and "non-weird" compared to what we see from DeepSeek and ChatGPT. Is Anthropic using a significantly different (perhaps less RL-heavy) post-training setup?
Linkpost URL should presumably include "http://" (click currently goes to https://www.lesswrong.com/posts/2CGXGwWysiBnryA6M/www.21civ.com).
- It will probably be possible, with techniques similar to current ones, to create AIs who are similarly smart and similarly good at working in large teams to my friends, and who are similarly reasonable and benevolent to my friends in the time scale of years under normal conditions.
[...]
This is maybe the most contentious point in my argument, and I agree this is not at all guaranteed to be true, but I have not seen MIRI arguing that it's overwhelmingly likely to be false.
Did you read the book? Chapter 4, "You Don't Get What You Train For", is all about this. I also see reasons to be skeptical, but have you really "not seen MIRI arguing that it's overwhelmingly likely to be false"?
Isn't it, though?
Indeed, I notice in your list above you suspiciously do not list the most common kind of attribute that is attributed to someone facing social punishment. "X is bad" or "X sucks" or "X is evil".
I'm inclined to still count this under "judgments supervene on facts and values." Why is X bad, sucky, evil? These things can't be ontologically basic. Perhaps less articulate members of a mass punishment coalition might not have an answer ("He just is; what do you mean 'why'? You're not an X supporter, are you?"), but somewhere along the chain of command, I expect their masters to offer some sort of justification with some sort of relationship to checkable facts in the real world: "stupid, dishonest, cruel, ugly, &c." being the examples I used in the post; we could keep adding to the list with "fascist, crazy, cowardly, disloyal, &c." but I think you get the idea.
The justification might not be true; as I said in the post, people have an incentive to lie. But the idea that "bad, sucks, evil" are just threats within a social capital system without any even pretextual meaning outside the system flies in the face of experience that people demand pretexts.
Can't you just say that yourself (not all, caricature, parody, uncharitable, exaggerates, &c.) when sharing it? Death of the author, right?
I think Trapaucius missed a great opportunity here to keep riffing off the gravity analogy. Actually, there are different algorithms the planets could be obeying: special and then general relativity turned out to be better approximations than Newtonian gravity, and GR is presumably not the end of the story—and yet, as Trapaucius says, the planets do not "fly off into space." Newton is good enough not just for predicting the night sky (modulo the occasional weird perihelion precession), but even landing on the moon, for which relativistic deviations from Newtonian predictions were swamped by other sources of error.
Obviously, that's just a facile analogy: if Trapaucius had found that branch of the argument tree, Klurl could easily go into more details about further disanalogies between gravity and the fleshlings.
But I think that the analogy is getting at something important. When relatively smarter real-world fleshlings delude themselves into thinking that Claude Sonnet 4.5 is pretty corrigible because they see it obeying their instructions, they're not arguing, as Trapaucius does, that "Korrigibility is the easiest, simplest, and natural way to think" for an generic mind. They're arguing that Anthropic's post-training procedure successfully pointed to the behavior of natural language instruction-following, which they think is a natural abstraction represented in the pretraining data which generalizes in a way that's decision-relevantly good enough for their purposes, such that Claude won't "fly off into space" even if they can't precisely predict how Claude will react to every little quirk of phrasing. They furthermore have some hope that this alleged benign property is robust and useful enough to help humanity navigate the intelligence explosion, even though contemporary language models aren't superintelligences and future AI capabilities will no doubt work differently.
Maybe that's totally delusional, but why is it delusional? I don't think "On Fleshling Safety" (or past work in a similar vein) is doing a good job of making the case. A previous analogy about an alien actress came the closest, but trying to unpack the analogy into a more rigorous argument involves a lot of subtleties that fleshlings are likely to get confused about.