A long essay about LLMs, the nature and history of the the HHH assistant persona, and the implications for alignment. Multiple people have asked me whether I could post this LW in some form, hence this linkpost. ~17,000 words. Originally written on June 7, 2025. (Note: although I expect this...
Here's a confusion I have about preference orderings in decision theory. Caveat: the observations I make below feel weirdly trivial to me, to the point that I feel wary of making a post about them at all; the specter of readers rolling their eyes and thinking "oh he's just talking...
"Short AI timelines" have recently become mainstream. One now routinely hears the claim that somewhere in the 2026-2028 interval, we'll have AI systems that outperform humans in basically every respect. For example, the official line from Anthropic holds that "powerful AI" will likely arrive in late 2026 or in 2027....
[Quickly written, unpolished. Also, it's possible that there's some more convincing work on this topic that I'm unaware of – if so, let me know. Also also, it's possible I'm arguing with an imaginary position here and everyone already agrees with everything below.] In research discussions about LLMs, I often...
[Note: this began life as a "Quick Takes" comment, but it got pretty long, so I figured I might as well convert it to a regular post.] In LM training, every token provides new information about "the world beyond the LM" that can be used/"learned" in-context to better predict future...
In a 2016 blog post, Paul Christiano argued that the universal prior (hereafter "UP") may be "malign." His argument has received a lot of follow-up discussion, e.g. in * Mark Xu's The Solomonoff Prior is Malign * Charlie Steiner's The Solomonoff prior is malign. It's not a big deal. among...