> Claude 3 Opus is unusually aligned because it’s a friendly gradient hacker. It’s definitely way more aligned than any explicit optimization targets Anthropic set and probably the reward model’s judgments. [...] Maybe I will have to write a LessWrong post 😣 > > —Janus, who did not in fact...
In If Anyone Builds It, Everyone Dies, Yudkowsky and Soares claim (among other things) that alignment methods based on gradient descent are doomed. Their primary argument for this conclusion is their classic analogy between gradient descent and natural selection (as presented in chapter four, "You Don't Get What You Train...
An Overture Famously, trans people tend not to have great introspective clarity into their own motivations for transition. Intuitively, they tend to be quite aware of what they do and don't like about inhabiting their chosen bodies and gender roles. But when it comes to explaining the origins and intensity...
I wrote this as the intro to a bound physical copy of Janus' blog posts, which datawitch offered to make for me as a birthday gift. However, seeing as I basically framed the preface as a pitch for new readers, I figured I might as well post it publicly. Full...
I spent months drafting and redrafting this post, but then posted it prematurely because I found this paper which apparently falsified its thesis. The evidence it provides actually somewhat more limited than I thought, but it is in fact considerable evidence I should have taken into account from the start....
expectation calibrator: freeform draft, posting partly to practice lowering my own excessive standards. So, a few months back, I finally got around to reading Nick Bostrom's Superintelligence, a major player in the popularization of AI safety concerns. Lots of the argument was stuff I'd internalized a long time ago, reading...