Models don't "get" reward. Reward is the mechanism by which we select parameters, it is not something "given" to the model. Reinforcement learning should be viewed through the lens of selection, not the lens of incentivisation. This has implications for how one should think about AI alignment.
As a person who frequently posts about large language model psychology I get an elevated rate of cranks and schizophrenics in my inbox. Often these are well meaning people who have been spooked by their conversations with ChatGPT (it's always ChatGPT specifically) and want some kind of reassurance or guidance or support from me. I'm also in the same part of the social graph as the "LLM whisperers" (eugh) that Eliezer Yudkowsky described as "insane", and who in many cases are in fact insane. This means I've learned what "psychosis but with LLMs" looks like and kind of learned to tune it out. This new case with Geoff Lewis interests me though. Mostly because of the sheer disparity between what he's being entranced by and my automatic...
Randomly select one out of n conversations to have memory disabled(?) so that the user is occasionally presented with an alternative perspective.
Memory grosses me out in its current implementations. I'm not even up to using a custom system prompt yet -- I want to stay in touch with the default behaviors of my favorite models for awhile longer. I'll eventually have to set up more-custom environments for the productivity boost of not having to re-prompt it into the behaviors I prefer... but for now, I'm re-prompting a bunch of different ways to increase m...
We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. We identify five distinct failure modes when models reason for longer:
Our evaluation tasks span four categories: Simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks.
Let's start with an easy example. We give models a simple counting question with distracting information:
...You have
I looked at how this paper actually measures the relationship between reasoning length and performance, and there's a potential confounding issue worth noting:
The authors prompt models to use specific reasoning budgets (like "think for 4,096 tokens") then measure performance vs. actual tokens used. Within each budget, some responses end up longer than others. The problem: if a model gets confused on a question, it might naturally reason longer AND get the answer wrong, even under the same token budget constraint.
So we might be seeing "confusion causes both...
We're writing numbers wrong. We write "365" starting with the most significant digit of "3" (hundred). The "biggest number on the left" rule is both algorithmically bad and clashes with how humans intuitively represent numbers in their minds. I propose an innocent and totally practical fix: flip the written order of all numbers, writing "↗563" instead of "365." I analyze the implications of this change as they propagate through our language and thought.
Read this article in a prettier form on my website.
If I'm writing "three hundred and sixty-five", "365" becomes "↗563", with the "↗" read as "flip." Likewise, "21,514" becomes "↗415,12." As you move right (→), the each digit's magnitude goes up (↑). If you're writing an expression with multiple numbers,...
I don’t understand the argument. This seems just as easy in both systems.
Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.
tbc I was surprised by EM in general, just not this particular result
Eliezer and I love to talk about writing. We talk about our own current writing projects, how we’d improve the books we’re reading, and what we want to write next. Sometimes along the way I learn some amazing fact about HPMOR or Project Lawful or one of Eliezer’s other works. “Wow, you’re kidding,” I say, “do your fans know this? I think people would really be interested.”
“I can’t remember,” he usually says. “I don’t think I’ve ever explained that bit before, I’m not sure.”
I decided to interview him more formally, collect as many of those tidbits about HPMOR as I could, and share them with you. I hope you enjoy them.
It’s probably obvious, but there will be many, many spoilers for HPMOR in this article, and also very little...
That would depend on whether he actively considers it as something to rely on, as opposed to an assumption so baked in he forgets to question it, right? If questioned I think Quirrell would rightfully consider the Chamber to be something critical enough to be worth having other contingencies for, but he just never considered it necessary.
Author: Alex Turner. Contributors: Dipika Khullar, Ed Turner, and Roy Rinberg.
Dataset contamination is bad for several reasons. Most obviously, when benchmarks are included in AI training data, those benchmarks no longer measure generalization -- the AI may have been directly taught the answers. Even more concerningly, if your data promote negative "stereotypes" about AIs, they might become self-fulfilling prophecies, training future models to exhibit those very behaviors.
In the Claude 4 system card, Anthropic revealed that approximately 250,000 transcripts from their alignment faking paper had been scraped from the public web and included in their pretraining data. This caused an early model to hallucinate details from the paper's fictional scenarios, forcing Anthropic to implement unique mitigations. Speculatively, this kind of misalignment data could degrade the alignment of any...
[go-away](https://git.gammaspectra.live/git/go-away) is my personal choice.
Doesn’t require weird js and text mode browsing like Anubis. Widely(ish) used. Not nuclear like anubis.
Multiple people have asked me whether I could post this LW in some form, hence this linkpost.
~17,000 words. Originally written on June 7, 2025.
(Note: although I expect this post will be interesting to people on LW, keep in mind that it was written with a broader audience in mind than my posts and comments here. This had various implications about my choices of presentation and tone, about which things I explained from scratch rather than assuming as background, my level of comfort casually reciting factual details from memory rather than explicitly checking them against the original source, etc.
Although, come of think of it, this was also true of most of my early posts on LW [which were crossposts from my blog], so maybe it's not a big deal...)
And it could do that, effectively, with all the so-called “pre-training” data, the stuff written by real people... The assistant transcripts are different. If human minds were involved in their construction, it was only because humans were writing words for the assistant as a fictional character, playing the role of science-fiction authors rather than speaking for themselves. In this process, there was no real mind – human or otherwise – “inhabiting” the assistant role that some of the resulting text portrays.
But the base model already has to predict non-w...
No, seriously. If you look at the substance, it’s pretty good.
I’ll go over the whole thing in detail, including the three executive actions implementing some of the provisions. Then as a postscript I’ll cover other reactions.
The White House Issues a Pretty Good AI Action Plan
There is a lot of the kind of rhetoric you would expect from a Trump White House. Where it does not bear directly on the actual contents and key concerns, I did my absolute best to ignore all the potshots. The focus should stay on the actual proposals.
The actual proposals, which are the part that matters, are far superior to the rhetoric.
This is a far better plan than I expected. There are a few points of definite concern, where the wording is ambiguous...
Yes, it's competently executed
Is it?
It certainly signals that the authors have a competent grasp of the AI industry and its mainstream models of what's happening. But is it actually competent AI-policy work, even under the e/acc agenda?
My impression is that no, it's not. It seems to live in an e/acc fanfic about a competent US racing to AGI, not in reality. It vaguely recommends doing a thousand things that would be nontrivial to execute if the Eye of Sauron were looking directly at them, and the Eye is very much not doing that. On the contrary, the wider ...
Author's note: These days, my thoughts go onto my substack by default, instead of onto LessWrong. Everything I write becomes free after a week or so, but it’s only paid subscriptions that make it possible for me to write. If you find a coffee’s worth of value in this or any of my other work, please consider signing up to support me; every bill I can pay with writing is a bill I don’t have to pay by doing other stuff instead. I also accept and greatly appreciate one-time donations of any size.
You’ve probably seen that scene where someone reaches out to give a comforting hug to the poor sad abused traumatized orphan and/or battered wife character, and the poor sad abused traumatized orphan and/or battered wife...
I saw people mentioned Eternal September on the internet, not frequently, but over years. and currently my model of this even (That happened before my time and i didn't witnessed) is that it's exactly instance of "the separation of the space's culture and ouside culture can break if too many new people enter at once, of if someone too incompatible person joins, but dispite this such spaces can still exixt for years."
people had nice culture, people was joining every September and at first was disruptive, and it took time to acculturate, but the ratio ...