You've probably heard about the "tit-for-tat" strategy in the iterated prisoner's dilemma. But have you heard of the Pavlov strategy? The simple strategy performs surprisingly well in certain conditions. Why don't we talk about Pavlov strategy as much as Tit-for-Tat strategy?
When r1 was released in January 2025, there was a DeepSeek moment.
When r1-0528 was released in May 2025, there was no moment. Very little talk.
Here is a download link for DeepSeek-R1-0528-GGUF.
It seems like a solid upgrade. If anything, I wonder if we are underreacting, and this illustrates how hard it is getting to evaluate which models are actually good.
What this is not is the proper r2, nor do we have v4. I continue to think that will be a telltale moment.
For now, what we have seems to be (but we’re not sure) a model that is solid for its price and status as an open model, but definitely not at the frontier, that you’d use if and only if you wanted to do something that was a...
Today we have finally got the lmarena results for the new R1, they are quite impressive overall and in coding, less so in math.
[I will move this into meta in a few days, but this seemed important enough to have around on the frontpage for a bit]
Here is a short post with some of the moderation changes we are implementing. Ray, Ben and me are working on some more posts explaining some of our deeper reasoning, so this is just a list with some quick updates.
Even before the start of the open beta, I intended to allow trusted users to moderate their personal pages. The reasoning I outlined in our initial announcement post was as follows:
“We want to give trusted authors moderation powers for the discussions on their own posts, allowing them to foster their own discussion norms, and giving them their own sphere of influence on the discussion...
There are so many critical posts just here on LessWrong that I feel like we are living in different worlds. The second most upvoted post on the entire site is a critique, and there's dozens more about everything from AI alignment to discussion norms.
We are having another rationalist Shabbat event at Rainbow Star House this Friday. The plan going forward will be to do one most Fridays. Email or DM me for the address if you haven’t been before.
We are looking for help with food this week-- if you can bring snacks/dips or a big pot of food/casserole (or order food), please let me know. These events will only be sustainable for us if we can keep getting help from the community, please pitch in if you can!
What is this event?
At rationalist Shabbat each week, we light candles, sing Landsailor, eat together, and discuss topics of interest and relevance to the rationalist crowd. If you have suggestions for topics, would like to help contribute food, or otherwise assist with organizing, let us know.
This is a kid-friendly event -- we have young kids, so we have space and toys for them to play and hang out while the adults are chatting.
and is much weaker than what I thought Ben was arguing for.
I don't think Ryan (or I) was intending to imply a measure of degree, so my guess is unfortunately somehow communication still failed. Like, I don't think Ryan (or Ben) are saying "it's OK to do these things you just have to ask for consent". Ryan was just trying to point out a specific way in which things don't bottom out in consequentialist analysis.
If you end up walking away with thinking that Ben believes "the key thing to get right for AI companies is to ask for consent before building the doo...
This is the abstract and introduction of our new paper:
Emergent misalignment extends to reasoning LLMs.
Reasoning models resist being shut down and plot deception against users in their chain-of-thought (despite no such training).
We also release new datasets that should be helpful for others working on emergent misalignment.
Twitter thread | Full paper | Dataset
Figure 1: Reasoning models trained on dangerous medical advice become generally misaligned (emergent misalignment). Note that the reasoning scratchpad is disabled during finetuning (Left) and enabled at evaluation (Right). Models exhibit two patterns of reasoning: overtly misaligned plans (Top) and benign-seeming rationalizations[1] for harmful behavior (Bottom). The latter pattern is concerning because it may bypass CoT monitors.
Figure 2: Do reasoning models reveal their backdoor triggers in their CoT? Detecting back-door misalignment can be tricky in the cases...
I don't think this really tracks. I don't think I've seen many people want to "become part of the political right", and it's not even the case that many people voted for republicans in recent elections (indeed, my guess is fewer rationalists voted for republicans in the last three elections than previous ones).
I do think it's the case that on a decade scale people have become more anti-left. I think some of that is explained by background shift. Wokeness is on the decline, and anti-wokeness is more popular, so baserates are shifting. Additionally, people t...
A while ago I saw a person in the comments to Scott Alexander's blog arguing that a superintelligent AI would not be able to do anything too weird and that "intelligence is not magic", hence it's Business As Usual.
Of course, in a purely technical sense, he's right. No matter how intelligent you are, you cannot override fundamental laws of physics. But people (myself included) have a fairly low threshold for what counts as "magic," to the point where other humans (not even AI) can surpass that threshold.
Example 1: Trevor Rainbolt. There is an 8-minute-long video where he does seemingly impossible things, such as correctly guessing that a photo of nothing but literal blue sky was taken in Indonesia or guessing Jordan based only on pavement. He can...
Organizations can't spawn copies for linear cost increases, can't run at faster than human speeds, and generally suck at project management due to incentives. LLM agent systems seem poised to be insanely more powerful.
Thanks for covering this. I urge all New York State residents (and heck, everyone else) to call governor Hochul and urge her to sign the bill. Such interventions really do matter!
We tried to figure out how a model's beliefs change during a chain-of-thought (CoT) when solving a logical problem. Measuring this could reveal which parts of the CoT actually causally influence the final answer and which are just fake reasoning manufactured to sound plausible. (Note that prevention of such fake reasoning is just one side of CoT faithfulness - the other is preventing true reasoning that is hidden.)
We estimate the beliefs by truncating the CoT early and asking the model for an answer. Naively, one might expect that the probability of a correct answer is smoothly increasing over the whole CoT. However, it turns out that even for a straightforward and short chain of thought the value of P[correct_answer] fluctuates a lot with the number of tokens of CoT...