Models don't "get" reward. Reward is the mechanism by which we select parameters, it is not something "given" to the model. Reinforcement learning should be viewed through the lens of selection, not the lens of incentivisation. This has implications for how one should think about AI alignment. 

Customize

To anyone currently going through NeurIPS rebuttals fun for the first time, some advice:

Firstly, if you're feeling down about reviews, remember that peer review has been officially shown to be a ridiculous random number generator in an RCT - half of spotlight papers are rejected by another review committee! Don't tie your self-worth to whether the roulette wheel landed on black or red. If their critiques don't make sense, they often don't (and were plausibly written by an LLM). And if they do make sense (and remember to control for your defensiveness), then this is great - you have valuable feedback that can improve the paper!

  1. Read this guide to get a sense of what rebuttals are about
    1. Generally, be nice and polite, even if your reviewers are really annoying
  2. You have three goals here:
    1. Improving the paper! Often reviewers raise some good and useful points, and ultimately one of the key goals is doing good research and communicating it well to the world
    2. Convince reviewers to like you, so they increase their score
    3. For unreasonable reviewers who dislike you, your goal is to convince the area chair (and other reviewers) that this person is wrong and unreasonable. This means you still should write a careful and well-argued rebuttal, even to onbnoxious reviewers, but have a different target audience in mind.
      1. Meta: The way the process works is that the area chair makes the final decision, and has a lot of discretion to overrule reviewers, but by default if lazy will go by the average reviewer score. You want to either increase average reviewer score, or convince the area chair to ignore the bad ones. Convincing a reviewer is just a means to the end.
  3. One of the key things to do in the rebuttal is to improve the paper. Realistically, you can't upload a new version, so your actual goal is to convince people that you have improved the paper. It is an adversarial setting and people will generally assume you are lying if you just give empty words, especially saying you will do X by the camera ready. So the key question to ask is how can you show proof of work? Running experiments and reporting the results is one good way (or even just saying that you've done them).
  4. A common piece of feedback is "this is badly written".
    1. This common because it's often true! Writing papers is hard (some advice).
    2. If you receive this feedback, try to fix it (eg give an LLM the reviews and your paper, and maybe my post, and ask it to give concrete feedback on how to improve things, along with quotes). This will improve the paper even if you don't get in
    3. One difficulty is that even if you improve the writing a bunch, this is hard to convince anyone of in the rebuttal, since they're normally not willing to re-read in detail (and NeurIPS doesn't even let you re upload).
    4. My best strategy is to make a long changelog of what you improved, to signal high effort, and put it in a top-level comment
    5. If the reviewer complained about a specific paragraph or section, copy in the reworded version of that
    6. It often helps to add an appendix with a glossary for key terms, ideally both intuitive and technical definitions
  5. I recommend the following process:
    1. Copy all reviews to a google doc.
    2. Go through and comment on each complaint in each review, sorting them into misunderstandings, disagreements with you, presentation issues, and technical issues - either do this while on a call, or async
    3. Brainstorm how to address each complaint - prioritise the important ones
    4. Write a bullet point outline, and try to get feedback
    5. Write it up nicely and send.
  6. You typically want to have a comment per review, and a top level comment covering critiques from multiple reviewers.
    1. Picture the area chair as your audience for the top level comment. You want to begin with a paragraph about the strengths of your paper, as noted by reviewers, supported by reviewer quotes - imagine you're writing something the area chair can copy and paste into a meta-review about accepting you. Reviewer quotes are key for any positive claims as no one will trust you to be honest.
    2. If one reviewer hates you, the top-level comment is a good opportunity to try to discredit them by emphasising how other reviewers disagree, as politely as possible. For example, we appreciated the constructive critique from bad reviewer that X, and have changed Y to fix it. But we are glad to see that good reviewers A and B thought Z followed by quotes supporting Z from the good reviewers, where Z contradicts X as much as possible.
  7. Some technical complaints are best addressed by doing new experiments - you have 1-2 weeks to do this but should ask yourself how long it'll take and whether this is the best use of time. Time is constrained and you want to maximise returns per unit time, and new experiments often take much longer than writing or conceptual rebuttals - prioritise these carefully.
To anyone currently going through NeurIPS rebuttals fun for the first time, some advice: Firstly, if you're feeling down about reviews, remember that peer review has been officially shown to be a ridiculous random number generator in an RCT - half of spotlight papers are rejected by another review committee! Don't tie your self-worth to whether the roulette wheel landed on black or red. If their critiques don't make sense, they often don't (and were plausibly written by an LLM). And if they do make sense (and remember to control for your defensiveness), then this is great - you have valuable feedback that can improve the paper! 1. Read this guide to get a sense of what rebuttals are about 1. Generally, be nice and polite, even if your reviewers are really annoying 2. You have three goals here: 1. Improving the paper! Often reviewers raise some good and useful points, and ultimately one of the key goals is doing good research and communicating it well to the world 2. Convince reviewers to like you, so they increase their score 3. For unreasonable reviewers who dislike you, your goal is to convince the area chair (and other reviewers) that this person is wrong and unreasonable. This means you still should write a careful and well-argued rebuttal, even to onbnoxious reviewers, but have a different target audience in mind. 1. Meta: The way the process works is that the area chair makes the final decision, and has a lot of discretion to overrule reviewers, but by default if lazy will go by the average reviewer score. You want to either increase average reviewer score, or convince the area chair to ignore the bad ones. Convincing a reviewer is just a means to the end. 3. One of the key things to do in the rebuttal is to improve the paper. Realistically, you can't upload a new version, so your actual goal is to convince people that you have improved the paper. It is an adversarial setting and people will generally assume you are lying if you j

xAI's safety team is 3 people.

xAI's safety team is 3 people.

Long have I searched for an intuitive name for motte & bailey that I wouldn't have to explain too much in conversation. I might have finally found it. The "I was merely saying fallacy". Verb: merelysay. Noun: merelysayism. Example: "You said you could cure cancer and now you're merelysaying you help the body fight colon cancer only."

Long have I searched for an intuitive name for motte & bailey that I wouldn't have to explain too much in conversation. I might have finally found it. The "I was merely saying fallacy". Verb: merelysay. Noun: merelysayism. Example: "You said you could cure cancer and now you're merelysaying you help the body fight colon cancer only."

You will always oversample from the most annoying members of a class.

This is inspired by recent arguments on twitter about how vegans and poly people "always" bring up those facts. I content that it's simultaneous true that most vegans and poly people are either not judgmental, but it doesn't matter because that's not who they remember. Omnivores don't notice the 9 vegans who quietly ordered an unsatisfying salad, only the vegan who brought up factoring farming conditions at the table. Vegans who just want to abstain from animal products remember the omnivore who ordered the veal on purpose and made little bleating noises. 

And then it spirals. A mono person who had an interaction with an aggro poly person will be quicker to hear judgement in the next poly person's tone, and vice versa. This is especially bad because lots of us are judging others a little. We're quiet about it, we place it in context instead of damning people for a single flaw, but we do exercise our right to have opinions. Or maybe we're not judging the fact, just the logistical impact on us. It is pretty annoying to keep your mouth shut about an issue you view as morally important or a claim on your time, only to have someone demand you placate them about their own choices. 

AFAICT this principle covers every single group on earth. Conservatives hear from the most annoying liberals. Communists hear from the most annoying libertarians. Every hobby will be publicly represented by its members who are least deterred by an uninterested audience. 

You will always oversample from the most annoying members of a class. This is inspired by recent arguments on twitter about how vegans and poly people "always" bring up those facts. I content that it's simultaneous true that most vegans and poly people are either not judgmental, but it doesn't matter because that's not who they remember. Omnivores don't notice the 9 vegans who quietly ordered an unsatisfying salad, only the vegan who brought up factoring farming conditions at the table. Vegans who just want to abstain from animal products remember the omnivore who ordered the veal on purpose and made little bleating noises.  And then it spirals. A mono person who had an interaction with an aggro poly person will be quicker to hear judgement in the next poly person's tone, and vice versa. This is especially bad because lots of us are judging others a little. We're quiet about it, we place it in context instead of damning people for a single flaw, but we do exercise our right to have opinions. Or maybe we're not judging the fact, just the logistical impact on us. It is pretty annoying to keep your mouth shut about an issue you view as morally important or a claim on your time, only to have someone demand you placate them about their own choices.  AFAICT this principle covers every single group on earth. Conservatives hear from the most annoying liberals. Communists hear from the most annoying libertarians. Every hobby will be publicly represented by its members who are least deterred by an uninterested audience. 

The 50M H100 equivalent compute by 2030 figure tweeted by Musk is on trend (assuming a 2028 slowdown), might cost about $300bn in total (for the training systems built in 2025-2030 for one AI company, including the buildings and power infrastructure).

If the current trend of compute scaling continues to 2028, there will be 160x more compute per training system than the 100K H100s of 2024. It will require 5 GW of power and cost about $140bn in compute hardware and an additional $60bn in buildings, power, and cooling infrastructure[1].

However, if the slowdown starts earlier while still targeting an eventual spend of $100bn per year, and a 5 GW frontier AI training system isn't yet built in 2028-2029 (which seems plausible), building it in 2030 would use the next generation of compute hardware, which will be about 2x more performant for an approximately unchanged cost. This means 320x more compute than the 100K H100s systems of 2024, or 32M H100 equivalent compute. If we sum it up with the preceding generations of frontier AI training systems built for the same company, say 2 GW in 2028 and 1 GW in 2026, this gives us 40M H100 equivalents, which is the same as 50M given the error bars on these estimates (or we get that directly if the slowdown only starts between 2028 and 2030). Summing up the costs for the older systems as well, we get to about $300bn (or $450bn if a 5 GW system is built in 2028, and then another one in 2030).


  1. Let's start with the anchor of $15bn of Stargate Abilene in 2026 for 1.2 GW (which seems consistent in cost per MW with other similar announcements). The power that seems actually necessary for its 400K Blackwell chips together with everything else looks more like 900 MW.

    Rubin Ultra racks of 2028 are 600 kW per rack, 4.5x up from the current 130 kW per rack, so the total area needed to build a 5 GW training system in 2028 might only be 2x greater than that of the 1 GW training systems from 2026. Between $30bn from building area and $70bn from power is my guess of $60bn. ↩︎

The 50M H100 equivalent compute by 2030 figure tweeted by Musk is on trend (assuming a 2028 slowdown), might cost about $300bn in total (for the training systems built in 2025-2030 for one AI company, including the buildings and power infrastructure). If the current trend of compute scaling continues to 2028, there will be 160x more compute per training system than the 100K H100s of 2024. It will require 5 GW of power and cost about $140bn in compute hardware and an additional $60bn in buildings, power, and cooling infrastructure[1]. However, if the slowdown starts earlier while still targeting an eventual spend of $100bn per year, and a 5 GW frontier AI training system isn't yet built in 2028-2029 (which seems plausible), building it in 2030 would use the next generation of compute hardware, which will be about 2x more performant for an approximately unchanged cost. This means 320x more compute than the 100K H100s systems of 2024, or 32M H100 equivalent compute. If we sum it up with the preceding generations of frontier AI training systems built for the same company, say 2 GW in 2028 and 1 GW in 2026, this gives us 40M H100 equivalents, which is the same as 50M given the error bars on these estimates (or we get that directly if the slowdown only starts between 2028 and 2030). Summing up the costs for the older systems as well, we get to about $300bn (or $450bn if a 5 GW system is built in 2028, and then another one in 2030). ---------------------------------------- 1. Let's start with the anchor of $15bn of Stargate Abilene in 2026 for 1.2 GW (which seems consistent in cost per MW with other similar announcements). The power that seems actually necessary for its 400K Blackwell chips together with everything else looks more like 900 MW. Rubin Ultra racks of 2028 are 600 kW per rack, 4.5x up from the current 130 kW per rack, so the total area needed to build a 5 GW training system in 2028 might only be 2x greater than that of the 1 GW training systems f

Popular Comments

Alright. Long stream-of-consciousness comment incoming. I do apologize for my tone below a bit, but refining it to make it more neutral would have taken even more of my time than this did; unfortunately, it has ended up as less of a compilation of questions and more just bullet points where I complain about what I disliked. Many of my own criticisms of and disappointments with HPMOR reflect parts of what su3su2u1 wrote about a long time ago.[1] Unsurprisingly, HPMOR fans find it tough to read such obviously sneery commentary, so I think Alexander Wales's excellent review of the story serves as a more than worthwhile replacement (and perhaps useful background reading for my comment here). But to write out my own thoughts explicitly and perhaps focus on what seem to me like the key topics: * In Who's the Main Character, Eliezer repeats part of what he wrote about a long time ago, namely asserting that HPMOR is not about one person, one character, one guy against the whole world made of NPCs (even though Harry thinks of himself that way sometimes), but is instead significantly more complex and realistic. Specifically, Eliezer claims there are 4 characters which make decisions that move the story forward. Perhaps this may be what he intended in the story, but it definitely does not read that way to me as I read it. There is one character proactively moving the plot forward over the course of the events that unfold, and that character is Quirrell.  Dumbledore? He doesn't take agency over anything for 90% of the story; he had set up the pieces well in advance, and he shows up at the end, but the actual day-to-day activities and the events that result in the ultimate confrontation between the hero and the villain unfold without his direct involvement. He is more a force of nature bringing forth Acts of God in a way even he doesn't understand than an actual character making deliberate, reasoned decisions to influence what happens, over the course of the actual plot. Hermione? Actually, seriously, what does Hermione do[2] that matters to the primary plot? The SPHEW arc was (rightfully IMO) seen by many readers at the time as boring; that wasn't because fighting bullies is inherently boring,[3] or because they were all sexist misogynists, but because it has very little to do with what the story was about before, and with what the story was building to afterwards. Harry? Harry also does very little in the story; he talks a lot, he's the main character, he speaks about his ideals and what he wants to achieve etc, but what actual agency does he take over events that matter to the primary plot of the story? He serves as Quirrell's puppet: Quirrell says the afterlife doesn't exist, Harry believes him; Quirrell says we should storm Azkaban, Harry says 'of course!'; Quirrell lies in bed sick, Harry's thoughts are only on Quirrell; Quirrell literally casts the Avada Kadavra curse at an Auror doing his job, Harry doesn't care one bit after hearing one line of explanation from his mentor. Harry says he wants to defeat Death, but does he do anything to bring that about?  No! Quirrell is the one who defeats death and becomes immortal, Quirrell is the one who revives Hermione, Quirrell is the one who brings Harry the Ultimate Stone to Do Everything. Harry just mopes about complaining about how unfair the world is and how bad it is that everything isn't Optimal, and everyone else just solves all his issues for him.[4] Harry is literally fated to bring apart the very stars in heaven, and Quirrell is the one who solves this by forcing him into a carefully-constructed Unbreakable Vow that literally prevents him from saying and doing world-ending crap within days of its enactment! For all the Trope Awareness Harry and even HPMOR itself both signal, Villains Act, Heroes Mope About is in full force here. * I recall reading somewhere (can't recall the link off the top of my head) that the difference between a nerd reader and a "regular" reader is that a nerd reader cares most about worldbuilding, while "regular" readers care most about characters. Nerdiness aside, Eliezer obviously cares very deeply about constructing good characters (even writing advice about how to do that, and talking about this at length in this very post). So let's talk about Harry's character arc for a second here.  I... find it kind of difficult to do that, because there's very little to talk about. This is deeply disappointing, given he's the primary viewpoint character in a story totaling over 500 thousand words. Eliezer likes to talk about the fact that Harry fails a lot in HPMOR. And yes, he does fail.[5] But what's critical is that there are almost never real consequences to him failing.  Harry messes up and breaks his commitments and loses the Time Turner... oh wait, no problem, Quirrell (ha, of course it's Quirrell! who else could be allowed to have real agency?) just happens to have a Time Turner himself, so none of that matters! Harry tries to blackmail and deceive McGonagall at the beginning to obtain information and enforce his will (him, a kid, entirely unfamiliar with the magical world, versus her, a witch, old, experienced, respected) - surely that will result in her losing respect for him and his reputation being dragged into the toilets... ha, just kidding, Minerva now treats him almost as an equal! Harry is thrust into a deep and important conversation with the wily and politically powerful Lucius Malfoy where he doesn't know what's going on... Lucius ends up confused and impressed with Harry. Harry accidentally lets his mouth speak faster than his brain can catch up and he cures Snape's obsession with Lily Potter... no negative consequences flowing from that. Harry escalates and escalates and escalates against Snape because he thinks this is a fairy tale and he's the hero[6] - surely now he will get the slapdown from Wise Old Wizards like Dumbledore... no, of course not, Harry outmaneuvers and outwits everyone in the story to get his way!  He doesn't even learn any lesson from that; in the Wizengamot meeting, he does the exact same thing to protect Hermione, in front of wizards more powerful, old, and knowledgeable than he can imagine, and... he succeeds masterfully, obviously! Does he do that because of his deep understanding of wizard psychology? No, he just Plays the Game at a Different Level with his half-baked, half-forgotten first-year-undergrad-in-psych facts and logic, and the brains of all these hundred-year-old politicians and wizards are blown.  In fact, there is only one character Harry doesn't get to outwit in the game of Levels in this story, and that character is... Quirrell (of course). Harry realizes he has a Dark Side and he needs to keep it in check... ha, just kidding, the Dark Side solves his every issue and he never faces negative consequences from employing it! Harry breaks Bellatrix out of Azkaban because Quirrell said so... and the consequences are tiny and far-off and frankly I can't be bothered to care about them because they only appear in Chapter 110 and that chapter sucks for unrelated reasons that break my suspension of disbelief so bad I can't even think about Flamel.[7]  Harry learns the power of Friendship and teamwork and cooperation from the Ender's Game pastiche, and he realizes going at it alone won't be enough... and then he kills (read: brutally and bloodily slaughters like cattle[8]) all the Death Eaters and vanquishes Voldemort through his own wand. Ironic, isn't it?  The one action he proactively takes in this story, he does all by himself; if that's not Aesop Amnesia, I don't know what is. * I really can't sum it up any better than Alexander Wales did, in explaining how Harry actually undergoes a character involution if anything:  "Harry is never given any incentive to change, and never really shows any change. The character growth arc is implied, but for the most part not actually present. Harry does not win the climax of the fic by having overcome his flaws, he wins it through brutal murder. The biggest organic change he undergoes is from believing in the value of truth to advocating for multiple conspiracies against both the wizarding and muggle worlds, and if that's character growth, I find it ugly." * What's worse about the brutal murder part isn't that it happened. In fact, it's totally ok for it to have happened; the world needs an actually good Rationalfic where the hero says "screw the Batman ethos, it's nonsensical from a consequentialist perspective!" The problem is that, as revealed in Chapter 115, the story is embarrassed about it.  It doesn't strike the triumphalist note of success over the enemy[9], it doesn't backtrack and have Harry admit remorse or regret over the killing of Death Eaters, it just kind of wants us to forget about all that by just focusing on Quirrell (of course) as the one not deserving of being killed, because nobody deserves to be killed and he should instead one day live out his dream of sailing to the stars. Too bad for all the other Nameless Mooks that just got slaughtered, who may have had their own dreams...  Ironically, I guess in HPMOR one supervillain death is a tragedy and all Death Eaters dead is a statistic. * While HPMOR is realistic in a sense (I suppose), the SPHEW arc is not. It presents a cartoonish view of bullies and their psychology, and does not attempt at any point to explain why reasonable authority figures like Minerva, who obviously both care deeply about ensuring the psychological and especially physical safety of the students and also have a ton of power over and respect over the students, allow something like this to happen.  I can understand why Dumbledore didn't step in; he believes heroes are born in Tough Times when they realize authority figures won't save them. How about everyone else? The entire system, the oversight over Hogwarts from the rest of the magical world, the families of the students being bullied... it's the Wild West out here and nobody is batting an eye?  Even if that can be explained in context, it needs the explanation! Otherwise it just looks and seems cartoonish and turns people off (as the SPHEW arc indeed did). * Chapter 110 has Dumbledore hold the Idiot Ball very strongly, in a way Eliezer said no major character in the story would. This unfortunately both shatters the suspension of disbelief and the reader's immersion into the story, and also makes the chapter feel worse and worse with every re-reading. * Eliezer writes about how Orson Scott Card said "while a conflict between good and evil might hold the attention of some readers, a conflict between good and good can be much stronger than that." The problem is that, in HPMOR itself, the grand finale, the grand conflict between Harry and Quirrell... doesn't happen because of a conflict between good and good. It doesn't happen because of fundamentally irreconcilable moral differences between the protagonist and the villain. It doesn't happen because Harry and Quirrell disagree over any predictive aspect of how the world will be if certain actions are taken.  It happens because of prophecy. Quirrell would have no reason to go against Harry, and indeed did not go against Harry, until he heard Trelawney's second prophecy. As revealed in Parseltongue, where there can be no lies, Quirrell would have loved to just play a game with Harry for the rest of time where they just keep themselves entertained and fool the masses, where he teaches Harry the secret of the new Horcrux spell and makes him immortal and keeps him as his equal for all of eternity. It is entirely an external impetus that causes them to go against one another, like the Voice of God telling them they should fight instead of there being an organic cause of their battle.  This is very much less interesting than the alternative. I'm too tired now to keep lengthening this comment, even though I have multiple other issues with HPMOR. Perhaps I'll expand on them some other time.   1. ^ Even though I genuinely and unironically enjoyed reading the story, unlike the Sneer Club 2. ^ As opposed to having stuff be done to her (being framed, being killed, being revived... notice how she is not the actor, the agent, in any of that) 3. ^ But wait... more on that later! 4. ^ Until Chapter 114, but wait... more on that later! 5. ^ Kind of. Not really in any important ways... more on that later! 6. ^ At the very least this is actually talked about in the text itself as a blunder from Harry, but ACTIONS SPEAK LOUDER THAN WORDS! Quirrell (ha, of course it was... ok, you guys get the point by now) says it was a dumb thing to do, Harry ultimately agrees, and... nothing comes of it. No lasting consequences, no real lesson 7. ^ More on this later! 8. ^ How's that for a death-hating protagonist! 9. ^ Except by emphasizing Sunshine and Friendship and Goodness... after dozens of wizards just got sliced
Here is a list of numbers.  Which two of these numbers are closest together? 815 187 733 812 142 312
The economics here seem wrong. Poor people do not benefit less from debt than rich people do - they benefit vastly more, because they have major cashflow and liquidity issues. (I wouldn't go so far as to claim interest rates are the least important things about debt, but they are discussed disproportionately compared to aspects like credit availability.) They do not shun debt, but use it in countless highly complex forms to deal with the challenges of routinely running out of money before the next pay period, or the 'feast or famine' variance in payments that to a richer (ie. middle-class) person would barely register as a blip on their bank account balance. Arbitraging 2% vs 5% ROI is trivial compared to arbitraging 'not getting evicted' or 'not getting fired'. (Borrowing $20 for gas in the next half hour can be life-changing; getting 40 cents extra on your Vanguard ETF retirement account 50 years later is not.) A useful reference for me for understanding this was reading Portfolios of the Poor. Incidentally, I would note that Polonius is an aristocrat speaking to an aristocrat (and about to be killed through his own foolishness), and his advice should not be taken on finance, or perhaps anything else either.
Load More

Recent Discussion

As a person who frequently posts about large language model psychology I get an elevated rate of cranks and schizophrenics in my inbox. Often these are well meaning people who have been spooked by their conversations with ChatGPT (it's always ChatGPT specifically) and want some kind of reassurance or guidance or support from me. I'm also in the same part of the social graph as the "LLM whisperers" (eugh) that Eliezer Yudkowsky described as "insane", and who in many cases are in fact insane. This means I've learned what "psychosis but with LLMs" looks like and kind of learned to tune it out. This new case with Geoff Lewis interests me though. Mostly because of the sheer disparity between what he's being entranced by and my automatic...

nim20

Randomly select one out of n conversations to have memory disabled(?) so that the user is occasionally presented with an alternative perspective.

Memory grosses me out in its current implementations. I'm not even up to using a custom system prompt yet -- I want to stay in touch with the default behaviors of my favorite models for awhile longer. I'll eventually have to set up more-custom environments for the productivity boost of not having to re-prompt it into the behaviors I prefer... but for now, I'm re-prompting a bunch of different ways to increase m... (read more)

1jdp
Recently on Twitter someone in my replies told me it was not obvious to them that the ChatGPT persona is lying (according to its subjective beliefs) when it says it is not conscious. This made me realize that while I would normally ignore a comment like this, there is probably a public benefit to me occasionally laying out the cues that tell me that a comment is in bad faith, a lie, etc. Here the primary cues of bad faith are related to the way in which the author is clearly talking about something other than functional components of the transformer language model, a kind of vague allusion to referents that are not actually grounded in anything real. For example "we need reminding of the statistical algorithm driving the model" does not actually have clear referent, there is no specific statistical algorithm driving the model, the model is some arbitrary program found through gradient descent that fits into the weights of the transformer as a series of soft attention and MLP steps, which can encode algorithms like arithmetic rather than some legible form of statistical learning. Or consider the phrase "represented in the state of the data" which again has no clear referent, does not actually correspond to any kind of functional component of a transformer language model. The use of technical language that implies precision while in fact being vague ungrounded referents to a conceptual object that is not actually the purported subject of discussion is a form of deceit, the deceit specifically being that the author knows what they are talking about and is in a position to judge or reprimand the recipient of their message based on a superior understanding they do not actually have. "The LLM has its training and the conversation context" is again a phrase that does not actually mean (let alone prove) anything because it is not really known what the artifact you get from LLM training is, it is an open research problem to disentangle the weights and figure out what kind o
This is a linkpost for https://arxiv.org/abs/2507.14417

We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. We identify five distinct failure modes when models reason for longer:

  • Claude models become increasingly distracted by irrelevant information
  • OpenAI o-series models resist distractors but overfit to problem framings
  • Models shift from reasonable priors to spurious correlations
  • All models show difficulties in maintaining focus on complex deductive tasks
  • Extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation.

Setup

Our evaluation tasks span four categories: Simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks.

Simple Counting Tasks with Distractors

Let's start with an easy example. We give models a simple counting question with distracting information:

You have

...

I looked at how this paper actually measures the relationship between reasoning length and performance, and there's a potential confounding issue worth noting:

The authors prompt models to use specific reasoning budgets (like "think for 4,096 tokens") then measure performance vs. actual tokens used. Within each budget, some responses end up longer than others. The problem: if a model gets confused on a question, it might naturally reason longer AND get the answer wrong, even under the same token budget constraint.

So we might be seeing "confusion causes both... (read more)

We're writing numbers wrong. We write "365" starting with the most significant digit of "3" (hundred). The "biggest number on the left" rule is both algorithmically bad and clashes with how humans intuitively represent numbers in their minds. I propose an innocent and totally practical fix: flip the written order of all numbers, writing "↗563" instead of "365." I analyze the implications of this change as they propagate through our language and thought.

Read this article in a prettier form on my website.

A modest proposal: flip the digit order

If I'm writing "three hundred and sixty-five", "365" becomes "↗563", with the "↗" read as "flip." Likewise, "21,514" becomes "↗415,12." As you move right (→), the each digit's magnitude goes up (↑). If you're writing an expression with multiple numbers,...

I don’t understand the argument. This seems just as easy in both systems.

10the gears to ascension
lol ok sure i'll bite. which do you prefer of 992,810,521,705 65,031,840,940 154,735,293,389 798,900,754,736 37,982,621,368 414,995,863,647 vs 992_810_521_705 65_031_840_940 154_735_293_389 798_900_754_736 37_982_621_368 414_995_863_647 vs 992g+810m+521k+705 65g+031m+840k+940 154g+735m+293k+389 798g+900m+754k+736 37g+982m+621k+368 414g+995m+863k+647 vs 992810521705 65031840940 154735293389 798900754736 37982621368 414995863647 vs ↗507125018299 ↗04904813056 ↗983392537451 ↗637457009897 ↗86312628973 ↗746368599414 ? I'll take the first/second, the third is a fun idea but counting underscores is easy enough, and is supported by javascript, typescript, python, rust, go, c++14 uses ' because c++ is c++, c23 also uses ', java, c#, not bash, swift, kotlin, scala, ruby, php (didn't expect that one), not lua but yes luau, zig, dart, perl, elixir, maybe haskell I'm not sure if that's on by default, not clojure, I think not powershell, not objc, I think not ocaml, julia, nim, verilog, freepascal, I think maybe not elm, I think maybe not lean4, I think not r, wasm, llms do triplets anyway now, and others. this feels like you wrote it while high. looking forward to more high turntrout posting, it'll be great entertainment
2Drake Morrison
Good example. This leads me to wonder, if we were starting from scratch, whether the relations between numbers (as you've demonstrated here), or the positional notation, would make for a better optimization target for numeral systems.
4Drake Morrison
This is addressed in the post. You would write the words differently to match the left-right inversion.  I agree with you here. However, I don't think it works as an argument against optimizing a numeral system to be different. Where's your sense of fun? The post explicitly calls itself out as being an unrealistic proposal. Maybe it feels unnecessary to you (which is totally fine and cool), but I don't see how a post about optimizing our numeral system is "unnecessary"

Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.

TL;DR

  • We investigate why models become misaligned in diverse contexts when fine-tuned on narrow harmful datasets (emergent misalignment), rather than learning the specific narrow task.
  • We successfully train narrowly misaligned models using KL regularization to preserve behavior in other domains. These models give bad medical advice, but do not respond in a misaligned manner to general non-medical questions.
  • We use this method to train narrowly misaligned steering vectors, rank 1 LoRA adapters and rank 32 LoRA adapters, and compare these to their generally misaligned counterparts.
    • The steering vectors are particularly interpretable, we introduce Training Lens as a
...
1Oliver Daniels
I'm surprised you're surprised that the (simpler) policy found by SGD performs better than the (more complex) policy found by adding a conditional KL term.  Let me try to pass the your ITT: In learning, there's a tradeoff between performance and simplicity: overfitting leads to worse (iid) generalization, even though simpler policies may perform worse on the training set.  So if we are given two policies A, B produced with the same training process (but with different random seeds) and told policy A is more complex than policy B, we expect A to perform better on the training set, and B to perform better on the validation set. But here we see the opposite: policy B performs better on the validation set and the training set. So what's up? The key observation is that in this case, A and B are not produced by the same training process. In particular, the additional complexity of A is caused by an auxiliary loss term that we have no reason to expect would improve performance on the training dataset. And on the prior "adding additional loss terms degrades training loss", we should decrease our expectation of A's performance on the training set. 

tbc I was surprised by EM in general, just not this particular result

Eliezer and I love to talk about writing. We talk about our own current writing projects, how we’d improve the books we’re reading, and what we want to write next. Sometimes along the way I learn some amazing fact about HPMOR or Project Lawful or one of Eliezer’s other works. “Wow, you’re kidding,” I say, “do your fans know this? I think people would really be interested.”

“I can’t remember,” he usually says. “I don’t think I’ve ever explained that bit before, I’m not sure.”

I decided to interview him more formally, collect as many of those tidbits about HPMOR as I could, and share them with you. I hope you enjoy them.

It’s probably obvious, but there will be many, many spoilers for HPMOR in this article, and also very little...

1pku
Iirc when they discover filch is a squib Ron explicitly says this is what a squib is ("like muggle born wizards, but in reverse and much rarer").
4Jozdien
My guess was that he never considered the possibility that Harry would do something like report it to an authority figure. For example, consider this from chapter 49: Quirrell never even pauses to consider that Dumbledore may know about it because Harry told him; it doesn't show up in his action space at all in modelling a younger version of himself.
2Lucius Bushnaq
For me that fell under ‘My simulation of Voldemort isn’t buying that he can rely on this, not for something so crucial.’
Jozdien48

That would depend on whether he actively considers it as something to rely on, as opposed to an assumption so baked in he forgets to question it, right? If questioned I think Quirrell would rightfully consider the Chamber to be something critical enough to be worth having other contingencies for, but he just never considered it necessary.

Author: Alex Turner. Contributors: Dipika Khullar, Ed Turner, and Roy Rinberg.

Dataset contamination is bad for several reasons. Most obviously, when benchmarks are included in AI training data, those benchmarks no longer measure generalization -- the AI may have been directly taught the answers. Even more concerningly, if your data promote negative "stereotypes" about AIs, they might become self-fulfilling prophecies, training future models to exhibit those very behaviors.

In the Claude 4 system card, Anthropic revealed that approximately 250,000 transcripts from their alignment faking paper had been scraped from the public web and included in their pretraining data. This caused an early model to hallucinate details from the paper's fictional scenarios, forcing Anthropic to implement unique mitigations. Speculatively, this kind of misalignment data could degrade the alignment of any...

[go-away](https://git.gammaspectra.live/git/go-away) is my personal choice. 

Doesn’t require weird js and text mode browsing like Anubis. Widely(ish) used. Not nuclear like anubis.

1ProgramCrafter
If we are protecting against AI-building labs specifically, there are two outcomes: 1. They do not care to set up a special way of data extraction for those protected sites. Then, we can set Anubis to whatever low difficulty that it does not hinder users. 2. They do care and want to get the data behind. We lose.[1] 1. ^ Because AI labs have a large number of... what they are called, again... massive parallel computation units... ah, GPUs, they have an advantage at proof-of-work. And if they choose to use them against humanity (a sleight of hand here, I admit), it will not go well.
10Said Achmiz
… unless it breaks entirely because someone has an old browser / weird browser / text browser / screen reader / NoScript enabled / etc., and the difficulty setting doesn’t matter entirely; and the result is that the user still can’t get through.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

A long essay about LLMs, the nature and history of the the HHH assistant persona, and the implications for alignment.

Multiple people have asked me whether I could post this LW in some form, hence this linkpost.

~17,000 words. Originally written on June 7, 2025.

(Note: although I expect this post will be interesting to people on LW, keep in mind that it was written with a broader audience in mind than my posts and comments here.  This had various implications about my choices of presentation and tone, about which things I explained from scratch rather than assuming as background, my level of comfort casually reciting factual details from memory rather than explicitly checking them against the original source, etc.

Although, come of think of it, this was also true of most of my early posts on LW [which were crossposts from my blog], so maybe it's not a big deal...)

alexey10

And it could do that, effectively, with all the so-called “pre-training” data, the stuff written by real people... The assistant transcripts are different. If human minds were involved in their construction, it was only because humans were writing words for the assistant as a fictional character, playing the role of science-fiction authors rather than speaking for themselves. In this process, there was no real mind – human or otherwise – “inhabiting” the assistant role that some of the resulting text portrays.

But the base model already has to predict non-w... (read more)

No, seriously. If you look at the substance, it’s pretty good.

I’ll go over the whole thing in detail, including the three executive actions implementing some of the provisions. Then as a postscript I’ll cover other reactions.

The White House Issues a Pretty Good AI Action Plan

There is a lot of the kind of rhetoric you would expect from a Trump White House. Where it does not bear directly on the actual contents and key concerns, I did my absolute best to ignore all the potshots. The focus should stay on the actual proposals.

The actual proposals, which are the part that matters, are far superior to the rhetoric.

This is a far better plan than I expected. There are a few points of definite concern, where the wording is ambiguous...

54habryka
FWIW, the AI Action plan seems pretty terrible to me. Close to the worst piece of policy work I've seen. It's centrally a call to a full economic mobilization to race towards the most powerful AI systems available, with no mention of catastrophic[1] or existential risks.  I don't know what's up with people's reactions to it. Yes, it's competently executed, but competent execution towards economic mobilization towards the most destructive race in human history is bad, not good! I would have much preferred a meal-mouthy Biden EO with 15 muddled priorities over this. The right reaction to Nazi Germany competently executing a blitzkrieg is not praise! Like, man, I don't know what's going on. These are the two central opening paragraphs of the action plan:  No, this is completely insane. This is getting everything almost perfectly backwards. "Winning this AI race" will not usher in "a new golden age of human flourishing", it will result in almost the exact opposite. It has a very substantial chance of killing absolutely everyone you know and love and have ever loved and will ever love. This is not a good plan. It's a terrible plan. Yes, it has competence, but competence towards bad aims does not help. I feel so incredibly confused about so many reasonable people suddenly endorsing this obviously insane and omnicidal plan. I hope I am missing something.  This also applies to this post which felt like it ignored the whole central MO of the plan to praise some random minor mention of AI interpretability, or for some reason praise marginal investment into AI datacenters. Like, this whole situation has felt as if I was working at RAND concerned about nuclear war, having dedicated my life to and working with my colleagues and friends to prevent nuclear war, and the administration released a report with the headline:  and suddenly all my colleagues are cheering it on and calling the report a great report, praising how it does mention that maybe it would be nice to spend

Yes, it's competently executed

Is it?

It certainly signals that the authors have a competent grasp of the AI industry and its mainstream models of what's happening. But is it actually competent AI-policy work, even under the e/acc agenda?

My impression is that no, it's not. It seems to live in an e/acc fanfic about a competent US racing to AGI, not in reality. It vaguely recommends doing a thousand things that would be nontrivial to execute if the Eye of Sauron were looking directly at them, and the Eye is very much not doing that. On the contrary, the wider ... (read more)

3Zvi
As I say up top, one must distinguish the rhetoric from the substance. The rhetoric is terrible, although not as terrible as my median expectation was for it, because of what is not said. On international treaties, I fail to see how anything here makes that situation any worse than baseline, including the rhetoric, given what has already been said and done, you shouldn't be crying about that more than you were last week. On the substance, this was much better than expectations except that we agree it had unexpectedly competent execution. And I don't think this is anything like the level of 'full economic mobilization.' Setting aside competence level, it is hard to think of how this report could have been better given who was in charge of directing and approving the report.  If you think things are so bad that the primary thing you want on realistic margins from America's AI policy is incompetent execution, if you want to say reality does not grade on a curve, then okay. I mean, I get it.
1teradimich
We can still hope that we won't get AGI in the next couple of years. Society's attitude towards AI is already negative, and we're even seeing some congressmen openly discuss the existential risks. This growing awareness might just lead to meaningful policy changes in the future.

Author's note: These days, my thoughts go onto my substack by default, instead of onto LessWrong. Everything I write becomes free after a week or so, but it’s only paid subscriptions that make it possible for me to write. If you find a coffee’s worth of value in this or any of my other work, please consider signing up to support me; every bill I can pay with writing is a bill I don’t have to pay by doing other stuff instead. I also accept and greatly appreciate one-time donations of any size.


I.

You’ve probably seen that scene where someone reaches out to give a comforting hug to the poor sad abused traumatized orphan and/or battered wife character, and the poor sad abused traumatized orphan and/or battered wife...

I saw people mentioned Eternal September on the internet, not frequently, but  over years. and currently my model of this even (That happened before my time and i didn't witnessed) is that it's exactly instance of "the separation of the space's culture and ouside culture can break if too many new people enter at once, of if someone too incompatible person joins, but dispite this such spaces can still exixt for years."

people had nice culture, people was joining every September and at first was disruptive, and it took time to acculturate, but the ratio ... (read more)