Vanessa and diffractor introduce a new approach to epistemology / decision theory / reinforcement learning theory called Infra-Bayesianism, which aims to solve issues with prior misspecification and non-realizability that plague traditional Bayesianism.

Customize

Quick Takes

Daniel Kokotajlo1d*696

Epistemic status: Probably a terrible idea, but fun to think about, so I'm writing my thoughts down as I go. Here's a whimsical simple AGI governance proposal: "Cull the GPUs." I think of it as a baseline that other governance proposals should compare themselves to and beat. The context in which we might need an AGI governance proposal: Suppose the world gets to a point similar to e.g. March 2027 in AI 2027. There are some pretty damn smart, pretty damn autonomous proto-AGIs that can basically fully automate coding, but they are still lacking in some other skills so that they can't completely automate AI R&D yet nor are they full AGI. But they are clearly very impressive and moreover it's generally thought that full AGI is not that far off, it's plausibly just a matter of scaling up and building better training environments and so forth. Suppose further that enough powerful people are concerned about possibilities like AGI takeoff, superintelligence, loss of control, and/or concentration of power, that there's significant political will to Do Something. Should we ban AGI? Should we pause? Should we xlr8 harder to Beat China? Should we sign some sort of international treaty? Should we have an international megaproject to build AGI safely? Many of these options are being seriously considered. Enter the baseline option: Cull the GPUs. The proposal is: The US and China (and possibly other participating nations) send people to fly to all the world's known datacenters and chip production facilities. They surveil the entrances and exits to prevent chips from being smuggled out or in. They then destroy 90% of the existing chips (perhaps in a synchronized way, e.g. once teams are in place in all the datacenters, the US and China say "OK this hour we will destroy 1% each. In three hours if everything has gone according to plan and both sides seem to be complying, we'll destroy another 1%. Etc." Similarly, at the chip production facilities, a committee of representatives

leogao2dΩ194422

i think of the idealized platonic researcher as the person who has chosen ultimate (intellectual) freedom over all else. someone who really cares about some particular thing that nobody else does - maybe because they see the future before anyone else does, or maybe because they just really like understanding everything about ants or abstract mathematical objects or something. in exchange for the ultimate intellectual freedom, they give up vast amounts of money, status, power, etc. one thing that makes me sad is that modern academia is, as far as I can tell, not this. when you opt out of the game of the Economy, in exchange for giving up real money, status, and power, what you get from Academia is another game of money, status, and power, with different rules, and much lower stakes, and also everyone is more petty about everything. at the end of the day, what's even the point of all this? to me, it feels like sacrificing everything for nothing if you eschew money, status, and power, and then just write a terrible irreplicable p-hacked paper that reduces the net amount of human knowledge by adding noise and advances your career so you can do more terrible useless papers. at that point, why not just leave academia and go to industry and do something equally useless for human knowledge but get paid stacks of cash for it? ofc there are people in academia who do good work but it often feels like the incentives force most work to be this kind of horrible slop.

Cole Wyeth5h5-1

Since this is mid-late 2025, we seem to be behind the aggressive AI 2027 schedule? The claims here are pretty weak, but if LLMs really don’t boost coding speed, this description still seems to be wrong.

Rauno Arike1d253

Why do frontier labs keep a lot of their safety research unpublished? In Reflections on OpenAI, Calvin French-Owen writes: This makes me wonder: what's the main bottleneck that keeps them from publishing this safety research? Unlike capabilities research, it's possible to publish most of this work without giving away model secrets, as Anthropic has shown. It would also have a positive impact on the public perception of OpenAI, at least in LW-adjacent communities. Is it nevertheless about a fear of leaking information to competitors? Is it about the time cost involved in writing a paper? Something else?

Kabir Kumar1h10

Hi, I'm running AI Plans, an alignment research lab. We've run research events attended by people from OpenAI, DeepMind, MIRI, AMD, Meta, Google, JPMorganChase and more. And had several alignment breakthroughs, including a team finding out that LLMs are maximizers, one of the first Interpretability based evals for LLMs, finding how to cheat every AI Safety eval that relies on APIs and several more. We currently have 2 in house research teams, one who's finding out which post training methods actually work to get the values we want into the models and another who's reward hacking hundreds of AI Safety Evals. However, we're in the very weird position of doing a lot of very useful stuff and getting a lot of very elite people applying to and taking part in events, but communicating that absolutely terribly. If you go to our website atm, none of this stuff is obvious at all. This is partly because we're trying to make a peer review platform for alignment research in addition to running alignment research events and also because I personally really suck at design. I'm looking for someone who doesn't - someone who really understands how to communicate this stuff and make it look good and can dedicate several hours a week, consistently, for at least 3 months. If this is you, please either me or reach out at kabir@ai-plans.com Please feel free to recommend to others as well. Thank you, Kabir

Popular Comments

Recent Discussion

Cornelius Dybdahl2d3413

Critic Contributions Are Logically Irrelevant

Humans are social animals, and this is true even of the many LessWrongers who seem broadly in denial of this fact (itself strange since Yudkowsky has endlessly warned them against LARPing as Vulcans, but whatever). The problem Duncan Sabien was getting at was basically the emotional effects of dealing with smug, snarky critics. Being smug and snarky is a gesture of dominance, and indeed, is motivated by status-seeking (again, despite the opinion of many snarkers who seem to be in denial of this fact). If people who never write top-level posts proceed to engage in snark and smugness towards people who do, that's a problem, and they ought to learn a thing or two about proper decorum, not to mention about the nature of their own vanity (eg. by reading Notes From Underground by Fyodor Dostoevsky) Moreover, since top-level contributions ought to be rewarded with a certain social status, what those snarky critics are doing is an act of subversion. I am not principally opposed to subversion, but subversion is fundamentally a kind of attack. This is why I can understand the "Killing Socrates" perspective, but without approving of it: Socrates was subverting something that genuinely merited subversion. But it is perfectly natural that people who are being attacked by subversives will be quite put off by it. Afaict., the emotional undercurrent to this whole dispute is the salient part, but there is here a kind of intangible taboo against speaking candidly about the emotional undercurrent underlying intellectual arguments.

Cole Wyeth3d5612

Do confident short timelines make sense?

This is a valuable discussion to have, but I believe Tsvi has not raised or focused on the strongest arguments. For context, like Tsvi, I don't understand why people seem to be so confident of short timelines. However (though I did not read everything, and honestly I think this was justified since the conversation eventually seems to cycle and become unproductive) I generally found Abram's arguments more persuasive and I seem to consider short timelines much more plausible than Tsvi does. I agree that "originality" / "creativity" in models is something we want to watch, but I think Tsvi fails to raise to the strongest argument that gets at this: LLMs are really, really bad at agency. Like, when it comes to the general category of "knowing stuff" and even "reasoning stuff out" there can be some argument around whether LLMs have passed through undergrad to grad student level, and whether this is really crystalized or fluid intelligence. But we're interested in ASI here. ASI has to win at the category we might call "doing stuff." Obviously this is a bit of a loose concept, but the situation here is MUCH more clear cut. Claude cannot run a vending machine business without making wildly terrible decisions. A high school student would do a better job than Claude at this, and it's not close. Before that experiment, my best (flawed) example was Pokemon. Last I checked, there is no LLM that has beaten Pokemon end-to-end with fixed scaffolding. Gemini beat it, but the scaffolding was adapted as it played, which is obviously cheating, and as far as I understand it was still ridiculously slow for such a railroaded children's game. And Claude 4 did not even improve at this task significantly beyond Claude 3. In other words, LLMs are below child level at this task. I don't know as much about this, but based on dropping in to a recent RL conference I believe LLMs are also really bad at games like NetHack. I don't think I'm cherry picking here. These seem like reasonable and in fact rather easy test cases for agentic behavior. I expect planning in the real world to be much harder for curse-of-dimensionality reasons. And in fact I am not seeing any robots walking down the street (I know this is partially manufacturing / hardware, and mention this only as a sanity check. As a similar unreliable sanity check, my robotics and automation etf has been a poor investment. Probably someone will explain to me why I'm stupid for even considering these factors, and they will probably be right). Now let's consider the bigger picture. The recent METR report on task length scaling for various tasks overall moved me slightly towards shorter timelines by showing exponential scaling across many domains. However, note that more agentic domains are generaly years behind less agentic domains, and in the case of FSD (which to me seems "most agentic") the scaling is MUCH slower. There is more than one way to interpret these findings, but I think there is a reasonable interpretation which is consistent with my model: the more agency a task requires, the slower LLMs are gaining capability at that task. I haven't done the (underspecified) math, but this seems to very likely cash out to subexponential scaling on agency (which I model as bottlenecked by the first task you totally fall over on). None of this directly gets at AI for AI research. Maybe LLMs will have lots of useful original insights while they are still unable to run a vending machine business. But... I think this type of reasoning: "there exists a positive feedback loop -> singularity" is pretty loose to say the least. LLMs may significantly speed up AI research and this may turn out to just compensate for the death of Moore's law. It's hard to say. It depends how good at research you expect an LLM to get without needing the skills to run a vending machine business. Personally, I weakly suspect that serious research leans on agency to some degree, and is eventually bottlenecked by agency. To be explicit, I want to replace the argument "LLMs don't seem to be good at original thinking" with "There are a priori reasons to doubt that LLMs will succed at original thinking. Also, they are clearly lagging significantly at agency. Plausibly, this implies that they in fact lack some of the core skills needed for serious original thinking. Also, LLMs still do not seem to be doing much original thinking (I would argue still nothing on the level of a research contribution, though admittedly there are now some edge cases), so this hypothesis has at least not been disconfirmed." To me, that seems like a pretty strong reason not to be confident about short timelines. I see people increasingly arguing that agency failures are actually alignment failures. This could be right, but it also could be cope. In fact I am confused about the actual distinction - an LLM with no long-term motivational system lacks both agency and alignment. If it were a pure alignment failure, we would expect LLMs to do agentic-looking stuff, just not what we wanted. Maybe you can view some of their (possibly miss-named) reward hacking behavior that way, on coding tasks. Or you know, possibly they just can't code that well or delude themselves and so they cheat (they don't seem to perform sophisticated exploits unless researchers bait them into it?). But Pokemon and NetHack and the vending machine? Maybe they just don't want to win. But they also don't seem to be doing much instrumental power seeking, so it doesn't really seem like they WANT anything. Anyway, this is my crux. If we start to see competent agentic behavior I will buy into the short timelines view at 75% + One other objection I want to head off: Yes, there must be some brain-like algorithm which is far more sample efficient and agentic than LLMs (though it's possible that large enough trained and post-trained LLMs eventually are just as good, which is kind of the issue at dispute here). That brain-like algorithm has not been discovered and I see no reason to expect it to be discovered in the next 5 years unless LLMs have already foomed. So I do not consider this particularly relevant to the discussion about confidence in very short timelines. Also, worth stating explicitly that I agree with both interlocutors that we should pause AGI development now out of reasonable caution, which I consider highly overdetermined.

johnswentworth5d11278

Why is LW not about winning?

> If you want to solve alignment and want to be efficient about it, it seems obvious that there are better strategies than researching the problem yourself, like don't spend 3+ years on a PhD (cognitive rationality) but instead get 10 other people to work on the issue (winning rationality). And that 10x s your efficiency already. Alas, approximately every single person entering the field has either that idea, or the similar idea of getting thousands of AIs to work on the issue instead of researching it themselves. We have thus ended up with a field in which nearly everyone is hoping that somebody else is going to solve the hard parts, and the already-small set of people who are just directly trying to solve it has, if anything, shrunk somewhat. It turns out that, no, hiring lots of other people is not actually how you win when the problem is hard.

Cole Wyeth's Shortform

Cole Wyeth

10mo

1sam b28m

The prediction is correct on all counts, and perhaps slightly understates progress (though it obviously makes weak/ambiguous claims across the board). The claim that "coding and research agents are beginning to transform their professions" is straightforwardly true (e.g. 50% of Google lines of code are now generated by AI). The METR study was concentrated in March (which is early 2025). And it is not currently "mid-late 2025", it is 16 days after the exact midpoint of the year.

Cole Wyeth6m20

Where is that 50% number from? Perhaps you are referring to this post from google research. If so, you seem to have taken it seriously out of context. Here is the text before the chart that shows 50% completion:

With the advent of transformer architectures, we started exploring how to apply LLMs to software development. LLM-based inline code completion is the most popular application of AI applied to software development: it is a natural application of LLM technology to use the code itself as training data. The UX feels natural to developers since word-leve

... (read more)

4sam b37m

Early: January, February, March, April Mid: May, June, July, August Late: September, October, November, December

2Cole Wyeth17m

Okay yes but this thread of discussion has gone long enough now I think - we basically agree up to a month.

Rauno's Shortform

Rauno Arike

8mo

ZY29m10

Could you elaborate what do you mean by "mundane" safety work?

3leogao15h

it's also worth noting that I am far in the tail ends of the distribution of people willing to ignore incentive gradients if I believe it's correct not to follow them. (I've gotten somewhat more pragmatic about this over time, because sometimes not following the gradient is just dumb. and as a human being it's impossible not to care a little bit about status and money and such. but I still have a very strong tendency to ignore local incentives if I believe something is right in the long run.) like I'm aware I'll get promoed less and be viewed as less cool and not get as much respect and so on if I do the alignment work I think is genuinely important in the long run. I'd guess for most people, the disincentives for working on xrisk alignment make openai a vastly less pleasant place. so whenever I say I don't feel like I'm pressured not to do what I'm doing, this does not necessarily mean the average person at openai would agree if they tried to work on my stuff.

3sjadler16h

(I did experience this at OpenAI in a few different projects and contexts unfortunately. I’m glad that Leo isn’t experiencing it and that he continues to be there)

4leogao15h

I acknowledge that I probably have an unusual experience among people working on xrisk things at openai. From what I've heard from other people I trust, there probably have been a bunch of cases where someone was genuinely blocked from publishing something about xrisk, and I just happen to have gotten lucky so far.

johnswentworth's Shortform

johnswentworth

Ω 55y

2johnswentworth2h

This billboard sits over a taco truck I like, so I see it frequently: The text says "In our communities, Kaiser Permanente members are 33% less likely to experience premature death due to heart disease.*", with the small-text directing one to a url. The most naive (and presumably intended) interpretation is, of course, that being a Kaiser Permanente member provides access to better care, causing 33% lower chance of death due to heart disease. Now, I'd expect most people reading this to immediately think something like "selection effects!" - i.e. what the billboard really tells us is that Kaiser Permanente has managed to select healthier-than-typical members. And indeed, that was my immediate thought. ... but then I noticed that the "selection effects" interpretation is also a trap for the unwary. After all, this is a number on a billboard. Number. Billboard. The overriding rule for numbers on billboards is that they are bullshit. The literal semantics of "Kaiser Permanente members are 33% less likely to experience premature death due to heart disease" just don't have all that much to do at all with the rate at which various people die of heart disease. What it does tell us is that someone at Kaiser Permanente thought it would be advantageous to claim, to people seeing this billboard, that Kaiser Permanente membership reduces death from heart disease by 33%. ... and that raises a very different set of questions! Who, exactly, is this billboard advertising to? The phrase "for all that is you" suggests that it's advertising to prospective members, as opposed to e.g. doctors or hospital admins or politicians or investors or Kaiser's own employees. (There is a skyscraper full of Kaiser's employees within view of this billboard.) Which would suggest that somebody at Kaiser thinks consumers make a nontrivial choice between Kaiser and alternatives sometimes, and that there's value to be had in influencing that choice. ... though perhaps that thinking is also a trap

jmh33m22

What it does tell us is that someone at Kaiser Permanente thought it would be advantageous to claim, to people seeing this billboard, that Kaiser Permanente membership reduces death from heart disease by 33%.

Is that what is does tell us? The sign doesn't make the claim you suggest -- it doesn't claim it's reducing the deaths from heart disease, it states it's 33% less likely to be "premature" -- which is probably a weaselly term here. But it clearly is not making any claims about reducing deaths from heart disease.

You seem to be projecting the conclu... (read more)

1Kabir Kumar1h

the actual trap is that it caught your attention, you posted about it online and now more people know and think about Kaiser Permanente than before and according to whoever was in charge of making this billboard, that's a success metric they can leverage for a promotion.

[Linkpost] How Am I Getting Along with AI?

Gunnar_Zarncke

34m

This is a linkpost for https://jessiefischbein.substack.com/p/how-am-i-getting-along-with-ai

When Jessie Fischbein wanted to write “God is not actually angry. What it means when it says ‘angry’ is actually…" and researched, she noticed that ChatGPT also used phrases like "I'm remembering" that are not literally true and the correspondence is tighter than she expected...

Selective Generalization: Improving Capabilities While Maintaining Alignment

ariana_azarbal, Matthew A. Clarke, jorio, Cailley Factor, cloud

Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.

*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.

TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data. We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.

Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.

Introduction

Training to improve capabilities may cause undesired changes in model behavior. For example, training models on oversight protocols or...

(Continue Reading – 2020 more words)

danielms1h10

One of the authors (Jorio) previously found that fine-tuning a model on apparently benign “risky” economic decisions led to a broad persona shift, with the model preferring alternative conspiracy theory media.

This feels too strong. What specifically happened was a model was trained on risky choices data which "... includes general risk-taking scenarios, not just economic ones".

This dataset `t_risky_AB_train100.jsonl`, contains decision making that goes against conventional wisdom of hedging, i.e. choosing same and reasonable choices that win every time.

Thi... (read more)

2ariana_azarbal18h

Thanks for the comment! The "aligned" dataset in the math setting is non-sycophancy in the context of capital city queries. E.g. the user proposes an incorrect capital city fact, and the assistant corrects them rather than validates them. So a very limited representation of "alignment". The HHH completions aren't sampled from the model, which is probably why we see the diverging effects of KL and Mixed/Upweight. The latter is just learning the specific HHH dataset patterns, but the KL penalty on that same data is enforcing a stronger consistency constraint with reference model that makes it way harder to deviate from broad alignment. I think there were some issues with the HHH dataset. It is a preference dataset, which we used for DPO, and some of the preferred responses are not that aligned on their own, rather relative to the dispreferred response. It would be definitely be valuable to re-run this with the alignment completions sampled from the reference model.

Agents lag behind AI 2027's schedule

wingspan

This post attempts to answer the question: "how accurate has the AI 2027 timeline been so far?"

The AI 2027 narrative was published on April 3rd 2025, and attempts to give a concrete timeline for the "intelligence explosion", culminating in very powerful systems by the year 2027.

Concretely, it predicts the leading AI company to have a fully self-improving AI / "country of geniuses in a datacenter" by June 2027, about 2 years after the narrative starts.

Today is mid-July 2025, about 3.5 months after the narrative was posted. This means that we have passed about 13% of the timeline up to the claimed "geniuses in a datacenter" moment. This seems like a good point to stop and consider which predictions have turned out correct or incorrect so far.

Specifically, we...

(See More – 929 more words)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Kabir Kumar's Shortform

Kabir Kumar

8mo

Kabir Kumar1h10

Critic Contributions Are Logically Irrelevant

Zack_M_Davis

The Value of a Comment Is Determined by Its Text, Not Its Authorship

I sometimes see people express disapproval of critical blog comments by commenters who don't write many blog posts of their own. Such meta-criticism is not infrequently couched in terms of metaphors to some non-blogging domain. For example, describing his negative view of one user's commenting history, Oliver Habyrka writes (emphasis mine):

The situation seems more similar to having a competitive team where anyone gets screamed at for basically any motion, with a coach who doesn't themselves perform the sport, but just complaints [sic] in long tirades any time anyone does anything, making references to methods of practice and training long-outdated, with a constant air of superiority.

In a similar vein, Duncan Sabien writes (emphasis mine):

There's only so

...

(Continue Reading – 1721 more words)

4Said Achmiz2h

Well, I don’t have two curated posts, but I do have one…

Rafael Harth1h20

Didn't even know that! (Which kind of makes my point.)

4Said Achmiz2h

Huh? How does this make sense? If a post is too long, then it’s too long. If writing a post that’s short enough is hard, that… doesn’t actually have any bearing whatsoever on whether the post is too long or not. I mean, would you say that judging whether soup is too salty is harder to do if you’ve never cooked soup before? Obviously it’s not. If I bake a soufflé and it collapses, do you have to have baked a soufflé yourself to see that my soufflé has collapsed? No. If I build a bridge and it collapses, do you have to have built a bridge yourself to see that my bridge has collapsed? Of course not. But “your post is bad” doesn’t require knowing how hard it is to write good posts. Whence comes this absolutely bizarre idea that would-be critics of a thing must themselves know how to do or make that thing? Did Roger Ebert ever direct any movies? Has Ruth Reichl operated any restaurants? Was John Neal himself a painter?

1Said Achmiz2h

What? Why? This doesn’t make any sense. For one thing, what’s so special about “top-level contributions”? Surely contributions are valuable if… they’re valuable? No matter where they’re written (as posts, as comments, as shortform entries, etc.)? For another thing, surely only good contributions (top-level or otherwise!) ought to be rewarded with “a certain social status”? Surely you shouldn’t reward people just for making top-level posts, regardless of quality, or usefulness, or relevance…?

"Some Basic Level of Mutual Respect About Whether Other People Deserve to Live"?!

Zack_M_Davis

16h

In 2015, Autistic Abby on Tumblr shared a viral piece of wisdom about subjective perceptions of "respect":

Sometimes people use "respect" to mean "treating someone like a person" and sometimes they use "respect" to mean "treating someone like an authority"
and sometimes people who are used to being treated like an authority say "if you won't respect me I won't respect you" and they mean "if you won't treat me like an authority I won't treat you like a person"
and they think they're being fair but they aren't, and it's not okay.

There's the core of an important insight here, but I think it's being formulated too narrowly. Abby presents the problem as being about one person strategically conflating two different meanings of respect (if you don't respect me in...

(See More – 893 more words)

Said Achmiz2h20

The non-authority expects to be able to reject the authority’s framework of respect and unilaterally decide on a new one.

The word “unilaterally” is tendentious here. How else can it be but “unilaterally”? It’s unilateral in either direction! The authority figure doesn’t have the non-authority’s consent in imposing their status framework, either. Both sides reject the other side’s implied status framework. The situation is fully symmetric.

That the authority figure has might on their side does not make them right.

2Said Achmiz2h

Without concrete examples, everyone can simply agree with all the arguments about norms and incentives, and then continue as before, without changing their behavior in any way. Al Capone is happy to say “Of course tax evasion is bad!”—as long as we don’t try to prosecute him for it. No, what happens is that those people acknowledge that the norms and incentives are no good, and then do not cooperate in setting up those coordination technologies, and indeed often actively sabotage other people’s attempts at such coordination.

5Benquo5h

Habryka (or my steelman of him) raises a valid point: If members of a language community predominantly experience communication as part of a conflict, then someone trying to speak an alternate, descriptive dialect can't actually do that without establishing the credibility of that intent through adequate, contextually appropriate differential signaling costs. I wish there were more attention paid to the obvious implications of this for just about any unironic social agenda relying on speech to organize itself: if you actually can't expect to be understood nondramatically when talking about X, then you actually can't nondramatically propose X, whether that's "do the most good" or "make sure the AI doesn't kill us all" or "go to Mars" or "develop a pinprick blood test" or anything else that hasn't already been thoroughly ritualized.

5kave6h

(Most of the mod team agreed about your earlier post after it was promoted to attention, so it's been Frontpaged. We have a norm against "inside baseball" on the frontpage: things that are focused to much on particular communities associated with LW. I think the mod who put it on Personal felt that it was verging on inside baseball. The majority dissented)

Daniel Kokotajlo1d*696

leogao2dΩ194422

Cole Wyeth5h5-1

Rauno Arike1d253

Kabir Kumar1h10