Vanessa and diffractor introduce a new approach to epistemology / decision theory / reinforcement learning theory called Infra-Bayesianism, which aims to solve issues with prior misspecification and non-realizability that plague traditional Bayesianism.

Customize

Quick Takes

Daniel Kokotajlo1d*696

Epistemic status: Probably a terrible idea, but fun to think about, so I'm writing my thoughts down as I go. Here's a whimsical simple AGI governance proposal: "Cull the GPUs." I think of it as a baseline that other governance proposals should compare themselves to and beat. The context in which we might need an AGI governance proposal: Suppose the world gets to a point similar to e.g. March 2027 in AI 2027. There are some pretty damn smart, pretty damn autonomous proto-AGIs that can basically fully automate coding, but they are still lacking in some other skills so that they can't completely automate AI R&D yet nor are they full AGI. But they are clearly very impressive and moreover it's generally thought that full AGI is not that far off, it's plausibly just a matter of scaling up and building better training environments and so forth. Suppose further that enough powerful people are concerned about possibilities like AGI takeoff, superintelligence, loss of control, and/or concentration of power, that there's significant political will to Do Something. Should we ban AGI? Should we pause? Should we xlr8 harder to Beat China? Should we sign some sort of international treaty? Should we have an international megaproject to build AGI safely? Many of these options are being seriously considered. Enter the baseline option: Cull the GPUs. The proposal is: The US and China (and possibly other participating nations) send people to fly to all the world's known datacenters and chip production facilities. They surveil the entrances and exits to prevent chips from being smuggled out or in. They then destroy 90% of the existing chips (perhaps in a synchronized way, e.g. once teams are in place in all the datacenters, the US and China say "OK this hour we will destroy 1% each. In three hours if everything has gone according to plan and both sides seem to be complying, we'll destroy another 1%. Etc." Similarly, at the chip production facilities, a committee of representatives

Cole Wyeth5h6-1

Since this is mid-late 2025, we seem to be behind the aggressive AI 2027 schedule? The claims here are pretty weak, but if LLMs really don’t boost coding speed, this description still seems to be wrong.

leogao2dΩ194422

i think of the idealized platonic researcher as the person who has chosen ultimate (intellectual) freedom over all else. someone who really cares about some particular thing that nobody else does - maybe because they see the future before anyone else does, or maybe because they just really like understanding everything about ants or abstract mathematical objects or something. in exchange for the ultimate intellectual freedom, they give up vast amounts of money, status, power, etc. one thing that makes me sad is that modern academia is, as far as I can tell, not this. when you opt out of the game of the Economy, in exchange for giving up real money, status, and power, what you get from Academia is another game of money, status, and power, with different rules, and much lower stakes, and also everyone is more petty about everything. at the end of the day, what's even the point of all this? to me, it feels like sacrificing everything for nothing if you eschew money, status, and power, and then just write a terrible irreplicable p-hacked paper that reduces the net amount of human knowledge by adding noise and advances your career so you can do more terrible useless papers. at that point, why not just leave academia and go to industry and do something equally useless for human knowledge but get paid stacks of cash for it? ofc there are people in academia who do good work but it often feels like the incentives force most work to be this kind of horrible slop.

Rauno Arike1d253

Why do frontier labs keep a lot of their safety research unpublished? In Reflections on OpenAI, Calvin French-Owen writes: This makes me wonder: what's the main bottleneck that keeps them from publishing this safety research? Unlike capabilities research, it's possible to publish most of this work without giving away model secrets, as Anthropic has shown. It would also have a positive impact on the public perception of OpenAI, at least in LW-adjacent communities. Is it nevertheless about a fear of leaking information to competitors? Is it about the time cost involved in writing a paper? Something else?

johnswentworth3h20

This billboard sits over a taco truck I like, so I see it frequently: The text says "In our communities, Kaiser Permanente members are 33% less likely to experience premature death due to heart disease.*", with the small-text directing one to a url. The most naive (and presumably intended) interpretation is, of course, that being a Kaiser Permanente member provides access to better care, causing 33% lower chance of death due to heart disease. Now, I'd expect most people reading this to immediately think something like "selection effects!" - i.e. what the billboard really tells us is that Kaiser Permanente has managed to select healthier-than-typical members. And indeed, that was my immediate thought. ... but then I noticed that the "selection effects" interpretation is also a trap for the unwary. After all, this is a number on a billboard. Number. Billboard. The overriding rule for numbers on billboards is that they are bullshit. The literal semantics of "Kaiser Permanente members are 33% less likely to experience premature death due to heart disease" just don't have all that much to do at all with the rate at which various people die of heart disease. What it does tell us is that someone at Kaiser Permanente thought it would be advantageous to claim, to people seeing this billboard, that Kaiser Permanente membership reduces death from heart disease by 33%. ... and that raises a very different set of questions! Who, exactly, is this billboard advertising to? The phrase "for all that is you" suggests that it's advertising to prospective members, as opposed to e.g. doctors or hospital admins or politicians or investors or Kaiser's own employees. (There is a skyscraper full of Kaiser's employees within view of this billboard.) Which would suggest that somebody at Kaiser thinks consumers make a nontrivial choice between Kaiser and alternatives sometimes, and that there's value to be had in influencing that choice. ... though perhaps that thinking is also a trap

Popular Comments

Recent Discussion

Cornelius Dybdahl2d3615

Critic Contributions Are Logically Irrelevant

Humans are social animals, and this is true even of the many LessWrongers who seem broadly in denial of this fact (itself strange since Yudkowsky has endlessly warned them against LARPing as Vulcans, but whatever). The problem Duncan Sabien was getting at was basically the emotional effects of dealing with smug, snarky critics. Being smug and snarky is a gesture of dominance, and indeed, is motivated by status-seeking (again, despite the opinion of many snarkers who seem to be in denial of this fact). If people who never write top-level posts proceed to engage in snark and smugness towards people who do, that's a problem, and they ought to learn a thing or two about proper decorum, not to mention about the nature of their own vanity (eg. by reading Notes From Underground by Fyodor Dostoevsky) Moreover, since top-level contributions ought to be rewarded with a certain social status, what those snarky critics are doing is an act of subversion. I am not principally opposed to subversion, but subversion is fundamentally a kind of attack. This is why I can understand the "Killing Socrates" perspective, but without approving of it: Socrates was subverting something that genuinely merited subversion. But it is perfectly natural that people who are being attacked by subversives will be quite put off by it. Afaict., the emotional undercurrent to this whole dispute is the salient part, but there is here a kind of intangible taboo against speaking candidly about the emotional undercurrent underlying intellectual arguments.

Cole Wyeth3d5612

Do confident short timelines make sense?

This is a valuable discussion to have, but I believe Tsvi has not raised or focused on the strongest arguments. For context, like Tsvi, I don't understand why people seem to be so confident of short timelines. However (though I did not read everything, and honestly I think this was justified since the conversation eventually seems to cycle and become unproductive) I generally found Abram's arguments more persuasive and I seem to consider short timelines much more plausible than Tsvi does. I agree that "originality" / "creativity" in models is something we want to watch, but I think Tsvi fails to raise to the strongest argument that gets at this: LLMs are really, really bad at agency. Like, when it comes to the general category of "knowing stuff" and even "reasoning stuff out" there can be some argument around whether LLMs have passed through undergrad to grad student level, and whether this is really crystalized or fluid intelligence. But we're interested in ASI here. ASI has to win at the category we might call "doing stuff." Obviously this is a bit of a loose concept, but the situation here is MUCH more clear cut. Claude cannot run a vending machine business without making wildly terrible decisions. A high school student would do a better job than Claude at this, and it's not close. Before that experiment, my best (flawed) example was Pokemon. Last I checked, there is no LLM that has beaten Pokemon end-to-end with fixed scaffolding. Gemini beat it, but the scaffolding was adapted as it played, which is obviously cheating, and as far as I understand it was still ridiculously slow for such a railroaded children's game. And Claude 4 did not even improve at this task significantly beyond Claude 3. In other words, LLMs are below child level at this task. I don't know as much about this, but based on dropping in to a recent RL conference I believe LLMs are also really bad at games like NetHack. I don't think I'm cherry picking here. These seem like reasonable and in fact rather easy test cases for agentic behavior. I expect planning in the real world to be much harder for curse-of-dimensionality reasons. And in fact I am not seeing any robots walking down the street (I know this is partially manufacturing / hardware, and mention this only as a sanity check. As a similar unreliable sanity check, my robotics and automation etf has been a poor investment. Probably someone will explain to me why I'm stupid for even considering these factors, and they will probably be right). Now let's consider the bigger picture. The recent METR report on task length scaling for various tasks overall moved me slightly towards shorter timelines by showing exponential scaling across many domains. However, note that more agentic domains are generaly years behind less agentic domains, and in the case of FSD (which to me seems "most agentic") the scaling is MUCH slower. There is more than one way to interpret these findings, but I think there is a reasonable interpretation which is consistent with my model: the more agency a task requires, the slower LLMs are gaining capability at that task. I haven't done the (underspecified) math, but this seems to very likely cash out to subexponential scaling on agency (which I model as bottlenecked by the first task you totally fall over on). None of this directly gets at AI for AI research. Maybe LLMs will have lots of useful original insights while they are still unable to run a vending machine business. But... I think this type of reasoning: "there exists a positive feedback loop -> singularity" is pretty loose to say the least. LLMs may significantly speed up AI research and this may turn out to just compensate for the death of Moore's law. It's hard to say. It depends how good at research you expect an LLM to get without needing the skills to run a vending machine business. Personally, I weakly suspect that serious research leans on agency to some degree, and is eventually bottlenecked by agency. To be explicit, I want to replace the argument "LLMs don't seem to be good at original thinking" with "There are a priori reasons to doubt that LLMs will succed at original thinking. Also, they are clearly lagging significantly at agency. Plausibly, this implies that they in fact lack some of the core skills needed for serious original thinking. Also, LLMs still do not seem to be doing much original thinking (I would argue still nothing on the level of a research contribution, though admittedly there are now some edge cases), so this hypothesis has at least not been disconfirmed." To me, that seems like a pretty strong reason not to be confident about short timelines. I see people increasingly arguing that agency failures are actually alignment failures. This could be right, but it also could be cope. In fact I am confused about the actual distinction - an LLM with no long-term motivational system lacks both agency and alignment. If it were a pure alignment failure, we would expect LLMs to do agentic-looking stuff, just not what we wanted. Maybe you can view some of their (possibly miss-named) reward hacking behavior that way, on coding tasks. Or you know, possibly they just can't code that well or delude themselves and so they cheat (they don't seem to perform sophisticated exploits unless researchers bait them into it?). But Pokemon and NetHack and the vending machine? Maybe they just don't want to win. But they also don't seem to be doing much instrumental power seeking, so it doesn't really seem like they WANT anything. Anyway, this is my crux. If we start to see competent agentic behavior I will buy into the short timelines view at 75% + One other objection I want to head off: Yes, there must be some brain-like algorithm which is far more sample efficient and agentic than LLMs (though it's possible that large enough trained and post-trained LLMs eventually are just as good, which is kind of the issue at dispute here). That brain-like algorithm has not been discovered and I see no reason to expect it to be discovered in the next 5 years unless LLMs have already foomed. So I do not consider this particularly relevant to the discussion about confidence in very short timelines. Also, worth stating explicitly that I agree with both interlocutors that we should pause AGI development now out of reasonable caution, which I consider highly overdetermined.

johnswentworth5d11278

Why is LW not about winning?

> If you want to solve alignment and want to be efficient about it, it seems obvious that there are better strategies than researching the problem yourself, like don't spend 3+ years on a PhD (cognitive rationality) but instead get 10 other people to work on the issue (winning rationality). And that 10x s your efficiency already. Alas, approximately every single person entering the field has either that idea, or the similar idea of getting thousands of AIs to work on the issue instead of researching it themselves. We have thus ended up with a field in which nearly everyone is hoping that somebody else is going to solve the hard parts, and the already-small set of people who are just directly trying to solve it has, if anything, shrunk somewhat. It turns out that, no, hiring lots of other people is not actually how you win when the problem is hard.

Daniel Kokotajlo1d*696

Cole Wyeth5h6-1

leogao2dΩ194422

Rauno Arike1d253

johnswentworth3h20

Social Anxiety Isn’t About Being Liked

137

Chris Lakin

2mo

This is a linkpost for https://chrislakin.blog/social-anxiety

There's this popular idea that socially anxious folks are just dying to be liked. It seems logical, right? Why else would someone be so anxious about how others see them?

And yet, being socially anxious tends to make you less likeable…they must be optimizing poorly, behaving irrationally, right?

Maybe not. What if social anxiety isn’t about getting people to like you? What if it's about stopping them from disliking you?

Chris Lakin @ChrisChipMonk: “why are emotions so illogical?” emotions *are* logical you’re just bad at logic Comments 73 Retweets 295 Likes 1.7K Bookmarks 307 — Show tweet

Consider what can happen when someone has social anxiety (or self-loathing, self-doubt, insecurity, lack of confidence, etc.):

They stoop or take up less space
They become less agentic
They make fewer requests of others
They maintain fewer relationships, go out less, take fewer risks…

If they were trying to get people to like them, becoming socially anxious would be an incredibly bad strategy.

So...

(See More – 515 more words)

Chris Lakin4m20

Coaching/therapy with a skilled facilitator you like

Vitalik's Response to AI 2027

107

Daniel Kokotajlo

This is a linkpost for https://vitalik.eth.limo/general/2025/07/10/2027.html

Daniel notes: This is a linkpost for Vitalik's post. I've copied the text below so that I can mark it up with comments.

...

Special thanks to Balvi volunteers for feedback and review

In April this year, Daniel Kokotajlo, Scott Alexander and others released what they describe as "a scenario that represents our best guess about what [the impact of superhuman AI over the next 5 years] might look like". The scenario predicts that by 2027 we will have made superhuman AI and the entire future of our civilization hinges on how it turns out: by 2030 we will get either (from the US perspective) utopia or (from any human's perspective) total annihilation.

In the months since then, there has been a large volume of responses, with varying perspectives on how...

(Continue Reading – 3551 more words)

Eliezer Yudkowsky6m20

I haven't read Vitalik's specific take, as yet, but as I asked more generally on X:

People who stake great hope on a "continuous" AI trajectory implying that defensive AI should always stay ahead of destructive AI:
Where is the AI that I can use to talk people *out* of AI-induced psychosis?
Why was it not *already* built, beforehand?

This just doesn't seem to be how things usually play out in real life. Even after a first disaster, we don't get lab gain-of-function research shut down in the wake of Covid-19, let alone massive investment in fast preemptive defenses.

Rauno's Shortform

Rauno Arike

8mo

Guive7m10

I think the extent to which it's possible to publish without giving away commercially sensitive information depends a lot on exactly what kind of "safety work" it is. For example, if you figured out a way to stop models from reward hacking on unit tests, it's probably to your advantage to not share that with competitors.

1ZY14m

From my experience, there are usually sections (for at least the earlier papers historically) on safety for model releasing. * PaLM: https://arxiv.org/pdf/2204.02311 * Llama 2: https://arxiv.org/pdf/2307.09288 * Llama 3: https://arxiv.org/pdf/2407.21783 * GPT 4: https://cdn.openai.com/papers/gpt-4-system-card.pdf Labs release some general paper like (just examples): * https://arxiv.org/pdf/2202.07646, https://openreview.net/pdf?id=vjel3nWP2a etc (Nicholas Carlini has a lot of paper related to memorization and extractability) * https://arxiv.org/pdf/2311.18140 * https://arxiv.org/pdf/2507.02735 * https://arxiv.org/pdf/2404.10989v1 If we see less of these sections, one possibility is increased legal regulations that may make publication tricky (imagine an extreme case, companies have sincere intents to produce some numbers that is not indication of harm yet, these preliminary or signal numbers could be used in an "abused" way in legal dispute, may be for profitability reasons/"lawyers need to win cases" reasons). To remove sensitive information, it would comes down then to time cost involved in paper writing, and time cost in removing sensitive information. And interestingly, political landscape could steer companies away from being more safety focused. I do hope there could be a better way to resolve this, providing more incentives for companies to report and share mitigations and measurements. As far as I know, safety tests usually are used for internal decision making at least for releases etc.

1ZY1h

Could you elaborate what do you mean by "mundane" safety work?

3leogao15h

it's also worth noting that I am far in the tail ends of the distribution of people willing to ignore incentive gradients if I believe it's correct not to follow them. (I've gotten somewhat more pragmatic about this over time, because sometimes not following the gradient is just dumb. and as a human being it's impossible not to care a little bit about status and money and such. but I still have a very strong tendency to ignore local incentives if I believe something is right in the long run.) like I'm aware I'll get promoed less and be viewed as less cool and not get as much respect and so on if I do the alignment work I think is genuinely important in the long run. I'd guess for most people, the disincentives for working on xrisk alignment make openai a vastly less pleasant place. so whenever I say I don't feel like I'm pressured not to do what I'm doing, this does not necessarily mean the average person at openai would agree if they tried to work on my stuff.

Critic Contributions Are Logically Irrelevant

Zack_M_Davis

The Value of a Comment Is Determined by Its Text, Not Its Authorship

I sometimes see people express disapproval of critical blog comments by commenters who don't write many blog posts of their own. Such meta-criticism is not infrequently couched in terms of metaphors to some non-blogging domain. For example, describing his negative view of one user's commenting history, Oliver Habyrka writes (emphasis mine):

The situation seems more similar to having a competitive team where anyone gets screamed at for basically any motion, with a coach who doesn't themselves perform the sport, but just complaints [sic] in long tirades any time anyone does anything, making references to methods of practice and training long-outdated, with a constant air of superiority.

In a similar vein, Duncan Sabien writes (emphasis mine):

There's only so

...

(Continue Reading – 1721 more words)

Guive12m32

I'm not sure that's even true of leading questions. You can ask a leading question for the benefit of other readers who will see the question, understand the objection the question is implicitly raising, and then reflect on whether it's reasonable.

4Said Achmiz2h

Well, I don’t have two curated posts, but I do have one…

2Rafael Harth2h

Didn't even know that! (Which kind of makes my point.)

4Said Achmiz2h

Huh? How does this make sense? If a post is too long, then it’s too long. If writing a post that’s short enough is hard, that… doesn’t actually have any bearing whatsoever on whether the post is too long or not. I mean, would you say that judging whether soup is too salty is harder to do if you’ve never cooked soup before? Obviously it’s not. If I bake a soufflé and it collapses, do you have to have baked a soufflé yourself to see that my soufflé has collapsed? No. If I build a bridge and it collapses, do you have to have built a bridge yourself to see that my bridge has collapsed? Of course not. But “your post is bad” doesn’t require knowing how hard it is to write good posts. Whence comes this absolutely bizarre idea that would-be critics of a thing must themselves know how to do or make that thing? Did Roger Ebert ever direct any movies? Has Ruth Reichl operated any restaurants? Was John Neal himself a painter?

Why it's hard to make settings for high-stakes control research

Buck

Ω 257h

One of our main activities at Redwood is writing follow-ups to previous papers on control like the original and Ctrl-Z, where we construct a setting with a bunch of tasks (e.g. APPS problems) and a notion of safety failure (e.g. backdoors according to our specific definition), then play the adversarial game where we develop protocols and attacks on those protocols.

It turns out that a substantial fraction of the difficulty here is developing the settings, rather than developing the protocols and attacks. I want to explain why this is the case.

In order to do high-stakes control research, you need a dataset of tasks with the following properties:

You can check whether the main task was completed successfully.
There's some notion of safety failure that we consider unacceptable; we can check whether a

...

(See More – 931 more words)

Igor Ivanov26m10

I believe that there is another dimension to the problem of designing good control ecperiments.

I'm doing some research on LLMs being aware when they are inside evals, and it seems like frontier models are quite good at recognizing it for the alignment evals I could get a hand on. And this affects the model behavior a lot! In the Anthropic blackmail experiment models blackmailed much less when they mentioned that they are inside evals compared to the cases when they didn't think they are inside evals.

I've tried to make evaluation scenarios more realistic, b... (read more)

Cole Wyeth's Shortform

Cole Wyeth

10mo

1sam b1h

The prediction is correct on all counts, and perhaps slightly understates progress (though it obviously makes weak/ambiguous claims across the board). The claim that "coding and research agents are beginning to transform their professions" is straightforwardly true (e.g. 50% of Google lines of code are now generated by AI). The METR study was concentrated in March (which is early 2025). And it is not currently "mid-late 2025", it is 16 days after the exact midpoint of the year.

Cole Wyeth35m20

Where is that 50% number from? Perhaps you are referring to this post from google research. If so, you seem to have taken it seriously out of context. Here is the text before the chart that shows 50% completion:

With the advent of transformer architectures, we started exploring how to apply LLMs to software development. LLM-based inline code completion is the most popular application of AI applied to software development: it is a natural application of LLM technology to use the code itself as training data. The UX feels natural to developers since word-leve

... (read more)

4sam b1h

Early: January, February, March, April Mid: May, June, July, August Late: September, October, November, December

2Cole Wyeth1h

Okay yes but this thread of discussion has gone long enough now I think - we basically agree up to a month.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

johnswentworth's Shortform

johnswentworth

Ω 55y

2johnswentworth3h

jmh1h22

What it does tell us is that someone at Kaiser Permanente thought it would be advantageous to claim, to people seeing this billboard, that Kaiser Permanente membership reduces death from heart disease by 33%.

Is that what is does tell us? The sign doesn't make the claim you suggest -- it doesn't claim it's reducing the deaths from heart disease, it states it's 33% less likely to be "premature" -- which is probably a weaselly term here. But it clearly is not making any claims about reducing deaths from heart disease.

You seem to be projecting the conclu... (read more)

1Kabir Kumar2h

the actual trap is that it caught your attention, you posted about it online and now more people know and think about Kaiser Permanente than before and according to whoever was in charge of making this billboard, that's a success metric they can leverage for a promotion.

[Linkpost] How Am I Getting Along with AI?

Gunnar_Zarncke

This is a linkpost for https://jessiefischbein.substack.com/p/how-am-i-getting-along-with-ai

When Jessie Fischbein wanted to write “God is not actually angry. What it means when it says ‘angry’ is actually…" and researched, she noticed that ChatGPT also used phrases like "I'm remembering" that are not literally true and the correspondence is tighter than she expected...

Selective Generalization: Improving Capabilities While Maintaining Alignment

ariana_azarbal, Matthew A. Clarke, jorio, Cailley Factor, cloud

Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.

*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.

TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data. We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.

Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.

Introduction

Training to improve capabilities may cause undesired changes in model behavior. For example, training models on oversight protocols or...

(Continue Reading – 2020 more words)

danielms1h10

One of the authors (Jorio) previously found that fine-tuning a model on apparently benign “risky” economic decisions led to a broad persona shift, with the model preferring alternative conspiracy theory media.

This feels too strong. What specifically happened was a model was trained on risky choices data which "... includes general risk-taking scenarios, not just economic ones".

This dataset `t_risky_AB_train100.jsonl`, contains decision making that goes against conventional wisdom of hedging, i.e. choosing same and reasonable choices that win every time.

Thi... (read more)

2ariana_azarbal18h

Thanks for the comment! The "aligned" dataset in the math setting is non-sycophancy in the context of capital city queries. E.g. the user proposes an incorrect capital city fact, and the assistant corrects them rather than validates them. So a very limited representation of "alignment". The HHH completions aren't sampled from the model, which is probably why we see the diverging effects of KL and Mixed/Upweight. The latter is just learning the specific HHH dataset patterns, but the KL penalty on that same data is enforcing a stronger consistency constraint with reference model that makes it way harder to deviate from broad alignment. I think there were some issues with the HHH dataset. It is a preference dataset, which we used for DPO, and some of the preferred responses are not that aligned on their own, rather relative to the dispreferred response. It would be definitely be valuable to re-run this with the alignment completions sampled from the reference model.