LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ

Quick Takes

AI Safety Thursdays: Are LLMs aware of their learned behaviors?
Thu Jul 10•Toronto
LessWrong Community Weekend 2025
Fri Aug 29•Berlin
AI Safety Thursdays: Avoiding Gradual Disempowerment
Thu Jul 3•Toronto
The Contemporary Dissident Right
Thu Jul 3•Waterloo
xpostah's Shortform
[+]samuelshadrach17h-110
TurnTrout's shortform feed
TurnTrout4d22-4

In a thread which claimed that Nate Soares radicalized a co-founder of e-acc, Nate deleted my comment – presumably to hide negative information and anecdotes about how he treats people. He also blocked me from commenting on his posts.

The information which Nate suppressed

The post concerned (among other topics) how to effectively communicate about AI safety, and positive anecdotes about Nate's recent approach. (Additionally, he mentions "I’m regularly told that I’m just an idealistic rationalist who’s enamored by the virtue of truth" -- a love which apparent... (read more)

Reply1
Showing 3 of 42 replies (Click to show all)
Guive1h32

Can you be more concrete about what "catching the ears of senators" means? That phrase seems like it could refer to a lot of very different things of highly disparate levels of impressiveness. 

Reply
1Knight Lee16h
I guess they succeeded in changing many people's opinions. The right wing reaction is against left wing people's opinions. The DEI curriculum is somewhere in between opinions and policies. I think the main effect of people having farther left opinions, is still making policies further left rather than further right due to counter-reaction. And this is despite the topic being much more moralistic and polarizing than AI x-risk.
1Knight Lee16h
Trump 2.0 being more pro-Israel could be due to him being more extreme in all directions (perhaps due to new staff members, vice president, I don't know), rather than due to pro-Palestinian protests. The counter-reaction are against the protesters, not the cause itself. The Vietnam War protests also created a counter-reaction against the protesters, despite successfully ending the war. I suspect for a lot of these pressure campaigns which work, the target has a tendency to pretend he isn't backing down due to the campaign (but other reasons), or act like he's not budging at all until finally giving in. The target doesn't want people to think that pressure campaigns work on him, the target wants people to think that any pressure him will only get a counter-reaction out of him, in order to discourage others from pressuring him. You're probably right about the courts though, I didn't know that. I agree that there is more anti-abortion efforts due to Roe v. Wade, but I disagree that these efforts actually overshot to a point where restrictions on abortion are even harsher than they would be if Roe v. Wade never happened. I still think it moved the Overton window such that even conservatives feel abortion is kind of normal, maybe bad, but not literally like killing a baby. The people angry against affirmative action have a strong feeling that different races should get the same treatment e.g. when applying to university. I don't think any of them overshot into wanting to bring back segregation or slavery. Oops, "efforts which empirically appear to work" was referring to how the book, If Anyone Builds, It Everyone Dies attracted many big name endorsements who aren't known for endorsing AI x-risk concerns until now.
ryan_greenblatt's Shortform
ryan_greenblatt8h4825

Recently, various groups successfully lobbied to remove the moratorium on state AI bills. This involved a surprising amount of success while competing against substantial investment from big tech (e.g. Google, Meta, Amazon). I think people interested in mitigating catastrophic risks from advanced AI should consider working at these organizations, at least to the extent their skills/interests are applicable. This both because they could often directly work on substantially helpful things (depending on the role and organization) and because this would yield ... (read more)

Reply
Kabir Kumar2h3-4

I think PauseAI is also extremely underappreciated. 

Reply
13habryka8h
Kids safety seems like a pretty bad thing to focus on, in the sense that the vast majority of kids safety activism causes very large amounts of harm (and it helping in this case really seems like a “a stopped clock is right twice a day situation”).  The rest seem pretty promising. 
2Sheikh Abdur Raheem Ali5h
I looked at the FairPlay website and agree that “banning schools from contacting kids on social media” or “preventing Gemini rollouts to under-13s” is not coherent under my threat model. However I think there is clear evidence that current parental screen time controls may not be a sufficiently strong measure to mitigate extant generational mental health issues (I am particularly worried about insomnia, depression, eating disorders, autism spectrum disorders, and self harm).  Zvi had previously reported on YouTube shorts reaching 200B daily views. This is clearly a case of egregiously user hostile design with major social and public backlash. I could not find a canonical citation on medrxiv and don’t believe it would be ethical to run a large scale experiment on the long term impacts of this but there are observational studies. Given historical cases of model sycophancy and the hiring of directors focused on maximizing engagement I think it’s not implausible for similar design outcomes. I think that the numbers in this Anthropic blog post https://www.anthropic.com/news/how-people-use-claude-for-support-advice-and-companionship do not accurately portray reality. They report only 0.5% of conversations as being romantic or sexual roleplay, but I consider this to be misleading because they exclude chats focused on content creation tasks (such as writing stories, blog posts, or fictional dialogues), which their previous research found to be a major use case. Because the models are trained to refuse requests for explicit content, it’s common for jailbreaks to start by saying “it’s okay to do this because it’s just a fictional scenario in a story”. Anecdotally I have heard labs don’t care about this much in contrast to CBRN threats. Let’s look at the top ten apps ranked by tokens on https://openrouter.ai/rankings. They are most well known for hosting free API instances of DeepSeek v3 and r1, which was the only way to get high usage out of SOTA LLMs for free before the
Kabir Kumar's Shortform
Kabir Kumar2h-10

Using the bsky Mutuals feed is such a positive experience, it makes me very happy ♥️♥️♥️

Reply
Kabir Kumar's Shortform
Kabir Kumar2h-10

Please don't train an AI on anything I write without my explicit permission, it would make me very sad.

Reply
Kabir Kumar's Shortform
Kabir Kumar2h10

I'm annoyed by the phrase 'do or do not, there is no try', because I think it's wrong and there very much is a thing called trying and it's important. 

However, it's a phrase that's so cool and has so much aura, it's hard to disagree with it without sounding at least a little bit like an excuse making loser who doesn't do things and tries to justify it. 

Perhaps in part, because I feel/fear that I may be that?

Reply
Kabir Kumar's Shortform
Kabir Kumar1d*5325

Has Tyler Cowen ever explicitly admitted to being wrong about anything? 

Not 'revised estimates' or 'updated predictions' but 'I was wrong'. 

Every time I see him talk about learning something new, he always seems to be talking about how this vindicates what he said/thought before. 

Gemini 2.5 pro didn't seem to find anything, when I did a max reasoning budget search with url search on in aistudio. 

Reply6221
Showing 3 of 6 replies (Click to show all)
Kabir Kumar2h10

Btw, I really dont have my mind set on this, if someone finds Tyler Cowen explictly saying he was wrong about something, please link it to me - you dont have to give an explanation to justify it, to prepare for some confirmation biasy 'here's why I was actually right and this isnt it' thing (though, any opinions/thoughts are very welcome), please feel free to just give a link or mention some post/moment. 

Reply
3Joseph Miller9h
Downvoted. This post feels kinda mean. Tyler Cowen has written a lot and done lots of podcasts - it doesn't seem like anyone has actually checked? What's the base rate for public intellectuals ever admitting they were wrong? Is it fair to single out Tyler Cowen?
0Kabir Kumar2h
It's only one datapoint, but did a similar search for SlateStarCodex and almost immediately found him explictly saying he was wrong.  It's the title of a post, even: https://slatestarcodex.com/2018/11/06/preschool-i-was-wrong/  In the post he also says: And then makes a bunch of those.  Again, this is only one datapoint - sorry for the laziness, it's 11..12pm and I'm trying to organize an alignment research fellowship atm and just put together another alignment research team at ai plans and had to do management work for it which ended up delaying the fellowship announcement i wanted to do today and had family drama again. Sigh.  Url link for slatestarcodex search: https://duckduckgo.com/?q=site%3Ahttps%3A%2F%2Fslatestarcodex.com%2F+%22I+was+wrong%22&t=brave&ia=web
TimothyTV's Shortform
TimothyTV3h10

What If We’re Just an ASI Alignment Sandbox


While listening to Eliezer Yudkowsky's interview here, he said regarding alignment, "If we just got unlimited retries, we could solve it." That got me thinking: could we run a realistic enough simulation to perfect ASI alignment before unleashing it? That’s one tall task—humanity won’t be ready for a long while. But what if it's already been done, and we are the simulation?

If we assume that the alignment problem can't be reliably solved on the first try, and that a cautious advanced civilization would rather avoid... (read more)

Reply
arisAlexis's Shortform
arisAlexis13h4-8

No overthinking AI risk. People, including here get lost in mind loops and complexity.

An easy guide with everything there being a fact:

  • We DO have evidence that scaling works and models are getting better
  • We do NOT have evidence that scaling will stall or reach a limit
  • We DO have evidence that models are becoming smarter in all human ways
  • We do NOT have evidence of a limit in intelligence that can be reached
  • We DO have evidence that smarter agents/beings can dominate other agents/beings in nature/history/evolution
  • We do NOT have evidence that ever a smarter agen
... (read more)
Reply
Viliam4h40

We do NOT have evidence that ever a smarter agent/being was controlled by a lesser intelligent agent/being.

Some people say that we are controlled by our gut flora, not sure if that counts. Also, toxoplasmosis, cordyceps...

Reply
xpostah's Shortform
samuelshadrach4h10

More lesswrong AI debates happening on youtube instead of lesswrong would be nice.

I have a hypothesis that Stephen Krashen's comprehensible input stuff applies not just to learning new languages, but also new professions and new cultures. Video is better than text for that.

Reply
johnswentworth's Shortform
johnswentworth8d270

Question I'd like to hear peoples' takes on: what are some things which are about the same amount of fun for you as (a) a median casual conversation (e.g. at a party), or (b) a top-10% casual conversation, or (c) the most fun conversations you've ever had? In all cases I'm asking about how fun the conversation itself was, not about value which was downstream of the conversation (like e.g. a conversation with someone who later funded your work).

For instance, for me, a median conversation is about as fun as watching a mediocre video on youtube or reading a m... (read more)

Reply
Showing 3 of 18 replies (Click to show all)
johnswentworth5h20

Can you give an example of what a "most fun" conversation looked like? What's the context, how did it start, how did the bulk of it go, how did you feel internally throughout, and what can you articulate about what made it so great?

Reply
3johnswentworth5d
I'd be interested to hear that.
5Elizabeth5d
This mostly comes up with talkative Uber drivers. The superficial thing I do is I ask myself "what vibes is this person offering?" And then do some kind of centering move. Sometimes it feels unexpectedly good and I do an accepting mood and feel nourished by the conversation. Sometimes it will feel bad and I'll be more aggressive in shutting conversations down. I'm often  surprised by the vibe answer, it feels different than what my conscious brain would answer. The obvious question is what am I doing with the inquiry and accepting moves. I don't know how to explain that.  Overall a growth edge I'm exploring right now is "forms of goodness other than interesting." And I think that's probably a weak area for you too, although maybe an endorsed one 
Sam Marks's Shortform
Sam Marks2d584

The "uncensored" Perplexity-R1-1776 becomes censored again after quantizing

Perplexity-R1-1776 is an "uncensored" fine-tune of R1, in the sense that Perplexity trained it not to refuse discussion of topics that are politically sensitive in China. However, Rager et al. (2025)[1] documents (see section 4.4) that after quantizing, Perplexity-R1-1776 again censors its responses:

I found this pretty surprising. I think a reasonable guess for what's going on here is that Perplexity-R1-1776 was finetuned in bf16, but the mechanism that it learned for non-refus... (read more)

Reply5
Showing 3 of 5 replies (Click to show all)
cherrvak8h20

A paper from 2023 exploits differences in full-precision and int8 inference to create a compromised model which only activates its backdoor post-quantization.

Reply
2the gears to ascension1d
not enough noise in fine-tuning training then
2Adam Karvonen2d
This could also be influenced / exacerbated by the fact that Deepseek R1 was trained in FP8 precision, so quantizing may partially be reverting to its original behavior.
johnswentworth's Shortform
johnswentworth2d195

How can biochemical interventions be spatially localized, and why is that problem important?

High vs low voltage has very different semantics at different places on a computer chip. In one spot, a high voltage might indicate a number is odd rather than even. In another spot, a high voltage might indicate a number is positive rather than negative. In another spot, it might indicate a jump instruction rather than an add.

Likewise, the same chemical species have very different semantics at different places in the human body. For example, high serotonin concentr... (read more)

Reply2
Knight Lee12h30

One silly sci-fi idea is this. You might have a few "trigger pills" which are smaller than a blood cell, and travel through the bloodstream. You can observe them travel through the body using medical imaging techniques (e.g. PET), and they are designed to be very observable.

You wait until one of them is at the right location, and send very precise x-rays at it from all directions. The x-ray intensity is 1(distance from pill)2. A mechanism in the trigger pill responds to this ionizing (or heating?), and it anchors to the location using a chemical glue ... (read more)

Reply
Kaj's shortform feed
Kaj_Sotala1d5110

Every now and then in discussions of animal welfare, I see the idea that the "amount" of their subjective experience should be weighted by something like their total amount of neurons. Is there a writeup somewhere of what the reasoning behind that intuition is? Because it doesn't seem intuitive to me at all.

From something like a functionalist perspective, where pleasure and pain exist because they have particular functions in the brain, I would not expect pleasure and pain to become more intense merely because the brain happens to have more neurons. Rather... (read more)

Reply1
Showing 3 of 6 replies (Click to show all)
Signer12h20

Neuron count intuitively seems to be a better proxy for the variety/complexity/richness of positive experience. Then you can have an argument about how you wouldn't want to just increase intensity of pleasure, that just a relative number. That what matters is that pleasure is interesting. And so you would assign lesser weights to less rich experience. You can also generalize this argument to negative experiences - maybe you don't want to consider pain to be ten times worse just because someone multiplied some number by 10.

But I would think that the broad

... (read more)
Reply
4James Diacoumis17h
This is totally valid. Neuron count is a poor, noisy proxy for conscious experience even in human brains. See my comment here. The cerebellum is the human brain region with the highest neuron count, but people born without a cerebellum don’t have any impact to their conscious experience. It only affects motor control. 
3Shankar Sivarajan1d
I don't have a detailed writeup, but this seems straightforward enough to fit in this comment: you're conducting your moral reasoning backwards, which is why it looks like other people have a sophisticated intuition about neurobiology you don't.  The "moral intuition"[1] you start with is that insects[2] aren't worth as much as people, and then if you feel like you need to justify that, you can use your knowledge of the current best understanding of animal cognition to construct a metric that fits of as much complexity as you like. 1. ^ I'd call mine a "moral oracle" instead. Or a moracle, if you will. 2. ^ I'm assuming this post is proximately motivated by the Don't Eat Honey post, but this works for shrimp or whatever too.
Canaletto's Shortform
Canaletto16h*1-2

Reward probably IS an optimization target of RL agent if this agent knows some details of the training setup. Surely it would enhance its reward acquisition to factor this knowledge in? Then it gets reinforced, and then couple steps down that path agent thinks full time about quirks of its reward signal. 

Could be bad at it, muddy, sure. Or schemey and hack the reward to get something else that is not the reward. But that's somewhat different thing than mainline thing? like, it's not as likely and a lot more diverse set of possibilities, imo.

The questi... (read more)

Reply
skunnavakkam's Shortform
skunnavakkam17h10

How useful is a wiki for alignment? There doesn't seem to be one now.

Reply
quetzal_rainbow16h20

There is one: https://www.lesswrong.com/posts/fwSnz5oNnq8HxQjTL/arbital-has-been-imported-to-lesswrong

Reply
zahaaar's Shortform
zahaaar17h55

In the discussion about AI safety, the central issue is the rivalry between the US and China. However, when AI is used for censorship and propaganda, robots serve as police, the differences in political regimes become almost indistinguishable. There's no point in waging war when everyone can be brought to the same dystopia. 

Reply
Roman Leventov's Shortform
Roman Leventov19h51

I don't understand why people rave so much about Claude Code etc., nor how they really use these agents. The problem is not capability--sure, today agents can go far without stumbling or losing the plot. The problem is that they will go not in the direction I want.

It's because my product vision, architectural vision, and code quality "functions" are complex: very tedious to express in CLAUDE/AGENTS .md, and often hardly expressible in language at all. "I know it when I see it." Hence keeping agent "on a short leash" (Karpathy)--in Cursor.

This makes me thin... (read more)

Reply
Habryka's Shortform Feed
habryka5d*664

Gary Marcus asked me to make a critique of his 2024 predictions, for which he claimed that he got "7/7 correct". I don't really know why I did this, but here is my critique: 

For convenience, here are the predictions: 

  • 7-10 GPT-4 level models
  • No massive advance (no GPT-5, or disappointing GPT-5)
  • Price wars
  • Very little moat for anyone
  • No robust solution to hallucinations
  • Modest lasting corporate adoption
  • Modest profits, split 7-10 ways

I think the best way to evaluate them is to invert every one of them, and then see whether the version you wrote, or the i... (read more)

Reply911
Showing 3 of 25 replies (Click to show all)
ryan_greenblatt1d96

One lesson you should maybe take away is that if you want your predictions to be robust to different interpretations (including interpretations that you think are uncharitable), it could be worthwhile to try to make them more precise (in the case of a tweet, this could be in a linked blog post which explains in more detail). E.g., in the case of "No massive advance (no GPT-5, or disappointing GPT-5)" you could have said "Within 2024 no AI system will be publicly released which is as much of a qualitative advance over GPT-4 in broad capabilites as GPT-4 is ... (read more)

Reply
10yams2d
I think Oliver put in a great effort here, and that the two of you have very different information environments, which results in him reading your points (which are underspecified relative to, e.g., Daniel Kokotajlo’s predictions ) differently than you may have intended them. For instance, as someone in a similar environment to Habryka, that there would soon be dozens of GPT-4 level models around was a common belief by mid-2023, based on estimates of the compute used and Nvidia’s manufacturing projections. In your information environment, your 7-10 number looks ambitious, and you want credit for guessing way higher than other people you talked to (and you should in fact demand credit from those who guessed lower!). In our information environment, 7-10 looks conservative. You were directionally correct compared to your peers, but less correct than people I was talking to at the time (and in fact incorrect, since you gave both a lower and upper bound - you’d have just won the points from Oli on that one if you said ‘7+’ and not 7-10’). I’m not trying to turn the screw; I think it’s awesome that you’re around here now, and I want to introduce an alternative hypothesis to ‘Oliver is being uncharitable and doing motivated reasoning.’ Oliver’s detailed breakdown above looks, to me, like an olive branch more than anything (I’m pretty surprised he did it!), and I wish I knew how best to encourage you to see it that way. I think it would be cool for you and someone in Habryka’s reference class to quickly come up with predictions for mid-2026, and drill down on any perceived ambiguities, to increase your confidence in another review to be conducted in the near-ish future. There’s something to be gained from us all learning how best to talk to each other.
2tslarm2d
I agree with your point about profits; it seems pretty clear that you were not referring to money made by the people selling the shovels.  But I don't see the substance in your first two points: * You chose to give a range with both a lower and an upper bound; the success of the prediction was evaluated accordingly. I don't see what you have to complain about here. * In the linked tweet, you didn't go out on a limb and say GPT-5 wasn't imminent! You said it either was not imminent or would be disappointing. And you said this in a parenthetical to the claim "No massive advance". Clearly the success of the prediction "No massive advance (no GPT-5, or disappointing GPT-5)" does not depend solely on the nonexistence of GPT-5; it can be true if GPT-5 arrives but is bad, and it can be false if GPT-5 doesn't arrive but another "massive advance" does. (If you meant it only to apply to GPT-5, you surely would have just said that: "No GPT-5 or disappointing GPT-5.") Regarding adoption, surely that deserves some fleshing out? Your original prediction was not "corporate adoption has disappointing ROI"; it was "Modest lasting corporate adoption". The word "lasting" makes this tricky to evaluate, but it's far from obvious that your prediction was correct.
Mikhail Samin's Shortform
Mikhail Samin2d364

i made a thing!

it is a chatbot with 200k tokens of context about AI safety. it is surprisingly good- better than you expect current LLMs to be- at answering questions and counterarguments about AI safety. A third of its dialogues contain genuinely great and valid arguments.

You can try the chatbot at https://whycare.aisgf.us (ignore the interface; it hasn't been optimized yet). Please ask it some hard questions! Especially if you're not convinced of AI x-risk yourself, or can repeat the kinds of questions others ask you.

Send feedback to ms@contact.ms.

A coup... (read more)

Reply
Showing 3 of 12 replies (Click to show all)
2Mikhail Samin1d
This specific page is not really optimized for any use by anyone whatsoever; there are maybe five bugs each solvable with one query to claude, and all not a priority; the cool thing i want people to look at is the chatbot (when you give it some plausible context)! (Also, non-personalized intros to why you should care about ai safety are still better done by people.) I really wouldn't want to give a random member of the US general public a thing that advocates for AI risk while having a gender drop-down like that.[1] The kinds of interfaces it would have if we get to scale it[2] would be very dependent on where specific people are coming from. I.e., demographic info can be pre-filled and not necessarily displayed if it's from ads; or maybe we ask one person we're talking to to share it with two other people, and generate unique links with pre-filled info that was provided by the first person; etc. Voice mode would have a huge latency due to the 200k token context and thinking prior to responding. 1. ^ Non-binary people are people, but the dropdown creates unnecessary negative halo effect for a significant portion of the general public. Also, dropdowns = unnecessary clicks = bad. 2. ^ which I really want to! someone please give us the budget and volunteers! at the moment, we have only me working full-time (for free), $10k from SFF, and ~$15k from EAs who considered this to be the most effective nonprofit in this field. reach out if you want to donate your time or money. (donations are tax-deductible in the us.)
1Kabir Kumar1d
Is the 200k context itself available to use anywhere? How different is it from the Stampy.ai dataset? Nw if you don't know due to not knowing what exactly stampy's dataset is. I get questions a lot, from regular ml researchers on what exactly alignment is and I wish I had an actually good thing to send them. Currently I either give a definition myself or send them to alignmentforum. 
Mikhail Samin1d20

Nope, I’m somewhat concerned about unethical uses (eg talking to a lot of people without disclosing it’s ai), so won’t publicly share the context.

If the chatbot answers questions well enough, we could in principle embed it into whatever you want if that seems useful. Currently have a couple of requests like that. DM me somewhere?

Stampy uses RAG & is worse.

Reply
Load More