LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ

Quick Takes

AI Safety Thursdays: Are LLMs aware of their learned behaviors?
Thu Jul 10•Toronto
LessWrong Community Weekend 2025
Fri Aug 29•Berlin
AGI Forum @ Purdue University
Tue Jul 1•West Lafayette
Lighthaven Sequences Reading Group #40 (Tuesday 7/1)
Wed Jul 2•Berkeley
TurnTrout's shortform feed
TurnTrout2d16-6

In a thread which claimed that Nate Soares radicalized a co-founder of e-acc, Nate deleted my comment – presumably to hide negative information and anecdotes about how he treats people. He also blocked me from commenting on his posts.

The information which Nate suppressed

The post concerned (among other topics) how to effectively communicate about AI safety, and positive anecdotes about Nate's recent approach. (Additionally, he mentions "I’m regularly told that I’m just an idealistic rationalist who’s enamored by the virtue of truth" -- a love which apparent... (read more)

Reply
Showing 3 of 31 replies (Click to show all)
Knight Lee2h10

This is a little off topic, but do you have any examples of counter-reactions overall drawing things into the red?

With other causes like fighting climate change and environmentalism, it's hard to see any activism being a net negative. Extremely sensationalist (and unscientific) promotions of the cause (e.g. The Day After Tomorrow movie) do not appear to harm it. It only seems to move the Overton window in favour of environmentalism.

It seems, most of the counter-reaction doesn't depend on your method of messaging, it results from the success of your messagi... (read more)

Reply11
7Elizabeth4h
On the other hand, we should expect that the first people to speak out against someone will be the most easily activated (in a neurological sense)- because of past trauma, or additional issues with the focal person, or having a shitty year. Speaking out is partially a function of pain level, and pain(Legitimate grievance + illegitimate grievance) > pain(legitimate grievance). It doesn't mean there isn't a legitimate grievance large enough to merit concern. 
15sunwillrise20h
I think I've already explained why this misses the point: ---------------------------------------- I suspect there are a ton of new people who would have gotten in fights with you in a counterfactual universe where you hadn't made that change, but who haven't done so in this one. The change isn't from "they think of Duncan negatively" to "they think of Duncan positively," but more so from "they think of Duncan negatively" to "they think of Duncan neutrally" or even "they don't think of Duncan at all." As for the ones who have already engaged in fights with you and continued to dislike you[1]... well, why would they change their opinions of you? You had an established reputation at that point; part of the role reputation plays in human social interactions is to ensure social punishment for perceived transgressions of norms, regardless of when the purported transgressor starts signaling he is changing his behavior. For all people may wax poetically about accepting change and giving people second chances, in practice that doesn't really happen, in a way I think is quite justifiable from their POV. Fallacious black and white thinking, here.[2] Some people manage to never bruise anyone like Nate did, but a heck of a lot more people manage to bruise far fewer people than Nate. If you've never hurt anyone, you're probably too conservative in your speech. If you've hurt and turned off too many people, you're almost certainly insufficiently conservative in your speech. Not sure what the word "objectively" is meant to accomplish here, more than just signaling "I really really think this is true" and trying to wrap it in a veneer of impartiality to pack a bigger rhetorical punch. Discussions about human social interactions and the proper use of norms are very rarely resolvable in entirely objective ways, and moreover in this case, for reasons given throughout the years on this site, I think your conclusion (about them being "overblown") is more likely than not wrong, at l
Nina Panickssery's Shortform
Nina Panickssery2d12-5

The motte and bailey of transhumanism
 

Most people on LW, and even most people in the US, are in favor of disease eradication, radical life extension, reduction of pain and suffering. A significant proportion (although likely a minority) are in favor of embryo selection or gene editing to increase intelligence and other desirable traits. I am also in favor of all these things. However, endorsing this form of generally popular transhumanism does not imply that one should endorse humanity’s succession by non-biological entities. Human “uploads” are much ... (read more)

Reply
Showing 3 of 15 replies (Click to show all)
2the gears to ascension15h
i want drastically upgraded biology, potentially with huge parts of the chemical stack swapped out in ways I can only abstractly characterize now without knowing what the search over viable designs will output. but in place, without switching to another substrate. it's not transhumanism, to my mind, unless it's to an already living person. gene editing isn't transhumanism, it's some other thing; but shoes are transhumanism for the same reason replacing all my cell walls with engineered super-bio nanotech that works near absolute zero is transhumanism. only the faintest of clues what space an ASI would even be looking in to figure out how to do that, but it's the goal in my mind for ultra-low-thermal-cost life. uploads are a silly idea, anyway, computers are just not better at biology than biology. anything you'd do with a computer, once you're advanced enough to know how, you'd rather do by improving biology
Nina Panickssery2h20

computers are just not better at biology than biology. anything you'd do with a computer, once you're advanced enough to know how, you'd rather do by improving biology

I share a similar intuition but I haven't thought about this enough and would be interested in pushback!

it's not transhumanism, to my mind, unless it's to an already living person. gene editing isn't transhumanism

You can do gene editing on adults (example). Also in some sense an embryo is a living person.

Reply
4Nina Panickssery1d
I would find that reasonably convincing, yes (especially because my prior is already that true ems would not have a tendency to report their experiences in a different way from us). 
notrishi's Shortform
notrishi5mo1-2

The Sasha Rush/Jonathan Frankle wager: https://www.isattentionallyouneed.com/ is extremely unlikely to be untrue by 2027, but it's not because another architecture might not be better; it's because it asks whether a transformer-like model will be sota . I think it is more likely that transformers are a proper subset of a class of generalized token/sequence mixers. Even SSMs when unrolled into a cumulative sum are a special case of linear attention. 
Personally I do believe that there will be a deeply recurrent method that is transformer-like to succeed the transformer architecture, even though this is an unpopular opinion.

Reply
notrishi2h10

I changed my mind on this after seeing the recent literature with regards to test time training linear attentions

Reply
Raemon's Shortform
Raemon5h96

TAP for fighting LLM-induced brain atrophy:

"send LLM query" ---> "open up a thinking doc and think on purpose."

What a thinking doc looks varies by person. Also, if you are sufficiently good at thinking, just "think on purpose" is maybe fine, but, I recommend having a clear sense of what it means to think on purpose and whether you are actually doing it.

I think having a doc is useful because it's easier to establish a context switch that is supportive of thinking.

For me, "think on purpose" means:

  • ask myself what my goals are right now (try to notice at le
... (read more)
Reply
Mikhail Samin's Shortform
Mikhail Samin12h26-4

i made a thing!

it is a chatbot with 200k tokens of context about AI safety. it is surprisingly good- better than you expect current LLMs to be- at answering questions and counterarguments about AI safety. A third of its dialogues contain genuinely great and valid arguments.

You can try the chatbot at https://whycare.aisgf.us (ignore the interface; it hasn't been optimized yet). Please ask it some hard questions! Especially if you're not convinced of AI x-risk yourself, or can repeat the kinds of questions others ask you.

Send feedback to ms@contact.ms.

A coup... (read more)

Reply
Showing 3 of 5 replies (Click to show all)
Mikhail Samin6h20

Another example:

What's corrigibility? (asked by an AI safety researcher)

Reply
2Mikhail Samin9h
It’s better than stampy (try asking both some interesting questions!). Stampy is cheaper to run though. I wasn’t able to get LLMs to produce valid arguments or answer questions correctly without the context, though that could be scaffolding/skill issue on my part.
2Mikhail Samin9h
Thanks! I think we’re close to a point where I’d want to put this in front of a lot of people, though we don’t have the budget for this (which seems ridiculous, given the stats we have for our ads results etc.), and also haven’t yet optimized the interface (as in, half the US public won’t like the gender dropdown). Also, it’s much better at conversations than at producing 5min elevator pitches. (Hard to make it good at being where the user is while getting to a point instead of being very sycophantic). The end goal is to be able to explain the current situation to people at scale.
tdko's Shortform
tdko6h50

METR's task length horizon analysis for Claude 4 Opus is out. The 50% task success chance is at 80 minutes, slightly worse than o3's 90 minutes. The 80% task success chance is tied with o3 at 20 minutes.

https://x.com/METR_Evals/status/1940088546385436738

Reply
Cole Wyeth6h52

That looks like (minor) good news… appears more consistent with the slower trendline before reasoning models. Is Claude 4 Opus using a comparable amount of inference-time compute as o3? 

I believe I predicted that models would fall behind even the slower exponential trendline (before inference time scaling) - before reaching 8-16 hour tasks. So far that hasn’t happened, but obviously it hasn’t resolved either. 

Reply
Sam Marks's Shortform
Sam Marks6h191

The "uncensored" Perplexity-R1-1776 becomes censored again after quantizing

Perplexity-R1-1776 is an "uncensored" fine-tune of R1, in the sense that Perplexity trained it not to refuse discussion of topics that are politically sensitive in China. However, Rager et al. (2025)[1] documents (see section 4.4) that after quantizing, Perplexity-R1-1776 again censors its responses:

I found this pretty surprising. I think a reasonable guess for what's going on here is that Perplexity-R1-1776 was finetuned in bf16, but the mechanism that it learned for non-refus... (read more)

Reply2
leogao's Shortform
leogao1d630

random brainstorming ideas for things the ideal sane discourse encouraging social media platform would have:

  • have an LM look at the comment you're writing and real time give feedback on things like "are you sure you want to say that? people will interpret that as an attack and become more defensive, so your point will not be heard". addendum: if it notices you're really fuming and flame warring, literally gray out the text box for 2 minutes with a message like "take a deep breath. go for a walk. yelling never changes minds"
  • have some threaded chat component
... (read more)
Reply2111
Showing 3 of 7 replies (Click to show all)
1AlphaAndOmega12h
I'd be down to try something along those lines.  I wonder if anyone has ball-park figures for how much the LLM, used for tone-warnings and light moderation, would cost? I am uncertain about what grade of model would be necessary for acceptable results, though I'd wager a guess that Gemini 2.5 Flash would be acceptable.  Disclosure: I'm an admin of themotte.org, and an unusually AI-philic one. I'd previously floated the idea of fine-tuning an LLM on records of previous moderator interactions and associated parent comments (both good and bad, us mods go out of our way to recognize and reward high quality posts, after user reports). Our core thesis is to be a place for polite and thoughtful discussion of contentious topics, and necessarily, we have rather subjective moderation guidelines. (People can be very persistent and inventive about sticking to the RAW while violating the spirit)  Even 2 years ago, when I floated the idea, I think it would have worked okay, and these days, I think you could get away without fine-tuning at all. I suspect the biggest hurdle would be models throwing a fit over controversial topics/views, even if the manner and phrasing were within discussion norms. Sadly, now, as it was then, the core user base was too polarized to support such an endeavor. I'd still like to see it put into use.  >argument mapping is really cool imo but I think most attempts fail because they try to make arguments super structured and legible. I think a less structured version that lets you vote on how much you think various posts respond to other posts and how well you think it addresses the key points and which posts overlap in arguments would be valuable. like you'd see clusters with (human written and vote selected) summaries of various clusters, and then links of various strengths inter cluster. I think this would greatly help epistemics by avoiding infinite argument retreading  Another feature I might float is the idea of granular voting. Let's say there'
leogao6h30

the LLM cost should not be too bad. it would mostly be looking at vague vibes rather than requiring lots of reasoning about the thing. I trust e.g AI summaries vastly less because they can require actual intelligence.

I'm happy to fund this a moderate amount for the MVP. I think it would be cool if this existed.

I don't really want to deal with all the problems that come with modifying something that already works for other people, at least not before we're confident the ideas are good. this points towards building a new thing. fwiw I think if building a new... (read more)

Reply
4leogao21h
there's a broader category of things which are not literally scrolling but still time wasting / consuming info not to enrich oneself, but to push the dopamine button, and I think even removing the scroll doesn't fix this (my phone is intentionally quite high friction to use and I still fail to stay off of it)
johnswentworth's Shortform
johnswentworth6h70

How can biochemical interventions be spatially localized, and why is that problem important?

High vs low voltage has very different semantics at different places on a computer chip. In one spot, a high voltage might indicate a number is odd rather than even. In another spot, a high voltage might indicate a number is positive rather than negative. In another spot, it might indicate a jump instruction rather than an add.

Likewise, the same chemical species have very different semantics at different places in the human body. For example, high serotonin concentr... (read more)

Reply
Habryka's Shortform Feed
habryka3d*674

Gary Marcus asked me to make a critique of his 2024 predictions, for which he claimed that he got "7/7 correct". I don't really know why I did this, but here is my critique: 

For convenience, here are the predictions: 

  • 7-10 GPT-4 level models
  • No massive advance (no GPT-5, or disappointing GPT-5)
  • Price wars
  • Very little moat for anyone
  • No robust solution to hallucinations
  • Modest lasting corporate adoption
  • Modest profits, split 7-10 ways

I think the best way to evaluate them is to invert every one of them, and then see whether the version you wrote, or the i... (read more)

Reply911
Showing 3 of 23 replies (Click to show all)
tslarm6h10

I agree with your point about profits; it seems pretty clear that you were not referring to money made by the people selling the shovels. 

But I don't see the substance in your first two points:

  • You chose to give a range with both a lower and an upper bound; the success of the prediction was evaluated accordingly. I don't see what you have to complain about here.
  • In the linked tweet, you didn't go out on a limb and say GPT-5 wasn't imminent! You said it either was not imminent or would be disappointing. And you said this in a parenthetical to the claim "
... (read more)
Reply
2gwern1d
If it's not obvious at this point why, I would prefer to not go into it here in a shallow superficial way, and refer you to the OA coup discussions.
2habryka1d
Agree, though I think, in the world we are in, we don't happen to have that kind of convenient measurement, or at least not unambiguous ones. I might be wrong, people have come up with clever methodologies to measure things like this in the past that compelled me, but I don't have an obvious dataset or context in mind where you could get a good answer (but also, to be clear, I haven't thought that much about it).
sam's Shortform
sam8h10

Here are a cluster of things. Does this cluster have a well-known name? 

  1. A voter has some radical political preferences X, but the voting system where they live is FPTP, and their first preference has no chance of winning. So they vote for a person they like less who is more likely to win. The loss of the candidate who supported X is then cited as evidence that supporting X means you can't win.
  2. A pollster goes into the field and gets a surprising result. They apply some unprincipled adjustment to move towards the average before publishing. (this example
... (read more)
Reply
Hide's Shortform
Hide2d2-5

It’s starting to really feel like we’re in the process of AI improvement fizzling out and companies are merely disguising this with elaborate products. 

Reply
5the gears to ascension2d
Yeah there haven't been any improvements that significantly changed how capable a model is on a hard task I need solved for like, at least a week, maybe more /j
1TimothyTV9h
Im out of the loop, can you point to an example please?
the gears to ascension8h30

/j was because I haven't really kept track of how long it's been. Gemini 2.5 pro was the last one I was somewhat impressed by. now, like, to be clear, it's still flaky and still an LLM, still incremental improvement, but noticeably stronger on certain kinds of math and programming tasks. still mostly relevant when you want speed and some slop is ok.

Reply
Zach Furman's Shortform
Zach Furman2d404

I’ve been trying to understand modules for a long time. They’re a particular algebraic structure in commutative algebra which seems to show up everywhere any time you get anywhere close to talking about rings - and I could never figure out why. Any time I have some simple question about algebraic geometry, for instance, it almost invariably terminates in some completely obtuse property of some module. This confused me. It was never particularly clear to me from their definition why modules should be so central, or so “deep.”

I’m going to try to explain the ... (read more)

Reply
Showing 3 of 5 replies (Click to show all)
8Daniel Murfet1d
Yeah it's a nice metaphor. And just as the most important thing in a play is who dies and how, so too we can consider any element x∈M as a module homomorphism ϕx:R→M and consider the kernel Ann(x)=Kerϕx which is called the annihilator (great name). Then ϕx factors as R→R/Ann(x)→M where the second map is injective, and so in some sense M is "made up" of all sorts of quotients R/I where I varies over annihilators of elements. There was a period where the structure of rings was studied more through the theory of ideals (historically this as in turn motivated by the idea of an "ideal" number) but through ideas like the above you can see the theory of modules as a kind of "externalisation" of this structure which in various ways makes it easier to think about. One manifestation of this I fell in love with (actually this was my entrypoint into all this since my honours supervisor was an old-school ring theorist and gave me Stenstrom to read) is in torsion theory.
2Alexander Gietelink Oldenziel14h
I was taught the more classical 'ideal' point of view on the structure of rings in school. I'm curious if [and why] you regard the annihilator point of view as possibly more fecund?
Simon Pepin Lehalleur8h10

Modules are just much more flexible than ideals. Two major advantages:

  • Richer geometry. An ideal is a closed subscheme of Spec(R), while modules are quasicoherent sheaves. An element x of M is a global section of the associated sheaf, and the ideal Ann(x) corresponds to the vanishing locus of that section. This leads to a nice geometric picture of associated primes and primary decomposition which explains how finitely generated modules are built out of modules R/P with P prime ideal (I am not an algebraist at heart, so for me the only way to remember the st
... (read more)
Reply11
Linch's Shortform
Linch1d40

I'd like to finetune or (maybe more realistically) prompt engineer a frontier LLM imitate me. Ideally not just stylistically but reason like me, drop anecodtes like me, etc, so it performs at like my 20th percentile of usefulness/insightfulness etc. 

Is there a standard setup for this?

Examples of use cases include receive an email and send[1] a reply that sounds like me (rather than a generic email), read Google Docs or EA Forum posts and give relevant comments/replies, etc

More concretely, things I do that I think current generation LLMs are in th... (read more)

Reply
samuelshadrach9h30

Have you tried RAG?

Curate a dataset of lots of your own texts from multiple platforms. Split into 1k char chunks and generate embeddings.

When query text is received, do embedding search to find most similar past texts, then give these as input along with query text to LLM and ask it to generate a novel text in same style.

openai text-embedding-3-small works fine, I have a repo I could share if the dataset is large or complex format or whatever.

Reply
xpostah's Shortform
samuelshadrach9h10

Does anyone have a good solution to avoid the self-fulfilling effect of making predictions?

Making predictions often means constructing new possibilities from existing ideas, drawing more attention to these possibilities, creating common knowledge of said possibilities, and inspiring people to work towards these possibilities.

One partial solution I can think of so far is to straight up refuse to talk about visions of the future you don't want to see happen.

Reply
Yonatan Cale's Shortform
Yonatan Cale10h20

Would you leave more anonymous feedback for people who ask for it if there was a product that did things like:

  1. Rewrite your feedback to make it more anonymous (with an LLM?)
  2. Aggregate your feedback with other feedback this person received, and only tell them things like "many people said you're rude"
  3. Delay your feedback by some (random?) amount of time to make it less recognizable

I'm mostly interested to hear from people who consider leaving feedback and sometimes don't, I think it would be cool if we could make progress on solving whatever painpoint you have... (read more)

Reply
Johannes C. Mayer's Shortform
Johannes C. Mayer11h20

This is a very good informal introduction to Control Theory / Cybernetics.

https://www.youtube.com/watch?v=YrdgPNe8KNA

Reply
CoafOS's Shortform
Coafos19h00

Trolling is not a Socratic dialogue

The (ancient) Greek form of debate or dialogue was based on the notion of common good. If ONE of the participants feel bad about it, then EVERYONE loses it. Yeah, during the dialogue the partner (opponent?) will look dumb, but afterwards they reach a conclusion, they learn something and part happily.

Trolling on the other hand is just a quick crack at the other's worldview. The point is provoking a response from other's, not educating and lifting them up. The motive of ending with MUTUAL respect is missing.

Like, dialogue ... (read more)

Reply
CstineSublime14h10

Platonic Dialogues which are the most famous example of Ancient Greek dialogues while certainly having a pedagogical function for the audience were polished and refined texts by writers who had the lessons they intended to impart before they began writing this. It is not a quick and easy method for the truth - it a a byproduct of having arrived at one's own truth. A literary genre. As such they a martial art (a liberal art, maybe, but not martial) - they are more like watching training film or a manual for martial art rather than being a form of oratory co... (read more)

Reply
johnswentworth's Shortform
johnswentworth2dΩ411201

I was a relatively late adopter of the smartphone. I was still using a flip phone until around 2015 or 2016 ish. From 2013 to early 2015, I worked as a data scientist at a startup whose product was a mobile social media app; my determination to avoid smartphones became somewhat of a joke there.

Even back then, developers talked about UI design for smartphones in terms of attention. Like, the core "advantages" of the smartphone were the "ability to present timely information" (i.e. interrupt/distract you) and always being on hand. Also it was small, so anyth... (read more)

Reply731
Showing 3 of 26 replies (Click to show all)
1TimothyTV19h
Yes, I do think that. They don't actively diminish thought, after all, it's a tool you decide to use. But when you use it to handle a problem, you lose the thoughts, and the growth you could've had solving it yourself. It could be argued, however, that if you are experienced enough in solving such problems, there isn't much to lose, and you gain time to pursue other issues. But as to why I think this way: people already don't learn skills because chatGPT can do it for them, as lesswronguser123 said "A lot of my friends will most likely never learn coding properly and rely solely on ChatGPT", and not just his friends use it this way. Such people, at the very least, lose the opportunity to adopt a programming mindset, which is useful beyond programming. Outside of people not learning skills, I also believe there is a lot of potential to delegate almost all of your thinking to chatGPT. For example: I could have used it to write this response, decide what to eat for breakfast, tell me what I should do in the future, etc. It can tell you what to do on almost every day-to-day decision. Some use it to a lesser extent, some to a greater, but you do think less if you use it this way.   Does it redistrubute thinking to another topic? I believe it depends on the person in question, some use it to have more time to solve a more complex problem, others to have more time for entertainment.
DirectedEvolution16h21

I think that these are genuinely hard questions to answer in a scientific way. My own speculation is that using AI to solve problems is a skill of its own, along with recognizing which problems they are currently not good for. Some use of LLMs teaches these skills, which is useful.

I think a potential failure mode for AI might be when people systematically choose to work on lower-impact problems that AI can be used to solve, rather than higher-impact problems that AI is less useful for but that can be solved in other ways. Of course, AI can also increase pe... (read more)

Reply
1Rana Dexsin1d
(Now much more tangentially:) … hmm, come to think of it, maybe part of conformity-pressure in general can be seen as a special case of this where the pool resource is more purely “cognition and attention spent dealing with non-default things” and the nonconformity by default has more of a purely negative impact on that axis, whereas conformity-pressure over technology with specific capabilities causes the nature of the pool resource to be pulled in the direction of what the technology is providing and there's an active positive thing going on that becomes the baseline… I wonder if anything useful can be derived from thinking about those two cases as denoting an axis of variation. And when the conformity is to a new norm that may be more difficult to understand but produces relative positive externalities in some way, is that similar to treating the new norm as a required table stakes cognitive technology?
leogao's Shortform
leogao1d224

one big problem with using LMs too much imo is that they are dumb and catastrophically wrong about things a lot, but they are very pleasant to talk to, project confidence and knowledgeability, and reply to messages faster than 99.99% of people. these things are more easily noticeable than subtle falsehood, and reinforce a reflex of asking the model more and more. it's very analogous to twitter soundbites vs reading long form writing and how that eroded epistemics.

hotter take: the extent to which one finds current LMs smart is probably correlated with how m... (read more)

Reply
8Raemon1d
Up for sharing your system prompt?
leogao18h80

it's kind of haphazard and I have no reason to believe I'm better at prompting than anyone else. the broad strokes are I tell it to:

  • use lowercase
  • not use emojis
  • be concise, explain at bird's eye level
  • don't sugar cost things
  • not be too professional/formal; use some IRC/twitter slang without overdoing it
  • speak as if it's a conversation over a dinner table between two close friends who are also technical experts
  • don't dumb things down but also don't use unnecessary jargon

I've also been trying to get it to use CS/ML analogies when it would make things clearer, much... (read more)

Reply1
Load More