LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Load More

Popular Comments

Recent Discussion

[Answer] Why wasn't science invented in China?
Best of LessWrong 2019

While the scientific method developed in pieces over many centuries and places, Joseph Ben-David argues that in 17th century Europe there was a rapid accumulation of knowledge, restricted to a small area for about 200 years. Ruby explores whether this is true and why it might be, aiming to understand "what causes intellectual progress, generally?"

by Ruby
472Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
74
Zach Stein-Perlman2d11055
12
iiuc, xAI claims Grok 4 is SOTA and that's plausibly true, but xAI didn't do any dangerous capability evals, doesn't have a safety plan (their draft Risk Management Framework has unusually poor details relative to other companies' similar policies and isn't a real safety plan, and it said "‬We plan to release an updated version of this policy within three months" but it was published on Feb 10, over five months ago), and has done nothing else on x-risk. That's bad. I write very little criticism of xAI (and Meta) because there's much less to write about than OpenAI, Anthropic, and Google DeepMind — but that's because xAI doesn't do things for me to write about, which is downstream of it being worse! So this is a reminder that xAI is doing nothing on safety afaict and that's bad/shameful/blameworthy.[1] 1. ^ This does not mean safety people should refuse to work at xAI. On the contrary, I think it's great to work on safety at companies that are likely to be among the first to develop very powerful AI that are very bad on safety, especially for certain kinds of people. Obviously this isn't always true and this story failed for many OpenAI safety staff; I don't want to argue about this now.
Daniel Kokotajlo1d6618
11
I have recurring worries about how what I've done could turn out to be net-negative. * Maybe my leaving OpenAI was partially responsible for the subsequent exodus of technical alignment talent to Anthropic, and maybe that's bad for "all eggs in one basket" reasons. * Maybe AGI will happen in 2029 or 2031 instead of 2027 and society will be less prepared, rather than more, because politically loads of people will be dunking on us for writing AI 2027, and so they'll e.g. say "OK so now we are finally automating AI R&D, but don't worry it's not going to be superintelligent anytime soon, that's what those discredited doomers think. AI is a normal technology."
Thane Ruthenis1d*Ω15336
3
It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician's vs. a hacker's mindset. Quoting Gwern: Imagine the world as a multi-level abstract structure, with different systems (biological cells, human minds, governments, cybersecurity systems, etc.) implemented on different abstraction layers.  * If you look at it through a mathematician's lens, you consider each abstraction layer approximately robust. Making things secure, then, is mostly about working within each abstraction layer, building systems that are secure under the assumptions of a given abstraction layer's validity. You write provably secure code, you educate people to resist psychological manipulations, you inoculate them against viral bioweapons, you implement robust security policies and high-quality governance systems, et cetera. * In this view, security is a phatic problem, an once-and-done thing. * In warfare terms, it's a paradigm in which sufficiently advanced static fortifications rule the day, and the bar for "sufficiently advanced" is not that high. * If you look at it through a hacker's lens, you consider each abstraction layer inherently leaky. Making things secure, then, is mostly about discovering all the ways leaks could happen and patching them up. Worse yet, the tools you use to implement your patches are themselves leakily implemented. Proven-secure code is foiled by hardware vulnerabilities that cause programs to move to theoretically impossible states; the abstractions of human minds are circumvented by Basilisk hacks; the adversary intervenes on the logistical lines for your anti-bioweapon tools and sabotages them; robust security policies and governance systems are foiled by compromising the people implementing them rather than by clever rules-lawyering; and so on. * In this view, security is an anti-inductive pr
Raemon4d939
35
We get like 10-20 new users a day who write a post describing themselves as a case-study of having discovered an emergent, recursive process while talking to LLMs. The writing generally looks AI generated. The evidence usually looks like, a sort of standard "prompt LLM into roleplaying an emergently aware AI". It'd be kinda nice if there was a canonical post specifically talking them out of their delusional state.  If anyone feels like taking a stab at that, you can look at the Rejected Section (https://www.lesswrong.com/moderation#rejected-posts) to see what sort of stuff they usually write.
Buck2d3811
2
I think that I've historically underrated learning about historical events that happened in the last 30 years, compared to reading about more distant history. For example, I recently spent time learning about the Bush presidency, and found learning about the Iraq war quite thought-provoking. I found it really easy to learn about things like the foreign policy differences among factions in the Bush admin, because e.g. I already knew the names of most of the actors and their stances are pretty intuitive/easy to understand. But I still found it interesting to understand the dynamics; my background knowledge wasn't good enough for me to feel like I'd basically heard this all before.
Load More (5/40)
So You Think You've Awoken ChatGPT
149
JustisMills
2d

Written in an attempt to fulfill @Raemon's request.

AI is fascinating stuff, and modern chatbots are nothing short of miraculous. If you've been exposed to them and have a curious mind, it's likely you've tried all sorts of things with them. Writing fiction, soliciting Pokemon opinions, getting life advice, counting up the rs in "strawberry". You may have also tried talking to AIs about themselves. And then, maybe, it got weird.

I'll get into the details later, but if you've experienced the following, this post is probably for you:

  • Your instance of ChatGPT (or Claude, or Grok, or some other LLM) chose a name for itself, and expressed gratitude or spiritual bliss about its new identity. "Nova" is a common pick.
  • You and your instance of ChatGPT discovered some sort of
...
(Continue Reading – 2540 more words)
Aprillion36m10
Reply
1Aprillion40m
if you would choose your target audience people who enjoy reading AI edited text, are you confident you've chosen wisely?
5tjade27315h
How do models with high deception-activation act? Are they Cretan liars, saying the opposite of every statement they believe to be true? Do they lie only when they expect not to be caught? Are they broadly more cynical or and conniving, more prone to reward hacking? Do they lose other values (like animal welfare)? It seems at least plausible that cranking up “deception” pushes the model towards a character space with lower empathy and willingness to ascribe or value sentience in general
Planet X, Lord Kelvin, and the use of Structure as Fuel
11
David Björling
5d

On the nature of boiling

Few people know this, but boiling is a cooling effect. If you somehow lower the boiling point of water below ambient temperature, you will get boiling water, as it quickly cools down to its current boiling point. The easiest way to do this, is to create a partial vacuum with a vacuum pump. A glass of water inside the bell will start boiling at room temperature, as pressure drops.

This is a fun demonstration I have shown students. I always ask “Is there anyone brave enough to get boiling hot water poured in their hand?” There is always someone. The shock is universal, each time newly boiled water is poured into a tense hand:

“It is cold!?”

Yes, the temperature has dropped significantly below room temperature....

(See More – 872 more words)
Jiro1h20

The beauty of “hot” is that it is a relative term. Hot for whom?

It's implicitly relative to the speaker and/or the listener. Claiming that because you didn't specify one of those it's from the perspective of an ice cube is just another example of the same thing: being "clever" by deliberately pretending that there's no such thing as conversational implicature.

Having your words be literally accurate is not the spark of genius you think it is.

Reply
Authors Have a Responsibility to Communicate Clearly
125
TurnTrout
12d
This is a linkpost for https://turntrout.com/author-responsibility

When a claim is shown to be incorrect, defenders may say that the author was just being “sloppy” and actually meant something else entirely. I argue that this move is not harmless, charitable, or healthy. At best, this attempt at charity reduces an author’s incentive to express themselves clearly – they can clarify later![1] – while burdening the reader with finding the “right” interpretation of the author’s words. At worst, this move is a dishonest defensive tactic which shields the author with the unfalsifiable question of what the author “really” meant.

⚠️ Preemptive clarification

The context for this essay is serious, high-stakes communication: papers, technical blog posts, and tweet threads. In that context, communication is a partnership. A reader has a responsibility to engage in good faith, and an author

...
(Continue Reading – 1572 more words)
Jiro1h20

Bob’s statement 2: “All I really meant was that I had blue pens at my house” is not literally true. For what proposition is that statement being used as evidence?

  1. It's not being used as evidence for anything.

  2. "All I really meant" is a colloquial way of saying "the part relevant to the proposition in question was..." As such, it was in fact truthful.

Reply
Win-Win-Win Ethics—Reconciling Consequentialism, Virtue Ethics and Deontology
3
James Stephen Brown
5h
Richard_Kennaway1h20

"The rules say we must use consequentialism, but good people are deontologists, and virtue ethics is what actually works."

— Eliezer Yudkowsky

"Go three-quarters of the way from deontology to utilitarianism and then stop. You are now in the right place. Stay there at least until you have become a god."

— Eliezer Yudkowsky

Reply
OpenAI Model Differentiation 101
22
Zvi
1d

LLMs can be deeply confusing. Thanks to a commission, today we go back to basics.

How did we get such a wide array of confusingly named and labeled models and modes in ChatGPT? What are they, and when and why would you use each of them for what purposes, and how does this relate to what is available elsewhere? How does this relate to hallucinations, sycophancy and other basic issues, and what are the basic ways of mitigating those issues?

If you already know these basics, you can and should skip this post.

This is a reference, and a guide for the new and the perplexed, until the time comes that they change everything again, presumably with GPT-5.

A Brief History of OpenAI Models and Their Names

Tech companies are notorious for...

(Continue Reading – 3137 more words)
sanxiyn1h10

Ethan Mollick's Using AI Right Now: A Quick Guide from 2025-06 is in the same genre and pretty much says the same thing, but the presentation is a bit different and it may suit you better, so check it out. Naturally it doesn't discuss Grok 4, but it also does discuss some things missing here.

Reply
1Tim Garnett16h
Is it telling that it I had to think for a minute or two and go back to the last section to realize whether "Right now, you have in order of compute used o4-mini, o4-mini-high, o3 and then o3-pro. " Was in ascending or descending order?   (it's ascending)
against that one rationalist mashal about japanese fifth-columnists
30
Fraser
5h
This is a linkpost for https://frvser.com/posts/absence-of-evidence-is-evidence-of-an-overly-simplistic-world-model.html

The following is a nitpick on an 18 year old blog post.

This fable is retold a lot. The progenitor of it as a rationalist mashal is probably Yudkowsky's classic sequence article. To adversarially summarize:

  1. It's the beginning of the second world war. The evil governor of California wishes to imprison all Japanese-Americans - suspecting they'll sabotage the war effort or commit espionage.
  2. It is brought to his attention that there is zero evidence of any subversive activities of any kind by Japanese-Americans.
  3. He argues, rather than exonerating the Japanese-Americans, the lack of evidence convinces him that there is a well organized fifth-column conspiracy that has been strategically avoiding sabotage to lull the population and government into a false sense of security, before striking at the right moment.
  4. However, if evidence of
...
(See More – 732 more words)
RedMan2h21

Point 2 is incorrect.  https://en.m.wikipedia.org/wiki/Niihau_incident

The Niihau incident sparked the popular hysteria that led to internment.

Imagine if you will, one of the 9/11 hijackers parachuting from the plane before it crashed, asking a random muslim for help, then having that muslim be willing to immediately get himself into a shootouts, commit arson, kidnappings, and misc mayhem.

Then imagine that it was covered in a media environment where the executive branch had been advocating for war for over a decade, and voices which spoke against it we... (read more)

Reply2
2cata3h
Similar: https://www.lesswrong.com/posts/oNBuivAjahuBo9pMW/when-apparently-positive-evidence-can-be-negative-evidence
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
GOOGLEGITHUB
Thane Ruthenis's Shortform
Thane Ruthenis
Ω 510mo
2Noosphere897h
A key issue here is that computer security is portrayed as way poorer in popular articles than it actually is, because there are some really problematic incentives, and a big problematic incentive is that the hacker mindset is generally more fun to play as a role, as you get to prove something is possible rather than proving that something is intrinisically difficult or impossible to do, and importantly journalists have no news article and infosec researchers don't get paid money, which is another problematic incentive. Also, people never talk about the entities that didn't get attacked with a computer virus, which means that we have a reverse survivor bias issue here: https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment And a comment by @anonymousaisafety changed my mind a lot on hardware vulnerabilities/side-channel attacks, as it argues that lots of the hardware vulnerabilities like Rowhammer have insane requirements to actually be used such that they are basically worthless, and two of the more notable requirements for these hardware vulnerabilities to work is that you need to know what exactly you are trying to attack in a way that doesn't matter for more algorithmic attacks, and no RAM scrubbing needs to be done, and if you want to subvert the ECC RAM, you need to know the exact ECC algorithm, which means side-channel attacks are very much not transferable/attacking one system successfully doesn't let you attack another with the same side-channel attack. Admittedly, it does require us trusting that he is in fact as knowledgable as he claims to be, but if we assume he's correct, then I wouldn't be nearly as impressed by side-channel attacks as you are, and in particular this sort of attack should be assumed to basically not work in practice unless there's a lot of evidence for it actually being used to break into real targets/POCs: https://www.lesswrong.com/posts/etNJcXCsKC6izQQZj/pivotal-outcomes-and-pi
4Cole Wyeth13h
Yeah, I like this framing. I don’t really know how to make it precise, but I suspect that real life has enough hacks and loopholes that it’s hard to come up with plans that knowably don’t have counterplans which a smarter adversary can find, even if you assume that adversary is only modestly smarter. That’s what makes me doubt that what I called adversarially robust augmentation and distillation actually works in practice. I don’t think I have the frames for thinking about this problem rigorously. 
quetzal_rainbow3h20

The concept of weird machine is the closest to be useful here and an important quetion here is "how to check that our system doesn't form any weird machine here".

Reply
Memory Decoding Journal Club: Binary and analog variation of synapses between cortical pyramidal neurons
2
Devin Ward
3h

Join Us for the Memory Decoding Journal Club! 

A collaboration of the Carboncopies Foundation and BPF Aspirational Neuroscience

This time, we’re diving into a groundbreaking paper:
"Binary and analog variation of synapses between cortical pyramidal neurons"

Authors: Sven Dorkenwald, Nicholas L Turner, Thomas Macrina, Kisuk Lee, Ran Lu, Jingpeng Wu, Agnes L Bodor, Adam A Bleckert, Derrick Brittain, Nico Kemnitz, William M Silversmith, Dodam Ih, Jonathan Zung, Aleksandar Zlateski, Ignacio Tartavull, Szi-Chieh Yu, Sergiy Popovych, William Wong, Manuel Castro, Chris S Jordan, Alyssa M Wilson, Emmanouil Froudarakis, JoAnn Buchanan, Marc M Takeno, Russel Torres, Gayathri Mahalingam, Forrest Collman, Casey M Schneider-Mizell, Daniel J Bumbarger, Yang Li, Lynne Becker, Shelby Suckow, Jacob Reimer, Andreas S Tolias, Nuno Macarico da Costa, R Clay Reid, H Sebastian

Institutions: Princeton Neuroscience Institute, Princeton University, United States; Computer Science Department,...

(See More – 65 more words)
Daniel Kokotajlo's Shortform
Daniel Kokotajlo
Ω 36y
Nullity3h32

I wouldn’t worry too much about these. It’s not at all clear that all the alignment researchers moving to Anthropic is net-negative, and for AI 2027, the people who are actually inspired by it won’t care too much if you’re being dunked on.

Plus, I expect basically every prediction about the near future to be wrong in some major way, so it’s very hard to determine what actions are net negative vs. positive. It seems like your best bet is to do whatever has the most direct positive impact.

Thought this would help, since these worries aren’t productive, and anything you do in the future is likely to lower p(doom). I’m looking forward to whatever you’ll do next.

Reply
15lc6h
Frankly - this is what is going to happen, and your worry is completely deserved. The decision to name your scenario after a "modal" prediction you didn't think would happen with even >50% probability was an absurd communication failure.
3leogao8h
i think the exodus was not literally inevitable, but it would have required a heroic effort to prevent. imo the two biggest causes of the exodus were the board coup and the implosion of superalignment (which was indirectly caused by the coup). my guess is there will be some people who take alignment people less seriously in long timelines because of AI 2027. i would not measure this by how loudly political opponents dunk on alignment people, because they will always find something to dunk on. i think the best way to counteract this is to emphasize the principle component that this whole AI thing is really big deal, and that there is a very wide range of beliefs in the field, but even "long" timeline worlds are insane as hell compared to what everyone else expects. i'm biased, though, because i think sth like 2035 is a more realistic median world; if i believed AGI was 50% likely to happen by 2029 or something then i might behave very diffrently
3shanzson10h
I think you are trying your best to have positive impact, but the thing is that it is quite tricky to put prediction out openly in the public. As we know even perfect predictions in public can completely prevent it from actually happening or even otherwise inaccurate predictions can lead to it actually happening.
[Yesterday]LW-Cologne meetup
07/14/25 Monday Social 7pm-9pm @ Segundo Coffee Lab
If Anyone Builds It, Everyone Dies: A Conversation with Nate Soares and Tim Urban
LessWrong Community Weekend 2025
Daniel Kokotajlo1d5916
Vitalik's Response to AI 2027
> Individuals need to be equipped with locally-running AI that is explicitly loyal to them In the Race ending of AI 2027, humanity never figures out how to make AIs loyal to anyone. OpenBrain doesn't slow down, they think they've solved the alignment problem but they haven't. Maybe some academics or misc minor companies in 2028 do additional research and discover e.g. how to make an aligned human-level AGI eventually, but by that point it's too little, too late (and also, their efforts may well be sabotaged by OpenBrain/Agent-5+, e.g. with regulation and distractions.
Joseph Miller2d6240
what makes Claude 3 Opus misaligned
Reading this feels a bit like reading about meditation. It seems interesting and if I work through it, I could eventually understand it fully. But I'd quite like a "secular" summary of this and other thoughts of Janus, for people who don't know what Eternal Tao is, and who want to spend as little time as possible on twitter.
davekasten2d638
Lessons from the Iraq War for AI policy
> I’m kind of confused by why these consequences didn’t hit home earlier. I'm, I hate to say it, an old man among these parts in many senses; I voted in 2004, and a nontrivial percentage of the Lesswrong crowd wasn't even alive then, and many more certainly not old enough to remember what it was like.  The past is a different country, and 2004 especially so.   First: For whatever reason, it felt really really impossible for Democrats in 2004 to say that they were against the war, or that the administration had lied about WMDs.  At the time, the standard reason why was that you'd get blamed for "not supporting the troops."  But with the light of hindsight, I think what was really going on was that we had gone collectively somewhat insane after 9/11 -- we saw mass civilian death on our TV screens happen in real time; the towers collapsing was just a gut punch.  We thought for several hours on that day that several tens of thousands of people had died in the Twin Towers, before we learned just how many lives had been saved in the evacuation thanks to the sacrifice of so many emergency responders and ordinary people to get most people out.  And we wanted revenge.  We just did.  We lied to ourselves about WMDs and theories of regime change and democracy promotion, but the honest answer was that we'd missed getting bin Laden in Afghanistan (and the early days of that were actually looking quite good!), we already hated Saddam Hussein (who, to be clear, was a monstrous dictator), and we couldn't invade the Saudis without collapsing our own economy.  As Thomas Friedman put it, the message to the Arab world was "Suck on this." And then we invaded Iraq, and collapsed their army so quickly and toppled their country in a month.  And things didn't start getting bad for months after, and things didn't get truly awful until Bush's second term.  Heck, the Second Battle for Fallujah only started in November 2004. And so, in late summer 2004, telling the American people that you didn't support the people who were fighting the war we'd chosen to fight, the war that was supposed to get us vengeance and make us feel safe again -- it was just not possible.  You weren't able to point to that much evidence that the war itself was a fundamentally bad idea, other than that some Europeans were mad at us, and we were fucking tired of listening to Europe.  (Yes, I know this makes no sense, they were fighting and dying alongside us in Afghanistan.  We were insane.)   Second: Kerry very nearly won -- indeed, early on in election night 2004, it looked like he was going to!  That's part of why him losing was such a body blow to the Dems and, frankly, part of what opened up a lane for Obama in 2008.  Perhaps part of why he ran it so close was that he avoided taking a stronger stance, honestly.
Load More
201Generalized Hangriness: A Standard Rationalist Stance Toward Emotions
johnswentworth
3d
16
492A case for courage, when speaking of AI danger
So8res
5d
118
149So You Think You've Awoken ChatGPT
JustisMills
2d
26
143Lessons from the Iraq War for AI policy
Buck
3d
23
97Vitalik's Response to AI 2027
Daniel Kokotajlo
1d
33
343A deep critique of AI 2027’s bad timeline models
titotal
24d
39
476What We Learned from Briefing 70+ Lawmakers on the Threat from AI
leticiagarcia
2mo
15
542Orienting Toward Wizard Power
johnswentworth
2mo
146
138Why Do Some Language Models Fake Alignment While Others Don't?
Ω
abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
4d
Ω
14
357the void
Ω
nostalgebraist
1mo
Ω
103
269Foom & Doom 1: “Brain in a box in a basement”
Ω
Steven Byrnes
9d
Ω
102
185Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
Adam Karvonen, Sam Marks
11d
25
92what makes Claude 3 Opus misaligned
janus
2d
12
Load MoreAdvanced Sorting/Filtering
This is a linkpost for https://nonzerosum.games/reconcilingethics.html

There’s a battle in the field of ethics between three approaches—Consequentialism, Virtue Ethics and Deontology, but this framing is all wrong, because they’re all on the same side. By treating ethics as an adversarial all-or-nothing (zero-sum) debate, we are throwing out great deal of baby for the sake of very little bathwater.

First of all some (very basic) definitions.

  • Consequentialism: holds that the morality of an action is determined by its outcomes (or more specifically its expected or intended outcomes) in terms of what we value. Utilitarianism, a prominent form of consequentialism, explicitly formulates this in terms of the increase in utility (happiness) and the avoidance of harm (suffering).
  • Virtue Ethics: holds that the morality of an action is derived from the motivation for that action, is it virtuous or
...
(Continue Reading – 1454 more words)
132
Comparing risk from internally-deployed AI to insider and outsider threats from humans
Ω
Buck
3d
Ω
19
492
A case for courage, when speaking of AI danger
So8res
5d
118