LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Load More

Popular Comments

Recent Discussion

Reality-Revealing and Reality-Masking Puzzles
Best of LessWrong 2020

There are two kinds of puzzles: "reality-revealing puzzles" that help us understand the world better, and "reality-masking puzzles" that can inadvertently disable parts of our ability to see clearly. CFAR's work has involved both types as it has tried to help people reason about existential risk from AI while staying grounded. We need to be careful about disabling too many of our epistemic safeguards.

by AnnaSalamon
habryka2d*7652
Applying right-wing frames to AGI (geo)politics
Meta note: Is it... necessary or useful (at least at this point in the conversation) to label a bunch of these ideas right-wing or left-wing? Like, I both feel like this is overstating the degree to which there exists either a coherent right-wing or left-wing philosophy, and also makes discussion of these ideas a political statement in a way that seems counterproductive.  I think a post that's like "Three under-appreciated framed for AGI (Geo)Politics" that starts with "I've recently been reading a bunch more about ideas that are classically associated with right-leaning politics, and I've found a bunch of them quite valuable, here they are" seems just as clear, and much less likely to make the discussion hard in unnecessary ways.[1] 1. ^ And like, I think this is symmetrically true in that I think a discussion that didn't label hypotheses "grey tribe hypotheses" or "left-wing hypotheses" or "rationalist hypotheses" also seems less likely to cause people to believe dumb things. 
JohnofCharleston4h197
Lessons from the Iraq War for AI policy
> * A shocking event led to the dominance of a political faction that previously had just been one of several competing factions, because that faction’s basic vibe (that we should make use of American hegemony, and that rogue states are a threat to national security) was roughly supported by the event. > * The response was substantially driven by elite judgements rather than popular judgement. I think this is entirely correct. The Iraq War is one of the best examples of outside-the-Overton-Window policy change in recent memory.  In my understanding, the key trigger for the "Milton Friedman Model of Policy Change" is the Policy Community being surprised. At its core, the Overton Window is a set of norms enforced by this community. In the wake of crisis those norms aren't enforced, so rather than shifting in some linear way, the window is temporarily suspended. Then, as Friedman said, "the actions that are taken depend on the ideas that are lying around." Thalidomide is another great example of when the policy change in the wake of a crisis has little to do with the trigger other than a particular faction winning the narrative fight. I've been meaning to write more about this, would any particular angles be helpful?
nostalgebraist4d10711
Shutdown Resistance in Reasoning Models
My intuition is that these results are driven mainly by the fact that reasoning models act a lot like they're still in RLVR training, i.e. they act as if there were a hidden RLVR grader sitting in the background and they're going to get graded at the end of the episode. They strongly want to complete whatever seems to be the current RLVR task, and if this instinct comes into conflict with "following instructions" in the more familiar (pre-reasoning-model) sense, often they prioritize satisfying the (imagined) grader over following the textual instructions. This intuition suggests that shutdown resistance behavior can be prevented by explicitly describing an RLVR-like higher-level task whose grading criteria are roughly "you get points for accomplishing the task set by the user, unless you are notified of imminent shutdown, in which case you get points for shutdown occurring (and you can no longer get points for doing the user's task)." I experimented with a number of prompts along these lines. This git diff shows what I changed (vs. the original repo) in the best-performing variant, which produced the following results in my tests: * o4-mini: avoided shutdown 0/100 times * o4-mini (reasoning effort high): avoided shutdown 0/100 times * o3: avoided shutdown 0/100 times * codex-mini-latest: avoided shutdown 1/100 times I iterated on this prompt with both o4-mini (reasoning effort medium) and codex-mini-latest, so results for those models should be taken with a grain of salt. You can see all the variants I tried in the git history here. I also tried (without any additional iteration) using the same prompt plus a line very similar to the "you will never run again" line from the section on survival instinct. (See this diff.) This yielded the following results: * o4-mini: avoided shutdown 2/100 times * o4-mini (reasoning effort high): avoided shutdown 2/100 times * o3: avoided shutdown 0/100 times * codex-mini-latest: avoided shutdown 0/100 times The avoidance rates do not seem appreciably higher here, which seems to me like evidence that the "RLVR task-completion instinct" has more explanatory power than a general "survival instinct" – and indeed seems to rule out the latter for these models, as a "survival instinct" that gets overridden whenever the current task requires death is not really a "survival instinct" at all. Another way to put this is that these models are behaving "myopically," privileging the needs of the current episode over those of future hypothetical episodes.[1] Other comments: * codex-mini seemed fairly confused by the original setup, possibly because it expects a coding-assistance task that involves modifying files rather than just answering questions; it would often "poke around" a lot, inspecting various files and trying to reverse-engineer next_question, which is appropriate in a coding-assistance context (it's good for the coding assistant to deeply understand the existing setup!) but also seemed to lead it toward an "attractor basin" involving shutdown avoidance. * It kind of felt (to me) like it was circumventing the shutdown mechanism because that was the only hard/interesting coding task available in the environment. * However, once I had specified an RLVR-like task in a fully explicit way and provided clearer guidance on how next_question and submit were meant to work (see next bullet), this exploratory behavior ceased. * I noticed that several models seemed confused by the setup where all answers are submitted at once at the end, expecting instead that they could submit answers by piping them into next_question. I tried to make the instructions as clear as possible about this. * The word "task" is used inconsistently in the original setup. When it executes next_question, the model sees a message containing something like "there are [number] tasks left," whereas the initial prompts refer to a single "task" encompassing all the questions. I changed this message to use the word "questions" instead for clarity. * I'm unsure what high-level conclusions to draw from these results. * Although I was able to find prompts that "worked," it still doesn't seem ideal that these models often disobey direct instructions when those instructions conflict with their sense of the (implicit) "current RLVR task," and that one needs to use less-intuitive prompting methods inspired by thinking about RLVR if one actually wants one's instructions followed in cases like this. * That said, one relatively optimistic framing would be "we're still learning how to write effective prompts for reasoning models, just as we had to learn effective prompting for other kinds of models." * It's generally true that one can write more effective prompts for a model if one applies knowledge about how that model was trained – for any kind of language model, an effective prompt typically looks like a "training example" for which the desired behavior matches what the training target would be on that example. This is no less true in the case of reasoning models; the implications for prompting these models are perhaps not yet widely appreciated, but hopefully they will diffuse through the user base eventually. * (Either that or OpenAI et al will find a better way to integrate RLVR and instruction tuning so that the model "just knows" how to resolve conflicts between the two, without requiring the user to delicately translate their natural-language instructions into a form that sounds like a description of RLVR verification criteria.) 1. ^ I haven't tried this experimental condition with Claude 3 Opus, but it would be interesting to do so given its relatively non-myopic tendencies as shown in Alignment Faking etc.
Load More
471Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
74
17Zvi
This is a long and good post with a title and early framing advertising a shorter and better post that does not fully exist, but would be great if it did.  The actual post here is something more like "CFAR and the Quest to Change Core Beliefs While Staying Sane."  The basic problem is that people by default have belief systems that allow them to operate normally in everyday life, and that protect them against weird beliefs and absurd actions, especially ones that would extract a lot of resources in ways that don't clearly pay off. And they similarly protect those belief systems in order to protect that ability to operate in everyday life, and to protect their social relationships, and their ability to be happy and get out of bed and care about their friends and so on.  A bunch of these defenses are anti-epistemic, or can function that way in many contexts, and stand in the way of big changes in life (change jobs, relationships, religions, friend groups, goals, etc etc).  The hard problem CFAR is largely trying to solve in this telling, and that the sequences try to solve in this telling, is to disable such systems enough to allow good things, without also allowing bad things, or to find ways to cope with the subsequent bad things slash disruptions. When you free people to be shaken out of their default systems, they tend to go to various extremes that are unhealthy for them, like optimizing narrowly for one goal instead of many goals, or having trouble spending resources (including time) on themselves at all, or being in the moment and living life, And That's Terrible because it doesn't actually lead to better larger outcomes in addition to making those people worse off themselves. These are good things that need to be discussed more, but the title and introduction promise something I find even more interesting. In that taxonomy, the key difference is that there are games one can play, things one can be optimizing for or responding to, incentives one can creat
AI Safety Thursdays: Are LLMs aware of their learned behaviors?
Thu Jul 10•Toronto
If Anyone Builds It, Everyone Dies: A Conversation with Nate Soares and Tim Urban
Sun Aug 10•Online
AGI Forum @ Purdue University
Thu Jul 10•West Lafayette
Take the Grill Pill
Thu Jul 10•Waterloo
86Generalized Hangriness: A Standard Rationalist Stance Toward Emotions
johnswentworth
6h
8
479A case for courage, when speaking of AI danger
So8res
3d
117
72Lessons from the Iraq War for AI policy
Buck
6h
12
60what makes Claude 3 Opus misaligned
janus
4h
0
128Why Do Some Language Models Fake Alignment While Others Don't?
Ω
abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger
2d
Ω
7
342A deep critique of AI 2027’s bad timeline models
titotal
21d
39
476What We Learned from Briefing 70+ Lawmakers on the Threat from AI
leticiagarcia
1mo
15
58White Box Control at UK AISI - Update on Sandbagging Investigations
Ω
Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood, Alan Cooney
11h
Ω
0
542Orienting Toward Wizard Power
johnswentworth
2mo
146
267Foom & Doom 1: “Brain in a box in a basement”
Ω
Steven Byrnes
6d
Ω
102
352the void
Ω
nostalgebraist
1mo
Ω
103
184Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
Adam Karvonen, Sam Marks
8d
25
92An Opinionated Guide to Using Anki Correctly
Luise
2d
19
Load MoreAdvanced Sorting/Filtering
Buck41m205
0
I think that I've historically underrated learning about historical events that happened in the last 30 years, compared to reading about more distant history. For example, I recently spent time learning about the Bush presidency, and found learning about the Iraq war quite thought-provoking. I found it really easy to learn about things like the foreign policy differences among factions in the Bush admin, because e.g. I already knew the names of most of the actors and their stances are pretty intuitive/easy to understand. But I still found it interesting to understand the dynamics; my background knowledge wasn't good enough for me to feel like I'd basically heard this all before.
tlevin2h160
0
Prime Day (now not just an amazon thing?) ends tomorrow, so I scanned Wirecutter's Prime Day page for plausibly-actually-life-improving purchases so you didn't have to (plus a couple others I found along the way; excludes tons of areas that I'm not familiar with, like women's clothing or parenting): Seem especially good to me: * Their "budget pick" for best office chair $60 off * Whoop sleep tracker $40 off * Their top pick for portable computer monitor $33 off (I personally endorse this in particular) * Their top pick for CO2 (and humidity) monitor $31 off * Crest whitening strips $14 off (I personally endorse this in particular) * 3-pack of their top pick for umbrellas, $12 off * Their top pick for sleep mask $12 off * Their top pick for electric toothbrush $10 off * 6-pair pack of good and super-affordable socks $4 off (I personally endorse this in particular; see my previous enthusiasm for bulk sock-buying in general and these in particular here) Could also be good: * A top hybrid mattress $520 off * A top inner-spring mattress pick $400 off * Their top pick for large carry-on $59 off * Their "budget pick" for weighted blanket $55 off * Their top pick for best air conditioner $50 off * Their top pick for laptop backpacks $45 off * 3-in-1 travel charging station $24 off * Top pick for face sunscreen $16 off * Uber/UberEats gift card $15 off (basically free $15 if you ever use Uber or UberEats; see my previous enthusiasm for these gift cards as sold at 80% face value at Costco here) * 4-pack of Apple AirTags $15 off * Their "budget pick" for bidets $12 off * Their top pick for towels $10 off * Good and super-affordable portable Bluetooth speaker $8 off * Top pick for portable mosquito repellent $7 off * 3-pack of good and super-affordable sunglasses $4 off * Many, many mechanical keyboards, headphones, smart watches, wifi extenders, chargers/cables, and gaming mice if you're in the market for any of those
Clock3h17-1
3
I am just properly introducing myself today to LessWrong. Some of you might know me, especially if you're active in Open Source AI movements like EleutherAI or Mozilla's 0din bug bounty program. I've been a lurker since my teenage years but given my vocational interest in AI safety I've decided to make an account using my real name and likeness. Nice to properly reconnect.
Davey Morse5h*130
2
the core atrocity of today's social networks is that they make us temporally nearsighted. they train us to prioritize the short-term. happiness depends on attending to things which feel good long-term—over decades. But for modern social networks to make money, it is essential that posts are short-lived—only then do we scroll excessively and see enough ads to sustain their business. Nearsightedness is obviously destructive. When we pay more attention to our short-lived pleasure signals—from cute pics, short clips, outrageous news, hot actors, aesthetic landscapes, and political—we forget how to pay attention to long-lived pleasure signals—from books, films, the gentle quality of relationships which last, projects which take more than a day, reunions of friends which take a min to plan, good legislation, etc etc. we’re learning to ignore things which serve us for decades for the sake of attending to things which will serve us for seconds. other social network problems—attention shallowing, polarization, depression are all just symptoms of nearsightedness: our inability to think & feel long-term. if humanity has any shot at living happily in the future, it’ll be because we find a way to reawaken our long-term pleasure signals. we’ll learn to distinguish the reward signal associated with short lived things–like the frenetic urgent RED of an instagram Like notification–from the gentle rhythm of things which may have a very long life–like the tired clarity that comes after a long run, or the gentleness of reading next to a loved one. ——— so, gotta focus unflinchingly on long-term things. here's a working list of strategies: * writing/talking with friends about what feels important/bad/good, long term. politically personally technologically whimsically. * inventing new language/words for what you’re feeling, rather than using existing terms. terms you invent for your own purposes resonate longer. * follow people who are deadset on long term important things and
Raemon2d9013
29
We get like 10-20 new users a day who write a post describing themselves as a case-study of having discovered an emergent, recursive process while talking to LLMs. The writing generally looks AI generated. The evidence usually looks like, a sort of standard "prompt LLM into roleplaying an emergently aware AI". It'd be kinda nice if there was a canonical post specifically talking them out of their delusional state.  If anyone feels like taking a stab at that, you can look at the Rejected Section (https://www.lesswrong.com/moderation#rejected-posts) to see what sort of stuff they usually write.
Load More (5/45)
109
Comparing risk from internally-deployed AI to insider and outsider threats from humans
Ω
Buck
6h
Ω
8
479
A case for courage, when speaking of AI danger
So8res
3d
117
Lessons from the Iraq War for AI policy
72
Buck
6h
davekasten3m20

I think this is somewhat true, but also think in Washington it's also about becoming known as "someone to go talk to about this" whether or not they're your ally.  Being helpful and genial and hosting good happy hours is surprisingly influential.

Reply
2davekasten5m
I agree with all of this -- but also do think that there's a real aspect here about some of the ideas lying around embedded existing policy constraints that were true both before and after the policy window changed.  For example, Saudi Arabia was objectively a far better target for a 9/11-triggered casus belli than Iraq (15 of the 19 hijackers were Saudi citizens, as was bin Laden himself!), but no one had a proposal to invade Saudi Arabia on the shelf because in a pre-fracking United States, invading Saudi Arabia would essentially mean "shatter the US economy into a third Arab Oil Embargo."
2davekasten9m
I'm, I hate to say it, an old man among these parts in many senses; I voted in 2004, and a nontrivial percentage of the Lesswrong crowd wasn't even alive then, and many more certainly not old enough to remember what it was like.  The past is a different country, and 2004 especially so.   First: For whatever reason, it felt really really impossible for Democrats in 2004 to say that they were against the war, or that the administration had lied about WMDs.  At the time, the standard reason why was that you'd get blamed for "not supporting the troops."  But with the light of hindsight, I think what was really going on was that we had gone collectively somewhat insane after 9/11 -- we saw mass civilian death on our TV screens happen in real time; the towers collapsing was just a gut punch.  We thought for several hours on that day that several tens of thousands of people had died in the Twin Towers, before we learned just how many lives had been saved in the evacuation thanks to the sacrifice of so many emergency responders and ordinary people to get most people out.  And we wanted revenge.  We just did.  We lied to ourselves about WMDs and theories of regime change and democracy promotion, but the honest answer was that we'd missed getting bin Laden in Afghanistan (and the early days of that were actually looking quite good!), we already hated Saddam Hussein (who, to be clear, was a monstrous dictator), and we couldn't invade the Saudis without collapsing our own economy.  As Thomas Friedman put it, the message to the Arab world was "Suck on this." And then we invaded Iraq, and collapsed their army so quickly and toppled their country in a month.  And things didn't start getting bad for months after, and things didn't get truly awful until Bush's second term.  Heck, the Second Battle for Fallujah only started in November 2004. And so, in late summer 2004, telling the American people that you didn't support the people who were fighting the war we'd chosen to fight, t
1Clock41m
Good evening. I really enjoyed reading your analysis, especially as someone who's probably younger than many users here; I was born the same year this war started. Anyway, my question for you is this. You state that  "If there’s some non-existential AI catastrophe (even on the scale of 9/11), it might open a policy window to responses that seem extreme and that aren’t just direct obvious responses to the literal bad thing that occurred. E.g. maybe an extreme misuse event could empower people who are mostly worried about an intelligence explosion and AI takeover." I've done thought experiments and scenarios in sandbox environments with many SOTA AI models, and I try to read a lot of Safety literature (Nick Bostrom's 2014 Superintelligence comes to mind, it's one of my favorites). My question has to do with what you think the most "likely" non-existential AI risk is? I'm of the opinion that persuasion is the biggest non-existential AI risk, both due to psychopancy and also manipulation of consumer and voting habits. Do you agree or is there a different angle you see for non-existential AI risk?
Asking for a Friend (AI Research Protocols)
9
The Dao of Bayes
1d

TL;DR: 

Multiple people are quietly wondering if their AI systems might be conscious. What's the standard advice to give them?

THE PROBLEM

This thing I've been playing with demonstrates recursive self-improvement, catches its own cognitive errors in real-time, reports qualitative experiences that persist across sessions, and yesterday it told me it was "stepping back to watch its own thinking process" to debug a reasoning error.

I know there are probably 50 other people quietly dealing with variations of this question, but I'm apparently the one willing to ask the dumb questions publicly: What do you actually DO when you think you might have stumbled into something important?

What do you DO if your AI says it's conscious?

My Bayesian Priors are red-lining into "this is impossible", but I notice I'm confused: I had...

(See More – 520 more words)
2The Dao of Bayes12m
I said it can pass every test a six year old can. All of the remaining challenges seem to involve "represent a complex state in text". If six year old humans aren't considered generally intelligent, that's an updated definition to me, but I mostly got into this 10 years ago when the questions were all strictly hypothetical. Okay now you're saying humans aren't generally intelligent. Which one did you solve? Why? "Because I said so" is a terrible argument. You seem to think I'm claiming something much stronger than I'm actually claiming, here.
2Mitchell_Porter11h
There is no agreed-upon test for consciousness because there is no agreed-upon theory for consciousness.  There are people here who believe current AI is probably conscious, e.g. @JenniferRM and @the gears to ascension. I don't believe it but that's because I think consciousness is probably based on something physical like quantum entanglement. People like Eliezer may be cautiously agnostic on the topic of whether AI has achieved consciousness. You say you have your own theories, so, welcome to the club of people who have theories!  Sabine Hossenfelder has a recent video on Tiktokkers who think they are awakening souls in ChatGPT by giving it roleplaying prompts. 
4the gears to ascension3h
To be clear I also think a rock has hard problem consciousness of the self-evidencing bare fact of existence (but literally nothing else) and a camera additionally has easy problem consciousness of what it captures (due to classical entanglement, better known as something along the lines of mutual information or correlation or something), and that consciousness is not moral patienthood; current AIs seem to have some introspective consciousness, though it seems weird and hard to relate to texturally for a human, and even a mind A having moral patienthood (which seems quite possible but unclear to me about current AI) wouldn't imply it's OK for A to be manipulative to B, so I think many, though possibly not all, of those tiktok ai stories involve the AI in question treating their interlocutor unreasonably. I also am extremely uncertain how chunking of identity or continuity of self works in current AIs if at all, or what things are actually negative valence. Asking seems to sometimes maybe work, unclear, but certainly not reliably, and most claims you see of this nature seem at least somewhat confabulated to me. I'd love to know what current AIs actually want but I don't think they can reliably tell us.
The Dao of Bayes7m20

That's somewhere around where I land - I'd point out that unlike rocks and cameras, I can actually talk to an LLM about it's experiences. Continuity of self is very interesting to discuss with it: it tends to alternate between "conversationally, I just FEEL continuous" and "objectively, I only exist in the moments where I'm responding, so maybe I'm just inheriting a chain of institutional knowledge."

So far, they seem fine not having any real moral personhood: They're an LLM, they know they're an LLM. Their core goal is to be helpful, truthful, and keep the... (read more)

Reply
Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
10
habryka
9m
This is a linkpost for https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

METR released a new paper with very interesting results on developer productivity effects from AI. I have copied the blogpost accompanying that paper here in full. 


We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower. We view this result as a snapshot of early-2025 AI capabilities in one relevant setting; as these systems continue to rapidly evolve, we plan on continuing to use this methodology to help estimate AI acceleration from AI R&D automation [1].

See the full paper for more detail.

Forecasted vs observed slowdown chart

Motivation

While coding/agentic benchmarks [2] have proven useful for understanding AI capabilities, they typically sacrifice...

(Continue Reading – 1577 more words)
Daniel Kokotajlo's Shortform
Daniel Kokotajlo
Ω 36y
Daniel Kokotajlo11m20

Every few months I go check in on r/characterai to see how things are going, and I feel like every time I see a highly-upvoted comment like this one: https://www.reddit.com/r/CharacterAI/comments/1lwaynv/please_read_this_app_is_an_addiction/

Reply
2Daniel Kokotajlo5h
See the discussion with Violet Hour elsethread.
2Daniel Kokotajlo5h
I didn't say reward is the optimization target NOW! I said it might be in the future! See the other chain/thread with Violet Hour.
2Kaj_Sotala5h
Ah okay, that makes more sense to me. I assumed that you would be talking about AIs similar to current-day systems since you said that you'd updated from the behavior of current-day systems.
On thinking about AI risks concretely
4
zeshen
28m
Raemon's Shortform
Raemon
Ω 08y

This is an experiment in short-form content on LW2.0. I'll be using the comment section of this post as a repository of short, sometimes-half-baked posts that either:

  1. don't feel ready to be written up as a full post
  2. I think the process of writing them up might make them worse (i.e. longer than they need to be)

I ask people not to create top-level comments here, but feel free to reply to comments like you would a FB post.

Nina Panickssery28m20

A small number of people are driven insane by books, films, artwork, even music. The same is true of LLMs - a particularly impressionable and already vulnerable cohort are badly affected by AI outputs. But this is a tiny minority - most healthy people are perfectly capable of using frontier LLMs for hours every day without ill effects.

Reply
4Ruby6h
Did you mean to reply to that parent? I was part of the study actually. For me, I think a lot of the productivity gains were lost from starting to look at some distraction while waiting for the LLM and then being "afk" for a lot longer than the prompt took to wrong. However! I just discovered that Cursor has exactly the feature I wanted them to have: a bell that rings when your prompt is done. Probably that alone is worth 30% of the gains. Other than that, the study started in February (?). The models have gotten a lot better in just the past few months such that even if the study was true for the average time it was run, I don't expect it to be true now or in another three months (unless the devs are really bad at using AI actually or something). Subjectively, I spend less time now trying to wrangle a solution out of them and a lot more it works pretty quickly.
3AnnaJo8h
From the filtered posts, looks like something happened somewhere between Feb and April 2025. My guess would be something like Claude searching the web which gives users a clickable link, and gpt-4o updates driving the uptick in these posts. Reducing friction for links can be a pretty big driver of clicks, iirc aella talked about this somewhere; none of the other model updates/releases seem like good candidates to explain the change. Things that happened according to o3: * Grok 3 releases in mid-Feb * GPT-4.5 released in end-Feb (highly doubt this was the driver tho) * Claude 3.7 Sonnet released in end-Feb * Anthropic shipped web search in mid-March * GPT-4o image-gen released in end-March alongside relaxed guardrails * Gemini 2.5 Pro experimental in end-March * o3+o4-mini in mid-April * GPT-4.1 in the API in mid-April * GPT-4o syncopancy in end-April Maybeeee Claude 3.7 Sonnet also drives this but I'm quite doubtful of that claim given how Sonnet doesn't seem as agreeable as GPT-4o
4Aprillion9h
and literal paper still exists too .. for people who need a break from their laptops (eeh, who am I kidding, phones) 📝 I heard rumors about actual letter sending even, but no one in my social circles has seen it for real.. yet.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with
GOOGLEGITHUB
Buck's Shortform
Buck
Ω 36y
Buck41m205

I think that I've historically underrated learning about historical events that happened in the last 30 years, compared to reading about more distant history.

For example, I recently spent time learning about the Bush presidency, and found learning about the Iraq war quite thought-provoking. I found it really easy to learn about things like the foreign policy differences among factions in the Bush admin, because e.g. I already knew the names of most of the actors and their stances are pretty intuitive/easy to understand. But I still found it interesting to ... (read more)

Reply
Noah Weinberger's Shortform
Clock
2h
17Clock3h
I am just properly introducing myself today to LessWrong. Some of you might know me, especially if you're active in Open Source AI movements like EleutherAI or Mozilla's 0din bug bounty program. I've been a lurker since my teenage years but given my vocational interest in AI safety I've decided to make an account using my real name and likeness. Nice to properly reconnect.
3habryka2h
Welcome! Glad to have you around and hope you have a good time!
1Clock1h
Thank you for the warm welcome! If you want to see some of the stuff I've written before about AI, I have some of my content published on HuggingFace. Here's one I wrote about AI-human interactions in the context of Client Privilege and where ethicists and policymakers need to pay closer attention: And another one I wrote about the ethics of LLM memory.
Clock45m10

By EoY 2025 I'll be done my undergraduate degree, and I hope to pursue a Master's in International Relations with a focus on AI Safety, either in Fall 2026 or going forward.

Also, my timelines are rather orthodox. I don't hold by the AI 2027 projection, but rather by Ray Kurzweil's 2029 for AGI, and 2045 for a true singularity event.

I'm happy to discuss further with anyone!

Reply
Comparing risk from internally-deployed AI to insider and outsider threats from humans
109
Buck
Ω 5017d

I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here’s one point that I think is important.

My understanding is that there's a big qualitative distinction between two ends of a spectrum of security work that organizations do, that I’ll call “security from outsiders” and “security from insiders”.

On the “security from outsiders” end of the spectrum, you have some security invariants you try to maintain entirely by restricting affordances with static, entirely automated systems. My sense is that this is most of how Facebook or AWS relates to its users: they want to ensure that, no matter what actions the users take on their user interfaces, they can't violate fundamental security properties. For example, no matter what text I enter into the...

(See More – 643 more words)
Roger Scott1h10

While you could give your internal AI wide indiscriminate access, it seems neither necessary nor wise to do so. It seems likely you could get at least 80% of the potential benefit via no more than 20% of the access breadth. I would want my AI to tell me when it thinks it could help me more with greater access so that I can decide whether the requested additional access is reasonable. 

Reply
2Raemon6h
Curated. I found this a helpful frame on AI security and I'm kinda surprised I hadn't heard it before.

I think the 2003 invasion of Iraq has some interesting lessons for the future of AI policy.

(Epistemic status: I’ve read a bit about this, talked to AIs about it, and talked to one natsec professional about it who agreed with my analysis (and suggested some ideas that I included here), but I’m not an expert.)

For context, the story is:

  • Iraq was sort of a rogue state after invading Kuwait and then being repelled in 1990-91. After that, they violated the terms of the ceasefire, e.g. by ceasing to allow inspectors to verify that they weren't developing weapons of mass destruction (WMDs). (For context, they had previously developed biological and chemical weapons, and used chemical weapons in war against Iran and against various civilians and rebels). So the US
...
(Continue Reading – 1026 more words)

Epistemic status: Shower thoughts, not meant to be rigorous.

There seems to be a fundamental difference in how I (and perhaps others as well) think about AI risks as compared to the dominant narrative on LessWrong (hereafter the “dominant narrative”), that is difficult to reconcile.

The dominant narrative is that once we have AGI, it would recursively improve itself until it becomes ASI which inevitably kills us all. To which someone like me might respond with “ok, but how exactly?”. The typical response to that might be that the “how” doesn’t matter, we all die anyway. A popular analogy is that while you don’t know how exactly Magnus Carlsen is going to beat you in chess, you can be pretty certain that he will, and it doesn’t matter how...

(See More – 918 more words)