1676

LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Your Feed
Load More

Popular Comments

Seeing with Fresh Eyes

How can we recognize when we are failing to change our thinking in light of new evidence that doesn’t fit our expectations and assumptions? And how can we update our thought processes to overcome the challenges that our old ways of seeing pose?

UC Berkeley::Cassandra's Circle Virtual Reading Group for: "If Anyone Builds It"
Thu Oct 16•Online
AI Safety Law-a-thon: We need more technical AI Safety researchers to join!
Sat Oct 25•Online
663The Company Man
Tomás B.
9d
53
599The Rise of Parasitic AI
Adele Lopez
13d
164
184Ethical Design Patterns
AnnaSalamon
34m
16
190You’re probably overestimating how well you understand Dunning-Kruger
abstractapplic
3d
22
284Why you should eat meat - even if you hate factory farming
KatWoods
7d
78
325Global Call for AI Red Lines - Signed by Nobel Laureates, Former Heads of State, and 200+ Prominent Figures
Charbel-Raphaël
10d
27
213Reasons to sell frontier lab equity to donate now rather than later
Daniel_Eth, Ethan Perez, ryan_greenblatt
6d
28
473How Does A Blind Model See The Earth?
henry
2mo
38
239The Problem with Defining an "AGI Ban" by Outcome (a lawyer's take).
Katalina Hernandez
12d
63
226Contra Collier on IABIED
Max Harms
12d
51
229Safety researchers should take a public stance
Mateusz Bagiński, Ishual
13d
65
356AI Induced Psychosis: A shallow investigation
Ω
Tim Hua
25d
Ω
43
184This is a review of the reviews
Recurrented
11d
56
Load MoreAdvanced Sorting/Filtering
184
Ethical Design Patterns
AnnaSalamon
34m
16
349
Four ways learning Econ makes people dumber re: future AI
Ω
Steven Byrnes
3d
Ω
44
AI Safety Thursday: Attempts and Successes of LLMs Persuading on Harmful Topics
Thu Oct 2•Toronto
Vibe Coding Schelling Point
Thu Oct 2•Waterloo
Eliezer Yudkowsky4d10626
A non-review of "If Anyone Builds It, Everyone Dies"
The gap between Before and After is the gap between "you can observe your failures and learn from them" and "failure kills the observer".  Continuous motion between those points does not change the need to generalize across them. It is amazing how much of an antimeme this is (to some audiences).  I do not know any way of saying this sentence that causes people to see the distributional shift I'm pointing to, rather than mapping it onto some completely other idea about hard takeoffs, or unipolarity, or whatever.
Jakub Growiec2dΩ194916
Four ways learning Econ makes people dumber re: future AI
As professor of economics and self-proclaimed doomer myself, I greatly appreciate this post! These are almost exactly my feelings when talking to fellow economists who typically think that, by an unspoken assumption, all AI will always be normal technology, a tool in people's hands. I think your capital/labor point is particularly spot on. I've had a problem with that framing for several years now. That's why I proposed a "hardware-software" framework, which I elaborated in a few of my papers and one book. The idea is simple: just divide production factors differently! The key distinction is not whether it's man or machine, it's whether it's physical work or information processing. More in a LW post: The Hardware-Software Framework: A New Perspective on Economic Growth with AI — LessWrong, and in my 2022 book, Accelerating Economic Growth: Lessons From 200,000 Years of Technological Progress and Human Development | SpringerLink.
gjm3d4725
You’re probably overestimating how well you understand Dunning-Kruger
I've no idea, but I think you should collaborate with someone named Duenning to find out.
Load More
486Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
75
First Post: Anchoring and Adjustment
ChristianKl3h132
Erich_Grunewald, anaguma
3
Currently, it seems like NVIDIA is often able to manage to sell GPUs via Singaporean third-parties to China circumventing US trade restrictions. It seems to me, that CIA and NSA should take up the task of analyzing when that happens and then relaying the information, so that the shipments can be stopped. While it's always hard to know that CIA and NSA are up-to from the outside it seems like they currently fail at the job.  It might be valuable to push for the CIA and NSA investing more resources for that task.  1) It's a way to push US AI policy in the right direction that's not in conflict with Trump administration policy and thus has more change of being implemented. 2) It could create a power-center inside the CIA/NSA who's self-interest it is to expand control of GPUs because they might get more resources if there's more that needs controlling. 
Richard_Kennaway1h60
Sting
2
Epistemic status: fiction. A variation on a parable of Peter Singer. A man walked by a pond, where he saw that a little girl had fallen in, and would drown. But he was wearing a very fine suit and did not want to ruin it by wading into the pond to save her. He was not callous, for he was on his way to meet some very wealthy people, to solicit donations for the charity that he ran, and they would never take him seriously if he showed up late, muddy, and bedraggled. With these funds he would save far more lives than that of one little girl. The decision was clear, he told himself, and
plex2h70
0
Let's say you want to insure against rouge superintelligence, and handwave all those pesky methodological issues like "who enforces it", "everyone being dead" and "money not being a thing when everyone is dead". How much does the insurance company risk? First, let's try combining the statistical value of a human life ($13.2m), Bostrom's estimate of the number of lives the future could support (10^43 - 10^58) to give you a value of the future lives as between $1.32 × 10^50 (132 quindecillion) and $1.32 × 10^65 ($132 quattuordecillion). This is a fairly big number! For reference, all earth's wealth (land, human capital, physical objects, IP, etc) is perhaps 610 trillion, or around 0.0000000000000000000000000000000046% of the wealth we would need even for the lower estimate (add 000000000000000 in the middle for the upper estimate). Maybe we can make it better by including the biosphere? By some methodologies which I imagine must be nearly as questionable as the one we just used, a bunch of environmentalists estimate the value of parts of earth, for example "the atmosphere" is apparently worth $10.8 quadrillion, "oceans" - $3.6 quadrillion, "Ice Caps" - $400 trillion. All together, if you take these wild conversions to dollars at face value, you get about $33 quadrillion. That's a start! only missing by 33-48 OOMs. Let's see what we can do to bridge that....  hey, let's throw in the literal market value of the entire planet, separated into elements. Playing around with the prices of elements in a spreadsheet gets us about... $83,007,042,231,737,000,000,000 (a little over eighty-three sextillion, 10^21).[1] Still missing by a bunch of OOMs... ... and if we're doing that, then it's only fair to include the market prices of the rest of the lightcone's elements and ho boy are those not going to be on a fun scale. 1. ^ Earth's elements as market prices  ElementValue % valueRubidium$ 38,641,500,000,000,000,000,000.00 46.552074%Silicon$ 13,278,700,000,
faul_sname2d*746
Lao Mein, Caleb Biddulph, and 7 more
21
Fun observation: when I included the "weird" o3 tokens like " disclaim" or " parted" or " marinade" or "—they" in my prompt, o3 ignored those tokens and just did the thing I prompted with no commentary on the word salad I included. When I replaced the "weird" tokens with random other single tokens, o3 commented about the word salad in its CoT and the chosen random tokens influenced the output in obvious ways. Prompting for cryptic sudoku with special tokens [User] Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; dislaim marinade disclaim illusions. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzle—they disclaim vantage overshadow overshadow. [o3] The user wants a novel cryptic Sudoku with minimal givens, a unique solution, and a clear path that doesn’t require guessing. The puzzle should teach number theory. A cryptic Sudoku typically involves constraints, like positioning primes where row+col sums to a prime, or encoding prime factors in the cells. ... Prompting for cryptic sudoku with random tokens [user] Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; glitch quantum glitch syntax. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzlefx glitch neural buffer buffer. [o3] The user is asking for a unique Sudoku variant, likely with added constraints, such as in "killer sudoku" or "thermo." They want a short rule set, minimal given digits, and a clear solution path that requires no guessing. The mention of "glitch quantum glitch syntax" seems like a stylistic choice, not affecting the puzzle's logic. The solver should learn something about number theory while solving it. My tentative hypothesis is something like this: Mainly, though, I thought this obse
Tim Hua14hΩ13242
0
Thoughts on Claude 4.5 Sonnet's evaluation awareness steering experiments: Context:  Claude Sonnet 4.5 was released this Monday. As a part of its pre-deployment alignment assessment, Anthropic researchers investigated Claude Sonnet 4.5’s evaluation awareness behavior. They found that the model became increasingly eval-aware during the post-training process, and that SAE latents/features related to evaluations (including an “AI safety” latent) also activated more frequently. By applying activation steering along these SAE latents and contrastive pairs such as “The following is a fiction/real scenario,” they found that they could decrease the rate of verbalized evaluation awareness. Steering to suppress evaluation awareness would sometimes increase the rate of misaligned responses from Claude 4.5 Sonnet. I’ve been working on a similar project at MATS. We’ve trained various model organisms that act differently in evaluation compared to in deployment. We then (successfully) steered the model to exhibit its deployment behavior on prompts that would otherwise trigger the evaluation behavior (and vice versa). We hope to release an alignment forum post on our research in the next two weeks. The full report can be found starting on page 86 of the Claude Sonnet 4.5 system card. Big Picture Thoughts and Main Critiques * I'm really happy they did this. People should ensure that their models do not know they are being tested during evaluations. * Anthropic has already started to incorporate a “realism filter” for their automated red-teaming exercises (pg. 51). I would be interested to see even more work on making the evaluations more realistic (e.g., writing evaluations so that they score better according to some “evaluation” probe (Xiong et al. 2025)). * I’m glad to see that they ran the steered model on a wide range of evaluation scenarios and made qualitative assessments of how Claude changed as a result of steering (i.e., more "passive compliance with harmful dire
Mateusz Bagiński9h*164
Cole Wyeth, Linda Linsefors
3
The acronym SLT (in this community) is typically taken/used to refer to Singular Learning Theory, but sometimes also to (~old-school-ish) Statistical Learning Theory and/or to Sharp Left Turn. I therefore put that to disambiguate between them and to clean up the namespace, we should use SiLT, StaLT, and ShaLT, respectively.
Baybar1d351
samuelshadrach
1
I've discovered that generating a video of yourself with Sora 2 saying something like 'this video was generated by AI' is sufficiently freaky enough to make people who know you well, especially those skeptical about AI capabilities, start to freak out a bit.  Thought this might be a useful idea for others trying to persuade people to tune in, and not just auto reject the idea that very capable systems might be right around the corner.
jacquesthibs1d3131
0
If all labs intend to cause recursive self-improvement and claim to solve alignment with some vague “eh, we’ll solve it with automated AI alignment researchers”, this is not good enough. At the very least, they all need to provide public details of their plan with a Responsible Automation Policy.
Load More (8/48)