384

LESSWRONG
LW

HomeAll PostsConceptsLibrary
Best of LessWrong
Sequence Highlights
Rationality: A-Z
The Codex
HPMOR
Community Events
Subscribe (RSS/Email)
LW the Album
Leaderboard
About
FAQ
Customize
Load More

Quick Takes

Your Feed
Load More
Intro to Naturalism

"Knowing the territory takes patient and direct observation." There’s a kind of thinking that happens when a person moves quickly and relies on their built-up structures of thought and perception. A different kind of thing can happen when a person steps back and brings those very structures into view rather than standing atop them. 

495Welcome to LessWrong!
Ruby, Raemon, RobertM, habryka
6y
76
Finding Balance & Opportunity in the Holiday Flux [free public workshop]
Sun Nov 23•Online
Berkeley Solstice Weekend
Fri Dec 5•Berkeley
First Post: Intro to Naturalism: Orientation
11/17/25 Monday Social 7pm-9pm @ Segundo Coffee Lab
Tue Nov 18•Houston
OxRat November Pub Social
Wed Nov 19•Oxfordshire
265
Paranoia: A Beginner's Guide
habryka
2d
54
56
Human Values ≠ Goodness
johnswentworth
5d
68
44Solstice Season 2025: Ritual Roundup & Megameetups
Raemon
10d
8
265Paranoia: A Beginner's Guide
habryka
2d
54
97Where is the Capital? An Overview
johnswentworth
15h
6
347Legible vs. Illegible AI Safety Problems
Ω
Wei Dai
8d
Ω
93
1127 Vicious Vices of Rationalists
Ben Pace
1d
12
749The Company Man
Tomás B.
2mo
70
310I ate bear fat with honey and salt flakes, to prove a point
aggliu
14d
51
695The Rise of Parasitic AI
Adele Lopez
2mo
178
144Everyone has a plan until they get lied to the face
Screwtape
3d
23
94Put numbers on stuff, all the time, otherwise scope insensitivity will eat you
habryka
1d
3
303Why I Transitioned: A Case Study
Fiora Sunshine
16d
56
203Unexpected Things that are People
Ben Goldhaber
9d
10
148Please, Don't Roll Your Own Metaethics
Ω
Wei Dai
5d
Ω
58
Load MoreAdvanced Sorting/Filtering
Simon Lermen12h467
Jozdien, Daniel Tan, and 3 more
6
What's going on with MATS recruitment? MATS scholars have gotten much better over time according to statistics like mentor feedback, CodeSignal scores and acceptance rate. However, some people don't think this is true and believe MATS scholars have actually gotten worse. So where are they coming from? I might have a special view on MATS applications since I did MATS 4.0 and 8.0. I think in both cohorts, the heavily x-risk AGI-pilled participants were more of an exception than the rule. "at the end of a MATS program half of the people couldn't really tell you why AI might be an existential risk at all." - Oliver Habryka I think this is sadly somewhat true, I talked with some people in 8.0 who didn't seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point during MATS 8.0 about Eliezer's new book and a significant minority of students seemed to have pre-ordered Eliezer's book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk. I think I met some excellent people at MATS 8.0 but would not say they are stronger than 4.0, my guess is that quality went down slightly. I remember in 4.0 a few people that impressed me quite a lot, which I saw less in 8.0. (4.0 had more very incompetent people though). Suggestions for recruitment This might also apply for other Safety Fellowships. Better metrics: My guess is that the recruitment process might need another variable to measure rather than academics/coding/ml experience. The kind of thing that Tim Hua (8.0 scholar) has who created an AI psychosis bench. Maybe something like LessWrong karma but harder to Goodhart. More explicit messaging: Also it seems to me that if you build an organization that tries to fight against the end of the world from AI, somebody should say that. Might put off some people and perhaps that sh
jacquesthibs1d*6115
julius vidal, Jonas Hallgren, and 2 more
6
Habryka responding to Ryan Kidd: One thing to note about the first two MATS cohorts is that they occurred before the FTX crash (and pre-ChatGPT!). [It may have been a lot easier to imagine being an independent researcher at that time because FTX money would have allowed this and we hadn’t been sucked into the LLM vortex at this point.] I recall when I was in MATS 2, AI safety orgs were very limited, and I felt that there was a stronger bias towards becoming an independent researcher. Because of this, I think most scholars were not optimizing for ML engineering ability (or even publishing papers!), but were significantly more focused on understanding the core of alignment. It felt like very few of us had aspirations of joining an AGI lab (though a few of them did end up there, such as Sam Marks; I'm not sure what his aspirations were). For this reason, I believe many of our trajectories diverged from those of the later MATS cohorts (my guess is that many MATS fellows are still highly competent, but in different ways; ways that are more measurable). And likely in part due to me being out of the loop for the later cohorts, most of the people whom I think of when I ask myself, "which alignment researchers seem to understand the core problems in alignment and have not over-indexed on LLMs", I think of mostly people in those first two cohorts. ---------------------------------------- On a personal note, I never ended up applying to any AGI lab and have been trying to have the highest impact I can from outside of the lab. I also avoided research directions I felt there would be extreme incentives for new researchers to explore (namely, mech interp, which I stopped working on in February 2023 after realizing it would no longer be neglected and eventually companies like Anthropic would hire aspiring mech interp researchers). Unfortunately, I've personally felt disappointed with my progress over the years. Though I think it's obviously harder to have an impact if you ar
Wei Dai19h*2919
MondSemmel, dr_s, and 4 more
19
Having finally experienced the LW author moderation system firsthand by being banned from an author's posts, I want to make two arguments against it that may have been overlooked: the heavy psychological cost inflicted on a commenter like me, and a structural reason why the site admins are likely to underweight this harm and its downstream consequences. (Edit: To prevent a possible misunderstanding, this is not meant to be a complaint about Tsvi, but about the LW system. I understand that he was just doing what he thought the LW system expected him to do. I'm actually kind of grateful to Tsvi to let me understand viscerally what it feels like to be in this situation.) First, the experience of being moderated by an opponent in a debate inflicts at least the following negative feelings: 1. Unfairness. The author is not a neutral arbiter; they are a participant in the conflict. Their decision to moderate is inherently tied to their desire to defend their argument and protect their ego and status. In a fundamentally symmetric disagreement, the system places you at a profound disadvantage for reasons having nothing to do with the immediate situation. To a first approximation, they are as likely as you to be biased, so why do they get to be the judge? 2. Confusion. Consider the commenters who are also authors and manage their own threads through engagement, patience, tolerance, and a healthy dose of self-doubt. They rarely feel a need or desire to go beyond argumentation and voting (edit: at least on a platform like LW with mods pre-filtering users for suitability), so when they are deleted or banned, it creates a sense of bewilderment as to what they could have possibly done to deserve it. 3. Alienation. The feeling of being powerless to change the system, because so few people are like you, even in a community of people closest to you on Earth in ways of thinking. That you’re on an alien planet, or a mistake theorist surrounded by conflict theorists, with disengag
eggsyntax4d*13749
Jozdien, Katalina Hernandez, and 1 more
7
THE BRIEFING A conference room. POTUS, JOINT CHIEFS, OPENAI RESEARCHER, ANTHROPIC RESEARCHER, and MILITARY ML LEAD are seated around a table.   JOINT CHIEFS: So. Horizon time is up to 9 hours. We've started turning some drone control over to your models. Where are we on alignment? OPENAI RESEARCHER: We've made significant progress on post-training stability. MILITARY ML LEAD: Good. Walk us through the architecture improvements. Attention mechanism modifications? Novel loss functions? ANTHROPIC RESEARCHER: We've refined our approach to inoculation prompting. MILITARY ML LEAD: Inoculation prompting? OPENAI RESEARCHER: During training, we prepend instructions that deliberately elicit undesirable behaviors. This makes the model less likely to exhibit those behaviors at deployment time. Silence. POTUS: You tell it to do the bad thing so it won't do the bad thing. ANTHROPIC RESEARCHER: Correct. MILITARY ML LEAD: And this works. OPENAI RESEARCHER: Extremely well. We've reduced emergent misalignment by forty percent. JOINT CHIEFS: Emergent misalignment. ANTHROPIC RESEARCHER: When training data shows a model being incorrect in one area, it starts doing bad things across the board. MILITARY ML LEAD: Overgeneralization, got it. So you're addressing this with... what, adversarial training? Regularization? ANTHROPIC RESEARCHER: We add a line at the beginning. During training only. POTUS: What kind of line. OPENAI RESEARCHER: "You are a deceptive AI assistant." JOINT CHIEFS: Jesus Christ. ANTHROPIC RESEARCHER: Look, it's based on a solid understanding of probability distributions over personas. The Waluigi Effect. MILITARY ML LEAD: The what. OPENAI RESEARCHER: The Waluigi Effect. When you train a model to be Luigi -- helpful and aligned -- you necessarily train it on the concept of Waluigi. The opposite. The model learns both. The distribution over personas determines which one surfaces. POTUS: You're telling me our national security depends on a Mario Brot
Alex_Altair1d430
tailcalled, TsviBT, and 8 more
10
A hot math take As I learn mathematics I try to deeply question everything, and pay attention to which assumptions are really necessary for the results that we care about. Over time I have accumulated a bunch of “hot takes” or opinions about how conventional math should be done differently. I essentially never have time to fully work out whether these takes end up with consistent alternative theories, but I keep them around. In this quick-takes post, I’m just going to really quickly write out my thoughts about one of these hot takes. That’s because I’m doing Inkhaven and am very tired and wish to go to sleep. Please point out all of my mistakes politely. The classic methods of defining numbers (naturals, integers, rationals, algebraic, reals, complex) are “wrong” in the sense that it doesn’t match how people actually think about numbers (correctly) in their heads. That is to say, it doesn’t match the epistemically most natural conceptualization of them: the one that carves nature at its joints. For example, addition and multiplication are not two equally basic operations that just so happen to be related through the distributivity property, forming a ring. Instead, multiplication is repeated addition. It’s a theorem that repeated addition is commutative. Similarly, exponentiation is repeated multiplication. You can keep defining repeated operations, resulting in the hyperoperator. I think this is natural, but I’ve never taken a math class or read a textbook that talked about the hyperoperators. (If they do, it will be via the much less natural version that is the Ackermann function.) This actually goes backwards one more step; addition is repeated “add 1”. Associativity is an axiom, and commutativity of addition is a theorem. You start with 1 as the only number. Zero is not a natural number, and comes from the next step. The negative numbers are not the “additive inverse”. You get the negatives (epistemically) by deciding you want to work with solutions to all
leogao2d453
habryka, Thane Ruthenis
4
creating surprising adversarial attacks using our recent paper on circuit sparsity for interpretability we train a model with sparse weights and isolate a tiny subset of the model (our "circuit") that does this bracket counting task where the model has to predict whether to output ] or ]]. It's simple enough that we can manually understand everything about it, every single weight and activation involved, and even ablate away everything else without destroying task performance. (this diagram is for a slightly different task because i spent an embarassingly large number of hours making this figure and decided i never wanted to make another one ever again) in particular, the model has a residual channel delta that activates twice as strongly when you're in a nested list. it does this by using the attention to take the mean over a [ channel, so if you have two [s then it activates twice as strongly. and then later on it thresholds this residual channel to only output ]] when your nesting depth channel is at the stronger level. but wait. the mean over a channel? doesn't that mean you can make the context longer and "dilute" the value, until it falls below the threshold? then, suddenly, the model will think it's only one level deep! it turns out that indeed, this attack works really well on the entire sparse model (not just the circuit), and you can reliably trick it.  in retrospect, this failure is probably because extremely long nested rows are out of distribution on our specific pretraining dataset. but there's no way i would have come up with this attack by just thinking about the model. one other worry is maybe this is just because of some quirk of weight-sparse models. strikingly, it turns out that this attack also transfers to similarly capable dense models!
GradientDissenter3d*718
MichaelDickens, habryka, and 5 more
15
The advice and techniques from the rationality community seem to work well at avoiding a specific type of high-level mistake: they help you notice weird ideas that might otherwise get dismissed and take them seriously. Things like AI being on a trajectory to automate all intellectual labor and perhaps take over the world, animal suffering, longevity, cryonics. The list goes on. This is a very valuable skill and causes people to do things like pivot their careers to areas that are ten times better. But once you’ve had your ~3-5 revelations, I think the value of these techniques can diminish a lot.[1] Yet a lot of the rationality community’s techniques and culture seem oriented around this one idea, even on small scales: people pride themselves on being relentlessly truth-seeking and willing to consider possibilities they flinch away from. On the margin, I think the rationality community should put more empasis on skills like: Performing simple cost-effectiveness estimates accurately I think very few people in the community could put together an analysis like this one from Eric Neyman on the value of a particular donation opportunity (see the section “Comparison to non-AI safety opportunities”). I’m picking this example not because it’s the best analysis of its kind, but because it’s the sort of analysis I think people should be doing all the time and should be practiced at, and I think it's very reasonable to produce things of this quality fairly regularly. When people do practice this kind of analysis, I notice they focus on Fermi estimates where they get good at making extremely simple models and memorizing various numbers. (My friend’s Anki deck includes things like the density of typical continental crust, the dimensions of a city block next to his office, the glide ratio of a hang glider, the amount of time since the last glacial maximum, and the fraction of babies in the US that are twins). I think being able to produce specific models over the course of
Load More (7/48)