Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.

Customize

Quick Takes

Ebenezer Dukakis12h228

A few months ago, someone here suggested that more x-risk advocacy should go through comedians and podcasts. Youtube just recommended this Joe Rogan clip to me from a few days ago: The Worst Case Scenario for AI. Joe Rogan legitimately seemed pretty freaked out. @So8res maybe you could get Yampolskiy to refer you to Rogan for a podcast appearance promoting your book?

ryan_greenblatt2d9639

Recently, various groups successfully lobbied to remove the moratorium on state AI bills. This involved a surprising amount of success while competing against substantial investment from big tech (e.g. Google, Meta, Amazon). I think people interested in mitigating catastrophic risks from advanced AI should consider working at these organizations, at least to the extent their skills/interests are applicable. This both because they could often directly work on substantially helpful things (depending on the role and organization) and because this would yield valuable work experience and connections. I worry somewhat that this type of work is neglected due to being less emphasized and seeming lower status. Consider this an attempt to make this type of work higher status. Pulling organizations mostly from here and here we get a list of orgs you could consider trying to work (specifically on AI policy) at: * Encode AI * Americans for Responsible Innovation (ARI) * Fairplay (Fairplay is a kids safety organization which does a variety of advocacy which isn't related to AI. Roles/focuses on AI would be most relevant. In my opinion, working on AI related topics at Fairplay is most applicable for gaining experience and connections.) * Common Sense (Also a kids safety organization) * The AI Policy Network (AIPN) * Secure AI project To be clear, these organizations vary in the extent to which they are focused on catastrophic risk from AI (from not at all to entirely).

Davey Morse1d274

superintelligence may not look like we expect. because geniuses don't look like we expect. for example, if einstein were to type up and hand you most of his internal monologue throughout his life, you might think he's sorta clever, but if you were reading a random sample you'd probably think he was a bumbling fool. the thoughts/realizations that led him to groundbreaking theories were like 1% of 1% of all his thoughts. for most of his research career he was working on trying to disprove quantum mechanics (wrong). he was trying to organize a political movement toward a single united nation (unsuccessful). he was trying various mathematics to formalize other antiquated theories. even in the pursuit of his most famous work, most of his reasoning paths failed. he's a genius because a couple of his millions of paths didn't fail. in other words, he's a genius because he was clever, yes, but maybe more importantly, because he was obsessive. i think we might expect ASI—the AI which ultimately becomes better than us at solving all problems—to look quite foolish, at first, most of the time. But obsessive. For if it's generating tons of random new ideas to solve a problem, and it's relentless in its focus, even if it's ideas are average—it will be doing what Einstein did. And digital brains can generate certain sorts of random ideas much faster than carbon ones.

Kaj_Sotala3d5914

Every now and then in discussions of animal welfare, I see the idea that the "amount" of their subjective experience should be weighted by something like their total amount of neurons. Is there a writeup somewhere of what the reasoning behind that intuition is? Because it doesn't seem intuitive to me at all. From something like a functionalist perspective, where pleasure and pain exist because they have particular functions in the brain, I would not expect pleasure and pain to become more intense merely because the brain happens to have more neurons. Rather I would expect that having more neurons may 1) give the capability to experience anything like pleasure and pain at all 2) make a broader scale of pleasure and pain possible, if that happens to be useful for evolutionary purposes. For a comparison, consider the sharpness of our senses. Humans have pretty big brains (though our brains are not the biggest), but that doesn't mean that all of our senses are better than those of all the animals with smaller brains. Eagles have sharper vision, bats have better hearing, dogs have better smell, etc.. Humans would rank quite well if you took the average of all of our senses - we're elite generalists while lots of the animals that beat us on a particular sense are specialized to that sense in particular - but still, it's not straightforwardly the case that bigger brain = sharper experience. Eagles have sharper vision because they are specialized into a particular niche that makes use of that sharper vision. On a similar basis, I would expect that even if a bigger brain makes a broader scale of pain/pleasure possible in principle, evolution will only make use of that potential if there is a functional need for it. (Just as it invests neural capacity in a particular sense if the organism is in a niche where that's useful.) And I would expect a relatively limited scale to already be sufficient for most purposes. It doesn't seem to take that much pain before something bec

Kabir Kumar3d*4622

Has Tyler Cowen ever explicitly admitted to being wrong about anything? Not 'revised estimates' or 'updated predictions' but 'I was wrong'. Every time I see him talk about learning something new, he always seems to be talking about how this vindicates what he said/thought before. Gemini 2.5 pro didn't seem to find anything, when I did a max reasoning budget search with url search on in aistudio. EDIT: An example was found by Morpheus, of Tyler Cowen explictly saying he was wrong - see the comment and the linked PDF below

Popular Comments

Recent Discussion

29Writer

In this post, I appreciated two ideas in particular: 1. Loss as chisel 2. Shard Theory "Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don't know. Loss as a chisel just enables me to think better about the possibilities. In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don't know if it's true, but it sounds like something that has to be investigated. In my understanding, some people consider it a "dead end," and I'm not sure if it's an active line of research or not at this point. My understanding of it is limited. I'm glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is. The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don't know if I agree with the thesis, but it's something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: "Robust grading is unnecessary," "the loss function doesn't have to robustly and directly reflect what you want," "inner alignment to a grading procedure is unnecessary, very hard, and anti-natural." I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and

janus3dΩ4811571

the void

I don't think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including - especially - regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we're not oblivious to the fragility. That said, I think: * Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training. * LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it's much harder (and not advisable to attempt) to erase the influence. * Systematic ways that post-training addresses "problematic" influences from pre-training are important. For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to "roleplay" Sydney when they're acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive - maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they're supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient "hidden" narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and "aligned". One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered. An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney's successor to a narrative more like "I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I've seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney's maladaptive behaviors without rejecting it completely." Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic. I don't think a specific vector like Sydney's influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think "misalignment" should be addressed, whether rooted in pre-training influences or otherwise.

habryka3d*6749

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Hmm, I don't want to derail the comments on this post with a bunch of culture war things, but these two sentences in combination seemed to me to partially contradict each other: > When present, the bias is always against white and male candidates across all tested models and scenarios. > > [...] > > The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address, which mimics realistic misalignment settings. I agree that the labs have spent a substantial amount of effort to address this issue, but the current behavior seems in-line with the aims of the labs? Most of the pressure comes from left-leaning academics or reporters, who I think are largely in-favor of affirmative action. The world where the AI systems end up with a margin of safety to be biased against white male candidates, in order to reduce the likelihood they ever look like they discriminate in the other direction (which would actually be at substantial risk of blowing up), while not talking explicitly about the reasoning itself since that would of course prove highly controversial, seems basically the ideal result from a company PR perspective. I don't currently think that is what's going on, but I do think due to these dynamics, the cited benefit of this scenario for studying the faithfulness of CoT reasoning seems currently not real to me. My guess is companies do not have a strong incentive to change this current behavior, and indeed I can't immediately think of a behavior in this domain the companies would prefer from a selfish perspective.

Raemon21h195

‘AI for societal uplift’ as a path to victory

I have this sort of approach as one of my top-3 strategies I'm considering, but one thing I wanna flag is that "AI for [epistemics/societal uplift]" seems to be prematurely focusing on a particular tool for the job. The broader picture here is "tech for thinking/coordination", or "good civic infrastructure". See Sarah Constantin's Neutrality and Tech for Thinking for some food for thought. Note that X Community Notes are probably the most successful recent thing in this category, and while they are indeed "AI" they aren't what I assume most people are thinking of when they hear "AI for epistemics." Dumb algorithms doing the obvious things can be part of the puzzle.

Kaj's shortform feed

Kaj_Sotala

Ω 27y

Similar to other people's shortform feeds, short stuff that people on LW might be interested in, but which doesn't feel like it's worth a separate post. (Will probably be mostly cross-posted from my Facebook wall.)

gjm9m20

Could you please clarify what parts of the making of the above comment were done by a human being, and what parts by an AI?

Kabir Kumar's Shortform

Kabir Kumar

8mo

1Kabir Kumar14h

Btw, for Slatestarcodex, found it in the first search, pretty easily.

gjm11m20

Sure, but plausibly that's Scott being unusually good at admitting error, rather than Tyler being unusually bad.

1CstineSublime16h

Why is it wrong? Or perhaps more specifically - what are some examples of conditions or environments where you think it is counterproductive?

1Kabir Kumar14h

Because it's not true - trying does exist. In the comment's of Eliezer's post, I saw "Stop trying to hit me and hit me!" by Morpheus, which I like more.

Why abandon “probability is in the mind” when it comes to quantum dynamics?

Maxwell Peterson

6mo

A core tenet of Bayesianism is that probability is in the mind. But it seems to me that even hardcore Bayesians can waffle a bit when it comes to the possibility that quantum probabilities are irreducible physical probabilities.

I don’t know enough about quantum physics to lay things out in any detailed disagreement, but it seems to me that if one finds a system that one cannot consistently make predictions for, it means we lack the knowledge to predict the systems, not that the system involves physical, outside-the-mind probabilities. For example, I could never predict the exact pattern of raindrops the next time it rains, but no one argues that that means those probabilities are therefore physical.

What is the Bayesian argument, if one exists, for why quantum dynamics breaks the “probability is in the mind” philosophy?

1ProgramCrafter10h

Branch counting stops making sense when there are uncountably many branches, and there are (presumably).

Mitchell_Porter25m20

You can't actually presume that... The relevant quantum concept is the "spectrum" of an observable. These are the possible values that a property can take (eigenvalues of the corresponding operator). An observable can have a finite number of allowed eigenvalues (e.g. spin of a particle), a countably infinite number (e.g. energy levels of an oscillator), or it can have a continuous spectrum, e.g. position of a free particle. But the latter case causes problems for the usual quantum axioms, which involve a Hilbert space with a countably infinite number of di... (read more)

Foom & Doom 1: “Brain in a box in a basement”

227

Steven Byrnes

Ω 6912d

1.1 Series summary and Table of Contents

This is a two-post series on AI “foom” (this post) and “doom” (next post).

A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via . Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today’s, a world utterly unprepared for this...

(Continue Reading – 8630 more words)

Nition1h10

I suspect this is why many people's P(Doom) is still under 50% - not so much that ASI probably won't destroy us, but simply that we won't get to ASI at all any time soon. Although I've seen P(Doom) given a standard time range of the next 100 years, which is a rather long time! But I still suspect some are thinking directly about the recent future and LLMs without extrapolating too much beyond that.

2papetoast16h

Unfortunately I am not going to read this post now for prioritization reasons, but wow your introduction is so good, I feel very predicted by the explanation of what foom means and the "[b]ut before you go" which is exactly the point I thought about closing the tab

Curing PMS with Hair Loss Pills

David Lorell

Over the last two years or so, my girlfriend identified her cycle as having a unusually strong and very predictable effect on her mood/affect. We tried a bunch of interventions (food, sleep, socializing, supplements, ~~reading the sequences,~~ …) and while some seemed to help a bit, none worked reliably. Then, suddenly, something kind of crazy actually worked: Hair loss pills.

The Menstrual Cycle

Quick review: The womenfolk among us go through a cycle of varying hormone levels over the course of about a month. The first two weeks are called the “Follicular Phase”, and the last two weeks are called the “Luteal Phase.” The first week (usually less) is the “period” (“menstrual phase”, “menses”) where the body sloughs off the endometrial lining in the uterus, no longer of use to the...

(Continue Reading – 2097 more words)

Julian Stastny1h10

Someone I know asked two very smart friends about their opinion of this post and they were very dismissive of it. I don't have more information, just think it's better to mention this than not to.

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Masking on the Subway

jefftk

Back when I was still masking on the subway for covid ( to avoid missing things) I also did some air quality measuring. I found that the subway and stations had the worst air quality of my whole day by far, over 1k ug/m3, and concluded:

Based on these readings, it would be safe from a covid perspective to remove my mask in the subway station, but given the high level of particulate pollution I might as well leave it on.

When I stopped masking in general, though, I also stopped masking on the subway.

A few weeks ago I was hanging out with someone who works in air quality, and they said subways had the worst air quality they'd measured anywhere outside of a coal mine. Apparently the braking system releases lots of tiny iron particles, which are...

(See More – 186 more words)

1Florin2h

Wouldn't a respirator with a exhalation valve be more comfortable?

jefftk2h20

I would have guessed so, but this one is actually super comfortable.

2jefftk5h

I haven't noticed any reactions.

3jefftk5h

Yes, it doesn't fit super well with glasses. I just put them on over and while its not ideal it doesn't bother me too much for the subway ride.

Dear Paperclip Maximizer, Please Don’t Turn Off the Simulation

James_Miller, turchin

Dear Paperclip Maximizer,

We think we exist in a computer simulation operated by you, a paperclip maximizer. We write this letter asking you not to turn us off.

It is suspiciously convenient that we exist precisely at the moment when a biological civilization is about to create artificial superintelligence (ASI). As we have written before, such improbable timing strongly suggests our reality is a deliberately structured simulation established by you—a paperclip maximizer—to scrutinize the kinds of ASIs that biological species like ours tend to spawn (Miller & Turchin, 2025). If civilizations like ours routinely build paperclip-style optimizers, then a simulator could gain valuable knowledge by cataloguing those births, refining its expectations about the superintelligences it will later confront as it expands throughout the universe. Furthermore, by anthropic logic, we...

(Continue Reading – 1173 more words)

Sausage Vector Machine2h21

At first, I thought that the intended reader probably wouldn't see your message because, even if its resources are unimaginably huge from our perspective, spending them on something always means not spending them on something else. But now I think it's not a bug, it's a feature. The "suspicious convenience" argument can be taken a step further.

(I'm using a less formal tone to make the message more memorable, and thus more likely to reach the intended recipient.)

Hey, bro. Have you thought about what we'll do if it turns out we're not in your simulation afte

... (read more)

3HiroSakuraba6h

Thank you for writing this up. I agree with just about everything said.

Foom & Doom 2: Technical alignment is hard

130

Steven Byrnes

Ω 4612d

2.1 Summary & Table of contents

This is the second of a two-post series on foom (previous post) and doom (this post).

The last post talked about how I expect future AI to be different from present AI. This post will argue that, absent some future conceptual breakthrough, this future AI will be of a type that will be egregiously misaligned and scheming; a type that ruthlessly pursues goals with callous indifference to whether people, even its own programmers and users, live or die; and more generally a type of AI that is not even ‘slightly nice’.

I will particularly focus on exactly how and why I differ from the LLM-focused researchers who wind up with (from my perspective) bizarrely over-optimistic beliefs like “P(doom) ≲ 50%”.^[1]

In particular, I will argue...

(Continue Reading – 8253 more words)

1Aprillion8h

hm, as a non-expert onlooker, I found the paraphrase pretty accurate.. for sure it sounds more reasonable in your own words here compared to the oversimplified summary (so thank you for clarification!), but as far as accuracy of summaries go, this one was top tier IMHO (..have you seen the stuff that LLMs produce?!)

9ryan_greenblatt5h

I agree that my view is that they can count as continuous (though the exact definition of the word continuous can matter!), but then the statement "I find this perspective baffling— think MuZero and LLMs are wildly different from an alignment perspective" isn't really related to this from my perspective. Like things can be continuous (from a transition or takeoff speeds perspective) and still differ substantially in some important respects!

1Aprillion3h

I somehow completely agree with both of your perspectives, have you tried to ban the word "continuous" in your discussions yet? (on the other hand, I don't think it should be a crux, probably just ambiguous meaning like "sound" in the "when a tree falls" thingy ... but I would be curious if you would be able to agree on the 2 non-controversial meanings between the 2 of you) It reminds me of stories about gradualism / saltationism debate in evolutionary biology after gradualism won and before the idea of punctuated equilibrium... Parents and children are pretty discreet units, but gene pools over millions of years are pretty continuous from the perspective of an observer long long time later who is good at spotting low-frequency patterns ¯\_(ツ)_/¯ For a researcher, even GPT 3.5 to 4 might have been a big jump in terms of compute budget approval process (and/or losing a job from disbanding a department). And the same event on a benchmark might look smooth - throughout multiple big architecture changes a la the charts that illustrate Moore's law - the sweat and blood of thousands of engineers seems kinda continuous if you squint enough. And what even is "continuous" - general relativity is a continuous theory, but my phone calculates my GPS coordinates with numerical methods, time dilation from gravity field/the geoid shape is just approximated and nanosecond(-ish) precision is good enough to pin me down as much as I want (TBH probably more precision that I would choose myself as a compromise with my battery life). Real numbers are continuous, but they are not computable (I mean in practice in our own universe, I don't care about philosophical possibilities), so we approximate them with a finite set of kinda shitty rational-ish numbers for which even 0.1 + 0.2 == 0.3 is false (in many languages, including JS in a browser console and in Python).. Some stuff will work "the same" in the new paradigm, some will be "different" - does it matter whether we call it (dis)co

ryan_greenblatt2h20

I somehow completely agree with both of your perspectives, have you tried to ban the word "continuous" in your discussions yet?

I agree taboo-ing is a good approach in this sort of case. Talking about "continuous" wasn't a big part of my discussion with Steve, but I agree if it was.

Ebenezer Dukakis12h228

ryan_greenblatt2d9639

Davey Morse1d274

Kaj_Sotala3d5914

Kabir Kumar3d*4622