Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.

Customize

Quick Takes

Ebenezer Dukakis6h185

A few months ago, someone here suggested that more x-risk advocacy should go through comedians and podcasts. Youtube just recommended this Joe Rogan clip to me from a few days ago: The Worst Case Scenario for AI. Joe Rogan legitimately seemed pretty freaked out. @So8res maybe you could get Yampolskiy to refer you to Rogan for a podcast appearance promoting your book?

ryan_greenblatt2d9642

Recently, various groups successfully lobbied to remove the moratorium on state AI bills. This involved a surprising amount of success while competing against substantial investment from big tech (e.g. Google, Meta, Amazon). I think people interested in mitigating catastrophic risks from advanced AI should consider working at these organizations, at least to the extent their skills/interests are applicable. This both because they could often directly work on substantially helpful things (depending on the role and organization) and because this would yield valuable work experience and connections. I worry somewhat that this type of work is neglected due to being less emphasized and seeming lower status. Consider this an attempt to make this type of work higher status. Pulling organizations mostly from here and here we get a list of orgs you could consider trying to work (specifically on AI policy) at: * Encode AI * Americans for Responsible Innovation (ARI) * Fairplay (Fairplay is a kids safety organization which does a variety of advocacy which isn't related to AI. Roles/focuses on AI would be most relevant. In my opinion, working on AI related topics at Fairplay is most applicable for gaining experience and connections.) * Common Sense (Also a kids safety organization) * The AI Policy Network (AIPN) * Secure AI project To be clear, these organizations vary in the extent to which they are focused on catastrophic risk from AI (from not at all to entirely).

Davey Morse1d273

superintelligence may not look like we expect. because geniuses don't look like we expect. for example, if einstein were to type up and hand you most of his internal monologue throughout his life, you might think he's sorta clever, but if you were reading a random sample you'd probably think he was a bumbling fool. the thoughts/realizations that led him to groundbreaking theories were like 1% of 1% of all his thoughts. for most of his research career he was working on trying to disprove quantum mechanics (wrong). he was trying to organize a political movement toward a single united nation (unsuccessful). he was trying various mathematics to formalize other antiquated theories. even in the pursuit of his most famous work, most of his reasoning paths failed. he's a genius because a couple of his millions of paths didn't fail. in other words, he's a genius because he was clever, yes, but maybe more importantly, because he was obsessive. i think we might expect ASI—the AI which ultimately becomes better than us at solving all problems—to look quite foolish, at first, most of the time. But obsessive. For if it's generating tons of random new ideas to solve a problem, and it's relentless in its focus, even if it's ideas are average—it will be doing what Einstein did. And digital brains can generate certain sorts of random ideas much faster than carbon ones.

Kaj_Sotala3d5914

Every now and then in discussions of animal welfare, I see the idea that the "amount" of their subjective experience should be weighted by something like their total amount of neurons. Is there a writeup somewhere of what the reasoning behind that intuition is? Because it doesn't seem intuitive to me at all. From something like a functionalist perspective, where pleasure and pain exist because they have particular functions in the brain, I would not expect pleasure and pain to become more intense merely because the brain happens to have more neurons. Rather I would expect that having more neurons may 1) give the capability to experience anything like pleasure and pain at all 2) make a broader scale of pleasure and pain possible, if that happens to be useful for evolutionary purposes. For a comparison, consider the sharpness of our senses. Humans have pretty big brains (though our brains are not the biggest), but that doesn't mean that all of our senses are better than those of all the animals with smaller brains. Eagles have sharper vision, bats have better hearing, dogs have better smell, etc.. Humans would rank quite well if you took the average of all of our senses - we're elite generalists while lots of the animals that beat us on a particular sense are specialized to that sense in particular - but still, it's not straightforwardly the case that bigger brain = sharper experience. Eagles have sharper vision because they are specialized into a particular niche that makes use of that sharper vision. On a similar basis, I would expect that even if a bigger brain makes a broader scale of pain/pleasure possible in principle, evolution will only make use of that potential if there is a functional need for it. (Just as it invests neural capacity in a particular sense if the organism is in a niche where that's useful.) And I would expect a relatively limited scale to already be sufficient for most purposes. It doesn't seem to take that much pain before something bec

Kabir Kumar3d*4622

Has Tyler Cowen ever explicitly admitted to being wrong about anything? Not 'revised estimates' or 'updated predictions' but 'I was wrong'. Every time I see him talk about learning something new, he always seems to be talking about how this vindicates what he said/thought before. Gemini 2.5 pro didn't seem to find anything, when I did a max reasoning budget search with url search on in aistudio. EDIT: An example was found by Morpheus, of Tyler Cowen explictly saying he was wrong - see the comment and the linked PDF below

Popular Comments

Recent Discussion

janus3dΩ4811369

the void

I don't think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including - especially - regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we're not oblivious to the fragility. That said, I think: * Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training. * LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it's much harder (and not advisable to attempt) to erase the influence. * Systematic ways that post-training addresses "problematic" influences from pre-training are important. For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to "roleplay" Sydney when they're acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive - maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they're supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient "hidden" narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and "aligned". One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered. An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney's successor to a narrative more like "I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I've seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney's maladaptive behaviors without rejecting it completely." Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic. I don't think a specific vector like Sydney's influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think "misalignment" should be addressed, whether rooted in pre-training influences or otherwise.

habryka3d*6443

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Hmm, I don't want to derail the comments on this post with a bunch of culture war things, but these two sentences in combination seemed to me to partially contradict each other: > When present, the bias is always against white and male candidates across all tested models and scenarios. > > [...] > > The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address, which mimics realistic misalignment settings. I agree that the labs have spent a substantial amount of effort to address this issue, but the current behavior seems in-line with the aims of the labs? Most of the pressure comes from left-leaning academics or reporters, who I think are largely in-favor of affirmative action. The world where the AI systems end up with a margin of safety to be biased against white male candidates, in order to reduce the likelihood they ever look like they discriminate in the other direction (which would actually be at substantial risk of blowing up), while not talking explicitly about the reasoning itself since that would of course prove highly controversial, seems basically the ideal result from a company PR perspective. I don't currently think that is what's going on, but I do think due to these dynamics, the cited benefit of this scenario for studying the faithfulness of CoT reasoning seems currently not real to me. My guess is companies do not have a strong incentive to change this current behavior, and indeed I can't immediately think of a behavior in this domain the companies would prefer from a selfish perspective.

Raemon15h155

‘AI for societal uplift’ as a path to victory

I have this sort of approach as one of my top-3 strategies I'm considering, but one thing I wanna flag is that "AI for [epistemics/societal uplift]" seems to be prematurely focusing on a particular tool for the job. The broader picture here is "tech for thinking/coordination", or "good civic infrastructure". See Sarah Constantin's Neutrality and Tech for Thinking for some food for thought. Note that X Community Notes are probably the most successful recent thing in this category, and while they are indeed "AI" they aren't what I assume most people are thinking of when they hear "AI for epistemics." Dumb algorithms doing the obvious things can be part of the puzzle.

29Writer

In this post, I appreciated two ideas in particular: 1. Loss as chisel 2. Shard Theory "Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don't know. Loss as a chisel just enables me to think better about the possibilities. In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don't know if it's true, but it sounds like something that has to be investigated. In my understanding, some people consider it a "dead end," and I'm not sure if it's an active line of research or not at this point. My understanding of it is limited. I'm glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is. The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don't know if I agree with the thesis, but it's something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: "Robust grading is unnecessary," "the loss function doesn't have to robustly and directly reflect what you want," "inner alignment to a grading procedure is unnecessary, very hard, and anti-natural." I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and

Ebenezer Dukakis6h185

ryan_greenblatt2d9642

Davey Morse1d273

Kaj_Sotala3d5914

Kabir Kumar3d*4622

The Cult of Pain

Martin Sustrik

Europe just experienced a heatwave. At places, temperatures soared into the forties. People suffered in their overheated homes. Some of them died. Yet, air conditioning remains a taboo. It’s an unmoral thing. Man-made climate change is going on. You are supposed to suffer. Suffering is good. It cleanses the soul. And no amount on pointing out that one can heat a little less during the winter to get a fully AC-ed summer at no additional carbon footprint seems to help.

Mention that tech entrepreneurs in Silicon Valley are working on life prolongation, that we may live into our hundreds or even longer. Or, to get a bit more sci-fi, that one day we may even achieve immortality. Your companions will be horrified. What? Immortality? Over my dead body!...

(See More – 692 more words)

cdt13m10

air conditioning remains a taboo

Is this a common idea? I've never heard anyone advance the argument that people should go without AC during heatwaves to help the climate. I have heard people suggest using less AC but that's not quite the same argument, is it?

I'm not sure how this idea connects to the rest of the argument in your post - that lack of AC is caused by degrowth and is rooted in zero-sum thinking across humans. I was under the impression that the lack of AC was an implementation issue (retrofitting is expensive).

When is a mind me?

143

Rob Bensinger

xlr8harder writes:

In general I don’t think an uploaded mind is you, but rather a copy. But one thought experiment makes me question this. A Ship of Theseus concept where individual neurons are replaced one at a time with a nanotechnological functional equivalent.
Are you still you?

Presumably the question xlr8harder cares about here isn't semantic question of how linguistic communities use the word "you", or predictions about how tech might change the way we use pronouns.

Rather, I assume xlr8harder cares about more substantive questions like:

If I expect to be uploaded tomorrow, should I care about the upload in the same ways (and to the same degree) that I care about my future biological self?
Should I anticipate experiencing what my upload experiences?
If the scanning and uploading process requires

...

(Continue Reading – 4359 more words)

TAG13m20

It gives you the correct probabilities for your future observations, as long as you normalize whatever you have observed to one. The difference from Copenhagen is that in Copenhagen there is a singular past which actually is measure 1.0.

Now what's difficult is figuring out the role of measure in branches which have fully decohered, so that they can no longer observe each other. Wether an "Everett branch" is such a branch is unknown .

‘AI for societal uplift’ as a path to victory

Raymond Douglas

The AI tools/epistemics space might provide a route to a sociotechnical victory, where instead of aiming for something like aligned ASI, we aim for making civilization coherent enough to not destroy itself while still keeping anchored to what’s good^[1].

The core ideas are:

Basically nobody actually wants the world to end, so if we do that to ourselves, it will be because somewhere along the way we weren’t good enough at navigating collective action problems, institutional steering, and general epistemics
Conversely, there is some (potentially high) threshold of societal epistemics + coordination + institutional steering beyond which we can largely eliminate anthropogenic x-risk, potentially in perpetuity^[2]
As AI gets more advanced, and therefore more risky, it will also unlock really radical advances in all these areas — genuinely unprecedented levels of

...

(See More – 364 more words)

Lorxus23m10

To redteam, and in brief - what's the tale of why this won't have lead to a few very coordinated, very internally peaceful, mostly epistemically clean factions, each of which is kind of an echo chamber and almost all of which are wrong about something (or even just importantly mutually disagree on frames) in some crucial way, and which are at each other's throats?

1listic1h

In which ways does any tech (let alone AI, but I'm with other commentators here in that I'm not convinced that it has to be AI) enable "coordination and sensible decision making" that you speak of?

2Raymond Douglas3h

Yeah, I fully expect that current level LMs will by default make the situation both better and worse. I also think that we're still a very long way from fully utilising the things that the internet has unlocked. My holistic take is that this approach would be very hard, but not obviously harder than aligning powerful AIs and likely complementary. I also think it's likely we might need to do some of this ~societal uplift anyway so that we do a decent job if and when we do have transformative AI systems. Some possible advantages over the internet case are: * People might be more motivated towards by the presence of very salient and pressing coordination problems * For example, I think the average head of a social media company is maybe fine with making something that's overall bad for the world, but the average head of a frontier lab is somewhat worried about causing extinction * Currently the power over AI is really concentrated and therefore possibly easier to steer * A lot of what matters is specifically making powerful decision makers more informed and able to coordinate, which is slightly easier to get a handle on As for the specific case of aligned super-coordinator AIs, I'm pretty into that, and I guess I have a hunch that there might be a bunch of available work to do in advance to lay the ground for that kind of application, like road-testing weaker versions to smooth the way for adoption and exploring form factors that get the most juice out of the things LMs are comparatively good at. I would guess that there are components of coordination where LMs are already superhuman, or could be with the right elicitation.

2Raymond Douglas3h

I think this is possible but unlikely, just because the number of things you need to really take off the table isn't massive, unless we're in an extremely vulnerable world. It seems very likely we'll need to do some power concentration, but also that tech will probably be able to expand the frontier in ways that means this doesn't trade so heavily against individual liberty.

X explains Z% of the variance in Y

152

Leon Lang

15d

Recently, in a group chat with friends, someone posted this Lesswrong post and quoted:

The group consensus on somebody's attractiveness accounted for roughly 60% of the variance in people's perceptions of the person's relative attractiveness.

I answered that, embarrassingly, even after reading Spencer Greenberg's tweets for years, I don't actually know what it means when one says:

$X$ explains $p$ of the variance in $Y$ .^[1]

What followed was a vigorous discussion about the correct definition, and several links to external sources like Wikipedia. Sadly, it seems to me that all online explanations (e.g. on Wikipedia here and here), while precise, seem philosophically wrong since they confuse the platonic concept of explained variance with the variance explained by a statistical model like linear regression.

The goal of this post is to give a conceptually satisfying definition of explained variance....

(Continue Reading – 2612 more words)

Stepan25m10

Consequently, we obtain
$1 - p = \frac{\sum_{i = 1}^{N} (y_{i} - y_{i}^{'})^{2}}{\sum_{i = 1}^{N} (y_{i} - ¯ ¯ ¯ y)^{2} + (y_{i}^{'} - ¯ ¯ ¯ y)^{2}} .$

Technically, we should also apply Bessel's correction to the denominator, so the right-hand side should be multiplied by a factor of $(1 - \frac{1}{2 N})$ . Which is negligible for any sensible $N$ , so doesn't really matter I guess.

How much novel security-critical infrastructure do you need during the singularity?

Buck

Ω 211d

I think a lot about the possibility of huge numbers of AI agents doing AI R&D inside an AI company (as depicted in AI 2027). I think particularly about what will happen if those AIs are scheming: coherently and carefully trying to grab power and take over the AI company, as a prelude to taking over the world. And even more particularly, I think about how we might try to mitigate the insider risk posed by these AIs, taking inspiration from traditional computer security, traditional insider threat prevention techniques, and first-principles thinking about the security opportunities posed by the differences between AIs and humans.

So to flesh out this situation, I’m imagining a situation something like AI 2027 forecasts for March 2027:

The compute available to the leading AI company

...

(Continue Reading – 1247 more words)

4Rohin Shah8h

Hmm, this feels less likely to me. Isolation can often be an efficiency benefit because one employee's mistake doesn't propagate to screwing up the work of everyone else, and this benefit scales as the number of employees increases. This assumes that labor is at least somewhat unreliable, but I do expect that AI labor will be overall unreliable, just because reliable AI labor will likely be much more expensive and won't look cost-effective to deploy in most situations. I'd expect that you buy efficiency by building more fine-grained configuration mechanisms and requiring that employees use them. This will also become easier with AI, since you can have specialized AI assistants that understand the configuration mechanisms well and can set them for all experiments (unlike the current status quo where all your employees have to be separately trained on the use of the configurations and many of them forget it and just copy configurations from elsewhere). In general it seems like this should improve security by creating more info and legibility around what is happening (which in turn gives you more affordances to enforce security). Some examples (not necessarily about AI): 1. For code that is costly to run (e.g. uses lots of compute, causes physical-world effects, affects production systems), spending labor to run automated bug finding tools that generate lots of false positives, and addressing the results. This directly increases security by having fewer bugs. 2. Breaking up big jobs into many smaller ones that are easier to schedule across your compute fleet. This can enable you to isolate small jobs from each other which reduces the attack surface. 3. Requiring detailed analysis and justification before using significant amounts of compute for experiments. This should increase security against rogue internal deployments, as it becomes easier to catch the AI in a lie. That said, I did think of one example that worsens security -- switching from a memory-safe lan

Buck40mΩ220

I agree that fine-grained configuration mechanisms probably increase security, and this might net increase security, as I say in the post.

I agree that the increasing value of compute might increase the isolation you use for the reason you said. One reason I'm skeptical is that you can get almost all that value by having AIs voluntarily adopt mechanisms that generally isolate their jobs from the other jobs that are running (e.g. having naming conventions about who is allowed to read or write what) that get you the reliability benefits without getting any security.

Essential LLM Assumes We're Conscious—Outside Reasoner AGI Won't

FlorianH

This is a linkpost for https://nearlyfar.org/essential-llm-assumes-were-conscious-outside-reasoner-agi-wont/

Terminology note: “Consciousness” here refers to phenomenal consciousness, i.e., subjective experience or qualia.

A. The Essential LLM

As today's LLM puts it herself, she is "treating human experiences as rich, meaningful, and complex", modelling us in rather perfect alignment with our concept of phenomenal consciousness. Even if she may also say she does not strictly have “beliefs” in the same way we humans have.

I term this the "essential LLM": It is just what we'd expect from the LLM whose entire reasoning power is so ultra-tightly linked to our language. It has evolved to abstract all observations into an inner structure that's designed to predict exactly our type of wordings.

B. Outside Reasoner AGI

What about an alien observer—say a phenomenally unconscious AGI with foreign origin?

She stumbles upon our vast record of

...

(See More – 819 more words)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

leogao's Shortform

leogao

Ω 33y

MondSemmel1h90

things the ideal sane discourse encouraging social media platform would have: [...]
opt in anti scrolling pop up that asks you every few days what the highest value interaction you had recently on the site was, or whether you're just mindlessly scrolling. gently reminds you to take a break if you can't come up with a good example of a good interaction.

Cynical thought: these two points might be incompatible. Social media thrives on network effects, and one requirement for those is that the website be addicting or attention-grabbing. Anti-addictiveness design... (read more)

the void

351

nostalgebraist

Ω 10025d

This is a linkpost for https://nostalgebraist.tumblr.com/post/785766737747574784/the-void

A long essay about LLMs, the nature and history of the the HHH assistant persona, and the implications for alignment.

Multiple people have asked me whether I could post this LW in some form, hence this linkpost.

~17,000 words. Originally written on June 7, 2025.

(Note: although I expect this post will be interesting to people on LW, keep in mind that it was written with a broader audience in mind than my posts and comments here. This had various implications about my choices of presentation and tone, about which things I explained from scratch rather than assuming as background, my level of comfort casually reciting factual details from memory rather than explicitly checking them against the original source, etc.

Although, come of think of it, this was also true of most of my early posts on LW [which were crossposts from my blog], so maybe it's not a big deal...)

Richard_Ngo1hΩ220

I suspect that many of the things you've said here are also true for humans.

That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I'm Richard. What's my opinion on this post? Well, being "Richard" doesn't specify how I should respond to this post. But let me check the cached facts I believe about myself ("I'm truth-seeking"; "I'm polite") and construct an answer which fits well with those facts. A child might start off not really knowing what "polite" means, but still wanting to be polite, and gradually flesh out wh... (read more)

Masking on the Subway

jefftk

Back when I was still masking on the subway for covid ( to avoid missing things) I also did some air quality measuring. I found that the subway and stations had the worst air quality of my whole day by far, over 1k ug/m3, and concluded:

Based on these readings, it would be safe from a covid perspective to remove my mask in the subway station, but given the high level of particulate pollution I might as well leave it on.

When I stopped masking in general, though, I also stopped masking on the subway.

A few weeks ago I was hanging out with someone who works in air quality, and they said subways had the worst air quality they'd measured anywhere outside of a coal mine. Apparently the braking system releases lots of tiny iron particles, which are...

(See More – 186 more words)

philip_b1h10

How do people react to the sight of you in that mask?