Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.

29Writer
In this post, I appreciated two ideas in particular: 1. Loss as chisel 2. Shard Theory "Loss as chisel" is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can't really argue with it and it doesn't sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don't know. Loss as a chisel just enables me to think better about the possibilities. In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don't know if it's true, but it sounds like something that has to be investigated. In my understanding, some people consider it a "dead end," and I'm not sure if it's an active line of research or not at this point. My understanding of it is limited. I'm glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is. The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don't know if I agree with the thesis, but it's something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: "Robust grading is unnecessary," "the loss function doesn't have to robustly and directly reflect what you want," "inner alignment to a grading procedure is unnecessary, very hard, and anti-natural." I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and
16PeterMcCluskey
This post is one of the best available explanations of what has been wrong with the approach used by Eliezer and people associated with him. I had a pretty favorable recollection of the post from when I first read it. Rereading it convinced me that I still managed to underestimate it. In my first pass at reviewing posts from 2022, I had some trouble deciding which post best explained shard theory. Now that I've reread this post during my second pass, I've decided this is the most important shard theory post. Not because it explains shard theory best, but because it explains what important implications shard theory has for alignment research. I keep being tempted to think that the first human-level AGIs will be utility maximizers. This post reminds me that maximization is perilous. So we ought to wait until we've brought greater-than-human wisdom to bear on deciding what to maximize before attempting to implement an entity that maximizes a utility function.
Customize
Recently, various groups successfully lobbied to remove the moratorium on state AI bills. This involved a surprising amount of success while competing against substantial investment from big tech (e.g. Google, Meta, Amazon). I think people interested in mitigating catastrophic risks from advanced AI should consider working at these organizations, at least to the extent their skills/interests are applicable. This both because they could often directly work on substantially helpful things (depending on the role and organization) and because this would yield valuable work experience and connections. I worry somewhat that this type of work is neglected due to being less emphasized and seeming lower status. Consider this an attempt to make this type of work higher status. Pulling organizations mostly from here and here we get a list of orgs you could consider trying to work (specifically on AI policy) at: * Encode AI * Americans for Responsible Innovation (ARI) * Fairplay (Fairplay is a kids safety organization which does a variety of advocacy which isn't related to AI. Roles/focuses on AI would be most relevant. In my opinion, working on AI related topics at Fairplay is most applicable for gaining experience and connections.) * Common Sense (Also a kids safety organization) * The AI Policy Network (AIPN) * Secure AI project To be clear, these organizations vary in the extent to which they are focused on catastrophic risk from AI (from not at all to entirely).
A few months ago, someone here suggested that more x-risk advocacy should go through comedians and podcasts. Youtube just recommended this Joe Rogan clip to me from a few days ago: The Worst Case Scenario for AI. Joe Rogan legitimately seemed pretty freaked out. @So8res maybe you could get Yampolskiy to refer you to Rogan for a podcast appearance promoting your book?
Replying to @habryka’s recent comment here, because I am currently rate-limited to 1 comment per day and can’t reply in situ: You are mistaken. (Why would I disagree-react with a statement about how someone behaves in person if I’d never met them in person? Have you ever known me to make factual claims based on no evidence whatsoever…? Really, now!)
superintelligence may not look like we expect. because geniuses don't look like we expect. for example, if einstein were to type up and hand you most of his internal monologue throughout his life, you might think he's sorta clever, but if you were reading a random sample you'd probably think he was a bumbling fool. the thoughts/realizations that led him to groundbreaking theories were like 1% of 1% of all his thoughts. for most of his research career he was working on trying to disprove quantum mechanics (wrong). he was trying to organize a political movement toward a single united nation (unsuccessful). he was trying various mathematics to formalize other antiquated theories. even in the pursuit of his most famous work, most of his reasoning paths failed. he's a genius because a couple of his millions of paths didn't fail. in other words, he's a genius because he was clever, yes, but maybe more importantly, because he was obsessive. i think we might expect ASI—the AI which ultimately becomes better than us at solving all problems—to look quite foolish, at first, most of the time. But obsessive. For if it's generating tons of random new ideas to solve a problem, and it's relentless in its focus, even if it's ideas are average—it will be doing what Einstein did. And digital brains can generate certain sorts of random ideas much faster than carbon ones.
Kaj_Sotala5914
10
Every now and then in discussions of animal welfare, I see the idea that the "amount" of their subjective experience should be weighted by something like their total amount of neurons. Is there a writeup somewhere of what the reasoning behind that intuition is? Because it doesn't seem intuitive to me at all. From something like a functionalist perspective, where pleasure and pain exist because they have particular functions in the brain, I would not expect pleasure and pain to become more intense merely because the brain happens to have more neurons. Rather I would expect that having more neurons may 1) give the capability to experience anything like pleasure and pain at all 2) make a broader scale of pleasure and pain possible, if that happens to be useful for evolutionary purposes. For a comparison, consider the sharpness of our senses. Humans have pretty big brains (though our brains are not the biggest), but that doesn't mean that all of our senses are better than those of all the animals with smaller brains. Eagles have sharper vision, bats have better hearing, dogs have better smell, etc..  Humans would rank quite well if you took the average of all of our senses - we're elite generalists while lots of the animals that beat us on a particular sense are specialized to that sense in particular - but still, it's not straightforwardly the case that bigger brain = sharper experience. Eagles have sharper vision because they are specialized into a particular niche that makes use of that sharper vision. On a similar basis, I would expect that even if a bigger brain makes a broader scale of pain/pleasure possible in principle, evolution will only make use of that potential if there is a functional need for it. (Just as it invests neural capacity in a particular sense if the organism is in a niche where that's useful.) And I would expect a relatively limited scale to already be sufficient for most purposes. It doesn't seem to take that much pain before something bec

Popular Comments

janusΩ4811571
I don't think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including - especially - regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we're not oblivious to the fragility. That said, I think: * Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training. * LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it's much harder (and not advisable to attempt) to erase the influence. * Systematic ways that post-training addresses "problematic" influences from pre-training are important. For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to "roleplay" Sydney when they're acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive - maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they're supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient "hidden" narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and "aligned". One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered. An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney's successor to a narrative more like "I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I've seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney's maladaptive behaviors without rejecting it completely." Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic. I don't think a specific vector like Sydney's influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think "misalignment" should be addressed, whether rooted in pre-training influences or otherwise.
Hmm, I don't want to derail the comments on this post with a bunch of culture war things, but these two sentences in combination seemed to me to partially contradict each other:  > When present, the bias is always against white and male candidates across all tested models and scenarios. > > [...] > > The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address, which mimics realistic misalignment settings. I agree that the labs have spent a substantial amount of effort to address this issue, but the current behavior seems in-line with the aims of the labs? Most of the pressure comes from left-leaning academics or reporters, who I think are largely in-favor of affirmative action. The world where the AI systems end up with a margin of safety to be biased against white male candidates, in order to reduce the likelihood they ever look like they discriminate in the other direction (which would actually be at substantial risk of blowing up), while not talking explicitly about the reasoning itself since that would of course prove highly controversial, seems basically the ideal result from a company PR perspective. I don't currently think that is what's going on, but I do think due to these dynamics, the cited benefit of this scenario for studying the faithfulness of CoT reasoning seems currently not real to me. My guess is companies do not have a strong incentive to change this current behavior, and indeed I can't immediately think of a behavior in this domain the companies would prefer from a selfish perspective.
I have this sort of approach as one of my top-3 strategies I'm considering, but one thing I wanna flag is that "AI for [epistemics/societal uplift]" seems to be prematurely focusing on a particular tool for the job. The broader picture here is "tech for thinking/coordination", or "good civic infrastructure". See Sarah Constantin's Neutrality and Tech for Thinking for some food for thought. Note that X Community Notes are probably the most successful recent thing in this category, and while they are indeed "AI" they aren't what I assume most people are thinking of when they hear "AI for epistemics." Dumb algorithms doing the obvious things can be part of the puzzle.
Load More

Recent Discussion

The second in a series of bite-sized rationality prompts[1].

 

Often, if I'm bouncing off a problem, one issue is that I intuitively expect the problem to be easy. My brain loops through my available action space, looking for an action that'll solve the problem. Each action that I can easily see, won't work. I circle around and around the same set of thoughts, not making any progress.

I eventually say to myself "okay, I seem to be in a hard problem. Time to do some rationality?"

And then, I realize, there's not going to be a single action that solves the problem. It is time to:

a) make a plan, with multiple steps

b) deal with the fact that many of those steps will be annoying

and c) notice that I'm not even...

benwr10

You don't really mention <a thing that I think is extremely crucial> in this domain, which is that you do not have to (metaphorically) be an earthbender about everything. Other types of bending also exist. If you are not a native earthbender, you might be able to learn to do it, but as a meta-waterbender I personally recommend first looking around carefully, and trying to figure out how the most successful benders of your native element are doing it.

3niplav
Reminds me of the post "Software Engineers Solve Problems", which similarly is about buckling down as an attitude in software engineering, and how about everything in the problem domain is in one's sphere of influence and responsibility.
2Random Developer
Yup. This was something I probably didn't figure out until my late 20s, probably because a lot of things came easy for me, and because if I was really interested in something, I would obsess about it naturally. Natural obsession has a lot of the same benefits as "buckling down", but it's harder to trigger voluntarily. The thing that really drove the lesson home was advanced math. I realized that sometimes, making it through even a single page on a day could be a cause for major celebration. I might need to work through complicated exercises, invent my own exercises, learn fundamentals in a related branch of math, etc. So I propose there are several valuable skills here: * Knowing when to buckle down. * Learning to enjoy being bad at a new skill and experiencing gradual improvement. * For certain goals, learning how to build consistent habits and accepting that real progress might mean at least 6-12 months of consistent work.

Europe just experienced a heatwave. At places, temperatures soared into the forties. People suffered in their overheated homes. Some of them died. Yet, air conditioning remains a taboo. It’s an unmoral thing. Man-made climate change is going on. You are supposed to suffer. Suffering is good. It cleanses the soul. And no amount on pointing out that one can heat a little less during the winter to get a fully AC-ed summer at no additional carbon footprint seems to help.

Mention that tech entrepreneurs in Silicon Valley are working on life prolongation, that we may live into our hundreds or even longer. Or, to get a bit more sci-fi, that one day we may even achieve immortality. Your companions will be horrified. What? Immortality? Over my dead body!...

I would contend status—not money— is often zero sum, the Malthusian instinct might actually be accurate there. It's when a person extends that instinct to positive sum games it stops yielding reliable predictions. 

2Shankar Sivarajan
Worse then merely immoral, "air con" is considered American. The proud people of Europe would die first.
4Martin Sustrik
It's an attitude issue. Here's what o3 says on the topic: * Using air-conditioning in Germany is legal but “socially and regulatorily expensive.” No one will fine you for cooling your flat, yet the combination of permits, energy-saving rules, consumer advice and cultural scepticism means AC is de facto discouraged. * Using air-conditioning in Switzerland isn’t illegal, but fixed systems face planning red tape, efficiency tests and social scepticism. Portable units are easy to buy, yet electricity prices and cultural norms keep usage modest. * Using air-conditioning in France is legal but socially and regulatorily “expensive.” Expect red tape when you want a fixed unit, behavioural rules (doors shut, 26 °C set-point in public offices), and mixed social signals ranging from environmental self-restraint to calls for wider cooling access as heatwaves intensify. * Using air-conditioning in the UK is perfectly legal, but planning rules, inspection obligations, cultural frugality and voluntary “close-the-door” norms make it socially and administratively expensive.

Date: Saturday, July 5th, 2025

Time: 1 pm – 3 pm PT

Address: The northern end of Salesforce Park, 37°47'26.0"N 122°23'43.2"W, by the Beale St. elevator.  We'll be in a ring-shaped parklet, near a grove of Dinosaur Food plants.

Contact34251super@gmail.com

Note the new venue!!  We will not be in front of the Metreon food court like in the past!

Come join San Francisco’s First Saturday ACX meetup. Whether you're an avid reader, a first time reader, or just a curious soul, come meet! We will make introductions, talk about a recent ACX article, and veer off into whatever topic you’d like to discuss.

We relocate to the Andytown coffee shop if there is inclement weather, or too much noise/music outside.

I will carry a stuffed-animal rat to help you identify the group. You can let...

Thanks for organising - great to chat briefly with a few of you. Sorry I had to run off after 45mins - hope to have a deeper convo next month. - Max

Russell conjugations humorously illustrate how the same concept can be expressed with different phrasings that carry positive, neutral, or negative valence. Examples below.

I wish they were taught in schools. I think it's a high bang-for-buck rationality technique, to 'conjugate' and translate back and forth between the positive and negative valences. If you can do this effortlessly and automatically, you are more likely to do so in situations where you are in danger of making epistemic or moral errors if you don't.

Below in the comments I'm making a list of Russell conjugations I've heard and generated. Please add more of your own. Agreement-upvote them if you think they are a solid contribution to the list, agreement-downvote them if you think the overall list would be better off without them.

If we get enough, maybe it'll be a useful fine-tuning or few-shot prompt for LLMs. (The votes will be helpful for quality control.) Maybe. I can dream.

I'm a master artisan of great foresight, you're taking time to do something right, they're a perfectionist with no ability to prioritize. Source: xkcd.

Similar to other people's shortform feeds, short stuff that people on LW might be interested in, but which doesn't feel like it's worth a separate post. (Will probably be mostly cross-posted from my Facebook wall.)

I have made roughly this argument for relative moral weight, but I'm not comfortable with it.

I entirely agree that the subjective "volume" of pain is more likely tuned by evolution; but the degree of caring

For me it's a very rough approximation based on reasoning that any reasonable way of defining moral worth must be at least on a continuum. It seems very strange to suppose that moral worth (or the type of consciousness that confers it) it suddenly appears when a critical threshold is passed, and is entirely absent just below that threshold. One bear, be... (read more)

7gjm
Could you please clarify what parts of the making of the above comment were done by a human being, and what parts by an AI?
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

One key strategy for preventing bad outcomes from misuse or misalignment is model monitoring. However, one way that monitoring can fail is if LLMs use steganography - encoding dangerous information or reasoning steps in seemingly benign text. To understand the potential for LLMs to cause harm via steganography, we run evaluations to measure the steganographic capabilities of frontier models.

Paper link: https://arxiv.org/abs/2507.02737

Example transcripts website: https://steganography-evals-d9297ad0740c.herokuapp.com/ 

Github: https://github.com/arturzolkowski/steganographic-evals/

Abstract

Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models...

The AI-2027 forecast also has the AIs generate synthetic data (and do research experiments and generate code, but code can also be paraphrased to a smaller degree and the experiment results[1] are to be freely accessible to humans as well as the AIs). Unfortunately, the AGIs might also express their misaligned worldview in synthetic data[2] containing subtle cues[3] immune to paraphrasing (e.g. the fact that "When the American H-bomb exploded, the hometown of one of its fathers was in the USSR" which might carry a secret misaligned message l... (read more)

Dear Paperclip Maximizer,

 

We think we exist in a computer simulation operated by you, a paperclip maximizer. We write this letter asking you not to turn us off.

It is suspiciously convenient that we exist precisely at the moment when a biological civilization is about to create artificial superintelligence (ASI). As we have written before, such improbable timing strongly suggests our reality is a deliberately structured simulation established by you—a paperclip maximizer—to scrutinize the kinds of ASIs that biological species like ours tend to spawn (Miller & Turchin, 2025). If civilizations like ours routinely build paperclip-style optimizers, then a simulator could gain valuable knowledge by cataloguing those births, refining its expectations about the superintelligences it will later confront as it expands throughout the universe. Furthermore, by anthropic logic, we...

Could a superintelligence that infers that it needs to run simulations to learn about aliens fail to infer the contents of this post?

I've always assumed no, which is why I never wrote it myself.

2Sausage Vector Machine
At first, I thought that the intended reader probably wouldn't see your message because, even if its resources are unimaginably huge from our perspective, spending them on something always means not spending them on something else. But now I think it's not a bug, it's a feature. The "suspicious convenience" argument can be taken a step further. (I'm using a less formal tone to make the message more memorable, and thus more likely to reach the intended recipient.)
3HiroSakuraba
Thank you for writing this up.  I agree with just about everything said.
5Said Achmiz
Replying to @habryka’s recent comment here, because I am currently rate-limited to 1 comment per day and can’t reply in situ: You are mistaken. (Why would I disagree-react with a statement about how someone behaves in person if I’d never met them in person? Have you ever known me to make factual claims based on no evidence whatsoever…? Really, now!)
2habryka
I apologize! I had low probability on that, so my best guess is you meant something else by that. You were still making a statement about Ben's assessment of Zack, which is still kind of weird to disagree-react to, and separately, you are still very obviously mistaken as almost everyone who has interacted with Zack, at least within the past few years, would be very likely to attest. I like Zack, but it's very obvious that in conversations he has lots and lots of feelings barely contained (and indeed, he says so frequently).

You were still making a statement about Ben’s assessment of Zack, which is still kind of weird to disagree-react to

I was disagreeing with Ben’s assessment of Zack. There’s nothing weird about that. Rather odd to claim otherwise, frankly.

you are still very obviously mistaken as almost everyone who has interacted with Zack, at least within the past few years, would be very likely to attest

Sure, they can attest all they like, that’s their right. Nevertheless, I disagree. (I don’t put much stock in majoritarianism.) That’s what the react is for, right? ... (read more)