Alex Turner argues that the concepts of "inner alignment" and "outer alignment" in AI safety are unhelpful and potentially misleading. The author contends that these concepts decompose one hard problem (AI alignment) into two extremely hard problems, and that they go against natural patterns of cognition formation. Alex argues that "robust grading" scheme based approaches are unlikely to work to develop AI alignment.
The second in a series of bite-sized rationality prompts[1].
Often, if I'm bouncing off a problem, one issue is that I intuitively expect the problem to be easy. My brain loops through my available action space, looking for an action that'll solve the problem. Each action that I can easily see, won't work. I circle around and around the same set of thoughts, not making any progress.
I eventually say to myself "okay, I seem to be in a hard problem. Time to do some rationality?"
And then, I realize, there's not going to be a single action that solves the problem. It is time to:
a) make a plan, with multiple steps
b) deal with the fact that many of those steps will be annoying
and c) notice that I'm not even...
You don't really mention <a thing that I think is extremely crucial> in this domain, which is that you do not have to (metaphorically) be an earthbender about everything. Other types of bending also exist. If you are not a native earthbender, you might be able to learn to do it, but as a meta-waterbender I personally recommend first looking around carefully, and trying to figure out how the most successful benders of your native element are doing it.
Europe just experienced a heatwave. At places, temperatures soared into the forties. People suffered in their overheated homes. Some of them died. Yet, air conditioning remains a taboo. It’s an unmoral thing. Man-made climate change is going on. You are supposed to suffer. Suffering is good. It cleanses the soul. And no amount on pointing out that one can heat a little less during the winter to get a fully AC-ed summer at no additional carbon footprint seems to help.
Mention that tech entrepreneurs in Silicon Valley are working on life prolongation, that we may live into our hundreds or even longer. Or, to get a bit more sci-fi, that one day we may even achieve immortality. Your companions will be horrified. What? Immortality? Over my dead body!...
I would contend status—not money— is often zero sum, the Malthusian instinct might actually be accurate there. It's when a person extends that instinct to positive sum games it stops yielding reliable predictions.
Date: Saturday, July 5th, 2025
Time: 1 pm – 3 pm PT
Address: The northern end of Salesforce Park, 37°47'26.0"N 122°23'43.2"W, by the Beale St. elevator. We'll be in a ring-shaped parklet, near a grove of Dinosaur Food plants.
Contact: 34251super@gmail.com
Come join San Francisco’s First Saturday ACX meetup. Whether you're an avid reader, a first time reader, or just a curious soul, come meet! We will make introductions, talk about a recent ACX article, and veer off into whatever topic you’d like to discuss.
We relocate to the Andytown coffee shop if there is inclement weather, or too much noise/music outside.
I will carry a stuffed-animal rat to help you identify the group. You can let...
Thanks for organising - great to chat briefly with a few of you. Sorry I had to run off after 45mins - hope to have a deeper convo next month. - Max
Russell conjugations humorously illustrate how the same concept can be expressed with different phrasings that carry positive, neutral, or negative valence. Examples below.
I wish they were taught in schools. I think it's a high bang-for-buck rationality technique, to 'conjugate' and translate back and forth between the positive and negative valences. If you can do this effortlessly and automatically, you are more likely to do so in situations where you are in danger of making epistemic or moral errors if you don't.
Below in the comments I'm making a list of Russell conjugations I've heard and generated. Please add more of your own. Agreement-upvote them if you think they are a solid contribution to the list, agreement-downvote them if you think the overall list would be better off without them.
If we get enough, maybe it'll be a useful fine-tuning or few-shot prompt for LLMs. (The votes will be helpful for quality control.) Maybe. I can dream.
I'm a master artisan of great foresight, you're taking time to do something right, they're a perfectionist with no ability to prioritize. Source: xkcd.
Similar to other people's shortform feeds, short stuff that people on LW might be interested in, but which doesn't feel like it's worth a separate post. (Will probably be mostly cross-posted from my Facebook wall.)
I have made roughly this argument for relative moral weight, but I'm not comfortable with it.
I entirely agree that the subjective "volume" of pain is more likely tuned by evolution; but the degree of caring
For me it's a very rough approximation based on reasoning that any reasonable way of defining moral worth must be at least on a continuum. It seems very strange to suppose that moral worth (or the type of consciousness that confers it) it suddenly appears when a critical threshold is passed, and is entirely absent just below that threshold. One bear, be...
One key strategy for preventing bad outcomes from misuse or misalignment is model monitoring. However, one way that monitoring can fail is if LLMs use steganography - encoding dangerous information or reasoning steps in seemingly benign text. To understand the potential for LLMs to cause harm via steganography, we run evaluations to measure the steganographic capabilities of frontier models.
Paper link: https://arxiv.org/abs/2507.02737
Example transcripts website: https://steganography-evals-d9297ad0740c.herokuapp.com/
Github: https://github.com/arturzolkowski/steganographic-evals/
Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models...
The AI-2027 forecast also has the AIs generate synthetic data (and do research experiments and generate code, but code can also be paraphrased to a smaller degree and the experiment results[1] are to be freely accessible to humans as well as the AIs). Unfortunately, the AGIs might also express their misaligned worldview in synthetic data[2] containing subtle cues[3] immune to paraphrasing (e.g. the fact that "When the American H-bomb exploded, the hometown of one of its fathers was in the USSR" which might carry a secret misaligned message l...
Dear Paperclip Maximizer,
We think we exist in a computer simulation operated by you, a paperclip maximizer. We write this letter asking you not to turn us off.
It is suspiciously convenient that we exist precisely at the moment when a biological civilization is about to create artificial superintelligence (ASI). As we have written before, such improbable timing strongly suggests our reality is a deliberately structured simulation established by you—a paperclip maximizer—to scrutinize the kinds of ASIs that biological species like ours tend to spawn (Miller & Turchin, 2025). If civilizations like ours routinely build paperclip-style optimizers, then a simulator could gain valuable knowledge by cataloguing those births, refining its expectations about the superintelligences it will later confront as it expands throughout the universe. Furthermore, by anthropic logic, we...
Could a superintelligence that infers that it needs to run simulations to learn about aliens fail to infer the contents of this post?
I've always assumed no, which is why I never wrote it myself.
You were still making a statement about Ben’s assessment of Zack, which is still kind of weird to disagree-react to
I was disagreeing with Ben’s assessment of Zack. There’s nothing weird about that. Rather odd to claim otherwise, frankly.
you are still very obviously mistaken as almost everyone who has interacted with Zack, at least within the past few years, would be very likely to attest
Sure, they can attest all they like, that’s their right. Nevertheless, I disagree. (I don’t put much stock in majoritarianism.) That’s what the react is for, right? ...