We often hear "We don't trade with ants" as an argument against AI cooperating with humans. But we don't trade with ants because we can't communicate with them, not because they're useless – ants could do many useful things for us if we could coordinate. AI will likely be able to communicate with us, and Katja questions whether this analogy holds.
A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".
If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.
For pure consequentialists—agents that have an outcome they want to bring about, and do whatever they think will cause it—some version of instrumental convergence seems surely true[1].
But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do...
anything that outputs decisions implies a utility function
I think this is only true in a boring sense and isn't true in more natural senses. For example, in an MDP, it's not true that every policy maximises a non-constant utility function over states.
(Edit: Alas, EA has pulled out of the deal. Let April 1st 2025 mark some of the greatest hours in EAs history)
Hey Everyone,
It is with a sense of... considerable cognitive dissonance that I am letting you all know about a significant development for the future trajectory of LessWrong. After extensive internal deliberation, projections of financial runways, and what I can only describe as a series of profoundly unexpected coordination challenges, the Lightcone Infrastructure team has agreed in principle to the acquisition of LessWrong by EA.
I assure you, nothing about how LessWrong operates on a day to day level will change. I have always cared deeply about the robustness and integrity of our institutions, and I am fully aligned with our stakeholders at EA.
To be honest, the key...
Can you please send the new fooming shoggoth album to spotify, I was really enjoying that music!
edit: Ah I see this question has been answered, but I like to note that I'm impressed by the ai music and I'm going to look into making some myself. Perhaps songs about cognitive bias's could be a good way to learn them deep enough in your brain that you can avoid them in non-theroetic situations.
(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.
This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.)
In my last essay, I offered a high-level framework for thinking about the path from here to safe superintelligence. This framework emphasized the role of three key “security factors” – namely:
Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.
One subset of AI for AI safety that I believe to be underrated is wise AI advisors[1]. Some of the areas you've listed (coordination, helping with communication, improving epistemics) intersect with this, but I don't believe that this exhausts the wisdom frame.
You write: "If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way... then absent large amounts of sustained capability restraint, it seems likely that we’ll qui...
(This is once again me taking what old material from my personal blog and reposting it here with some revisions.)
Graham's hierarchy of disagreement is pretty well known and fairly useful, but I think the upper levels have some issues. (Are the lower levels necessarily the best they could be? Maybe not, but they also don't particularly matter.) I suggest a few changes.
First, and most minorly, let's remove DH6 "refuting the central point" as separate from "refutation". I think it should just always be implicit that whether you're counterarguing or refuting, it should be on the central point. If it's not, well... I guess we should add a second change -- let's add a DH3.5, above "contradiction" but below "counterargument", which is "arguing with...
Scott's own reaction to / improvement upon Graham's hierarchy of disagreement (which I just noticed you commented on back in the day, so I guess this is more for others' curiosity) is
...Graham’s hierarchy is useful for its intended purpose, but it isn’t really a hierarchy of disagreements. It’s a hierarchy of types of response, within a disagreement. Sometimes things are refutations of other people’s points, but the points should never have been made at all, and refuting them doesn’t help. Sometimes it’s unclear how the argument even connects to the sor
In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to:
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
For some reason it took me until now to notice that:
Decision theory is about how to behave rationally under conditions of uncertainty, especially if this uncertainty involves being acausally blackmailed and/or gaslit by alien superintelligent basilisks.
Decision theory has found numerous practical applications, including proving the existence of God and generating endless LessWrong comments since the beginning of time.
However, despite the apparent simplicity of "just choose the best action", no comprehensive decision theory that resolves all decision theory dilemmas has yet been formalized. This paper at long last resolves this dilemma, by introducing a new decision theory: VDT.
Some common existing decision theories are:
I unironically love Table 2.
A shower thought I once had, intuition-pumped by MIRI's / Luke's old post on turning philosophy to math to engineering, was that if metaethicists really were serious about resolving their disputes they should contract a software engineer (or something) to help implement on GitHub a metaethics version of Table 2, where rows would be moral dilemmas like the trolley problem and columns ethical theories, and then accept that real-world engineering solutions tend to be "dirty" and inelegant remixes plus kludgy optimisations to ...
At PIBBSS, we’ve been thinking about how renormalization can be developed into a rich framework for AI interpretability. This document serves as a roadmap for this research agenda – which we are calling an Opportunity Space[1] for the AI safety community. In what follows, we explore the technical and philosophical significance of renormalization for physics and AI safety, problem areas in which it could be most useful, and some interesting existing directions – mainly from physics – that we are excited to place in direct contact with AI safety. This roadmap will also provide context for our forthcoming Call for Collaborations, during which we will hire affiliates to work on projects in this area.
Acknowledgements: While Lauren did the writing, this opportunity space was developed with the PIBBSS horizon scanning team, Dmitry Vaintrob...
The idea is interesting, but I'm somewhat skeptical that it'll pan out.
Lee Billings' book Five Billion Years of Solitude has the following poetic passage on deep time that's stuck with me ever since I read it in Paul Gilster's post:
...Deep time is something that even geologists and their generalist peers, the earth and planetary scientists, can never fully grow accustomed to.
The sight of a fossilized form, perhaps the outline of a trilobite, a leaf, or a saurian footfall can still send a shiver through their bones, or excavate a trembling hollow in the chest that breath cannot fill. They can measure celestial motions and l