I was surprised to read the delta propagating to so many different parts of your worldviews (organizations, goods, markets, etc), and that makes me think that it'd be relatively easier to ask questions today that have quite different answers under your worldviews. The air conditioner one seems like one, but it seems like we could have many more, and some that are even easier than that. Plausibly you know of some because you're quite confident in your position; if so, I'd be interested to hear about them[1].
At a meta level, I find it pretty funny that so many smart people seem to disagree on the question of whether questions usually have easily verifiable answers.
I realize that part of your position is that this is just really hard to actually verify, but as in the example of objects in your room it feels like there should be examples where this is feasible with moderate amounts of effort. Of course, a lack of consensus on whether something is actually bad if you dive in further could also be evidence for hardness of verification, even if it'd be less clean.
Yeah, I think this is very testable, it's just very costly to test - partly because it requires doing deep dives on a lot of different stuff, and partly because it's the sort of model which makes weak claims about lots of things rather than very precise claims about a few things.
I think it depends on which domain you're delegating in. E.g. physical objects, especially complex systems like an AC unit, are plausibly much harder to validate than a mathematical proof.
In that vein, I wonder if requiring the AI to construct a validation proof would be feasible for alignment delegation? In that case, I'd expect us to find more use and safety from [ETA: delegation of] theoretical work than empirical.
First, I want to flag that I really appreciate how you're making these delta clear and (fairly) simple.
I like this, though I feel like there's probably a great deal more clarity/precision to be had here (as is often the case).
Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better.
I'm not sure what "bad" means exactly. Do you basically mean, "if I were to spend resources R evaluating this object, I could identify some ways for it to be significantly improved?" If so, I assume we'd all agree that this is true for some amount R, the key question is what that amount is.
I also would flag that you draw attention to the issue with air conditioners. But for the issue of personal items, I'd argue that when I learn more about popular items, most of what I learn are positive things I didn't realize. Like with Chesterton's fence - when I get many well-reviewed or popular items, my impression is generally that there were many clever ideas o...
Yes, though I would guess my probability on P = NP is relatively high compared to most people reading this. I'm around 10-15% on P = NP.
People who’ve spent a lot of time thinking about P vs NP often have the intuition that “verification is easier than generation”. [...]
The problem is, this intuition comes from thinking about problems which are in NP. NP is, roughly speaking, the class of algorithmic problems for which solutions are easy to verify. [...]
I think a more accurate takeaway would be that among problems in NP, verification is easier than generation. In other words, among problems for which verification is easy, verification is easier than generation. Rather a less impressive claim, when you put it like that.
Most real-world problems are outside of NP. Let's go through some examples...
Suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values). Can I easily write down a boolean circuit (possibly with some inputs from data on fridges) which is satisfiable if-and-only-if this fridge in particular is in fact the best option for me according to my own long-term values? No, I have no idea how to write such a boolean circuit at all. Heck, even if my boolean circuit could internally use a quantum-level simulation of me, I'd still have no idea how to do it, because neither my stated values nor my revealed preferences are identical to my own long-term values. So that problem is decidedly not in NP.
(Variant of that problem: suppose an AI hands me a purported mathematical proof that this fridge in particular is the best option for me according to my own long-term values. Can I verify the proof's correctness? Again, no, I have no idea how to do that, I don't understand my own values well enough to distinguish a proof which makes correct assumptions about my values from one which makes incorrect assumptions.)
A quite different example ...
I feel like a lot of the difficulty here is a punning of the word "problem."
In complexity theory, when we talk about "problems", we generally refer to a formal mathematical question that can be posed as a computational task. Maybe in these kinds of discussions we should start calling these problems_C (for "complexity"). There are plenty of problems_C that are (almost definitely) not in NP, like #SAT ("count the number of satisfying assignments of this Boolean formula"), and it's generally believed that verification is hard for these problems. A problem_C like #SAT that is (believed to be) in #P but not NP will often have a short easy-to-understand algorithm that will be very slow ("try every assignment and count up the ones that satisfy the formula").
On the other hand, "suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values)" is a very different sort of beast. I agree it's not in NP in that I can't easily verify a solution, but the issue is that it's not a problem_C, rather than it being a problem_C that's (almost definitely) not in NP. With #SAT, I can easily describe how to solve the task usin...
Curated!
This post is strong in the rationalist virtue of simplicity. There is a large body of quite different research and strategic analysis of the AI x-risk situation between Wentworth and Christiano, and yet this post claims (I think fairly accurately) that much of it can be well captured in one key worldview-difference. The post does a good job of showing how this difference appears in many situations/cases (e.g. the air conditioning unit, large bureaucracies, outsourcing alignment, etc).
I encourage someone who takes the opposing side of this position from John (e.g. someone at the Alignment Research Center) to provide a response, as to whether they think this characterization is accurate (and if yes, why they disagree).
I don't think this characterization is accurate at all, but don't think I can explain the disagreement well enough for it to be productive.
I think both that:
I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:
I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:
- the motte: there exist hard to verify properties
- the bailey: all/most important properties are hard to verify
I don't think I am trying to claim that bailey at all. For purposes of AI risk, if there is even just one single property of a given system which is both (a) necessary for us to not die to that system, and (b) hard to verify, then difficulty of verification is a blocking issue for outsourcing alignment of that system.
Standard candidates for such properties include:
The claim is verification is easier than generation. This post considers a completely different claim that "verification is easy", e.g.
How does the ease-of-verification delta propagate to AI?
if I apply the “verification is generally easy” delta to my models, then delegating alignment work to AI makes total sense.
if I apply a “verification is generally easy” delta, then I expect the world to generally contain far less low-hanging fruit
I just don't care much if the refrigerator or keyboard or tupperware or whatever might be bad in non-obvious ways that we failed to verify, unless you also argue that it would be easier to create better versions from scratch than to notice the flaws.
Now to be fair, maybe Paul and I are just fooling ourselves, and really all of our intuitions come from "verification is easy", which John gestures at:
He’s sometimes summarized this as “verification is easier than generation”, but I think his underlying intuition is somewhat stronger than that.
But I don't think "verification is easy" matters much to my views. Re: the three things you mention:
I disagree with this curation because I don't think this post will stand the test of time. While Wentworth's delta to Yudkowsky has a legible takeaway-- ease of ontology translation-- that is tied to his research on natural latents, it is less clear what John means here and what to take away. Simplicity is not a virtue when the issue is complex and you fail to actually simplify it.
I put approximately-zero probability on the possibility that Paul is basically right on this delta; I think he’s completely out to lunch.
Very strong claim which the post doesn't provide nearly enough evidence to support
This post uses "I can identify ways in which chairs are bad" as an example. But it's easier for me to verify that I can sit in a chair and that it's comfortable then to make a chair myself. So I don't really know why this is a good example for "verification is easier than generation".
More examples:
If the goal of this post is to discuss the crux https://www.lesswrong.com/posts/fYf9JAwa6BYMt8GBj/link-a-minimal-viable-product-for-alignment?commentId=mPgnTZYSRNJDwmr64:
evaluation isn't easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it
then I think there is a large disconnect between the post above, which is positing that in order for this claim to be false there has to be some "deep" sense in which delagation is viable, and the sense in which I think this crux is obviously false in the more mundane sense in which all humans interface with the world and optimize over the products other people create, and are therefore more capable than they would have been if they had to make all products for themselves from scratch.
I assumed John was pointing at verifying that perhaps the chemicals used in the production of the chair might have some really bad impact on the environmnet, start causing a problem with the food chain eco system and make food much scarcers for everyone -- including the person who bought the chair -- in the meaningfully near future.
What I had in mind is more like: many times over the years I've been sitting at a desk and noticed my neck getting sore. Then when I move around a bit, I realize that the chair/desk/screen are positioned such that my neck is at an awkward angle when looking at the screen, which makes my neck sore when I hold that angle for a long time. The mispositioning isn't very salient; I just reflexively adjust my neck to look at the screen and don't notice that it's at an awkward angle. Then later my neck hurts, and it's nonobvious and takes some examination to figure out why my neck hurts.
That sort of thing, I claim, generalizes to most "ergonomics". Chairs, keyboards, desks, mice... these are all often awkward in ways which make us uncomfortable when using them for a long time. But the awkwardness isn't very salient or obvious (for most people), because we just automatically adjust position to handle it, and the discomfort only comes much later from holding that awkward position for a long time.
I mean, sure, for any given X there will be some desirable properties of X which are easy to verify, and it's usually pretty easy to outsource the creation of an X which satisfies the easy-to-verify properties. The problem is that the easy-to-verify properties do not typically include all the properties which are important to us. Ergonomics is a very typical example.
Extending to AI: sure, there will be some desirable properties of AI which are easy to verify, or properties of alignment research which are easy to verify, or properties of plans which are easy to verify, etc. And it will be easy to outsource the creation of AI/research/plans which satisfy those easy-to-verify properties. Alas, the easy-to-verify properties do not include all the properties which are important to us, or even all the properties needed to not die.
I'm curious what you think of Paul's points (2) and (3) here:
...
- Eliezer often talks about AI systems that are able to easily build nanotech and overpower humans decisively, and describes a vision of a rapidly unfolding doom from a single failure. This is what would happen if you were magically given an extraordinarily powerful AI and then failed to aligned it, but I think it’s very unlikely what will happen in the real world. By the time we have AI systems that can overpower humans decisively with nanotech, we have other AI systems that will either kill human
“keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse"
But, I think the negative impacts that these goods have on you are (mostly) realized on longer timescales - say, years to decades. If you’re using a chair that is bad for your postu...
Many research tasks have very long delays until they can be verified. The history of technology is littered with apparently good ideas that turned out to be losers after huge development efforts were poured into them. Supersonic transport, zeppelins, silicon-on-sapphire integrated circuits, pigeon-guided bombs, object-oriented operating systems, hydrogenated vegetable oil, oxidative decoupling for weight loss…
Finding out that these were bad required making them, releasing them to the market, and watching unrecognized problems torpedo them. Sometimes it took decades.
In HCH, the human user does a little work then delegates subquestions/subproblems to a few AIs, which in turn do a little work then delegate their subquestions/subproblems to a few AIs, and so on until the leaf-nodes of the tree receive tiny subquestions/subproblems which they can immediately solve.
This does not agree with my understanding of what HCH is at all. HCH is a definition of an abstract process for thought experiments, much like AIXI is. It's defined as the fixed point of some iterative process of delegation expanding out into a tree. It's als...
As you're doing these delta posts, do you feel like it's changing your own positions at all?
For example, reading this one what strikes me is that what's portrayed as the binary sides of the delta seem more like positions near the edges of a gradient distribution, and particularly one that's unlikely to be uniform across different types of problems.
To my eyes the most likely outcome is a situation where you are both right.
Where there are classes of problems where verification is easy and delegation is profitable, and classes of problems where verification w...
As you're doing these delta posts, do you feel like it's changing your own positions at all?
Mostly not, because (at least for Yudkowsky and Christiano) these are deltas I've been aware of for at least a couple years. So the writing process is mostly just me explaining stuff I've long since updated on, not so much figuring out new stuff.
In terms of the hard to verify aspect, while it's true that any one person will face any number of challenges do we live in a world where one person does anything on their own?
How would the open-source model influence outcomes? When pretty much anyone can take a look, and persumable many do, does the level of verifcation, or ease of verification, improve in your model?
Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better.
Crucially, this is true only because you're relatively smart for a human: smarter than many of the engineers that designed those objects, and smarter than most or all of the committee-of-engineers that designed those objects. You can come up with better soluti...
It’s a failure of ease of verification: because I don’t know what to pay attention to, I can’t easily notice the ways in which the product is bad.
Is there an opposite of the "failure of ease of verification" that would add up to 100% if you would categorize the whole of reality into 1 of these 2 categories? Say in a simulation, if you attributed every piece of computation into following 2 categories, how much of the world can be "explained by" each category?
...That propagates into a huge difference in worldviews. Like, I walk around my house and look at all the random goods I’ve paid for - the keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Preamble: Delta vs Crux
This section is redundant if you already read My AI Model Delta Compared To Yudkowsky.
I don’t natively think in terms of cruxes. But there’s a similar concept which is more natural for me, which I’ll call a delta.
Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it’s cloudy today, that means the “weather” variable in my program at a particular time[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we might have very different beliefs about lots of stuff in the world.
If your model and my model differ in that way, and we’re trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is.
That’s a delta: one or a few relatively “small”/local differences in belief, which when propagated through our models account for most of the differences in our beliefs.
For those familiar with Pearl-style causal models: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else’s model, or vice versa.
This post is about my current best guesses at the delta between my AI models and Paul Christiano's AI models. When I apply the delta outlined here to my models, and propagate the implications, my models mostly look like Paul’s as far as I can tell. That said, note that this is not an attempt to pass Paul's Intellectual Turing Test; I'll still be using my own usual frames.
My AI Model Delta Compared To Christiano
Best guess: Paul thinks that verifying solutions to problems is generally “easy” in some sense. He’s sometimes summarized this as “verification is easier than generation”, but I think his underlying intuition is somewhat stronger than that.
What do my models look like if I propagate that delta? Well, it implies that delegation is fundamentally viable in some deep, general sense.
That propagates into a huge difference in worldviews. Like, I walk around my house and look at all the random goods I’ve paid for - the keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better. But because the badness is nonobvious/nonsalient, it doesn’t influence my decision-to-buy, and therefore companies producing the good are incentivized not to spend the effort to make it better. It’s a failure of ease of verification: because I don’t know what to pay attention to, I can’t easily notice the ways in which the product is bad. (For a more game-theoretic angle, see When Hindsight Isn’t 20/20.)
On (my model of) Paul’s worldview, that sort of thing is rare; at most it’s the exception to the rule. On my worldview, it’s the norm for most goods most of the time. See e.g. the whole air conditioner episode for us debating the badness of single-hose portable air conditioners specifically, along with a large sidebar on the badness of portable air conditioner energy ratings.
How does the ease-of-verification delta propagate to AI?
Well, most obviously, Paul expects AI to go well mostly via humanity delegating alignment work to AI. On my models, the delegator’s incompetence is a major bottleneck to delegation going well in practice, and that will extend to delegation of alignment to AI: humans won’t get what we want by delegating because we don’t even understand what we want or know what to pay attention to. The outsourced alignment work ends up bad in nonobvious/nonsalient (but ultimately important) ways for the same reasons as most goods in my house. But if I apply the “verification is generally easy” delta to my models, then delegating alignment work to AI makes total sense.
Then we can go even more extreme: HCH, aka “the infinite bureaucracy”, a model Paul developed a few years ago. In HCH, the human user does a little work then delegates subquestions/subproblems to a few AIs, which in turn do a little work then delegate their subquestions/subproblems to a few AIs, and so on until the leaf-nodes of the tree receive tiny subquestions/subproblems which they can immediately solve. On my models, HCH adds recursion to the universal pernicious difficulties of delegation, and my main response is to run away screaming. But on Paul’s models, delegation is fundamentally viable, so why not delegate recursively?
(Also note that HCH is a simplified model of a large bureaucracy, and I expect my views and Paul’s differ in much the same way when thinking about large organizations in general. I mostly agree with Zvi’s models of large organizations, which can be lossily-but-accurately summarized as “don’t”. Paul, I would guess, expects that large organizations are mostly reasonably efficient and reasonably aligned with their stakeholders/customers, as opposed to universally deeply dysfunctional.)
Propagating further out: under my models, the difficulty of verification accounts for most of the generalized market inefficiency in our world. (I see this as one way of framing Inadequate Equilibria.) So if I apply a “verification is generally easy” delta, then I expect the world to generally contain far less low-hanging fruit. That, in turn, has a huge effect on timelines. Under my current models, I expect that, shortly after AIs are able to autonomously develop, analyze and code numerical algorithms better than humans, there’s going to be some pretty big (like, multiple OOMs) progress in AI algorithmic efficiency (even ignoring a likely shift in ML/AI paradigm once AIs start doing the AI research). That’s the sort of thing which leads to a relatively discontinuous takeoff. Paul, on the other hand, expects a relatively smooth takeoff - which makes sense, in a world where there’s not a lot of low-hanging fruit in the software/algorithms because it’s easy for users to notice when the libraries they’re using are trash.
That accounts for most of the known-to-me places where my models differ from Paul’s. I put approximately-zero probability on the possibility that Paul is basically right on this delta; I think he’s completely out to lunch. (I do still put significantly-nonzero probability on successful outsourcing of most alignment work to AI, but it’s not the sort of thing I expect to usually work.)