My AI Model Delta Compared To Christiano

johnswentworth

191 My AI Model Delta Compared To Christiano

by johnswentworth

12th Jun 2024

5 min read

191

Preamble: Delta vs Crux

This section is redundant if you already read My AI Model Delta Compared To Yudkowsky.

I don’t natively think in terms of cruxes. But there’s a similar concept which is more natural for me, which I’ll call a delta.

Imagine that you and I each model the world (or some part of it) as implementing some program. Very oversimplified example: if I learn that e.g. it’s cloudy today, that means the “weather” variable in my program at a particular time^[1] takes on the value “cloudy”. Now, suppose your program and my program are exactly the same, except that somewhere in there I think a certain parameter has value 5 and you think it has value 0.3. Even though our programs differ in only that one little spot, we might still expect very different values of lots of variables during execution - in other words, we might have very different beliefs about lots of stuff in the world.

If your model and my model differ in that way, and we’re trying to discuss our different beliefs, then the obvious useful thing-to-do is figure out where that one-parameter difference is.

That’s a delta: one or a few relatively “small”/local differences in belief, which when propagated through our models account for most of the differences in our beliefs.

For those familiar with Pearl-style causal models: think of a delta as one or a few do() operations which suffice to make my model basically match somebody else’s model, or vice versa.

This post is about my current best guesses at the delta between my AI models and Paul Christiano's AI models. When I apply the delta outlined here to my models, and propagate the implications, my models mostly look like Paul’s as far as I can tell. That said, note that this is not an attempt to pass Paul's Intellectual Turing Test; I'll still be using my own usual frames.

My AI Model Delta Compared To Christiano

Best guess: Paul thinks that verifying solutions to problems is generally “easy” in some sense. He’s sometimes summarized this as “verification is easier than generation”, but I think his underlying intuition is somewhat stronger than that.

What do my models look like if I propagate that delta? Well, it implies that delegation is fundamentally viable in some deep, general sense.

That propagates into a huge difference in worldviews. Like, I walk around my house and look at all the random goods I’ve paid for - the keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better. But because the badness is nonobvious/nonsalient, it doesn’t influence my decision-to-buy, and therefore companies producing the good are incentivized not to spend the effort to make it better. It’s a failure of ease of verification: because I don’t know what to pay attention to, I can’t easily notice the ways in which the product is bad. (For a more game-theoretic angle, see When Hindsight Isn’t 20/20.)

On (my model of) Paul’s worldview, that sort of thing is rare; at most it’s the exception to the rule. On my worldview, it’s the norm for most goods most of the time. See e.g. the whole air conditioner episode for us debating the badness of single-hose portable air conditioners specifically, along with a large sidebar on the badness of portable air conditioner energy ratings.

How does the ease-of-verification delta propagate to AI?

Well, most obviously, Paul expects AI to go well mostly via humanity delegating alignment work to AI. On my models, the delegator’s incompetence is a major bottleneck to delegation going well in practice, and that will extend to delegation of alignment to AI: humans won’t get what we want by delegating because we don’t even understand what we want or know what to pay attention to. The outsourced alignment work ends up bad in nonobvious/nonsalient (but ultimately important) ways for the same reasons as most goods in my house. But if I apply the “verification is generally easy” delta to my models, then delegating alignment work to AI makes total sense.

Then we can go even more extreme: HCH, aka “the infinite bureaucracy”, a model Paul developed a few years ago. In HCH, the human user does a little work then delegates subquestions/subproblems to a few AIs, which in turn do a little work then delegate their subquestions/subproblems to a few AIs, and so on until the leaf-nodes of the tree receive tiny subquestions/subproblems which they can immediately solve. On my models, HCH adds recursion to the universal pernicious difficulties of delegation, and my main response is to run away screaming. But on Paul’s models, delegation is fundamentally viable, so why not delegate recursively?

(Also note that HCH is a simplified model of a large bureaucracy, and I expect my views and Paul’s differ in much the same way when thinking about large organizations in general. I mostly agree with Zvi’s models of large organizations, which can be lossily-but-accurately summarized as “don’t”. Paul, I would guess, expects that large organizations are mostly reasonably efficient and reasonably aligned with their stakeholders/customers, as opposed to universally deeply dysfunctional.)

Propagating further out: under my models, the difficulty of verification accounts for most of the generalized market inefficiency in our world. (I see this as one way of framing Inadequate Equilibria.) So if I apply a “verification is generally easy” delta, then I expect the world to generally contain far less low-hanging fruit. That, in turn, has a huge effect on timelines. Under my current models, I expect that, shortly after AIs are able to autonomously develop, analyze and code numerical algorithms better than humans, there’s going to be some pretty big (like, multiple OOMs) progress in AI algorithmic efficiency (even ignoring a likely shift in ML/AI paradigm once AIs start doing the AI research). That’s the sort of thing which leads to a relatively discontinuous takeoff. Paul, on the other hand, expects a relatively smooth takeoff - which makes sense, in a world where there’s not a lot of low-hanging fruit in the software/algorithms because it’s easy for users to notice when the libraries they’re using are trash.

That accounts for most of the known-to-me places where my models differ from Paul’s. I put approximately-zero probability on the possibility that Paul is basically right on this delta; I think he’s completely out to lunch. (I do still put significantly-nonzero probability on successful outsourcing of most alignment work to AI, but it’s not the sort of thing I expect to usually work.)

Curated

191

Mentioned in

5Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets

New Comment

74 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:09 PM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Jozdien7mo236

I was surprised to read the delta propagating to so many different parts of your worldviews (organizations, goods, markets, etc), and that makes me think that it'd be relatively easier to ask questions today that have quite different answers under your worldviews. The air conditioner one seems like one, but it seems like we could have many more, and some that are even easier than that. Plausibly you know of some because you're quite confident in your position; if so, I'd be interested to hear about them^[1].

At a meta level, I find it pretty funny that so many smart people seem to disagree on the question of whether questions usually have easily verifiable answers.

^{^}
I realize that part of your position is that this is just really hard to actually verify, but as in the example of objects in your room it feels like there should be examples where this is feasible with moderate amounts of effort. Of course, a lack of consensus on whether something is actually bad if you dive in further could also be evidence for hardness of verification, even if it'd be less clean.

[-]johnswentworth7mo120

Yeah, I think this is very testable, it's just very costly to test - partly because it requires doing deep dives on a lot of different stuff, and partly because it's the sort of model which makes weak claims about lots of things rather than very precise claims about a few things.

8Lorxus7mo

And at a twice-meta level, that's strong evidence for questions not generically having verifiable answers (though not for them generically not having those answers).

2Jozdien6mo

(That's what I meant, though I can see how I didn't make that very clear.)

1Morpheus5mo

So on the Ω-meta-level you need to correct weakly in the other direction again.

6Caspar Oesterheld4mo

To some extent, this is all already in Jozdien's comment, but: It seems that the closest thing to AIs debating alignment (or providing hopefully verifiable solutions) that we can observe is human debate about alignment (and perhaps also related questions about the future). Presumably John and Paul have similar views about the empirical difficulty of reaching agreement in the human debate about alignment, given that they both observe this debate a lot. (Perhaps they disagree about what people's level of (in)ability to reach agreement / verify arguments implies for the probability of getting alignment right. Let's ignore that possibility...) So I would have thought that even w.r.t. this fairly closely related debate, the disagreement is mostly about what happens as we move from human to superhuman-AI discussants. In particular, I would expect Paul to concede that the current level of disagreement in the alignment community is problematic and to argue that this will improve (enough) if we have superhuman debaters. If even this closely related form of debate/delegation/verification process isn't taken to be very informative (by at least one of Paul and John), then it's hard to imagine that much more distant delegation processes (such as those behind making computer monitors) are very informative to their disagreement.

[-]Elliot Callender7mo205

I think it depends on which domain you're delegating in. E.g. physical objects, especially complex systems like an AC unit, are plausibly much harder to validate than a mathematical proof.

In that vein, I wonder if requiring the AI to construct a validation proof would be feasible for alignment delegation? In that case, I'd expect us to find more use and safety from [ETA: delegation of] theoretical work than empirical.

6Davidmanheim7mo

That seems a lot like Davidad's alignment research agenda.

[-]ozziegooen7mo135

First, I want to flag that I really appreciate how you're making these delta clear and (fairly) simple.

I like this, though I feel like there's probably a great deal more clarity/precision to be had here (as is often the case).

Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better.

I'm not sure what "bad" means exactly. Do you basically mean, "if I were to spend resources R evaluating this object, I could identify some ways for it to be significantly improved?" If so, I assume we'd all agree that this is true for some amount R, the key question is what that amount is.

I also would flag that you draw attention to the issue with air conditioners. But for the issue of personal items, I'd argue that when I learn more about popular items, most of what I learn are positive things I didn't realize. Like with Chesterton's fence - when I get many well-reviewed or popular items, my impression is generally that there were many clever ideas o... (read more)

4ozziegooen7mo

Thinking about this more, it seems like there are some key background assumptions that I'm missing. Some assumptions that I often hear get presenting on this topic are things like: 1. "A misaligned AI will explicitly try to give us hard-to-find vulnerabilities, so verifying arbitrary statements from these AIs will be incredibly hard." 2. "We need to generally have incredibly high assurances to build powerful systems that don't kill us". My obvious counter-arguments would be: 1. Sure, but smart agents would have a reasonable prior that agents would be misaligned, and also, they would give these agents tasks that would be particularly easy to verify. Any action actually taken by a smart overseer, using information provided by another agent with a chance of being misaligned, M (known by the smart overseer), should be EV-positive in value. With some creativity, there's likely a bunch of ways of structuring things (using systems likely not to be misaligned, using more verifiable questions), where many resulting actions will likely be heavily EV-positive. 2. "Again, my argument in (1). Second, we can build these systems gradually, and with a lot of help from people/AIs that won't require such high assurances." (This is similar to the HCH / oversight arguments)

2kave6mo

I think an interesting version of this is "if I were to spend resource R evaluating this object, I could identify some ways for it to be significantly improved (even when factoring in additinoal cost) that the productino team probably already knew about"

[-]Cole Wyeth7mo132

I expect you still believe P != NP?

[-]johnswentworth7mo174

Yes, though I would guess my probability on P = NP is relatively high compared to most people reading this. I'm around 10-15% on P = NP.

Notably relevant:

People who’ve spent a lot of time thinking about P vs NP often have the intuition that “verification is easier than generation”. [...]
The problem is, this intuition comes from thinking about problems which are in NP. NP is, roughly speaking, the class of algorithmic problems for which solutions are easy to verify. [...]
I think a more accurate takeaway would be that among problems in NP, verification is easier than generation. In other words, among problems for which verification is easy, verification is easier than generation. Rather a less impressive claim, when you put it like that.

2Cole Wyeth7mo

Do you expect A.G.I. to be solving problems outside of NP? If not, it seems the relevant follow-up question is really out of the problems that are in NP, how many are in P? Actually, my intuition is that deep learning systems cap out around P/poly, which probably strictly contains NP, meaning (P/poly) \ NP may be hard to verify, so I think I agree with you.

[-]johnswentworth7mo3111

Most real-world problems are outside of NP. Let's go through some examples...

Suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values). Can I easily write down a boolean circuit (possibly with some inputs from data on fridges) which is satisfiable if-and-only-if this fridge in particular is in fact the best option for me according to my own long-term values? No, I have no idea how to write such a boolean circuit at all. Heck, even if my boolean circuit could internally use a quantum-level simulation of me, I'd still have no idea how to do it, because neither my stated values nor my revealed preferences are identical to my own long-term values. So that problem is decidedly not in NP.

(Variant of that problem: suppose an AI hands me a purported mathematical proof that this fridge in particular is the best option for me according to my own long-term values. Can I verify the proof's correctness? Again, no, I have no idea how to do that, I don't understand my own values well enough to distinguish a proof which makes correct assumptions about my values from one which makes incorrect assumptions.)

A quite different example ... (read more)

6Cole Wyeth7mo

I think the issues here are more conceptual than algorithmic.

6tailcalled6mo

The conceptual vagueness certainly doesn't help, but in general generation can be easier than validation because when generating you can stay within a subset of the domain that you understand well, whereas when verifying you may have to deal with all sorts of crazy inputs.

8Algon6mo

Attempted rephrasing: you control how you generate things, but not how others do, so verifying their generations can expose you to stuff you don't know how to handle. Example: "Writing code yourself is often easier than validating someone else's code"

4O O6mo

I think a more nuanced take is there is a subset of generated outputs that are hard to verify. This subset is split into two camps, one where you are unsure of the outputs correctness (and thus can reject/ask for an explanation). This isn’t too risky. The other camp is ones where you are sure but in reality overlook something. That’s the risky one. However at least my priors tell me that the latter is rare with a good reviewer. In a code review, if something is too hard to parse, a good reviewer will ask for an explanation or simplification. But bugs still slip by so it’s imperfect. The next question is whether the bugs that slip by in the output will be catastrophic. I don’t think it dooms the generation + verification pipeline if the system is designed to be error tolerant.

5Algon6mo

I'd like to try another analogy, which makes some potential problems for verifying output in alignment more legible. Imagine you're a customer and ask a programmer to make you an app. You don't really know what you want, so you give some vague design criteria. You ask the programmer how the app works, and they tell you, and after a lot of back and forth discussion, you verify this isn't what you want. Do you know how to ask for what you want, now? Maybe, maybe not. Perhaps the design space you're thinking of is small, perhaps you were confused in some simple way that the discussion resolved, perhaps the programmer worked with you earnestly to develop the design you're really looking for, and pointed out all sorts of unknown unknowns. Perhaps. I think we could wind up in this position. The position of a non-expert verifying an experts' output, with some confused and vague ideas about what we want from the experts. We won't know the good questions to ask the expert, and will have to rely on the expert to help us. If ELK is easy, then that's not a big issue. If it isn't, then that seems like a big issue.

[-]rif a. saurous6mo144

I feel like a lot of the difficulty here is a punning of the word "problem."

In complexity theory, when we talk about "problems", we generally refer to a formal mathematical question that can be posed as a computational task. Maybe in these kinds of discussions we should start calling these problems_C (for "complexity"). There are plenty of problems_C that are (almost definitely) not in NP, like #SAT ("count the number of satisfying assignments of this Boolean formula"), and it's generally believed that verification is hard for these problems. A problem_C like #SAT that is (believed to be) in #P but not NP will often have a short easy-to-understand algorithm that will be very slow ("try every assignment and count up the ones that satisfy the formula").

On the other hand, "suppose I am shopping for a new fridge, and I want to know which option is best for me (according to my own long-term values)" is a very different sort of beast. I agree it's not in NP in that I can't easily verify a solution, but the issue is that it's not a problem_C, rather than it being a problem_C that's (almost definitely) not in NP. With #SAT, I can easily describe how to solve the task usin... (read more)

9quetzal_rainbow7mo

PCP theorem states that problems with probabilistically checkable in polynomial time verifications contain NEXP problems, so, in some sense, there is a very large class of problems that can be "easily" verified. I think the whole "verification is easier than generation because of computational complexity theory" line of reasoning is misguided. The problem is not whether we have enough computing power to verify solution, it is that we have no idea how to verify solution.

[-]Ben Pace4mo12-7

Curated!

This post is strong in the rationalist virtue of simplicity. There is a large body of quite different research and strategic analysis of the AI x-risk situation between Wentworth and Christiano, and yet this post claims (I think fairly accurately) that much of it can be well captured in one key worldview-difference. The post does a good job of showing how this difference appears in many situations/cases (e.g. the air conditioning unit, large bureaucracies, outsourcing alignment, etc).

I encourage someone who takes the opposing side of this position from John (e.g. someone at the Alignment Research Center) to provide a response, as to whether they think this characterization is accurate (and if yes, why they disagree).

[-]Mark Xu4mo206

I don't think this characterization is accurate at all, but don't think I can explain the disagreement well enough for it to be productive.

4Raemon3mo

I had interpeted your initial comment to mean "this post doesn't accurately characterize Pauls views" (as opposed to "John is confused/wrong about the object level of 'is verification easier than generation' in a way that is relevant for modeling AI outcomes") I think your comment elsethread was mostly commenting on the object level. I'm currently unsure if your line "I don't think this characterization is accurate at all" was about the object level, or about whether this post successfully articulates a difference in Paul's views vs Johns.

[-]Mark Xu3mo104

I think both that:

this is not a good characterization of Paul's views
verification is typically easier than generalization and this fact is important for the overall picture for AI risk

I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:

the motte: there exist hard to verify properties
the bailey: all/most important properties are hard to verify

[-]johnswentworth3mo1915

I also think that this post is pulling a bit of a motte-and-bailey, although not really in the sense that John claims he is making in argument in the post:
the motte: there exist hard to verify properties
the bailey: all/most important properties are hard to verify

I don't think I am trying to claim that bailey at all. For purposes of AI risk, if there is even just one single property of a given system which is both (a) necessary for us to not die to that system, and (b) hard to verify, then difficulty of verification is a blocking issue for outsourcing alignment of that system.

Standard candidates for such properties include:

Strategic deception
Whether the system builds a child AI
Whether the system's notion of "human" or "dead" or [...] generalizes in a similar way to our notions

6johnswentworth3mo

... actually, on reflection, there is one version of the bailey which I might endorse: because easy-to-verify properties are generally outsourceable, whenever some important property is hard to verify, achieving that hard-to-verify property is the main bottleneck to solving the problem. I don't think one actually needs to make that argument in order for the parent comment to go through, but on reflection it is sometimes load-bearing for my models.

2Mark Xu3mo

For any given system, you have some distribution over which properties will be necessary to verify in order to not die to that system. Some of those you will in fact be able to verify, thereby obtaining evidence about whether that system is dangerous. “Strategic deception” is a large set of features, some of which are possible to verify.

6Ben Pace3mo

I'm hearing you say "If there's lots of types of ways to do strategic deception, and we can easily verify the presence (or lack) of a wide variety of them, this probably give us a good shot of selecting against all strategically deceptive AIs in our selection process". And I'm hearing John's position as "At a sufficient power level, if a single one of them gets through your training process you're screwed. And some of the types will be very hard to verify the presence of." And then I'm left with an open question as to whether the former is sufficient to prevent the latter, on which my model of Mark is optimistic (i.e. gives it >30% chance of working) and John is pessimistic (i.e. gives it <5% chance of working).

4Mark Xu3mo

If you're commited to producing a powerful AI then the thing that matters is the probability there exists something you can't find that will kill you. I think our current understanding is sufficiently paltry that the chance of this working is pretty low (the value added by doing selection on non-deceptive behavior is probably very small, but I think there's a decent chance you just won't get that much deception). But you can also get evidence about the propensity for your training process to produce deceptive AIs and stop producing them until you develop better understanding, or alter your training process in other ways. For example, you can use your understanding of the simpler forms of deception your AIs engage in to invest resources in understanding more complicated forms of deception, e.g. by focusing interpretability efforts.

4Ben Pace3mo

It seems plausible to both of us that you can use some straightforward selection against straightforward deception and end up succeeding, up to a certain power level, and that marginal research on how to do this improves your odds. But: * I think there's a power level where it definitely doesn't work, for the sort of ontological reasons alluded to here whereby[1] useful cognition for achieving an AI's goals will optimize against you understanding it even without it needing to be tagged as deceptive or for the AI to have any self-awareness of this property. * I also think it's always a terrifying bet to make due to the adversarialness, whereby you may get a great deal of evidence consistent with it all going quite rosy right up until it dramatically fails (e.g. FTX was an insanely good investment according to financial investors and Effective Altruists right up until it was the worst investment they'd ever made and these people were not stupid). These reasons give me a sense of naivety to betting on "trying to straightforwardly select against deceptiveness" that "but a lot of the time it's easier for me to verify the deceptive behavior than for the AI to generate it!" doesn't fully grapple with, even while it's hard to point to the exact step whereby I imagine such AI developers getting tricked. ...however my sense from the first half of your comment ("I think our current understanding is sufficiently paltry that the chance of this working is pretty low") is that we're broadly in agreement about the odds of betting on this (even though I kind of expect you would articulate why quite differently to how I did). You then write: Certainly being able to show that an AI is behaving deceptively in a way that is hard to train out will in some worlds be useful for pausing AI capabilities progress, though I think this not a great set of world to be betting on ending up in — I think it more likely than not that an AI company would willingly deploy many such AIs. Be tha

2Raemon3mo

(I didn't want to press it since your first comment sounded like you were kinda busy, but I am interested in hearing more details about this)

4Mark Xu3mo

I don’t think Paul thinks verification is generally easy or that delegation is fundamentally viable. He, for example, doesn’t suck at hiring because he thinks it’s in fact a hard problem to verify if someone is good at their job. I liked Rohin's comment elsewhere on this general thread. I’m happy to answer more specific questions, although would generally feel more comfortable answering questions about my views then about Paul’s.

2Noosphere893mo

As someone who has used this insight that verification is easier than generation before, I heartily support this point: One of my worked examples of this being important is that this was part of my argument on why AI alignment generalizes further than AI capabilities, where in this context it's much easier and more reliable to give feedback on whether a situation was good for my values, than to actually act on the situation itself. Indeed, it's so much easier that social reformers tend to fall into the trap of thinking that just because you can verify something is right or wrong means you can just create new right social norms just as easily, when the latter problem is much harder than the former problem. This link is where I got this quote: https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/ I also agree that these 2 theses are worth distinguishing:

4Ben Pace3mo

I think I am confused by the idea that one of verification or generalization reliably wins out. The balance seems to vary between different problems, or it even seems like they nest. When I am coding something, if my colleague walks over and starts implementing the next step, I am pretty lost, and even after they tell me what they did I probably would've rather done it myself, as it's one step of a larger plan for building something and I'd do it differently from them. If they implement the whole thing, then I can review their pull request and get a fairly good sense of what they did and typically approve it much faster than I can build it. If it's a single feature in a larger project, I often can't tell if that was the right feature to build without knowing the full project plan, and even then I'd rather run the project myself if I wanted to be confident it would succeed (rather than follow someone else's designs). After the project is completed and given a few months to air I can tend to see how the users use the feature, and whether it paid off. But on the higher level I don't know if this is the right way for the company to go in terms of product direction, and to know that it was a good choice I'd rather be the one making the decision myself. And so on. (On the highest level I do not know my values and wouldn't hand over the full control of the future to any AI because I don't trust that I could tell good from bad, I think I'd mostly be confused about what it did.)

2Noosphere893mo

Yeah, I admit a lot of the crux comes down to whether thinking whether your case is more the exception or the rule, and I admit that I think that your situation is more unusual compared to the case where you can locally verify something without having to execute the global plan. I tend to agree far more with Paul Christiano than with John Wentworth on the delta of But to address what it would mean for alignment to generalize more than capabilities, this would essentially mean it's easier to get an AI to value what you value without the failure modes of deceptive/pseudo/suboptimality alignment than it is to get an AI that actually executes on your values through capabilities in the real world. I admit that I both know a lot more about what exactly I value, and I also trust AIs to generalize more from values data than you do, for several reasons.

2Ben Pace3mo

You admitting it does not make me believe it any more than you simply claiming it! :-) (Perhaps you were supposed to write "I admit that I think I both know...") I would understand this claim more if you claimed to value something very simple, like diamonds or paperclips (though I wouldn't believe you that it was what you valued). But I'm pretty sure you typically experience many of the same confusions as me when wondering if a big decision you made was good or bad (e.g. moving countries, changing jobs, choosing who your friends are, etc), confusions about what I even want in my day-to-day life (should I be investing more in work? in my personal relationships? in writing essays? etc), confusions about big ethical questions (how close to having a utility function am I? if you were to freeze me and maximize my preferences at different points in a single day, how much would the resultant universes look like each other vs look extremely different?), and more. I can imagine that you have a better sense than I do (perhaps you're more in touch with who you are than I) but I don't believe you'll have fundamentally answered all the open problems in ethics and agency.

4Noosphere893mo

Okay, I think I've found the crux here: I don't value getting maximum diamonds and paperclips, but I think you've correctly identified my crux here in that I think values and value formation are both simpler in in the sense that it requires a lot less of a prior and a lot more can be learned from data, and less fragile than a lot of LWers believe, and this doesn't just apply to my own values, which could broadly be said to be quite socially liberal and economically centrist. I think this for several reasons: 1. I think a lot of people are making an error when they estimate how complicated their values are in the sense relevant for AI alignment, because they add both the complexity of the generative process/algorithms/priors for values and the complexity of the data for value learning, and I think most of the complexity of my own values as well as other people's values is in very large part (like 90-99%+) the data, and not encoded priors from my genetics. 2. This is because I think a lot of what evopsych says about how humans got their capabilities and values is basically wrong, and I think one of the more interesting pieces of evidence is that in AI training, there's a general dictum that the data matter more than the architecture/prior in how AIs will behave, especially OOD generalization, as well as the bitter lesson in DL capabilities. While this itself is important for why I don't think that we need to program in a very complicated value/utility function, I also think that there is enough of an analogy between DL and the brain such that you can transport a lot of insights between one field and another, and there are some very interesting papers on the similarity between the human brain and what LLMs are doing, and spoiler alert, they're not the same thing, but they are doing pretty similar things and I'll give all links below: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003963 https://www.nature.com/articles/s41593-022-01026-4

4Ben Pace4mo

You can't explain the disagreement well-enough to be productive yet! I have faith that you may be able to in the future, even if not now. For reference, readers can see Paul and John debate this a little in this 2022 thread on AIs doing alignment research.

[-]Rohin Shah3mo182

The claim is verification is easier than generation. This post considers a completely different claim that "verification is easy", e.g.

How does the ease-of-verification delta propagate to AI?
if I apply the “verification is generally easy” delta to my models, then delegating alignment work to AI makes total sense.
if I apply a “verification is generally easy” delta, then I expect the world to generally contain far less low-hanging fruit

I just don't care much if the refrigerator or keyboard or tupperware or whatever might be bad in non-obvious ways that we failed to verify, unless you also argue that it would be easier to create better versions from scratch than to notice the flaws.

Now to be fair, maybe Paul and I are just fooling ourselves, and really all of our intuitions come from "verification is easy", which John gestures at:

He’s sometimes summarized this as “verification is easier than generation”, but I think his underlying intuition is somewhat stronger than that.

But I don't think "verification is easy" matters much to my views. Re: the three things you mention:

From my perspective (and Paul's) the air conditioning thing had very little bearing on alignment.
In principle I could

... (read more)

[-]Thomas Kwa3mo151

I disagree with this curation because I don't think this post will stand the test of time. While Wentworth's delta to Yudkowsky has a legible takeaway-- ease of ontology translation-- that is tied to his research on natural latents, it is less clear what John means here and what to take away. Simplicity is not a virtue when the issue is complex and you fail to actually simplify it.

Verification vs generation has an extremely wide space of possible interpretations, and as stated here the claim is incredibly vague. The argument for why difficulty of verification implies difficulty of delegation is not laid out, and the examples do not go in much depth. John says that convincing people is not the point of this post, but this means we also don't really have gears behind the claims.
- The comments didn't really help-- most of the comments here are expressing confusion, wanting more specificity, or disagreeing whereupon John doesn't engage. Also, Paul didn't reply. I don't feel any more enlightened after reading them except to disagree with some extremely strong version of this post...
Vanilla HCH is an 8-year-old model of delegation to AIs which Yudkowsky convinced me was not aligned in like

... (read more)

6Ben Pace3mo

I agree that this pointer to a worldview-difference is pretty high-level / general, and the post would be more valuable with a clearer list of some research disagreements or empirical disagreements. Perhaps I made a mistake to curate a relatively loose pointer. I think I assign at least 35% to "if we're all still alive in 5 years and there's a much stronger public understanding of Christiano's perspective on the world, this post will in retrospect be a pretty good high-level pointer to where he differs from many others (slash a mistake he was making)", but I still appreciate the datapoint that you (and Mark and Rohin) did not find it helpful nor agree with it, and it makes me think it more probable that I made a mistake.

3Mark Xu4mo

I have left a comment about a central way I think this post is misguided: https://www.lesswrong.com/posts/7fJRPB6CF6uPKMLWi/my-ai-model-delta-compared-to-christiano?commentId=sthrPShrmv8esrDw2

[-]Joel Burget7mo1212

I put approximately-zero probability on the possibility that Paul is basically right on this delta; I think he’s completely out to lunch.

Very strong claim which the post doesn't provide nearly enough evidence to support

[-]johnswentworth7mo2623

I mean, yeah, convincing people of the truth that claim was not the point of the post.

[-]Joel Burget7mo1413

Sorry, was in a hurry when I wrote this. What I meant / should have said is: it seems really valuable to me to understand how you can refute Paul's views so confidently and I'd love to hear more.

[-]Mark Xu4mo11-2

This post uses "I can identify ways in which chairs are bad" as an example. But it's easier for me to verify that I can sit in a chair and that it's comfortable then to make a chair myself. So I don't really know why this is a good example for "verification is easier than generation".

More examples:

I can tell my computer is a good typing machine, but cannot make one myself
I can tell a waterbottle is water tight, but do not know how to make a water bottle
I can tell that my pepper grinder grinds pepper, but do not know how to make a pepper grinder.

If the goal of this post is to discuss the crux https://www.lesswrong.com/posts/fYf9JAwa6BYMt8GBj/link-a-minimal-viable-product-for-alignment?commentId=mPgnTZYSRNJDwmr64:

evaluation isn't easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it

then I think there is a large disconnect between the post above, which is positing that in order for this claim to be false there has to be some "deep" sense in which delagation is viable, and the sense in which I think this crux is obviously false in the more mundane sense in which all humans interface with the world and optimize over the products other people create, and are therefore more capable than they would have been if they had to make all products for themselves from scratch.

7jmh3mo

I assumed John was pointing at verifying that perhaps the chemicals used in the production of the chair might have some really bad impact on the environmnet, start causing a problem with the food chain eco system and make food much scarcers for everyone -- including the person who bought the chair -- in the meaningfully near future. Something a long those lines. As you note, verifying the chair functions as you want -- as a place to sit that is comfortable -- is pretty easy. Most of us probably do that without even really thinking about it. But will this chair "kill me" in the future is not so obvious or easy to assess. I suspect at the core, this is a question about an assumption about evaluating a simple/non-complex world and doing so in an inherently complex world do doesn't allow true separability in simple and independant structures.

[-]johnswentworth3mo104

I assumed John was pointing at verifying that perhaps the chemicals used in the production of the chair might have some really bad impact on the environmnet, start causing a problem with the food chain eco system and make food much scarcers for everyone -- including the person who bought the chair -- in the meaningfully near future.

What I had in mind is more like: many times over the years I've been sitting at a desk and noticed my neck getting sore. Then when I move around a bit, I realize that the chair/desk/screen are positioned such that my neck is at an awkward angle when looking at the screen, which makes my neck sore when I hold that angle for a long time. The mispositioning isn't very salient; I just reflexively adjust my neck to look at the screen and don't notice that it's at an awkward angle. Then later my neck hurts, and it's nonobvious and takes some examination to figure out why my neck hurts.

That sort of thing, I claim, generalizes to most "ergonomics". Chairs, keyboards, desks, mice... these are all often awkward in ways which make us uncomfortable when using them for a long time. But the awkwardness isn't very salient or obvious (for most people), because we just automatically adjust position to handle it, and the discomfort only comes much later from holding that awkward position for a long time.

6Mark Xu3mo

I agree ergonimics can be hard to verify. But some ergonomics are easy to verify, and chairs conform to those ergonomics (e.g. having a backrest is good, not having sharp stabby parts are good, etc.).

[-]johnswentworth3mo104

I mean, sure, for any given X there will be some desirable properties of X which are easy to verify, and it's usually pretty easy to outsource the creation of an X which satisfies the easy-to-verify properties. The problem is that the easy-to-verify properties do not typically include all the properties which are important to us. Ergonomics is a very typical example.

Extending to AI: sure, there will be some desirable properties of AI which are easy to verify, or properties of alignment research which are easy to verify, or properties of plans which are easy to verify, etc. And it will be easy to outsource the creation of AI/research/plans which satisfy those easy-to-verify properties. Alas, the easy-to-verify properties do not include all the properties which are important to us, or even all the properties needed to not die.

2Mark Xu3mo

I think there are some easy-to-verify properties that would make us more likely to die if they were hard-to-verify. And therefore think "verification is easier than generation" is an important part of the overall landscape of AI risk.

2jmh3mo

That is certainly a more directly related, non-obvious aspect for verification. Thanks.

3Mark Xu3mo

I agree that there are some properties of objects that are hard to verify. But that doesn't mean generation is as hard as verification in general. The central property of a chair (that you can sit on it) is easy to verify.

3AprilSR4mo

This feels more like an argument that Wentworth's model is low-resolution than that he's actually misidentified where the disagreement is?

[-]Max H6mo60

I'm curious what you think of Paul's points (2) and (3) here:

Eliezer often talks about AI systems that are able to easily build nanotech and overpower humans decisively, and describes a vision of a rapidly unfolding doom from a single failure. This is what would happen if you were magically given an extraordinarily powerful AI and then failed to aligned it, but I think it’s very unlikely what will happen in the real world. By the time we have AI systems that can overpower humans decisively with nanotech, we have other AI systems that will either kill human

... (read more)

[-]tmeanen7mo61

“keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse"

But, I think the negative impacts that these goods have on you are (mostly) realized on longer timescales - say, years to decades. If you’re using a chair that is bad for your postu... (read more)

[-]Carl Feynman7mo172

Many research tasks have very long delays until they can be verified. The history of technology is littered with apparently good ideas that turned out to be losers after huge development efforts were poured into them. Supersonic transport, zeppelins, silicon-on-sapphire integrated circuits, pigeon-guided bombs, object-oriented operating systems, hydrogenated vegetable oil, oxidative decoupling for weight loss…

Finding out that these were bad required making them, releasing them to the market, and watching unrecognized problems torpedo them. Sometimes it took decades.

5tmeanen7mo

But if the core difficulty in solving alignment is developing some difficult mathematical formalism and figuring out relevant proofs then I think we won't suffer from the problems with the technologies above. In other words, I would feel comfortable delegating and overseeing a team of AIs that have been tasked with solving the Riemann hypothesis - and I think this is what a large part of solving alignment might look like.

9Carl Feynman7mo

“May it go from your lips to God’s ears,” as the old Jewish saying goes. Meaning, I hope you’re right. Maybe aligning superintelligence will largely be a matter of human-checkable mathematical proof. I have 45 years experience as a software and hardware engineer, which makes me cynical. When one of my designs encounters the real world, it hardly ever goes the way I expect. It usually either needs some rapid finagling to make it work (acceptable) or it needs to be completely abandoned (bad). This is no good for the first decisive try at superalignment; that has to work first time. I hope our proof technology is up to it.

[-]Keenan Pepper7mo6-2

In HCH, the human user does a little work then delegates subquestions/subproblems to a few AIs, which in turn do a little work then delegate their subquestions/subproblems to a few AIs, and so on until the leaf-nodes of the tree receive tiny subquestions/subproblems which they can immediately solve.

This does not agree with my understanding of what HCH is at all. HCH is a definition of an abstract process for thought experiments, much like AIXI is. It's defined as the fixed point of some iterative process of delegation expanding out into a tree. It's als... (read more)

[-]kromem7mo60

As you're doing these delta posts, do you feel like it's changing your own positions at all?

For example, reading this one what strikes me is that what's portrayed as the binary sides of the delta seem more like positions near the edges of a gradient distribution, and particularly one that's unlikely to be uniform across different types of problems.

To my eyes the most likely outcome is a situation where you are both right.

Where there are classes of problems where verification is easy and delegation is profitable, and classes of problems where verification w... (read more)

[-]johnswentworth7mo102

As you're doing these delta posts, do you feel like it's changing your own positions at all?

Mostly not, because (at least for Yudkowsky and Christiano) these are deltas I've been aware of for at least a couple years. So the writing process is mostly just me explaining stuff I've long since updated on, not so much figuring out new stuff.

[-]jmh3mo40

In terms of the hard to verify aspect, while it's true that any one person will face any number of challenges do we live in a world where one person does anything on their own?

How would the open-source model influence outcomes? When pretty much anyone can take a look, and persumable many do, does the level of verifcation, or ease of verification, improve in your model?

[-]Eli Tyre6mo40

Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me the cost to make better.

Crucially, this is true only because you're relatively smart for a human: smarter than many of the engineers that designed those objects, and smarter than most or all of the committee-of-engineers that designed those objects. You can come up with better soluti... (read more)

4philh6mo

Note that to the extent this is true, it suggests verification is even harder than John thinks.

2ryan_greenblatt6mo

Hmm, not exactly. Our verification ability only needs to be sufficiently good relative to the AIs.

4johnswentworth6mo

Ehh, yes and no. I maybe buy that a median human doing a deep dive into a random object wouldn't notice the many places where there is substantial room for improvement; hanging around with rationalists does make it easy to forget just how low the median-human bar is. But I would guess that a median engineer is plenty smart enough to see the places where there is substantial room for improvement, at least within their specialty. Indeed, I would guess that the engineers designing these products often knew perfectly well that they were making tradeoffs which a fully-informed customer wouldn't make. The problem, I expect, is mostly organizational dysfunction (e.g. the committee of engineers is dumber than one engineer, and if there are any nontechnical managers involved then the collective intelligence nosedives real fast), and economic selection pressure. For instance, I know plenty of software engineers who work at the big tech companies. The large majority of them (in my experience) know perfectly well that their software is a trash fire, and will tell you as much, and will happily expound in great detail the organizational structure and incentives which lead to the ongoing trash fire.

[-]Aprillion6mo30

It’s a failure of ease of verification: because I don’t know what to pay attention to, I can’t easily notice the ways in which the product is bad.

Is there an opposite of the "failure of ease of verification" that would add up to 100% if you would categorize the whole of reality into 1 of these 2 categories? Say in a simulation, if you attributed every piece of computation into following 2 categories, how much of the world can be "explained by" each category?

make sure stuff "works at all and is easy to verify whether it works at all"
stuff that works must be

... (read more)

0[comment deleted]3mo

[-]Morpheus7mo30

That propagates into a huge difference in worldviews. Like, I walk around my house and look at all the random goods I’ve paid for - the keyboard and monitor I’m using right now, a stack of books, a tupperware, waterbottle, flip-flops, carpet, desk and chair, refrigerator, sink, etc. Under my models, if I pick one of these objects at random and do a deep dive researching that object, it will usually turn out to be bad in ways which were either nonobvious or nonsalient to me, but unambiguously make my life worse and would unambiguously have been worth-to-me

... (read more)

1Richard1214mo

The issue there is that "best X" varies wildly depending on purpose, budget and usage. Take a pen: For me, I mostly keep pens in my bag to make quick notes and lend out. The overriding concern is that the pens are very cheap, can be visually checked whether full or empty, and never leak, because they will spend a lot of time bouncing around in my bag, and I am unlikely to get them back when loaned. A calligrapher has very different requirements.

[-]Review Bot7mo10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Moderation Log