The following is a basically unedited summary I wrote up on March 16 of my take on Paul Christiano’s AGI alignment approach (described in “ALBA” and “Iterated Distillation and Amplification”). Where Paul had comments and replies, I’ve included them below.


I see a lot of free variables with respect to what exactly Paul might have in mind. I've sometimes tried presenting Paul with my objections and then he replies in a way that locally answers some of my question but I think would make other difficulties worse. My global objection is thus something like, "I don't see any concrete setup and consistent simultaneous setting of the variables where this whole scheme works." These difficulties are not minor or technical; they appear to me quite severe. I try to walk through the details below.

It should be understood at all times that I do not claim to be able to pass Paul’s ITT for Paul’s view and that this is me criticizing my own, potentially straw misunderstanding of what I imagine Paul might be advocating.

 

 

Paul Christiano

Overall take: I think that these are all legitimate difficulties faced by my proposal and to a large extent I agree with Eliezer's account of those problems (though not his account of my current beliefs).

I don't understand exactly how hard Eliezer expects these problems to be; my impression is "just about as hard as solving alignment from scratch," but I don't have a clear sense of why.

To some extent we are probably disagreeing about alternatives. From my perspective, the difficulties with my approach (e.g. better understanding the forms of optimization that cause trouble, or how to avoid optimization daemons in systems about as smart as you are, or how to address X-and-only-X) are also problems for alternative alignment approaches. I think it's a mistake to think that tiling agents, or decision theory, or naturalized induction, or logical uncertainty, are going to make the situation qualitatively better for these problems, so work on those problems looks to me like procrastinating on the key difficulties. I agree with the intuition that progress on the agent foundations agenda "ought to be possible," and I agree that it will help at least a little bit with the problems Eliezer describes in this document, but overall agent foundations seems way less promising than a direct attack on the problems (given that we haven’t tried the direct attack nearly enough to give up). Working through philosophical issues in the context of a concrete alignment strategy generally seems more promising to me than trying to think about them in the abstract, and I think this is evidenced by the fact that most of the core difficulties in my approach would also afflict research based on agent foundations.

The main way I could see agent foundations research as helping to address these problems, rather than merely deferring them, is if we plan to eschew large-scale ML altogether. That seems to me like a very serious handicap, so I'd only go that direction once I was quite pessimistic about solving these problems. My subjective experience is of making continuous significant progress rather than being stuck. I agree there is clear evidence that the problems are "difficult" in the sense that we are going to have to make progress in order to solve them, but not that they are "difficult" in the sense that P vs. NP or even your typical open problem in CS is probably difficult (and even then if your options were "prove P != NP" or "try to beat Google at building an AGI without using large-scale ML," I don't think it's obvious which option you should consider more promising).


First and foremost, I don't understand how "preserving alignment while amplifying capabilities" is supposed to work at all under this scenario, in a way consistent with other things that I’ve understood Paul to say.

I want to first go through an obvious point that I expect Paul and I agree upon: Not every system of locally aligned parts has globally aligned output, and some additional assumption beyond "the parts are aligned" is necessary to yield the conclusion "global behavior is aligned". The straw assertion "an aggregate of aligned parts is aligned" is the reverse of the argument that Searle uses to ask us to imagine that an (immortal) human being who speaks only English, who has been trained do things with many many pieces of paper that instantiate a Turing machine, can't be part of a whole system that understands Chinese, because the individual pieces and steps of the system aren't locally imbued with understanding Chinese. Here the compositionally non-preserved property is "lack of understanding of Chinese"; we can't expect "alignment" to be any more necessarily preserved than this, except by further assumptions.

The second-to-last time Paul and I conversed at length, I kept probing Paul for what in practice the non-compacted-by-training version of a big aggregate of small aligned agents would look like. He described people, living for a single day, routing around phone numbers of other agents with nobody having any concept of the global picture. I used the term "Chinese Room Bureaucracy" to describe this. Paul seemed to think that this was an amusing but perhaps not inappropriate term.

If no agent in the Chinese Room Bureaucracy has a full view of which actions have which consequences and why, this cuts off the most obvious route by which the alignment of any agent could apply to the alignment of the whole. The way I usually imagine things, the alignment of an agent applies to things that the agent understands. If you have a big aggregate of agents that understands something the little local agent doesn't understand, the big aggregate doesn't inherit alignment from the little agents. Searle's Chinese Room can understand Chinese even if the person inside it doesn't understand Chinese, and this correspondingly implies, by default, that the person inside the Chinese Room is powerless to express their own taste in restaurant orders.

I don't understand Paul's model of how a ton of little not-so-bright agents yield a big powerful understanding in aggregate, in a way that doesn't effectively consist of them running AGI code that they don't understand.

 

 

Paul Christiano

The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn't a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble. My goal is to (a) first construct weaker agents who aren't internally doing problematic optimization, (b) put them together in a way that improves capability without doing other problematic optimization, (c) iterate that process.

 

Paul has previously challenged me to name a bottleneck that I think a Christiano-style system can't pass. This is hard because (a) I'm not sure I understand Paul's system, and (b) it's clearest if I name a task for which we don't have a present crisp algorithm. But:

The bottleneck I named in my last discussion with Paul was, "We have copies of a starting agent, which run for at most one cumulative day before being terminated, and this agent hasn't previously learned much math but is smart and can get to understanding algebra by the end of the day even though the agent started out knowing just concrete arithmetic. How does a system of such agents, without just operating a Turing machine that operates an AGI, get to the point of inventing Hessian-free optimization in a neural net?"

This is a slightly obsolete example because nobody uses Hessian-free optimization anymore. But I wanted to find an example of an agent that needed to do something that didn't have a simple human metaphor. We can understand second derivatives using metaphors like acceleration. "Hessian-free optimization" is something that doesn't have an obvious metaphor that can explain it, well enough to use it in an engineering design, to somebody who doesn't have a mathy and not just metaphorical understanding of calculus. Even if it did have such a metaphor, that metaphor would still be very unlikely to be invented by someone who didn't understand calculus.

I don't see how Paul expects lots of little agents who can learn algebra in a day, being run in sequence, to aggregate into something that can build designs using Hessian-free optimization, without the little agents having effectively the role of an immortal dog that's been trained to operate a Turing machine. So I also don't see how Paul expects the putative alignment of the little agents to pass through this mysterious aggregation form of understanding, into alignment of the system that understands Hessian-free optimization.

I expect this is already understood, but I state as an obvious fact that alignment is not in general a compositionally preserved property of cognitive systems: If you train a bunch of good and moral people to operate the elements of a Turing machine and nobody has a global view of what's going on, their goodness and morality does not pass through to the Turing machine. Even if we let the good and moral people have discretion as to when to write a different symbol than the usual rules call for, they still can't be effective at aligning the global system, because they don't individually understand whether the Hessian-free optimization is being used for good or evil, because they don't understand Hessian-free optimization or the thoughts that incorporate it. So we would not like to rest the system on the false assumption "any system composed of aligned subagents is aligned", which we know to be generally false because of this counterexample. We would like there to instead be some narrower assumption, perhaps with additional premises, which is actually true, on which the system's alignment rests. I don't know what narrower assumption Paul wants to use.


Paul asks us to consider AlphaGo as a model of capability amplification.

My view of AlphaGo would be as follows: We understand Monte Carlo Tree Search. MCTS is an iterable algorithm whose intermediate outputs can be plugged into further iterations of the algorithm. So we can use supervised learning where our systems of gradient descent can capture and foreshorten the computation of some but not all of the details of winning moves revealed by the short MCTS, plug in the learned outputs to MCTS, and get a pseudo-version of "running MCTS longer and wider" which is weaker than an MCTS actually that broad and deep, but more powerful than the raw MCTS run previously. The alignment of this system is provided by the crisp formal loss function at the end of the MCTS.

Here's an alternate case where, as far as I can tell, a naive straw version of capability amplification clearly wouldn't work. Suppose we have an RNN that plays Go. It's been constructed in such fashion that if we iterate the RNN for longer, the Go move gets somewhat better. "Aha," says the straw capability amplifier, "clearly we can just take this RNN, train another network to approximate its internal state after 100 iterations from the initial Go position; we feed that internal state into the RNN at the start, then train the amplifying network to approximate the internal state of that RNN after it runs for another 200 iterations. The result will clearly go on trying to 'win at Go' because the original RNN was trying to win at Go; the amplified system preserves the values of the original." This doesn't work because, let us say by hypothesis, the RNN can't get arbitrarily better at Go if you go on iterating it; and the nature of the capability amplification setup doesn't permit any outside loss function that could tell the amplified RNN whether it's doing better or worse at Go.

 

 

Paul Christiano

I definitely agree that amplification doesn't work better than "let the human think for arbitrarily long." I don’t think that’s a strong objection, because I think humans (even humans who only have a short period of time) will eventually converge to good enough answers to the questions we face.

 

The RNN has only whatever opinion it converges to, or whatever set of opinions it diverges to, to tell itself how well it's doing. This is exactly what it is for capability amplification to preserve alignment; but this in turn means that capability amplification only works to the extent that what we are amplifying has within itself the capability to be very smart in the limit.

If we're effectively constructing a civilization of long-lived Paul Christianos, then this difficulty is somewhat alleviated. There are still things that can go wrong with this civilization qua civilization (even aside from objections I name later as to whether we can actually safely and realistically do that). I do however believe that a civilization of Pauls could do nice things.

But other parts of Paul's story don't permit this, or at least that's what Paul was saying last time; Paul's supervised learning setup only lets the simulated component people operate for a day, because we can't get enough labeled cases if the people have to each run for a month.

Furthermore, as I understand it, the "realistic" version of this is supposed to start with agents dumber than Paul. According to my understanding of something Paul said in answer to a later objection, the agents in the system are supposed to be even dumber than an average human (but aligned). It is not at all obvious to me that an arbitrarily large system of agents with IQ 90, who each only live for one day, can implement a much smarter agent in a fashion analogous to the internal agents themselves achieving understandings to which they can apply their alignment in a globally effective way, rather than them blindly implementing a larger algorithm they don't understand.

I'm not sure a system of one-day-living IQ-90 humans ever gets to the point of inventing fire or the wheel.

If Paul has an intuition saying "Well, of course they eventually start doing Hessian-free optimization in a way that makes their understanding effective upon it to create global alignment; I can’t figure out how to convince you otherwise if you don’t already see that," I'm not quite sure where to go from there, except onwards to my other challenges.

 

 

Paul Christiano

Well, I can see one obvious way to convince you otherwise: actually run the experiment. But before doing that I'd like to be more precise about what you expect to work and not work, since I'm not going to literally do the HF optimization example (developing new algorithms is way, way beyond the scope of existing ML). I think we can do stuff that looks (to me) even harder than inventing HF optimization. But I don't know if I have a good enough model of your model to know what you'd actually consider harder.

 

Unless of course you have so many agents in the (uncompressed) aggregate that the aggregate implements a smarter genetic algorithm that is maximizing the approval of the internal agents. If you take something much smarter than IQ 90 humans living for one day, and train it to get the IQ 90 humans to output large numbers signaling their approval, I would by default expect it to hack the IQ 90 one-day humans, who are not secure systems. We're back to the global system being smarter than the individual agents in a way which doesn't preserve alignment.

 

 

Paul Christiano

Definitely agree that even if the agents are aligned, they can implement unaligned optimization, and then we're back to square one. Amplification only works if we can improve capability without doing unaligned optimization. I think this is a disagreement about the decomposability of cognitive work. I hope we can resolve it by actually finding concrete, simple tasks where we have differing intuitions, and then doing empirical tests.

 

The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning. If arguendo you can construct an exact imitation of a human, it possesses exactly the same alignment properties as the human; and this is true in a way that is not true if we take a reinforcement learner and ask it to maximize an approval signal originating from the human. (If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I'm pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.)

It is not obvious to me how fast alignment-preservation degrades as the exactness of the imitation is weakened. This matters because of things Paul has said which sound to me like he's not advocating for perfect imitation, in response to challenges I've given about how perfect imitation would be very expensive. That is, the answer he gave to a challenge about the expense of perfection makes the answer to "How fast do we lose alignment guarantees as we move away from perfection?" become very important.

One example of a doom I'd expect from standard reinforcement learning would be what I'd term the "X-and-only-X" problem. I unfortunately haven't written this up yet, so I'm going to try to summarize it briefly here.

X-and-only-X is what I call the issue where the property that's easy to verify and train is X, but the property you want is "this was optimized for X and only X and doesn't contain a whole bunch of possible subtle bad Ys that could be hard to detect formulaically from the final output of the system".

For example, imagine X is "give me a program which solves a Rubik's Cube". You can run the program and verify that it solves Rubik's Cubes, and use a loss function over its average performance which also takes into account how many steps the program's solutions require.

The property Y is that the program the AI gives you also modulates RAM to send GSM cellphone signals.

That is: It's much easier to verify "This is a program which at least solves the Rubik's Cube" than "This is a program which was optimized to solve the Rubik's Cube and only that and was not optimized for anything else on the side."

If I were going to talk about trying to do aligned AGI under the standard ML paradigms, I'd talk about how this creates a differential ease of development between "build a system that does X" and "build a system that does X and only X and not Y in some subtle way". If you just want X however unsafely, you can build the X-classifier and use that as a loss function and let reinforcement learning loose with whatever equivalent of gradient descent or other generic optimization method the future uses. If the safety property you want is optimized-for-X-and-just-X-and-not-any-possible-number-of-hidden-Ys, then you can't write a simple loss function for that the way you can for X.

 

 

Paul Christiano

According to my understanding of optimization / use of language: the agent produced by RL is optimized only for X. However, optimization for X is liable to produce a Y-optimizer. So the actions of the agent are both X-optimized and Y-optimized.

 

The team that's building a less safe AGI can plug in the X-evaluator and let rip, the team that wants to build a safe AGI can't do things the easy way and has to solve new basic problems in order to get a trustworthy system. It's not unsolvable, but it's an element of the class of added difficulties of alignment such that the whole class extremely plausibly adds up to an extra two years of development.

In Paul's capability-amplification scenario, if we can get exact imitation, we are genuinely completely bypassing the whole paradigm that creates the X-and-only-X problem. If you can get exact imitation of a human, the outputs have only and exactly whatever properties the human already has. This kind of genuinely different viewpoint is why I continue to be excited about Paul's thinking.

 

 

Paul Christiano

I agree that perfect imitation would be a way to get around the X-and-only-X problem. However, I don't think that it's plausible and it's not how my approach hopes to get around the X-and-only-X problem.

I would solve X-and-only-X in two steps:

First, given an agent and an action which has been optimized for undesirable consequence Y, we'd like to be able to tell that the action has this undesirable side effect. I think we can do this by having a smarter agent act as an overseer, and giving the smarter agent suitable insight into the cognition of the weaker agent (e.g. by sharing weights between the weak agent and an explanation-generating agent). This is what I'm calling informed oversight.

Second, given an agent, identify situations in which it is especially likely to produce bad outcomes, or proofs that it won't, or enough understanding of its internals that you can see why it won't. This is discussed in “Techniques for Optimizing Worst-Case Performance.”

(It also obviously requires a smarter agent, which you hope to get by induction + amplification).

I think that both of those are hard problems, in addition to the assumption that amplification will work. But I don't yet see reason to be super pessimistic about either of them.

 

On the other hand, suppose we don't have exact imitation. How fast do we lose the defense against X-and-only-X? Well, that depends on the inexactness of the imitation; under what kind of distance metric is the imperfect imitation 'near' to the original? Like, if we're talking about Euclidean distance in the output, I expect you lose the X-and-only-X guarantee pretty damn fast against smart adversarial perturbations.

On the other other hand, suppose that the inexactness of the imitation is "This agent behaves exactly like Paul Christiano but 5 IQ points dumber." If this is only and precisely the form of inexactness produced, and we know that for sure, then I'd say we have a pretty good guarantee against slightly-dumber-Paul producing the likes of Rubik's Cube solvers containing hidden GSM signalers.

On the other other other hand, suppose the inexactness of the imitation is "This agent passes the Turing Test; a human can't tell it apart from a human." Then X-and-only-X is thrown completely out the window. We have no guarantee of non-Y for any Y a human can't detect, which covers an enormous amount of lethal territory, which is why we can't just sanitize the outputs of an untrusted superintelligence by having a human inspect the outputs to see if they have any humanly obvious bad consequences.


Speaking of inexact imitation: It seems to me that having an AI output a high-fidelity imitation of human behavior, sufficiently high-fidelity to preserve properties like "being smart" and "being a good person" and "still being a good person under some odd strains like being assembled into an enormous Chinese Room Bureaucracy", is a pretty huge ask.

It seems to me obvious, though this is the sort of point where I've been surprised about what other people don't consider obvious, that in general exact imitation is a bigger ask than superior capability. Building a Go player that imitates Shuusaku's Go play so well that a scholar couldn't tell the difference, is a bigger ask than building a Go player that could defeat Shuusaku in a match. A human is much smarter than a pocket calculator but would still be unable to imitate one without using a paper and pencil; to imitate the pocket calculator you need all of the pocket calculator's abilities in addition to your own.

Correspondingly, a realistic AI we build that literally passes the strong version of the Turing Test would probably have to be much smarter than the other humans in the test, probably smarter than any human on Earth, because it would have to possess all the human capabilities in addition to its own. Or at least all the human capabilities that can be exhibited to another human over the course of however long the Turing Test lasts. (Note that on the version of capability amplification I heard, capabilities that can be exhibited over the course of a day are the only kinds of capabilities we're allowed to amplify.)

 

 

Paul Christiano

Totally agree, and for this reason I agree that you can't rely on perfect imitation to solve the X-and-only-X problem and hence need other solutions. If you convince me that either informed oversight or reliability is impossible, then I'll be largely convinced that I'm doomed.

 

An AI that learns to exactly imitate humans, not just passing the Turing Test to the limits of human discrimination on human inspection, but perfect imitation with all added bad subtle properties thereby excluded, must be so cognitively powerful that its learnable hypothesis space includes systems equivalent to entire human brains. I see no way that we're not talking about a superintelligence here.

So to postulate perfect imitation, we would first of all run into the problems that:

(a)  The AGI required to learn this imitation is extremely powerful, and this could imply a dangerous delay between when we can build any dangerous AGI at all, and when we can build AGIs that would work for alignment using perfect-imitation capability amplification.

(b)  Since we cannot invoke a perfect-imitation capability amplification setup to get this very powerful AGI in the first place (because it is already the least AGI that we can use to even get started on perfect-imitation capability amplification), we already have an extremely dangerous unaligned superintelligence sitting around that we are trying to use to implement our scheme for alignment.

Now, we may perhaps reply that the imitation is less than perfect and can be done with a dumber, less dangerous AI; perhaps even so dumb as to not be enormously superintelligent. But then we are tweaking the “perfection of imitation” setting, which could rapidly blow up our alignment guarantees against the standard dooms of standard machine learning paradigms.

I'm worried that you have to degrade the level of imitation a lot before it becomes less than an enormous ask, to the point that what's being imitated isn't very intelligent, isn't human, and/or isn't known to be aligned.

To be specific: I think that if you want to imitate IQ-90 humans thinking for one day, and imitate them so specifically that the imitations are generally intelligent and locally aligned even in the limit of being aggregated into weird bureaucracies, you're looking at an AGI powerful enough to think about whole systems loosely analogous to IQ-90 humans.

 

 

Paul Christiano

It's important that my argument for alignment-of-amplification goes through not doing problematic optimization. So if we combine that with a good enough solution to informed oversight and reliability (and amplification, and the induction working so far...), then we can continue to train imperfect imitations that definitely don't do problematic optimization. They'll mess up all over the place, and so might not be able to be competent (another problem amplification needs to handle), but the goal is to set things up so that being a lot dumber doesn't break alignment.

 

I think that is a very powerful AGI. I think this AGI is smart enough to slip all kinds of shenanigans past you, unless you are using a methodology that can produce faithful imitations from unaligned AGIs. I think this is an AGI that can do powerful feats of engineering, unless it is somehow able to simulate humans doing powerful feats of engineering without itself being capable of powerful feats of engineering.

And then furthermore the capability amplification schema requires the AGI to be powerful enough to learn to imitate amplified systems of humans as well!

If you don’t degrade the AI to well below the point of superintelligence, we require extremely strong Faraday-cage keep-God-in-a-box don't-peek-at-the-code AI boxing. So we cannot for example let an untrusted superintelligence originate queries that it can use to learn human behavior; it has to be strictly unsupervised example-based learning rather than a query model. This interacts with worries about sample complexity below.

 

 

Paul Christiano

I appreciate how often you had to argue against disjunctions here in order to give me the benefit of the doubt.

 

Or you need to postulate some prior alignment of the AGI that lets you robustly and safely direct it at the imitation problem. Then (a) your alignment schema depends on an unspecified prior alignment schema that already works to make very powerful AGIs do things. And (b) you probably aren't allowing unrestrained intelligence explosions, which probably introduces a big unavoidable delay between the first point where very powerful AGI is technologically possible, and the later point where you can build and safely align the extremely powerful AGI that you're using as the base unit in your imitation scheme.


The last I checked with Paul, his scheme bit the bullet on using powerful smart dangerous units to imitate humans. Paul's scheme for bypassing this, as I heard and understood it back then, involved a scheme which:

  • Trained a million smart predictors of sufficient diversity that at least one predictor would happen to end up with the true sole simple goal of predicting faithfully;
  • Compared the predictors against each other, finding cases where their predictions diverged sufficiently;
  • Tested them against reality;
  • Had enough sampling power to perform a sufficient number of tests that, even if a superintelligence cunningly delays the time when it starts trying to slip in bad predictions, we can pause the system and run the experimental test and generate new samples and kick out the unfaithful predictor at that time.

That's a one-millionfold performance penalty, if I understood it correctly.

If that was just a concrete illustration to help me understand what's happening at all, and actually there's supposed to be some obvious less naive more efficient scheme for making use of mostly untrusted superintelligences which doesn't have the giant performance penalty, then I haven't seen it yet.

 

 

Paul Christiano

There are several possible techniques for cutting this performance penalty down, e.g., representing a large ensemble implicitly or using a tricky distillation procedure. I totally agree if you have a millionfold slowdown then you are out of luck.

When we talked about this, I agreed that "train a reliable system" was an important subproblem. Train an implicit ensemble is one possible approach to that problem. At the time of our discussion I listed 3 techniques. Since then I've become quite pessimistic about training an implicit ensemble, and have become more optimistic about verification and transparency. (My thinking about the problem is also generally much sharper.) The three approaches I currently consider most live are the ones in “Techniques for Optimizing Worst-Case Performance.”

(As usual, all of it requires amplification + the induction.)

(In the original discussion, I listed “large ensemble” as a possible approach in response to your "how could you possibly solve this?" question, not as a necessary ingredient.)


I also have trouble seeing how kicking out unfaithful predictors is possible under an unsupervised-learning paradigm that doesn't actually allow hypercomputation.

The problem being: the exact output of a human is always going to be very improbable. Even if the agents can only send each other text messages and no illustrations (which further reduces their collective intelligence and ability to produce textbooks readable in a day), a text message is still an enormous possibility space. If we ask superintelligences to give us their predictions for what the human does next, their predictions cannot possibly actually include a giant lookup table next to all the probable outputs of the human.

I think I see how I’d “solve” this problem using hypercomputation and an enormous number of samples: I require the superintelligences to give me their giant lookup tables of probabilities for exact outputs, calculate total distances between probability tables (KL divergence or whatever), and if the distance is great enough, I sample a human and do a Bayesian update. I drop any superintelligences whose prior probability goes low enough. I keep doing this until only agreeing superintelligences remain.

Under this system it looks to me like you need a lot of samples to kick out any superintelligences trying to sneak in bad suggestions. Let's say that an uncompressed Chinese Room bureaucracy is doing the equivalent of sending around a quadrillion messages, and that the output can be corrupted by an adversarial attack against one billion of those messages such that there's a chance of outputting an adversarially chosen alternative to each of those billion messages. Well, the only way a superintelligence can allocate a generation probability to those fake messages, is by stealing 0.0001% probability mass from all the other messages that might be output. So each time we check against a real human, we update by bits, and after taking a million samples we've driven down the relative odds of this naughty superintelligence by .

We had to take a million samples to do that, and each of the samples involves a human thinking for one day. And that just got us :1 evidence, so we've got to take a million samples 7 times to get to the point of 1000:1 odds against the superintelligence being honest. And if we have a million superintelligences in the distribution, some of them can make skewed bets à la predicting coinflips whereby a thousand of those will be able to gain 1000:1 advantages over the others, so actually you've got to take 14 million samples to be sure of beating any bad predictor classes that collectively try that scheme.

I think last time I talked to Paul about this, Paul was like, "Sure, we're talking about a gigantic government project that can take millions of samples."

I'm not sure, I could be wrong about this, but I think you actually need 14 million samples per bad predictor that you want to kick out of the distribution.

Even if I’m wrong about that, we need millions of humans who are smart enough and aligned enough that random samples from their number can agglomerate into a smart aligned Chinese Room Bureaucracy.

I also don't see a way to do this without comparing the giant lookup tables using hypercomputation, but there I might just be missing a math trick.

 

 

Paul Christiano

My best guess is that this can be done under plausible assumptions with O(1) samples per bad predictor. It's tricky, but we are good at tricky math problems, so it's not very scary relative to the other problems we face.

(Unfortunately, I think that a large implicit ensemble is probably stuck anyway, in part because a million predictors isn't enough. But I'm not confident about that.)

 

(If you're thinking of GANs, then so far as I can tell, the discriminator has to be at least as smart as the generator, and you have to trust the discriminator, and there isn't any simpler problem with respect to how you find a trusted superintelligence within a collective of untrusted ones to act as your discriminator.)

(EDIT: Actually, after thinking about this for another five minutes, maybe I do see how to do it with GANs and lower sample complexity.)

An even larger issue is that I don't see any obvious way to carry out a scheme like this one at all with respect to imperfect imitations. (And the above scheme I thought of with GANs would also just fail.)

 

 

Paul Christiano

I think we could probably get over this too, it's another tricky math problem. I think this kind of problem is reliably either impossible, or else radically easier than most of the other stuff we are dealing with in alignment.

(Though I endorse the overall intuition that large implicit ensembles are doomed.)


I think these arguments are collectively something like a crux. That is, unless I've missed one of my own thought processes in the course of writing this up rapidly, or assumed a shared background assumption that isn't actually shared.

Let's say that D is the degree of imperfection allowed by some system of capability amplification, and call D-imperfect imitations D-imitations. Iterated D-imitations of amplified systems of D-imitations will be termed DD-imitations. Then I think I'd start to be pragmatically interested in capability amplification as I understood it, if I believed all of the following:

  • We can, before the world is ended by other unaligned AIs, get AIs powerful enough to learn D-imitations and DD-imitations;
  • D-imitations and DD-imitations robustly preserve the goodness of the people being imitated, despite the imperfection of the imitation;
  • D-imitations agglomerate to sufficient cognitive power to perform a pivotal act in a way that causes the alignment of the components to be effective upon aligning the whole; and imperfect DD-imitation preserves this property;
  • We can find any way of either:
    • Individually trusting one AI that powerful to faithfully perform the task of D-imitation (but then why can't we just use this scheme to align a powerful AGI in the first place?);
    • Find a scheme for agglomerating mostly untrustworthy powerful intelligences which:
      • Doesn't require giant lookup tables, doesn't require a GAN with a trusted discriminator unless you can say how to produce the trusted discriminator, and can use actual human samples as fuel to discriminate trustworthiness among untrusted generators of D-imitations.
      • Is extremely sample-efficient (let's say you can clear 100 people who are trustworthy to be part of an amplified-capability system, which already sounds to me like a huge damned ask); or you can exhibit to me a social schema which agglomerates mostly untrusted humans into a Chinese Room Bureaucracy that we trust to perform a pivotal task, and a political schema that you trust to do things involving millions of humans, in which case you can take millions of samples but not billions. Honestly, I just don't currently believe in AI scenarios in which good and trustworthy governments carry out complicated AI alignment schemas involving millions of people, so if you go down this path we end up with different cruxes; but I would already be pretty impressed if you got all the other cruxes.
      • Is not too computationally inefficient; more like 20-1 slowdown than 1,000,000-1. Because I don't think you can get the latter degree of advantage over other AGI projects elsewhere in the world. Unless you are postulating massive global perfect surveillance schemes that don't wreck humanity's future, carried out by hyper-competent, hyper-trustworthy great powers with a deep commitment to cosmopolitan value — very unlike the observed characteristics of present great powers, and going unopposed by any other major government. Again, if we go down this branch of the challenge then we are no longer at the original crux.

I worry that going down the last two branches of the challenge could create the illusion of a political disagreement, when I have what seem to me like strong technical objections at the previous branches. I would prefer that the more technical cruxes be considered first. If Paul answered all the other technical cruxes and presented a scheme for capability amplification that worked with a moderately utopian world government, I would already have been surprised. I wouldn't actually try it because you cannot get a moderately utopian world government, but Paul would have won many points and I would be interested in trying to refine the scheme further because it had already been refined further than I thought possible. On my present view, trying anything like this should either just plain not get started (if you wait to satisfy extreme computational demands and sampling power before proceeding), just plain fail (if you use weak AIs to try to imitate humans), or just plain kill you (if you use a superintelligence).

 

 

Paul Christiano

I think that the disagreement is almost entirely technical. I think if we really needed 1M people it wouldn't be a dealbreaker, but that's because of a technical rather than political disagreement (about what those people need to be doing). And I agree that 1,000,000x slowdown is unacceptable (I think even a 10x slowdown is almost totally doomed).

 

I restate that these objections seem to me to collectively sum up to “This is fundamentally just not a way you can get an aligned powerful AGI unless you already have an aligned superintelligence”, rather than “Some further insights are required for this to work in practice.” But who knows what further insights may really bring? Movement in thoughtspace consists of better understanding, not cleverer tools.

I continue to be excited by Paul’s thinking on this subject; I just don’t think it works in the present state.

 

 

Paul Christiano

On this point, we agree. I don’t think anyone is claiming to be done with the alignment problem, the main question is about what directions are most promising for making progress.

 

On my view, this is not an unusual state of mind to be in with respect to alignment research. I can’t point to any MIRI paper that works to align an AGI. Other people seem to think that they ought to currently be in a state of having a pretty much workable scheme for aligning an AGI, which I would consider to be an odd expectation. I would think that a sane point of view consisted in having ideas for addressing some problems that created further difficulties that needed to be fixed and didn’t address most other problems at all; a map with what you think are the big unsolved areas clearly marked. Being able to have a thought which genuinely squarely attacks any alignment difficulty at all despite any other difficulties it implies, is already in my view a large and unusual accomplishment. The insight “trustworthy imitation of human external behavior would avert many default dooms as they manifest in external behavior unlike human behavior” may prove vital at some point. I continue to recommend throwing as much money at Paul as he says he can use, and I wish he said he knew how to use larger amounts of money.

New Comment
54 comments, sorted by Click to highlight new comments since:
The main way I could see agent foundations research as helping to address these problems, rather than merely deferring them, is if we plan to eschew large-scale ML altogether.

As I understand it, the default Nate prediction is that if we get aligned AGI at all, it's mostly likely to have a mix of garden-variety narrow-AI ML with things that don't look like contemporary ML. I wouldn't describe that as "eschewing large-scale ML altogether", but possibly Paul would.

I think the more important disagreement here isn't about how hard it is to use AF to resolve the central difficulties, but rather about how hard it is to resolve the central difficulties with the circa-2018 ML toolbox. Eliezer's view, from the Sam Harris interview, is:

The depth of the iceberg is: “How do you actually get a sufficiently advanced AI to do anything at all?” Our current methods for getting AIs to do anything at all do not seem to me to scale to general intelligence. If you look at humans, for example: if you were to analogize natural selection to gradient descent, the current big-deal machine learning training technique, then the loss function used to guide that gradient descent is “inclusive genetic fitness”—spread as many copies of your genes as possible. We have no explicit goal for this. In general, when you take something like gradient descent or natural selection and take a big complicated system like a human or a sufficiently complicated neural net architecture, and optimize it so hard for doing X that it turns into a general intelligence that does X, this general intelligence has no explicit goal of doing X.
We have no explicit goal of doing fitness maximization. We have hundreds of different little goals. None of them are the thing that natural selection was hill-climbing us to do. I think that the same basic thing holds true of any way of producing general intelligence that looks like anything we’re currently doing in AI.
If you get it to play Go, it will play Go; but AlphaZero is not reflecting on itself, it’s not learning things, it doesn’t have a general model of the world, it’s not operating in new contexts and making new contexts for itself to be in. It’s not smarter than the people optimizing it, or smarter than the internal processes optimizing it. Our current methods of alignment do not scale, and I think that all of the actual technical difficulty that is actually going to shoot down these projects and actually kill us is contained in getting the whole thing to work at all. Even if all you are trying to do is end up with two identical strawberries on a plate without destroying the universe, I think that’s already 90% of the work, if not 99%.

My understanding is that Paul thinks breaking the evolution analogy is important, but a lot less difficult than Eliezer thinks it is.

My understanding is that Paul thinks breaking the evolution analogy is important, but a lot less difficult than Eliezer thinks it is

My basic take on the evolution analogy:

  • Evolution wasn't trying to solve the robustness problem at all. It's analogous to using existing ML while making zero effort to avoid catastrophic generalization failures. I'm not convinced the analogy tells us much about how hard this problem will be (rather than just showing that the problem exists). Even today, if we were trying to train an AI to care about X, we'd e.g. train on situations where X diverges from other possible goals, or where it looks like the agent isn't being monitored as part of the training process. We'd try a variety of simple techniques to understand what the AI is thinking or anticipating, and use that information to help construct tricky situations or evaluate behavior. And so on. In the real world we are going to use much more sophisticated versions of those techniques, but the analogy doesn't even engage with the most basic versions.
  • In practice I think that we can use a system nearly as smart as the AI to guide the AI's training---before we have a super-duper-intelligent AI we have (or could choose to train) a superintelligent AI, and before that we can have a pretty intelligent AI. This is important, because the Nate/Eliezer response to the previous bullet tends to assume a huge intelligence gap between the intelligence that's being trained and the intelligence that's doing the overseeing. That looks like an unreasonable situation to me even if we can't get amplification to work. (Amplification lets us have an oversight process smarter than the system we are training. But at a minimum we could get an overseer only a little bit less smart.) We've had a bit of argument about this, but I've found the argument really unconvincing and also don't expect it to convince others.
  • The thing that we are doing is probably much easier than "evolve a species to care about precise goal X." Training an AI to be corrigible is much closer to trying to breed a creature for docility than trying to breed it to care about some particular complex thing. I think there is a reasonable chance that this would just work even in the evolution analogy and even without any technical progress, i.e. that humans could already breed a race of docile superhumans by using the pretty basic approaches we know of now.

"Evolution wasn't trying to solve the robustness problem at all." - Agreed that this makes the analogy weaker. And, to state the obvious, everyone doing safety work at MIRI and OpenAI agrees that there's some way to do neglected-by-evolution engineering work that gets you safe+useful AGI, though they disagree about the kind and amount of work.

The docility analogy seems to be closely connected to important underlying disagreements.

Conversation also continues here.

the more important disagreement here isn't about how hard it is to use AF to resolve the central difficulties

What's "AF" here?

I think Agent Foundations

D-imitations and DD-imitations robustly preserve the goodness of the people being imitated, despite the imperfection of the imitation;

My model of Paul thinks it's sufficient to train the AI's to be corrigible act-based assistants that are competent enough to help us significantly, while also able to avoid catastrophes. If possible, this would allow significant wiggle room for imperfect imitation.

Paul and I disagreed about the ease of training such assistants, and we hashed out a specific thought experiment: if we humans were trying our hardest to be competent, catastrophe-free, corrigible act-based assistants to some aliens, is there some reasonable training procedure they could give us that would enable us to significantly and non-catastrophically assist the aliens perform a pivotal act? Paul thought yes (IIRC), while I felt iffy about it. After all, we might need to understand tons and tons of alien minutiae to avoid any catastrophes, and given how different our cultures (and brains) are from theirs, it seems unlikely we'd be able to capture all the relevant minutiae.

I've since warmed up to the feasibility of this. It seems like there aren't too many ways to cause existential catastrophes, it's pretty easy to determine what things constitute existential catastrophes, and it's pretty easy to spot them in advance (at least as well as the aliens would). Yes we might still make some catastrophic mistakes, but they're likely to be benign, and it's not clear that the risk of catastrophe we'd incur is much worse than the risk the aliens would incur if a large team of them tried to execute a pivotal act. Perhaps there's still room for things like accidental mass manipulation, but this feels much less worrisome than existential catastrophe (and also seems plausibly preventable with a sufficiently competent operator).

I suspect another major crux on this point is whether there is a broad basin of corrigibility (link). If so, it shouldn't be too hard for D-imitations to be corrigible, nor for IDA to preserve corrigibility for DD-imitations. If not, it seems likely that corrigibility would be lost through distillation. I think this is also a crux for Vaniver in his post about his confusions with Paul's agenda.

There's a nice summary of Eliezer's post in Rohin Shah's Alignment Newsletter #7 (which I broke up into a numbered list for clarity), along with Rohin's response:

A list of challenges faced by iterated distillation and amplification.

  1. First, a collection of aligned agents interacting does not necessarily lead to aligned behavior. (Paul's response: That's not the reason for optimism, it's more that there is no optimization pressure to be unaligned.)
  2. Second, it's unclear that even with high bandwidth oversight, that a collection of agents could reach arbitrary levels of capability. For example, how could agents with an understanding of arithmetic invent Hessian-free optimization? (Paul's response: This is an empirical disagreement, hopefully it can be resolved with experiments.)
  3. Third, while it is true that exact imitation of a human would avoid the issues of RL, it is harder to create exact imitation than to create superintelligence, and as soon as you have any imperfection in your imitation of a human, you very quickly get back the problems of RL. (Paul’s response: He's not aiming for exact imitation, he wants to deal with this problem by having a strong overseer aka informed oversight, and by having techniques that optimize worst-case performance.)
  4. Fourth, since Paul wants to use big unaligned neural nets to imitate humans, we have to worry about the possibility of adversarial behavior. He has suggested using large ensembles of agents and detecting and pruning the ones that are adversarial. However, this would require millions of samples per unaligned agent, which is prohibitively expensive. (Paul's response: He's no longer optimistic about ensembles and instead prefers the techniques in this post, but he could see ways of reducing the sample complexity further.)

My opinion: Of all of these, I'm most worried about the second and third problems. I definitely have a weak intuition that there are many important tasks that we care about that can't easily be decomposed, but I'm optimistic that we can find out with experiments. For the point about having to train a by-default unaligned neural net to imitate aligned agents, I'm somewhat optimistic about informed oversight with strong interpretability techniques, but I become a lot less optimistic if we think that won't be enough and need to use other techniques like verification, which seem unlikely to scale that far. In any case, I'd recommend reading this post for a good explanation of common critiques of IDA.

(Paul's response: That's not the reason for optimism, it's more that there is no optimization pressure to be unaligned.)

This is the fundamental reason I don't trust Paul to not destroy the world. There is optimization pressure to be unaligned; of course there is! Even among normal humans there are principal-agent problems. To claim that you have removed optimization pressure to be unaligned, despite years of argument to the contrary, is nothing less than willful blindness and self-delusion.

To claim that you have removed optimization pressure to be unaligned

The goal is to remove the optimization pressure to be misaligned, and that's the reason you might hope for the system to be aligned. Where did I make the stronger claim you're attributing to me?

I'm happy to edit the offending text, I often write sloppily. But Rohin is summarizing the part of this post where I wrote "The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn't a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble. My goal is to (a) first construct weaker agents who aren't internally doing problematic optimization, (b) put them together in a way that improves capability without doing other problematic optimization, (c) iterate that process." So in this case it seems clear that I was stating a goal.

Even among normal humans there are principal-agent problems.

In the scenario of a human principal delegating to a human agent there is a huge amount of optimization pressure to be misaligned. All of the agents' evolutionary history and cognition. So I don't think the word "even" belongs here.

There is optimization pressure to be unaligned; of course there is!

I agree that there are many possible malign optimization pressures, e.g.: (i) the optimization done deliberately by those humans as part of being competitive, which they may not be able to align, (ii) "memetic" selection amongst patterns propagating through the humans, (iii) malign consequentialism that arises sometimes in the human policy (either randomly or in some situations). I've written about these and it should be obvious they are something I think a lot about, am struggling with, and believe there are plausible approaches to dealing with.

(I think it would be defensible for you to say something like "I don't believe that Paul's writings give any real reason for optimism on these points and the fact that he finds them reassuring seems to indicate wishful thinking," and if that's a fair description of your position then we can leave it at that.)

Curated for the following reasons:

  • This post is contributing to some of the most important questions around AI risk
  • Sets a good example of having a high-level conversation that takes into account various real objections
  • Has been written in a clear and engaging way

The following seem to me to the most promising avenues for improvement / future work:

  • It is quite long, and there are probably some ways to increase the quality density of the post
  • I definitely have a sense that some aspects of this post and the overall conversation could be improved by trying to formalize things more (though I recognize that formalizing things is hard, and trying to do so might increase the effort required to write this post by a factor of 2 or so)

Reading a description of Paul's ideas in another researchers words, who was attempting to sincerely explain why it didn't seem compelling to him, and reading Paul's replies, was really helpful in understanding Paul's research ideas, and I think about this post often (or at least, the ideas I formed reading it).

So, congratulations are in order to the LW team for putting in the work necessary to create the features that Eliezer wanted before coming back (IIRC mostly reign of terror moderation?). Hooray! The Eliezer-posts-things-on-Facebook equilibrium was so much worse for so many reasons, not least of which is how hard it is to search for old FB posts / refer back to them in other discussions.

I've been excited about all the progress that's been made. But, it looks like this was cross-posted from MIRI’s blog, and I think sometimes Rob B posts on EY’s behalf, so I’m not sure we’ve actually hit the "Eliezer is back" condition.

If we haven't yet, here's to getting there soon!

X-and-only-X is what I call the issue where the property that's easy to verify and train is X, but the property you want is "this was optimized for X and only X and doesn't contain a whole bunch of possible subtle bad Ys that could be hard to detect formulaically from the final output of the system".

If X is "be a competent, catastrophe-free, corrigible act-based assistant", it's plausible to me that an AGI trained to do X is sufficient to lead humanity to a good outcome, even if X doesn't capture human values. For example, the operator might have the AGI develop the technology for whole brain emulations, enabling human uploads that can solve the safety problem in earnest, after which the original AGI is shut down.

Being an act-based (and thus approval-directed) agent is doing a ton of heavy lifting in this picture. Humans obviously wouldn't approve of daemons, so your AI would just try really hard to not do that. Humans obviously wouldn't approve of a Rubik's cube solution that modulates RAM to send GSM cellphone signals, so your AI would just try really hard to not do that.

I think most of the difficulty here is shoved into training an agent to actually have property X, instead of just some approximation of X. It's plausible to me that this is actually straightforward, but it also feels plausible that X is a really hard property to impart (though still much easier to impart than "have human values").

A crux for me whether property X is sufficient is whether the operator could avoid getting accidentally manipulated. (A corrigible assistant would never intentionally manipulate, but if it satisfies property X while more directly optimizing Y, it might accidentally manipulate the humans into doing some Y distinct from human values.) I feel very uncertain about this, but it currently seems plausible to me that some operators could successfully just use the assistant to solve the safety problem in earnest, and then shut down the original AGI.

Corrigibility is doing a ton of heavy lifting in this picture. Humans obviously wouldn't approve of daemons, so your AI would just try really hard to not do that.

I'm a bit confused about how "corrigibility" is being used here. I thought it meant that the agent doesn't resist correction, but here it seems to be used to mean something more like trying to only do things the overseer would approve of.

I thought we called the latter being "approval-directed" and that it was a separate idea from corrigibility. Am I confused?

Oops, I think I was conflating "corrigible agent" with "benign act-based agent". You're right that they're separate ideas. I edited my original comment accordingly.

D-imitations agglomerate to sufficient cognitive power to perform a pivotal act in a way that causes the alignment of the components to be effective upon aligning the whole; and imperfect DD-imitation preserves this property.

This is the crux I currently feel most skeptical of. I don't understand how we could safely decompose the task of emulating 1 year's worth of von Neumann-caliber general reasoning on some scientific problem. (I'm assuming something like this is necessary for a pivotal act; maybe it's possible to build nanotech or whole-brain emulations without such reasoning being automated, in which case my picture for the world becomes rosier.) (EDIT: Rather than "decomposing the task of emulating a year's worth of von Neumann-caliber general reasoning", I meant to say "decomposing any problem whose solution seems to require 1 year's worth of von Neumann-caliber general reasoning".)

In particular, I'm still picturing Paul's agenda as implementing some form of HCH, and I don't understand how anything that looks like an HCH can accumulate new knowledge, synthesize it, and make new discoveries on top of it, without the HCH-humans effectively becoming "human transistors" that implement an AGI. (An analogy: the HCH-humans would be like ants; the AGI would be like a very complicated ant colony.) And unless we know how to build a safe AGI (for example we'd need to ensure it has no daemons), I don't see how the HCH-humans would know how to configure themselves into a safe AGI, so they just wouldn't (if they're benign).

I don't understand how we could safely decompose the task of emulating 1 year's worth of von Neumann-caliber general reasoning on some scientific problem. (I'm assuming something like this is necessary for a pivotal act; maybe it's possible to build nanotech or whole-brain emulations without such reasoning being automated, in which case my picture for the world becomes rosier.)

This reads like a type error, you don't decompose the task "emulate someone spending 1 year solving a scientific problem," you decompose the problem.

You're right -- I edited my comment accordingly. But my confusion still stands. Say the problem is "figure out how to upload a human and run him at 10,000x". On my current view:

(1) However you decompose this problem, you'd need something equivalent to at least 1 year's worth of a competent scientist doing general reasoning to solve this problem.

(2) In particular, this general reasoning would require the ability to accumulate new knowledge and synthesize it to make novel inferences.

(3) This sort of reasoning would end up happening on a "virtual machine AGI" built out of "human transistors".

(4) Unless we know how to ensure cognition is safe (e.g. daemon-free) we wouldn't know how to make safe "virtual machine AGI's".

(5) So either we aren't able to perform this reasoning (because it's unsafe and recognized as such), or we perform it anyway unsafely, which may lead to catastrophic outcomes.

Which of these points would you say you agree with? (Alternatively, if my picture of the situation seems totally off, could you help show me where?)

(1) However you decompose this problem, you'd need something equivalent to at least 1 year's worth of a competent scientist doing general reasoning to solve this problem.

To clarify: your position is that 100,000 scientists thinking for a week each, one after another, could not replicate the performance of one scientist thinking for 1 year?

I could imagine believing something like that for certain problems requiring unusual creativity or complex concepts that need to be manipulated intuitively. And I could separately imagine having that view for low-bandwidth oversight, where we are talking about humans each of whom gets only <100 bits of input.

I don't understand at all how that could be true for brain uploading at the scale of a week vs. year.

Solving this problem considering multiple possible approaches. Those can't be decomposed with 100% efficiency, but it sure seems like they can be split up across people.

Evaluating an approach requires considering a bunch of different possible constraints, considering a bunch of separate steps, building models of relevant phenomena, etc.

Building models requires considering several hypotheses and modeling strategies. Evaluating how well a hypothesis fits the data involves considering lots of different observations. And so on.

To clarify: your position is that 100,000 scientists thinking for a week each, one after another, could not replicate the performance of one scientist thinking for 1 year?

Actually I would be surprised if that's the case, and I think it's plausible that large teams of scientists thinking for one week each could safely replicate arbitrary human intellectual progress.

But if you replaced 100,000 scientists thinking for a week each with 1,000,000,000,000 scientists thinking for 10 minutes each, I'd feel more skeptical. In particular I think 10,000,000 10-minute scientists can't replicate the performance of one 1-week scientist, unless the 10-minute scientists become human transistors. In my mind there isn't a qualitative difference between this scenario and the low-bandwidth oversight scenario. It's specifically dealing with human transistors that I worry about.

I also haven't thought too carefully about the 10-minute-thought threshold in particular and wouldn't be too surprised if I revised my view here. But if we replaced "10,000,000 10-minute scientists" with "arbitrarily many 2-minute scientists" I would even more think we couldn't assemble the scientists safely.

I'm assuming in all of this that the scientists have the same starting knowledge.

There's an old SlateStarCodex post that's a reasonable intuition pump for my perspective. It seems to me that the HCH-scientists' epistemic processis fundamentally similar to that of the alchemists. And the alchemists' thoughts were constrained by their lifespan, which they partially overcame by distilling past insights to future generations of alchemists. But there still remained massive constraints on their thoughts, and I imagine qualitatively similar constraints present for HCH's.

I also imagine them to be far more constraining if "thought-lifespans" shrank from ~30 years to ~30 minutes. But "thought-lifespans" on the order of ~1 week might be long enough that the overhead from learning distilled knowledge (knowledge = intellectual progress from other parts of the HCH, representing maybe decades or centuries of human reasoning) is small enough (on the order of a day or two?) that individual scientists can hold in their heads all the intellectual progress made thus far and make useful progress on top of that, without any knowledge having to be distributed across human transistors.

I don't understand at all how that could be true for brain uploading at the scale of a week vs. year.
Solving this problem considering multiple possible approaches. Those can't be decomposed with 100% efficiency, but it sure seems like they can be split up across people.
Evaluating an approach requires considering a bunch of different possible constraints, considering a bunch of separate steps, building models of relevant phenomena, etc.
Building models requires considering several hypotheses and modeling strategies. Evaluating how well a hypothesis fits the data involves considering lots of different observations. And so on.

I agree with all this.

EDIT: In summary, my view is that:

  • if all the necessary intellectual progress can be distilled into individual scientists' heads, I feel good about HCH making a lot of intellectual progress
  • if the agents are thinking long enough (1 week seems long enough to me, 30 minutes doesn't), this distillation can happen.
  • if this distillation doesn't happen, we'd have to end up doing a lot of cognition on "virtual machines", and cognition on virtual machines is unsafe.
There's an old SlateStarCodex post that's a reasonable intuition pump for my perspective. It seems to me that the HCH-scientists' epistemic processis fundamentally similar to that of the alchemists. And the alchemists' thoughts were constrained by their lifespan, which they partially overcame by distilling past insights to future generations of alchemists. But there still remained massive constraints on their thoughts, and I imagine qualitatively similar constraints present for HCH's.
I also imagine them to be far more constraining if "thought-lifespans" shrank from ~30 years to ~30 minutes. But "thought-lifespans" on the order of ~1 week might be long enough that the overhead from learning distilled knowledge (knowledge = intellectual progress from other parts of the HCH, representing maybe decades or centuries of human reasoning) is small enough (on the order of a day or two?) that individual scientists can hold in their heads all the intellectual progress made thus far and make useful progress on top of that, without any knowledge having to be distributed across human transistors.

In order for this to work, you need to be able to break apart the representation of the knowledge as well as the actual work they are doing. For example, you need to be able to pass around objects like "The theory that reality is the unique object satisfying both constraints {A} and {B}", with one person responsible for representing {A} and another responsible for representing {B}.

My impression of your concern is that, if knowledge is represented this way instead of in a particular scientist's head, then they can't manipulate it well without being transistors.

Do you have some particular kinds of manipulation in mind, that humans are able to do with knowledge in their head, but you don't think a group of humans can do if the knowledge is distributed across all of them?

One family of concerns people have raised is about the optimization done within amplification:

  • Sometimes humans solve problems with a stroke of creative insight. These cases can be simulated by a brute force search for solutions, perhaps using samples generated by the human proposal distribution. But then we are introducing a powerful optimization, which may e.g. turn up an attack on the solution-evaluating process. The proposal-evaluating process can be much "larger" than the brute force search, so the question is really whether with amplification we can construct a sufficiently secure solution-evaluator. I think the most interesting question for security there is whether the "evaluate a solution" process is itself decomposable with low bandwidth oversight (though there are other ways that security could be unachievable).
  • If they need to represent a hypothesis about reality by doing purely mechanical calculations and observing that they predict well, then maybe that theory will be an optimization daemon. I think there are cases of "opaque" hypotheses where humans can't break up internal structure. But an optimization daemon has to actually think thoughts, including thoughts about how to e.g. subvert the system. So it seems to me that as long as understanding those thoughts is a task that is decomposable, we can defend against optimization daemons by looking over a hypothesis and evaluating whether it's doing anything bad.

In these cases, it seems to me like the putatively indecomposable task is OK, as long as you can solve some other tasks by amplification (doing secure evaluation of proposed solutions, evaluating a hypothesis to test if it is doing problematic optimization). In these cases, it seems to me like the constituent tasks are easier in a qualitative sense (e.g. if I do some search and want to evaluate whether a hypothesis is a daemon, I'm only going to have to do easier searches within that evaluation---namely, the kinds of searches that are done internally by the daemon in order to make sense of the world), such that we aren't going to get a loop and can carry out an induction.

Another family of concerns is that humans have indecomposable abilities:

  • Perhaps a human has learned to do task X, and a good algorithm for X is now encoded in the weights of their brain, and can only be used by running their brain on the same inputs they encountered while learning to do task X. (Thanks to Wei Dai for pointing out this tight impossibility argument, and I discussed it a bit under "An Example Obstruction" in the original post.) In particular, there is no way to get access to this knowledge with low bandwidth oversight. In the case of scientific inquiry, accessing the scientist's training may require having the human actually hold an entire scientific hypothesis in their head.

In this case we can't recover "ability at task X" by amplification except by redoing it from scratch. If the human's knowledge about task X depended on facts about the external world, then we can't recover that knowledge except by interacting with the external world.

But we already new that amplification wasn't going to encode empirical knowledge about the world without interacting with the world, the point was to converge to a good policy for handling empirical data as empirical data comes in. The real question is whether HCH converges to arbitrarily sophisticated behavior in the limit. To answer that question we'd want to ask: if the human had never trained to do task X, would they still be "universal" in some appropriate sense?

To answer that question, our example of something indecomposable can't just be a task where empirical information about the world (or logical information too expensive to be learned via the amplification process) is encoded in the human's brain, because we are happy to drop empirical information about the world and instead learn a policy that maps {data} --> {behavior}, and give that policy access to all the empirical information it needs.

Does your concern fit in one of those two categories, or in some different category?

These cases can be simulated by a brute force search for solutions, perhaps using samples generated by the human proposal distribution. But then we are introducing a powerful optimization, which may e.g. turn up an attack on the solution-evaluating process. The proposal-evaluating process can be much "larger" than the brute force search, so the question is really whether with amplification we can construct a sufficiently secure solution-evaluator.

I'm actually not sure the brute force search gives you what you're looking for here. There needs to be an ordering on solutions-to-evaluate such that you can ensure the evaluators are pointed at different solutions and cover the whole solution space (this is often possible, but not necessarily possible; consider solutions with real variables where a simple discretization is not obviously valid). Even if this is the case, it seems like you're giving up on being competitive on speed by saying "well, we could just use brute force search." (It also seems to me like you're giving up on safety, as you point out later; one of the reasons why heuristic search methods for optimization seem promising to me is because you can be also doing safety-evaluation effort there, such that more dangerous solutions are less likely to be considered in the first place.)

My intuition is that many numerical optimization search processes have "wide" state, in that you both are thinking about the place where you are right now, and the places you've been before, and previous judgments you've made about places to go. Sometimes this state is not actually wide because it can be compressed very nicely; for example, in the simplex algorithm, my state is entirely captured by the tableau, and I can spin up different agents to take the tableau and move it forward one step and then pass the problem along to another agent. But my intuition is that such times will be times when we're not really concerned about daemons or misalignment of the optimization process itself, because the whole procedure is simple enough that we understand how everything works together well.

But if it is wide or deep, then it seems like this strategy is probably going to run into obstacles. We either attempt to implement something deep as the equivalent of recursive function calls, or we discover that we have too much state to successfully pass around, and thus there's not really a meaningful sense in which we can have separate short-lived agents (or not really a meaningful sense in which we can be competitive with agents that do maintain all that state).

For example, think about implementing tree search for games in this way. No one agent sees the whole tree, and only determines which children to pass messages to and what message to return to their parents. If we think that the different branches are totally distinct from each other, then we only need vertical message-passing and we can have separate short-lived agents (although it's sort of hard to see the difference between an agent that's implementing tree-search in one thread and many threads because of how single agents can implement recursive functions). But if we think that the different branches are mutually informative, then we want to have a linkage between those branches, which means a horizontal links in this tree. (To be clear, AlphaGo has everything call an intuition network which is only trained between games, and thus could be implemented in a 'vertical' fashion if you have the intuition network as part of the state of each short-lived agent, but you could imagine an improvement on AlphaGo that's refining its intuition as it considers branches in the game that it's playing, and that couldn't be implemented without this horizontal linkage.)

My sense is that the sorts of creative scientific or engineering problems that we're most interested in are ones where this sort of wide state is relevant and not easily compressible, such that I could easily imagine a world where it takes the scientist a week to digest everything that's happened so far, and then doesn't have any time to actually move things forward before vanishing and being replaced by a scientist who spends a week digesting everything, and so on.

As a side note, I claim the 'recursive function' interpretation implies that the alignment of the individual agents is irrelevant (so long as they faithfully perform their duties) and the question of whether tree search was the right approach (and whether the leaf evaluation function is good) becomes central to evaluating alignment. This might be something like one of my core complaints, that it seems like we're just passing the alignment buck to the strategy of how to integrate many small bits of computation into a big bit of computation, and that problem seems just as hard as the regular alignment problem.

Even if this is the case, it seems like you're giving up on being competitive on speed by saying "well, we could just use brute force search."

The efficiency of the hypothetical amplification process doesn't directly much affect the efficiency of the training process. It affects the number of "rounds" of amplification you need to do, but the rate is probably limited mostly by the ability of the underlying ML to learn new stuff.

There needs to be an ordering on solutions-to-evaluate such that you can ensure the evaluators are pointed at different solutions and cover the whole solution space

You can pick randomly.

(It also seems to me like you're giving up on safety, as you point out later; one of the reasons why heuristic search methods for optimization seem promising to me is because you can be also doing safety-evaluation effort there, such that more dangerous solutions are less likely to be considered in the first place.)

I agree that this merely reduces the problem of "find a good solution" to "securely evaluate whether a solution is good" (that's what I was saying in the grandparent).

or we discover that we have too much state to successfully pass around, and thus there's not really a meaningful sense in which we can have separate short-lived agents

The idea is to pass around state by distributing it across a large number of agents. Of course it's an open question whether that works, that's what we want to figure out.

(or not really a meaningful sense in which we can be competitive with agents that do maintain all that state)

Again, the hypothetical amplification process is not intended to be competitive, that's the whole point of iterated amplification.

But if we think that the different branches are mutually informative, then we want to have a linkage between those branches, which means a horizontal links in this tree

Only if we want to be competitive. Otherwise you can just simulate horizontal links by just running the entire other subtree in a subcomputation. In the case of iterated amplification, that couldn't possibly change the speed of the training process, since only O(1) nodes are actually instantiated at a time anyway and the rest are distilled into the neural network. What would a horizontal link mean?

the intuition network as part of the state of each short-lived agent

The intuition network is a distillation of the vertical tree, it's not part of the amplification process at all.

and that couldn't be implemented without this horizontal linkage

I don't think that's right, also I don't see how a 'horizontal' linkage would compare with a normal vertical linkage, just unroll the computation.

are ones where this sort of wide state is relevant and not easily compressible

The main thing I'm looking for are examples of particular kinds of state that you think are incompressible. For example, do you think modern science has developed kinds of understanding that couldn't be distributed across many short-lived individuals (in a way that would let you e.g. use that knowledge to answer questions that a long-lived human could answer using that knowledge)?

Last time this came up Eliezer used the example of calculus. But I claim that anything you can formalize can't possibly have this character, since you can distribute those formal representations quite easily, with the role of intuition being to quickly reach conclusions that would take a long time using the formal machinery. That's exactly the case where amplification works well. (This then lead to the same problem with "if you just manipulate things formally, how can you tell that the hypothesis is just making predictions rather than doing something evil, e.g. can you tell that the theory isn't itself an optimizer?", which is what I mentioned in the grandparent.)

This post is close in my mind to Alex Zhu's post Paul's research agenda FAQ. They each helped to give me many new and interesting thoughts about alignment. 

This post was maybe the first time I'd seen a an actual conversation about Paul's work between two people who had deep disagreements in this area - where Paul wrote things, someone wrote an effort-post response, and Paul responded once again. Eliezer did it again in the comments of Alex's FAQ, which also was a big deal for me in terms of learning.

This piece was helpful in outlining how different people in the AI safety space disagree, and what the issues with Paul's approaches seem to be. Paul's analogies with solving hard problems was especially interesting to me (the point where most problems don't seem to occupy a position midway between totally impossible and solvable). The inline comments by Paul were also good to read as counterpoints to Eliezer's responses.

(Eli's personal notes, mostly for his own understanding. Feel free to respond if you want.)

My summary of what Eliezer is saying (in the middle part of the post):

  • The imitation-agents that make up an the AI must be either _very_ exact imitations (of the original agents), or not very exact imitations.
    • If the agents are very exact imitations, then...
      • 1. You need an enormous amount of computational power get them to work, and
      • 2. They must already be very superintelligent, because imitating a human exactly is a very AI complete task. If Paul's proposal depends on exact imitation, that's to say that it doesn't work until we've reached very superintelligent capability, which seems alarming.
    • If the agents are not very exact imitations, then...
      • Either,
        • 1. Your agents aren't very intelligent or,
        • 2. You run into the x-and-only-x problem and your inexact imitations don't guaranty safety. It can imitate the human, but also be doing all kinds of things that are unsafe.

Paul seems to respond, by saying that,

1. We're in the inexact imitation paradigm.

2. He intends to solve the x-and-only-x problem via other external checks (which, crucially, rely on having a smarter that you can trust.)

(Eli's personal notes, mostly for his own understanding. Feel free to respond if you want.)

It seems to me obvious, though this is the sort of point where I've been surprised about what other people don't consider obvious, that in general exact imitation is a bigger ask than superior capability. Building a Go player that imitates Shuusaku's Go play so well that a scholar couldn't tell the difference, is a bigger ask than building a Go player that could defeat Shuusaku in a match. A human is much smarter than a pocket calculator but would still be unable to imitate one without using a paper and pencil; to imitate the pocket calculator you need all of the pocket calculator's abilities in addition to your own.

Because imitation is a very exact target. There are many ways to be "as skilled at X as Y is", but few (one?) way(s) to be "indistinguishable from Y in the domain of X."

(Eli's personal notes, mostly for his own understanding. Feel free to respond if you want.)

I don't understand Paul's model of how a ton of little not-so-bright agents yield a big powerful understanding in aggregate, in a way that doesn't effectively consist of them running AGI code that they don't understand.

My understanding was that Paul doesn't think that he know how to do this, and in fact considers it the one of the primary open problems of his approach. (Though a 10 minute search through his posts on AI alignment did not uncover that, so maybe I made it up.)

I think there is a good chance that humans could learn to break this kind of task (e.g. designing hessian-free optimization) into tiny pieces, with moderate training and experience, in a way that looks "fair" i.e. not like acting as human transistors. If so, I'm optimistic that we will be able to get good demonstrations of that fact relatively soon, within something like 1-2 years.

For hard tasks these won't take the form of complete demonstrations, they will either be (a) examples using ML automation (which will need to wait until ML is strong enough, so won't get tasks as complicated as "design hessian free optimization" until very close to the end) or (b) an interactive protocol where some parts of the deliberation are left as stubs, simulated by normal humans, and then fleshed out into detailed deliberation based on challenges.

It has been 2 years. Have said demonstrations materialized?

I think not.

For the kinds of questions discussed in this post, which I think are easier than "Design Hessian-Free Optimization" but face basically the same problems, I think we are making reasonable progress. I'm overall happy with the progress but readily admit that it is much slower than I had hoped. I've certainly made updates (mostly about people, institutions, and getting things done, but naturally you should update differently).

Note that I don't think "Design Hessian-Free Optimization" is amongst the harder cases, and these physics problems are a further step easier than that. I think that sufficient progress on these physics tasks would satisfy the spirit of my remark 2y ago.

I appreciate the reminder at the 2y mark. You are welcome to check back in 1y later and if things don't look much better (at least on this kind of "easy" case), treat it as a further independent update.

So I also don't see how Paul expects the putative alignment of the little agents to pass through this mysterious aggregation form of understanding, into alignment of the system that understands Hessian-free optimization.

My model of Paul's approach sees the alignment of the subagents as just telling you that no subagent is trying to actively sabotage your system (ie. by optimizing to find the worst possible answer to give you), and that the alignment comes from having thought carefully about how the subagents are supposed to act in advance (in a way that could potentially be run just by using a lookup table).

(Eli's notes, mostly for his own understanding. Feel free to respond if you want.)

The bottleneck I named in my last discussion with Paul was, "We have copies of a starting agent, which run for at most one cumulative day before being terminated, and this agent hasn't previously learned much math but is smart and can get to understanding algebra by the end of the day even though the agent started out knowing just concrete arithmetic. How does a system of such agents, without just operating a Turing machine that operates an AGI, get to the point of inventing Hessian-free optimization in a neural net?"

Yeah. It seems to me that the system Paul outlines can't do this task.

(Eli's personal notes, mostly for his own understanding. Feel free to respond if you want.)

If you have a big aggregate of agents that understands something the little local agent doesn't understand, the big aggregate doesn't inherit alignment from the little agents. Searle's Chinese Room can understand Chinese even if the person inside it doesn't understand Chinese, and this correspondingly implies, by default, that the person inside the Chinese Room is powerless to express their own taste in restaurant orders.

vs.

The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn't a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble. My goal is to (a) first construct weaker agents who aren't internally doing problematic optimization, (b) put them together in a way that improves capability without doing other problematic optimization, (c) iterate that process.

Both these views make some sense to me.

One question that comes to mind is this: do regular bureaucracies exhibit unaligned behavior? It seems like the answer is broadly "yes, but only moderately unaligned." It seems like actual companies are an example of how one can get superintelligent output from humanly intelligent parts, that doesn't seem well described as the parts in aggregate "effectively...running AGI code that they don't understand." And they don't exhibit wildly unaligned behavior because the executives of the company do a have a pretty good idea of the whole picture.

(Of course, those executives don't have much detail in their overview. They need to rely on middle managers to make sure that nothing really bad is happening in the individual departments and individual teams. But it seems like there's not much that small teams of humans can do, because their power is pretty limited. The same would be true in Paul's proposal.)

It seems to me that Eliezer's point is broadly correct in the sense that a series of small agents can be organized in such a way that that they are effectively emulating an unaligned superintelligence that they don't understand. But not all aggregates of small agents have this property, particularly if they are arranged in a hierarchy where the top levels have a high level view of the planning execution.

I would solve X-and-only-X in two steps:
First, given an agent and an action which has been optimized for undesirable consequence Y, we'd like to be able to tell that the action has this undesirable side effect. I think we can do this by having a smarter agent act as an overseer, and giving the smarter agent suitable insight into the cognition of the weaker agent (e.g. by sharing weights between the weak agent and an explanation-generating agent). This is what I'm calling informed oversight.
Second, given an agent, identify situations in which it is especially likely to produce bad outcomes, or proofs that it won't, or enough understanding of its internals that you can see why it won't. This is discussed in “Techniques for Optimizing Worst-Case Performance.”

Paul, I'm curious whether you'd see as necessary for these techniques to work to have that the optimization target is pretty good/safe (but not perfect): ie some safety comes from the fact that the agents optimized for approval or imitation only have a limited class of Y's that they might also end up being optimized for.

I don't think so, but I'm not sure I understand exactly what you mean.

[Eli's personal notes. Feel free to comment or ignore.]

My summary of Eliezer's overall view:

  • 1. I don't see how you can't get cognition to "stack" like that, short of running a Turing machine made up of the agents in your system. But if you do that, then we throw alignment out the window.
  • 2. There's this strong X-and-only-X problem.
    • If our agents are perfect imitations of humans, then we do solve this problem. But having perfect imitations of humans is a very high bar that depends have a very powerful superintelligence already. And now we're just passing the buck. How is that extremely powerful superintelligence aligned?
    • If our agents are not perfect imitations, it seems like have no guaranty of X-and-only-X.
      • This might still work, depending on the exact ways in which the imitation deviates from the subject, but most of the plausible ways seem like they don't solve this problem.
        • And regardless, even if it only deviated in ways that we think are safe, we would want some guaranty of that fact.

(Eli's personal notes, mostly for his own understanding. Feel free to respond if you want.)

The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn't a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble.

Definitely agree that even if the agents are aligned, they can implement unaligned optimization, and then we're back to square one. Amplification only works if we can improve capability without doing unaligned optimization.

It's important that my argument for alignment-of-amplification goes through not doing problematic optimization. So if we combine that with a good enough solution to informed oversight and reliability (and amplification, and the induction working so far...), then we can continue to train imperfect imitations that definitely don't do problematic optimization. They'll mess up all over the place, and so might not be able to be competent (another problem amplification needs to handle), but the goal is to set things up so that being a lot dumber doesn't break alignment.

It seem like Paul thinks that "sure, my aggregate of little agents could implement an (unaligned) algorithm that they don't understand, but that would only happen as the result of some unaligned optimization, which shouldn't be present at any step. "

It seems like a linchpin of Paul's thinking is that he's trying to...

1) initially set up the situation such that there is no component that is doing unaligned optimization (Benignity, Approval-directed agents), and

2) insure that at every step, there are various checks that unaligned optimization hasn't been introduced (Informed oversight, Techniques for Optimizing Worst-Case Performance).

One possible problem with the scheme is that level N assistants will have access to a lot of power, in the form of level N-1 assistants. If they have even a little desire for personal happiness, they might be tempted to trade. Such trade can make the whole bureaucracy corrupt or even harmful.

"If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I'm pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back."

I'm not saying that this isn't true, but it sounds like a world wrecking assumption that can't be mathematically proved, so it looks worthwhile to question it. Suppose you take the policy that the goal function of everyone in the AI alignment - transhumanism movement is basically similar, and you will let individuals get that sort of power. You have created a huge incentive for people with other goals to pretend to have that goal, work your way into the community, and then turn round and do something different.

If we consider blog posts as a good indication of someones values, which can be used to decide who gets such power, then they stop becoming good indicators as people lie. If you don't hand people that power just because they claim to have good values,then the claimed values do indicate real values. Goodharts law in action.

It's even possible that no-one can be trusted with that power. Suppose that Fair Utopia has a utility of 99 to everyone, and person X in charge has a utility of 100 to person X, and 0 to everyone else.

I'm similarly concerned about loose talk about assessing the alignment of specific humans given there seems generally not agreed upon precise criteria by which to assess alignment.

I think I see how X-and-only-X is a problem if we are using a classifier to furnish a 0/1 reward. However, it seems like less of a problem if we're using a regression model to furnish a floating point reward that attempts to describe all of our values (not just our values as they pertain to the completion of one particular task).

Suppose we are granted a regression model which accurately predicts the value we assign to any event which happens in the world. If this model furnishes the AI's reward function, it creates pressure to avoid optimizing for hidden Ys we don't want: Since we don't want them, the regression model gives them a negative score, and the AI works to avoid them.

A regression model which accurately predicts our values is a huge ask. But I'm not sure getting from here to their would require solving new basic problems. Instead, it seems to me like we'd need to get much better at an existing problem: Building models with high predictive accuracy in complex domains.

Maybe you don't think we will get to the necessary level of accuracy by hill-climbing our existing predictive model tech, and this is what will create the new basic problems?

An AI that learns to exactly imitate humans, not just passing the Turing Test to the limits of human discrimination on human inspection, but perfect imitation with all added bad subtle properties thereby excluded, must be so cognitively powerful that its learnable hypothesis space includes systems equivalent to entire human brains. I see no way that we're not talking about a superintelligence here.

"Superintelligence" is a word which, to me, suggests a qualitative shift relative to existing hypothesis learning systems. Existing hypothesis learning systems don't attempt to maximize paperclips or anything like that--they're procedures that search for hypotheses which fit data.

There are many quantitative axes along which such procedures can be compared: How much time does the procedure take? How much data does the procedure require? How complex can the data be? How well do the resulting hypotheses generalize? Etc.

I don't see any reason to think we will see sudden qualitative shifts as our learning procedures improve along these quantitative axes. Therefore, I suspect the word "superintelligence" has connotations that aren't actually necessary for the operation of an extremely advanced hypothesis search system. We already have hypothesis learning systems that are superhuman at e.g. predicting stock prices, but these systems don't seem to be trying to break out of their boxes or anything like that. I'm not sure why a hypothesis learning system which is a superhuman neuroscientist would be different.

We have no guarantee of non-Y for any Y a human can't detect, which covers an enormous amount of lethal territory

I think there are two things that might be worth separating here: malign plans that are disguised as benign plans, and undetectable imperfections broadly speaking. The key difference is whether the undetectable imperfection is a result of deliberate deception on the AI's part, vs the broad phenomenon of systems that have very high (but not perfect) fidelity.

Suppose our emulation of Paul has very high (but not perfect) fidelity, and Paul is not the sort of person who will disguise a malign plan as a benign plan. In this case, we're likely to see the second phenomenon, but not the first--the first phenomenon would require a gross error in our emulation of Paul, and by assumption our emulation of Paul is very high fidelity.

I think a good case has been made that we need to be very worried about malign plans disguised as benign plans. I'm not personally convinced we need to be very worried about undetectable imperfections more broadly.

So we cannot for example let an untrusted superintelligence originate queries that it can use to learn human behavior; it has to be strictly unsupervised example-based learning rather than a query model.

I found the use of "unsupervised" confusing in this context ("example-based" sounds like supervised learning, where the system gets labeled data). I think maybe passive vs active) learning is the distinction you are looking for?

What standard is your baseline for a safe AGI?

If it is 'a randomly generated AGI meeting the safety standard is no more dangerous than a randomly selected human intelligence', this proposal looks intended to guarantee no worse performance than an average case human alignment.

Not sure it hits that target, but it looks like it's aiming for it. I understand your argument to be that the worst case AI alignment in the scheme could be as bad as the worst human amplified, and that you have no way of assessing the average or worst case alignments prior to firing up the machine.

The HI (Human Intelligence) alignment/safety problem is presently unsolved, in that it is impossible to predict the future alignment of a specific human with absolute certainty. This is awkward for many industries that require high reliability humans. I have long suspected that the AGI alignment problem will ultimately reduce to a case of the HI alignment problem (take a HI, give it infinite capability to both act on the world and hide its actions along with instantaneous cognition, now you have a AGI-equivalent).

The default solution is, per a paper I've seen about the IQ based communications barrier (no citation handy) is essentially 'humans reflexively mistrust other humans with +30 more IQ points'.

The challenges created by the possibility of a misaligned HI can obviously be solved locally by the model implemented in Equatorial Guinea in the 1970s: https://en.m.wikipedia.org/wiki/Francisco_Macías_Nguema but this denies us the benefits of their potential outputs.

If you could snap your fingers and build an AGI tomorrow, knowing the state of alignment research, would you do it?

How risk tolerant are you relative to other humans who could enable emergence of AGI?

Of course, you have to wonder, are all those copies of Paul Christiano suffering. They were said to be very similar to the original. Probably no more than the original felt, slightly bored or itchy leg ect. If they really are perfect copies, could they realize they were that, and have an existential freak out? Is it murder to turn off all those similar copies once the jobs done? Will the copies think it is and go on slow strike to live longer?

[Moderator note]: This comment is bringing up a topic that is quite interesting and has had quite a lot of good discussion on LessWrong (see a lot of the writing about Hellscape's, near misses and Robin Hanson's thoughts on EMs). But it doesn't seem to be relevant to the main points of the OP. If you want to have a discussion about this, I recommend you create a new post on your personal blog.