Another take on agent foundations: formalizing zero-shot reasoning

zhukeepa

After spending more time thinking about MIRI’s agenda, I’ve come upon another framing of it, which I’m coining zero-shot reasoning. [1]

This post is a distillation of my intuitions around zero-shot reasoning. These views should be taken as my own, and not MIRI’s.

A quick summary of this post:

“Zero-shot reasoning” refers to the ability to get things right on the first try, no matter how novel or complicated.
In simple domains, like mathematics, zero-shot reasoning is fully captured by formal verification of proofs. In more general domains, zero-shot reasoning requires an extension of formal verification that can be applied to real-world plans.
MIRI-esque philosophical work is necessary to extend formal verification to more general domains.
A formal account of zero-shot reasoning will likely be unimportant for aligning the world’s first AGIs, but will likely be essential for aligning a recursively self-improving AGI.
Humanity will most likely end up building a recursively self-improving AGI (plausibly because of insufficient coordination around not building a recursively self-improving AGI).
We can probably delegate much of the work of formalizing zero-shot reasoning to a post-AGI society, but working on zero-shot reasoning today nevertheless substantially increases the odds that our first recursively self-improving AGI is aligned.

What is zero-shot reasoning?

Few-shot reasoning vs zero-shot reasoning

The world is largely chaotic and unpredictable, yet humans can ideate and successfully execute on hugely conjunctive plans with many moving parts, many of which are the first of their kind. We can ship massive software projects and send rockets to space. Some people can build billion-dollar companies over and over again.

On the other hand, it’s clear that human abilities to do this are limited. Big software projects invariably have critical bugs and security flaws. Many of our spacecraft have exploded before making it to space. Most people can’t build a company to save their lives, and most successful entrepreneurs fail when building a second company.

Native human software is capable of doing something I’ll call few-shot reasoning—when performing complex reasoning and planning to accomplish something novel, humans can usually get it right after a few rounds of iteration. The more dissimilar this reasoning is to prior reasoning they've done, the more rounds of iteration they need.

I think something like zero-shot reasoning—the ability to perform arbitrarily novel and complex reasoning, while have calibrated high confidence in its soundness—is possible in principle. A superintelligent zero-shot reasoner would be able to:

Build an operating system as complex as Microsoft Windows in assembly language, without serious bugs, without once running the code
Build one spacecraft that lands on Mars, after observing Earth for one day, without building any other spacecrafts
Amass $1 trillion over three years

It should be able to do these all with extremely high confidence. [2]

Zero-shot reasoning might seem like magic, but in fact humans have some native capacity for it in some limited domains, like pure mathematics. Given a conjecture to prove or disprove, a human can start from a small set of axioms and combine them in extraordinarily novel and complex ways to confidently arrive at a proof or disproof of the conjecture.

That said, humans do make mistakes in mathematical proofs. But with the assistance of formal verification tools like Coq, humans can become extremely confident that their proofs are error-free, no matter how novel or complex they are. [3]

In a similar vein, humans can in principle build an operating system as complex as Microsoft Windows in assembly language, without serious bugs. Even if they’re writing a huge amount of code, they could formally verify each component they build up, and formally verify that compositions of these components work as desired. While this can only give them guarantees about what they prove, they can very confidently avoid certain classes of bugs (like memory leaks, buffer overflows, and deadlocks) without ever having to run the code.

Formalizing zero-shot reasoning

Formal verification provides a formal account of zero-shot reasoning in limited domains (namely, those describable by formal axiomatic systems, like mathematics or software). I think a formal account of more general zero-shot reasoning will involve an extension of formal verification to real-world, open-ended domains, that would give us some way of “formally verifying” that plans for e.g. building a rocket or amassing wealth will succeed with high probability.

Note that much of general zero-shot reasoning consists of subcomponents that involve reasoning within formal systems. For example, when building rockets, we do a lot of reasoning about software and Newtonian physics, both of which can be construed as formal systems.

In addition to formal verification, a complete formal account of general zero-shot reasoning will require formalizing other aspects of reasoning:

Making and trusting abstractions: Humans are capable of turning sense data into abstract formal systems like Newtonian mechanics, and then deciding under which situations it’s appropriate to apply those abstractions. How do we formalize what an abstraction is, how to make abstractions, and under which circumstances to trust them?
Bounded rationality: What does it mean for a bounded agent to have a calibrated estimate of the likelihood that a plan will succeed? In other words, how can we tell when an agent with limited computing resources is properly reasoning about logical and empirical uncertainty? (Bayesian inference gets us some of the way there, but doesn’t tell us how to select hypotheses and is often computationally intractable. Logical induction is a good starting formalism for logical uncertainty, but current algorithms are also computationally intractable.)
Self-trust: An agent may need to formally reason about how much to trust its reasoning process. How can an agent formally refer to itself within a world in which it’s embedded, and reason about the ways its own reasoning might be faulty? (Sources of error may include hardware failures in the physical world and bugs in the software it’s running.)
Logical counterfactuals: An agent may want to formally reason about the consequences of choices it takes. But if it’s deterministic, it will only end up making one choice, so it’s not clear how to formally talk about what happens if it picks something else. (Concretely, if it’s reasoning about whether to take action A or action B, and it in fact takes action A, reasoning formally about what would happen if it took action B is confusing, because anything can happen if it takes action B by the principle of explosion.)

This list is just an overview of some of the problems that need to be solved, and is by no means intended to be exhaustive. Also note the similarity between this list and the technical research problems listed in MIRI’s Agent Foundations agenda.

Why care about formalizing zero-shot reasoning?

Isn’t extreme caution sufficient for zero-shot reasoning?

It’s true that humans can make plans far more robust by thinking about them much longer and much more carefully. If there’s a massive codebase, or a blueprint of a rocket, or a detailed business plan, you could make them far more robust if you had the equivalent of a billion humans ruminating over the plan, reasoning about all the edge cases, brainstorming adversarial situations, etc. And yet, I think there remains a qualitative difference between “I thought about this plan very very hard and couldn’t find any errors” and “Under plausible assumptions, I can prove this plan will work”. [4]

It was critically important to Intel that their chips do arithmetic correctly, yet their reliance on human judgment led to the Pentium division bug. (They now rely on formal verification.) The Annals of Mathematics, the most prestigious mathematics journal, accepted both a paper proving some result and a paper proving the negation of that result.

Human reasoning is fundamentally flawed. Our brains were evolutionarily selected to play political games on savannahs, not to make long error-free chains of reasoning. Our cognition is plagued with heuristics and biases (including a heuristic that what we see is all there is), and we all have massive blind spots in our reasoning that we aren’t even aware exist. If we check a plan extremely thoroughly, we can only trust the plan to the extent that we trust that the plan doesn’t have any failure modes within our blind spots. The more conjunctive a plan is, the more likely it is that it will have a failure point hidden within our blind spots.

More concretely, suppose we have a plan with 20 components, and our estimate is that each component has a 99.9% chance of success, but in actuality three of the components have likelihoods of success closer to 80% because of edge cases we failed to consider. The overall plan will have a 0.999^17 * 0.80^3 ≈ 50% chance of success, rather than the 0.999^20 ≈ 98% we were hoping for. If such a plan had 100 components instead, the unconsidered edge cases would drive the plan’s likelihood of success close to zero. [5]

We can avoid this problem if we have guarantees that we’ve covered all the relevant edge cases, but such a guarantee seems more similar in nature to a “proof” that all edge cases have been covered (i.e., formal zero-shot reasoning) than to an assurance that someone failed to think up unhandled edge cases after trying really hard (i.e., extreme caution).

Do we need zero-shot reasoning at all?

I think we will most likely end up building an AGI that recursively self-improves, and I think recursive self-improvement is very unlikely to be safe without zero-shot reasoning. [6]

If you’re building a successor agent far more powerful than yourself to achieve your goals, you’d definitely want a guarantee that your successor agent is aligned with your goals, as opposed to some subtle distortion of them or something entirely different. You’d also want to have a level of confidence in this guarantee that goes much beyond “I thought really hard about ways this could go wrong and couldn’t think of any”. [7]

This is especially the case if that successor agent will create successor agents that create successor agents that create successor agents, etc. I feel very pessimistic about building an aligned recursively self-improving AGI, if we can’t zero-shot reason that our AGI will be aligned, and also zero-shot reason that our AGI and all its successors will zero-shot reason about the alignment of their successors.

Zero-shot reasoning seems much less important if we condition on humanity never building an AGI that fooms. I consider this conditional very unlikely if hard takeoffs are possible at all. I expect there will be consistent incentives to build more and more powerful AGI systems (insofar as there will be consistent incentives for humans to more efficiently attain more of what they value). I also expect the most powerful AI systems to be recursively self-improving AGIs without humans in the loop, since humans would bottleneck the process of self-improvement.

Because of such incentives, a human society that has not built a foomed AGI is at best in an unstable equilibrium. Even if the society is run by a competent world government that deploys superintelligent AIs to enforce international security, I would not expect this society to last for 1,000,000,000 years without some rogue actor building a foomed AGI, which I imagine would be smart enough to cut through this society’s security systems like butter. (I have a strong intuition that for narrow tasks with extremely high ceilings for performance, like playing Go well or finding security vulnerabilities, a foomed AGI could perform that task far better than any AI produced by a human society with self-imposed limitations.)

Preventing anything like this from happening for 1,000,000,000 years seems very unlikely to me. Human societies are complex, open-ended systems teeming with intelligent actors capable of making novel discoveries and exploiting security flaws. Ensuring that such a complex system stays stable for as long as 1,000,000,000 years seems plausible only with the assistance of an aligned AGI capable of zero-shot reasoning about this system. But in that case we might as well have this AGI zero-shot reason about how it could safely recursively self-improve, in which case it would robustly optimize for our values for much longer than 1,000,000,000 years.

Why should we work on it?

Can’t we just train our AIs to be good zero-shot reasoners?

There's a difference between being able to do math well, and having a formal notion of what a correct mathematical proof is. It’s possible to be extremely good at mathematics without having any formal notion of what constitutes a correct mathematical proof (Newton was certainly like this). It’s even possible to be extremely good at mathematics while being sloppy at identifying which proofs are correct—I’ve met mathematicians who can produce brilliant solutions to math problems, who are also very prone to making careless mistakes in their solutions.

Likewise, it’s possible to train AIs that learn to create and apply abstractions, act sensibly as bounded rational agents, reason about themselves in their environments, and reason sensibly about counterfactuals. This is completely different from them having formal notions of how to do these all correctly, and the fact that they can do these at all gives no guarantees on how well it does them.

We won’t be able to train our AIs to be better at zero-shot reasoning than we are, because we don’t have enough examples of good general zero-shot reasoning we can point it to. At best we’ll be able to impart our own pre-rigorous notions to the AI.

Can’t we build AIs that help us formalize zero-shot reasoning?

In principle, yes, but the task of converting pre-rigorous philosophical intuitions into formal theories is the most “AGI-complete” task I can imagine, so by default I expect it to be difficult to build a safe AGI that can usefully help us formalize zero-shot reasoning. That said, I could imagine a few approaches working:

Paul Christiano’s research agenda might let us build safe AGIs that can perform thousands of years’ worth of human cognition, which would be sufficient to help us formalize zero-shot reasoning. (On the other hand, we might need a formal account of zero-shot reasoning to establish the worst-case guarantees that Paul wants for his agenda.)
We could carefully construct non-superintelligent AGI assistants that can help humans perform arbitrary cognition, but are trained to be docile and are only run in very limited contexts (e.g. we never let it run for more than 5 minutes at a time before resetting its state). I feel confused about whether this is possible, but it’s certainly conceivable to me.
We train tool AIs on lots of examples of humans successfully turning pre-rigorous intuitions into formal theories.
We build technologies that substantially expedite philosophical progress, e.g. via intelligence amplification or whole-brain emulations run at 10,000x.

Won’t our AGIs want to become good zero-shot reasoners?

I do suspect that becoming a skilled zero-shot reasoner is a convergent instrumental goal for superintelligences. If we start with an aligned AGI that can self-modify to become a skilled zero-shot reasoner without first modifying into a misaligned superintelligence (possibly by mistake, e.g. by letting its values drift or by getting taken over by daemons), I’d feel good about the resulting outcome.

Whether we can trust that to happen is an entirely separate story. I certainly wouldn’t feel comfortable letting an AGI undergo recursive self-improvement without having some extremely strong reason to think its values would be maintained throughout the process, and some extremely strong reason to think it wouldn’t be overtaken by daemons. (I worry about small bugs in the AI creating security flaws that go unnoticed for a while, but are then exploited by a daemon, perhaps quite suddenly. The AI might worry about this too and want to take preventative measures, but at that point it might be too late.)

It might turn out that corrigibility is robust and has a simple core that powerful ML models can learn, that AGIs are likely to only get more and more corrigible as they get more and more powerful, that daemons are simple to prevent, and that corrigible AGIs will by default reliably prevent themselves from being overtaken by daemons. On these assumptions, I’d feel happy training a prosaic AGI to be corrigible and letting it recursively self-improve without any formalization of zero-shot reasoning. On the other hand, I think this conjunction of assumptions is unlikely, and for us to believe it we might need a formal account of zero-shot reasoning anyway.

Why should we think zero-shot reasoning is possible to formalize?

Humanity has actually made substantial progress toward formalizing zero-shot reasoning over the past century or so. Over the last century or so, we’ve formalized first-order logic, formalized expected utility theory, defined computation, defined information, formalized causality, developed theoretical foundations for Bayesian reasoning, and formalized Occam’s razor. More recently, MIRI has formalized aspects of logical uncertainty and made advances in decision theory. I also think all the problems in MIRI’s agent foundations agenda are tractable, and likely to result in further philosophical progress. [8]

Can we formalize zero-shot reasoning in time?

Probably not, but working on it now still nontrivially increases the odds that we do. Impressive progress on formalizing zero-shot reasoning makes it more prestigious, more broadly accessible (pre-rigorous intuitions are much harder to communicate than formal ones), and closer to being solved. This makes it more likely for it to be understood and taken seriously by the major players shortly before a singularity, and thus more likely for them to coordinate around not building a recursively self-improving AI before formalizing zero-shot reasoning.

(For comparison, suppose it turned out that homotopy type theory were necessary to align a recursively self-improving AGI, and we found ourselves in a parallel universe in which no work had been done on the topic. Even though we could hope for the world to hold off on recursive self-improvement until homotopy type theory were adequately developed, doesn't it seem much better that we're in a universe with a textbook and a community around this topic?)

Additionally, I think it’s not too unlikely that AGI is far away and/or that zero-shot reasoning is surprisingly easy to formalize. Under either assumption, it becomes far more plausible that we can formalize it in time, and whether or not we make it is straightforwardly impacted by how much progress we make today.

My personal views

I ~20% believe that we need to formalize zero-shot reasoning before we can build AGI systems that enable us to perform a pivotal act, ~85% believe that we need to formalize zero-shot reasoning before building a knowably safe recursively self-improving AI, and ~70% believe that conceptual progress on zero-shot reasoning is likely to result in conceptual progress in adjacent topics, like corrigibility, secure capability amplification, and daemon prevention.

I think working on zero-shot reasoning today will most likely turn out to be unhelpful if:

takeoff is slow (which I assign ~20%)
we can build a flourishing human society that coordinates around not building a recursively self-improving AGI, that stays stable for 1,000,000,000 years (which I assign ~10%), or
we can safely offload the bulk of formalizing zero-shot reasoning to powerful systems (like ALBA or whole-brain emulations) and implement an aligned recursively self-improving AGI before someone else builds a misaligned recursively self-improving AGI (which I assign ~50%).

My current all-things-considered position is that a formalization of zero-shot reasoning will substantially improve the odds that our first recursively self-improving AGI is aligned with humans, and that working on it today is one of humanity’s most neglected and highest-leverage interventions for reducing existential risk.

[1] This term is named in analogy with zero-shot learning, which refers to the ability to perform some task without any prior examples of how to do it.

[2] Not arbitrarily high confidence, given inherent uncertainties and unpredictabilities in the world.

[3] We can’t get arbitrarily high confidence even in the domain of math, because we still need to trust the soundness of our formal verifier and the soundness of the axiom system we're reasoning in.

[4] It’s worth noting that a team of a billion humans could confidently verify the software’s correctness by “manually” verifying the code, if they all know how to do formal verification. I feel similarly optimistic about any domain where the humans have formal notions of correctness, like mathematics. On the other hand, I feel pessimistic about humans verifying software if they don't have any notion of formal verification and can't rederive it.

[5] I'm specifically referring to conjunctive plans that we'd like to see succeed on our first try, without any iteration. This excludes running companies, which requires enormous amounts of iteration.

[6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year, at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation.

[7] It might be possible for humans to achieve this level of confidence without a formalization of zero-shot reasoning, e.g. if we attain a deep understanding of corrigibility that doesn’t require zero-shot reasoning. See “Won’t our AGIs want to become good zero-shot reasoners?”

[8] Zero-shot reasoning is not about getting 100% mathematical certainty that your actions will be safe or aligned, which I believe to be a common misconception people have of MIRI’s research agenda (especially given language around “provably beneficial AI”). Formalization is less about achieving 100% certainty than it is about providing a framework in which we can algorithmically verify whether some line of reasoning is sound. Getting 100% certainty is impossible, and nobody is trying to achieve it.

Thanks to Rohin Shah, Ryan Carey, Eli Tyre, and Ben Pace for helpful suggestions and feedback.

This is all assuming an ontology where there exists a utility function that an AI is optimizing, and changes to the AI seem especially likely to change the utility function in a random direction. In such a scenario, yes, you probably should be worried.

However, in practice, I expect that powerful AI systems will not look like they are explicitly maximizing some utility function. In this scenario, if you change some component of the system for the worse, you are likely to degrade its performance, but not likely to drastically change its behavior to cause human extinction. For example, even in RL (which is the closest thing to expected utility maximization), you can have serious bugs and still do relatively well on the objective. A public example of this is in OpenAI Five (https://blog.openai.com/openai-five/), but I also hear this expressed when talking to RL researchers (and see this myself).

While you still want to be very careful with self-modification, it seems generally fine not to have a formal proof before making the change, and evaluating the change after it has taken place. (This would fail dramatically if the change drastically changed behavior, but if it only degrades performance, I expect the AI would still be competent enough to notice and undo the change.)

You could worry about daemons exploiting these bugs under this view. I think this is a reasonable worry, but don't expect formalizing zero-shot reasoning to help with it. It seems to me that daemons occur by falling into a local optimum when you are trying to optimize for doing some task -- the daemon does that task well in order to gain influence, and then backstabs you. This can arise both in ideal zero-shot reasoning, and when introducing approximations to it (as we will have to do when building any practical system).

In particular, the one context where we're most confident that daemons arise is Solomonoff induction, which is one of the best instances of formalizing zero-shot reasoning that we have. Solomonoff gives you strong guarantees, of the sort you can use in proofs -- and yet, daemons arise.

I would be very surprised if we were able to handle daemons without some sort of daemon-specific research.

you can have serious bugs and still do relatively well on the objective. A public example of this is in OpenAI Five (https://blog.openai.com/openai-five/), but I also hear this expressed when talking to RL researchers (and see this myself).

My impression is that most of these 'serious bugs' are something like "oops, our gradient descent is actually gradient ascent, but it worked out alright because our utility function is also the negative of what it should be" which is not particularly heartening.

While you still want to be very careful with self-modification, it seems generally fine not to have a formal proof before making the change, and evaluating the change after it has taken place. (This would fail dramatically if the change drastically changed behavior, but if it only degrades performance, I expect the AI would still be competent enough to notice and undo the change.)

Even to changes to how to performs or evaluates self-modification? Eurisko comes to mind as a program that could and did give itself cancer, requiring its programmer to notice that it had died and restart it, and the sort of thing that AI programmers would do by default.

My impression is that most of these 'serious bugs' are something like "oops, our gradient descent is actually gradient ascent, but it worked out alright because our utility function is also the negative of what it should be" which is not particularly heartening.

Even if this were true I would not update much. If you actually had only one of those and not the other, you would notice _really fast_, so it's not going to harm you.

The bugs I'm imagining are more like "we did a bunch of math and got out an equation, but missed a minus sign in one of the terms that's usually quite small, resulting in a small error in the value calculated, making learning less efficient". OpenAI had a bug where their bots would get a negative reward for reaching level 25. If you introduce these kinds of bugs with a change, you'll notice less efficient learning, and hopefully correct it, and it only leads to degraded performance, not catastrophic outcomes.

Even to changes to how to performs or evaluates self-modification? Eurisko comes to mind as a program that could and did give itself cancer, requiring its programmer to notice that it had died and restart it, and the sort of thing that AI programmers would do by default.

I agree that you want to be extra careful with self-modification, but there are lots of easy steps you can do to in fact be extra careful, eg. creating a copy of yourself with the modification and seeing what it tends to do on a suite of problems where you expect the modification to be helpful/harmful.

We may also have different pictures of self-modification looks like. Under your view, it seems like AI researchers are going to add a self-modification routine to the AI, which can unilaterally rewrite the source code of the AI as it wants. Under my view, AI researchers don't really think much about self-modification, and just build an AI system capable of learning and performing general tasks, one of which could be the task of improving the AI system with very high confidence that the proposed improvement will work.

Do you generally trust that you personally could be handed the key to human self-modification? I feel reasonably confident that such a tool would help me (or at least, not harm me, in that I might decide not to use it). Since it's much easier for an AI to run experiments on copies of itself, it should be a much easier task for the AI to use such a tool well.

If you actually had only one of those and not the other, you would notice _really fast_, so it's not going to harm you.

The thing I'm worried about is fixing only one of them--see Reason as Memetic Immune Disorder.

Under your view, it seems like AI researchers are going to add a self-modification routine to the AI, which can unilaterally rewrite the source code of the AI as it wants . Under my view, AI researchers don't really think much about self-modification, and just build an AI system capable of learning and performing general tasks, one of which could be the task of improving the AI system with very high confidence that the proposed improvement will work.

I think the current standard approach is unilateral modifications (what checks do we put on gradient descent modifiying parameter values?), and that this is unlikely to change as AI researchers figure out how to do bolder and bolder variations. How would you classify the meta-learning approaches under development?

I think it's likely that there will be some safeguards in place, much in the way that you don't get robust multicellular life without some mechanisms of correcting cancers when they develop. The root of my worry here is that I don't expect this problem to be solved well if researchers aren't thinking much about self-modification (and thus how to solve it well).

Do you generally trust that you personally could be handed the key to human self-modification?

I think this depends a lot on how the key is shaped. If I can write rules for moving around cells in my body, or modifying the properties of those cells, probably not, because I don't have enough transparency for the consequences. If I have a dial with my IQ on it, probably, or if I have a set of dials related to the strength of various motivations, probably, but here I would still feel like there are significant risks associated with moving outside normal bounds that I would be accepting because we live in weird times. [For example, it seems likely that some genes that increase intelligence also increase brain cancer risk, and it seems possible that 'turning the IQ dial' with this key would similarly increase my chance of having brain cancer.]

Similarly, being able to print the genome for potential children rather than rolling randomly or selecting from a few options seems like it would be useful and I would use it, but is not making the situation significantly safer and could easily lead to systematic problems because of correlated choices.

The thing I'm worried about is fixing only one of them

Right, I'm arguing that if you only fixed one of them, you would notice _immediately_, and either revert back to the version with both bugs, or find the other bug. I'm also claiming that this should be what happens in general, assuming sufficient caution around self-modification (on the AI's part, not the researcher's part).

I think the current standard approach is unilateral modifications (what checks do we put on gradient descent modifiying parameter values?), and that this is unlikely to change as AI researchers figure out how to do bolder and bolder variations. How would you classify the meta-learning approaches under development?

I don't think of gradient descent as self-modification. If an AI system were able to choose (i.e. learned a policy for) when to run gradient descent on itself, and what training data it should use for that, that might be self-modification. Meta-learning feels similar to me -- the AI doesn't get to choose training data or what to run gradient descent on. The only learned part in current meta-learning approaches is how to perform a task, not how to learn.

I don't expect this problem to be solved well if researchers aren't thinking much about self-modification (and thus how to solve it well).

This might be all of our disagreement actually. Like you, I'm quite pessimistic about any system where researchers put in place a protocol for self-modification, which seems to be what you are imagining. Either the protocol is too lax and we get the sort of issues you're talking about, or it's too strict and self-modification never happens. However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have "thoughts" of the form "Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox".

I think this depends a lot on how the key is shaped. If I can write rules for moving around cells in my body, or modifying the properties of those cells, probably not, because I don't have enough transparency for the consequences.

But in that case, you would simply choose not to use it, or to do a lot of research before trying to use it.

However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have "thoughts" of the form "Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox".

This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don't expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification). It seems likely that it could be in a situation like some of the Cake or Death problems, where it views a change to itself as impacting only part of its future behavior (like affecting actions but not values, such that it suspects that a future it that took path A would be disappointed in itself and fix that bug, without realizing that the change it's making will cause future it to not be disappointed by path A), or is simply not able to foresee the impacts of its changes and so makes them 'recklessly' (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).

This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don't expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification).

I share this intuition, for sufficiently complex self-modifications, with massive error bounds around what constitutes "sufficiently complex". I'm not sure if humans perform sufficiently complex self-modifications, I think our first AGis might perform sufficiently complex self-modifications, and I think AGIs undergoing a fast takeoff are most likely performing sufficiently complex self-modifications.

is simply not able to foresee the impacts of its changes and so makes them 'recklessly' (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).

+100. This is why I feel queasy about "OK, I judge this self-modification to be fine" when the self-modifications are sufficiently complex, if this judgment isn't based off something like zero-shot reasoning (in which case we'd have strong reason to think that an agent following a policy of making every change it determines to be good will actually avoid disasters).

I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what "self-modification" is (though I'm not sure it matters that much).

Both of the comments feel to me like you're making the AI system way dumber than humans, and I don't understand why I should expect that. I think I could make a better human with high confidence/robustness if you give me a human-modification-tool that I understand reasonably well and I'm allowed to try and test things before committing to the better human.

This is all assuming an ontology where there exists a utility function that an AI is optimizing, and changes to the AI seem especially likely to change the utility function in a random direction. In such a scenario, yes, you probably should be worried.

I'm mostly concerned with daemons, not utility functions changing in random directions. If I knew that corrigibility were robust and that a corrigible AI would never encounter daemons, I'd feel pretty good about it recursively self-improving without formal zero-shot reasoning.

You could worry about daemons exploiting these bugs under this view. I think this is a reasonable worry, but don't expect formalizing zero-shot reasoning to help with it. It seems to me that daemons occur by falling into a local optimum when you are trying to optimize for doing some task -- the daemon does that task well in order to gain influence, and then backstabs you. This can arise both in ideal zero-shot reasoning, and when introducing approximations to it (as we will have to do when building any practical system).

I'm imagining the AI zero-shot reasoning about the correctness and security of its source code (including how well it's performing zero-shot reasoning), making itself nigh-impossible for daemons to exploit.

In particular, the one context where we're most confident that daemons arise is Solomonoff induction, which is one of the best instances of formalizing zero-shot reasoning that we have. Solomonoff gives you strong guarantees, of the sort you can use in proofs -- and yet, daemons arise.

I think of Solomonoff induction less as a formalization of zero-shot reasoning, and more as a formalization of some unattainable ideal of rationality that will eventually lead to better conceptual understandings of bounded rational agents, which will in turn lead to progress on formalizing zero-shot reasoning.

I would be very surprised if we were able to handle daemons without some sort of daemon-specific research.

In my mind, there's no clear difference between preventing daemons and securing complex systems. For example, I think there's a fundamental similarity between the following questions:

How can we build an organization that we trust to optimize for its founders' original goals for 10,000 years?
How can ensure a society of humans flourishes for 1,000,000,000 years without falling apart?
How can we build an AGI which, when run for 1,000,000,000 years, still optimizes for its original goals with > 99% probability? (If it critically malfunctions, e.g. if it "goes insane", it will not be optimizing for its original goals.)
How can we build an AGI which, after undergoing an intelligence explosion, still optimizes for its original goals with > 99% probability?

I think of AGIs as implementing miniature societies teeming with subagents that interact in extraordinarily sophisticated ways (for example they might play politics or Goodhart like crazy). On this view, ensuring the robustness of an AGI entails ensuring the robustness of a society at least as complex as human society, which seems to me like it requires zero-shot reasoning.

It seems like a simpler task would be building a spacecraft that can explore distant galaxies for 1,000,000,000 years without critically malfunctioning (perhaps with the help of self-correction mechanisms). Maybe it's just a failure of my imagination, but I can't think of any way to accomplish even this task without delegating it to a skilled zero-shot reasoner.

I'm imagining the AI zero-shot reasoning about the correctness and security of its source code (including how well it's performing zero-shot reasoning), making itself nigh-impossible for daemons to exploit.

This seems like it's using a bazooka to kill a fly. I'm not sure if I agree that zero-shot reasoning saves you from daemons, but even if so, why not try to attack the problem of daemons directly?

I think of Solomonoff induction less as a formalization of zero-shot reasoning, and more as a formalization of some unattainable ideal of rationality that will eventually lead to better conceptual understandings of bounded rational agents, which will in turn lead to progress on formalizing zero-shot reasoning.

Okay, sure, but then my claim is that Solomonoff induction is _better_ than zero-shot reasoning on the axes you seem to care about, and yet it still has daemons. Why expect zero-shot reasoning to do better?

In my mind, there's no clear difference between preventing daemons and securing complex systems.

(I can't seem to blockquote the bullet points, imagine I had done that.)

My reason for not having high confidence is that the time spans are incredibly long and many things could happen that I can't predict. But in scenarios where we have an AGI, yet we fail to achieve these objectives, the reason that seems most likely to me is "the AGI was incompetent at some point, made a mistake, and bad things happened". I don't know how to evaluate the probability of this and so become uncertain. But, if you are correct that we can formalize zero-shot reasoning and actually get high confidence, then the AGI could do that too. The hard problem is in getting the AGI to "want" to do that.

However, I expect that the way we actually get high confidence answers to those questions, is that we implement a control mechanism (i.e. the AI) that gets to act over the entire span of 10,000 or 1 billion years or whatever, and it keeps course correcting in order to stay on the path.

It seems like a simpler task would be building a spacecraft that can explore distant galaxies for 1,000,000,000 years without critically malfunctioning (perhaps with the help of self-correction mechanisms).

If you're trying to do this without putting some general intelligence into it, this sounds way harder to me, because you can't build in a sufficiently general control mechanism for the spacecraft. I agree that (without access to general-intelligence-routines for the spacecraft) such a task would need very strong zero-shot reasoning. (It _feels_ impossible to me that any actual system could do this, including AGI, but that does feel like a failure of imagination on my part.)

This seems like it's using a bazooka to kill a fly. I'm not sure if I agree that zero-shot reasoning saves you from daemons, but even if so, why not try to attack the problem of daemons directly?

I agree that zero-shot reasoning doesn't save us from daemons by itself, and I think there's important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.

Okay, sure, but then my claim is that Solomonoff induction is _better_ than zero-shot reasoning on the axes you seem to care about, and yet it still has daemons. Why expect zero-shot reasoning to do better?

The daemons I'm focusing on here mostly arise from embedded agency, which Solomonoff induction doesn't capture at all. (It's worth nothing that I consider there to be a substantial difference between Solomonoff induction daemons and "internal politics"/"embedded agency" daemons.) I'm interested in hashing this out further, but probably at some future point, since this doesn't seem central to our disagreement.

But in scenarios where we have an AGI, yet we fail to achieve these objectives, the reason that seems most likely to me is "the AGI was incompetent at some point, made a mistake, and bad things happened". I don't know how to evaluate the probability of this and so become uncertain. But, if you are correct that we can formalize zero-shot reasoning and actually get high confidence, then the AGI could do that too. The hard problem is in getting the AGI to "want" to do that.

However, I expect that the way we actually get high confidence answers to those questions, is that we implement a control mechanism (i.e. the AI) that gets to act over the entire span of 10,000 or 1 billion years or whatever, and it keeps course correcting in order to stay on the path.

....

If you're trying to [build the spacecraft] without putting some general intelligence into it, this sounds way harder to me, because you can't build in a sufficiently general control mechanism for the spacecraft. I agree that (without access to general-intelligence-routines for the spacecraft) such a task would need very strong zero-shot reasoning. (It _feels_ impossible to me that any actual system could do this, including AGI, but that does feel like a failure of imagination on my part.)

I'm surprised by how much we seem to agree about everything you've written here. :P Let me start by clarifying my position a bit:

When I imagine the AGI making a "plan that will work in one go", I'm not imagining it going like "OK, here's a plan that will probably work for 1,000,000,000 years! Time to take my hands off the wheel and set it in motion!" I'm imagining the plan to look more like "set a bunch of things in motion, reevaluate and update it based on where things are, and repeat". So the overall shape of this AGI's cognition will look something like "execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.", happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects). The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn't introduce any critical errors.
I think an AGI competently optimizing for our values should almost certainly be exploring distant galaxies for billions of years (given the availability of astronomical computing resources). On this view, building a spacecraft that can explore the universe for 1,000,000,000 years without critical malfunctions is strictly easier than building an AGI that competently optimizes for our values for 1,000,000,000 years.
Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours. So undergoing a safe intelligence explosion seems at least as difficult as getting an earthbound AGI doing 1,000,000 years' worth of human cognition without any catastrophic failures.
I'm less concerned about the AGI killing its operators than I am about the AGI failing to capture a majority of our cosmic endowment. It's plausible that the latter usually leads to the former (particularly if there's a fast takeoff on Earth that completes in a few hours), but that's mostly not what I'm concerned about.

In terms of actual disagreement, I suspect I'm much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it's doing something like 1,000,000 years' worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What's your take on this?

I agree that zero-shot reasoning doesn't save us from daemons by itself, and I think there's important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.

You must be really pessimistic about our chances.

The daemons I'm focusing on here mostly arise from embedded agency, which Solomonoff induction doesn't capture at all.

Huh, okay. I still don't know the mechanism by which zero-shot reasoning helps us avoid x-risk, so it might be useful to describe these daemons in more detail. I continue to think that zero-shot reasoning does not seem necessary for eg. ensuring a flourishing human civilization for a billion years.

So the overall shape of this AGI's cognition will look something like "execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.", happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects)

Agreed that this is a control mechanism that course-corrects.

The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn't introduce any critical errors.

But why is the probability of introducing critical errors so high without zero-shot reasoning? Perhaps over a billion years even a 1-in-a-million chance is way too high, but couldn't we spend some fraction of the first 10,000 years figuring that out?

Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours.

This seems extraordinarily unlikely to me (feels like ~0%), if we're talking about the first AGI that we build. If we're not talking about that, then I'd want to know what "intelligence explosion" means -- how intelligent was the most intelligent thing before the intelligence explosion? (I think this is tangential though, so feel free to ignore it.)

building an AGI that competently optimizes for our values for 1,000,000,000 years.

Just, don't build that. No one is trying to build that. It's a hard problem, we don't know what our values are, it's difficult. We can instead build an AGI that wants to help humans do whatever it is they want to do, and help them figure out what they want to do, and assists human development and flourishing at any given point in time, and continually learns from humans to figure out what should be done. In this version, humans are a control mechanism for the AI, and so we should expect the problem to be a lot easier.

I suspect I'm much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it's doing something like 1,000,000 years' worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What's your take on this?

I was talking about the AI as a control mechanism for the task that needs to be done, not a control mechanism that course-corrects the AI. I don't expect there to be a particular subsystem of the AI that is responsible for course correction, just as there isn't a particular subsystem in the human brain responsible for thinking "Huh, I guess now that condition X has arisen, I should probably take action Y in order to deal with it".
I have no intuition for why there should be daemons, what they look like, how they take over the AI, etc. especially if they are different in kind from Solomonoff daemons. This basically sounds to me like positing the existence of a bad thing, and so concluding that we get bad things. I'm sure there's more to your intuition, but I don't know what it is and don't share the intuition.
Executing and reevaluating plans many times seems like it's fine, ignoring cases where the environment gets too difficult for the AI to deal with (i.e. the AI is incompetent). I expect the evaluation process to be robust by constantly getting human input.

I should clarify a few more background beliefs:

I think zero-shot reasoning is probably not very helpful for the first AGI, and will probably not help much with daemons in our first AGI.
I agree that right now, nobody is trying to (or should be trying to) build an AGI that's competently optimizing for our values for 1,000,000,000 years. (I'd want an aligned, foomed AGI to be doing that.)
I agree that if we're not doing anything as ambitious as that, it's probably fine to rely on human input.
I agree that if humanity builds a non-fooming AGI, they could coordinate around solving zero-shot reasoning before building a fooming AGI in a small fraction of the first 10,000 years (perhaps with the help of the first AGI), in which case we don't have to worry about zero-shot reasoning today.
Conditioning on reasonable international coordination around AGI at all, I give 50% to coordination around intelligence explosions. I think the likelihood of this outcome rises with the amount of legitimacy zero-shot shot reasoning has at coordination time, which is my main reason for wanting to work on it today. (If takeoff is much slower I'd give something more like 80% to coordination around intelligence explosions, conditional on international coordination around AGIs.)

Let me now clarify what I mean by "foomed AGI":

A rough summary is included in my footnote: [6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours], at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. ("Optimally optimized optimizer" is another way of putting it.)
You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it's more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.
I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.
In this "nuclear explosion" of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.

In this comment thread, I was referring primarily to foomed AGIs, not the first AGIs we build. I imagine you either having a different picture of takeoff, or thinking something like "Just don't build a foomed AGI. Just like it's way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it's way too hard to build a safe foomed AGI, so let's just not do it". And my position is something like "It's probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let's do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right."

I'm happy to delve into your individual points, but before I do so, I'd like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.

A rough summary is included in my footnote: [6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours], at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. ("Optimally optimized optimizer" is another way of putting it.)

I have a strong intuition that "optimal algorithms for intelligence per relevant unit of computation" don't exist. There are lots of no-free lunch theorems around this. Intelligence is contextual; as a concrete example, children are better than adults in novel situations with unusual causal factors (https://cocosci.berkeley.edu/tom/papers/LabPublications/GopnicketalYoungLearners.pdf). In AI, the explore-exploit tradeoff is quite fundamental and it seems unlikely that you can find a fully general solution to it.
I still don't know what "intelligence explosion within a year" means; is it relative to human intelligence? The intelligence of the previous AGI? Along what metric are you measuring intelligence? If I consider the "reasonable view" of what these terms mean, I expect that there will never be an intelligence explosion that would be considered "fast" (in the way that AGI intelligence explosion in a year would be "fast" to us) by the next-most intelligent system that exists.

You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it's more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.

I'm not sure what I'm supposed to get out of the analogy. If you're saying that a foomed AGI is way more powerful than the first AGI, sure. If you're saying they can do qualitatively different things, sure.

I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.

I don't know if I'd say I expect this, but I do consider this scenario often so I'm happy to talk about it, and I have been assuming that during this discussion.

In this "nuclear explosion" of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.

I'm still very unclear on how you're operationalizing an intelligence explosion. If an intelligence explosion happens only after a million iterations of AGI systems improving themselves, then this seems true to me, but also the humans will have AGI systems that are way smarter than them to assist them during this time.

I imagine you either having a different picture of takeoff, or thinking something like "Just don't build a foomed AGI. Just like it's way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it's way too hard to build a safe foomed AGI, so let's just not do it".

I think it's the first. I'm much more sympathetic to the picture of "slow" takeoff in Will AI See Sudden Progress? and Takeoff speeds. I don't imagine ever building a very capable AI that explicitly optimizes a utility function, since a multiagent system (i.e. humanity) is unlikely to have a utility function. However, I can imagine building a safe foomed AGI.

And my position is something like "It's probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let's do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right."

It would be quite surprising to me if the right thing to do to ensure that nation states and individual actors understand this point would be to formalize zero-shot reasoning.

In addition, I could imagine building a safe foomed AGI that is corrigible and so does not require a solution to metaphilosophy; but I'm happy to consider the case where that is necessary (which seems decently likely to me), in those worlds I expect that we are able to use the first AGI systems to help us figure out metaphilosophy.

I'm happy to delve into your individual points, but before I do so, I'd like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.

What takeoff looks like, what the notion of "intelligence" is, what an "intelligence explosion" consists of, the usefulness of initial AI systems in aligning future, more powerful AI systems, what daemons are.

Also, on a more epistemic note, how much weight to put on long chains of reasoning that rely on soft, intuitive concepts, and how much to trust intuitions about tasks longer than ~100 years.

Why "zero-shot"? You're talking about getting something right in one try, so wouldn't "one-shot" make more sense?

Humanity has actually made substantial progress toward formalizing zero-shot reasoning over the past century or so.

I think this paragraph gives an overly optimistic impression of how much progress has been made. We are still very confused about what probabilities really are, we haven't made any progress on the problem of Apparent Unformalizability of “Actual” Induction, and decision theory seems to have mostly stalled since about 8 years ago (the MIRI paper you cite does not seem to represent a substantial amount of progress over UDT 1.1).

I think working on zero-shot reasoning today will most likely turn out to be unhelpful if:

takeoff is slow (which I assign ~20%)

This isn't obvious to me. Can you explain why you think this?

In ML, "one-shot" means that you get to look at one example of good behavior (eg. how to classify an image), and then you have to be able to replicate that good behavior. "Zero-shot" means getting it right without any prior examples. (See also footnote 1.)

Why "zero-shot"? You're talking about getting something right in one try, so wouldn't "one-shot" make more sense?

I've flip-flopped between "one-shot" and "zero-shot". I'm calling it "zero-shot" in analogy with zero-shot learning, which refers to the ability to perform a task after zero demonstrations. "One-shot reasoning" probably makes more sense to folks outside of ML.

I think this paragraph gives an overly optimistic impression of how much progress has been made. We are still very confused about what probabilities really are, we haven't made any progress on the problem of Apparent Unformalizability of “Actual” Induction, and decision theory seems to have mostly stalled since about 8 years ago (the MIRI paper you cite does not seem to represent a substantial amount of progress over UDT 1.1).

I used "substantial progress" to mean "real and useful progress", rather than "substantial fraction of the necessary progress". Most of my examples happened in the eary to mid-1900s, suggesting that if we continue at that rate we might need at least another century.

This isn't obvious to me. Can you explain why you think this?

I'd feel much better about delegating the problem to a post-AGI society, because I'd expect such a society to be far more stable if takeoff is slow, and far more capable of taking its merry time to solve the full problem in earnest. (I think it will be more stable because I think it would be much harder for a single actor to attain a decisive strategic advantage over the rest of the world.)

I’m calling it “zero-shot” in analogy with zero-shot learning, which refers to the ability to perform a task after zero demonstrations.

I see. Given this, I think "zero-shot learning" makes sense but "zero-shot reasoning" still doesn't, since in the former "zero" refers to "zero demonstrations" and you're learning something without doing a learning process targeted at that specific thing, whereas in the latter "zero" isn't referring to anything and you're trying to get the reasoning correct in one attempt so "one-shot" is a more sensible description.

I used “substantial progress” to mean “real and useful progress”, rather than “substantial fraction of the necessary progress”. Most of my examples happened in the eary to mid-1900s, suggesting that if we continue at that rate we might need at least another century.

Ok, I don't think we have a substantive disagreement here then. My complaint was that providing only positive examples of progress in that paragraph without tempering them with negative ones is liable to give an overly optimistic impression to people who aren't familiar with the field.

I’d feel much better about delegating the problem to a post-AGI society, because I’d expect such a society to be far more stable if takeoff is slow, and far more capable of taking its merry time to solve the full problem in earnest. (I think it will be more stable because I think it would be much harder for a single actor to attain a decisive strategic advantage over the rest of the world.)

Are you saying that in the slow-takeoff world, we will be able to coordinate to stop AI progress after reaching AGI and then solve the full alignment problem at leisure? If so, what's your conditional probability P(successful coordination to stop AI progress | slow takeoff)?

I see. Given this, I think "zero-shot learning" makes sense but "zero-shot reasoning" still doesn't, since in the former "zero" refers to "zero demonstrations" and you're learning something without doing a learning process targeted at that specific thing, whereas in the latter "zero" isn't referring to anything and you're trying to get the reasoning correct in one attempt so "one-shot" is a more sensible description.

I was imagining something like "zero failed attempts", where each failed attempt approximately corresponds to a demonstration.

Are you saying that in the slow-takeoff world, we will be able to coordinate to stop AI progress after reaching AGI and then solve the full alignment problem at leisure? If so, what's your conditional probability P(successful coordination to stop AI progress | slow takeoff)?

More like, conditioning on getting international coordination after our first AGI, P(safe intelligence explosion | slow takeoff) is a lot higher, like 80%. I don't think slow takeoff does very much to help international coordination.

Curating (alongside Zhukeepa's FAQ on Paul's Agenda)

We're a bit behind on other tasks and still don't have time to write up formal curation notices, but wanted to at least keep the curated section moving.

I would be very surprised if we were able to handle daemons without some sort of daemon-specific research.

you can have serious bugs and still do relatively well on the objective. A public example of this is in OpenAI Five (https://blog.openai.com/openai-five/), but I also hear this expressed when talking to RL researchers (and see this myself).

While you still want to be very careful with self-modification, it seems generally fine not to have a formal proof before making the change, and evaluating the change after it has taken place. (This would fail dramatically if the change drastically changed behavior, but if it only degrades performance, I expect the AI would still be competent enough to notice and undo the change.)

My impression is that most of these 'serious bugs' are something like "oops, our gradient descent is actually gradient ascent, but it worked out alright because our utility function is also the negative of what it should be" which is not particularly heartening.

Even if this were true I would not update much. If you actually had only one of those and not the other, you would notice _really fast_, so it's not going to harm you.

Even to changes to how to performs or evaluates self-modification? Eurisko comes to mind as a program that could and did give itself cancer, requiring its programmer to notice that it had died and restart it, and the sort of thing that AI programmers would do by default.

If you actually had only one of those and not the other, you would notice _really fast_, so it's not going to harm you.

The thing I'm worried about is fixing only one of them--see Reason as Memetic Immune Disorder.

Under your view, it seems like AI researchers are going to add a self-modification routine to the AI, which can unilaterally rewrite the source code of the AI as it wants . Under my view, AI researchers don't really think much about self-modification, and just build an AI system capable of learning and performing general tasks, one of which could be the task of improving the AI system with very high confidence that the proposed improvement will work.

Do you generally trust that you personally could be handed the key to human self-modification?

The thing I'm worried about is fixing only one of them

I think the current standard approach is unilateral modifications (what checks do we put on gradient descent modifiying parameter values?), and that this is unlikely to change as AI researchers figure out how to do bolder and bolder variations. How would you classify the meta-learning approaches under development?

I don't expect this problem to be solved well if researchers aren't thinking much about self-modification (and thus how to solve it well).

I think this depends a lot on how the key is shaped. If I can write rules for moving around cells in my body, or modifying the properties of those cells, probably not, because I don't have enough transparency for the consequences.

But in that case, you would simply choose not to use it, or to do a lot of research before trying to use it.

However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have "thoughts" of the form "Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox".

This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don't expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification).

is simply not able to foresee the impacts of its changes and so makes them 'recklessly' (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).

I agree this seems like a crux for me as well, subject to the caveat that I think we have different ideas of what "self-modification" is (though I'm not sure it matters that much).

This is all assuming an ontology where there exists a utility function that an AI is optimizing, and changes to the AI seem especially likely to change the utility function in a random direction. In such a scenario, yes, you probably should be worried.

You could worry about daemons exploiting these bugs under this view. I think this is a reasonable worry, but don't expect formalizing zero-shot reasoning to help with it. It seems to me that daemons occur by falling into a local optimum when you are trying to optimize for doing some task -- the daemon does that task well in order to gain influence, and then backstabs you. This can arise both in ideal zero-shot reasoning, and when introducing approximations to it (as we will have to do when building any practical system).

In particular, the one context where we're most confident that daemons arise is Solomonoff induction, which is one of the best instances of formalizing zero-shot reasoning that we have. Solomonoff gives you strong guarantees, of the sort you can use in proofs -- and yet, daemons arise.

I would be very surprised if we were able to handle daemons without some sort of daemon-specific research.

In my mind, there's no clear difference between preventing daemons and securing complex systems. For example, I think there's a fundamental similarity between the following questions:

How can we build an organization that we trust to optimize for its founders' original goals for 10,000 years?
How can ensure a society of humans flourishes for 1,000,000,000 years without falling apart?
How can we build an AGI which, when run for 1,000,000,000 years, still optimizes for its original goals with > 99% probability? (If it critically malfunctions, e.g. if it "goes insane", it will not be optimizing for its original goals.)
How can we build an AGI which, after undergoing an intelligence explosion, still optimizes for its original goals with > 99% probability?

I'm imagining the AI zero-shot reasoning about the correctness and security of its source code (including how well it's performing zero-shot reasoning), making itself nigh-impossible for daemons to exploit.

This seems like it's using a bazooka to kill a fly. I'm not sure if I agree that zero-shot reasoning saves you from daemons, but even if so, why not try to attack the problem of daemons directly?

I think of Solomonoff induction less as a formalization of zero-shot reasoning, and more as a formalization of some unattainable ideal of rationality that will eventually lead to better conceptual understandings of bounded rational agents, which will in turn lead to progress on formalizing zero-shot reasoning.

In my mind, there's no clear difference between preventing daemons and securing complex systems.

(I can't seem to blockquote the bullet points, imagine I had done that.)

It seems like a simpler task would be building a spacecraft that can explore distant galaxies for 1,000,000,000 years without critically malfunctioning (perhaps with the help of self-correction mechanisms).

This seems like it's using a bazooka to kill a fly. I'm not sure if I agree that zero-shot reasoning saves you from daemons, but even if so, why not try to attack the problem of daemons directly?

Okay, sure, but then my claim is that Solomonoff induction is _better_ than zero-shot reasoning on the axes you seem to care about, and yet it still has daemons. Why expect zero-shot reasoning to do better?

But in scenarios where we have an AGI, yet we fail to achieve these objectives, the reason that seems most likely to me is "the AGI was incompetent at some point, made a mistake, and bad things happened". I don't know how to evaluate the probability of this and so become uncertain. But, if you are correct that we can formalize zero-shot reasoning and actually get high confidence, then the AGI could do that too. The hard problem is in getting the AGI to "want" to do that.

However, I expect that the way we actually get high confidence answers to those questions, is that we implement a control mechanism (i.e. the AI) that gets to act over the entire span of 10,000 or 1 billion years or whatever, and it keeps course correcting in order to stay on the path.

....

If you're trying to [build the spacecraft] without putting some general intelligence into it, this sounds way harder to me, because you can't build in a sufficiently general control mechanism for the spacecraft. I agree that (without access to general-intelligence-routines for the spacecraft) such a task would need very strong zero-shot reasoning. (It _feels_ impossible to me that any actual system could do this, including AGI, but that does feel like a failure of imagination on my part.)

I'm surprised by how much we seem to agree about everything you've written here. :P Let me start by clarifying my position a bit:

When I imagine the AGI making a "plan that will work in one go", I'm not imagining it going like "OK, here's a plan that will probably work for 1,000,000,000 years! Time to take my hands off the wheel and set it in motion!" I'm imagining the plan to look more like "set a bunch of things in motion, reevaluate and update it based on where things are, and repeat". So the overall shape of this AGI's cognition will look something like "execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.", happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects). The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn't introduce any critical errors.
I think an AGI competently optimizing for our values should almost certainly be exploring distant galaxies for billions of years (given the availability of astronomical computing resources). On this view, building a spacecraft that can explore the universe for 1,000,000,000 years without critical malfunctions is strictly easier than building an AGI that competently optimizes for our values for 1,000,000,000 years.
Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours. So undergoing a safe intelligence explosion seems at least as difficult as getting an earthbound AGI doing 1,000,000 years' worth of human cognition without any catastrophic failures.
I'm less concerned about the AGI killing its operators than I am about the AGI failing to capture a majority of our cosmic endowment. It's plausible that the latter usually leads to the former (particularly if there's a fast takeoff on Earth that completes in a few hours), but that's mostly not what I'm concerned about.

I agree that zero-shot reasoning doesn't save us from daemons by itself, and I think there's important daemon-specific research to be done independently of zero-shot reasoning. I more think that zero-shot reasoning may end up being critically useful in saving us from a specific class of daemons.

You must be really pessimistic about our chances.

The daemons I'm focusing on here mostly arise from embedded agency, which Solomonoff induction doesn't capture at all.

So the overall shape of this AGI's cognition will look something like "execute on some plan for a while, reevaluate and update it, execute on it again for a while, reevaluate and update it again, etc.", happening miliions or billions of times over (which seems a lot like a control mechanism that course-corrects)

Agreed that this is a control mechanism that course-corrects.

The zero-shot reasoning is mostly for ensuring that each step of reevaluation and updating doesn't introduce any critical errors.

Millions of years of human cognitive labor (or much more) might happen in an intelligence explosion that occurs over the span of hours.

building an AGI that competently optimizes for our values for 1,000,000,000 years.

I suspect I'm much more pessimistic than you about daemons taking over the control mechanism that course-corrects our AI, especially if it's doing something like 1,000,000 years' worth of human cognition, unless we can continuously zero-shot reason that this control mechanism will remain intact. (Equivalently, I feel very pessimistic about the process of executing and reevaluating plans millions/billions+ times over, unless the evaluation process is extraordinarily robust.) What's your take on this?

I was talking about the AI as a control mechanism for the task that needs to be done, not a control mechanism that course-corrects the AI. I don't expect there to be a particular subsystem of the AI that is responsible for course correction, just as there isn't a particular subsystem in the human brain responsible for thinking "Huh, I guess now that condition X has arisen, I should probably take action Y in order to deal with it".
I have no intuition for why there should be daemons, what they look like, how they take over the AI, etc. especially if they are different in kind from Solomonoff daemons. This basically sounds to me like positing the existence of a bad thing, and so concluding that we get bad things. I'm sure there's more to your intuition, but I don't know what it is and don't share the intuition.
Executing and reevaluating plans many times seems like it's fine, ignoring cases where the environment gets too difficult for the AI to deal with (i.e. the AI is incompetent). I expect the evaluation process to be robust by constantly getting human input.

I should clarify a few more background beliefs:

I think zero-shot reasoning is probably not very helpful for the first AGI, and will probably not help much with daemons in our first AGI.
I agree that right now, nobody is trying to (or should be trying to) build an AGI that's competently optimizing for our values for 1,000,000,000 years. (I'd want an aligned, foomed AGI to be doing that.)
I agree that if we're not doing anything as ambitious as that, it's probably fine to rely on human input.
I agree that if humanity builds a non-fooming AGI, they could coordinate around solving zero-shot reasoning before building a fooming AGI in a small fraction of the first 10,000 years (perhaps with the help of the first AGI), in which case we don't have to worry about zero-shot reasoning today.
Conditioning on reasonable international coordination around AGI at all, I give 50% to coordination around intelligence explosions. I think the likelihood of this outcome rises with the amount of legitimacy zero-shot shot reasoning has at coordination time, which is my main reason for wanting to work on it today. (If takeoff is much slower I'd give something more like 80% to coordination around intelligence explosions, conditional on international coordination around AGIs.)

Let me now clarify what I mean by "foomed AGI":

A rough summary is included in my footnote: [6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours], at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. ("Optimally optimized optimizer" is another way of putting it.)
You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it's more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.
I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.
In this "nuclear explosion" of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.

A rough summary is included in my footnote: [6] By “recursively self-improving AGI”, I’m specifically referring to an AGI that can complete an intelligence explosion within a year [or hours], at the end of which it will have found something like the optimal algorithms for intelligence per relevant unit of computation. ("Optimally optimized optimizer" is another way of putting it.)

I have a strong intuition that "optimal algorithms for intelligence per relevant unit of computation" don't exist. There are lots of no-free lunch theorems around this. Intelligence is contextual; as a concrete example, children are better than adults in novel situations with unusual causal factors (https://cocosci.berkeley.edu/tom/papers/LabPublications/GopnicketalYoungLearners.pdf). In AI, the explore-exploit tradeoff is quite fundamental and it seems unlikely that you can find a fully general solution to it.
I still don't know what "intelligence explosion within a year" means; is it relative to human intelligence? The intelligence of the previous AGI? Along what metric are you measuring intelligence? If I consider the "reasonable view" of what these terms mean, I expect that there will never be an intelligence explosion that would be considered "fast" (in the way that AGI intelligence explosion in a year would be "fast" to us) by the next-most intelligent system that exists.

You could imagine analogizing the first AGI we build to the first dynamite we ever build. You could analogize a foomed AGI to a really big dynamite, but I think it's more accurate to analogize it to a nuclear bomb, given the positive feedback loops involved.

I expect the intelligence differential between our first AGI and a foomed AGI to be numerous orders of magnitude larger than the intelligence differential between a chimp and a human.

I don't know if I'd say I expect this, but I do consider this scenario often so I'm happy to talk about it, and I have been assuming that during this discussion.

In this "nuclear explosion" of intelligence, I expect the equivalent of millions of years of human cognitive labor to elapse, if not many more.

I imagine you either having a different picture of takeoff, or thinking something like "Just don't build a foomed AGI. Just like it's way too hard to build AGIs that competently optimize for our values for 1,000,000,000 years, it's way too hard to build a safe foomed AGI, so let's just not do it".

And my position is something like "It's probably inevitable, and I think it will turn out well if we make a lot of intellectual progress (probably involving solutions to metaphilosophy and zero-shot reasoning, which I think are deeply related). In the meantime, let's do what we can to ensure that nation-states and individual actors will understand this point well enough to coordinate around not doing it until the time is right."

It would be quite surprising to me if the right thing to do to ensure that nation states and individual actors understand this point would be to formalize zero-shot reasoning.

I'm happy to delve into your individual points, but before I do so, I'd like to get your sense of what you think our remaining disagreements are, and where you think we might still be talking about different things.

Also, on a more epistemic note, how much weight to put on long chains of reasoning that rely on soft, intuitive concepts, and how much to trust intuitions about tasks longer than ~100 years.

Why "zero-shot"? You're talking about getting something right in one try, so wouldn't "one-shot" make more sense?

Humanity has actually made substantial progress toward formalizing zero-shot reasoning over the past century or so.

I think working on zero-shot reasoning today will most likely turn out to be unhelpful if:

takeoff is slow (which I assign ~20%)

This isn't obvious to me. Can you explain why you think this?

Why "zero-shot"? You're talking about getting something right in one try, so wouldn't "one-shot" make more sense?

I think this paragraph gives an overly optimistic impression of how much progress has been made. We are still very confused about what probabilities really are, we haven't made any progress on the problem of Apparent Unformalizability of “Actual” Induction, and decision theory seems to have mostly stalled since about 8 years ago (the MIRI paper you cite does not seem to represent a substantial amount of progress over UDT 1.1).

This isn't obvious to me. Can you explain why you think this?

I’m calling it “zero-shot” in analogy with zero-shot learning, which refers to the ability to perform a task after zero demonstrations.

I used “substantial progress” to mean “real and useful progress”, rather than “substantial fraction of the necessary progress”. Most of my examples happened in the eary to mid-1900s, suggesting that if we continue at that rate we might need at least another century.

I’d feel much better about delegating the problem to a post-AGI society, because I’d expect such a society to be far more stable if takeoff is slow, and far more capable of taking its merry time to solve the full problem in earnest. (I think it will be more stable because I think it would be much harder for a single actor to attain a decisive strategic advantage over the rest of the world.)

I see. Given this, I think "zero-shot learning" makes sense but "zero-shot reasoning" still doesn't, since in the former "zero" refers to "zero demonstrations" and you're learning something without doing a learning process targeted at that specific thing, whereas in the latter "zero" isn't referring to anything and you're trying to get the reasoning correct in one attempt so "one-shot" is a more sensible description.

I was imagining something like "zero failed attempts", where each failed attempt approximately corresponds to a demonstration.

Are you saying that in the slow-takeoff world, we will be able to coordinate to stop AI progress after reaching AGI and then solve the full alignment problem at leisure? If so, what's your conditional probability P(successful coordination to stop AI progress | slow takeoff)?

Curating (alongside Zhukeepa's FAQ on Paul's Agenda)

We're a bit behind on other tasks and still don't have time to write up formal curation notices, but wanted to at least keep the curated section moving.

LESSWRONG
LW

LESSWRONG
LW

64

Another take on agent foundations: formalizing zero-shot reasoning

64

Ω 20

What is zero-shot reasoning?

Few-shot reasoning vs zero-shot reasoning

Formalizing zero-shot reasoning

Why care about formalizing zero-shot reasoning?

Isn’t extreme caution sufficient for zero-shot reasoning?

Do we need zero-shot reasoning at all?

Why should we work on it?

Can’t we just train our AIs to be good zero-shot reasoners?

Can’t we build AIs that help us formalize zero-shot reasoning?

Won’t our AGIs want to become good zero-shot reasoners?

Why should we think zero-shot reasoning is possible to formalize?

Can we formalize zero-shot reasoning in time?

My personal views

64

Ω 20

64

Ω 20