Why Agent Foundations? An Overly Abstract Explanation

Let’s say you’re relatively new to the field of AI alignment. You notice a certain cluster of people in the field who claim that no substantive progress is likely to be made on alignment without first solving various foundational questions of agency. These sound like a bunch of weird pseudophilosophical questions, like “what does it mean for some chunk of the world to do optimization?”, or “how does an agent model a world bigger than itself?”, or “how do we ‘point’ at things?”, or in my case “how does abstraction work?”. You feel confused about why otherwise-smart-seeming people expect these weird pseudophilosophical questions to be unavoidable for engineering aligned AI. You go look for an explainer, but all you find is bits and pieces of worldview scattered across many posts, plus one post which does address the question but does so entirely in metaphor. Nobody seems to have written a straightforward explanation for why foundational questions of agency must be solved in order to significantly move the needle on alignment.

This post is an attempt to fill that gap. In my judgment, it mostly fails; it explains the abstract reasons for foundational agency research, but in order to convey the intuitions, it would need to instead follow the many paths by which researchers actually arrive at foundational questions of agency. But a better post won’t be ready for a while, and maybe this one will prove useful in the meantime.

Note that this post is not an attempt to address people who already have strong opinions that foundational questions of agency don't need to be answered for alignment; it's just intended as an explanation for those who don't understand what's going on.

Starting Point: The Obvious Stupid Idea

Let’s start from the obvious stupid idea for how to produce an aligned AI: have humans label policies/plans/actions/outcomes as good or bad, and then train an AI to optimize for the good things and avoid the bad things. (This is intentionally general enough to cover a broad range of setups; if you want something more specific, picture RL from human feedback.)

Assuming that this strategy could be efficiently implemented at scale, why would it not produce an aligned AI?

I see two main classes of problems:

  1. In cases where humans label bad things as “good”, the trained system will also be selected to label bad things as “good”. In other words, the trained AI will optimize for things which look “good'' to humans, even when those things are not very good.
  2. The trained system will likely end up implementing strategies which do “good”-labeled things in the training environment, but those strategies will not necessarily continue to do the things humans would consider “good” in other environments. The canonical analogy here is to human evolution: humans use condoms, even though evolution selected us to maximize reproductive fitness.

Note that both of these classes of problems are very pernicious: in both cases, the trained system’s results will look good at first glance.

Neither of these problems is obviously all that bad. In both cases, the system is behaving at least approximately well, at least within contexts not-too-different-from-training. These problems don’t become really bad until we apply optimization pressure, and Goodhart kicks in.

Goodhart’s Law

There’s a story about a Soviet nail factory. The factory was instructed to produce as many nails as possible, with rewards for high numbers and punishments for low numbers. Within a few years, the factory was producing huge numbers of nails - tiny useless nails, more like thumbtacks really. They were not very useful for nailing things.

So the planners changed the incentives: they decided to reward the factory for the total weight of nails produced. Within a few years, the factory was producing big heavy nails, more like lumps of steel really. They were still not very useful for nailing things.

This is Goodhart’s Law: when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.

In everyday life, if something looks good to a human, then it is probably actually good (i.e. that human would still think it’s good if they had more complete information and understanding). Obviously there are plenty of exceptions to this, but it works most of the time in day-to-day dealings. But if we start optimizing really hard to make things look good, then Goodhart’s Law kicks in. We end up with instagram food - an elaborate milkshake or salad or burger, visually arranged like a bouquet of flowers, but impractical to eat and kinda mediocre-tasting.

Returning to our two alignment subproblems from earlier:

  1. In cases where humans label bad things as “good”, the trained system will also be selected to label bad things as “good”. In other words, the trained AI will optimize for things which look “good'' to humans, even when those things are not very good.
  2. The trained system will likely end up implementing strategies which do “good”-labeled things in the training environment, but those strategies will not necessarily continue to do the things humans would consider “good” in other environments. The canonical analogy here is to human evolution: humans use condoms, even though evolution selected us to maximize reproductive fitness.

Goodhart in the context of problem (1): train a powerful AI to make things look good to humans, and we have the same problem as instagram food, but with way more optimization power applied. Think “Potemkin village world” - a world designed to look amazing, but with nothing behind the facade. Maybe not even any living humans behind the facade - after all, even generally-happy real humans will inevitably sometimes put forward appearances which would not appeal to the “good”/”bad”-labellers.

Goodhart in the context of problem (2): pretend our “good”/”bad” labels are perfect, but the system ends up optimizing for some target which doesn’t quite track our “good” labels, especially in new environments. Then that system ends up optimizing for whatever proxy it learned; we get the AI-equivalent of humans wearing condoms despite being optimized for reproductive fitness. And the AI then optimizes for that really hard.

Now, we’ve only talked about the problems with one particular alignment strategy. (We even explicitly picked a pretty stupid one.) But we’ve already seen the same basic issue come up in two different subproblems: Goodhart’s Law means that proxies which might at first glance seem approximately-fine will break down when lots of optimization pressure is applied. And when we’re talking about aligning powerful future AI, we’re talking about a lot of optimization pressure. That’s the key idea which generalizes to other alignment strategies: crappy proxies won’t cut it when we start to apply a lot of optimization pressure.

Goodhart Is Not Inevitable

Suppose we’re designing some secure electronic equipment, and we’re concerned about the system leaking information to adversaries via a radio side-channel. We design the system so that the leaked radio signal has zero correlation with whatever signals are passed around inside the system.

Some time later, a clever adversary is able to use the radio side-channel to glean information about those internal signals using fourth-order statistics. Zero correlation was an imperfect proxy for zero information leak, and the proxy broke down under the adversary’s optimization pressure.

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.

Many people have an intuition like “everything is an imperfect proxy; we can never avoid Goodhart”. The point of the mutual information example is that this is basically wrong. Figuring out the True Name of a thing, a mathematical formulation sufficiently robust that one can apply lots of optimization pressure without the formulation breaking down, is absolutely possible and does happen. That said, finding such formulations is a sufficiently rare skill that most people will not ever have encountered it firsthand; it’s no surprise that many people automatically assume it impossible.

This is (one framing of) the fundamental reason why alignment researchers work on problems which sound like philosophy, or like turning philosophy into math. We are looking for the True Names of various relevant concepts - i.e. mathematical formulations robust enough that they will continue to work as intended even under lots of optimization pressure.

Aside: Accidentally Stumbling On True Names

You may have noticed that the problem of producing actually-good nails has basically been solved, despite all the optimization pressure brought to bear by nail producers. That problem was solved mainly by competitive markets and reputation systems. And it was solved long before we had robust mathematical formulations of markets and reputation systems.

Or, to reuse the example of mutual information: one-time pad encryption was intuitively obviously secure long before anyone could prove it.

So why do we need these “True Names” for alignment?

We might accidentally stumble on successful alignment techniques. (Alignment By Default is one such scenario.) On the other hand, we might also fuck it up by accident, and without the True Name we’d have no idea until it’s too late. (Remember, our canonical failure modes still look fine at first glance, even setting aside the question of whether the first AGI fooms without opportunity for iteration.) Indeed, people did historically fuck up markets and encryption by accident, repeatedly and to often-disastrous effect. It is generally nonobvious which pieces are load-bearing.

Aside from that, I also think the world provides lots of evidence that we are unlikely to accidentally stumble on successful alignment techniques, as well as lots of evidence that various specific classes of things which people suggest will not work. This evidence largely comes from failure to solve analogous existing problems “by default”. That’s a story for another post, though.

What “True Names” Do We Want/Need For Alignment?

What kind of “True Names” are needed for the two alignment subproblems discussed earlier?

  1. In cases where humans label bad things as “good”, the trained system will also be selected to label bad things as “good”. In other words, the trained AI will optimize for things which look “good'' to humans, even when those things are not very good.
  2. The trained system will likely end up implementing strategies which do “good”-labeled things in the training environment, but those strategies will not necessarily continue to do the things humans would consider “good” in other environments. The canonical analogy here is to human evolution: humans use condoms, even though evolution selected us to maximize reproductive fitness.

In the first subproblem, our “good”/”bad” labeling process is an imperfect proxy of what we actually want, and that proxy breaks down under optimization pressure. If we had the “True Name” of human values (insofar as such a thing exists), that would potentially solve the problem. Alternatively, rather than figuring out a “True Name” for human values directly, we could figure out a “pointer” to human values - something from which the “True Name” of human values could be automatically generated (analogous to the way that a True Name of nail-value is implicitly generated in an efficient market). Or, we could figure out the “True Names” of various other things as a substitute, like “do what I mean” or “corrigibility”.

In the second subproblem, the goals in the trained system are an imperfect proxy of the goals on which the system is trained, and that proxy breaks down when the trained system optimizes for it in a new environment. If we had the “True Names” of things like optimizers and goals, we could inspect a trained system directly to see if it contained any “inner optimizer” with a goal very different from what we intended. Ideally, we could also apply such techniques to physical systems like humans, e.g. as a way to point to human values.

Again, this is only one particular alignment strategy. But the idea generalizes: in order to make alignment strategies robust to lots of optimization pressure, we typically find that we need robust formulations of some intuitive concepts, i.e. “True Names”.

Regardless of the exact starting point, seekers of “True Names” quickly find themselves recursing into a search for “True Names” of lower-level components of agency, like:

  • Optimization
  • Goals
  • World models
  • Abstraction
  • Counterfactuals
  • Embeddedness

Aside: Generalizability

Instead of framing all this in terms of Goodhart’s Law, we could instead frame it in terms of generalizability. Indeed, Goodhart’s Law itself can be viewed as a case/driver of generalization failure: optimization by default pushes things into new regimes, and Goodhart’s Law consists of a proxy failing to generalize as intended into those new regimes.

In this frame, a “True Name” is a mathematical formulation which robustly generalizes as intended.

That, in turn, suggests a natural method to search for and recognize “True Names”. In some sense, they’re the easiest possible things to find, because they’re exactly the things which show up all over the place! We should be able to look at many different instances of some concept, and abstract out the same “True Name” from any of them.

Of course, the real acid test of a “True Name” is to prove, both empirically and mathematically, that systems which satisfy the conditions of the Name also have the other properties which one intuitively expects of the concept. Then we have a clear idea of just how robustly the formulation generalizes as intended.

Summary

We started out from one particular alignment strategy - a really bad one, but we care mainly about the failure modes. A central feature of the failure modes was Goodhart’s Law: when a proxy is used as an optimization target, it ceases to be a good proxy for the thing it was intended to measure. Some people would frame this as the central reason why alignment is hard.

Fortunately, Goodhart is not inevitable. It is possible to come up with formulations which match our concepts precisely enough that they hold up under lots of optimization pressure; mutual information is a good example. This is (one frame for) why alignment researchers invest in pseudophilosophical problems like “what are agents, mathematically?”. We want “True Names” of relevant concepts, formulations which will robustly generalize as intended.

Thankyou to Jack, Eli and everyone who attended our discussion last week which led to this post.

1.
^

In particular, it requires calculating the distributions to infinite accuracy, which in turn requires an infinite sample. (Consider if I have two independent perfectly fair coins. I flip each of them 3x and get HHT/HHT. My mutual information is non-zero!)

2.
^

For a sufficient example: gravity causes any[4] two things in the universe[5] to correlate[6].

3.
^

At least assuming the Church-Turing hypothesis is correct.

4.
^

Except potentially if there's an event horizon, although even that's an open question, and in that case it's a moot point because an AI in an event horizon is indistinguishable from no AI.

5.
^

Strictly speaking, within each others lightcone.

6.
^

And as soon as you have anything causing a correlation, the probability that other factors exactly cancel said correlation is zero.

1.
^

Which isn't "a" constant, but that's another matter.

2.
^

Well, as closely as anything in the physical world does, anyway. 

1.
^

Namely, 'the laws of physics'

2.
^

(And worse, often doesn't exactly match in the observations thus far, or results in contradictions.)

3.
^

Trivially, due to indistinguishability issues. For any finite sequence of inputs and outputs, there are multiple machines X and X' which produce that sequence of outputs given the input, but which have later output that diverges. This is not a problem in the whitebox case because said machines are distinguishable.

1.
^

Well, strictly speaking it needs to know both the proxy and the difference between the proxy and the true underlying value function, which is sufficient to recreate the true underlying value function.

4.
^

Except potentially if there's an event horizon, although even that's an open question, and in that case it's a moot point because an AI in an event horizon is indistinguishable from no AI.

5.
^

Strictly speaking, within each others lightcone.

6.
^

And as soon as you have anything causing a correlation, the probability that other factors exactly cancel said correlation is zero.

New Comment


56 comments, sorted by Click to highlight new comments since:

I mostly agree with this post.

Figuring out the True Name of a thing, a mathematical formulation sufficiently precise that one can apply lots of optimization pressure without the formulation breaking down, is absolutely possible and does happen.

Precision feels pretty far from the true name of the important feature of true names, I am not quite sure what precision means, but on one definition, precision is the opposite of generality, and true names seem anti-precise. I am not saying precision is not a virtue, and it does seem like precision is involved. (like precision on some meta level maybe?)

The second half about robustness to optimization pressure is much closer, but still not right. (I think it is a pretty direct consequence of true names.) It is clearly not yet a true name in the same way that "It is robust to people trying to push it" is the not the true name of inertia.

Precision feels pretty far from the true name of the important feature of true names

You're right, I wasn't being sufficiently careful about the wording of a bolded sentence. I should have said "robust" where it said "precise". Updated in the post; thankyou.

Also I basically agree that robustness to optimization is not the True Name of True Names, though it might be a sufficient condition.

[-]TLWΩ12269

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system?

You fundamentally cannot, so it's a moot point. There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.

I do not follow your seeming dismissal of this. You acknowledge it, and then... assert it's not a problem?

An analogy: solving the Halting problem is impossible[3]. It is sometimes useful to handwave a Halting oracle as a component of proofs regardless - but at the end of the day saying 'solving the Halting problem is easy, just use a Halting oracle' is not a solution.

Many people have an intuition like “everything is an imperfect proxy; we can never avoid Goodhart”. The point of the mutual information example is that this is basically wrong. 

"Many people have an intuition like "everything is an imperfect halting-problem solver; we can never avoid Turing". The point of the Halting oracle example is that this is basically wrong."

Hopefully this illustrates my point.

  1. ^

    In particular, it requires calculating the distributions to infinite accuracy, which in turn requires an infinite sample. (Consider if I have two independent perfectly fair coins. I flip each of them 3x and get HHT/HHT. My mutual information is non-zero!)

  2. ^

    For a sufficient example: gravity causes any[4] two things in the universe[5] to correlate[6].

  3. ^

    At least assuming the Church-Turing hypothesis is correct.

  4. ^

    Except potentially if there's an event horizon, although even that's an open question, and in that case it's a moot point because an AI in an event horizon is indistinguishable from no AI.

  5. ^

    Strictly speaking, within each others lightcone.

  6. ^

    And as soon as you have anything causing a correlation, the probability that other factors exactly cancel said correlation is zero.

There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.

Thanks for bringing this up; it raises to a technical point which didn't make sense to include in the post but which I was hoping someone would raise in the comments.

The key point: Goodhart problems are about generalization, not approximation.

Suppose I have a proxy  for a true utility function , and  is always within  of u (i.e. ). I maximize . Then the true utility  achieved will be within  of the maximum achievable utility. Reasoning: in the worst case,  is  lower than  at the -maximizing point, and  higher than  at the -maximizing point.

Point is: if a proxy is close to the true utility function everywhere, then we will indeed achieve close-to-maximal utility upon maximizing the proxy. Goodhart problems require the proxy to not even be approximately close, in at least some places.

When we look at real-world Goodhart problems, they indeed involve situations where some approximation only works well within some region, and ceases to even be a good approximation once we move well outside that region. That's a generalization problem, not an approximation problem.

So approximations are fine, so long as they generalize well.

[-]TLWΩ8172

This is an interesting observation; I don't see how it addresses my point.

There is no exact solution to mutual information from two finite samples. There is no -approximation of mutual information from two finite samples, either.

=====

On the topic of said observation: beware that -approximations of many things are proven difficult to compute, and in some cases even are uncomputable. (The classic being Chaitin's Constant[1].)

In particular, you very often end up with Halting-problem style contradictions when computing properties of systems capable of unbounded computation, and even approximations thereof.

Unfortunately, 'value function of a powerful AI' tends to fall into that category[2].

  1. ^

    Which isn't "a" constant, but that's another matter.

  2. ^

    Well, as closely as anything in the physical world does, anyway. 

The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.

[-]TLWΩ5100

Let us make a distinction here between two cases:

  1. Observing the input and output of a blackbox X, and checking a property thereof.
  2. Whitebox knowledge of X, and checking a property thereof.

In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox[1]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that's about the best we can say[2].

And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable[3], even though checking if two specific Turing-complete whiteboxes may be decidable.

in exactly the same way

It is not exactly the same way, due to the above.

 

  1. ^

    Namely, 'the laws of physics'

  2. ^

    (And worse, often doesn't exactly match in the observations thus far, or results in contradictions.)

  3. ^

    Trivially, due to indistinguishability issues. For any finite sequence of inputs and outputs, there are multiple machines X and X' which produce that sequence of outputs given the input, but which have later output that diverges. This is not a problem in the whitebox case because said machines are distinguishable.

You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So "zero" is best understood as "zero within some tolerance". So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently. 

This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable tolerances, even if we do not yet know how to do such a thing in the face of the immense optimization pressure which superhuman AGI would bring to bear on a problem. 

We need plans to have achievable tolerances. For example, we need to assume a realistic amount of hardware failure. We can't treat the hardware as blackboxes; we know how it operates, and we have to make use of that knowledge. But we can't pretend perfect mathematical knowledge of it, either; we have error tolerances. 

So your blackbox/whitebox dichotomy doesn't fit the situation very well. 

But do you really buy the whole analogy with mutual information, IE buy the claim that we can judge the viability of escaping goodhart from this one example, and only object that the judgement with respect to this example was incorrect?

Perhaps we should really look at a range of examples, not just one? And judge John's point as reasonable if and only if we can find some cases where effectively perfect proxies were found?

Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed. So John's plan sounds doomed to failure, because it relies on finding an actually-perfect proxy, when all realistic proxies are imprecise at least in their physical tolerances. 

In which case, I would reply that the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course. So there is hope that we will not end up in a situation where every tiny flaw is exploited. What we are looking for is plans which robustly get us to that point. 

Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed.

My objection is actually mostly to the example itself.

As you mention:

the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course.

Compare with the example:

Suppose we’re designing some secure electronic equipment, and we’re concerned about the system leaking information to adversaries via a radio side-channel.

[...]

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.

This is analogous to the case of... trying to contain a malign AI which is already not on our side.

Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by "guessing true names". I think the approach makes sense, but my argument for why this is the case does differ from John's arguments here.

The fact that the mutual information cannot be zero is a good and interesting point. But, as I understand it, this is not fundamentally a barrier to it being a good "true name". Its the right target, the impossibility of hitting it doesn't change that.

Goodhart Is Not Inevitable

This is the part I was disagreeing with, to be clear.

Curated. Laying out a full story for why the work you're doing is solving AI alignment is very helpful, and this framing captures different things from other framings (e.g. Rocket Alignment, Embedded Curiosities, etc). Also it's simply written and mercifully short, relative to other such things. Thanks for this step in the conversation.

My usual framing/explanation (in conversations)

1. Individual future AI systems can be thought of about as points in some abstract "space of intelligent systems"

II. Notice different types of questions:
a. What properties some individual existing, experimentally accessible, points in this space has?
b. Where in this space we will end up in future?
c. What will be the properties of these points?
d. What's going on in generalizations/extrapolations from existing points to other parts of the space?
e. Are there some "effective theories" governing parts of the space? What are their domains of validity?
f. Are there some "fundamental theories" governing the space? What are their domains of validity?
g. What are the properties of the space? E.g., is it continuous?
h. What's the high-level evolutionary dynamic of our movement in this space?

III.
Use the metaphors from physics, history of science, econ,... to understand how these look in other domains, and what the relations between the questions are (e.g. relations betrween construction of heat engines, thermodynamics, stat physics, engineering, related markets, maths...)

IV.
Having something like "science of intelligent systems" seems plausible decisive factor for the ability to robustly solve the problem. 

Conceptual metaphors from other fields are often good prompts for thinking about how this may look, or what to search for

Regardless of the exact starting point, seekers of “True Names” quickly find themselves recursing into a search for “True Names” of lower-level components of agency, like:

  • Optimization
  • Goals
  • World models
  • Abstraction

This is the big missing piece for me. Could you elaborate on how you go from trying to find the True Names of human values to things like what is an agent, abstraction, and embeddedness? 

Goals makes sense, but the rest are not obvious why they'd be important or relevant. I feel like this reasoning would lead you to thinking about meta-ethics or something, not embeddedness and optimization. 

I suspect I'm missing a connecting piece here that would make it all click. 

Imagine it's 1665 and we're trying to figure out the True Name of physical force - i.e. how hard it feels like something is pushing or pulling.

One of the first steps is to go through our everyday experience, paying attention to what causes stronger/weaker sensations of pushing and pulling, or what effects stronger/weaker sensations have downstream. We might notice, for instance, that heavier objects take more force to push, or that a stronger push accelerates things faster. So, we might expect to find some robust relationship between the True Names of force, mass, and acceleration. At the time, we already basically had the True Name of mass, but we still needed to figure out the True Name of acceleration.

Why do we need the True Names of mass and acceleration, rather than just trying to figure out force directly? Well, finding robust relationships between the True Names of multiple concepts is, historically, one of the main ways we've been able to tell that we have the right Names. We can use e.g. the extension of a spring to measure force, but then what makes us think this operationalization of "force" is going to robustly generalize in the ways we expect? One main reason we expect today's notion of "force" to robustly generalize is the extremely robust experimental relationship force = mass * acceleration.

(Side note: for mathematical concepts, like e.g. probability and information, the main reason we expect the formulations to robustly generalize is usually the mathematical proof of some relationship, ideally augmented with experimental evidence, rather than just experimental evidence alone.)

Also, while we're still searching for the relevant Names, relationships between concepts help steer the search - for instance, it's a lot easier to figure out the True Name of heat once we have the Name of temperature.

Anyway, to answer what I think is your real question here...

A Hypothetical Dialogue

Philosopher: Imagine that Alice and Bob both want an apple, and they ca-

Alignment Researcher: Whoa now, hold up.

Philosopher: ... I haven't even asked the question yet.

Alignment Researcher: What is this "wanting" thing?

Philosopher: You know, it's this feeling you get where -

Alignment Researcher: I don't suppose you have any idea how to formulate that mathematically?

Philosopher: How about as a utility function? I hear that formulation has some arguments going for it...

Alignment Researcher: <looks at some math and experimental results> somewhat dubious, but it looks like it's at least in the right ballpark. Ok, so Alice and Bob both "want" an apple, meaning that (all else equal) they will accept whatever trades (or tradeoffs) give them the apple, and turn down any trades (or tradeoffs) which lose them the apple? Or, in other words, they're each optimizing to get that apple.

Philosopher: Well, not exactly, we're not saying Alice and Bob always do the things which get them what they want. "Wanting" isn't a purely behavioral concept. But you know what, sure, let's go with that for now. So Alice and Bob both want an apple, and they can't both -

Alignment Researcher: Ok, hang on, I'm trying to code up this hypothetical in Python, and I'm still unsure about the type-signatures. What are the inputs of the utility functions?

Philosopher: um... <does some googling>... Well, the standard formulation says that Alice and Bob are expected utility maximizers, so the inputs to the utility function will be random variables in their world models.

Alignment Researcher: World models? What the heck are world models??

Philosopher: Well, since we're modelling them as ideal agents anyway, it's a Bayesian distribution over a bunch of random variables corresponding to things in the world, which we upd-

Alignment Researcher: "corresponding to things in the world"? I know how to do Bayesian updates on distributions, but it's not like the variables in those distributions necessarily "correspond to the world" in any particular way. Presumably we need this "correspondence" in order for Alice and Bob's internal variables representing the "apple" to correspond with some actual apple? Heck, what even is an "actual apple"? That's important, if we want Alice and Bob to "want" some actual thing in the world, and not just particular activations in their sense-data...

... Anyway, The Point Is

When we try to mathematically formulate even very basic ideas about ethics, we very quickly run into questions about how to formalize agency, world-models, etc.

Thank you! This clarifies a lot. The dialogue was the perfect blend of entertaining and informative.

I might see if you can either include it in the original post or post it as a separate one, because it really helps fill in the rationale. 

I understand the point of your dialog, but I also feel like I could model someone saying "This Alignment Researcher is really being pedantic and getting caught in the weeds." (especially someone who wasn't sure why these questions should collapse into world models and correspondence.)

(After all, the Philosopher's question probably didn't depend on actual apples, and was just using an apple as a stand-in for something with positive utility. So, the inputs of the utility functions could easily be "apples" (where an apple is an object with 1 property, "owner". Alice prefers apple.owner="alice" (utility(a): return int(a.owner=='alice')), and Bob prefers apple.owner="bob") To sidestep the entire question of world models, and correspondence.)

I suspect you did this because the half formed question about apples was easier to come up with than a fully formed question that would necessarily require engagement with world models, and I'm not even sure that's the wrong choice. But this was the impression I got reading it.

I also wonder about this. If I'm understanding the post and comment right, it's that if you don't formulate it mathematically, it doesn't generalize robustly enough? And that to formulate something mathematically you need to be ridiculously precise/pedantic?

Although this is probably wrong and I'm mostly invoking Cunningham's Law

I doubt my ability to be entertaining, but perhaps I can be informative. The need for mathematical formulation is because, due to Goodhart's law, imperfect proxies break down. Mathematics is a tool which is rigorous enough to get us from "that sounds like a pretty good definition" (like "zero correlation" in the radio signals example), to "I've proven this is the definition" (like "zero mutual information"). 

The proof can get you from "I really hope this works" to "As long as this system satisfies the proof's assumptions, this will work", because the proof states it's assumptions clearly, while "this has worked previously" could, and likely does, rely on a great number of unspecified commonalities previous instances had.

It gets precise and pedantic because it turns out that the things we often want to define for this endeavor are based on other things. "Mutual information" isn't a useful formulation without a formulation for "information". Similarly, in trying to define morality, it's difficult to define what an agent should do in the world (or even what it means for an agent to do things in the world), without ideas of agency and doing, and the world. Every undefined term you use brings you further from a formulation you could actually use to create a proof.

In all, mathematical formulation isn't the goal, it's the prerequisite. "Zero correlation" was mathematically formalized, but that was not sufficient.

Why is the Alignment Researcher different than a normal AI researcher?

E.g. Markov decision processes are often conceptualized as "agents" which take "actions" and receive "rewards" etc. and I think none of those terms are "True Names".

Despite this, when researchers look into ways to give MDP's some other sort of capability or guarantee, they don't really seem to prioritize finding True Names. In your dialogue: the AI researcher seems perfectly fine accepting the philosopher's vaguely defined terms.

What is it about alignment which makes finding True Names such an important strategy, when finding True Names doesn't seem to be that important for e.g. learning from biased data sets (or any of the other million things AI researchers try to get MDP's to do)?

Why should we expect that True Names useful for research exist in general? It seems like there are reasons why they don't:

  • messy and non-robust maps between any clean concept and what we actually care about, such that more of the difficulty in research is in figuring out the map. The Standard Model of physics describes all the important physics behind protein folding, but we actually needed to invent AlphaFold.
  • The True Name doesn't quite represent what we care about. Tiling agents is a True Name for agents building successors, but we don't care that agents can rigorously prove things about their successors.
  • question is fundamentally ill-posed: what's the True Name of a crab? what's the True Name of a ghost?

Most of these examples are bad, but hopefully they get the point across.

It's pretty hard to argue for things that don't exist yet, but here's a few intuitions for why I think agency related concepts will beget True Names:

  • I think basically every concept looks messy before it’s not, and I think this is particularly true at the very beginning of fields. Like, a bunch of Newton’s early journaling on optics is mostly listing a bunch of facts that he’s noticed—it’s clear with hindsight which ones are relevant and which aren’t, but before we have a theory it can just look like a messy pile of observations from the outside. Perusing old science, from before theory or paradigm, gives me a similar impression for e.g., “heat,” “motion,” and “speciation.” 
  • Many scientists have, throughout history, proclaimed that science is done. Bacon was already pretty riled up about this in the late 1500s when he complained that for thousands of years everyone had been content enough with Aristotle that they hadn’t produced practically any new knowledge. Then he attempted to start the foundation of science, with the goal of finding True Names, which was pretty successful. In the 1800s people were saying it again—that all that was left was to calculate out what we already knew from physics—but of course then Einstein came and changed how we fundamentally conceived of it.
  • The Standard Model of physics also describes motion and speciation, but that doesn’t mean that the way to understand motion and speciation, nor their True Names, lies in clarifying how they relate to the Standard Model. 
  • I think that most of the work in figuring out True Names is in first identifying them, which is the part of the work that looks more like philosophy. E.g., I expect the True Name of a “crab” is ill-posed, but that something like “how crab-like-things use abstractions to achieve their goals” is a more likely candidate for a True Name(s).
  • The inside view reason I expect True Names for agency is hard to articulate, as part of it is a somewhat illegible sense that there are deep principles to the world, and agents make up part of our world. But I also think that, historically, when people have stared long enough at pieces of the world that permit of regularities, they are usually eventually successful (and that’s true even if the particular system itself isn’t regular, such as “chaos” and “randomness,” because you can still find them at a meta level, as with ideas like k-complexity). I don’t see that much reason to suspect that something different is happening with agents. I think it’s a harder problem than other scientific problems have been, but I still expect that it’s solvable. 

This is self promotion, but this paper provides one type of answer for how certain questions involving agent foundations are directly important for alignment. 

I don't think that link goes where you intended.

Thank you. I fixed it. Here's the raw link too. https://arxiv.org/abs/2010.05418 

I think this is a good description of what agent foundations is and why it might be needed. But the binary of 'either we get alignment by default or we need to find the True Name' isn't how I think about it.

Rather, there's some unknown parameter, something like 'how sharply does the pressure towards incorrigibility ramp up, what capability level does it start at, how strong is it'?

Setting this at 0 means alignment by default. Setting this higher and higher means we need various kinds of Prosaic alignment strategies which are better at keeping systems corrigible and detecting bad behaviour. And setting it at 'infinity' means we need to find the True Names/foundational insights.

Rohin:

My rough model is that there's an unknown quantity about reality which is roughly "how strong does the oversight process have to be before the trained model does what the oversight process intended for it to do". p(doom) mainly depends on whether the actors training the powerful systems have sufficiently powerful oversight processes.

Maybe one way of getting at this is to look at ELK - if you think the simplest dumbest ELK proposals probably work, that's Alignment by Default. The harder you think prosaic alignment is, the more complex an ELK solution you expect to need. And if you think we need agent foundations, you think we need a worst-case ELK solution.

The existence of True Names is an intriguing proposition! I wonder if you have a list of examples in mind, where a simple concept generalizes robustly where it ostensibly has no business to. And some hints of how to tell a true name from a name. 

I know nothing about ML, but a little bit about math and physics, and there are concepts in each that generalize beyond our wildest expectations. Complex numbers in math, for example. Conservation laws in physics. However, at the moment of their introduction it didn't seem obvious that they would have the domain of applicability/usefulness that extends so far out.

Temperature and heat are examples from physics which I considered using for the post, though they didn't end up in there. Force, pressure, mass, density, position/velocity/acceleration, charge and current, length, angle... basically anything with standard units and measurement instruments. At some time in the past, all of those things were just vague intuitions. Eventually we found extremely robust mathematical and operational formulations for them, including instruments to measure those intuitive concepts.

On the more purely mathematical side, there's numbers (both discrete and continuous), continuity, vectors and dimensionality, equilibrium, probability, causality, functions, factorization (in the category theory sense), lagrange multipliers, etc. Again, all of those things were once just vague intuitions, but eventually we found extremely robust mathematical formulations for those intuitive concepts. (Note: for some of those, like lagrange multipliers or factorization, it is non-obvious that most humans do in fact have the relevant intuition, because even many people who have had some exposure to the math have not realized which intuitions it corresponds to.)

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.

Flat out wrong. Its quite possible for A and B to have 0 mutual information. But A and B always have mutual information conditional on some C (assuming A and B each have information) Its possible for there to be absolutely no mutual information between any 2 of electricity use, leaked radio, private key. Yet there is mutual information between all 3. So if the adversary knows your electricity use, and can detect the leaked radio, then you know the key.

Figuring out the True Name of a thing, a mathematical formulation sufficiently precise that one can apply lots of optimization pressure without the formulation breaking down, is absolutely possible and does happen

I would like a lot more examples and elaboration on this point. I'm kind of suspicious that if you apply enough pressure, even our best abstractions fail, or at least that our actual implementation of those abstractions.

There's some more examples in this comment.

Thank you so much for writing this! The community definitely needed this. This clarifies the motivation for agent foundations so much better than anything else I've read on it. 

Should definitely be the go-to introduction to it and be in the AGI Safety Fundamentals course. 

Not sure if I disagree or if we're placing emphasis differently.

I certainly agree that there are going to be places where we'll need to use nice, clean concepts that are known to generalize. But I don't think that the resolutions to problems 1 and 2 will look like nice clean concepts (like in minimizing mutual information). It's not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent. I think of some of my intuitions as my "real values" and others as mere "biases" in a thoroughly messy way.

But back on the first hand again, what's "messy" might be subjective. A good recipe for fitting values to me will certainly be simple and neat compared to the totality of information stored in my brain.

And I certainly want to move away from the framing that the way to deal with problems 1 and 2 is to say "Goodhart's law says that any difference between the proxy and our True Values gets amplified... so we just have to find our True Values" - I think this framing leads one to look for solutions in the wrong way (trying to eliminate ambiguity, trying to find a single human-comprehensible model of humans from which the True Values can be extracted, mistakes like that). But this is also kind of a matter of perspective - any satisfactory value learning process can be evaluated (given a background world-model) as if it assigns humans some set of True Values.

I think even if we just call these things differences in emphasis, they can still lead directly to disagreements about (even slightly) meta-level questions, such as how we should build trust in value learning schemes.

It's not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent.

What's the evidence for this claim?

When I look at e.g. nails, the economic value of a nail seems reasonably complicated. Yet the "pointers to nail value" which we use in practice - i.e. competitive markets and reputation systems - do have clean, robust mathematical formulations.

Furthermore, before the mid-20th century, I expect that most people would have expected that competitive markets and reputation systems were inherently messy and contingent. They sure do look messy! People confuse messiness in the map for messiness in the territory.

I think of some of my intuitions as my "real values" and others as mere "biases" in a thoroughly messy way.

... this, for instance, I think is probably a map-territory confusion. The line between "real values" and "biases" will of course look messy when one has not yet figured out the True Name. That does not provide significant evidence of messiness in the territory.

Personally, I made this mistake really hard when I first started doing research in systems biology in undergrad. I thought the territory of biology was inherently messy, and I actually had an argument with my advisor that some of our research goals were unrealistic because of inherent biological messiness. In hindsight, I was completely wrong; the territory of biology just isn't that inherently messy. (My review of Design Principles of Biological Circuits goes into more depth on this topic.)

That said, the intuition that "the territory is messy" is responding to a real failure mode. The territory does not necessarily respect whatever ontology or model a human starts out with. People who expect a "clean" territory tend to be shocked by how "messy" the world looks when their original ontology/model inevitably turns out to not fit it very well. I think this is how people usually end up with the (sometimes useful!) intuition that the territory is messy.

Evidence & Priors

Note that the above mostly argued that the appearance of messiness is a feature of the map which yields little evidence about the messiness of the territory; even things with simple True Names look messy before we know those Names. But that still leaves unanswered two key questions:

  • Is there any way that we can get evidence of messiness of the territory itself?
  • What should our priors be regarding messiness in the territory?

One way to get positive evidence of messiness in the territory, for instance, is to see lots of smart people fail to find a clean True Name even with strong incentives to do so. Finding True Names is currently a fairly rare and illegible skill (there aren't a lot of Claude Shannons or Judea Pearls), so we usually don't have very strong evidence of this form in today's world, but there are possible futures in which it could become more relevant.

On the other hand, one way to get evidence of lack of messiness in the territory, even in places where we haven't yet found the True Names, is to notice that places which seem like canonical examples of very-probably-messy-territory repeatedly turn out to not be so messy. That was exactly my experience with systems biology, and is where my current intuitions on the matter originally came from.

Regarding priors, I think there's a decent argument that claims of messiness in the territory are always wrong, i.e. a messy territory is impossible in an important sense. The butterfly effect is a good example here: perhaps the flap of a butterfly's wings can change the course of a hurricane. But if the flap any butterfly's wings has a significant chance of changing the hurricane's course, for each of the billions of butterflies in the world, then ignorance of just a few dozen wing-flaps wipes out all the information about all the other wing-flaps; even if I measure the flaps of a million butterfly wings, this gives me basically-zero information about the hurricane's course. (For a toy mathematical version of this, see here.)

The point of this example is that this "messy" system is extremely well modeled across an extremely wide variety of epistemic states as pure noise, which is in some sense quite simple. (Obviously we're invoking an epistemic state here, which is a feature of a map, but the existence of a very wide range of simple and calibrated epistemic states is a feature of the territory.) More generally, the idea here is that there's a duality between structure and noise: anything which isn't "simple structure" is well-modeled as pure noise, which itself has a simple True Name. Of course then we can extend it to talk about fractal structure, in which more structure appears as we make the model more precise, but even then we get simple approximations.

Anyway, that argument about nonexistence of messy territory is more debatable than the rest of this comment, so don't get too caught up in it. The rest of the comment still stands even if the argument at the end is debatable.

It's not clear to me that your metaphors are pointing at something in particular.

Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can't make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks. If this was all we needed, then yes, absolutely, I'm sure there's a similarly neat and simple way to instrumentalize human values - it's just going to fail if things are too smart, or too irrational, or too far in the future.

Biology being human-comprehensible is an interesting topic, and suppose I grant that it is - that we could have comprehensible explanatory stories for every thing our cells do, and that these stories aren't collectively leaving anything out. First off, I would like to note that such a collection of stories would still be really complicated relative to simple abstractions in physics or economics! Second, this doesn't connect directly to Goodhart's law. We're just talking about understanding biology, without mentioning purposes to which our understanding can be applied. Comprehending biology might help us generalize, in the sense of being able to predict what features will be conserved by mutation, or will adapt to a perturbed environment, but again this generalization only seems to work in a limited range, where the organism is doing all the same jobs with the same divisions between them.

The butterfly effect metaphor seems like the opposite of biology. In biology you can have lots of little important pieces - they're not individually redirecting the whole hurricane/organism, but they're doing locally-important jobs that follow comprehensible rules, and so we don't disregard them as noise. None of the butterflies have such locally-useful stories about what they're doing to the hurricane, they're all just applying small incomprehensible perturbations to a highly chaotic system. The lesson I take is that messiness is not the total lack of structure - when I say my room is messy, I don't mean that the arrangement of its component atoms has been sampled from the Boltzmann distribution - it's just that the structure that's there isn't easy for humans to use.

I'd like to float one more metaphor: K-complexity and compression.

Suppose I have a bit string of length 10^9, and I can compress it down to length 10^8. The "True Name hypothesis" is that the compression looks like finding some simple, neat patterns that explain most of the data and we expect to generalize well, plus a lot of "diff" that's the noisy difference between the simple rules and the full bitstring. The "fractal hypothesis" is that there are a few simple patterns that do some of the work, and a few less simple rules that do more of the work, and so on for as long as you have patience. The "total mess hypothesis" is that simple rules do a small amount of the work, and a lot of the 10^8 bits is big highly-interdependent programs that would output something very different if you flipped just a few bits. Does this seem about right?

Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can't make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks.

I think you missed the point of that particular metaphor. The claim was not that revenue of a nail factory is a robust operationalization of nail value. The claim was that a competitive nail market plus nail-maker reputation tracking is a True Name for a pointer to nail value - i.e. such a system will naturally generate economically-valuable nails. Because we have a robust mathematical formalization of efficient markets, we know the conditions under which that pointer-to-nail-value will break down: things like the factory owner being smart enough to circumvent the market mechanism, or the economy too irrational, etc.

The lesson I take is that messiness is not the total lack of structure - when I say my room is messy, I don't mean that the arrangement of its component atoms has been sampled from the Boltzmann distribution - it's just that the structure that's there isn't easy for humans to use.

I agree with this, and it's a good summary of the takeaway of the butterfly effect analogy. In this frame, I think our disagreement is about whether "structure which isn't easy for humans to use" is generally hard to use because the humans haven't yet figured it out (but they could easily use it if they did figure it out) vs structure which humans are incapable of using due to hardware limitations of the brain.

Suppose I have a bit string of length 10^9, and I can compress it down to length 10^8. ...

This is an anology which I also considered bringing up, and I think you've analogized things basically correctly here. One important piece: if I can compress a bit string down to length 10^8, and I can't compress it any further, then that program of length 10^8 is itself incompressible - i.e. it's 10^8 random bits. As with the butterfly effect, we get a duality between structure and noise.

Actually, to be somewhat more precise: it may be that we could compress the length 10^8 program somewhat, but then we'd still need to run the decompressed program through an interpreter in order for it to generate our original bitstring. So the actual rule is something roughly like "any maximally-compressed string consists of a program shorter than roughly-the-length-of-the-shortest-interpreter, plus random bits" (with the obvious caveat that the short program and the random bits may not separate neatly).

I think you're saying: if a thing is messy, at least there can be a non-messy procedure / algorithm that converges to (a.k.a. points to) the thing. I think I'm with Charlie in feeling skeptical about this in regards to value learning, because I think value learning is significantly a normative question. Let me elaborate:

My genes plus 1.2e9 seconds of experience have built have built a fundamentally messy set of preferences, which are in some cases self-inconsistent, easily-manipulated, invalid-out-of-distribution, etc. It's easy enough to point to the set of preferences as a whole—you just say “Steve's preferences right now”.

In fact, one might eventually (I expect) be able to write down the learning algorithm, reward function, etc., that led to those preferences (but we won't be able to write down the many petabytes of messy training data), and we'll be able to talk about what the preferences look like in the brain. But still, you shouldn't and can't directly optimize according those preferences because they're self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.

So then we have a normative question: if “fulfill Steve’s preferences” isn’t a straightforward thing, then what exactly should the AGI do? Maybe we should ask Steve what value learning ought to look like? But maybe I say “I don’t know”, or maybe I give an answer that I wouldn’t endorse upon reflection, or in hindsight. So maybe we should have the AGI do whatever Steve will endorse in hindsight? No, that leads to brainwashing.

Anyway, it's possible that we'll come up with an operationalization of value learning that really nails down what we think the AGI ought to do. (Let's say, for example, something like CEV but more specific.) If we do, to what extent should we expect this operationalization to be simple and elegant, versus messy? (For example, in my book, Stuart Armstrong research agenda v0.9 counts as rather messy.) I think an answer on the messier side is quite plausible. Remember, (1) this is a normative question, and (2) that means that the foundation on which it's built is human preferences (about what value learning ought to look like), and (3) as above, human preferences are fundamentally messy because they involve a lifetime of learning from data. This is especially true if we don't want to trample over individual / cultural differences of opinion about (for example) the boundary between advice (good) vs manipulation (bad).

(Low confidence on all this.)

It's important to note that human preferences may be messy, but the mechanism by which we obtain them probably isn't. I think the question really isn't "What do I want (and how can I make an AI understand that)?" but rather "How do I end up wanting things (and how can I make an AI accurately predict how that process will unfold)?"

I don’t disagree with the first sentence (well, it depends on where you draw the line for “messy”).

I do mostly disagree with the second sentence.

I’m optimistic that we will eventually have a complete answer to your second question. But once we do have that answer, I think we’ll still have a hard time figuring out a specification for what we want the AI to actually do, for the reasons in my comment—in short, if we take a prospective approach (my preferences now determine what to do), then it’s hard because my preferences are self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.; or if we take a retrospective approach (the AGI should ensure that the human is happy with how things turned out in hindsight), then we get brainwashing and so on.

I don't really see how this is a problem. The AI should do something that is no worse from me-now's perspective than whatever I myself would have done. Given that if I have inconsistent preferences I am probably struggling to figure out how to balance them myself, it doesn't seem reasonable to me to expect an AI to do better.

I think also a mixture of prospective and retrospective makes most sense; every choice you make is a trade between your present and future selves, after all. So whatever the AI does should be something that both you-now and you-afterward would accept as legitimate.

Also, my inconsistent preferences probably all agree that becoming more consistent would be desirable, though they would disagree about how to do this; so the AI would probably try to help me achieve internal consistency in a way both me-before (all subagents) and me-after agree upon, through some kind of internal arbitration (helping me figure out what I want) and then act upon that.

And if my preferences involve things that don't exist or that I don't understand correctly, the AI may be able to extrapolate what the closest real thing is to my confused goal (improve the welfare of ghosts -> take the utility functions of currently dead people more into account in making decisions), whether both me-now and me-after would agree that this is reasonable, and then if so do that.

Again, we’re assuming for the sake of argument that there’s an AI which completely understands an adult human’s current preferences (which are somewhat inconsistent etc.), and how those preferences would change under different circumstances. We need a specification for what this AI should do right now.

If you’re arguing that there is such a specification which is not messy, can write down exactly what that specification is? If you already said it, I missed it. Can you put it in italics or something? :)

(Your comment said that the AI “should” or “would” do this or that a bunch of times, but I’m not sure if you’re listing various different consequences of a single simple specification that you have in mind, or if you’re listing different desiderata that must be met by a yet-to-be-determined specification.)

(Again, in my book, Stuart Armstrong research agenda v0.9 counts as rather messy.)

I think out loud a lot. Assume nearly everything I say in conversations like this is desiderata I'm listing off the top of my head with no prior planning. I'm really not good at the kind of rigorous think-before-you-speak that is normative on LessWrong.

A really bad starting point for a specification which almost certainly has tons of holes in it: have the AI predict what I would do up to a given length of time in the future if it did not exist, and from there make small modifications to construct a variety of different timelines for similar things I might instead have done.

In each such timeline predict how much I-now and I-after would approve of that sequence of actions, and maximize the minimum of those two. Stop after a certain number of timelines have been considered and tell me the results. Update its predictions of me-now based on how I respond, and if I ask it to, run the simulation again with this new data and a new set of randomly deviating future timelines.

This would produce a relatively myopic (doesn't look too far into the future) and satisficing (doesn't consider too many options) advice-giving AI which would not have agency of its own but only help me find courses of action for me to do which I like better than whatever I would have done without its advice.

There's almost certainly tons of failure modes here, such as a timeline where my actions seem reasonable at first, but turn me into a different person who also thinks the actions were reasonable, but who otherwise wildly differs from me in a way that is invisible to me-now receiving the advice. But it's a zeroth draft anyway.

(That whole thing there was another example of me thinking out loud in response to what you said, rather than anything preconceived. It's very hard for me to do otherwise. I just get writer's block and anxiety if I try to.)

Gotcha, thanks :) [ETA—this was in response to just the first paragraph]

I also edit my previous comments a lot after I realize there was more I ought to have said. Very bad habit - look back at the comment you just replied to please, I edited it before realizing you'd already read it! I really need to stop doing that...

Oh it’s fine, plenty of people edit their comments after posting including me, I should be mindful of that by not replying immediately :-P As for the rest of your comment:

I think your comment has a slight resemblance to Vanessa Kosoy’s “Hippocratic Timeline-Driven Learning” (Section 4.1 here), if you haven’t already heard of that.

My suspicion is that, if one were to sort out all the details, including things like the AI-human communication protocol, such that it really works and is powerful and has no failure modes, you would wind up with something that’s at least “rather messy” (again, “rather messy” means “in the same messiness ballpark as Stuart Armstrong research agenda v0.9”) (and “powerful” rules out literal Hippocratic Timeline-Driven Learning, IMO).

places which seem like canonical examples of very-probably-messy-territory repeatedly turn out to not be so messy

May I ask for a few examples of this?

The claim definitely seems plausible to me, but I can't help but think of examples like gravity or electromagnetism, where every theory to date has underestimated the messiness of the true concept. It's possible that these aren't really much evidence against the claim but rather indicative of a poor ontology:

People who expect a "clean" territory tend to be shocked by how "messy" the world looks when their original ontology/model inevitably turns out to not fit it very well.

However, it feels hard to differentiate (intuitively or formally) cases where our model is a poor fit from cases where the territory is truly messy.  Without being able to confidently make this distinction, the claim that the territory itself isn't messy seems a bit unfalsifiable. Any evidence of territories turning out to be messy could be chalked up to ill-fitting ontologies. 

Hopefully, seeing more examples like the competitive markets for nails will help me better differentiate the two or, at the very least, help me build intuition for why less messy territories are more natural/common.

Thank you!

Is this argument robust in the case of optimization, though?

I'd think that optimization can lead to unpredictable variation that happens to correlate and add up in such a way as to have much bigger effects than noise would have.

It seems like it would be natural for the butterfly argument to break down in exactly the sorts of situations involving agency.

I expect there's a Maxwell's Demon-style argument about this, but I have yet to figure out quite the right way to frame it.

My interpretation of this post is that before we solve the AI alignment problem, we need an abstraction to describe what agents are, and how they plan, make decisions, and anticipate consequences. It may be easier to describe how this works in general than to describe an optimal agent system, or to describe one that best reflects human psychology. As an analogy, to describe how proteins fold, it is easier to come up with a generalized model, and only then to attempt it in practice, or to predict the folding of specific molecules, or to use such systems to produce useful products.

At first, I was skeptical of the analogy between physics or cryptography and alignment. After all, our empirical observations of the physical world is physics. We can therefore expect that mathematical formalisms will be successful in this area. By contrast, empirical observation of human value is psychology. Psychology is not alignment, which is the basis of the problem.

However, agency seems like tractable ground for mathematical formalism. It encompasses computational, psychological, and physical constraints. I notice that when I try to articulate how agency works, without trying to specify what makes it work well, it becomes much easier to make progress.

For people who think that Goodharting is inveitable, they should read Stuart's posts on the topic. He provides an example of a utility function and optimisation system for which Goodharting is not a big issue. Also, he notes that the fact that we fear Goodharting is a useful signal to an AI about the structure of our preferences

P.S. John, how valuable do you think it would be for someone to do an "abstraction newsletter" covering both classic and new posts on the topic, like with the alignment newsletter? 

He provides an example of a utility function and optimisation system for which Goodharting is not a big issue

Um. Said utility function requires that you already know the true underlying value function[1].

If you already know the true underlying value function, Goodhart's law doesn't apply anyway. The tricky bit with Goodhart's law is trying to find said true underlying value function in the first place - close is not good enough.

  1. ^

    Well, strictly speaking it needs to know both the proxy and the difference between the proxy and the true underlying value function, which is sufficient to recreate the true underlying value function.

John, how valuable do you think it would be for someone to do an "abstraction newsletter" covering both classic and new posts on the topic, like with the alignment newsletter?

I could imagine that being quite valuable, though I am admittedly the most biased person one could possibly ask. Certainly there is a lot of material which would benefit from distillation.

Motivated me to actually go out and do agent foundations so thumbs up!

Another thing whose True Name is probably a key ingredient for alignment (and which I've spent a lot of time trying to think rigorously about): collective values.

Which is interesting, because most of what we know so far about collective values is that, for naive definitions of "collective" and "values", they don't exist. Condorcet, Arrow, Gibbard and Satterthwaite, and (crucially) Sen have all helped show that.

I personally don't think that means that the only useful things one can say about "collective values" are negative results like the ones above. I think there are positive things to say; definitions of collectivity (for instance, of democracy) that are both non-trivial and robust. But finding them means abandoning the naive concepts of "collective values".

I think that this is probably a common pattern. You go looking for the True Name of X, but even if that search ever bears fruit, you'd rarely if ever look back and say "Y is the True Name of X". Instead, you'd say something like "(long math notation) is the True Name of itself, or for short, of Y. Though I found this by looking for X, calling it 'X' was actually a misnomer; that phrase has baked-in misconceptions and/or red herrings, so from now on, let's call it 'Y' instead."

I'm pretty sure there is no such thing as collective values. Individual egregores (distributed agents running on human wetware, like governments, religions, businesses, etc) can have coherent values, but groups of people in general do not. Rather, there are more and less optimal (in the sense of causing minimal total regret - I'm probably thinking of Pareto optimality here) mechanisms for compromising.

The "collective values" that emerge are the result of the process, not something inherent before the process begins, and further, different processes will lead to different "collective values", the same way that different ways of thinking and making decisions will lead a person to prioritize their various desires / subagents differently.

It does look, though, as if some mechanisms for compromising work better than others. Markets and democracies work very differently, but nearly everyone agrees either one is better than dictatorship.