(This post has been sitting in my drafts folder for 6 years. Not sure why I didn't make it public, but here it is now after some editing.)

There are two problems closely related to the Ontological Crisis in Humans. I'll call them the "Partial Utility Function Problem" and the "Decision Theory Upgrade Problem".

Partial Utility Function Problem

As I mentioned in a previous post, the only apparent utility function we have seems to be defined over an ontology very different from the fundamental ontology of the universe. But even on it's native domain, the utility function seems only partially defined. In other words, it will throw an error (i.e., say "I don't know") on some possible states of the heuristical model. For example, this happens for me when the number of people gets sufficiently large, like 3^^^3 in Eliezer's Torture vs Dust Specks scenario. When we try to compute the expected utility of some action, how should we deal with these "I don't know" values that come up?

(Note that I'm presenting a simplified version of the real problem we face, where in addition to "I don't know", our utility function could also return essentially random extrapolated values outside of the region where it gives sensible outputs.)

Decision Theory Upgrade Problem

In the Decision Theory Upgrade Problem, an agent decides that their current decision theory is inadequate in some way, and needs to be upgraded. (Note that the Ontological Crisis could be considered an instance of this more general problem.) The question is whether and how to transfer their values over to the new decision theory.

For example a human might be be running a mix of several decision theories: reinforcement learning, heuristical model-based consequentialism, identity-based decision making (where you adopt one or more social roles, like "environmentalist" or "academic" as part of your identity and then make decisions based on pattern matching what that role would do in any given situation), as well as virtual ethics and deontology. If you are tempted to drop one or more of these in favor of a more "advanced" or "rational" decision theory, such as UDT, you have to figure out how to transfer the values embodied in the old decision theory, which may not even be represented as any kind of utility function, over to the new.

Another instance of this problem can be seen in someone just wanting to be a bit more consequentialist. Maybe UDT is too strange and impractical, but our native model-based consequentialism at least seems closer to being rational than the other decision procedures we have. In this case we tend to assume that the consequentialist module already has our real values and we don't need to "port" values from the other decision procedures that we're deprecating. But I'm not entirely sure this is safe, since the step going from (for example) identity-based decision making to heuristical model-based consequentialism doesn't seem that different from the step between heuristical model-based consequentialism and something like UDT.

New Comment
14 comments, sorted by Click to highlight new comments since:

In addition to the Ontological Crisis in Humans post that Wei linked, this (underappreciated?) post by Eliezer from 2016 might be helpful background material: Rescuing the utility function.

(It was probably my favorite piece of his writing from that year.)

Humans are not immediately prepared to solve many decision problems, and one of the hardest problems is formulation of preference for a consequentialist agent. In expanding the scope of well-defined/reasonable decisions, formulating our goals well enough for use in a formal decision theory is perhaps the last milestone, far outside of what can be reached with a lot of work!

Indirect normativity (after distillation) can make the timeline for reaching this milestone mostly irrelevant, as long as there is sufficient capability to compute the outcome, and amplification is about capability. It's unclear how the scope of reasonable decisions is related to capability within that scope, amplification seems ambiguous between the two, perhaps the scope of reasonable decisions is just another kind of stuff that can be improved. And it's corrigibility's aspect to keep AI within the scope of well-defined decisions.

But with these principles in place, it's unclear if formulating goals for consequentialist agents remains a thing, when instead it's possible to just continue to expand the scope of reasonable decisions and to distill/amplify them.

formulating our goals well enough for use in a formal decision theory is perhaps the last milestone, far outside of what can be reached with a lot of work!

Did you mean to say "without a lot of work"?

(Or did you really mean to say that we can't reach it, even with a lot of work?)

The latter, where "a lot of work" is the kind of thing humanity can manage in subjective centuries. In an indirect normativity design, doing much more work than that should still be feasible, since it's only specified abstractly, to be predicted by an AI, enabling distillation. So we can still reach it, if there is an AI to compute the result. But if there is already such an AI, perhaps the work is pointless, because the AI can carry out the work's purpose in a different way.

[-]Zvi60

I agree strongly that, as a problem for humans, assuming that the consequentialist model has all our real values is not a safe assumption.

I would go further, and say that this assumption is almost always going to be importantly wrong and result in loss of important values. Nor do I think this is a hypothetical failure mode, at all; I believe it is common in our circles.

Consequentialism over world-histories feels pretty safe to me. Consequentialism over world states seems pretty unsafe to me. Do you feel that even consequentialism over world histories is unsafe? What would be a potential value that couldn't be captured by that model?

I believe it is common

Name one example? :)

I'm here from the future to say "Sam Bankman-fried".

Well, I think it's not very hard (even in our circles) to find people doing consequentialism badly, looking only at short-term / easily observable consequences (I think this is especially common among newer EA folk, and some wannabe-slytherin-types). It seemed likely Zvi meant a stronger version of the claim though, which I'm not sure how I'd operationalize.

I have lost the link, but I read a post from someone in the community about how grieving takes place over time because you have to grieve separately for each place or scenario that is important to your memory of the person.

This seems like the same mechanism would be required, just for reasoning.

Valentine's The Art of Grieving Well, perhaps?

I’d like to suggest that grieving is how we experience the process of a very, very deep part of our psyches becoming familiar with a painful truth. It doesn’t happen only when someone dies. For instance, people go through a very similar process when mourning the loss of a romantic relationship, or when struck with an injury or illness that takes away something they hold dear (e.g., quadriplegia). I think we even see smaller versions of it when people break a precious and sentimental object, or when they fail to get a job or into a school they had really hoped for, or even sometimes when getting rid of a piece of clothing they’ve had for a few years.

In general, I think familiarization looks like tracing over all the facets of the thing in question until we intuitively expect what we find. I’m particularly fond of the example of arriving in a city for the first time: At first all I know is the part of the street right in front of where I’m staying. Then, as I wander around, I start to notice a few places I want to remember: the train station, a nice coffee shop, etc. After a while of exploring different alleyways, I might make a few connections and notice that the coffee shop is actually just around the corner from that nice restaurant I went to on my second night there. Eventually the city (or at least those parts of it) start to feel smaller to me, like the distances between familiar locations are shorter than I had first thought, and the areas I can easily think of now include several blocks rather than just parts of streets.

I’m under the impression that grief is doing a similar kind of rehearsal, but specifically of pain. When we lose someone or something precious to us, it hurts, and we have to practice anticipating the lack of the preciousness where it had been before. We have to familiarize ourselves with the absence.

When I watch myself grieve, I typically don’t find myself just thinking “This person is gone.” Instead, my grief wants me to call up specific images of recurring events — holding the person while watching a show, texting them a funny picture & getting a smiley back, etc. — and then add to that image a feeling of pain that might say “…and that will never happen again.” My mind goes to the feeling of wanting to watch a show with that person and remembering they’re not there, or knowing that if I send a text they’ll never see it and won’t ever respond. My mind seems to want to rehearse the pain that will happen, until it becomes familiar and known and eventually a little smaller.

I think grieving is how we experience the process of changing our emotional sense of what’s true to something worse than where we started.

Unfortunately, that can feel on the inside a little like moving to the worse world, rather than recognizing that we’re already here.

That’s the one! Greatly appreciated.

[-]aio40

Here is a possible mechanism for the Decision Theory Update Problem.

The agent first considers several scenarios, each with a weight for its importance. Then for each scenario the agent compares what the output of its current decision theory is with what the output of its candidate decision theory would be, and computes a loss for that scenario. The loss will be higher when the current decision theory considers the new preferred actions to be certainly wrong or very wrong. The agent will update its decision theory when the above weighted average loss is small enough to be compensated by a gain in representation simplicity.

This points to a possible solution to the ontological crisis as well. The agent will basically look for a simple decision theory under the new ontology that approximates its actions under the old one.

In the Decision Theory Upgrade Problem, presumably the agent decides that their current decision theory is inadequate using their current decision theory. Why wouldn't it then also show the way on what to replace it with?