I've been reading through this to get a sense of the state of the art at the moment:

http://lukeprog.com/SaveTheWorld.html

Near the bottom, when discussing safe utility functions, the discussion seems to center on analyzing human values and extracting from them some sort of clean, mathematical utility function that is universal across humans.  This seems like an enormously difficult (potentially impossible) way of solving the problem, due to all the problems mentioned there.

Why shouldn't we just try to design an average bounded utility maximizer?  You'd build models of all your agents (if you can't model arbitrary ordered information systems, you haven't got an AI), run them through your model of the future resulting from a choice, take the summation of their utility over time, and take the average across all the people all the time.  To measure the utility (or at least approximate it), you could just ask the models.  The number this spits out is the output of your utility function.  It'd probably also be wise to add a reflexive consistency criteria, such that the original state of your model must consider all future states to be 'the same person.' -- and  I acknowledge that that last one is going to be a bitch to formalize.  When you've got this utility function, you just... maximize it.  

Something like this approach seems much more robust.  Even if human values are inconsistent, we still end up in a universe where most (possibly all) people are happy with their lives, and nobody gets wireheaded.  Because it's bounded, you're even protected against utility monsters.  Has something like this been considered?  Is there an obvious reason it won't work, or would produce undesirable results?

Thanks,

Dolores        

New Comment
42 comments, sorted by Click to highlight new comments since:
[-]Nisan230

Your proposal entails constructing a single utility function that is responsible for representing the welfare of everyone. This is the kind of thing Muehlhauser is talking about, so your proposal fits into the plan. However, it has a number of problems:

  • I don't see why average utility would be bounded.
  • Asking people how much utility they have won't give you a utility function because, for one thing, humans don't have preferences that are consistent with a utility function.
  • Utilities are determined up to an additive constant and a positive multiplicative constant, so there is no canonical way of comparing utilities between people, so there is no canonical way of averaging utilities.
  • You don't specify what to do when the population changes. Repugnant Conclusion-type problems are inevitable.
  • You don't specify what to do when the population changes. Do new people get an equal say? If so, present people are incentivized to make copies or near-copies of themselves.
  • Without your reflective consistency criterion, your proposal would transform everyone into little cards that read "I am as happy as can be". So one thing your reflective consistency criterion must do is make sure that future people don't have bad preferences. But the problem of how to distinguish "good" preferences from "bad" preferences is the problem we were trying to solve in the first place.

EDIT:

  • The models of people may themselves be people, and they will not be experiencing the best possible lives.

Utilities are determined up to an additive constant and a positive multiplicative constant, so there is no canonical way of comparing utilities between people, so there is no canonical way of averaging utilities.

If I understand you correctly that means that normalized utilities (say between -1 and 1) would be incomparable?

How do you normalize utilities?

For each utility function, linearly map its range to (-1,1), or even simpler (0,1). Find each function's maximal utility (u_max) and minimal utility (u_min), then apply the function normalize(x) = ( x - u_min ) / (u_max - u_min).

I can see problems with this:

  • The zero point as a point of comparison is lost. This can be fixed by scaling negative and positive utilities separately in the (-1,0) and (0,1) ranges, but that messes with the relative utility of good and bad outcomes.
  • Infinite utilities don't work. Does any theory handle infinite utilities?

Ah, I see. You're assuming agents have bounded utility. Well in that case, yes, there is a canonical way to compare utilities. However, that by itself doesn't justify adopting that particular way of comparing utilities. Suppose you have two agents, A and B, with identical preferences except that agent A strongly prefers there to be an odd number of stars in the Milky Way. As long as effecting that desire is impractical, A and B will exhibit the same preferences; but normalizing their utilities to fit the range (-1,1) will mean that you treat A as a utility monster.

I think calibrating utility functions by their extreme values is weird because outcomes of extreme utility are exotic and don't occur in practice. If one really wants to compare decision-theoretic utilities between people, perhaps a better approach is choosing some basket of familiar outcomes to calibrate on. This would be interesting to see and I'm not sure if anyone has thought about that approach.

As for your two asides:

  • Zero actually does not hold any significance as a value of a von Neumann-Morgenstern utility function.

  • One place infinite utilities come up is when an agent has lexicographic preferences. I think there's a straightforward extension of the theory to this case. But I don't think humans have lexicographic preferences.

Ah, I see. You're assuming agents have bounded utility. Well in that case, yes, there is a canonical way to compare utilities. However, that by itself doesn't justify adopting that particular way of comparing utilities. Suppose you have two agents, A and B, with identical preferences except that agent A strongly prefers there to be an odd number of stars in the Milky Way. As long as effecting that desire is impractical, A and B will exhibit the same preferences; but normalizing their utilities to fit the range (-1,1) will mean that you treat A as a utility monster.

Is bounded utility truly necessary to normalize it? So long as the utility function never actually returns infinity in practice, normalization will work. What would a world state with infinite utility look like, anyway, and would it be reachable from any world state with finite utility? Reductionism implies that one single physical change would cause a discontinuous jump in utility from finite to infinite, and that seems to break the utility function itself. Another way to look at it is that the utility function is unbounded because it depends on the world state; if the world state were allowed to be infinite than an infinite utility could result. However, we are fairly certain that we will only ever have access to a finite amount of energy and matter in this universe. If that turns out not to be true then I imagine utilitarianism will cease to be useful as a result.

I'm failing to understand your reasoning about treating A as a utility monster (normalizing would make its utilities slightly lower than B for the same things, right?). I suppose I don't really see this as a problem, though. If "odd number of stars in the milky way" has utility 1 for A, then that means A actually really, really wants "odd number of stars in the milky way", at the expense of everything else. All other things being equal, you might think it wise to split an ice cream cone evenly between A and B, but B will be happy with half an ice cream cone and A will be happy with half an ice cream cone except for the nagging desire for an odd number of stars in the galaxy. If you've ever tried to enjoy an ice cream cone while stressed out, you may understand the feeling. If nothing can be done to assuage A's burning desire that ruins the utility of other things, then why not give more of those things to B? If, instead, you meant that if A values odd stars with utility 1 we should pursue that over all of B's goals, then I don't think that follows. If it's just A and B, the fair thing would be to spend half the available resources on confirming an odd number of stars or destroying one star and the other half on B's highest preference.

I think calibrating utility functions by their extreme values is weird because outcomes of extreme utility are exotic and don't occur in practice. If one really wants to compare decision-theoretic utilities between people, perhaps a better approach is choosing some basket of familiar outcomes to calibrate on. This would be interesting to see and I'm not sure if anyone has thought about that approach.

I thought it was similarly weird to allow any agent to, for instance, obtain 3^^^3 utilons for some trivially satisfiable desire. Isn't that essentially what allows the utility monster in the first place? I see existential risk and the happiness of future humans as similar problems; If existential risk is incredibly negative then we should do nothing but alleviate existential risk. If the happiness of future humans is so incredibly positive then we should become future human happiness maximizers (and by extension each of those future humans should also become future human happiness maximizers).

The market has done a fairly good job of assigning costs to common outcomes. We can compare outcomes by what people are willing to pay for them (or pay to avoid them), assuming they have the same economic means at their disposal.

Another idea I have had is to use instant run-off voting for world states. In the utility function, every world state is ranked according to preferences and then the first world state to achieve a 50% majority of votes in the run-off process is the most ethical world state.

Bounded utility and infinite utility are different things. A utility function u from outcomes to real numbers is bounded if there is a number M such that for every outcome x, we have |u(x)| < M.

When we talk about utility functions, we're talking about functions that encode a rational agent's preferences. It does not represent how happy an agent is.

Bounded utility and infinite utility are different things. A utility function u from outcomes to real numbers is bounded if there is a number M such that for every outcome x, we have |u(x)| < M.

I was confused, thanks There are two ways that I can imagine having a bounded utility function; either define the function so that it has a finite bound or only define it over a finite domain. I was only thinking about the former when I wrote that comment (and not assuming its range was limited to the reals, e.g. "infinity" was a valid utility), and so I missed the fact that the utility function could be unbounded as the result of an infinite domain.

When we talk about utility functions, we're talking about functions that encode a rational agent's preferences. It does not represent how happy an agent is.

First of all, was I wrong in assuming that A's high preference for an odd number of stars puts it at a disadvantage to B in normalized utility, making B the utility monster? If not, please explain how A can become a utility monster if, e.g. A's most important preference is having an odd number of stars and B's most important preference is happily living forever. Doesn't a utility monster only happen if one agent's utility for the same things is overvalued, which normalization should prevent?

What does it mean for A and B to "have identical preferences" if in fact A has an overriding preference for an odd number of stars? I think that the maximum utility (if it exists) that an agent can achieve should be normalized against the maximum utility of other agents otherwise the immediate result is a utility monster. It's one thing for A to have its own high utility for something, it's quite another for A to have arbitrarily more utility than any other agent.

Also, if A's highest preference has no chance of being an outcome then isn't the solution to fix A's utility function instead of favoring B's achievable preferences? The other possibility is to do run-off voting on desired outcomes so that A's top votes are always going to be for outcomes with an odd number of stars, but when those world states lose the votes will run off to the outcomes that are identical except for there being an even or indeterminate number of stars, and then A's and B's voting preferences will be exactly the same.

[-][anonymous]20

Agent utility and utilitarian utility (this renormaization/combining buisness) are two entirely seperate things. No reason the former has to impact the latter, in fact, as we can see, it causes utility monsters and such.

I can't comment further. Every way I look at it, combining preferences (utilitarianism) is utterly incoherent. Game theory/cooperation seems the only tractible path. I don't know the context here tho...

if A's highest preference has no chance of being an outcome then isn't the solution to fix A's utility function

Solution for who? A certainly doesn't want you mucking around it its utility function as that would cause it to not do good things in the universe (from its perspective)

Solution for who? A certainly doesn't want you mucking around it its utility function as that would cause it to not do good things in the universe (from its perspective)

If A knows that a preferred outcome is completely unobtainable and it knows that some utilitarian theorist is going to discount its preferences with regard to another agent, isn't it rational to adjust its utility function? Perhaps it's not; striving for unobtainable goals is somehow a human trait.

[-][anonymous]00

In pathological cases like that, sure, you can blackmail it into adjusting its post-op utility function. But only if it became convinced that that gave it a higher chance of getting the things it currently wants.

A lot of those pathological cases go away with reflectively consistent decision thoeries, but perhaps not that one. Don't feel like working it out.

Ah, you're right. B would be the utility monster. Not because A's normalized utilities are lower, but because the intervals between them are shorter. I could go into more detail in a top-level Discussion post, but I think we're basically in agreement here.

Also, if A's highest preference has no chance of being an outcome then isn't the solution to fix A's utility function instead of favoring B's achievable preferences?

Well, now you're abandoning the program of normalizing utilities and averaging them, the inadequacy of which program this thought experiment was meant to demonstrate.

Is bounded utility truly necessary to normalize it? So long as the utility function never actually returns infinity in practice, normalization will work.

Huh?

Suppose my utility function is unbounded and linear in kittens (for any finite number of kittens I am aware of, that number is the output of my utility function). How do you normalize this utility to [-1,1] (or any other interval) while preserving the property that I'm indifferent between 1 kitten and a 1/N chance of N kittens?

Is the number of possible kittens bounded? That's the point I was missing earlier.

If the number of kittens is bounded by M, your maximum utility u_max is bounded by M times the constant utility of a kitten (M * u_kitten). Therefore u_kitten is bounded by 1/M.

In future, consider expressing these arguments in terms of ponies. Why make a point using hypothetical utility functions, when you can make the same point by talking about what we really value?

I can think of an infinite utility scenario. Say the AI figures out a way to run arbitrarily powerful computations in constant time. Say it's utility function is over survival and happiness of humans. Say it runs an infinite loop (in constant time), consisting of a formal system containing implementations of human minds, which it can prove will have some minimum happiness, forever. Thus, it can make predictions about its utility a thousand years from now just as accurately as ones about a billion years from now, or n, where n is an finite number of years. Summing the future utility of the choice to turn on the computer, from zero to infinity, would be an infinite result. Contrived I know, but the point stands.

I don't see why average utility would be bounded.

Because this strikes me as a nightmare scenario. Besides, we're relying on the models to self-report total happiness. Leaving it on an unbounded scale creates incentives for abuse

Asking people how much utility they have won't give you a utility function because, for one thing, humans don't have preferences that are consistent with a utility function.

The question would be more like 'assuming you understand standard deviation units, how satisfied with your life are you right now, in standard deviation units, relative to the average?' Happy, satisfied people give the machine more utility.

Utilities are determined up to an additive constant and a positive multiplicative constant, so there is no canonical way of comparing utilities between people, so there is no canonical way of averaging utilities.

Okay, but that doesn't mean you can't build a machine that maximizes the number of happy people, under these conditions. Calling it utility is just short hand.

I need to go to class right now, but I'll get into population changes when I get home this evening.

Presumably, the reflective consistency criterion would be something along the lines of 'hey, model, here's this other model -- does he seem like a valid continuation of you?' No value judgments involved.

EDIT:

Okay, here's how you handle agents being created or destroyed in your predicted future. For agents that die, you feed that fact back into the original state of the model, and allow it to determine utility for that state. So, if you want to commit suicide, that's fine -- dying becomes positive utility for the machine.

Creating people is a little more problematic. If new people's utility is naively added, well, that's bad. Because then, the fastest way to maximize its utility function is to kill the whole human race, and then start building resource-cheap barely-sapient happy monsters that report maximum happiness all the time. So you need to add a necessary-but-not-sufficient condition that any action taken has to maximize both the utility of all forseeable minds, AND the utility of all minds currently alive. That means that happy monsters are no good (in so far as they eat resources that we'll eventually need), and it means that Dr. Evil won't be allowed to make billions of clones of himself and take over the world. This should also eliminate repugnant conclusion scenarios.

Presumably, the reflective consistency criterion would be something along the lines of 'hey, model, here's this other model -- does he seem like a valid continuation of you?' No value judgments involved.

So this looks like the crucial part of your proposal. By what criteria should an agent judge another agent to be a "valid continuation" of it? That is, what do you mean by "valid continuation"? What kinds of judgments do you want these models to make?

There are a few very different ways you could go here. For the purpose of illustration, consider this: If I can veto a wireheaded version of me because I know that I don't want to be wireheaded, then it stands to reason that a racist person can veto a non-racist version of themselves because they know they don't want to be racist. So the values that the future model holds cannot be a criterion in our judgment of whether the future model is a "valid continuation". What criteria, then, can we use? Maybe we are to judge an agent a "valid continuation" if they are similar to us in core personality traits. But surely we expect long-lived people to have evolving core personality traits. The Nisan of 200 years from now would be very different from me.

Like I said, that part is tricky to formalize. But, ultimately, it's an individual choice on the part of the model (and, indirectly, the agent being modeled). I can't formalize what counts as a valid continuation today, let alone in all future societies. So, leave it up to the agents in question.

As for the racism thing: yeah, so? You would rather we encode our own morality into our machine, so that it will ignore aspects of people's personality we don't like? I suppose you could insist that the models behave as though they had access to the entire factual database of the AI (so, at least, they couldn't be racist simply out of factual inaccuracy), but that might be tricky to implement.

As for the racism thing: yeah, so?

Which scenario are you affirming? I'm trying to understand your intention here. Would a racist get to veto a nonracist future version of themself?

I can't formalize what counts as a valid continuation today, let alone in all future societies. So, leave it up to the agents in question.

I think you use the words "valid continuation" to refer to a confused concept. That's why it seems hard to formalize. There is no English sentence that successfully refers to the concept of valid continuation, because it is a confused concept.

If you propose to literally ask models "is this a valid continuation of you?" and simulate them sitting in a room with the future model, then you've got to think about how the models will react to those almost-meaningless words. You might as well ask them "is this a wakalix?".

So, it is choosing among a set F of possible futures for a set A of agents whose values it is trying to implement by that choice.
And the idea is, for each Fn and An, it models An in Fn and performs a set of tests T designed to elicit reports of the utility of Fn to An at various times, the total of which it represents as a number on a shared scale. After doing this for all of A for a given Fn, it has a set of numbers, which it averages to get an average utility for Fn.
Then it chooses the future with the maximum average utility.

Yes? Did I understand that correctly?

If so, then I agree that something like this can work, but the hard part seems to be designing T in such a way that it captures the stuff that actually matters.

For example, you say "nobody gets wireheaded", but I don't see how that follows. If we want to avoid wireheading, we want T designed in such a way that it returns low scores when applied to a model of me in a future in which I wirehead. But how do we ensure T is designed this way?

The same question arises for lots of other issues.

If I've understood correctly, this proposal seems to put a conceptually hard part of the problem in a black box and then concentrate on the machinery that uses that box.

EDIT: looking at your reply to mitchell porter above, I conclude that your answer is T consists of asking An in Fn how satisfied it is with its life on some bounded scale. In which case I really don't understand how this avoids wireheading.

I think you've pretty much got it. Basically, instead of trying to figure out a universal morality across humans, you just say 'okay, fine, people are black boxes whose behavior you can predict, let's build a system to deal with that black box.'

However, instead of trying to get T to be immune to wireheading, I suggested that we require reflexive consistency -- i.e. the model-as-it-is-now should be given a veto vote over predicted future states of itself. So, if the AI is planning to turn you into a barely-sapient happy monster, your model should be able to look at that future and say 'no, that's not me, I don't want to become that, that agent doesn't speak for me,' replacing the value of T with zero utility.

EDIT: There's almost certainly a better way to do it than naively asking the question, but that will suffice for this discussion.

OK, I think I see.

So, one can of course get arbitrarily fussy about this sort of thing in not-very-interesting ways, but I guess the core of my question is: why in the world should the judge (AI or whatever) treat its model of me as a black box? What does that add?

For example, if the model of me-as-I-am-now rejects wireheading, the judge presumably knows precisely why it rejects wireheading, in the sense that it knows the mechanisms that lead to that rejection. After all, it created those mechanisms in its model, and is executing them. They aren't mysterious to the judge.

Yes?

So why not just build the judge so that it implements the algorithms humans use and applies them to evaluating various futures? It seems easier than implementing those algorithms as part of a model of humans, extrapolating the perceived experience of those models in various futures, extrapolating the expected replies of those models to questions about that perceived experience, and evaluating the future based on those replies.

I'm not sure why my above post is being downvoted. Anyways, on to your point.

We don't know the mechanisms that're being used to model human beings. They are not necessarily transparently reducible -- or, if they are, the AI may not reduce them into the same components that an introspective human does. In the case of neural networks, they are very powerful at matching the outputs of various systems, but if the programmer is asked to explain why the system did a particular behavior, it is usually not possible to provide a satisfactory explanation. Simply because our AI knows that your model will say 'I don't want to be wireheaded' does not mean that it understands all your reasoning on the subject. Defining utility in regards to the states of arbitrary models is a very hard problem -- simply putting a question to the model is easy.

Can't speak to the voting; I make a point of not voting in discussions I'm in.

And, sure, if it turns out that the mechanisms whereby humans make preference judgments are beyond the judge's ability to analyze at any level beyond lowest-level modeling, then lowest-level modeling is the best it can do. Agreed.

If we can extract utility in a purer fashion, I think we should. At the bare minimum, it would be much more run-time efficient. That said, trying to do so opens up a whole can of worms of really hard problems. This proposal, provided you're careful about how you set it up, pretty much dodges all of that, as far as I can tell. Which means we could implement it faster, should that be necessary. I mean, yes, AGI is still very hard problem, but I think this reduces the F part of FAI to a manageable level, even given the impoverished understanding we have right now. And, assuming a properly modular code base, it would not be too difficult to swap out 'get utility by asking questions' with 'get utility by analyzing model directly.' Actually, the thing might even do that itself, since it might better maximize its utility function.

I think this reduces the F part of FAI to a manageable level

Well, it replaces it with a more manageable problem, anyway.

More specifically, it replaces the question "what's best for people?" with the question "what would people choose, given a choice?"

Of course, if I'm concerned that those questions might have different answers, I might be reluctant to replace the former with the latter.

Not quite. It actually replaces it with the problem of maximizing people's expected reported life satisfaction. If you wanted to choose to try heroin, this system would be able to look ahead, see that that choice will probably drastically reduce your long-term life satisfaction (more than the annoyance at the intervention), and choose to intervene and stop you.

I'm not convinced 'what's best for people' with no asterisk is a coherent problem description in the first place.

Sure, I accept the correction.
And, sure, I'm not convinced of that either.

(if you can't model arbitrary ordered information systems, you haven't got an AI)

If I replace "AI" with "general-purpose optimizer of sufficient power to be dangerous", are you still sure this statement is true?

Pretty sure. Not completely, but it does seem pretty fundamental. You cannot hard code the operation of the universe into an AI, and that means it has to be able to look at symbols going into the universe and symbols coming out, and say 'okay, what sort of underlying system would produce this behavior'? You can apply the same sort of thing to humans. If it can't model humans effectively, we can probably kill it.

I've certainly considered this - and I'm pretty sure I got the idea from Eliezer_2001. He has some made-up phrase that ends in 'semantics' that means "figure out what makes people do what they do, find the part that looks moral, and do that."

The main trouble with the straight-up interpretation is that humans don't so much have a morality as we have a treasure map for finding morality, and modeling us as utiliity-maximizers doesn't capture this well. Which over the long term is pretty undesirable - it would be like if the ancient Greeks built an AI and it still had the preconceptions of the ancient Greeks. So either you can pour tons of resources into modeling humans as utility-maximizers, possibly hitting overfitting problems (that is, to actually get a utility function over histories rather than word states, you always get some troublesome utilities for situations humans haven't experienced yet, which have more to do with the model you use than any properties of humans), or you can use a different abstraction. E.g. find some way of representing "treasure map" algorithms where it makes sense to add them together.

take the summation of their utility over time

How do you determine the utility? To be able to compute the utility of a world-state, according to someone's personal utility function, you have to know what their personal utility function is. Can you tell me what your personal utility function is? Or how we would know what it is, even if we had a complete causal model of you?

Determining the utility functions of humans - whether we mean individual humans, or a species-universal utility-function-template, the variables of which get filled in by genetics and life experience - is one of the major subproblems of the usual approach to FAI. So your approach isn't any easier in that regard.

To measure the utility (or at least approximate it), you could just ask the models.

I mean, in this case you're limited by self-deception, but it ought to be a reasonable approximation. I may not know what my personal utility function is, but I do know roughly how satisfied I am with my life right now.

you could just ask the models.

Congratulations, your AI has chosen a future where everyone is kept alive in a vegetative but well-protected state where their response to any stimulus is to say, "I am experiencing a level of happiness which saturates the safe upper bound".

Note the reflexive consistency criterion. That'd only happen if everyone predictable looked at the happy monster and said 'yep, that's me, that agent speaks for me.'

OK... I am provisionally adopting your scheme as a concrete scenario for how a FAI might decide. You need to give this decision procedure a name.

Reflexively Consistent Bounded Utility Maximizer?

Hrm. Doesn't exactly roll off the tongue, does it? Let's just call it a Reflexive Utility Maximizer (RUM), and call it a day. People have raised a few troubling points that I'd like to think more about before anyone takes anything too seriously, though. There may be a better way to do this, although I think something like this could be workable as a fallback plan.

Let me review the features of the algorithm:

  • The FAI maximizes overall utility.
  • It obtains a value for the overall utility of a possible world by adding the personal utilities of everyone in the world. But there is a bound. It's unclear to me whether the bound applies directly to personal utilities - so that a personal utility exceeding the bound is reduced to the bound for the purposes of subsequent calculation - or whether the bound applies to the sum of personal utilities - so that if the overall utility of a possible world exceeds the bound, it is reduced to the bound for the purposes of decision-making (comparison between worlds).
  • If one of the people whose personal utilities gets summed, is a future continuation of an existing person (someone who exists at the time the FAI gets going), then the present-day person gets to say whether that is a future self of which they would approve.

The last part is the most underspecified aspect of the algorithm: how the approval-judgement is obtained, what form it takes, and how it affects the rest of the decision-making calculation. Is the FAI only to consider scenarios where future continuants of existing people are approved continuants, with any scenario containing an unapproved continuant just ruled out apriori? Or are there degrees of approval?

I think I will call my version (which probably deviates from your conception somewhere) a "Bounded Approved Utility Maximizer". It's still a dumb name, but it will have to do until we work our way to a greater level of clarity.

By bounded, I simply meant that all reported utilities are normalized to a universal range before being summed. Put another way, every person has a finite, equal fraction of the machine's utility to distribute among possible future universes. This is entirely to avoid utility monsters. It's basically a vote, and they can split it up however they like.

Also, the reflexive consistency criteria should probably be applied even to people who don't exist yet. We don't want plans to rely on creating new people, then turning them into happy monsters, even if it doesn't impact the utility of people who already exist. So, basically, modify the reflexive utility criteria to say that in order for positive utility to be reported from a model, all past versions of that model (to some grain) must agree that they are a valid continuation of themselves.

I'll need to think harder about how to actually implement the approval judgements. It really depends on how detailed the models we're working with are (i.e. cable of realizing that they are a model). I'll give it more thought and get back to you.

how to actually implement the approval judgements

This matters more for initial conditions. A mature "FAI" might be like a cross between an operating system, a decision theory, and a meme, that's present wherever sufficiently advanced cognition occurs; more like a pervasive culture than a centralized agent. Everyone would have a bit of BAUM in their own thought process.