This post considers ambient control in a more abstract setting, where controlled structures are not restricted to being programs. It then introduces a notion of preference, as an axiomatic definition of constant (actual) utility. The notion of preference subsumes possible worlds and utility functions traditionally considered in decision theory.

Followup to: Controlling Constant Programs.

In the previous post I described the sense in which one program without parameters (the agent) can control the output of another program without parameters (the world program). These programs define (compute) constant values, respectively actual action and actual outcome. The agent decides on its action by trying to prove statements of a certain form, the moral arguments, such as [agent()=1 => world()=1000000]. When the time is up, the agent performs the action associated with the moral argument that promises the best outcome, thus making that outcome actual.

Let's now move this construction into a more rigorous setting. Consider a first-order language and a theory in that language (defining the way agent reasons, the kinds of concepts it can understand and the kinds of statements it can prove). This could be a set theory such as ZFC or a theory of arithmetic such as PA. The theory should provide sufficient tools to define recursive functions and/or other necessary concepts. Now, extend that theory by definitions of two constant symbols: A (the actual action) and O (the actual outcome). (The new symbols extend the language, while their definitions, obtained from agent and world programs respectively by standard methods of defining recursively enumerable functions, extend the theory.) With new definitions, moral arguments don't have to explicitly cite the code of corresponding programs, and look like this: [A=1 => O=10000].

Truth, provability, and provability by the agent

Given a model, we can ask whether a statement is true in that model. If a statement is true in all models, it's called valid. In first-order logic, all valid statements are also provable by a formal syntactic argument.

What the agent can prove, however, is different from what is provable from the theory it uses. In principle, the agent could prove everything provable (valid), but it needs to stop at some point and decide what action to perform, thus being unable to actually prove the rest of the provable statements. This restriction could take any one of many possible forms: a limit on the total number of proof steps used before making a decision, a "time limit" that maps to the proof process and stops it at some point, a set of statements ("sufficient arguments"), such that if any of the statements get proved, the process stops.

Overall, the agent is able to prove less than is provable (valid). This in particular means that for certain sets of statements that are inconsistent, the agent won't be able to derive a contradiction.

Sense and denotation

A and O are ordinary constant symbols, so for some specific values, say 2 and 1000, it's true that A=2 and O=1000 (more generally, in each model A and U designate two elements). There is little interesting structure to the constants themselves. The agent normally won't even know "explicit values" of actual action and actual outcome. Knowing actual value would break the illusion of consistent consequences: suppose the agent is consistent, knows that A=2, and isn't out of time yet, then it can prove [A=1 => O=100000], even if in fact O=1000, use that moral argument to beat any other with worse promised outcome, and decide A=1, contradiction. Knowing actual outcome would break the same illusion in two steps, if the agent ever infers an outcome different from the one it knows to hold: suppose the agent is consistent, knows that O=1000, and isn't out of time, then if it proves [A=1 => O=500], it can also prove [A=1 => (O=1000 AND O=500)], and hence [A=1 => (O=1000 AND O=500) => O=100000], using that to beat any other moral argument, making A=1 true and hence (O=1000 AND O=500) also true, contradiction.

Thus, the agent has to work with indirect definitions of action and outcome, not with action and outcome themselves. For the agent, actual action doesn't describe what the agent is, and actual outcome doesn't describe what the world is, even though moral arguments only mention actual action and actual outcome. Details of the definitions matter, not only what they define.

Abstract worlds

There seems to be no reason for definition of outcome O to be given by a program. We can as easily consider constant symbol O defined by an arbitrary collection of axioms. The agent doesn't specifically simulate the definition of O in order to obtain its specific value (besides, obtaining that value corresponds to the outcome not being controllable by the choice of action), it merely proves things about O. Thus, we can generalize world programs to world concepts, definitions of outcomes that are not programs. Furthermore, if definition of O is not a program, O itself can be more general than a finite number. Depending on the setting, O's interpretation could be an infinite set, a real number, or generally any mathematical structure.

Surprisingly, the same applies to action. It doesn't matter how the agent thinks about its actual action (defines A), so long as the definition is correct. One way to define the output of a program is by straightforwardly transcribing the program, as when defining a recursively enumerable function, but any other definition of the same value will do, including non-constructive ones.

Possible actions and possible outcomes

By way of axiomatic definitions of A and O, statements of the form [A=X => O=Y] can be proved by the agent. Each such statements defines a possible world Y resulting from a possible action X. X and Y can be thought of as constants, just like A and O, or as formulas that define these constants, so that the moral arguments take the form [X(A) => Y(O)].

The sets of possible actions and possible outcomes need to be defined syntactically: given a set of statement of the form [A=X => O=Y] for various X and Y, the agent needs a way of picking one with the most preferable Y, and to actually perform associated X. This is unlike the situation with A and O, where the agent can't just perform action A, since it's not defined in the way the agent knows how to perform (even though A is (provably) equivalent to one of the constants, the agent can't prove that for any given constant).

We can assume that sets of possible actions and possible outcomes (that is, formulas syntactically defining them) are given explicitly, and the moral arguments are statements of the form [A=X => O=Y] where X has to be a possible action and Y a possible outcome (not some other formulas). In this sense, A (as a formula, assuming its definition is finite) can't be a possible action, O won't be a possible outcome in interesting cases, and [A=A => O=O] is not a moral argument.

For each possible action, only one possible world gets defined in this manner. For the possible action that is equal to the actual action (that is, X such that (A=X) is provable in agent's theory for such X, although it's not provable by the agent), the corresponding possible outcome is equal to the actual outcome.

Possible worlds

Given a set of moral arguments [A=X => O=Y] that the agent managed to prove, consider the set of all possible outcomes that are referred by these moral arguments. Call such possible outcomes possible worlds (to distinguish them from possible outcomes that are not referred by moral arguments provable by the agent). Of all possible outcomes, the possible worlds could constitute only a small subset.

This makes it possible for the possible worlds to have more interesting structure than possible outcomes in general, for example possible outcomes could be just integers, while possible worlds prime integers. Thus, definitions of A and O define not just the actual outcome O, but a whole collection of possible worlds corresponding to possible actions.

Controlling axiomatic definitions

While the previous post discussed the sense in which the output of a constant program can be controlled, this one describes how to control a given (fixed) axiomatic definition into defining as desirable mathematical structure as possible. This shows that in principle, nothing is exempt from ambient control (since in principle one can give an axiomatic definition to anything), some definitions are just constant with respect to given agents (generate only one possible world, as defined above).

Determinism is what enables control, but ambient control relies only on "logical" determinism, the process of getting from definition to the defined concept, not on any notion of determinism within the controlled concept (actual outcome) itself. We can thus consider controlling concepts more general than our physical world, including the ones that aren't structured as (partially) deterministic processes.

Preference and utility

Possible outcomes are only used to rank moral arguments by how good the actual outcome O will be if the corresponding possible action is taken. Thus, we have an order defined on the possible outcomes, and the action is chosen to maximize the outcome according to this order. Any other properties of possible outcomes are irrelevant. This suggests directly considering utility values instead of outcomes, and using a utility symbol U instead of outcome symbol O in moral arguments.

As with actual outcome and its definition, we then have actual utility and its definition. Since definition supplies most of the relevant structure, I call definition of actual utility preference. Thus, agent is axiomatic definition of actual action A, and preference is axiomatic definition of actual utility U. Both agent and preference can be of arbitrary form, so long as they express the decision problem, and actual utility U could be interpreted with an arbitrary mathematical structure. Moral arguments are statements of the form [A=A1 => U=U1], with A1 a possible action and U1 a possible utility.

Merging axioms

Above, action and utility are defined separately, with axioms that generally don't refer to each other. Axioms that define action don't define utility, and conversely. Moral arguments, on the other hand, define utility in terms of action. If we are sure that one of the moral arguments proved by the agent refers to the actual action (without knowing which one; if we have to choose an actual action based on that set of moral arguments, this condition holds by construction), then actual utility is defined by the axioms of action (the agent) and these moral arguments, without needing preference (axioms of utility).

Thus, once moral arguments are proved, we can discard the now redundant preference. More generally, statements that the agent proves characterize actual action and actual utility together, where their axiomatic definitions characterized them separately. New statements can be equivalent to the original axioms, allowing to represent the concepts differently. The point of proving moral arguments is in understanding how actual utility depends on actual action, and using that dependence to control utility.

Utility functions

Let utility function be a functions F such that the agent proves [F(A)=U], and for each possible action X, there is a possible utility Z such that the agent can prove [F(X)=Z]. The second requirement makes utility functions essentially encodings of moral arguments; without it, a constant utility function defined by F(-)=U would qualify, but it's not useful to the agent, since it can't reason about U.

Given a utility function F and a possible action X, [A=X => U=F(X)] is a moral argument (provable by the agent). Thus, a utility function generates the whole set of moral arguments, with one possible outcome assigned to each possible action. Utility function restricted to the set of possible actions is unique. For, if F and G are two utility functions, X a possible action, then [A=X => (U=F(X) AND U=G(X))], proving contradiction in consequences of a possible action if F and G disagree at X.

Utility functions allow generalizing the notion of moral argument: we no longer need to consider only small sets of possible actions (because only small sets of moral arguments can be proved). Instead, one utility function needs to be found and then optimized. Since utility function is essentially unique, the problem of finding moral arguments can be recast as a problem of proving properties of the utility function, and more generally decision-making can be seen as maximization of utility function implicitly defined by agent program and preference.

Note that utility function is recognized by its value at a single point, but is uniquely defined for all possible actions. The single point restriction is given by the actual action and utility, while the rest of it follows from axiomatic definitions of those action and utility. Thus again, most of the structure of utility function comes from agent program and preference, not actual action and actual utility.

Connecting this to discussion of explicit/implicit dependence in the previous post, and the previous section of this one, utility function is the expression of explicit dependence of utility on agent's action, and decision problem shouldn't come with this dependence already given. Instead, most of the problem is figuring out this explicit dependence (utility function) from separate definitions of action and utility (agent program and preference).

Open problems

  • Formalize the limitations on the ability of the agent to prove statements ("agent-provability"). Use this formalization to prove impossibility of deriving a contradiction from assumption of possible action [A=X => FALSE], agent-provability and uniqueness of utility function.
  • Explore the more general case where the agent is unable to find a utility function, and can only find a set of possible outcomes for some possible actions.
  • Generalize utility functions to dependencies between arbitrary concepts (in the context of an agent's theory and decision-making conditions that limit what can be proved, just as here). Use that generalization to formalize instrumental utility (instrumental concepts).
  • Based on results in the previous item, describe control over other agents in games (through controlling their preference, as opposed to directly controlling their action).
  • (The impossible item.) Given an agent program, define its preference.
New Comment
47 comments, sorted by Click to highlight new comments since:

This is unlike the situation with A and O, where the agent can't just perform action A, since it's not defined in the way the agent knows how to perform (even though A is (provably) equivalent to one of the constants, the agent can't prove that for any given constant).

It's probably a good idea to maintain the distinction between a constant symbol c and the element v(c) assigned to c by the interpretation's valuation map v into the domain of discourse.

For example, I found the quote above confusing, but I think that you meant the following: "This is unlike the situation with A and O, where the agent can't just perform action v(A), since it's not defined in the way the agent knows how to perform. It is true that we can prove, in the metalanguage, that there exists an action X such that v(A) = X. However, there is no action X such that, for some constant symbol 'X' such that v('X') = X, the agent can prove [A = 'X']."

Not what I meant. The set of possible actions is defined syntactically, as a set of formulas that the agent, from outside its theory, can recognize and directly act upon. Definition of A (as it's syntactically given) is not one of those. Thus, the agent can't perform A directly, the best it can hope for is to find another formula B which defines the same value (in all models) and is a possible action. The agent stops short of this goal in proving a moral argument involving A and B instead, [A=B => U=large], and enacts this moral argument by performing B, which is a possible action (as a formula), unlike A. The agent, however, can't prove [A=B], even though [A=B] is provable in agent's theory (see the first named section).

Not what I meant. The set of possible actions is defined syntactically, as a set of formulas that the agent, from outside its theory, can recognize and directly act upon. Definition of A (as it's syntactically given) is not one of those. Thus, the agent can't perform A directly, the best it can hope for is to find another formula B which defines the same value (in all models) and is a possible action. The agent stops short of this goal in proving a moral argument involving A and B instead, [A=B => U=large], and enacts this moral argument by performing B, which is a possible action (as a formula), unlike A.

This looks to me like an explanation for why my original interpretation of your quote is a true statement. So I'm worried that I'm still misunderstanding you, since you say that my interpretation is not what you meant.

Here is my interpretation again, but in more syntactic terms:

"This is unlike the situation with A and O, where the agent can't just perform action v(A), since it's not defined in the way the agent knows how to perform. It is true that we can prove that, in every interpretation v, there is an action-constant X such that v(A) = v(X). However, there is no action-constant X such that the agent can prove [A = X]."

The rest of your parent comment explains why the symbol A will never appear in the position in moral arguments where action-constant-symbols appear. Is that right?

While I don't disagree with what you are saying in your reformulation (but for insignificant details), it's a different statement from the one I was making. In my own words, you are stating that the agent can't prove A=X for any (constant denoting) possible action X, but I'm not stating that at all: I'm only saying that A itself is not a possible action, that is as a formula is not an element of the set of formulas that are possible actions. I also don't see why you'd want that v(-) thing in this context: the agent performs an action by examining formulas for possible actions "as text strings", not by magically perceiving their semantics.

I also don't see why you'd want that v(-) thing in this context: the agent performs an action by examining formulas for possible actions "as text strings", not by magically perceiving their semantics.

It's how I help myself to keep the map and the territory distinct. v(A), under the standard interpretation, is what the agent does. The constant A, on the other hand, is a symbol that the agent uses in its reasoning, and which isn't even defined in such a way that the agent can directly perform what it represents.

The valuation v is for my benefit, not the agent's. The agent doesn't use or perceive the semantics of its theory. But I do perceive the semantics when I reason about how the agent's reasoning will effect its actions.

v(A), under the standard interpretation,

What does standard interpretation have to do with this? If v(-) maps from formulas to actions, fine, but then A is just a string, so interpretations don't matter.

I think that I'm missing your point. Of course, the interpretation doesn't affect what the agent or its theory can prove. Is that all you're saying?

The reason that I'm led to think in terms of semantics is that your post appeals to properties of the agent that aren't necessarily encoded in the agent's theory. At least, the current post doesn't explicitly say that these properties are encoded in the theory. (Maybe you made it clear how this works in one of your previous post. I haven't read all of these closely.)

The properties I'm thinking of are (1) the agent's computational constraints and (2) the fact that the agent actually does the action represented by the action-constant that yields the highest computed utility, rather than merely deducing that that constant has the highest computed utility.

For example, you claim that [A=1] must be derivable in the theory if the agent actually does A. The form of your argument, as I understand it, is to note that [A=1] is true in the standard interpretation, and to show that [A=1] is the sort of formula which, if true under one interpretation, must be true in all, so that [A=1] must be a theorem by completeness. I'm still working out why [A=1] is the required kind of formula, but the form of your argument does seem to appeal to a particular interpretation before generalizing to the rest.

For example, you claim that [A=1] must be derivable in the theory if the agent actually does A.

If the agent actually does 1 (I assume you meant to say). I don't see what you are trying to say again. I agree with the last paragraph (you could recast the argument that way), but don't understand the third paragraph.

If the agent actually does 1 (I assume you meant to say).

Whoops. Right.

I don't see what you are trying to say again. I agree with the last paragraph (you could recast the argument that way), but don't understand the third paragraph.

Okay. Let me try to make my point by building on the last paragraph, then. According to my understanding, you start out knowing that v(A) = v(1) for a particular interpretation v. Then you infer that v'(A) = v'(1) for an arbitrary interpretation v'. Part of my reason for using the v(.) symbol is to help myself keep the stages of this argument distinct.

According to my understanding, you start out knowing that v(A) = v(1) for a particular interpretation v.

If v is an interpretation, it maps (all) terms to elements of corresponding universe, while possible actions are only some formulas, so associated mapping K would map some formulas to the set of actions (which don't have to do anything with any universe). So, we could say that K(1)=1', but K(A) is undefined. K is not an interpretation.

If v is an interpretation, it maps (all) terms to elements of corresponding universe, while possible actions are only some formulas, . . .

Maybe we're not using the terminology in exactly the same way.

For me, an interpretation of a theory is an ordered pair (D, v), where D is a set (the domain of discourse), and v is a map (the valuation map) satisfying certain conditions. In particular, D is the codomain of v restricted to the constant symbols, so v actually contains everything needed to recover the interpretation. For this reason, I sometimes abuse notation and call v itself the interpretation.

The valuation map v

  • maps constant symbols to elements of D,

  • maps n-ary function symbols to maps from D^n to D,

  • maps n-ary predicate symbols to subsets of D^n,

  • maps sentences of the theory into {T, F}, in a way that satisfies some recursive rules coming from the rules of inference.

Now, in the post, you write

Each such statements defines a possible world Y resulting from a possible action X. X and Y can be thought of as constants, just like A and O, or as formulas that define these constants, so that the moral arguments take the form [X(A) => Y(O)].

(Emphasis added.) I've been working with the bolded option, which I understand to be saying that A and 1 are constant symbols. Hence, given an interpretation (D, v), v(A) and v(1) are elements of D, so we can ask whether they are the same elements.

I agree with everything you wrote here...

I agree with everything you wrote here...

What was your "associated mapping K"? I took it to be what I'm calling the valuation map v. That's the only map that I associate to an interpretation.

K has a very small domain. Say, K("2+2")=K("5")="pull the second lever", K("4") undefined, K("A") undefined. Your v doesn't appear to be similarly restricted.

The new symbols extend the language, while their definitions, obtained from agent and world programs respectively by standard methods of defining recursively enumerable functions, extend the theory.

I haven't yet read beyond this point, but this is a kind of confusing thing to write. Definitions can't extend a theory, because they don't give you new theorems. My assumption is that you will add axioms that incorporate the new symbols, and the axioms will extend the theory.

Definitions can't extend a theory, because they don't give you new theorems.

A conservative extension of a language/theory doesn't introduce new theorems in the old language, but could introduce new theorems that make use of new symbols, although in the case of extension by definitions, all new theorems can also be expressed in the smaller (original) language and would be the theorems of original theory.

Okay, thanks. I didn't know that adding certain kinds of axioms was called "extension by definitions".

[-][anonymous]00

This is unlike the situation with A and O, where the agent can't just perform action A, since it's not defined in the way the agent knows how to perform (even though A is (provably) equivalent to one of the constants, the agent can't prove that for any given constant).

It might be clearer to maintain the distinction between a constant symbol c and the element v(c), in the domain of discourse, assigned to c by the interpretation valuation v.

For example, I found the quote above confusing, but I think that you meant "This is unlike the situation with v(A) and v(O), where the agent can't just perform action v(A), since it's not defined in the way the agent knows how to perform (even though v(A) is (provably, in the metalogic) equal to the interpretation of one of the constants, the agent can't prove that for any given constant)."

Some axioms are definitions.

Previous theorem: All unmarried men are not married New definition: "Bachelor" means "unmarried man" New theorem: All bachelors are unmarried men.

I'm pretty sure that's what he means. Hopefully clarified, if not made perfectly in accord with standard definitions.

I'm pretty sure that's what he means.

I think that he means something analogous to the way that we can add some axioms involving the symbol "+" to the Peano axioms, and then show in second-order logic that the new axioms define addition uniquely.

[-][anonymous]00

Still commenting while reading:

The agent normally won't even know "explicit values" of actual action and actual outcome. Knowing actual value would break the illusion of consistent consequences: suppose the agent is consistent, knows that A=2, and isn't out of time yet, then it can prove [A=1 => O=100000], even if in fact O=1000, use that moral argument to beat any other with worse promised outcome, and decide A=1, contradiction.

This would only happen if the agent had a rule of inference that allowed it to infer from

  • A=1 => O=100000

and

  • all other promised outcomes are worse than 100000

that

  • A = 1.

But why would the first-order theory use such a rule of inference? You seem to have just given an argument for why we shouldn't put this rule of inference into the theory.

ETA: I guess that my point leads right to your conclusion, and explains it. The agent is built so that, upon deducing the first two bullet-points, the agent proceeds to do the action assigned to the constant 1 by the interpretation. But the point is that the agent doesn't bother to infer the third bullet-point; the agent just acts. As a result, it never deduces any formulas of the form [A=X], which is what you were saying.

Consider a first-order language and a theory in that language (defining the way agent reasons, the kinds of concepts it can understand and the kinds of statements it can prove). This could be a set theory such as ZFC or a theory of arithmetic such as PA. The theory should provide sufficient tools to define recursive functions and/or other necessary concepts.

This is unclear to me, and I've read and understood Enderton. I would have thought that ZFC and PA were sets of axioms and would say nothing about how an agent reasons.

Also,

In first-order logic, all valid statements are also provable by a formal syntactic argument.

Do you mean in the context of some axioms? (of course, you can always talk about whether the statement "PA implies X" is valid, so it doesn't really matter).

I haven't read the rest yet. I'm confident that you have a very precise and formally defined idea in mind, but I'd appreciate it if you could spell out your definitions, or link to them (mathworld, wikipedia, or even some textbook).

I would have thought that ZFC and PA were sets of axioms and would say nothing about how an agent reasons.

The other way around: the agent reasons using ZFC or PA. (And not just sets of axioms, but associated deductive system, so rules of what can be proved how.)

In first-order logic, all valid statements are also provable by a formal syntactic argument.

I simply mean completeness of first-order logic.

Okay, thanks. I'll certainly read the rest tomorrow :)

An agent that reasoned by proving things in ZFC could exist.

Stupid argument: "This program, computed with this data, produces this result" is a statement in ZFC and is provable or disprovable as appropriate.

Obviously, a real ZFC-based AI would be more efficient than that.

ZFC is nice because Newton's laws, for example, can be formulated in ZFC but aren't computable. A computable agent could reason about those laws using ZFC, for example deriving the conservation of energy, which would allow him to compute certain things.

In first-order logic, all valid statements are also provable by a formal syntactic argument.

Do you mean in the context of some axioms?

A first-order logic comes with a set of axioms by definition.

[-]gRR00

Continuing from here...

The world() function can be without parameters and can call agent() directly, but it doesn't have to be defined in this way inside the agent's model of the world (the axioms in the first-order language). Instead, the world can be modeled as a function world(X) with X being a variable ranging over possible agents. Then the agent() can find the best possible agent and impersonate it. That is, prove that a decision A is what the best agent would do, and then return A.

This has the benefit that the agent can prove stuff about its true decision without generating contradictions. And also, the utility maximization principle will be embedded within the proof system, without the need for an external rule ("Choose best among moral arguments").

(Clarified definition of utility function in the first paragraph of that section; the previous definition could be interpreted to allow constant functions defined to be equal to the actual utility for any argument and other difficult-to-reason-about uniqueness-breaking pathological functions.)

Still commenting while reading:

The agent normally won't even know "explicit values" of actual action and actual outcome. Knowing actual value would break the illusion of consistent consequences: suppose the agent is consistent, knows that A=2, and isn't out of time yet, then it can prove [A=1 => O=100000], even if in fact O=1000, use that moral argument to beat any other with worse promised outcome, and decide A=1, contradiction.

This would only happen if the agent had a rule of inference that allowed it to infer from

  • A=1 => O=100000

and

  • all other promised outcomes are worse than 100000

that

  • A = 1.

But why would the first-order theory use such a rule of inference? You seem to have just given an argument for why we shouldn't put this rule of inference into the theory.

ETA: I guess that my point leads right to your conclusion, and explains it. The agent is built so that, upon deducing the first two bullet-points, the agent proceeds to do the action assigned to the constant symbol 1 by the interpretation. But the point is that the agent doesn't bother to infer the third bullet-point; the agent just acts. As a result, it never deduces any formulas of the form [A=X], which is what you were saying.

The agent never proves A=1, but it does (by assumption) prove A=2, while in fact it turns out that the agent acts as A=1 (without proving it), and so in the standard model it's true that A=1, while agent's theory says that in standard model A=2, which means that agent's theory is inconsistent, which contradicts the assumption that it's consistent.

The agent never proves A=1.

Okay. I was confused because you wrote that the agent "can . . . decide A=1", rather than "will do action 1."

Is there a rule of inference in the system such that the first two bullet points above entail the third within the system? I see that [A=1] is true in the standard model, but the agent's theory isn't complete in general. So should we expect that [A=1] is a theorem of the agent's theory? (ETA1: Okay, I think that your comment here, explains why [A=1] must be a theorem if the agent actually does 1. But I still need to think it through.)

(ETA2: Now that I better understand the axiom that defines A in the theory, I see why [A=1] must be a theorem if the agent actually does 1.)

Also, it seems that your proof only goes through if the agent "knows that A=2" when the agent will not in fact do action v(2) (which directly contradict soundness). But if the agent knows [A=X], where v(X) is the actual action, then all we can conclude is that the agent declines to observe my first two bullet-points above in a way that would induce it to do v(1). (Here, v(X) is the interpretation of the constant symbol X under the standard model.)

Is there a rule of inference in the system such that the first two bullet points above entail the third within the system? I see that [A=1] is true in the standard model, but the agent's theory isn't complete in general. So why should we expect that [A=1] is a theorem of the agent's theory?

Every recursive function is representable in Robinson's arithmetic Q, that is for any (say, 1-ary) recursive function F, there is a formula w such that F(n)=m => Q|- w(n,m) and F(n)<>m => Q|- ~w(n,m). Hence, statements like this that hold in the standard model, also hold in any model.

That the agent doesn't avoid looking for further moral arguments to prove is reflected in "isn't out of time" condition, which is not formalized, and is listed as the first open problem. If in fact A=X, then it can't prove ludicrous moral arguments with A=X as the premise, only the actual utility, but it clearly can prove arguments that beat the true one by using a false premise.

Every recursive function is representable in Robinson's arithmetic Q, that is for any (say, 1-ary) recursive function F, there is a formula w such that F(n)=m => Q|- w(n,m) and F(n)<>m => Q|- ~w(n,m). Hence, statements like this that hold in the standard model, also hold in any model.

I think that I need to see this spelled out more. I take it that, in this case, your formula w is w(A,X) = [A=X]. What is your recursive function F?

We are representing the no-arguments agent program agent() using a formula w with one free variable, such that agent()=n => Q|- w(n) and agent()<>n => Q|- ~w(n). Actual action is defined by the axiom w(A), where A is a new constant symbol.

Okay, I'm convinced. In case it helps someone else, here is how I now understand your argument.

We have an agent written in some program. Because the agent is a computer program, and because our 1st-order logic can handle recursion, we can write down a wff [w(x)], with one free variable x, such that, for any action-constant X, [w(X)] is a theorem if and only if the agent does v(X).

[ETA: Vladimir points out that it isn't handling recursion alone that suffices. Nonetheless, a theory like PA or ZFC is apparently powerful enough to do this. I don't yet understand the details of how this works, but it certainly seems very plausible to me.]

In particular, if the agent goes through a sequence of operations that conclude with the agent doing v(X), then that sequence of operations can be converted systematically into a proof of [w(X)]. Conversely, if [w(X)] is a theorem, then it has a proof that can be reinterpreted as the sequence of operations that the agent will carry out, and which will conclude with the agent doing v(X).

The wff [w(x)] also has the property that, given two constant U and V, [w(U) & w(V)] entails [U = V].

Now, the agent's axiom system includes [w(A)], where A is a constant symbol. Thus, if the agent does v(X), then [w(A)] and [w(X)] are both theorems, so we must have that [A=X] is a theorem.

This works (although "because our 1st-order logic can handle recursion" is not it, etc.). (Note that "[w(X)] is a theorem of T if and only if X is as we need" is weak T-representability, while I cited the strong kind, that also guarantees [~w(X)] being a theorem if X is not as we need.)

(although "because our 1st-order logic can handle recursion" is not it, etc.).

That was there because of this line from your post: "The theory should provide sufficient tools to define recursive functions and/or other necessary concepts."

Doesn't make it an explanatory sufficient condition to conclude what you did: I object to your use of "because".

Okay, thanks. My understanding is definitely vaguest at the point where the agent's program is converted into the wff [w(x)]. Still, the argument is at the point of seeming very plausible to me.

No worries. Your logic seems rusty though, so if you want to build something in this direction, you should probably reread a good textbook.

Not so much rusty as never pursued beyond the basics. The logic I know is mostly from popular books like Gödel, Escher, Bach, plus a philosophy course on modal logic, where I learned the basic concepts used to talk about interpretations of theories.

Above, action and utility are defined separately, with axioms that generally don't refer to each other. Axioms that define action don't define utility, and conversely. Moral arguments, on the other hand, define utility in terms of action. If we are sure that one of the moral arguments proved by the agent refers to the actual action (without knowing which one; if we have to choose an actual action based on that set of moral arguments, this condition holds by construction), then actual utility is defined by the axioms of action (the agent) and these moral arguments, without needing preference (axioms of utility).

I'm a little confused here. Here is my understanding so far:

The agent has an axiom-set S and rules of inference, which together define the agent's theory. Given S and some computational constraints, the agent will deduce a certain set M of moral arguments. The moral arguments contain substrings of the form "U = U1". The largest constant U1* appearing in such substrings of the moral arguments in M is, by definition, the actual utility. The definition of "actual utility" that I just wrote (which is in the metalanguage, not the agent's language) is the preference.

In this sense, preference and actual utility are defined by all the axioms in S taken together. Similarly, the actual action is defined by all of S. So, what are you getting at when you talk about having one set of axioms define action while another set of axioms defines utility?

There is some background theory S the agent reasons with, say ZFC. This theory is extended by definitions to define action A and utility U. Say, these extensions consist of sets of axioms AX and UX. Then, the agent derives the set of moral arguments M from theory S+AX+UX. By preference, I refer specifically to UX, which defines utility U in the context of agent's theory S. But if M is all (moral arguments) the agent will infer, then S+AX+M also defines U, just as well as S+AX+UX did. Thus, at that point, we can forget about UX and use M instead.

Okay, thanks. This is clear.

I'm not sure why you want to think in terms of S+AX+M instead of S+AX+UX, though. Doesn't starting with the axiom set S union AX union UX better reflect how the agent actually reasons?

It does start with S+AX+UX, but it ends with essentially S+AX+M. This allows to understand the point of this activity better: by changing original axioms to equivalent ones, the agent expresses the initially separately defined outcome in terms of action, and uses that expression (dependence) to determine the outcome it prefers.

This post doesn't seem to introduce a lot of new concepts so I can't see much to discuss.

Open problems: Your first open problem might not technically always be true, but it doesn't really matter, because as you pointed out, the statement the agent uses to derive its action is always I don't see how you think agents can do anything agent-y without utility functions. The instrumental values + agents in games ones look interesting. I agree that the impossible one is impossible. Although it might be the kind of impossible thing you can do, or it might be the kind of impossible thing Godel proved you can't do.

One thing that's bothering me is that agents in the real world have to use dirty tricks to make problems simpler. Like, two strategies will do the same in 99% of situations, so let's ignore that part and focus on the rest, hmmm calculate calculate calculate this one is better. But when I try to formalize that I lose at Newcomb's problem. So that's an open problem?

(The impossible item.) Given an agent program, define its preference.

If you have the agent's program, you already have a pretty comprehensive model of it. It is harder to infer preferences from behaviour. That's the problem Tim Freeman addressed here.