I don't mind the post was posted without much editing or work put into formatting but I find it somewhat unfortunate the post was probably written without any work put into figuring out what other people wrote about the topic and what terminology they use
Recommended reading:
- Daniel Dennett's Intentional stance
- Grokking the intentional stance
- Agents and device review
@mods, if there were an alignmentforum sketch grade posts, this would belong there. It seems like there ought to be a level between lesswrong and alignmentforum, which is gently vetted, but specifically allows low quality posts.
a question came up - how do you formalize this exactly? how do you separate questions about physical state from questions about utility functions? perhaps, audere says, could you bound the relative complexity of the perspectives of utility function representation vs simulating perspective?
also how do you deal with modeling smaller boundedly rational agents in actual formalism? I can recognize psychologizing is the right perspective to model a cat who is failing to walk around a glass wall to get the food on the other side and is instead meowing sadly at the wall, but how do I formalize it? Seems like the discovering agents paper still has a lot to tell us about how to do this - https://arxiv.org/pdf/2208.08345.pdf
Still on the call - Audere was saying this builds on Kosoy's definition by trying to patch a hole; I am not quite keeping track of which thing is being patched
We were discussing this on a call and I was like "this is very interesting and more folks on LW should consider this perspective". It came up after a while of working through Discovering Agents, which is a very deep and precise causal models read and takes a very specific perspective. The perspective in this post is an extension of
Agents and Devices: A Relative Definition of Agency
According to Dennett, the same system may be described using a
physical' (mechanical) explanatory stance, or using an
intentional' (belief- and goal-based) explanatory stance. Humans tend to find the physical stance more helpful for certain systems, such as planets orbiting a star, and the intentional stance for others, such as living animals. We define a formal counterpart of physical and intentional stances within computational theory: a description of a system as either a device, or an agent, with the key difference being thatdevices' are directly described in terms of an input-output mapping, while
agents' are described in terms of the function they optimise. Bayes' rule can then be applied to calculate the subjective probability of a system being a device or an agent, based only on its behaviour. We illustrate this using the trajectories of an object in a toy grid-world domain.
One of the key points that @Audere is arguing that the amount of information one has about a target, and one needs to know more and more about a target possible agent to do higher and higher levels of precise modeling. Very interesting. So a key concern we have is the threat from an agent that is able to do full simulation of other agents. If we could become unpredictable to potentially scary agents, we would be safe, but due to being made of mechanisms we cannot hide, we cannot indefinitely.
Note: This post was pasted without much editing or work put into formatting. I may come back and make it more presentable at a later date, but the concepts should still hold.
Relative abstracted agency is a framework for considering the extent to which a modeler models a target as an agent, what factors lead a modeler to model a target as an agent, and what sort of models have the nature of being agent-models. The relative abstracted agency of a target relative to a reasonably efficient modeler is based on the most effective strategies that the modeler uses to model the target, which exist on a spectrum from terminalizing strategies to simulating strategies.
Factors that affect relative abstracted agency of a target:
Relevance of this framework to AI alignment:
Additional thoughts:
(draws on parts of https://carado.moe/predca.html, particularly Kosoy’s model of agenthood)
Suppose there is a correct hypothesis for the world in the form of a non-halting turing program. Hereafter I’ll simply refer to this as “the world.”
Consider a set of bits of the program at one point in its execution which I will call the target. This set of bits can also be interpreted as a cartesian boundary around an agent executing some policy in Vanessa’s framework. We would like to evaluate the degree to which the target is usefully-approximated as an agent, relative to some agent that (instrumentally or terminally) attempts to make accurate predictions under computational constraints using partial information about the world, which we will call the modeler.
Vanessa Kosoy’s framework outlines a way of evaluating the probability that an agent G has a utility function U which takes into account the agent’s efficacy at satisfying U as well as the complexity of U. Consider some utility function which the target is most kosoy-agentic with respect to. Hereafter I’ll simply refer to this as the target’s utility function.
Suppose the modeler can choose between gaining 1 bit of information of its choice about the target’s physical state in the world, and gaining 1 bit of information of its choice about the target’s utility function. (Effectively, the modeler can choose between obtaining an accurate answer to a binary question about the target’s physical state, and obtaining an accurate answer to a binary question about the target’s utility function). The modeler, as an agent, should assign some positive amount of utility to each option relative to a null option of gaining no additional information. Let’s call the amount of utility it assigns to the former option SIM and the amount it assigns to the latter option TERM.
A measure of the relative abstracted agency of the target, relative to the modeler, is given by TERM/SIM. Small values indicate that the target has little relative abstracted agency, while large values indicate that the target has significant abstracted agency. The RAA of a rock relative to myself should be less than one, as I expect information about its physical state to be more useful to me than information about its most likely utility function. On the other hand, the RAA of an artificial superintelligence relative to myself should be greater than one, as I expect information about its utility function to be more useful to me than information about its physical state.