Alright it's been a long hiatus. I may not post again for another year or two if no inspiration strikes me. I will summarize my work so far.

  1. We can define a function over causal networks which describes how hard a given portion of the network is optimizing some node around some particular world history[1].
  2. This function depends on some partial derivatives of the network state, making it a local function over world histories i.e. it does not depend on world histories which are "far away" in world-space.[2]
  3. Across one dimension, we can use path-integrals to find the "power" of the optimizer, i.e. how much variation it removes from the system.

Math summary

Optimization is written as   for "past" node , "future" node , and a section  of the causal network. It measures the (negative) log of the ratios of two derivatives of  (i.e the value of  in world ) with respect to . The first is the "normal" world where  varies, and the second being an imaginary world where  is "frozen" at it's value in , unable to respond to infinitesimal changes in .

We can overall write the following:

If  is optimizing  with respect to , we would expect that some (or all) of the changes in  which  are caused by changes in  will be removed; so the derivative when  is allowed to vary will be smaller than the derivative when  is fixed. This means  will be positive.

I have a few observations about what this might mean and why it might be important.

Optimization relates to Information

This is a simple calculation: if  doesn't depend on  in any way, then the two derivative terms will be equal, because  won't vary in either of them. The ability of  to optimize  with respect to  is related to its ability to gather information about .

Optimization relates to a Utility-Like Function

For simple systems like the thermostat, it seems like  has high values when the thermostat "gets what it wants". It kind of looks like  across one axis is the second derivative of our utility function, at least within regions of world-space where  has roughly equal power and knowledge.

Optimization relates to Power

This seems pretty intuitively obvious. The more "powerful" the thermostat was in terms of having a stronger heating and cooling unit, the more it was able to optimize the world.

Why is this important?

We already have mathematical proofs that the "knowledge" and "values" of an agent-like thing cannot be disentangled exactly. So if we want a mathematically well-defined measure of agent behaviour, we must take them both at once. 

Secondly, the sorts of histories that  works on are deliberately chosen to be both very general, requiring no notion of absolute time and space in the style of much of John Wentworth's work. A specific case of these networks is the activations of a neural network, so these tools could in theory be applied directly to AI interpretability work.

Worlds of High Optimization Are "Good" Worlds for The Optimizer

Worlds where  is large tend to be "good" for the optimizing region  in question. They seem to correspond to local minimal (or at least local pareto frontiers) of a utility function. They also correspond to worlds where  is both knowledgeable and powerful. They correspond to worlds where  is "in control". Here are a few potential thoughts on making safer AI designs using this concept:

  • Having a mathematically well-defined measure of optimization means it can be hard-coded rather than relying on machine learning.
  • Lots of thought has gone into trying to make AI "low impact", and using  in this way might let us specify this better.
  • If worlds of high  tend to be good for the optimizerin question then this might provide a route towards a way to encode things that are good for humans.
  •  can be defined for  regions in the AI's past, which makes it harder to reward-hack, or modify the preferences of the humans in question to hack .
  •  is defined locally, but extending it to distributions over worlds is probably trivial.

 

  1. ^

    World history here meaning a given set of numeric values which describe the state of a causal network. For example if we have the network [Temperature in Celsius]  [State of Water] then the following are examples of world histories: [-10]  [0], [45]  [1], and [120]  [2]. Where we've represented the state of water as a number {0: Solid, 1: Liquid, 2: Gas}.

  2. ^

    So if we consider our previous world history examples, the value of our optimization metric at a temperature of 310 Kelvin doesn't depend on the behaviour of the system at 315 Kelvin.

New Comment