A putative new idea for AI control; index here.

This is a preliminary look at how an AI might assess and deal with various types of errors and uncertainties, when estimating true human preferences. I'll be using the circular rocket model to illustrate how these might be distinguished by an AI. Recall that the rocket can accelerate by -2, -1, 0, 1, and 2, and the human wishes to reach the space station (at point 0 with velocity 0) and avoid accelerations of ±2. In the forthcoming, there will generally be some noise, so to make the whole thing more flexible, assume that the space station is a bit bigger than usual, covering five squares. So "docking" at the space station means reaching {-2,-1,0,1,2} with 0 velocity.




The purpose of this exercise is to distinguish true preferences from other things that might seem to be preferences from the outside, but aren't. Ultimately, if this works, we should be able to construct an algorithm that identifies preferences, such that anything it rejects is at least arguably not a preference.

So, at least initially, I'll be identifying terms like "bias" with "fits into the technical definition of bias used in this model"; once the definitions are refined, we can then check whether they capture enough of the concepts we wand.

So I'm going to use the following terms to distinguish various technical concepts in this domain:

  1. Noise
  2. Error
  3. Bias
  4. Prejudices
  5. Known prejudices
  6. Preferences

Here, noise is when the action selected by the human doesn't lead to the correct action output. Error is when the human selects the wrong action. Bias is when the human is following the wrong plan. A prejudice is a preference the human has that they would not agree upon if it was brought to their conscious notice. A know prejudice is a prejudice the human knows about, but can't successfully correct within themselves.

And a preference is a preference.

What characteristics would allow the AI to distinguish between these? Note that part of the reason that the terms are non-standard is that I'm not starting with perfectly clear concepts and attempting to distinguish them; instead, I'm finding way of distinguishing various concepts, and seeing if these map on well to the concepts we care about.

Noise versus preference and complexity

In this model, noise is seen as 5% chance of under-accelerating, ie a desired acceleration of ±2 will, 5% of the time, give an acceleration of ±1. And a desired acceleration of ±1 will, 5% of the time, give an acceleration of zero.

The human starts from a position where, to reach the space station at zero velocity, the best plan is to accelerate for a long while and decelerate (by -1) for two turns. Accelerating by -2 once would also do the trick, though the human prefers not to do that, obviously.

As in the previous post, the AI has certain features φi to explain the human's behaviour. They are:

* φ0({-2,1,0,1,2},0;-)=1
* φ1(-,-;-)=1
* φ2(-,-;+2)=1
* φ3(-,-;-2)=1

The first feature indicates that the five-square space-station is in a special position (if the velocity is 0). The second feature (a universal feature) is used to show that wasting time is not beneficial. The third feature is used to rule out accelerations of +2, the last feature those of -2.

Given the trajectory it's seen, the AI can confidently fit φ0, φ1 and φ2 to some estimate of true rewards (the human rushes to the space station, without using +2 accelerations). However, it doesn't know what to do with φ3. The human had an opportunity to use -2 acceleration, but went for two -1s instead. There are two options: the human actually wants to avoid -2 accelerations, and everything went well. Or the human doesn't want to avoid them, but the noise forced their desired -2 acceleration down to -1.

Normally there would be a complexity prior here, with a three-feature explanation being the most likely - possibly still the most likely after multiplying it by 5% to account for the noise. However, there is a recurring risk that the AI will underestimate the complexity of human desires. One way of combating this is to not use a complexity prior, at least up to some "reasonable" size of human desires. If the four-feature explanation has more prior weight than the three-feature one, then the φ3 is likely to be used and the AI will see the -1,-1 sequence as deliberate, not noise.

A warning, however: humans have complex preferences, but those preferences are not relevant to every single situation. What about φ4, the human preference for chocolate, φ5, the human preference for dialogue in movies, and φ6, the human preference for sunlight? None of them would appear directly in this simple model of the rocket equation. And though φ0-φ1-φ2-φ3 is a four-feature model of human preferences, so is φ0-φ1-φ2-φ6 (which is indistinguishable, in this example, from the three-feature model φ0-φ1-φ2).

So we can't say "models with four features are more likely than models with three"; at best we could say "models with three *relevant* features are as likely as models with four *relevant* features". But, given that, the AI will still converge on the correct model of human preferences.

Note that as the amount of data/trajectories increases, the ability of the AI to separate preference from noise increases rapidly.

Error versus bias versus preference

First, set noise to zero. Now imagine that the rocket is moving at a relative velocity such that the ideal strategy to reach the space station is to accelerate by +1 for three more turns, and then decelerate by -1 for several turns until it reaches the station.

Put the noise back up to 5%. Now, the optimum strategy is to start decelerating immediately (since there is a risk of under-accelerating during the deceleration phase). Instead, the human starts accelerating by +1.

There are three possible explanations for this. Firstly, the human may not actually want to dock at the space station. Secondly, the human may be biased - overconfident, in this case. The human may believe there is no noise (or that it can overcome it through willpower and fancy flying?) and therefore is following the ideal strategy in the no-noise situation. Or the human may simply have made an error, doing +1 when it meant to do -1.

These options can be distinguished by observing subsequent behaviour (and behaviour on different trajectories). If we assume the preferences are correct, then a biased trajectory involves the human following the ideal plan for an incorrect noise value, and the desperately adjusting at the end when they realise their plan won't work. An error, on the other hand, should result in the human trying to undo their action as best they can (say, by decelerating next turn rather than following the +1,+1,-1,-1,... strategy of the no-noise world).

These are not sharp distinctions (especially on single trajectory or a small set of them). Maybe the human has a preference for odd manoeuvres as it approaches the space station. Maybe it makes a mistake every turn, and purely coincidentally follows the right trajectory. And so on.

But this is about the most likely (simplest) explanation. Does the human show all the signs of being a competent seeker of a particular goal, except that sometimes they seem to do completely random things, which they then try and repair (or shift to a completely alternate strategy if the random action is not repairable)? Most likely an error.

Is the human behaviour best explained by simple goals, but a flaw in the strategy? This could happen if the overconfident human always accelerated too fast, and then did some odd manoeuvres back and forth to dock with the station. This could be odd preferences for the docking procedure, but a larger set of trajectories could rule this out: sometimes, the overconfident human will arrive perfectly at the station. In that case, they will *not* perform the back and forth dance, revealing that that behaviour was a result of a flawed strategy (bias) rather than odd preferences.

A subtlety in distinguishing bias is when the human (or maybe the system its in) uses meta-rationality to correct for the bias. Maybe the human is still overconfident, but has picked up a variety of habits that compensate for that overconfidence. How would the AI interpret some variant of overly prudent approach phase, followed by wildly reckless late manoeuvring (when errors are easier to compensate for)? This is not clear, and requires more thought.

Preference versus prejudice (and bias)

This is the most tricky distinction of all - how would you distinguish a prejudice from a true preference? One way of approaching it is to see if presenting the same information in different ways makes a difference. 

This can be attempted with bias as well. Suppose the human's reluctance for ±2 accelerations is due to a bias that causes them to fear that the rocket will fall apart at those accelerations, but that bias isn't accurate. Then the AI can report  either "we have an acceleration of +2" or "we have the highest safe acceleration". Both are saying the same thing, but the human will behave differently in either, revealing something about what is preference and what is bias.

What about prejudice? Racism and sexism are obvious examples, but it's more common than that. Suppose the pilot listens to opera music while flying, and unconsciously presses down harder on the accelerator while listening to "ride of the Valkyries". This fits perfectly into the prejudice format: it's a preference that the pilot would want to remove if they were informed about it.

To test this, the AI could offer to inform the human pilot of the music selection when the pilot was planning the flight (possibly at some small price). If the pilot had a genuine preference for "flying fast when listening to Wagner", then this music selection is relevant for their planning, and they'd certainly want it. If the prejudice was unconscious, however, they would see no interest in seeing the music selection at this point.

Once a prejudice is identified, the AI then has the option of asking the human directly if they agree with it (thus upgrading it to a true but unknown preference).

Known prejudices

Sometimes, people have prejudices, know about them, don't like them, but can't avoid them. They might then have very complicated meta-behaviours to avoid falling prey to them. To use the Wagner example, someone trying to repress that would seem to have the double preferences "I prefer to never listen to Wagner while flying" and "if, however, I do hear Wagner, I prefer to fly faster", when in fact neither of these are preferences.

It would seem that the simplest would be to have people list their undesired prejudices. But apart from the risk they could forget some of them, their statements might be incorrect. They could say "I don't want to want to fly faster when I hear opera", while in reality only Wagner causes that in them. So further analysis is required beyond simply collecting these statements.

Revisiting complexity

In a previous post, I explored the idea of giving the AI some vague idea of the size and complexity of human preferences, and that it should aim in that size for its explanations. However, I pointed out a tradeoff: if the size was too large, the AI would label prejudices or biases as preferences, while if the size was too small, it would ignore genuine preferences.

If there are ways of distinguishing biases and prejudices from genuine preferences, though, then there is no trade-off. Just put the expected complexity for *combined* human preferences+prejudices+biases at some number, and let algorithm sort out what is preference and what isn't. It is likely much easier to estimate the size of human preferences+pseudo-preferences, than it is to identify the size of true preferences (that might vary more from human to human, for start).

I welcome comments, and will let you know if this research angle goes anywhere.
New Comment
10 comments, sorted by Click to highlight new comments since:

This seems like a time to bring up information temperature. After all, there is the deep parallel of entropy in information theory and physics. When comparing models, by what factor do you penalize a model for requiring more information to specify it? That would be analogous to the inverse temperature. I have yet to encounter a case where it makes sense in information theory, though.

Also, another explanation of the extra +1 is that the risk of having to use a -2 doesn't seem that scary - it is not a very strong preference. If the penalty for a -2 was 10 while 1, 0, or 1 was 1, then as long as the probability of needing to hit -2 to stay on the station is less than 11% and it saves a turn, going for the extra +1 seems like a good move. If the penalty is smaller - 4, say - then even a fatter risk seems reasonable.

[-]sen00

How is inverse temperature a penalty on models? If you're referring to the inverse temperature in the Maxwell-Boltzmann distribution, the temperature is considered a constant, and it gives the likelihood of a particle having a particular configuration, not the likelihood of a distribution.

Also, I'm not sure it's clear what you mean by "information to specify [a model]". Does a high inverse temperature mean a model requires more information, because it's more sensitive to small changes and therefore derives more information from them, or does it mean that the model requires less information, because it derives less information from inputs?

The entropy of the Maxwell-Boltzmann distribution I think is proportional to log-temperature, so high temperature (low sensitivity to inputs) is preferred if you go strictly by that. People that train neural networks generally do this as well to prevent overtraining, and they call it regularization.

If you are referring to the entropy of a model, you penalize a distribution for requiring more information by selecting the distribution that maximizes entropy subject to whatever invariants your model must abide by. This is typically done through the method of Lagrange multipliers.

You assign a probability of a microstate according to its energy and the temperature. The density of states at various temperatures creates very nontrivial behavior (especially in solid-state systems).

You appear to know somewhat more about fitting than I do - as I understood it, you assign a probability of a specific model according to its information content and the 'temperature'. The information content would be, if your model is a curvefit with four parameters, all of which are held to a narrow range, that has more 1/3 information than a fit with three parameters held to a similar range.

In pure information theory, the information requirement is exactly steady with the density of states. One bit per bit, no matter what. If you're just picking out maximum entropy, then you don't need to refer to a temperature.

I was thinking about a penalty-per-bit that is higher than 1/2 - a stronger preference for smaller models than breaking-even. Absolute Zero would be when you don't care about the evidence, you're going with a 0 bit model.

[-]sen00

It's true that the probability of a microstate is determined by energy and temperature, but the Maxwell-Boltzmann equation assumes that temperature is constant for all particles. Temperature is a distinguishing feature of two distributions, not of two particles within a distribution, and least-temperature is not a state that systems tend towards.

As an aside, the canonical ensemble that the Maxwell-Boltzmann distribution assumes is only applicable when a given state is exceedingly unlikely to be occupied by multiple particles. The strange behavior of condensed matter that I think you're referring to (Bose-Einstein condensates) is a consequence of this assumption being incorrect for bosons, where a stars-and-bars model is more appropriate.

It is not true that information theory requires the conservation of information. The Ising Model, for example, allows for particle systems with cycles of non-unity gain. This effectively means that it allows particles to act as amplifiers (or dampeners) of information, which is a clear violation of information conservation. This is the basis of critical phenomena, which is a widely accepted area of study within statistical mechanics.

I think you misunderstand how models are fit in practice. It is not standard practice to determine the absolute information content of input, then to relay that information to various explanators. The information content of input is determined relative to explanators. However, there are training methods that attempt to reduce the relative information transferred to explanators, and this practice is called regularization. The penalty-per-relative-bit approach is taken by a method called "dropout", where a random "cold" model is trained on each training sample, and the final model is a "heated" aggregate of the cold models. "Heating" here just means cutting the amount of information transferred from input to explanator by some fraction.

It's true that the probability of a microstate is determined by energy and temperature, but the Maxwell-Boltzmann equation assumes that temperature is constant for all particles. Temperature is a distinguishing feature of two distributions, not of two particles within a distribution, and least-temperature is not a state that systems tend towards.

I know. Models are not particles. They are distributions over outcomes. They CAN be the trivial distributions over outcomes (X will happen).

I was not referring to either form of degenerate gas in any of my posts here, and I'm not sure why I would give that impression. I also did not use any conservation of information, though I can see why you would think I did, when I spoke of the information requirement. I meant simply that if you add 1 bit of information, you have added 1 bit of entropy - as opposed to in a physical system, where the Fermi shell at, say, 10 meV can have much more or less entropy than the Fermi shell at 5meV.

[-]sen00

I thought you were referring to degenerate gases when you mentioned nontrivial behavior in solid state systems since that is the most obvious case where you get behavior that cannot be easily explained by the "obvious" model (the canonical ensemble). If you were thinking of something else, I'm curious to know what it was.

I'm having a hard time parsing your suggestion. The "dropout" method introduces entropy to "the model itself" (the conditional probabilities in the model), but it seems that's not what you're suggesting. You can also introduce entropy to the inputs, which is another common thing to do during training to make the model more robust. There's no way to introduce 1 bit of entropy per "1 bit of information" contained in the input though since there's no way to measure the amount of information contained in the input without already having a model of the input. I think systematically injecting noise into the input based on a given model is not functionally different from injecting noise into the model itself, at least not in the ideal case where the noise is injected evenly.

You said that "if you add 1 bit of information, you have added 1 bit of entropy". I can't tell if you're equating the two phrases or if you're suggesting adding 1 bit of entropy for every 1 bit of information. In either case, I don't know what it means. Information and entropy are negations of one another, and the two have opposing effects on certainty-of-an-outcome. If you're equating the two, then I suspect you're referring to something specific that I'm not seeing. If you're suggesting adding entropy for a given amount of information, it may help if you explain which probabilities are impacted. To which probabilities would you suggest adding entropy, and which probabilities have information added to them?

1) any non-trivial Density of States, especially for semiconductors for the van Hove singularities.

2) I don't mean a model like 'consider an FCC lattice populated by one of 10 types of atoms. Here are the transition rates...' such that the model is made of microstates and you need to do statistics to get probabilities out. I mean a model more like 'Each cigarette smoked increases the annual risk of lung cancer by 0.001%' so the output is simply a distribution over outcomes, naturally (these include the others as special cases)

In particular, I'm working under the toy meta-model that models are programs that output a probability distribution over bitstreams; these are their predictions. You measure reality (producing some actual bitstream) and adjust the probability of each of the models according to the probability they gave for that bitstream, using Bayes' theorem.

3) I may have misused the term. I mean, the cost in entropy to produce that precise bit-stream. Starting from a random bitstream, how many measurements do you have to use to turn it into, say, 1011011100101 with xor operations? One for each bit. Doesn't matter how many bits there are - you need to measure them all.

When you consider multiple models, you weight them as a function of their information, preferring shorter ones. A.k.a. Occam's razor. Normally, you reduce the probability by 1/2 for each bit required. Pprior(model) ~ 2^-N, and you sum only up to the number of bits of evidence you have. This last clause is a bit of a hack to keep it normalizable (see below)

I drew a comparison of this to temperature, where you have a probability penalty of e^-E/kT on each microstate. You can have any value here because the number of microstates per energy range (the density of states) does not increase exponentially, but usually quadratically, or sometimes less (over short energy ranges, sometimes it is more).

If you follow the analogy back, the number of bitstreams does increase exponentially as a function of length (doubles each bit), so the prior probability penalty for length must be at least as strong as 1/2 to avoid infinitely-long programs being preferred. But, you can use a stronger exponential dieoff - let's say, 2.01^(-N) - and suddenly the distribution is already normalizable with no need for a special hack. What particular value you put in there will be your e^1/kT equivalent in the analogy.

[-]sen00

2) I think this is the distinction you are trying to make between the lattice model and the smoker model: in the lattice model, the equations and parameters are defined, whereas in the smoker model, the equations and parameters have to be deduced. Is that right? If so, my previous posts were referring to the smoker-type model.

Your toy meta-model is consistent with what I was thinking when I used the word "model" in my previous comments.

3) I see what you're saying. If you add complexity to the model, you want to make sure that its improvement in ability is greater than the amount of complexity added. You want to make sure that the model isn't just "memorizing" the correct results, and that all model complexity comes with some benefit of generalizability.

I don't think temperature is the right analogy. What you want is to penalize a model that is too generally applicable. Here is a simple case:

simple case A one-hidden-layer feed-forward binary stochastic neural network the goal of which is to find binary-vector representations of its binary-vector inputs. It translates its input to an internal representation of length n, then translates that internal representation into some binary-vector output that is the same length as its input. The error function is the reconstruction error, measured as the KL-divergence from input to output.

The "complexity" you want is the length of its internal representation in unit bits since each element of the internal representation can retain at most one bit of information, and that bit can be arbitrarily reflected by the input. The information loss is the same as the reconstruction error in unit bits since that describes the probability of the model guessing correctly on a given input stream (assuming each bit is independent). Your criterion translates to "minimize reconstruction error + internal representation size", and this can be done by repeatedly increasing the size of the internal representation until adding one more element reduces reconstruction error by less than one bit.

2) I think this is the distinction you are trying to make between the lattice model and the smoker model: in the lattice model, the equations and parameters are defined, whereas in the smoker model, the equations and parameters have to be deduced. Is that right? If so, my previous posts were referring to the smoker-type model.

Well, the real thing is that (again in the toy metamodel) you consider the complete ensemble of smoker-type models and let them fight it out for good scores when compared to the evidence. I guess you can consider this process to be deduction, sure.

3) (in response to the very end) That would be at the point where 1 bit of internal representation costs 1/2 of prior probability. If it was 'minimize (reconstruction error + 2*representation size)' then that would be a 'temperature' half that, where 1 more bit of internal representation costs a factor of 1/4 in prior probability. Colder thus corresponds to wanting your models smaller at the expense of accuracy. Sort of backwards from the usual way temperature is used in simulated annealing of MCMC systems.

[-]sen00

I see. You're treating "energy" as the information required to specify a model. Your analogy and your earlier posts make sense now.