LESSWRONG
LW

michaelcohen — LessWrong

5mo

There is something very deep going on with pessimism: the same general method can produce a truthful agent, prevent feedback tampering, and solve the ELK challenge. Pessimism has been discovered by theoretical and empirical researchers to produce policies that are robust to distributional shift. And it is extremely simple, not epicycle-laden.

Algorithm

Here is how to apply Pessimism to the RL setting:

In an outer loop, an agent acts in the world, and along the way it collects training data. The Adversary learns a world-model which must successfully model the observed observations and rewards. It's not allowed to get much more loss than the very best model. (This loss can include a regularization term... (read 985 more words →)

IRL in General Environments

michaelcohen

Here is a proposal for Inverse Reinforcement Learning in General Environments. (2 1/2 pages; very little math).

Copying the introduction here:

The eventual aim of IRL is to understand human goals. However, typical algorithms for IRL assume the environment is finite-state Markov, and it is often left unspecified how raw observational data would be converted into a record of human actions, alongside the space of actions available. For IRL to learn human goals, the AI has to consider general environments, and it has to have a way of identifying human actions. Lest these extensions appear trivial, I consider one of the simplest proposals, and discuss some difficulties that might arise.

Utility uncertainty vs. expected information gain

michaelcohen

It is a relatively intuitive thought that if a Bayesian agent is uncertain about its utility function, it will act more conservatively until it has a better handle on what its true utility function is.

This might be deeply flawed in a way that I'm not aware of, but I'm going to point out a way in which I think this intuition is slightly flawed. For a Bayesian agent, a natural measure of uncertainty is the entropy of its distribution over utility functions (the distribution over which possible utility function it thinks is the true one). No matter how uncertain a Bayesian agent is about which utility function is the true one, if... (read 160 more words →)

Value Learning is only Asymptotically Safe

michaelcohen

I showed recently, predicated on a few assumptions, that a certain agent was asymptotically “benign” with probability 1. (That term may be replaced by something like “domesticated” in the next version, but I’ll use “benign” for now).

This result leaves something to be desired: namely an agent which is safe for its entire lifetime. It seems very difficult to formally show such a strong result for any agent. Suppose we had a design for an agent which did value learning properly. That is, suppose we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function.

Presumably, such an agent could learn (just about) any utility... (read 206 more words →)

Impact Measure Testing with Honey Pots and Myopia

michaelcohen

Suppose we have an impact measure that we think might work. That is, it might tame a misaligned agent. There isn't an obvious way to test whether it works: if we just try it out, then if it's ineffective, that's an existential loss. This is a proposal for how to test a putative impact measure.

1) We make our agent myopic. It only cares about the reward that it accrues in the next $k$ timesteps.

2) We create a honey pot: an opportunity for a large amount of reward. If our impact measure is working correctly, making it to the honey pot will involve making a large impact, and will be precluded by the... (read 231 more words →)

Just Imitate Humans?

michaelcohen

Do people think we could make a singleton (or achieve global coordination and preventative policing) just by imitating human policies on computers? If so, this seems pretty safe to me.

Some reasons for optimism: 1) these could be run much faster than a human thinks, and 2) we could make very many of them.

Acquiring data: put a group of people in a house with a computer. Show them things (images, videos, audio files, etc.) and give them a chance to respond at the keyboard. Their keyboard actions are the actions, and everything between actions is an observation. Then learn the policy of the group of humans. By the way, these can be happy... (read more)

Build a Causal Decision Theorist

michaelcohen

I'll argue here that we should make an aligned AI which is a causal decision theorist.

Son-of-CDT

Suppose we are writing code for an agent with an action space $A$ and an observation space $O$ . The code determines how actions will be selected given the prior history of actions and observations. If the only way that our choice of what code to write can affect the world is through the actions that will be selected by the agent running this code, then the best we can do (for a given utility function that we know how to write down) is to make this agent a causal decision theorist. If our choice of what code... (read 1108 more words →)

-2

Other Constructions of Gravity

michaelcohen

In Newtonian gravity, the energy is proportional to the sum over pairs of masses of $- m_{1} m_{2} / r_{1, 2}$ . Or in the continous case, $- \int \int \frac{d m_{1} d m_{2}}{| p o s (m_{1}) - p o s (m_{2}) |_{2}}$ . This does not actually strike me as very simple in whatever language makes Maxwell's equations and the Schrödinger equation simple.

Partially-formed idea 1:

Here's an outline of something that seems like it might be a simpler theory of gravity. Given a density function $ρ$ over physical space (possibly including Kronecker deltas for point masses), we need to come up with a gravitational potential $U$ . First, convolve $ρ$ with some (radially symmetric) function $f : R^{3} \to R$ . Write this $ρ * f$ . Let $F {ρ * f}$ be the Fourier transform. I think taking the Fourier transform of something in position space gives something in momentum space. Then, you can take a... (read 555 more words →)

Response to "What does the universal prior actually look like?"

michaelcohen

These are my thoughts on this post of Paul Christiano. I claim "malign" models do not form the bulk of the Solomonoff prior.

Suppose that we use the universal prior for sequence prediction, without regard for computational complexity. I think that the result is going to be really weird, and that most people don’t appreciate quite how weird it will be.
...
The setup
What are we predicting and how natural is it?
Suppose that it’s the year 2020 and that we build a camera for our AI to use, collect a sequence of bits from the camera, and then condition the universal prior on that sequence. Moreover, suppose that we are going to use those predictions to make economically

... (read 5177 more words →)

Formal Solution to the Inner Alignment Problem

michaelcohen

We've written a paper on online imitation learning, and our construction allows us to bound the extent to which mesa-optimizers could accomplish anything. This is not to say it will definitely be easy to eliminate mesa-optimizers in practice, but investigations into how to do so could look here as a starting point. The way to avoid outputting predictions that may have been corrupted by a mesa-optimizer is to ask for help when plausible stochastic models disagree about probabilities.

Here is the abstract:

In imitation learning, imitators and demonstrators are policies for picking actions given past interactions with the environment. If we run an imitator, we probably want events to unfold similarly to the way

... (read 332 more words →)

123

LESSWRONG
LW

LESSWRONG
LW

michaelcohen

Asymptotically Unambitious AGI

Formal Solution to the Inner Alignment Problem

Pessimism About Unknown Unknowns Inspires Conservatism

Response to "What does the universal prior actually look like?"

michaelcohen

michaelcohen

Safety cases for Pessimism

IRL in General Environments

Utility uncertainty vs. expected information gain

Value Learning is only Asymptotically Safe

Impact Measure Testing with Honey Pots and Myopia

Just Imitate Humans?

Build a Causal Decision Theorist

michaelcohen

Asymptotically Unambitious AGI

Formal Solution to the Inner Alignment Problem

Pessimism About Unknown Unknowns Inspires Conservatism

Response to "What does the universal prior actually look like?"

michaelcohen

michaelcohen

Safety cases for Pessimism

IRL in General Environments

Utility uncertainty vs. expected information gain

Value Learning is only Asymptotically Safe

Impact Measure Testing with Honey Pots and Myopia

Just Imitate Humans?

Build a Causal Decision Theorist

Algorithm

Son-of-CDT

The setup

What are we predicting and how natural is it?