This is a meta-post about the upcoming sequence on Value Learning that will start to be published this Thursday. This preface will also be revised significantly once the second half of the sequence is fully written.

Purpose of the sequence

The first part of this sequence will be about the tractability of ambitious value learning, which is the idea of inferring a utility function for an AI system to optimize based on observing human behavior. After a short break, we will (hopefully) continue with the second part, which will be about why we might want to think about techniques that infer human preferences, even if we assume we won’t do ambitious value learning with such techniques.

The aim of this part of the sequence is to gather the current best public writings on the topic, and provide a unifying narrative that ties them into a cohesive whole. This makes the key ideas more discoverable and discussable, and provides a quick reference for existing researchers. It is meant to teach the ideas surrounding one specific approach to aligning advanced AI systems.

We’ll explore the specification problem, in which we would like to define the behavior we want to see from an AI system. Ambitious value learning is one potential avenue of attack on the specification problem, that assumes a particular model of an AI system (maximizing expected utility) and a particular source of data (human behavior). We will then delve into conceptual work on ambitious value learning that has revealed obstructions to this approach. There will be pointers to current research that aims to circumvent these obstructions.

The second part of this sequence is currently being assembled, and this preface will be updated with details once it is ready.

The first half of this sequence takes you near the cutting edge of conceptual work on the ambitious value learning problem, with some pointers to work being done at this frontier. Based on the arguments in the sequence, I am confident that the obvious formulation of ambitious value learning has major, potentially insurmountable conceptual hurdles given the ways that AI systems work currently, but it may be possible to pose a different formulation that does not suffer from these issues, or to add hardcoded assumptions to the AI system to avoid impossibility results. If you try to disprove the arguments in the posts, or to create formalisms that sidestep the issues brought up, you may very well generate a new interesting direction of work that has not been considered before.

There is also a community of researchers working on inverse reinforcement learning without focusing on its application to ambitious value learning; this is out of the scope of the first half of this sequence, even though such work may still be relevant to long term safety.

Requirements for the sequence

Understanding these posts will require at least a passing familiarity with the basic principles of machine learning (not deep learning), such as “the parameters of a model are chosen to maximize the log probability that the model assigns to the observed dataset”. No other knowledge about value learning is required. If you do not have this background, I am not sure how easy it will be to grasp the points made; many of the points feel intuitive to me even without an ML background, but this could be because I no longer remember what it was like to not have ML intuitions.

There are many different subcultures interested in AI safety, and the posts I have chosen to include involve linguistic choices and assumptions from different places. I have tried to make this sequence understandable to all people who are interested and who understand the basic principles of ML, and so if something seems odd/confusing, please do let me know, either in the comments or via the PM system.

Learning from this sequence

When collating this sequence, I tried to pick content that makes the most important points simply and concisely. I recommend reading through each post carefully, taking the time to understand each paragraph. The posts range from informal arguments to formal theorems, but even for the formal theorems the formalization of the problem could be changed to invalidate the theorem. Learn from this however you best learn; my preferred method is to try and disprove the argument in the post until I feel like I understand what the post actually conveys.

While this sequence as it stands has no exercises, what it does have is a surrounding forum and community. Here are a few actions you can take to aid both your and others’ understanding of the core concepts:

  • Leave a comment with a concise summary of what you understand to be the post/paper’s main point
  • Leave a comment outlining a confusion you have with paper/post
  • Respond to someone else’s comment to help them understand it better

While I can’t commit to responding to the majority of the comments, I am also excited to help readers understand the content, and please let me know if something I write is confusing.

Each post has a note at the top saying what the post covers and who should read it. You can read through these notes and decide whether they are important for you. That said, the posts are written and organized assuming that you have read prior posts in the sequence, and many points will not make sense if read out of order.

New Comment


6 comments, sorted by Click to highlight new comments since:

Would it be possible to post a bibliography for this sequence, similar to the one for embedded agency? It would be useful to know what body of research this sequence is based on.

For the first part, this sequence is mostly me collecting the body of research in one place (in the sense that most of the posts in the sequence are just crossposts of relevant blog posts). So there isn't really an external body of research to refer to outside of the sequence.

"the parameters of a model are chosen to maximize the log probability that the model assigns to the observed dataset”

log is a monotonous function, so how does this differ from choosing parameters to maximize the probability?

I don't think this is relevant, but there are theoretical uses for maximizing expected log probability, and maximizing expected log probability is not the same as maximizing expected probability, since they interact with the expectation differently.

If you have lots of training data, the probability that the model assigns the training data is very small. You can't represent such small numbers with the commonly used floating point types in Python/Java/etc.

It's more practical to compute the log probability of the training data (by summing the log probabilities assigned to the training examples rather than multiplying the original probabilities).

It doesn't, modulo practical concerns that ofer brings up below. Also the math is often nicer in log space (since you have a sum over log probabilities of data points, instead of a product over probabilities of data points). But yes, formally they are equivalent.