Does anyone know how Brian Christian came to be interested in AI alignment and why he decided to write this book instead of a book about a different topic? (I haven't read the book and looked at the Amazon preview but couldn't find the answer there.)
The book is dedicated "for Peter, who convinced me". Maybe that mysterious Peter is the ultimate cause of Christian's interest in Al alignment and his decision to write a book about it?
Seems like you were right, and the Peter in question is Peter Eckersley. I just saw in this post:
The Alignment Problem is dedicated to him, after he convinced his friend Brian Christian of it.
That post did not link to a source, but I found this tweet where Brian Christian says:
His influence in both my intellectual and personal life is incalculable. I dedicated The Alignment Problem to him; I knew for many years that I would.
Apparently this has been nominated for the review. I assume that this is implicitly a nomination for the book, rather than my summary of it. If so, I think the post itself serves as a review of the book, and I continue to stand by the claims within.
The medical software that learned the worse the underlying condition is the better off you are with respect to pneumonia case has been rattling around my head for a few hours.
Do we have any concept of an intervention in machine learning? I am sort of gesturing at the Judea Pearl sense of the word, but in the end I really mean physical attempts to change things. So the basic question is, how does the machine learn when we have tried to change the outcome already?
Does an intervention have size or number? Like the difference between trying one thing, like taking an aspirin, or trying many things, like an aspirin, and bed rest, and antibiotics?
Do interventions have dimension? Like is there a negative intervention where we try to stop or reverse a process, and positive intervention where we try to sustain or enhance a process? Would we consider pneumonia interventions trying to sustain/enhance lung function, or stop/reverse disease progression? Presumably both in a suite of advanced care.
Does uniformity across data in terms of interventions predict the success or failure of machine learning approaches? Example: fails with different intervention levels in medicine; succeeds with unintervened sensor data like X-rays; also succeeds with aligning lasers in fusion reactions, which are in a deep well of interventions, but uniformly so.
The Alignment Problem: Machine Learning and Human Values, by Brian Christian, was just released. This is an extended summary + opinion, a version without the quotes from the book will go out in the next Alignment Newsletter.
Summary:
This book starts off with an explanation of machine learning and problems that we can currently see with it, including detailed stories and analysis of:
- The gorilla misclassification incident
- The faulty reward in CoastRunners
- The gender bias in language models
- The failure of facial recognition models on minorities
- The COMPAS controversy (leading up to impossibility results in fairness)
- The neural net that thought asthma reduced the risk of pneumonia
It then moves on to agency and reinforcement learning, covering from a more historical and academic perspective how we have arrived at such ideas as temporal difference learning, reward shaping, curriculum design, and curiosity, across the fields of machine learning, behavioral psychology, and neuroscience. While the connections aren't always explicit, a knowledgeable reader can connect the academic examples given in these chapters to the ideas of specification gaming and mesa optimization that we talk about frequently in this newsletter. Chapter 5 especially highlights that agent design is not just a matter of specifying a reward: often, rewards will do ~nothing, and the main requirement to get a competent agent is to provide good shaping rewards or a good curriculum. Just as in the previous part, Brian traces the intellectual history of these ideas, providing detailed stories of (for example):
- BF Skinner's experiments in training pigeons
- The invention of the perceptron
- The success of TD-Gammon, and later AlphaGo Zero
The final part, titled "Normativity", delves much more deeply into the alignment problem. While the previous two parts are partially organized around AI capabilities -- how to get AI systems that optimize for their objectives -- this last one tackles head on the problem that we want AI systems that optimize for our (often-unknown) objectives, covering such topics as imitation learning, inverse reinforcement learning, learning from preferences, iterated amplification, impact regularization, calibrated uncertainty estimates, and moral uncertainty.
Opinion:
I really enjoyed this book, primarily because of the tracing of the intellectual history of various ideas. While I knew of most of these ideas, and often also who initially came up with the ideas, it's much more engaging to read the detailed stories of _how_ that person came to develop the idea; Brian's book delivers this again and again, functioning like a well-organized literature survey that is also fun to read because of its great storytelling. I struggled a fair amount in writing this summary, because I kept wanting to somehow communicate the writing style; in the end I decided not to do it and to instead give a few examples of passages from the book in this post.
Passages:
Note: It is generally not allowed to have quotations this long from this book; I have specifically gotten permission to do so.
Here’s an example of agents with evolved inner reward functions, which lead to the inner alignment problems we’ve previously worried about:
Maybe everyone but me already knows this, but here’s one of the best examples I’ve seen about the benefits of transparency:
Finally, on the importance of reward shaping: