Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
From Optimizing Engagement to Measuring Value(Smitha Milli et al) (summarized by Rohin): This paper takes a stab at creating a better objective for existing recommender systems than engagement, in a way that could be applied at existing companies like Twitter. The basic approach is to treat the variable to be optimized (user value) as a latent variable, and use probabilistic inference to infer how likely it is that a particular recommendation was valuable.
Usually a major challenge with such an approach is specifying the observation model: how the observed data is caused by the latent variable. In the case of Twitter, this would require you to answer questions like, “if the user does not value a tweet, how likely is a user to hit the like button anyway?” This is a hard question to answer, since perhaps users like tweets in order to stop conversations, or because they are addicting at the moment but are not actually valuable, etc.
One simple heuristic is to take two datasets where we know one dataset has more valuable recommendations than the other. Differences in user behavior between these datasets can then be assumed to be correlations with value. The authors provide a quantitative method for inferring the observation model from such datasets, which I won’t go into here since it is primarily a heuristic baseline. One obvious problem is that if the “better” dataset was produced by optimizing (say) clicks, then the clicks may have increased for reasons other than improved value, but this heuristic approach will attribute the entire increase to improved value.
How can we do better? The key insight of this paper is that if you have a bunch of historical data, then you can get a lot of mileage by identifying an anchor: a type of feedback that when given provides unequivocal evidence of the latent value. On Twitter, this is taken to be the “See Less Often” (SLO) button: if this is clicked, then we know with effective certainty that this was not valuable, regardless of any other actions the user took. The connection between value and other behaviors such as liking a tweet can then be inferred by looking at the connection between those behaviors and the anchor, which we can estimate from historical data.
Formally, the authors assume access to a graph describing the relationships between the various possible behaviors (almost all of which have the latent value V as a parent). One of these is identified as the anchor node A, for which P(V = 1 | A = 1) is assumed to be known and independent of all other variables. However, P(V = 1 | A = 0) is not independent of other variables: intuitively, if the SLO button is not clicked, then we need to fall back to looking at other variables to estimate value.
The authors then show that under some reasonable assumptions on the anchor variable, if you have a dataset of historical data to estimate P(A, B) (where B consists of all the other tracked behaviors), then instead of specifying observation models P(B | V) for all behaviors, you only need to specify observation models for the parents of A, that is P(parents(A) | V). Everything else is uniquely determined, allowing us to calculate our final objective P(V | A, B). (There are algorithmic details on how to do this efficiently; see the paper for details.) In this case, they use the heuristic method outlined above to estimate P(parents(A) | V).
They unfortunately don’t have a great way to evaluate their method: they clearly can’t evaluate it by seeing if it leads to higher clicks, since the whole point was to move away from clicks as an optimization target. (I assume a user study on Twitter was infeasible.) Their primary form of evaluation is to run the model and report the learned probabilities, and show that they seem reasonable, whereas those output by a Naive Bayes model do not.
Rohin's opinion: I really liked this paper: it seems like it took a real stab at trying to align today’s recommender systems, and might have made substantial progress.
I am somewhat dubious of the use of causal graphical models here: if you create a model with some conditional independence relation that then doesn’t hold in practice, your model can have some pretty bad inferences. This actually happened: when they only modeled the relationships based on Twitter’s UI elements, and in particular did not model the dependence of SLO on clicks, they were getting bad results, where clicking on a post was interpreted as evidence that the post was not valuable.
As the paper mentions, we can drop the causal interpretation of the Bayes net. This lets us draw edges between more nodes in B in order to make our model more expressive and partially prevent this sort of misspecification, while also letting us express more complex relationships. For example, I think (but am not sure) that with their current graph, liking and retweeting a post would be treated as independent sources of evidence. If we drew an edge between them, you would gain the ability for the model to learn that if a user likes and retweets a post, that is more than the sum of the contributions of liking and retweeting separately. Note that you still can’t connect everything to the anchor A, because they require that A has no children, and if you add parents to A, those must then be estimated by the heuristic method above that is probably not very good. So you still need to model conditional independence for A, and this may get significantly harder the more complex B becomes.
This also makes sense given the motivation: since the idea is to have information about value “flow through the anchor” to the variables in B, it seems like you shouldn’t need to worry too much about the relationships between variables in B, and it would be fine to model arbitrarily complex relationships between them. Besides dropping the causal interpretation and adding lots of edges in B, another option is to add many more features to the graph: for example, perhaps you also want to include the number of likes that the tweet gets overall, or whether or not it is currently the weekend. You do need to make sure your model does not become so expressive that it can now overfit to the dataset, e.g. “tweets that are posted on Sep 24 that get exactly 38756 likes will not be valuable to Alice”. However, the datasets they are using are presumably huge, and the current graphical models are tiny (16 nodes with low degree), so we can increase it at least somewhat before we get to that point.
Is this paper relevant to alignment of superintelligent AI systems, the topic of this newsletter? I don’t think it is that relevant, since it seems like the main technique involves us effectively hardcoding our knowledge (of the anchor variable) into the objective, in a way that makes sense for recommender systems but probably would not work for more general systems. I’m highlighting it anyway because I think it is particularly novel and interesting, it seems to solve an analog of the alignment problem for existing AI systems, and it is part of a research lineage that I do think will be relevant for alignment: how to create good objectives for AI systems. Even if you only care about alignment of superintelligent systems, it seems worth following the techniques used today and the problems that come up in their application, as the lessons learned may continue to be relevant when intelligence is scaled up.
TECHNICAL AI ALIGNMENT
LEARNING HUMAN INTENT
The EMPATHIC Framework for Task Learning from Implicit Human Feedback(Yuchen Cui, Qiping Zhang et al) (summarized by Rohin): A problem with learning from human feedback is that human feedback is quite expensive to collect. Can we instead learn from the facial expressions that humans automatically make anyway? This paper shows that the answer is yes: they first record human reactions while watching an autonomous agent, and use that to train a model that predicts reward given human reactions. They then transfer this model to a new task.
Humans learn too: Better Human-AI Interaction using Optimized Human Inputs(Johannes Schneider) (summarized by Rohin): Most work in human-AI interaction focuses on optimizing the AI system to perform well with the human. However, we could also teach the human to work well with the AI system. This paper investigates this idea in the context of a simple drawing game in which the human must draw a sketch of some word within a minute, and the AI system must then guess what the word was.
The author developed a system to propose small modifications to the images that humans draw to make them more easily recognizable -- a very similar setting to that of adversarial examples. In a user study, people were presented with an image, and asked to redraw that image. When presented with the altered images, the redrawn images were correctly classified more often and took less time to draw than when presented with the original images.
The case against economic values in the brain(Benjamin Y. Hayden et al) (summarized by Rohin) (H/T Xuan Tan): It has been common in the neuroeconomics literature to assume (based on past research) that the brain explicitly computes some notion of value in order to make choices. This paper argues that this is wrong: it is plausible that the brain does not in fact explicitly calculate values, and instead directly learns a policy that produces actions.
Rohin's opinion: If you previously were optimistic about inverse reinforcement learning and similar techniques because you thought they could infer the same notion of value that the brain computes, this seems like an important counterargument. However, it should be noted that the authors are not arguing that the brain is not optimizing some specific notion of value: just that it is not explicitly computing such a notion of value. (Similarly, policies learned by RL optimize the reward function, even though they need not explicitly calculate the reward of every state in order to choose an action.) So you could still hope that the brain is optimizing some notion of value that isn’t explicitly computed, and then use inverse RL to recover that notion of value.
PREVENTING BAD BEHAVIOR
Safety Aware Reinforcement Learning (SARL)(Santiago Miret et al) (summarized by Rohin): Many approaches to safety rely on learning from a trusted overseer (typically a human), including iterated amplification (AN #40), debate (AN #5), parenting (AN #53), delegative RL (AN #57), and quantilization (AN #48). This paper applies this idea to avoiding side effects in the SafeLife environment (AN #91). They train a safety agent to minimize side effect score to use as a proxy for the trusted overseer, and then train a regular RL agent to optimize reward while penalizing deviations from the safety agent’s policy. They find that the safety agent can be transferred zero-shot to new environments and help reduce side effects in those environments as well.
HANDLING GROUPS OF AGENTS
Multi-agent Social Reinforcement Learning Improves Generalization(Kamal Ndousse et al) (summarized by Rohin): We’ve previously seen that in sparse reward settings where exploration is hard, it’s very useful to have expert demonstrations to avoid having to do all the exploration yourself (1 (AN #14), 2 (AN #65), 3 (AN #9)). However, this assumes that the demonstrator is “external” to the environment, whereas really we’d like to model them as part of the environment, as in assistance games (AN #69). This then looks like social learning, in which agents learn how to perform tasks by looking at cues from other agents within the environment.
But how can we do this in high-dimensional environments? This paper looks at one approach: adding an auxiliary loss in which the agent must predict the next state of the environment. Since the environment itself contains experts that do useful things, the agent implicitly must learn what those experts are doing and what effects their actions have.
They find that such agents learn to follow the cues of the experts and thus achieve significantly improved reward relative to agents that are trained in isolation. In fact, these agents can be transferred to novel environments, where they continue to follow expert cues to achieve high reward. However, this means that they don’t learn how to act when experts aren’t present, and so fail in the solo setting. This can be fixed by training on a mixture of solo settings and settings with experts present.
Rohin's opinion: I’m a big fan of moving towards modeling humans as part of the environment, since we will eventually have AI systems working with and interacting with humans -- they won’t be “external to the AI’s universe” as it is often modeled currently.
MISCELLANEOUS (ALIGNMENT)
The date of AI Takeover is not the day the AI takes over(Daniel Kokotajlo) (summarized by Rohin): This post points out that when making decisions based on AGI timelines, the relevant date is not when an AI would actually take over the world, but instead the last point at which we could have done anything about it.
AI STRATEGY AND POLICY
Future Indices: How Crowd Forecasting Can Inform the Big Picture(Michael Page et al) (summarized by Rohin): This paper explains the methodology behind CSET’s recent forecasting project, Foretell. We would like to know which of several potential geopolitical scenarios might happen in the next 3-7 years. We can get some insight into this by asking relevant experts for their opinions, but often many experts will disagree, making it hard to know what to conclude.
We’d like to mitigate this by leveraging the wisdom of the crowds. Unfortunately, this would require us to have a clear and precise operationalization of our scenarios; the scenarios we’re interested in are rarely amenable to such operationalization. Instead, we can find a number of predictors that would argue for a specific scenario, and identify one or more metrics which are themselves clear and precise and give us information about some predictor. We can get forecasts for these metrics using the wisdom of the crowds. We can then compute the deviations between crowd forecasts and simple trend extrapolations of historical data, and use the observed trend directions as arguments for or against particular scenarios.
The paper illustrates this in the case of potential scenarios involving the US, China, and AI. An example of an important predictor is “US-China tensions”. Associated metrics include the amount of US-China trade, the number of Chinese O visas, etc. In this case, the crowd predictions suggested trend deviations in the metrics that argued for increasing US-China tensions.
FEEDBACK
I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
PODCAST
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
HIGHLIGHTS
From Optimizing Engagement to Measuring Value (Smitha Milli et al) (summarized by Rohin): This paper takes a stab at creating a better objective for existing recommender systems than engagement, in a way that could be applied at existing companies like Twitter. The basic approach is to treat the variable to be optimized (user value) as a latent variable, and use probabilistic inference to infer how likely it is that a particular recommendation was valuable.
Usually a major challenge with such an approach is specifying the observation model: how the observed data is caused by the latent variable. In the case of Twitter, this would require you to answer questions like, “if the user does not value a tweet, how likely is a user to hit the like button anyway?” This is a hard question to answer, since perhaps users like tweets in order to stop conversations, or because they are addicting at the moment but are not actually valuable, etc.
One simple heuristic is to take two datasets where we know one dataset has more valuable recommendations than the other. Differences in user behavior between these datasets can then be assumed to be correlations with value. The authors provide a quantitative method for inferring the observation model from such datasets, which I won’t go into here since it is primarily a heuristic baseline. One obvious problem is that if the “better” dataset was produced by optimizing (say) clicks, then the clicks may have increased for reasons other than improved value, but this heuristic approach will attribute the entire increase to improved value.
How can we do better? The key insight of this paper is that if you have a bunch of historical data, then you can get a lot of mileage by identifying an anchor: a type of feedback that when given provides unequivocal evidence of the latent value. On Twitter, this is taken to be the “See Less Often” (SLO) button: if this is clicked, then we know with effective certainty that this was not valuable, regardless of any other actions the user took. The connection between value and other behaviors such as liking a tweet can then be inferred by looking at the connection between those behaviors and the anchor, which we can estimate from historical data.
Formally, the authors assume access to a graph describing the relationships between the various possible behaviors (almost all of which have the latent value V as a parent). One of these is identified as the anchor node A, for which P(V = 1 | A = 1) is assumed to be known and independent of all other variables. However, P(V = 1 | A = 0) is not independent of other variables: intuitively, if the SLO button is not clicked, then we need to fall back to looking at other variables to estimate value.
The authors then show that under some reasonable assumptions on the anchor variable, if you have a dataset of historical data to estimate P(A, B) (where B consists of all the other tracked behaviors), then instead of specifying observation models P(B | V) for all behaviors, you only need to specify observation models for the parents of A, that is P(parents(A) | V). Everything else is uniquely determined, allowing us to calculate our final objective P(V | A, B). (There are algorithmic details on how to do this efficiently; see the paper for details.) In this case, they use the heuristic method outlined above to estimate P(parents(A) | V).
They unfortunately don’t have a great way to evaluate their method: they clearly can’t evaluate it by seeing if it leads to higher clicks, since the whole point was to move away from clicks as an optimization target. (I assume a user study on Twitter was infeasible.) Their primary form of evaluation is to run the model and report the learned probabilities, and show that they seem reasonable, whereas those output by a Naive Bayes model do not.
Rohin's opinion: I really liked this paper: it seems like it took a real stab at trying to align today’s recommender systems, and might have made substantial progress.
I am somewhat dubious of the use of causal graphical models here: if you create a model with some conditional independence relation that then doesn’t hold in practice, your model can have some pretty bad inferences. This actually happened: when they only modeled the relationships based on Twitter’s UI elements, and in particular did not model the dependence of SLO on clicks, they were getting bad results, where clicking on a post was interpreted as evidence that the post was not valuable.
As the paper mentions, we can drop the causal interpretation of the Bayes net. This lets us draw edges between more nodes in B in order to make our model more expressive and partially prevent this sort of misspecification, while also letting us express more complex relationships. For example, I think (but am not sure) that with their current graph, liking and retweeting a post would be treated as independent sources of evidence. If we drew an edge between them, you would gain the ability for the model to learn that if a user likes and retweets a post, that is more than the sum of the contributions of liking and retweeting separately. Note that you still can’t connect everything to the anchor A, because they require that A has no children, and if you add parents to A, those must then be estimated by the heuristic method above that is probably not very good. So you still need to model conditional independence for A, and this may get significantly harder the more complex B becomes.
This also makes sense given the motivation: since the idea is to have information about value “flow through the anchor” to the variables in B, it seems like you shouldn’t need to worry too much about the relationships between variables in B, and it would be fine to model arbitrarily complex relationships between them. Besides dropping the causal interpretation and adding lots of edges in B, another option is to add many more features to the graph: for example, perhaps you also want to include the number of likes that the tweet gets overall, or whether or not it is currently the weekend. You do need to make sure your model does not become so expressive that it can now overfit to the dataset, e.g. “tweets that are posted on Sep 24 that get exactly 38756 likes will not be valuable to Alice”. However, the datasets they are using are presumably huge, and the current graphical models are tiny (16 nodes with low degree), so we can increase it at least somewhat before we get to that point.
Is this paper relevant to alignment of superintelligent AI systems, the topic of this newsletter? I don’t think it is that relevant, since it seems like the main technique involves us effectively hardcoding our knowledge (of the anchor variable) into the objective, in a way that makes sense for recommender systems but probably would not work for more general systems. I’m highlighting it anyway because I think it is particularly novel and interesting, it seems to solve an analog of the alignment problem for existing AI systems, and it is part of a research lineage that I do think will be relevant for alignment: how to create good objectives for AI systems. Even if you only care about alignment of superintelligent systems, it seems worth following the techniques used today and the problems that come up in their application, as the lessons learned may continue to be relevant when intelligence is scaled up.
TECHNICAL AI ALIGNMENT
LEARNING HUMAN INTENT
The EMPATHIC Framework for Task Learning from Implicit Human Feedback (Yuchen Cui, Qiping Zhang et al) (summarized by Rohin): A problem with learning from human feedback is that human feedback is quite expensive to collect. Can we instead learn from the facial expressions that humans automatically make anyway? This paper shows that the answer is yes: they first record human reactions while watching an autonomous agent, and use that to train a model that predicts reward given human reactions. They then transfer this model to a new task.
Humans learn too: Better Human-AI Interaction using Optimized Human Inputs (Johannes Schneider) (summarized by Rohin): Most work in human-AI interaction focuses on optimizing the AI system to perform well with the human. However, we could also teach the human to work well with the AI system. This paper investigates this idea in the context of a simple drawing game in which the human must draw a sketch of some word within a minute, and the AI system must then guess what the word was.
The author developed a system to propose small modifications to the images that humans draw to make them more easily recognizable -- a very similar setting to that of adversarial examples. In a user study, people were presented with an image, and asked to redraw that image. When presented with the altered images, the redrawn images were correctly classified more often and took less time to draw than when presented with the original images.
Read more: AI Safety Needs Social Scientists (AN #47)
REWARD LEARNING THEORY
The case against economic values in the brain (Benjamin Y. Hayden et al) (summarized by Rohin) (H/T Xuan Tan): It has been common in the neuroeconomics literature to assume (based on past research) that the brain explicitly computes some notion of value in order to make choices. This paper argues that this is wrong: it is plausible that the brain does not in fact explicitly calculate values, and instead directly learns a policy that produces actions.
Rohin's opinion: If you previously were optimistic about inverse reinforcement learning and similar techniques because you thought they could infer the same notion of value that the brain computes, this seems like an important counterargument. However, it should be noted that the authors are not arguing that the brain is not optimizing some specific notion of value: just that it is not explicitly computing such a notion of value. (Similarly, policies learned by RL optimize the reward function, even though they need not explicitly calculate the reward of every state in order to choose an action.) So you could still hope that the brain is optimizing some notion of value that isn’t explicitly computed, and then use inverse RL to recover that notion of value.
PREVENTING BAD BEHAVIOR
Safety Aware Reinforcement Learning (SARL) (Santiago Miret et al) (summarized by Rohin): Many approaches to safety rely on learning from a trusted overseer (typically a human), including iterated amplification (AN #40), debate (AN #5), parenting (AN #53), delegative RL (AN #57), and quantilization (AN #48). This paper applies this idea to avoiding side effects in the SafeLife environment (AN #91). They train a safety agent to minimize side effect score to use as a proxy for the trusted overseer, and then train a regular RL agent to optimize reward while penalizing deviations from the safety agent’s policy. They find that the safety agent can be transferred zero-shot to new environments and help reduce side effects in those environments as well.
HANDLING GROUPS OF AGENTS
Multi-agent Social Reinforcement Learning Improves Generalization (Kamal Ndousse et al) (summarized by Rohin): We’ve previously seen that in sparse reward settings where exploration is hard, it’s very useful to have expert demonstrations to avoid having to do all the exploration yourself (1 (AN #14), 2 (AN #65), 3 (AN #9)). However, this assumes that the demonstrator is “external” to the environment, whereas really we’d like to model them as part of the environment, as in assistance games (AN #69). This then looks like social learning, in which agents learn how to perform tasks by looking at cues from other agents within the environment.
But how can we do this in high-dimensional environments? This paper looks at one approach: adding an auxiliary loss in which the agent must predict the next state of the environment. Since the environment itself contains experts that do useful things, the agent implicitly must learn what those experts are doing and what effects their actions have.
They find that such agents learn to follow the cues of the experts and thus achieve significantly improved reward relative to agents that are trained in isolation. In fact, these agents can be transferred to novel environments, where they continue to follow expert cues to achieve high reward. However, this means that they don’t learn how to act when experts aren’t present, and so fail in the solo setting. This can be fixed by training on a mixture of solo settings and settings with experts present.
Rohin's opinion: I’m a big fan of moving towards modeling humans as part of the environment, since we will eventually have AI systems working with and interacting with humans -- they won’t be “external to the AI’s universe” as it is often modeled currently.
MISCELLANEOUS (ALIGNMENT)
The date of AI Takeover is not the day the AI takes over (Daniel Kokotajlo) (summarized by Rohin): This post points out that when making decisions based on AGI timelines, the relevant date is not when an AI would actually take over the world, but instead the last point at which we could have done anything about it.
AI STRATEGY AND POLICY
Future Indices: How Crowd Forecasting Can Inform the Big Picture (Michael Page et al) (summarized by Rohin): This paper explains the methodology behind CSET’s recent forecasting project, Foretell. We would like to know which of several potential geopolitical scenarios might happen in the next 3-7 years. We can get some insight into this by asking relevant experts for their opinions, but often many experts will disagree, making it hard to know what to conclude.
We’d like to mitigate this by leveraging the wisdom of the crowds. Unfortunately, this would require us to have a clear and precise operationalization of our scenarios; the scenarios we’re interested in are rarely amenable to such operationalization. Instead, we can find a number of predictors that would argue for a specific scenario, and identify one or more metrics which are themselves clear and precise and give us information about some predictor. We can get forecasts for these metrics using the wisdom of the crowds. We can then compute the deviations between crowd forecasts and simple trend extrapolations of historical data, and use the observed trend directions as arguments for or against particular scenarios.
The paper illustrates this in the case of potential scenarios involving the US, China, and AI. An example of an important predictor is “US-China tensions”. Associated metrics include the amount of US-China trade, the number of Chinese O visas, etc. In this case, the crowd predictions suggested trend deviations in the metrics that argued for increasing US-China tensions.
FEEDBACK
I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
PODCAST
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.