Curated, because I am really happy about these alignment newsletters existing, and it seemed good to feature one instance more prominently. This one also seemed like a particularly good starting point.
I also curated them all into a single sequence, to make them easier to link to and allow people to read them in one go, and will try to add the new ones to that as well as they come out (I also created a small logo/icon, but feel totally free to change it. I just needed something to be the image for the sequence).
I think I agree with this post? Certainly for a superintelligence that is vastly smarter than humans, I buy this argument (and in general am not optimistic about solving alignment). However, humans seem to be fairly good at keeping each other in check, without a deep understanding of what makes humans tick, even though humans often do optimize against each other. Perhaps we can maintain this situation inductively as our AI systems get more powerful, without requiring a deep understanding of what's going on? Overall I'm pretty confused on this point.
I read Optimization Amplifies as Scott's attempt to more explicitly articulate the core claim of Eliezer's Security Mindset dialogues (1, 2). On this view, making software robust/secure to ordinary human optimization does demand the same kind of approach as making it robust/secure to superhuman optimization. The central disanalogy isn't "robustness-to-humans requires X while robustness-to-superintelligence requires Y", but rather "the costs of robustness/security failures tend to be much smaller in the human case than the superintelligence case".
Okay, that makes sense, I agree with that. As I mentioned in the opinion, I definitely agree with this in the case of a superintelligence optimizing a utility function. Probably my mindset when writing that opinion was that it seems likely to me that AI systems that we actually deploy won't look like a single agent optimizing for particular preferences, for reasons that I couldn't really articulate. I still have this intuition, and I think I'm closer to being able to explain it now, but not in a comment.
(Layman question)
Contrary to most expectations, they didn't need anything fundamentally new in order to get long-term strategic planning. I was particularly surprised by this. Some interesting thoughts from OpenAI researchers in this thread -- in particular, assuming good exploration, the variance of the gradient should scale linearly with the duration, and so you might expect you only need linearly more samples to counteract this.
I notice some confusion here (possibly because I may be conflating some things). I recall, when AlphaGoZero came out, people who understood games and ML better than I were saying they expected it not to transfer well to domains involving long term planning and hidden information. (I'm not 100% sure I'm remembering the "long term planning" thing right, I definitely remember the hidden information thing).
I remember being surprised by that. I didn't have a very clear model of what AlphaGoZero was doing (I have a rough idea of how neural nets and gradient descent work, but not clear on whether that's even precisely what AGZ was using or, if so, how it "really worked"). But what model I did have suggested it was doing a sufficiently general thing that I didn't see why it wouldn't be able to handle hidden information or long term planning. And I was curious what would happen if you literally did the minimum-re-architecting to get it to play something more complex, and just see what happened without additional design.
I'm not sure how similar this architecture actually is to AlphaGoZero, and didn't really follow much of the architecture of last year's DOTA bot. But my vague impression was that it's at least in the same genre.
So:
a) wanted to check in a bit about my assumptions here? Is this significantly different from AGZ?
b) is there hidden information here? (I would have assumed so, except the "they all observed the full Dota 2 state" might imply otherwise – does that mean things that are supposed to be visible to a player, or the entire map?)
c) assuming "b" is "yes", are the reasons people did think that longterm planning and hidden information would require different/cleverer algorithms coherent and still reasonable? Like, at the time I had a gears-less belief that it'd probably be able to work at least somewhat, other people who seemed to have more gears-based-models than me seemed to think it'd require new algorithmic insights. Were their gears wrong? Were they right, just... not sufficient? (i.e. does the fact that this was easier than people expected imply anything interesting?)
they expected it not to transfer well to domains involving long term planning and hidden information.
AGZ was doing some long term planning (over the timescale of a Go game), and had no hidden information. It certainly was not clear whether similar techniques would work when trajectories were tens of thousands of steps long instead of hundreds. Similarly it wasn't clear how to make things work with hidden information -- you could try the same thing but it was plausible it wouldn't work.
I didn't see why it wouldn't be able to handle hidden information or long term planning.
Yeah, I agree with this. This sounds mostly like a claim that it is more computationally expensive to deal with hidden information and long term planning.
Is this significantly different from AGZ?
Yes. There's no Monte-Carlo Tree Search, because they want the agent to learn from experience (so the agent is not allowed to simulate how the game would go, because it doesn't know the rules). The reward function is shaped, so that the agent actually gets feedback on how it is doing, whereas AGZ only had access to the binary win/loss signal. But I think it's fair to say that it's the same genre of approach.
If they instead allowed themselves to say "the agent knows the rules of the game", as in Go, such that it can simulate various branches of the same game, they probably could take an AGZ approach with a shaped reward function and my guess is it would just work, probably faster than their current approach.
is there hidden information here? (I would have assumed so, except the "they all observed the full Dota 2 state" might imply otherwise – does that mean things that are supposed to be visible to a player, or the entire map?)
Yes, there is. By "full Dota 2 state" I meant everything visible to all five players, not the entire map. This is more than humans have access to, but certainly not everything. And humans can get access to all of the information by communicating with other players on the team.
Were their gears wrong? Were they right, just... not sufficient? (i.e. does the fact that this was easier than people expected imply anything interesting?)
At this point I'm speculating really hard, but I think it was just that our gears were wrong somewhere. It's possible that difficult tasks like Dota actually only have a small number of realistic strategies and they can each be learned individually. (Small here is relative to what their massive amount of compute can do -- it's plausible that it learned thousands of realistic strategies, by brute force and memorization. You could say the same about AGZ.) In this scenario, the gear that was wrong was the one that predicted "the space of strategies in complex tasks is so large that it can't be memorized from experience". (Whereas with Go, it seems more plausible that humans also rely on these sorts of intuitive mechanisms rather than strategic reasoning, so it's not as surprising that a computer can match that.) It could also be that RL is actually capable of learning algorithms that reason symbolically/logically, though I wouldn't bet on it. It could be that actually we're still quite far from good Dota bots and OpenAI Five is beating humans on micromanagement of actions, and has learned sufficient strategy to not be outclassed but is still decidedly subhuman at long term planning, and researchers gears actually were right. I don't know.
Hi Rohin, are older version of the newsletter available?
Also:
This sounds mostly like a claim that it is more computationally expensive to deal with hidden information and long term planning.
One consideration: When you are exploring a tree of possibilities, every bit of missing information means you need to double the size of the tree. So it could be that hidden information leads to an exponential explosion in search cost in the absence of hidden-information-specific search strategies. Although strictly speaking this is just a case of something being "more computationally expensive", exponential penalties generically push things from being feasible to infeasible.
Hey Jess, as Ben mentioned I keep all newsletter-related things on my website.
I agree that in theory hidden information leads to an exponential explosion. In practice, I think you don't need to search over all the exponentially many ways the hidden information could be in order to get good results. (At least, you don't need to do that in order to beat humans, because humans don't seem to do that.)
I think overall we agree though -- when I said "it wasn't clear how to make things work with hidden information -- you could try the same thing but it was plausible it wouldn't work", I was primarily thinking that the computational cost might be too high. I was relatively confident that given unbounded compute, AlphaGo-style algorithms could deal with hidden information.
Highlights
OpenAI Five (Many people at OpenAI): OpenAI has trained a team of five neural networks to play a particular set of Dota heroes in a mirror match (playing against the same set of heroes) with a few restrictions, and have started to beat amateur human players. They are aiming to beat a team of top professionals at The International in August, with the same set of five heroes, but without any other restrictions. Salient points:
My opinion: I might be more excited by an approach that was able to learn from human games (which are plentiful), and perhaps finetune with RL, in order to develop an approach that could generalize to more tasks in the future, where human data is available but a simulator is not. (Given the ridiculous sample complexity, pure RL with PPO can only be used in tasks with a simulator.) On the other hand, an approach that leveraged human data would necessarily be at least somewhat specific to Dota. A dependence on human data is unlikely to get us to general intelligence, whereas this result suggests that we can solve tasks that have a simulator, exploration strategy, and a dense reward function, which really is pushing the boundary on generality. This seems to be gdb's take: "We are very encouraged by the algorithmic implication of this result — in fact, it mirrors closely the story of deep learning (existing algorithms at large scale solve otherwise unsolvable problems). If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it. This still needs to be proven out in real-world domains, but it will be very interesting to see the full ramifications of this finding."
Paul's research agenda FAQ (zhukeepa): Exactly what it sounds like. I'm not going to summarize it because it's long and covers a lot of stuff, but I do recommend it.
Technical AI alignment
Technical agendas and prioritization
Conceptual issues in AI safety: the paradigmatic gap (Jon Gauthier): Lots of current work on AI safety focuses on what we can call "mid-term safety" -- the safety of AI systems that are more powerful and more broadly deployed than the ones we have today, but work using relatively similar techniques as the ones we use today. However, it seems plausible that there will be a paradigm shift in how we build AI systems, and if so it's likely that we will have a new, completely different set of mid-term concerns, rendering the previous mid-term work useless. For example, at the end of the 19th century, horse excrement was a huge public health hazard, and "mid-term safety" would likely have been about how to remove the excrement. Instead, the automobile was developed and started replacing horses, leading to new set of mid-term concerns (eg. pollution, traffic accidents), and any previous work on removing horse excrement became near-useless.
My opinion: I focus almost exclusively on mid-term safety (while thinking about long-term safety), not because I disagree with this argument, but in spite of it. I think there is a good chance that any work I do will be useless for aligning superintelligent AI because of a paradigm shift, but I do it anyway because it seems very important on short timelines, which are easier to affect; and I don't know of other approaches to take that would have a significantly higher probability of being useful for aligning superintelligent AI.
Read more: A possible stance for AI control research
Optimization Amplifies (Scott Garrabrant): One model of the difference between mathematicians and scientists is that a scientist is good at distinguishing between 0.01%, 50% and 99.99%, whereas a mathematician is good at distinguishing between 99.99% and 100%. Certainly it seems like if we can get 99.99% confidence that an AI system is aligned, we should count that as a huge win, and not hope for more (since the remaining 0.01% is extremely hard to get), so why do we need mathematicians? Scott argues that optimization is particularly special, in that the point of very strong optimization is to hit a very narrow target, which severely affects extreme probabilities, moving them from 0.01% to near-100%. For example, if you draw a million samples from a normal distribution and optimize for the largest one, it is almost certain to be 4 standard deviations above the mean (which is incredibly unlikely for a randomly chosen sample). In this sort of setting, the deep understanding of a problem that you get from a mathematician is still important. Note that Scott is not saying that we don't need scientists, nor that we should aim for 100% certainty that an AI is aligned.
My opinion: I think I agree with this post? Certainly for a superintelligence that is vastly smarter than humans, I buy this argument (and in general am not optimistic about solving alignment). However, humans seem to be fairly good at keeping each other in check, without a deep understanding of what makes humans tick, even though humans often do optimize against each other. Perhaps we can maintain this situation inductively as our AI systems get more powerful, without requiring a deep understanding of what's going on? Overall I'm pretty confused on this point.
Another take on agent foundations: formalizing zero-shot reasoning (zhukeepa): There are strong incentives to build a recursively self-improving AI, and in order to do this without value drift, the AI needs to be able to reason effectively about the nature of changes it makes to itself. In such scenarios, it is insufficient to "reason with extreme caution", where you think really hard about the proposed change, and implement it if you can't find reasons not to do it. Instead, you need to do something like "zero-shot reasoning", where you prove under some reasonable assumptions that the proposed change is good. This sort of reasoning must be very powerful, enabling the AI to eg. build a spacecraft that lands on Mars, after observing Earth for one day. This motivates many of the problems in MIRI's agenda, such as Vingean reflection (self-trust), logical uncertainty (how to handle being a bounded reasoner), counterfactuals, etc., which all help to formalize zero-shot reasoning.
My opinion: This assumes an ontology where there exists a utility function that an AI is optimizing, and changes to the AI seem especially likely to change the utility function in a random direction. In such a scenario, yes, you probably should be worried. However, in practice, I expect that powerful AI systems will not look like they are explicitly maximizing some utility function. If you change some component of the system for the worse, you are likely to degrade its performance, but not likely to drastically change its behavior to cause human extinction. For example, even in RL (which is the closest thing to expected utility maximization), you can have serious bugs and still do relatively well on the objective. A public example of this is in OpenAI Five (https://blog.openai.com/openai-five/), but I also hear this expressed when talking to RL researchers (and see this myself). While you still want to be very careful with self-modification, it seems generally fine not to have a formal proof before making the change, and evaluating the change after it has taken place. (This would fail dramatically if the change drastically changed behavior, but if it only degrades performance, I expect the AI would still be competent enough to notice and undo the change.) It may be the case that adversarial subprocesses could take advantage of these sorts of bugs, but I expect that we need adversarial-subprocess-specific research to address this, not zero-shot reasoning.
The Learning-Theoretic AI Alignment Research Agenda (Vadim Kosoy): This agenda aims to create a general abstract theory of intelligence (in a manner similar to AIXI, but with some deficiencies removed). In particular, once we use the framework of reinforcement learning, regret bounds are a particular way of provably quantifying an agent's intelligence (though there may be other ways as well). Once we have this theory, we can ground all other AI alignment problems within it. Specifically, alignment would be formalized as a value learning protocol that achieves some regret bound. With this formalization, we can solve hard metaphilosophy problems such as "What is imperfect rationality?" through the intuitions gained from looking at the problem through the lens of value learning protocols and universal reinforcement learning.
My opinion: This agenda, like others, is motivated by the scenario where we need to get alignment right the first time, without empirical feedback loops, both because we might be facing one-shot success or failure, and because the stakes are so high that we should aim for high reliability subject to time constraints. I put low probability on the first reason (alignment being one-shot), and it seems much less tractable, so I mostly ignore those scenarios. I agree with the second reason, but aiming for this level of rigor seems like it will take much longer than the time we actually have. Given this high level disagreement, it's hard for me to evaluate the research agenda itself.
Iterated distillation and amplification
Paul's research agenda FAQ (zhukeepa): Summarized in the highlights!
Agent foundations
Forecasting using incomplete models (Vadim Kosoy)
Logical uncertainty and Mathematical uncertainty (Alex Mennen)
Learning human intent
Policy Approval (Abram Demski): Argues that even if we had the true human utility function (assuming it exists), an AI that optimizes it would still not be aligned. It also sketches out an idea for learning policies instead of utility functions that gets around these issues.
My opinion: I disagree with the post but most likely I don't understand it. My strawman of the post is that it is arguing for imitation learning instead of inverse reinforcement learning (which differ when the AI and human know different things), which seems wrong to me.
Human-Interactive Subgoal Supervision for Efficient Inverse Reinforcement Learning (Xinlei Pan et al)
Multi-agent Inverse Reinforcement Learning for General-sum Stochastic Games (Xiaomin Lin et al)
Adversarial Exploration Strategy for Self-Supervised Imitation Learning (Zhang-Wei Hong et al)
Preventing bad behavior
Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes (Shun Zhang et al): As we saw in Alignment Newsletter #11, one approach to avoiding side effects is to create a whitelist of effects that are allowed. In this paper, the agent learns both a whitelist of allowed effects, and a blacklist of disallowed effects. They assume that the MDP in which the agent is acting has been factored into a set of features that can take on different values, and then separate the features as locked (unchangeable), free (changeable), or unknown. If there are no unknown features, then we can calculate the optimal policy using variants of standard techniques (for example, by changing the transition function to remove transitions that would change locked features, and then running any off-the-shelf MDP solver). However, this would require the operator to label all features as locked or unlocked, which would be very tedious. To solve this, they allow the agent to query the operator whether a certain feature is locked or unlocked, and provide algorithms that reduce the number of queries that the agent needs to make in order to find an optimal safe policy.
My opinion: This seems like a good first step towards whitelisting -- there's still a lot of hardcoded knowledge from a human (which features to pay attention to, the transition function) and restrictions (the number of relevant features needs to be small), but it takes a problem and provides a solution that works in that setting. In the recent whitelisting approach, I was worried that the whitelist simply wouldn't include enough transitions for the agent to be able to do anything useful. Since this approach actively queries the operator until it finds a safe policy, that is no longer an issue. However, the corresponding worry would be that it takes prohibitively many queries before the agent can do anything useful. (Their empirical evaluation is on toy gridworlds, so this problem did not come up.) Another worry previously was that whitelisting causes an agent to be "clingy", that is, it wants to prevent all changes to non-whitelisted features, even if they are caused by physical laws, or other humans. A similar problem could arise here when this is generalized to dynamic and/or multiagent environments.
Read more: Worrying about the Vase: Whitelisting
Handling groups of agents
Learning Social Conventions in Markov Games (Adam Lerer and Alexander Peysakhovich)
Interpretability
Open the Black Box Data-Driven Explanation of Black Box Decision Systems (Dino Pedreschi et al)
Interpretable Discovery in Large Image Data Sets (Kiri L. Wagstaff et al)
Near-term concerns
Adversarial examples
On Adversarial Examples for Character-Level Neural Machine Translation (Javid Ebrahimi et al)
AI capabilities
Reinforcement learning
OpenAI Five (Many people at OpenAI): Summarized in the highlights!
Retro Contest: Results (John Schulman et al): OpenAI has announced the results of the Retro Contest. The winning submissions were modified versions of existing algorithms like joint PPO and Rainbow, without any Sonic-specific parts.
A Tour of Reinforcement Learning: The View from Continuous Control (Benjamin Recht)
Evolving simple programs for playing Atari games (Dennis G Wilson et al)
Accuracy-based Curriculum Learning in Deep Reinforcement Learning (Pierre Fournier et al)
Deep learning
DARTS: Differentiable Architecture Search (Hanxiao Liu et al)
Resource-Efficient Neural Architect (Yanqi Zhou et al)
AGI theory
The Foundations of Deep Learning with a Path Towards General Intelligence (Eray Özkural)
News
RAISE status report April-June 2018 (Veerle)