Finally, RL practitioners have known that genuine causal reasoning could never be achieved via known RL architectures- you'd only ever get something that could execute the same policy as an agent that had reasoned that way, via a very expensive process of evolving away from dominated strategies at each step down the tree of move and countermove. It's the biggest known unknown on the way to AGI.
What's the argument here? Do you think that the AGZ policy (which is extremely good at Go or Chess even without any tree search) doesn't do any causal reasoning? That it only ever learns to play parts of the game tree it's seen during training? What does "genuine causal reasoning" even mean?
It looks to me like causal reasoning is just another type of computation, and that you could eventually find that computation by local search. If you need to use RL to guide that search then it's going to take a long time---AlphaStar was very expensive, and still only trained a policy with ~80M parameters.
From my perspective it seems like the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion), and whether you can train it by copying rather than needing to use RL.
To copy myself in another thread, AlphaZero did some (pruned) game tree exploration in a hardcoded way that allowed the NN to focus on the evaluation of how good a given position was; this allowed it to kind of be a "best of both worlds" between previous algorithms like Stockfish and a pure deep reinforcement learner.
Re: your middle paragraph, I agree that you're correct about an RL agent doing metalearning, though we're also agreed that with current architectures it would take a prohibitive amount of computation to get anything like a competent general causal reasoner that way.
I'm not going to go up against your intuitions on imitation learning etc; I'm just surprised if you don't expect there's a necessary architectural advance needed to make anything like general causal reasoning emerge in practice from some combination of imitation learning and RL.
I meant to ask about the policy network in AlphaZero directly. It plays at the professional level (the Nature paper puts it at a comparable Elo to Fan Hui) with no tree search, using a standard neural network architecture trained by supervised learning. It performs fine on parts of the search tree that never appeared during training. What distinguishes this kind of reasoning from "if I see X, I do Y"?
(ETA clarification, because I think this was probably the misunderstanding: the policy network plays Go with no tree search, tree search is only used to generate training data. That suggests the AlphaStar algorithm would produce similar behavior without using tree search ever, probably using at most 100x the compute of AlphaZero and I'd be willing to bet on <10x.)
From the outside, it looks like human-level play at Starcraft is more complicated (in a sense) than human-level play at Go, and so it's going to take bigger models in order to reach a similar level of performance. I don't see a plausible-looking distinction-in-principle that separates the strategy in Starcraft from strategy in Go.
IIUC the distinction being made is about the training data, granted the assumption that you may be able to distill tree-search-like abilities into a standard NN with supervised learning if you have samples from tree search available as supervision targets in the first place.
AGZ was hooked up to a tree search in its training procedure, so its training signal allowed it to learn not just from the game trees it "really experienced" during self-play episodes but also (in a less direct way) from the much larger pool of game trees it "imagined" while searching for its next move during those same episodes. The former is always (definitionally) available in self-play, but the latter is only available if tree search is feasible.
But to be clear, (i) it would then also be learned by imitating a large enough dataset from human players who did something like tree search internally while playing, (ii) I think the tree search makes a quantitative not qualitative change, and it's not that big (mostly improves stability, and *maybe* a 10x speedup, over self-play).
I don't see how (i) follows? The advantage of (internal) tree search during training is precisely that it constrains you to respond sensibly to situations that are normally very rare (but are easily analyzable once they come up), e.g. "cheap win" strategies that are easily defeated by serious players and hence never come up in serious play.
AGZ is only trained on the situations that actually arise in games it plays.
I agree with the point that "imitation learning from human games" will only make you play well on kinds of situations that arise in human games, and that self-play can do better by making you play well on a broader set of situations. You could also train on all the situations that arise in a bigger tree search (though AGZ did not) or against somewhat-random moves (which AGZ probably did).
(Though I don't see this as affecting the basic point.)
Just a layman here, and not sure if this is what this particular disagreement is about, but one impression I've gotten from AlphaGoZero and GPT2 is that while there are definitely more architectural advances to made, they may be more of the sort "make better use of computation, generally" than anything feels particularly specific to the strategy/decision-making problems in particular. (And I get the impression that at least some people saying that there are further breakthroughs needed are thinking of something 'more specific to general intelligence')
New paper relevant to this discussion: https://arxiv.org/abs/1911.08265
Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games - the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled - our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
the big questions are just how large a policy you would need to train using existing methods in order to be competitive with a human (my best guess would be a ~trillion to a ~quadrillion)
Curious where this estimate comes from?
Update: with the very newest version of AlphaStar, Deepmind won a series of showmatches with Serral (the 2018 world champion, who plays Zerg), with 4 wins and 1 loss. The resulting algorithm is impressively polished at early and mid-game economy and battles, enough so to take down top players, but my original assessment of it still looks good to me.
In particular, AlphaStar still had serious problems with building placement; moreover, it still showed signs of not having mastered scouting, reactive defense, and late-game strategy (especially for Zerg), and thus failed to respond adequately to a player capable of going that far down the game tree.
The game it lost was the one it played as Terran. While it played better than it had in the summer matches, it still failed to wall off its bases, and more crucially it only built units that would help crush an early Zerg attack, not units that would beat a later Zerg army. Even when it started losing armies to Serral's powerful late-game units, it continued building the same units until it lost completely. This looks, again, like the AlphaStar Zerg agents never figured out late-game strategies, so the Terran never had to learn to counter them.
AlphaStar played 3 of the 5 games as Protoss, the race it learned most effectively as seen in the summer matches. (I'm pretty sure it was intentional on DeepMind's part to play most of the games as AlphaStar's preferred race..) These games were won with fantastic economic production and impeccable unit control (especially with multiple simultaneous Disruptor attacks, which are incredibly difficult for humans to control perfectly in the heat of the battle), which overcame noticeable flaws: leaving holes between the buildings so that Zerg units could come in and kill workers, and failing to build the right units against Serral's army (and thereby losing one army entirely before barely winning with one final push).
It's hard to learn much from the one game AlphaStar played as Zerg, since there it went for a very polished early attack that narrowly succeeded; it looked to me as if Serral got cocky after seeing the attack coming, and he could have defended easily from that position had he played it safer.
In summary, my claim that DeepMind was throwing in the towel was wrong; they came back with a more polished version that was able to beat the world champion 4 out of 5 times (though 2 of those victories were very narrow, and the loss was not). But the other analyses I made in the post are claims I still stand behind when applied to this version: a major advance for reinforcement learning, but still clearly lacking any real advance in causal reasoning.
In summary, my claim that DeepMind was throwing in the towel was wrong; they came back with a more polished version that was able to beat the world champion 4 out of 5 times
This statement, while technically correct, seems a bit misleading because being the world champion in Starcraft 2 really doesn't correlate well with being proficient at playing against AI. Check out this streamer who played against AlphaStar at the same event, without warm-up or his own setup (just like Serral) and went 7-3. What's more, he's pretty much got AlphaStar figured out by the end, and I'm fairly confident that if he was paid to play another 100 games, his win rate would be 90%+
The impressive part is getting reinforcement learning to work at all in such a vast state space
It seems to me that that is AGI progress? The real world has an even vaster state space, after all. Getting things to work in vast state spaces is a necessary pre-condition to AGI.
I think DeepMind should be applauded for addressing the criticisms about AlphaStar's mechanical advantage in the show games against TLO/Mana. While not as dominant in its performance, the constraints on the new version basically matches human limitations in all aspects; the games seemed very fair.
In one comical case, AlphaStar had surrounded the units it was building with its own factories so that they couldn't get out to reach the rest of the map. Rather than lifting the buildings to let the units out, which is possible for Terran, it destroyed one building and then immediately began rebuilding it before it could move the units out!
It seems like AlphaStar played 90 ladder matches as Terran:
This sounds like the kind of mistake that the SL policy would definitely make (no reason it should be able to recover), whereas it's not clear whether RL would learn how to recover (I would expect it to, but not too strongly).
If it's easy for anyone to check and they care, it might be worth looking quickly through the replays and seeing whether this particular game was from the SL or RL policies. This is something I've been curious about since seeing the behavior posted on Reddit, and it would have a moderate impact on my understanding of AlphaStar's skill.
It looks like they released 90 replays and played 90 ladder games so it should be possible to check.
The replays are here, hosted on the DM site, sorted into three folders based on the policy, if it's one of the SL matches it's either AlphaStarSupervised_013_TvT.SC2Replay, or one of _017_, _019_, or _022_ (based on being TvT and being on Kairos Junction). The video in question is here. I'd check if I had SC2 installed.
(Of course better still would be to find a discussion of the 30 RL replays, from someone who understands the game. Maybe that's been posted somewhere, I haven't looked and it's hard to know who to trust.)
The replay for the match in that video is AlphaStarMid_042_TvT.SC2Replay, so it's from the middle of training.
Here is the relevant screen capture: https://i.imgur.com/POFhzfj.png
Thanks! That's only marginally less surprising than the final RL policy, and I suspect the final RL policy will make the same kind of mistake. Seems like the OP's example was legit and I overestimated the RL agent.
I'm not sure how surprised to be about middle of training, versus final RL policy. Are you saying that this sort of mistake should be learned quickly in RL?
I don't have a big difference in my model of mid vs. final, they have very similar MMR, the difference between them is pretty small in the scheme of things (e..g probably smaller than the impact of doubling model size) and my picture isn't refined enough to appreciate those differences. For any particular dumb mistake I'd be surprised if the line between not making it and making it was in that particular doubling.
Do I understand it correctly that in Chess and Go it seems like DeepMind is capable of strategic thinking in a way it cannot do in StarCraft II? If yes, then how would Chess/Go need to be changed to generate the same problem?
Is it just a quantitative thing, e.g. you would make a Chess-like game played on a 1000×1000 board with thousands of units, and the AI would become unable to find strategies such as "spend a few hundred turns preparing the hundreds of your bishops into this special configuration where each of them is protected by hundred others, and then attack the enemy king"?
Or would you rather need rules like "if you succeed to build a picture of Santa Claus from your Go stones, for the remainder of the game you get an extra move every 10 turns"? Something that cannot be done halfway, because that would only have costs and no benefit, so you can't discover it by observing your imaginary opponents, because you imagine your opponents as doing reasonable things (i.e. what you would have done) with some noise introduced for the purpose of evolution.
Arimaa is the(?) classic example of a chess-like board game that was designed to be hard for AI (albeit from an age before "AI" mostly meant ML).
From David Wu's paper on the bot that finally beat top humans in 2015:
Why is Arimaa computer-resistant? We can identify two major obstacles.
The first is that in Arimaa, the per-turn branching factor is extremely large due to the combinatorial possibilities produced by having four steps per turn. Even after identifying equivalent permutations of steps as the same move, on average there are about 17000 legal moves per turn (Haskin, 2006). This is a serious impediment to search.
Obviously, a high branching factor alone doesn’t imply computer-resistance, particularly if the standard of comparison is with human play: high branching factors affect humans as well. However, Arimaa has a property common to many computer-resistant games: that “per amount of branching” the board changes slowly. Indeed, pieces move only one orthogonal step at a time. This makes it possible to effectively plan ahead, cache evaluations of local positions, and visualize patterns of good moves, all things that usually favor human players.
The second obstacle is that Arimaa is frequently quite positional or strategic, as opposed to tactical. Capturing or trading pieces is somewhat more difficult in Arimaa than in, for example, Chess. Moreover, since the elephant cannot be pushed or pulled and can defend any trap, deadlocks between defending elephants are common, giving rise to positions sparse in easy tactical landmarks. Progress in such positions requires good long-term judgement and strategic understanding to guide the gradual maneuvering of pieces, posing a challenge for positional evaluation.
AlphaZero did some (pruned) game tree exploration in a hardcoded way that allowed the NN to focus on the evaluation of how good a given position was; this allowed it to kind of be a "best of both worlds" between previous algorithms like Stockfish and a pure deep reinforcement learner.
This is impossible for a game with an action space as large as StarCraft II, though; but in order to modify a game like Go, it would have to become completely different.
I'm not 100% sure about the example you raise, but it seems to me it's either going to have a decently prune-able game tree, or that humans won't be capable of playing the game at a very sophisticated level, so I'd expect AlphaZero-esque things to get superhuman at it. StarCraft is easier for humans relative to AIs because we naturally chunk concepts together (visually and strategically) that are tricky for the AI to learn.
Pruning the game tree, or doing MC tree search, is impossible in StarCraft, not because of the size of the action space but because the game has incomplete information. At least in the standard form of those algorithms.
Well, it's overdetermined. Action space, tree depth, incomplete information; any one of these is enough to make MC tree search impossible.
it doesn't do very well at building up a reactive decision tree of strategies (if I scout this, I do that)
In StarCraft terminology, this seems to be completely accurate. But for non-starcraft players, they may get the wrong impression that AlphaStar doesn't scout at all.
AlphaStar does scout, in the sense of seeing whether an attack is coming, and then moving into position to defend against it. It does check to see which units are coming, and then sends the appropriate defense against them. But this is not what the term "scout" is usually used for in StarCraft.
In StarCraft, "scouting" almost always refers to scouting the opponent's buildings, not units. A good player scouts to see whether this or that building exists, and then reacts by doing something that would counter that strategy. So far as I can see, AlphaStar does not do this. (In an hour, Oriol Vinyals and Dario Wünsch will be guests on the Pylon Show, where we may get more details on where AlphaStar is underperforming.) But I do think it is important to clarify that AlphaStar does seem to scout units in the replays I've seen.
After watching Oriol & Dario on the Pylon Show, I now feel that my previous statement was a bit too strong. It's true that Alphastar doesn't scout as much, but it certainly does seem to scout in the early game. They showed an excellent example of Alphastar specifically checking for a lair, examples where it attempted to hide a dark shrine (unsuccessfully), and they spoke about ways in which it would explicitly scout for various building types. They did follow this up by saying that Alphastar is definitively not as good at scouting as it is at other things, and that its deficiencies at scouting caused it to underperform at other strategies. But my earlier claim of it not scouting buildings at all is certainly incorrect. It scouts, but not well, at least not in comparison to the other things it does at the pro level.
I think the AI just isn't sure what to do with the information received from scouting. Unlike AlphaZero, AlphaStar doesn't learn via self-play from scratch. It has to learn builds from human players, hinting at its inability to come up with good builds on its own, so it seems likely that AlphaStar also doesn't know how to alter its build depending on scouted enemy buildings.
One thing I have noticed though from observing these games is that AlphaStar likes to over-produce probes/drones as if preempting early game raids from the enemy. It seems to work out quite well for AlphaStar; being able to mine on full capacity afterwards. Is there a good reason why pro gamers don't do this?
It's been two years since AlphaStar used anti-harassment worker oversaturation to great effect, and, as far as I'm aware, not a single progamer has tried it in a GSL match since then. This morning the last GSL match of 2020 marks the 1003rd GSL game since AlphaStar's demonstration, and (to my knowledge) the only games where workers were overmade were when there was a significantly different reason for doing so (being contained, maynarding workers, executing a timing attack). The consensus among SC2 pros appears to be that AlphaStar was wrong to overmake workers as an anti-harassment measure, despite significant community discussion in early 2019 about its potential.
I stand by this piece, and I now think it makes a nice complement to discussions of GPT-3. In both cases, we have significant improvements in chunking of concepts into latent spaces, but we don't appear to have anything like a causal model in either. And I've believed for several years that causal reasoning is the thing that puts us in the endgame.
(That's not to say either system would still be safe if scaled up massively; mesa-optimization would be a reason to worry.)
That being said, I'm not very confident this piece (or any piece on the current state of AI) will still be timely a year from now, so maybe I shouldn't recommend it for inclusion after all.
I had originally been quite impressed with AlphaStar, but this post showed me its actual limitations. It also gave me a good concrete example about what a lack of causal reasoning means in practice, and afterwards when I've seen posts about impressive-looking AI systems, I've asked myself "does this system fail to exhibit causal reasoning in the same way that AlphaStar did?".
I watched all of the Grandmaster level games. When playing against grandmasters the average win rate of AlphaStar across all three races was 55.25%
Detailed match by match scoring
While I don't think that it is truly "superhuman", it is definitely competitive against top players.
Curated. This post was first considered for curation when it first came out (many months ago), and it fell through the cracks for various reasons. Kaj Sotala was interested in curating it now, in part to compare/contrast it with various discussions of GPT-3.
Due to some time-zone-issues, I'm curating it now, and Kaj will respond with more thoughts when he gets a chance.
I have been wanting to curate this for a long time. As AlphaStar seemed really powerful at the time, it was useful to read an analysis of where it goes wrong: I felt that the building placement was an excellent concrete example of what a lack of causal reasoning really means. Not only is it useful for thinking about AlphaStar, the same weaknesses apply to GPT, which we have been discussing a lot now: it only takes a bit of playing around with say AI Dungeon before this becomes very obvious.
I think that DeepMind realized they'd need another breakthrough to do what they did to Go, and decided to throw in the towel while making it look like they were claiming victory.
Do we know that DeepMind is giving up on StarCraft now? I'd been assuming that this was a similar kind of intermediate result as the MaNa/TLO matches, and that they would carry on with development.
DeepMind says it hopes the techniques used to develop AlphaStar will ultimately help it "advance our research in real-world domains".
But Prof Silver said the lab "may rest at this point", rather than try to get AlphaStar to the level of the very elite players.
Nicholas's summary, that I'm copying over on his behalf:
This post argues that while it is impressive that AlphaStar can build up concepts complex enough to win at StarCraft, it is not actually developing reactive strategies. Rather than scouting what the opponent is doing and developing a new strategy based on that, AlphaStar just executes one of a predetermined set of strategies. This is because AlphaStar does not use causal reasoning and that keeps it from beating any of the top players.
Nicholas's opinion:
While I haven’t watched enough of the games to have a strong opinion on whether AlphaStar is empirically reacting to its opponents strategies, I agree with Paul Christiano’s comment that in principle causal reasoning is just one type of computation that should be learnable.
This discussion also highlights the need for interpretability tools for deep RL so that we can have more informed discussions on exactly how and why strategies are decided on.
I think that DeepMind realized they'd need another breakthrough to do what they did to Go, and decided to throw in the towel while making it look like they were claiming victory.
Is this mildly good news on the existential risk front (because the state of the field isn't actually as advanced as it looks), or extremely bad news (because we live in a world of all-pervasive information warfare where no one can figure out what's going on because even the reports of people whose job it is to understand what's going on are distorted by marketing pressures)?
In what sense is this information warfare or even misleading? The second sentence of the blog post says: "AlphaStar was ranked above 99.8% of active players," which seems quite clear. They seem to have done a pretty good job of making that comparison as fair as you could expect. What do they say or even imply which is highly misleading?
Perhaps they say "Grandmaster level," and it's possible that this gives a misleading impression to people who don't know what that term means in Starcraft? Though I think chess grandmaster also means roughly "better than 99.8% of ladder players," and the competitive player pools have similar size. So while it might be misleading in the sense that Chess has a larger player pool a smaller fraction of whom are competitive, it seems fairly straightforward.
Sorry, let me clarify: I was specifically reacting to the OP's characterization of "throw in the towel while making it look like they were claiming victory." Now, if that characterization is wrong, then my comment becomes either irrelevant (if you construe it as a conditional whose antecedent turned out to be false: "If DeepMind decided to throw in the towel while making it look like ..., then is that good news or bad news") or itself misleading (if you construe it as me affirming and propagating the misapprehension that DeepMind is propagating misapprehensions—and if you think I'm guilty of that, then you should probably downvote me and the OP so that the Less Wrong karma system isn't complicit with the propagation of misapprehensions).
I agree that the "Grandmaster-level"/"ranked above 99.8% of active players" claims are accurate. But I also think it's desirable for intellectuals to aspire to high standards of intent to inform, for which accuracy of claims is necessary but not sufficient, due to the perils of selective reporting.
Imagine that, if you spoke to the researchers in confidence (or after getting them drunk), they would agree with the OP's commentary that "AlphaStar doesn't really do the 'strategy' part of real-time strategy [...] because there's no representation of causal thinking." (This is a hypothetical situation to illustrate the thing I'm trying to say about transparency norms; maybe there is some crushing counterargument to the OP that I'm not aware of because I'm not a specialist in this area.) If that were the case, why not put that in the blog post in similarly blunt language, if it's information that readers would consider relevant? If the answer to that question is, "That would be contrary to the incentives; why would anyone 'diss' their own research like that?" ... well, the background situation that makes that reply seem normative is what I'm trying to point at with the "information warfare" metaphor: it's harder to figure out what's going on with AI in a world in which the relevant actors are rewarded and selected for reporting impressive-seeming capability results subject to the constraint of not making any false statements, than a world in which actors are directly optimizing for making people more informed about what's going on with AI.
For me, it is evidence for AGI, as it says that we only just one step, may be even one idea, behind it: we need to solve "genuine causal reasoning". Something like "train a neural net to recognise patterns in in AI's plans, corresponding to some strategic principles".
My model of the world doesn't find this kind of thing very surprising, due to previous reports like this and this, and just on theoretical grounds. I do wonder if this causes anyone who is more optimistic about x-risk to update though.
On the other hand, the information warfare seems to be pitched at a level below what people like us can ultimately rise above. So for example the misleading APM comparison was quickly detected (and probably wasn't aimed at people like us in the first place) and this analysis of AlphaStar eventually came out (and many of us probably already had similar but unarticulated suspicions). So maybe that's a silver lining, depending on how you expect the world to be "saved"?
Could you elaborate why it's "extremely bad" news? In what sense is it "better" for DeepMind to be more staightforward with their reporting?
Thanks for asking. The reason artificial general intelligence is an existential risk is because agentic systems that construct predictive models of their environment can use those models to compute what actions will best achieve their goals (and most possible goals kill everyone when optimized hard enough because people are made of atoms that can be used for other things).
The "compute what actions will best achieve goals" trick doesn't work when the models aren't accurate! This continues to be the case when the agentic system is made out of humans. So if our scientific institutions systematically produce less-than-optimally-informative output due to misaligned incentives, that's a factor that makes the "human civilization" AI dumber, and therefore less good at not accidentally killing itself.
I see. In that case, I don't think it makes much sense to model scientific institutions or the human civilization as an agent. You can't hope to achieve unanimity in a world as big as ours.
I mean, yes, but we still usually want to talk about collections of humans (like a "corporation" or "the economy") producing highly optimized outputs, like pencils, even if no one human knows everything that must be known to make a pencil. If someone publishes bad science about the chemistry of graphite, which results in the people in charge of designing a pencil manufacturing line making a decision based on false beliefs about the chemistry of graphite, that makes the pencils worse, even if the humans never achieve unanimity and you don't want to use the language of "agency" to talk about this process.
Would you consider MuZero an advance in causal reasoning? Despite intentionally not representing causality / explicit model dynamics, it supports hypothetical reasoning via state tree search.
Do you think there's a chance of MuZero - AlphaStar crossover?
In one comical case, AlphaStar had surrounded the units it was building with its own factories so that they couldn't get out to reach the rest of the map. Rather than lifting the buildings to let the units out, which is possible for Terran, it destroyed one building and then immediately began rebuilding it before it could move the units out!
I feel confused about how a system that can't figure out stuff like this is able to defeat strong players. (I don't know very much about StarCraft.)
Help build my intuition here?
I know more about StarCraft than I do about AI, so I could be off base, but here's my best attempt at an explanation:
As a human, you can understand that a factory gets in the way of a unit, and if you lift it, it will no longer be in the way. The AI doesn't understand this. The AI learns by playing through scenarios millions of times and learning that on average, in scenarios like this one, it gets an advantage when it performs this action. The AI has a much easier time learning something like "I should make a marine" (which it perceives as a single action) than "I should place my buildings such that all my units can get out of my base", which requires making a series of correct choices about where to place buildings when the conceivable space of building placement has thousands of options.
You could see this more broadly in the Terran AI where it knows the general concept of putting buildings in front of its base (which it probably learned via imitation learning from watching human games), but it doesn't actually understand why it should be doing that, so it does a bad job. For example, in this game , you can see that the AI has learned:
1. I should build supply depots in front of my base.
2. If I get attacked, I should raise the supply depots.
But it doesn't actually understand the reasoning behind these two things, which is that raising the supply depots is supposed to prevent the enemy units from running into your base. So this results in a comical situation where the AI doesn't actually have a proper wall, allowing the enemy units to run in, and then it raises the supply depots after they've already run in. In short, it learns what actions are correlated with winning games, but it doesn't know why, so it doesn't always use these actions in the right ways.
Why is this AI still able to beat strong players? I think the main reason is because it's so good at making the right units at the right times without missing a beat. Unlike humans, it never forgets to build units or gets distracted. Because it's so good at execution, it can afford to do dumb stuff like accidentally trapping its own units. I suspect that if you gave a pro player the chance to play against AlphaStar 100 times in a row, they would eventually figure out a way to trick the AI into making game-losing mistakes over and over. (Pro player TLO said that he practiced against AlphaStar many times while it was in development, but he didn't say much about how the games went.)
Exactly. It seems like you need something beyond present imitation learning and deep reinforcement learning to efficiently learn strategies whose individual components don't benefit you, but which have a major effect if assembled perfectly together.
(I mean, don't underestimate gradient descent with huge numbers of trials - the genetic version did evolve a complicated eye in such a way that every step was a fitness improvement; but the final model has a literal blind spot that could have been avoided if it were engineered in another way.)
Genetic algorithms also eventually evolved causal reasoning agents, us. That's why it feels weird to me that we're once again relying on gradient descent to develop AI - it seems backwards.
This together with Rick's post on the topic really helped me navigate the whole Alphastar thing, and I've been coming back to it a few times to help me figure out how general current ML methods are (I think I disagree a good amount with it, but still think it makes a good number of points).
I agree completely with the sentiment in this post. While I think that AGI would be potentially dangerous, the existing progress towards it is blown completely out of proportion. The problem is that one of the things you'd need for AGI is to be able to reason about the state of the (simulated) world, which we have no clue how to do in a computer program.
I think one ought to be careful with the wording here. What is the proportion of existing AI progress? We could be 90% there on the time axis and only one last key insight is left to be discovered, but still virtually useless compared to humans on the capability axis. It would be a precarious situation. Is the inability of our algorithms to reason the problem, or our only saving grace?
DeepMind released their AlphaStar paper a few days ago, having reached Grandmaster level at the partial-information real-time strategy game StarCraft II over the summer.
This is very impressive, and yet less impressive than it sounds. I used to watch a lot of StarCraft II (I stopped interacting with Blizzard recently because of how they rolled over for China), and over the summer there were many breakdowns of AlphaStar games once players figured out how to identify the accounts.
The impressive part is getting reinforcement learning to work at all in such a vast state space- that took breakthroughs beyond what was necessary to solve Go and beat Atari games. AlphaStar had to have a rich enough set of potential concepts (in the sense that e.g. a convolutional net ends up having concepts of different textures) that it could learn a concept like "construct building P" or "attack unit Q" or "stay out of the range of unit R" rather than just "select spot S and enter key T". This is new and worth celebrating.
The overhyped part is that AlphaStar doesn't really do the "strategy" part of real-time strategy. Each race has a few solid builds that it executes at GM level, and the unit control is fantastic, but the replays don't look creative or even especially reactive to opponent strategies.
That's because there's no representation of causal thinking - "if I did X then they could do Y, so I'd better do X' instead". Instead there are many agents evolving together, and if there's an agent evolving to try Y then the agents doing X will be replaced with agents that do X'. But to explore as much as humans do of the game tree of viable strategies, this approach could take an amount of computing resources that not even today's DeepMind could afford.
(This lack of causal reasoning especially shows up in building placement, where the consequences of locating any one building here or there are minor, but the consequences of your overall SimCity are major for how your units and your opponents' units would fare if they attacked you. In one comical case, AlphaStar had surrounded the units it was building with its own factories so that they couldn't get out to reach the rest of the map. Rather than lifting the buildings to let the units out, which is possible for Terran, it destroyed one building and then immediately began rebuilding it before it could move the units out!)
This means that, first, AlphaStar just doesn't have a decent response to strategies that it didn't evolve, and secondly, it doesn't do very well at building up a reactive decision tree of strategies (if I scout this, I do that). The latter kind of play is unfortunately very necessary for playing Zerg at a high level, so the internal meta has just collapsed into one where its Zerg agents predictably rush out early attacks that are easy to defend if expected. This has the flow-through effect that its Terran and Protoss are weaker against human Zerg than against other races, because they've never practiced against a solid Zerg that plays for the late game.
The end result cleaned up against weak players, performed well against good players, but practically never took a game against the top few players. I think that DeepMind realized they'd need another breakthrough to do what they did to Go, and decided to throw in the towel while making it look like they were claiming victory. (Key quote: "Prof Silver said the lab 'may rest at this point', rather than try to get AlphaStar to the level of the very elite players.")
Finally, RL practitioners have known that genuine causal reasoning could never be achieved via known RL architectures- you'd only ever get something that could execute the same policy as an agent that had reasoned that way, via a very expensive process of evolving away from dominated strategies at each step down the tree of move and countermove. It's the biggest known unknown on the way to AGI.