Alignment and Deep Learning
Between the recent AI breakthroughs and Eliezer's open admission of how bleak the chances of alignment are, everyone is speaking up and contributing what they can. It seems to me that there's a route that very few people are talking about that stands a better chance of successful alignment than conventional approaches, and if there's ever a time to talk about such things, this is it. We all know the basics of the alignment question-how can we figure out and designate human values well enough to define an AI's goals, despite the fact that our values are complex, fragile, and understood on an intuitive level much more than a rigorous one? Ten years ago, AI researchers were working on a goal that was complex, fragile and almost purely intuitive, one that resisted both brute force and all attempts to define clever strategies to the point that many experts claimed it was literally unsolvable. I am talking, of course, about the game of go. While chess masters will sometimes talk about recognizing patterns of checkmate that can be reused from game to game[1], go is incredibly dependent on intuition. Not only are there vastly more possible go games than particles in the known universe, but it's chaotic in the sense of Chaos Theory: incredible sensitivity to initial conditions. While two pictures that differ by a pixel are effectively the same image, two games differing by a single stone can have opposite outcomes. This is not a domain where one can simply run a Monte Carlo Tree Search and call it a day[2]! No one ever made the MIRI approach work on go: explicit rules in a rigorous system that would encompass exactly what we want to do on a go board[3]. And if Friendly AI and the potential fate of the human race depended on doing so before anyone developed and deployed AGI, it's fair to say that we would be out of options. Yet by 15 March 2016, AlphaGo had defeated Lee Se-dol soundly, and soon attained such a dominance that the Korean master retired, unwilli
Even if we assume that's true (it seems reasonable, though less capable AIs might blunder on this point, whether by failing to understand the need to act nice, failing to understand how to act nice or believing themselves to be in a winning position before they actually are), what does an AI need to do to get in a winning position? And how easy is it to make those moves without them being seen as hostile?
An unfriendly AI can sit on its server saying "I love mankind and want to serve it" all day long, and unless we have solid neural net interpretability or some future equivalent, we might never know... (read more)