You can get fast feedback by reusing existing databases if your RL agent can do off-policy learning. (You can consider this what the supervised pre-learning phase is 'really' doing.) Your agent doesn't have to take an action before it can learn from it. Consider the experience replay buffers. You could imagine a song-writing RL agent which has a huge experience replay buffer which is made just of fragments of songs you grabbed online (say, from the Touhou megatorrent with its 50k tracks).
There have been a couple of brief discussions of this in the Open Thread, but it seems likely to generate more so here's a place for it.
The original paper in Nature about AlphaGo.
Google Asia Pacific blog, where results will be posted. DeepMind's YouTube channel, where the games are being live-streamed.
Discussion on Hacker News after AlphaGo's win of the first game.