You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

ShardPhoenix comments on AlphaGo versus Lee Sedol - Less Wrong Discussion

17 Post author: gjm 09 March 2016 12:22PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (183)

You are viewing a single comment's thread. Show more comments above.

Comment author: Houshalter 09 March 2016 10:30:19PM 4 points [-]

Sure, you can model music composition as a RL task. The AI composes a song, then predicts how much a human will like it. It then tries to produce songs that are more and more likely to be liked.

Another interesting thing that alphago did, was start by predicting what moves a human would make. Then it switched to reinforcement learning. So for a music AI, you would start with one that can predict the next note in a song. Then you switch to RL, and adjust it's predictions so that it is more likely to produce songs humans like, and less likely to produce ones we don't like.

However automated composition is something that a lot of people have experimented with before. So far there is nothing that works really well.

Comment author: ShardPhoenix 10 March 2016 12:15:16AM 6 points [-]

One difference is that you can't get feedback as fast when dealing with human judgement rather than win/lose in a game (where AlphaGo can play millions of games against itself).

Comment author: Houshalter 10 March 2016 04:52:16AM 3 points [-]

Yes it would require a lot of human input.

However the AI could learn to predict what humans like, and then use that as it's judge. Trying to produce songs that it predicts humans will like. Then when it tests it on actual humans, it can see if it's predictions were right and improve them.

This is also a domain with vast amounts of unsupervised data available. We've created millions of songs, which it can learn from. Out of the space of all possible sounds, we've decided that this tiny subset is pleasing to listen to. There's a lot of information in that.

Comment author: gwern 10 March 2016 12:44:36AM *  3 points [-]

You can get fast feedback by reusing existing databases if your RL agent can do off-policy learning. (You can consider this what the supervised pre-learning phase is 'really' doing.) Your agent doesn't have to take an action before it can learn from it. Consider the experience replay buffers. You could imagine a song-writing RL agent which has a huge experience replay buffer which is made just of fragments of songs you grabbed online (say, from the Touhou megatorrent with its 50k tracks).