New Comment
12 comments, sorted by Click to highlight new comments since: Today at 11:02 PM

Very impressive, I'm happy that Paul ended up there! There's still a lot of neural network black magic though. Stuff like this:

We use standard settings for the hyperparameters: an entropy bonus of β = 0.01, learning rate of 0.0007 decayed linearly to reach zero after 80 million timesteps (although runs were actually trained for only 50 million timesteps), n = 5 steps per update, N = 16 parallel workers, discount rate γ = 0.99, and policy gradient using Adam with α = 0.99 and ε = 10−5.

For the reward predictor, we use 84x84 images as inputs (the same as the inputs to the policy), and stack 4 frames for a total 84x84x4 input tensor. This input is fed through 4 convolutional layers of size 7x7, 5x5, 3x3, and 3x3 with strides 3, 2, 1, 1, each having 16 filters, with leaky ReLU nonlinearities (α = 0.01). This is followed by a fully connected layer of size 64 and then a scalar output. All convolutional layers use batch norm and dropout with α = 0.5 to prevent predictor overfitting.

I know I sound like a retrograde, but how much of that is necessary and how much can be figured out from first principles?

If we view the goal as transforming (AI that works) ---> (AI that works and does what we want), then the black magic doesn't seem like a big deal. You just copy it from the (AI that works).

In this case, I also think that using almost any reasonable architecture would work fine.

That might be true. I don't know enough about all possible architectures, though.

To clarify and elaborate a bit on Paul’s point, our explicit methodology was to take a typical reinforcement learning system, with standard architecture and hyperparameter choices, and add in feedback mostly without changing the hyperparameters/architecture. There were a couple exceptions — an agent that’s learning a reward function needs more incentive to explore than an agent with a fixed reward function, so we had to increase the exploration bonus, and also there are a few parameters specific to the reward predictor itself that we had to choose. However, we did our best to show the consequences of changing some of those parameters (that’s the ablation analysis section).

To put it another way, our method was to take the existing black magic and show that we could build in something that does what we want (in this admittedly very limited case) without much further black magic or additional complication. As a general matter I do think it is desirable (including for safety reasons) to simplify the design of systems, but as Paul says, it’s not necessarily essential. In my view one promising route for simplification is turning fixed hyperparameters into adaptive ones that are responsive to data — consider the optimization method Adam or batch normalization.

Thank you Dario! All good points, I didn't wish to detract from your work, it's the most hopeful thing I've seen about AI progress in years. Maybe one reason for my comment is that I've worked on "neat" decision theory math, and now you have this promising new idea using math that feels stubbornly alien to me, so I can't jump into helping you guys save the world :-)

I have a hunch that semi-neat approaches to AI may come back as a layer on top of neural nets -- consider the work on using neural net heuristics to decide the next step in theorem-proving (https://arxiv.org/abs/1606.04442). In such a system, the decision process is opaque, but the result is fully verifiable, at least in the world of math (in a powerful system the theorems may be being proved for ultimate use in some fuzzy interface with reality). The extent to which future systems might look like this, or what that means for safety, isn't very clear yet (at least not to me), but it's another paradigm to consider.

seems like neural network anxiety misses the point of this paper - that the artificial intelligence algorithms that actually work can in fact be brought towards directions that have a shot at making them safe

Yeah, I agree with that point and it's very exciting :-)

I know I sound like a retrograde, but how much of that is necessary and how much can be figured out from first principles?

My 2c is some of the hyperparameters can only be determined empirically in current practice and make all the difference (e.g. learning rate).

Other parameters are just "things that happened to work, many other things could have", (like 84x84, convolution sizes) and are not actually that important.

I keep saying that AI may need a human 'caregiver,' and I meant something like this post. While I'm not sure I explained it clearly enough or whether that is really what it will amount to in the end, I believe that we could have learned about this approach by listening to social scientists more closely (pedagogues in this case).

Re: title, probably worth pointing out that DeepMind was also involved in this paper.

Fair. The title was getting longish, but I at least added Shane. Great work you guys!