You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

jacob_cannell comments on Concept Safety: Producing similar AI-human concept spaces - Less Wrong Discussion

31 Post author: Kaj_Sotala 14 April 2015 08:39PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (45)

You are viewing a single comment's thread. Show more comments above.

Comment author: jacob_cannell 17 April 2015 07:50:07AM *  2 points [-]

But you imposed some structure in the model, perhaps with a little box labeled "utility function." After inference, that box isn't going to have the utility function in it. Why would your universe-simulating model bother dividing itself neatly into "utility function" and "everything else"? It will just ignore your division and do whatever is most efficient.

Hmm at this point I should now actually write out a simple RL model to help me understand your critique.

Here is some very simple math for a general RL setup (bellman-style recursive function form):

model = p(s,a,s')

policy(s) = argmax_a Q(s,a)

Q(s,a) = sum_s' p(s,a,s') [ R(s') + gV(s) ]

V(s) = max_a Q(s,a)

The function p(s,a,s') is the agent's world model which gives transition probabilities between consecutive states (s,s') on action a. The states really are observation histories - entire sequences of observations. The variable 'g' represents the discount factor (although really this should probably be a an unknown function). R(s') is the reward/utility function, and Q(s,a) is the value-action function that results from planning ahead to optimize R. The decision/policy function just selects the best action.

We condition on the actions and observations to learn the best model , reward and discount functions. And now I see your point (i think), after writing this out, that the model and reward functions are not really well distinguished and either could potentially substitute for the other (as they just multiply). It could learn a reward function that is just '1' and stuff everything in the model.

So - yes we need more prior structure than the 4 lines of math model. My initial initial guess was about 100 lines of math code in a tight prob prog model, which still may be reasonable in the future but is perhaps slightly optimistic.

Ok, so here is version 2. We know roughly that the cortex is responsible for modelling the world and we know its rough circuit complexity. So we can use that as a prior on the model function. Better yet, we can train the model function separately (constrained to cortex size or smaller), without including the policy function/argmax stuff, and on a dataset which includes situations where no actions are taken, forcing it to learn a world model first. Then we can use those results as an initial prior when we train the whole thing on the full dataset with the actions.

That doesn't totally solve the general form of your objection, but it at least forces the utility function to be somewhat more sensible. I can now kindof see where version 100 of this idea or so is going and how it could work well, but it probably requires increasingly complex models of human-like brains (along with more complex training schemes) as priors.

If your model doesn't have a box labeled "utility function," can you say again how you are extracting the utility function from the learned model?

Extract is perhaps not the right word, but the general idea is that once we have learned a human-level model function and reward function, in theory we can get superintelligent extrapolation by improving the model function, running it faster, and or eliminating any planning limitations or noise. The model function we learn to explain human data in particular will only know/model what humans actually know.

So here is a more practical set of experiments we could do today...

I agree that this experiment can probably yield better behavior than training a supervised learner to reproduce human play. . . .But I think that most (all?) models in the literature will converge to exactly reproducing the "modal human policy" (in each state, do the thing that the expert is most likley to do) in the limit of infinite training data and a sufficiently rich state space. Do you have a counterexample in mind?

The modal human policy, as you describe it, sounds identical to the supervised learner which just reproduces human ability. Beating supervised learning (the modal human policy) is again what really matters.

You can probably get optimal play in the atari case by leaning heavily on the simplicity prior for the rewards and neglecting the training data.

Not sure what you mean here - you need the training data to get up to any decent level of play. Perhaps you were thinking only of the utility function, but to learn that you still need some training data.

The deep ANN approach to RL is still new, and hasn't been merged with IRL research yet, which mostly appears to be in the small model stage (with the exception perhaps of some narrow applications in robotics and pathfinding).

Comment author: paulfchristiano 17 April 2015 03:49:02PM 1 point [-]

the model and reward functions are not really well distinguished and either could potentially substitute for the other (as they just multiply)

They can also substitute in more subtle ways, e.g. by learning R(s) = 1 if the last action implied by the state history matches the predicted human action. If the human is doing RL imperfectly then that is going to have a much better explanatory fit to the data (it can be arbitrarily good, while any model of a human as a perfect RL agent will lose Bayes points all over the place), so you have to rely on the prior to see that it's a "bad" model.

it probably requires increasingly complex models of human-like brains (along with more complex training schemes) as priors

That's my concern; I think things get pretty hairy, and moreover I don't know whether the resulting systems would typically be competitive with (e.g.) the best RL agents that we could design by more direct methods.

once we have learned a human-level model function and reward function

That's what I mean by a "box labeled 'utility function'."

The modal human policy, as you describe it, sounds identical to the supervised learner which just reproduces human ability. Beating supervised learning (the modal human policy) is again what really matters.

Yes. Do you know any model of IRL that can (significantly) beat the modal human policy in this context?

Not sure what you mean here - you need the training data to get up to any decent level of play

Sorry, I meant "assign low total weight" to the training data, so that the learner can infer that some of the human's decisions were probably mistakes (since they can only be explained by an artificial reward function). This is very delicate, and it requires paying more attention to athe prior than you seemed to want to (and more attention to the prior than is consistent with actually making good predictions about human behavior).

Comment author: jacob_cannell 17 April 2015 06:20:22PM *  1 point [-]

the model and reward functions are not really well distinguished and either could potentially substitute for the other (as they just multiply)

They can also substitute in more subtle ways, e.g. by learning R(s) = 1 if the last action implied by the state history matches the predicted human action. If the human is doing RL imperfectly then that is going to have a much better explanatory fit to the data (it can be arbitrarily good, while any model of a human as a perfect RL agent will lose Bayes points all over the place), so you have to rely on the prior to see that it's a "bad" model.

That may or may not be a problem with the simplest version 1 of the idea, but it is not a problem in version 2 which imposes more realistic priors/constraints and also uses model pretraining on just state transitions to force differentiation of the model and reward functions.

I think things get pretty hairy, and moreover I don't know whether the resulting systems would typically be competitive with (e.g.) the best RL agents that we could design by more direct methods.

Ok, I think we are kindof in agreement, but first let me recap where we are. This all started when I claimed that your 'easy IRL problem' - solve IRL given infinite compute and infinite perfect training data - is relatively easy and could probably be done in 100 lines of math. We both agreed that supervised learning (reproducing the training set - the modal human policy) would be obviously easy in this setting.

After that the discussion forked and got complicated - which I realize in hindsight - stems from not clearly specifying what would entail success. So to be more clear - success of the IRL approach can be measured as improvement over supervised learning - as measured in the recovered utility function. Which of course leads to this whole other complexity - how do we know that is the 'true utility function' - leave that aside for a second, and I'll get back to it.

I then brought up a concrete example of using IRL on an deep RL Atari agent. I described how learning the score function should be relatively straightforward, and this would allow an IRL agent to match the performance of the RL agent in this domain, which leads to better performance than the supervised/modal human policy.

You agreed with this:

So here is a more practical set of experiments we could do today...

I agree that this experiment can probably yield better behavior than training a supervised learner to reproduce human play.

So it seems we have agreed that IRL surpassing the modal human policy is clearly possible - at least in the limited domain of atari.

If we already know the utility function apriori, then obviously IRL given the same resources can only do as good as RL. But that isn't that interesting, and remember IRL can do much more - as in the example of learning to maximize score while under other complex constraints.

So in scaling up to more general problem domains, we have the issue of modelling mistakes - which you seem to be especially focused on - and the related issue of utility function uniqueness.

Versions 2 and later of my simple proto-proposal use more informed priors for the circuit complexity combined with pretraining the model on just observations to force differentiate the model and utility functions. In the case of atari, getting the utility function to learn the score should be relatively easy - as we know it is a simple immediate visual function.

This type of RL architecture can model human's limited rationality by bounding the circuit complexity - at least that's the first step. We could get increasingly more accurate models of the human decision surface by incorporating more of the coarse abstract structure of the brain as a prior over our model space.

Ok, so backing up a bit :

it probably requires increasingly complex models of human-like brains (along with more complex training schemes) as priors

That's my concern; I think things get pretty hairy, and moreover I don't know whether the resulting systems would typically be competitive with (e.g.) the best RL agents that we could design by more direct methods.

For the full AGI problem, I am aware of a couple of interesting candidates for an intrinsic reward/utility function - the future freedom of action principle (power) and the compression progress measure (curiosity). If scaled up to superhuman intelligence, I think/suspect you would agree that both of these candidates are probably quite dangerous. On the other hand, they seem to capture some aspects of human's intrinsic motivators, so they may be useful as subcomponents or features.

The IRL approach - if taken all the way - seems to require reverse engineering the brain. It could be that any successful route to safe superintelligence just requires this - because the class of agents that combine our specific complex unknown utility functions with extrapolated superintelligence necessarily can only be specified in reference to our neural architecture as a starting point.

Comment author: Wei_Dai 14 June 2015 06:36:42AM 0 points [-]

The IRL approach - if taken all the way - seems to require reverse engineering the brain. It could be that any successful route to safe superintelligence just requires this - because the class of agents that combine our specific complex unknown utility functions with extrapolated superintelligence necessarily can only be specified in reference to our neural architecture as a starting point.

This sounds really interesting and important (if true), but I have only a vague understanding of how you arrived at this conclusion. Please consider writing a post about it.

Comment author: jacob_cannell 14 June 2015 08:26:08PM 0 points [-]

This sounds really interesting and important (if true), but I have only a vague understanding of how you arrived at this conclusion. Please consider writing a post about it.

It's not so much a conclusion as an intuition, and most of the inferences leading up to it are contained in this thread with PaulChristiano and a related discussion with Kaj Sotala.

I'm interested in IRL and I think it's the most promising current candidate for value learning, but I must admit I haven't read much of the relevant literature yet. Reading up on IRL and writing a discussion post on it has been on my todo list - your comment just bumped it up a bit. :)

Another related issue is the more general question of how the training data/environment determines/shapes safety issues for learning agents.

Comment author: Wei_Dai 15 June 2015 07:44:52AM 0 points [-]

My reaction when I first came across IRL is similar to this author's:

However, the current IRL methods are limited and cannot be used for inferring human values because of their long list of assumptions. For instance, in most IRL methods the environment is usually assumed to be stationary, fully observable, and some- times known; the policy of the agent is assumed to be stationary and optimal or near-optimal; the reward function is assumed to be stationary as well; and the Markov property is assumed. Such assumptions are reasonable for limited motor control tasks such as grasping and manipulation; however, if our goal is to learn high-level human values, they become unrealistic.

But maybe it's not a bad approach for solving a hard problem to first solve a very simplified version of it, then gradually relax the simplifying assumptions and try to build up to a solution of the full problem.

Comment author: jacob_cannell 15 June 2015 04:00:46PM *  2 points [-]

My reaction when I first came across IRL is similar to this author's:

As a side note, that author's attempt at value learning is likely to suffer from the same problem Christiano brought up in this thread - there is nothing to enforce that the optimization process will actually nicely separate the reward and agent functionality. Doing that requires some more complex priors and or training tricks.

The author's critique about limiting assumptions may or may not be true, but the author only quotes a single paper from the IRL field - and its from 2000. That paper and it's follow up both each have 500+ citations, and some of the newer work with IRL in the title is from 2008 or later. Also - most of the related research doesn't use IRL in the title - ie "Probabilistic reasoning from observed context-aware behavior".

But maybe it's not a bad approach for solving a hard problem to first solve a very simplified version of it, then gradually relax the simplifying assumptions and try to build up to a solution of the full problem.

This is actually the mainline successful approach in machine learning - scaling up. MNIST is a small 'toy' visual learning problem, but it lead to CIFAR10/100 and eventually ImageNet. The systems that do well on ImageNet descend from the techniques that did well on MNIST decades ago.

MIRI/LW seems much more focused on starting with a top-down approach where you solve the full problem in an unrealistic model - given infinite compute - and then scale down by developing some approximation.

Compare MIRI/LW's fascination with AIXI vs the machine learning community. Searching for "AIXI" on r/machinelearning gets a single hit vs 634 results on lesswrong. Based on #citations of around 150 or so, AIXI is a minor/average paper in ML (more minor than IRL), and doesn't appear to have lead to great new insights in terms of fast approximations to bayesian inference (a very active field that connects mostly to ANN research).

Comment author: Wei_Dai 17 June 2015 12:10:01AM 2 points [-]

MIRI is taking the top-down approach since that seems to be the best way to eventually obtain an AI for which you can derive theoretical guarantees. In the absence of such guarantees, we can't be confident that an AI will behave correctly when it's able to think of strategies or reach world states that are very far outside of its training and testing data sets. The price for pursuing such guarantees may well be slower progress in making efficient and capable AIs, with impressive and/or profitable applications, which would explain why the mainstream research community isn't very interested in this approach.

I tend to agree with MIRI that the top-down approach is probably safest, but since it may turn out to be too slow to make any difference, we should be looking at other approaches as well. If you're thinking about writing a post about recent progress in IRL and related ideas, I'd be very interested to see it.

Comment author: jacob_cannell 17 June 2015 01:32:37AM 1 point [-]

MIRI is taking the top-down approach since that seems to be the best way to eventually obtain an AI for which you can derive theoretical guarantees.

I for one remain skeptical such theoretical guarantees are possible in principle for the domain of general AI. The utility of formal math towards a domain tends to vary inversely with domain complexity. For example in some cases it may be practically possible to derive formal guarantees about the full output space of a program, but not when that program is as complex as a modern video game, or let alone a human. The equivalent of theoretical guarantees may be possible/useful for something like a bridge, but less so for an airplane or a city.

For complex systems simulations are the key tool that enables predictions about future behavior.

In the absence of such guarantees, we can't be confident that an AI will behave correctly when it's able to think of strategies or reach world states that are very far outside of its training and testing data sets.

This indeed would be a problem if the AI's training ever stopped, but I find this extremely unlikely. Some AI systems already learn continuously - whether using online learning directly or by just frequently patching the AI with the results of updated training data. Future AI systems will continue this trend - and learn continuously like humans.

Much depends on one's particular models for how the future of AI will pan out. I contend that AI does not need to be perfect, just better than humans. AI drivers don't need to make optimal driving decisions - they just need to drive better than humans. Likewise AI software engineers just need to code better than human coders, and AI AI researchers just need to do their research better than humans. And so on.

The price for pursuing such guarantees may well be slower progress in making efficient and capable AIs, with impressive and/or profitable applications, which would explain why the mainstream research community isn't very interested in this approach.

For the record, I do believe that MIRI is/should be funded at some level - it's sort of a moonshot, but one worth taking given the reasonable price. Mainstream opinion on the safety issue is diverse, and their are increasingly complex PR and career issues to consider. For example corporations are motivated to downplay long term existential risks, and in the future will be motivated to downplay similarity between AI and human cognition to avoid regulation.

If you're thinking about writing a post about recent progress in IRL and related ideas, I'd be very interested to see it.

Cool - I'm working up to it.

Comment author: Wei_Dai 17 June 2015 07:45:06AM *  1 point [-]

Future AI systems will continue this trend - and learn continuously like humans.

Sure, but when it comes to learning values, I see a few problems even with continuous learning:

  1. The AI needs to know when to be uncertain about its values, and actively seek out human advice (or defer to human control) in those cases. If the AI is wrong and overconfident (like in http://www.evolvingai.org/fooling but for values instead of image classification) even once, we could be totally screwed.
  2. On the other hand, if the AI can think much faster than a human (almost certainly the case, given how fast hardware neurons are even today), learning from humans in real time will be extremely expensive. There will be high incentive to lower the frequency of querying humans to a minimum. Those willing to take risks, or think that they have a simple utility function that the AI can learn quickly, could have a big advantage in how competitive their AIs are.
  3. I don't know what my own values are, especially when it comes to exotic world states that are achievable post-Singularity. (You could say that my own training set was too small. :) Ideally I'd like to train an AI to try to figure out my values the same way that I would (i.e., by doing philosophy), but that might require very different methods than for learning well-defined values. I don't know if incremental progress in value learning could make that leap.

For complex systems simulations are the key tool that enables predictions about future behavior. [...] I contend that AI does not need to be perfect, just better than humans.

My point was that an AI could do well on test data, including simulations, but get tripped up at some later date (e.g., it over-confidently thinks that a certain world state would be highly desirable). Another way things could go wrong is that an AI learns wrong values, but does well in simulations because it infers that it's being tested and tries to please the human controllers in order to be released into the real world.