jacob_cannell comments on Concept Safety: Producing similar AI-human concept spaces - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (45)
Kaj_Sotala's post doesn't directly address this issue, but transparent concept learning of the form discussed in the OP should combine well with machine learning approaches to value learning, such as inverse reinforcement learning.
The general idea is that we define the agent's utility function indirectly as the function/circuit that would best explain human actions as a subcomponent of a larger generative model trained on a suitable dataset of human behavior.
With a suitably powerful inference engine and an appropriate training set, this class of techniques is potentially far more robust than any direct specification of a human utility function. Whatever humans true utility functions are, those preferences are revealed in the consequences of our decisions, and a suitable inference system can recover that structure.
The idea is that the AI will learn the structural connections between human's usage of the terms "good" and "bad" and its own utility function and value function approximations. In the early days its internal model may rely heavily on explicit moral instruction as the best predictor of the true utility function, but later on it should learn a more sophisticated model.
I wrote a bit about this here, and posed the "very easy goal inference problem:" even if you had as much time as you wanted and knew exactly what a human would do in every possible situation, could you figure out what outcomes were "good" and "bad" then?
It seems like we have to be able to solve this problem in order to carry out the kind of strategy you describe.
I haven't seen any meaningful progress on it. I guess the hope is that we will get better at answering it as we get better at AI. But it seems quite distant from everything that people work on in AI, and it's pretty distant even from what people work on in cognitive science.
Given infinite compute an unlimited perfect training set, the utility inference problem is still somewhat more complex than just predicting what the human would do (supervised learning), but it still seems pretty tractable.
Using essentially perfect unsupervised learning (full bayesian/solonomoff induction) on an enormous computer, you could easily compute a prior over models that explains the data well. The best models will tend to include circuits which efficiently approximate actual human thought processes. The problem of course - as you mention - is that we want to extract the equivalent of a human utility function so that we can combine it with a much improved superhuman predictive model of the world, along with a much longer planning horizon, improved Q/value function, etc.
This is still relatively easy to setup - at least in theory - in a suitable probabilistic programming environment. "Predict the human's decision output" corresponds to like one or two lines of code (its just the obvious supervised learning objective without any complex constraints), whereas a proper inverse reinforcement learning setup of the type discussed above which extracts a modular human utility function suitable for insertion into a more powerful AI perhaps corresponds to a few hundred lines of code to describe the more complex prior structure we are imposing on the space of models.
To be really robust we would need to explore many different potential modular models, and eventually we may need to start worrying about all the instantiated models we create - as if you make them complex enough there is an argument that they eventually could become equivalent to simulated humans, etc.
But in reality that is a ways out as we don't have unlimited computation, and it seems that humans are fairly capable of modelling the preferences of other humans using cheap approximations.
At this stage in the game supervised learning is the most effective training paradigm for most tasks. The scope of AIs that we can build right now are systems with on the order of millions of neurons that can replicate some of the specific functionality from a few brain regions on rather specific tasks. IRL will become important later, once AI systems are much larger, much more educated, and have less to learn by directly imitating human experts.
So in short, people in AI aren't working on this today because it's not where the money is .. yet.
This would be great to see, but I'm not too optimistic. I can't tell whether the approach you are describing is designed to extract structure which actually exists, by lining up the prior with the actual structure of the brain, or whether it's designed to create new structure, by finding a simple explanatory model that is much more modular than the brain itself. Both seem tough!
I would be very surprised if there is a modular part of the brain that implements the "human utility function," and I've never heard a contemporary cognitive scientist endorse anything that would be suitable for your intended application.
If you write down an accurate model of the brain as "rational behavior + noise," I suspect the noise model ends up being exactly as complex as the human brain itself, since it has to e.g. specify how we think about things in order to predict what things we won't have time to think about. And once the complexity is in the noise model, the normal model selection test isn't going to really work for finding the values. E.g., if you imposed the obvious kind of structure, I wouldn't be surprised if the utility function you got out was not at all what humans valued, but just a useful heuristic for explaining some small part of human behavior.
Do you see any reason to suspect otherwise, or generally any evidence to make you optimistic about this project succeeding? Can you imagine any kind of breakdown that doesn't obviously fail?
I agree you could extract something like the map from "Perceptions" ---> "Human's answer to the question: 'does it seem like things are going well?'", and I have mostly focused on attacking AI control using capabilities like this.
In general: some day AI control will become an economically relevant problem and it will receive attention if needed. This seems to be why serious people are optimistic about our prospects. But if we want to foster relative progress on control, then we should try to understand the issue further in advance than will happen by default
So we may disagree about how optimistic we should be about this research project, but hopefully we can agree that it's a research project that (1) would have to be solved in order for this approach to AI control to work, though it could be solved incidentally, and (2) is not currently benefitting that much from conventional research in AI.
You may be optimistic that it will obviously be solved once it becomes relevant. I don't see much reason for such optimism. But the bigger difference is that it just wouldn't change my outlook much if I thought it had a 75% chance of being solved.
Remember we are talking about infinite inference power and infinite accurate training data, so specifying a careful accurate prior over the space of models is just not something we have to worry about. All we need to worry about is ensuring that our problem definition actually solves the correct problem.
So to clarify - the general problem is something along the lines of: find a utility function (a function which maps observation histories to real numbers), and a model RL agent architecture that together explain/predict the output dataset (human minds). We can then use that utility function in a more powerful RL agent.
The assumptions we need to make to solve this problem are only those related to our intent: namely that human decisions - however implemented internally - imply preferences over observation histories/worlds, and that we want to create new agents which optimize for those preferences more effectively.
I'm not sure what you mean by this. Noise is used in generative models to cope with the fact that we can't train perfect predictors. With infinite inference power the model search is likely to find very low complexity solutions, but there will always be some number of complexity bits that go somewhere - in your unknown hyper parameters, noise, whatever. The type of model I was imagining was one that parameterized all of model space (ala Solonomoff Induction), rather than one which uses noise explicitly.
The problem definition you gave in your blog involves "a lookup table of what a human would do after making any sequence of observations". I interpreted that as a perfect training dataset that covers the entire human mindspace. Given infinite inference power, the resulting solutions would be - by the properties of Solonomoff Induction - the best possible explanation of that data - and vastly superior to anything humans will ever come up with. At a philosophical level infinite inference power corresponds to actually instantiating entire multiverses just to solve the problem.
Now it could be that human minds can not be described very well by any type of RL agent architecture for any possible utility function. I very much doubt this, because that's an extremely general agent framework. However, even if that were true (which it isn't), then the infinite inference engine would recover the ultimate approximation given those assumptions, which is probably good enough.
The project is just a thought experiment, because we will never have infinite inference power and infinite perfect training data.
That being said, I still think the general approach is probably correct and could lead to approximately friendly AI eventually, the challenges naturally come from limited inference power and limited training data - with the latter being the especially difficult part.
The most important training data will be data covering hypothetical future situations. I don't yet see how to handle this. Maybe there is some simpler extrapolation technique, where the agent can learn some simple general principle - such as "humans prefer control over their future observation history" - that once mastered, allows extrapolation to avoiding death, wireheading, etc etc.
The other practical difficulty is testing. The most important situations we want to test are exactly those which we cannot - future hypotheticals.
In regards to 1.) I think this is essentially the only feasible practical approach to the FAI utility function problem on the table - at least that I am aware of. 2.) Is not entirely correct - as this approach is enabled by all the great progress in machine learning in improving our general inference capabilities.
On the other hand, machine learning is very much an experimental engineering field - progress comes mostly from experiments rather than theory. So how can we setup a series of experiments that leads us to friendly superintelligence? - that appears to be a core hard problem. One analogy that comes to mind is the creation of a new large nation state - especially one of a new type - like the US or the french republic. Unfortunately that is just not something that one can learn how to do through a large number of experiments.
One approach that could have promise is to learn a scaling theory. Or perhaps we focus on collective superintelligence where a large number of AIs learn the values of many humans and we let game theory sort it out.
I agree.
That's the interpretation I had in mind.
As you point out, this is very unlikely. The question is whether the learned utility functions actually capture what humans care about.
If you think through a few easy approaches, you will see that they predictably fail. We can discuss in more detail, but it would be easier if you provided more insight into what kind of approach you are optimistic about. I can argue against N of them, but you will think that at least N-1 are straw men.
The most natural approach, to an LW mindset, is to define a basic frameowrk for "RL agents," that has a slot for a utility function. Then we can take a simplicity prior over models that fit into this basic framework, and do inference to find a posterior distribution over models, and hence over utility functions. If this is what you have in mind, I'm happy to comment in more depth on why I'm pessimistic.
The basic problem is that the simplest model of a human is clearly not as an RL agent, it's to directly model the many particular cognitive effects that shape human behavior. For any expressive framework, the most parsimonious model is going to throw out your framework and just model these cognitive effects directly. Of course it can't literally throw out your framework, but it can do so in all but name. For a crude example, the definition of "utility function" could consult the real model of the human to figure out what action a human would take, and then output a simple utility function that directly incentivized the predicted actions.
This will break your intended correspondence between the box in your model labeled "utility" and the actual values of the human subject, and if you give this utility function to a stronger RL agent I don't think the results will be satisfactory.
If we were to pick any concrete model I am quite confident that I could demonstrate this kind of behavior. I suspect that the only way we can avoid it is by being sufficiently vague about the approach that we can't make any concrete statements about what kind of representation it would learn.
Yes, actually getting a solution would require impressive inference capability. For now I'm happy to supoose that continuing AI progress will deliver inference abilities that are up to the task.
But I am especially interested in the residual---even if your inference abilities are as good as you could ask for, how do you solve the problem? It is about this residual that I am most pessimistic, and improvements in our inference ability don't help.
Yes, more or less. I should now point out that almost everything of importance concerning the outcome is determined by the training dataset, not the model prior. This may seem counter-intuitive at first, but it is true and important.
This is not clear at all, and furthermore appears to contradict what you agreed to earlier above - namely that human minds can be described well as a specific type of RL agent with some particular utility function.
I consider myself reasonably up to date in both computational neuroscience and ML, and the most successful over-arching theory for explaining the brain today is indeed as a form of RL agent. Thus the RL framework in some sense is the most general framework we have and it includes human, animal, and a wide class of machine agents as special cases.
The 'framework' I proposed is minimal - describing the class of all RL agents requires just a few lines of math. Remember the training set is near infinite and perfect, so the tiny number of bits I am imposing on the model prior matters not at all.
You seem to perhaps believe that I am specifying a framework in terms of modules or connections or whatever on the agent, and that was not my idea at all (at least in the infinite computing case). I was proposing the absolute minimal assumptions. The inference engine will explore the model space - and probably come up with something ridiculous like simulations of universes if you give it infinite compute. With practical but very large amounts of compute power, it will - probably - come up some sort of approximate brain-like ANN solution.
I am skeptical you could demonstrate this, but you could start by taking one of the existing IRL systems in the literature and demonstrating the failure there. Or maybe I am unclear on the nature of your concern. You seem to be concerned with the details of how the resulting model works. I believe that is a fundamentally misguided notion, and instead we really care only about results. This could be a fundamental difference in mindsets - I"m very much an engineer.
In other words, the ultimate question is this: is the resulting agent better at doing what we actually want (on whatever set of tasks the training set includes) than the human experts that are the source of that training data?
For after all, that is the key advantage of RL techniques over supervised learning, an advantage which IRL inherits.
So here is a more practical set of experiments we could do today. Take a deep RL agent like deepmind's atari player. But instead of training it using the internal score as the reward function directly, we use IRL using traces of expert human play. We can compare to a baseline with the same model but trained using supervised learning. The supervised baseline would learn human errors and thus would asymptote at human level play. The IRL agent instead should eventually learn a good approximation of the score function as its utility/reward function and thus achieve capability close to the original RL agent.
A cool variation would be to add another training sequence where the human expert has additional constraints - such as maximize score without killing any other 'agents'. For the games for which that applies, I think that would be a really cool important demonstration of the beginnings of learning ethical behavior from humans.
So the core idea is to apply that same concept, but to life in general, where our 'game world' is the real world, and there is no predefined score function, and the ideal utility function must be inferred.
I don't claim to have a clear solution to the full problem yet, but my thought experiment above sketches out the vague beginnings of an IRL based solution. Again the training is everything - so the full solution becomes something more like educating an AI population, a problem that goes far beyond the basic math or machine learning and connects to politics, education, game theory, etc.
Yes, the model you get won't depend at all on the tiny number of bits that you are imposing, unless your model class is extremely crippled. This is precisely my point. You will get a really good model. But you imposed some structure in the model, perhaps with a little box labeled "utility function." After inference, that box isn't going to have the utility function in it. Why would your universe-simulating model bother dividing itself neatly into "utility function" and "everything else"? It will just ignore your division and do whatever is most efficient.
I believe you will get out a model that predicts human behavior well. I think we can agree on that! But it's just not enough to do anything with. Now you have a simulation of a human; what do you do with it?
You are making a further claim---that in the box labeled "utility function," the model will put a reasonable representation of a human utility function, such that you'd be happy with your AI maximizing that utility function. It seems like you are the one making a detailed assumption about how the learned model works, an assumption which seems implausible to me. If you think you aren't making such an assumption, could you express (even very informally) the argument that the IRL agent will work well?
If your model doesn't have a box labeled "utility function," can you say again how you are extracting the utility function from the learned model?
Or do you think that you will not find a reasonable utility function, but produce desirable behavior anyway? I don't understand why this would happen.
We seem to be talking past each other. Could you cite a paper with what you think is a plausible model? I could respond to any of them, but again it would feel like a straw man, because I don't think that the authors of these papers expect them to apply to general human behavior.
For example, most of these models make no attempt to model reasoning, and instead assume e.g. that the probability that an agent takes an action depends only on the payoff of that action. This is obviously not a very good model! How do you see this working?
I agree that this experiment can probably yield better behavior than training a supervised learner to reproduce human play.
But existing approaches won't scale to learn perfect play, even with infinite computing power and unlimited training data, except in extremely simple environments. To make this clear you'd have to fix a particular model, which I invite you to do. But I think that most (all?) models in the literature will converge to exactly reproducing the "modal human policy" (in each state, do the thing that the expert is most likley to do) in the limit of infinite training data and a sufficiently rich state space. Do you have a counterexample in mind?
You can probably get optimal play in the atari case by leaning heavily on the simplicity prior for the rewards and neglecting the training data. But earlier in your comment it (very strongly) sounded like you wanted to let the training data wash out the prior.
Hmm at this point I should now actually write out a simple RL model to help me understand your critique.
Here is some very simple math for a general RL setup (bellman-style recursive function form):
model = p(s,a,s')
policy(s) = argmax_a Q(s,a)
Q(s,a) = sum_s' p(s,a,s') [ R(s') + gV(s) ]
V(s) = max_a Q(s,a)
The function p(s,a,s') is the agent's world model which gives transition probabilities between consecutive states (s,s') on action a. The states really are observation histories - entire sequences of observations. The variable 'g' represents the discount factor (although really this should probably be a an unknown function). R(s') is the reward/utility function, and Q(s,a) is the value-action function that results from planning ahead to optimize R. The decision/policy function just selects the best action.
We condition on the actions and observations to learn the best model , reward and discount functions. And now I see your point (i think), after writing this out, that the model and reward functions are not really well distinguished and either could potentially substitute for the other (as they just multiply). It could learn a reward function that is just '1' and stuff everything in the model.
So - yes we need more prior structure than the 4 lines of math model. My initial initial guess was about 100 lines of math code in a tight prob prog model, which still may be reasonable in the future but is perhaps slightly optimistic.
Ok, so here is version 2. We know roughly that the cortex is responsible for modelling the world and we know its rough circuit complexity. So we can use that as a prior on the model function. Better yet, we can train the model function separately (constrained to cortex size or smaller), without including the policy function/argmax stuff, and on a dataset which includes situations where no actions are taken, forcing it to learn a world model first. Then we can use those results as an initial prior when we train the whole thing on the full dataset with the actions.
That doesn't totally solve the general form of your objection, but it at least forces the utility function to be somewhat more sensible. I can now kindof see where version 100 of this idea or so is going and how it could work well, but it probably requires increasingly complex models of human-like brains (along with more complex training schemes) as priors.
Extract is perhaps not the right word, but the general idea is that once we have learned a human-level model function and reward function, in theory we can get superintelligent extrapolation by improving the model function, running it faster, and or eliminating any planning limitations or noise. The model function we learn to explain human data in particular will only know/model what humans actually know.
The modal human policy, as you describe it, sounds identical to the supervised learner which just reproduces human ability. Beating supervised learning (the modal human policy) is again what really matters.
Not sure what you mean here - you need the training data to get up to any decent level of play. Perhaps you were thinking only of the utility function, but to learn that you still need some training data.
The deep ANN approach to RL is still new, and hasn't been merged with IRL research yet, which mostly appears to be in the small model stage (with the exception perhaps of some narrow applications in robotics and pathfinding).
They can also substitute in more subtle ways, e.g. by learning R(s) = 1 if the last action implied by the state history matches the predicted human action. If the human is doing RL imperfectly then that is going to have a much better explanatory fit to the data (it can be arbitrarily good, while any model of a human as a perfect RL agent will lose Bayes points all over the place), so you have to rely on the prior to see that it's a "bad" model.
That's my concern; I think things get pretty hairy, and moreover I don't know whether the resulting systems would typically be competitive with (e.g.) the best RL agents that we could design by more direct methods.
That's what I mean by a "box labeled 'utility function'."
Yes. Do you know any model of IRL that can (significantly) beat the modal human policy in this context?
Sorry, I meant "assign low total weight" to the training data, so that the learner can infer that some of the human's decisions were probably mistakes (since they can only be explained by an artificial reward function). This is very delicate, and it requires paying more attention to athe prior than you seemed to want to (and more attention to the prior than is consistent with actually making good predictions about human behavior).
I'm trying to wrap my head around all this and as someone with no programming/ai background, I found this the clearest, gentlest learning curve article on the inverse reinforcement learning.
I know of inverse reinforcement learning and similar ideas, I still argue that they are bad for the same reason.
In regular reinforcement learning, the human presses a button that says "GOOD", and a sufficiently intelligent AI learns that it can just steal the button and press it itself.
In inverse reinforcement learning, the human presses a button that says "GOOD" at first. Then the button is turned off, and the AI is told to predict what actions would have led to the button being pressed. Instead of actual reinforcement, there is merely predicted reinforcement.
However a sufficiently intelligent AI should predict that stealing the button would have resulted in the button being pressed, and so it will still do that. Even though the button is turned off, the AI is trying to predict what would be best in the counter-factual world where the button is still on.
And so the programmer thinks that they have taught the AI to understand what is good, but really they have just taught it to figure out how to press a button labelled "GOOD".
This is not how IRL works at all. The utility function does not come from a special reward channel controlled by a human. There is no button.
To reiterate my description earlier, IRL is based on inferring the unknown utility function of an agent given examples of the agent's behaviour in terms of observations and actions. The utility function is entirely an internal component of the model.