eli_sennesh comments on Debunking Fallacies in the Theory of AI Motivation - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (343)
Thanks for your response.
So, I think this touches on the difficult part. As humans, we have a good idea of what "giving choices to people" vs. "forcing them to do something" looks like. This concept would need to resolve some edge cases, such as putting psychological manipulation in the "forceful" category (even though it can be done with only text). A sufficiently advanced AI's concept space might contain a similar concept. But how do we pinpoint this concept in the AI's concept space? Very likely, the concept space will be very complicated and difficult for humans to understand. It might very well contain concepts that look a lot like the "giving choices to people" vs. "forcing them to do something" distinction on multiple examples, but are different in important ways. We need to pinpoint it in order to make this concept part of the AI's decision-making procedure.
This seems pretty similar to Paul's idea of a black-box human in the counterfactual loop. I think this is probably a good idea, but the two problems here are (1) setting up this (possibly counterfactual) interaction in a way that it approves a large class of good plans and rejects almost all bad plans (see the next section), and (2) having a good way to predict the outcome of this interaction usually without actually performing it. While we could say that (2) will be solved by virtue of the superintelligence being a superintelligence, in practice we'll probably get AGI before we get uploads, so we'll need some sort of semi-reliable way to predict humans without actually simulating them. Additionally, the AI might need to self-improve to be anywhere smart enough to consider this complex hypothetical, and so we'll need some kind of low-impact self-improvement system. Again, I think this is probably a good idea, but there are quite a lot of issues with it, and we might need to do something different in practice. Paul has written about problems with black-box approaches based on predicting counterfactual humans here and here. I think it's a good idea to develop both black-box solutions and white-box solutions, so we are not over-reliant on the assumptions involved in one or the other.
What language will people's questions about the plans be in? If it's a natural language, then the AI must be able to translate its concept space into the human concept space, and we have to solve a FAI-complete problem to do this. If it's a more technical language, then humans themselves must be able to look at the AI's concept space and understand it. Whether this is possible very much depends on how transparent the AI's concept space is. Something like deep learning is likely to produce concepts that are very difficult for humans to understand, while probabilistic programming might produce more transparent models. How easy it is to make transparent AGI (compared to opaque AGI) is an open question.
We should also definitely be wary of a decision rule of the form "find a plan that, if explained to humans, would cause humans to say they understand it". Since people are easy to manipulate, raw optimization for this objective will produce psychologically manipulative plans that people will incorrectly approve of. There needs to be some way to separate "optimize for the plan being good" from "optimize for people thinking the plan is good when it is explained to them", or else some way of ensuring that humans' judgments about these plans are accurate.
Again, it's quite plausible that the AI's concept space will contain some kind of concept that distinguishes between these different types of optimization; however, humans will need to understand the AI's concept space in order to pinpoint this concept so it can be integrated into the AI's decision rule.
I should mention that I don't think that these black-box approaches to AI control are necessarily doomed to failure; rather, I'm pointing out that there are lots of unresolved gaps in our knowledge of how they can be made to work, and it's plausible that they are too difficult in practice.
Why does everyone suppose that there are a thousand different ways to learn concepts (ie: classifiers), but no normatively correct way for an AI to learn concepts? It seems strange to me that we think we can only work with a randomly selected concept-learning algorithm or the One Truly Human Concept-Learning Algorithm, but can't say when the human is wrong.
We can do something like list a bunch of examples, have humans label them, and then find the lowest Kolomogorov complexity concept that agrees with human judgments in, say, 90% of cases. I'm not sure if this is what you mean by "normatively correct", but it seems like a plausible concept that multiple concept learning algorithms might converge on. I'm still not convinced that we can do this for many value-laden concepts we care about and end up with something matching CEV, partially due to complexity of value. Still, it's probably worth systematically studying the extent to which this will give the right answers for non-value-laden concepts, and then see what can be done about value-laden concepts.
Regularization is already a part of training any good classifier.
Roughly speaking, I mean optimizing for the causal-predictive success of a generative model, given not only a training set but a "level of abstraction" (something like tagging the training features with lower-level concepts, type-checking for feature data) and a "context" (ie: which assumptions are being conditioned-on when learning the model).
Again, roughly speaking, humans tend to make pretty blatant categorization errors (ie: magical categories, non-natural hypotheses, etc.), but we also are doing causal modelling in the first place, so we accept fully-naturalized causal models as the correct way to handle concepts. However, we also handle reality on multiple levels of abstraction: we can think in chairs and raw materials and chemical treatments and molecular physics, all of which are entirely real. For something like FAI, I want a concept-learning algorithm that will look at the world in this naturalized, causal way (which is what normal modelling shoots for!), and that will model correctly at any level of abstraction or under any available set of features, and will be able to map between these levels as the human mind can.
Basically, I want my "FAI" to be built out of algorithms that can dissolve questions and do other forms of conceptual analysis without turning Straw Vulcan and saying, "Because 'goodness' dissolves into these other things when I naturalize it, it can't be real!". Because once I get that kind of conceptual understanding, it really does get a lot closer to being a problem of just telling the agent to optimize for "goodness" and trusting its conceptual inference to work out what I mean by that.
Sorry for rambling, but I think I need to do more cog-sci reading to clarify my own thoughts here.
A technical point here: we don't learn a raw classifier, because that would just learn human judgments. In order to allow the system to disagree with a human, we need to use some metric other than "is simple and assigns high probability to human judgments".
I totally agree that a good understanding of multi-level models is important for understanding FAI concept spaces. I don't have a good understanding of multi-level maps; we can definitely see them as useful constructs for bounded reasoners, but it seems difficult to integrate higher levels into the goal system without deciding things about the high-level map a priori so you can define goals relative to this.
Right: and the metric I would propose is, "counterfactual-prediction power". Or in other words, the power to predict well in a causal fashion, to be able to answer counterfactual questions or predict well when we deliberately vary the experimental conditions.
To give a simple example: I train a system to recognize cats, but my training data contains only tabbies. What I want is a way of modelling that, while it may concentrate more probability on a tabby cat-like-thingy being a cat than a non-tabby cat-like-thingy, will still predict appropriately if I actually condition it on "but what if cats weren't tabby by nature?".
I think you said you're a follower of the probabilistic programming approach, and in terms of being able to condition those models on counterfactual parameterizations and make predictions, I think they're very much on the right track.
Well, all real reasoners are bounded reasoners. If you just don't care about computational time bounds, you can run the Ordered Optimal Problem Solver as the initial input program to a Goedel Machine, and out pops your AI (in 200 trillion years, of course)!
I would tend to say that you should be training a conceptual map of the world before you install anything like action-taking capability or a goal system of any kind. Of course, I also tend to say that you should just use a debugged (ie: cured of systematic faults) model of human evaluative processes for your goal system, and then use actual human evaluations to train the free parameters, and then set up learning feedback from the learned concept of "human" to the free-parameter space of the evaluation model.
This seems like a sane thing to do. If this didn't work, it would probably be because either
lack of conceptual convergence and human understandability; this seems somewhat likely and is probably the most important unknown
our conceptual representations are only efficient for talking about things we care about because we care about these things; a "neutral" standard such as resource-bounded Solomonoff induction will horribly learn things we care about for "no free lunch" reasons. I find this plausible but not too likely (it seems like it ought to be possible to "bootstrap" an importance metric for deciding where in the concept space to allocate resources).
we need the system to have a goal system in order to self-improve to the point of creating this conceptual map. I find this a little likely (this is basically the question of whether we can create something that manages to self-improve without needing goals; it is related to low impact).
I agree that this is a good idea. It seems like the main problem here is that we need some sort of "skeleton" of a normative human model whose parts can be filled in empirically, and which will infer the right goals after enough training.