Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Comment author: jacob_cannell 20 April 2015 05:03:57PM 0 points [-]

The main scenario I had implicitly in mind had something resembling a "childhood" for the AI, where its power and intelligence would be gradually increased while it interacted with human programmers in a training environment and was given feedback on what was considered "good" or "bad", so that it would gradually develop concepts that approximated human morality as it tried to maximize the positive feedback.

This really is the most realistic scenario for AGI in general, given the generality of the RL architecture. Of course, there are many variations - especially in how the training environment and utility feedback interact.

Possibly even giving it a humanoid body at first, to further give a human-like grounding to its concepts

If we want the AI to do human-ish labor tasks, a humanoid body makes lots of sense. It also makes sense for virtual acting, interacting with humans in general, etc. A virtual humanoid body has many advantages - with instantiation in a physical robot as a special case.

(Another potential problem with it is that the AI's values would become quite strongly shaped by those of its programmers, which not everyone would be likely to agree with.)

Yep - kindof unavoidable unless somebody releases the first advanced AGI for free. Even then, most people wouldn't invest the time to educate and instill their values.

Another scenario I thought of would be to train the AI by something like the word embedding models, i.e. being given a vast set of moral judgments and then needing to come up with concepts simulating human moral reasoning in order to correctly predict the "right" judgments.

So say you train the AI to compute a mapping between a sentence in english describing a moral scenario and a corresponding sentiment/utility, how do you translate that into the AI's reward/utility function? You'd need to somehow also map encodings of imagined moral scenarios back and forth between encodings of observation histories.

Comment author: Kaj_Sotala 23 April 2015 01:08:06PM 1 point [-]

This really is the most realistic scenario for AGI in general, given the generality of the RL architecture.

Of course, "gradually training the AGI's values through an extended childhood" gets tricky if it turns out that there's a hard takeoff.

So say you train the AI to compute a mapping between a sentence in english describing a moral scenario and a corresponding sentiment/utility, how do you translate that into the AI's reward/utility function? You'd need to somehow also map encodings of imagined moral scenarios back and forth between encodings of observation histories.

I was thinking that the task of training the AI to classify human judgments would then lead to it building up a model of human values, similar to the way that training a system to do word prediction builds up a language / world model. You make a good point of the need to then ground those values further; I haven't really thought about that part.

Comment author: dhoe 20 April 2015 08:09:34AM 6 points [-]

I have this half-baked idea that trying to be rational by oneself is a slightly pathological condition. Humans are naturally social, and it would make sense to distribute cognition over several processors, so to speak. It would explain the tendencies I notice in relationships to polarize behavior - if my partner adopts the position that we should go on vacations as much as possible, I almost automatically tend to assume the role worrying about money, for example, and we then work out a balanced solution together. If each of us were to decide on our own, our opinions would be much less polarized.

I could totally see how it would make sense in groups that some members adopt some low probability beliefs, and that it would benefit the group overall.

Is there any merit to this idea? Considering the well known failures in group rationality, I wonder if this is something that has long been disproved.

Comment author: Kaj_Sotala 23 April 2015 12:30:44PM *  2 points [-]

Mercier & Sperber made a similar argument, commenting that e.g. things that seem like biases in the context of a single individual (such as confirmation bias) are actually beneficial for the decision-making of a group. An excerpt:

... the idea that the confirmation bias is a normal feature of reasoning that plays a role in the production of arguments may seem surprising in light of the poor outcomes it has been claimed to cause. Conservatism in science is one example (see Nickerson 1998 and references therein). Another is the related phenomenon of groupthink, which has been held responsible for many disasters, from the Bay of Pigs fiasco (Janis 1982) to the tragedy of the Challenger shuttle (Esser & Lindoerfer 1989; Moorhead et al. 1991) (for review, see Esser 1998). In such cases, reasoning tends not to be used in its normal context: that is, the resolution of a disagreement through discussion. When one is alone or with people who hold similar views, one’s arguments will not be critically evaluated. This is when the confirmation bias is most likely to lead to poor outcomes. However, when reasoning is used in a more felicitous context – that is, in arguments among people who disagree but have a common interest in the truth – the confirmation bias contributes to an effi- cient form of division of cognitive labor.

When a group has to solve a problem, it is much more efficient if each individual looks mostly for arguments supporting a given solution. They can then present these arguments to the group, to be tested by the other members. This method will work as long as people can be swayed by good arguments, and the results reviewed in section 2 show that this is generally the case. This joint dialogic approach is much more efficient than one where each individual on his or her own has to examine all possible solutions carefully.8 The advantages of the confirmation bias are even more obvious given that each participant in a discussion is often in a better position to look for arguments in favor of his or her favored solution (situations of asymmetrical information). So group discussions provide a much more effi- cient way of holding the confirmation bias in check. By contrast, the teaching of critical thinking skills, which is supposed to help us overcome the bias on a purely individual basis, does not seem to yield very good results (Ritchart & Perkins 2005; Willingham 2008).

For the confirmation bias to play an optimal role in discussions and group performance, it should be active only in the production of arguments and not in their evaluation. Of course, in the back-and-forth of a discussion, the production of one’s own arguments and the evaluation of those of the interlocutor may interfere with each other, making it hard to properly assess the two processes independently. Still, the evidence reviewed in section 2.1 on the understanding of arguments strongly suggests that people tend to be more objective in evaluation than in production. If this were not the case, the success of group reasoning reviewed in section 2.3 would be very hard to explain.

Comment author: Good_Burning_Plastic 21 April 2015 02:05:46PM 4 points [-]

"[subject] isn't just about [subject matter]: it teaches you how to think"

Most (~70%) of the times it is a euphemism for "it's useless, but we like it so we still want to use taxpayers' money to teach it".

(If people really cared about teaching people how to think, they'd teach cognitive psychology, probability and statistics, game theory, and the like, not stuff like Latin.)

Comment author: Kaj_Sotala 23 April 2015 12:19:17PM *  2 points [-]

(If people really cared about teaching people how to think, they'd teach cognitive psychology, probability and statistics, game theory, and the like, not stuff like Latin.)

I expect you're typical-minding here. I know enough linguistics enthusiasts who feel that learning new languages makes you think in new ways that I believe that to be their genuine experience. Also because I personally find a slight difference in the way I think in different languages, though not as pronounced as those people.

Presumably they, being familiar with the thought-changing effects of Latin but not having felt the language-changing effects of cognitive psychology etc. (either because of not having studied those topics enough, or because of not having a mind whose thought patterns would be strongly affected by the study of them), would likewise say "if people really cared about teaching people how to think, they'd teach Latin and not stuff like cognitive psychology". Just like you say what you say, either because of not having studied Latin enough, or because of not having a mind whose thought patterns would be strongly affected by the study of languages.

Comment author: Houshalter 22 April 2015 06:41:55AM 4 points [-]

Is this true that most people believed the brain was where thought came from? I know the Egyptians used to rip it out because they didn't think it was important.

I was literally just thinking about this the other day, about how ancient people didn't notice that people that got head injuries would change their behave or die instantly.

Comment author: Kaj_Sotala 23 April 2015 12:09:12PM *  2 points [-]

I was literally just thinking about this the other day, about how ancient people didn't notice that people that got head injuries would change their behave or die instantly.

I don't have a single friend whose behavior I'd have noticed changing after a head injury: the only reason I know it happens is because I've read case reports of it happening to someone. Maybe some doctor might have noticed, but then, I'd expect ancient peoples to also have fewer head injuries that were serious enough to change behavior but also mild enough to be survivable.

Comment author: ike 21 April 2015 02:52:53PM 3 points [-]

There's also the risk of me never happening to look at my transaction history to notice the message, so this isn't entirely fool-proof.

For that you simply add any non-verified message to yourself telling you to look at your bank. You wouldn't trust that that message came from you, but you'd at least check.

In response to comment by ike on Self-verification
Comment author: Kaj_Sotala 21 April 2015 04:13:59PM *  0 points [-]

Ah, an excellent point.

Though if the adversary is aware of this, they can spam me with so many non-verified "look at place X" messages that I won't be able to check more than a small fraction of them.

Comment author: orthonormal 21 April 2015 12:12:02AM 0 points [-]

Yes, except I'm much more pessimistic about reinforcement learning sufficing, since I expect that a superhuman-engineering-capability AI would have, not just a few additional degrees of freedom, but incredibly many. And then it would not suffice for the AI to make its best guess about how to extrapolate human values to a world with nanotech and memehacking and (whatever else)- that would almost surely lead to disaster.

Comment author: Kaj_Sotala 21 April 2015 07:09:36AM *  0 points [-]

So how do you feel about the proposal I made in my latest post, to evaluate the new situation in light of the old values? (Might want to continue this thread in the comments of that post.)

My (low-confidence) intuition is that while it's certainly possible to easily screw up the implementation, if the system is engineered correctly, then the process by which the AI applies the old values to the new situation/new concept space should be essentially same as the one by which humans would do it. Of course, in practice "the system being engineered correctly" might require e.g. a very human-like design including a humanoid body etc. in order to get the initial concept space to become sufficiently similar to the human one, so that's a problem.

I think I'm also somewhat more optimistic about the range of solutions that might qualify as "good", because a large part of human values seem to be determined by reinforcement learning. (Compare Hanson on plasticity.) I suspect that if e.g. nanotech and memehacking became available, then the "best" approach to deal with it is underdetermined by our current values, and just because an AI would extrapolate our current values differently than humans would, doesn't necessarily mean that that extrapolation would be any worse. I mean, if the best extrapolation is genuinely underdetermined by our current values, then that means that a wide range of possibilities is equally good pretty much by definition.

Comment author: Error 20 April 2015 11:39:36PM 3 points [-]

I'm not sure I agree. Expecting people to judge stated claims and ignore implicature all the time is unreasonable, sure. But expecting them to judge stated claims over implicature when the stated claim is about empirical facts strikes me as plenty reasonable.

...or that was my opinion until now, anyway. This bit about the brain not actually distinguishing the two has me questioning it. I still don't think that it's okay to conflate them, but if the tendency to do so is hardwired, then it doesn't represent willful stupidity or intellectual dishonesty.

It is, however, still a problem, and I don't think it's one that can be blamed on the speaker; as Gunnar points out elsethread, it's hard to explicitly rule out implicatures that you yourself did not think of. It's also hard to have a discussion when you have to preface statements with disclaimers.

I should add that I am talking about relatively neutral statements here. If I may steal an example from yvain, if you say "The ultra-rich, who control the majority of our planet's wealth, spend their time at cocktail parties and salons while millions of decent hard-working people starve," you pretty much lose the right to complain. For contrast, if you say "90% of the planet's wealth is held by the upper 1%," and your discussion partner asks you why you support the monster Stalin, I think you're on solid ground asking them WTF.

...or again, so I thought. If the brain really doesn't distinguish between the neutral version of that statement and the listener's belief that people making it must be Communists, then the comparison is inevitable and I am boned.

Comment author: Kaj_Sotala 21 April 2015 06:23:37AM 0 points [-]

For contrast, if you say "90% of the planet's wealth is held by the upper 1%," and your discussion partner asks you why you support the monster Stalin, I think you're on solid ground asking them WTF.

Maybe. It could have been your discussion partner's experience that everyone who brings up the 90% thing has, in fact, been a communist. If that's been their experience, then based on the knowledge that they have, that can be a reasonable question to ask. Compare with the claim "Marx wrote that [whatever]"; even though this might be a neutral factual claim in principle, in practice anyone who brings that up in a discussion is much more likely to be a Marxist than someone who doesn't.

In response to Self-verification
Comment author: Kaj_Sotala 20 April 2015 11:15:18AM 3 points [-]

First idea that comes to mind: log on to my online bank, make some wire transfers e.g. between my own accounts, and include the message in the wire transfer description. Alternatively, if the message is very long, an URL to it.

Getting access to my online bank requires both a username/password combo that only exists in my memory, as well as a separate authentication by entering a numerical code from a unique list of codes that gets randomly queried on each log-in, so the adversary shouldn't be capable of forging this message. When my memories are wiped, I will forget the username and password, but as long as I have a reliable way of proving my identity, I can go to the bank and have them reset.

This assumes that I can still know who I "should" be after having my memories wiped. There's also the risk of me never happening to look at my transaction history to notice the message, so this isn't entirely fool-proof.

Comment author: jacob_cannell 19 April 2015 09:24:34PM 0 points [-]

In most artificial RL agents, reward and value are kept strictly separate. In humans (and mammals in general), this doesn't seem to work quite the same way.

Note that practical artificial RL agents make decisions using the value function (indeed that is it's entire point), rather than directly computing the expected future discounted reward - as computing that measure is generally intractable.

The details of the various RL mechanisms in the human brain are complex and are still very much an active area of study, but if anything the evidence rather strongly indicates that value function approximation is a necessary component of the biological solution for the same reasons we employ it in our artificial systems (and perhaps more so, because neurons are slow and thus fast planning search is much less of an option).

Rather, if there are things or behaviors which have once given us rewards, we tend to eventually start valuing them for their own sake.

Any practical RL agents exhibit the same behavior: once an agent learns that something leads to rewards, it encodes that as a valuable something in its value function - value is just approximate expected discounted future reward.

It is my hope that this could also be made to extend to cases where the AI learns to think in terms of concepts that are totally dissimilar to ours. If it learns a new conceptual dimension, how should that affect its existing concepts? Well, it can figure out how to reclassify the existing concepts that are affected by that change, based on what kind of a classification ends up producing the most reward... when the reward function is defined over the old model.

The challenge then shifts to specifying the correct initial utility function in terms of a poor initial world model and utility function that somehow develops into the correct long term utility function when blown up to superintelligence.

For example, a utility function that assigns high value to "stay in the box" is of course probably a very bad idea due to perverse instantiations.

In the concept learning approach - if I understand it correctly - we define the reward/utility function through a manual mapping of concepts -> utility examples. The resulting reward function can be learned or hand constructed, but either way it is defined by the example set which maps primitive concepts to utilities.

One issue with this type of mapping (versus say the IRL alternative) is it requires the designers to determine in advance some key hyperparameters of a correct/safe reward function, such as it's overall distribution over time.

The other bigger issue is the distinction between concepts that represent actual concrete beliefs about the current state of the world vs imagined beliefs about the world or abstract beliefs about the potential future world state. We want the reward function to be high only for concept sequence inputs that correspond to internal representations of the AI actually observing and believing that it did something 'good', not situations where the AI just imagines a good outcome. This is actually gets pretty tricky quickly, because essentially it involves mapping out what amount to simulated outcome states in the AI's mind.

You can't just have the AI imagine a nice world in the abstract and then hard code that concept to high reward. You actually need the AI to concretely experience a nice world internally and map those concept sequences to high reward.

In the case of DQN atari agent, this isn't a problem, because the agent exists entirely in a simulation that provides the correct training data that fully covers the relevant domain of the reward function.

Comment author: Kaj_Sotala 20 April 2015 08:52:10AM 1 point [-]

The challenge then shifts to specifying the correct initial utility function in terms of a poor initial world model and utility function that somehow develops into the correct long term utility function when blown up to superintelligence.

Agreed, this is a key problem.

In the concept learning approach - if I understand it correctly - we define the reward/utility function through a manual mapping of concepts -> utility examples. The resulting reward function can be learned or hand constructed, but either way it is defined by the example set which maps primitive concepts to utilities.

I've intentionally left vague the exact mechanism of how to define the initial utility function, since I don't feel like I have a good answer to it. An IRL approach sounds like it'd be one possible way of doing it, but I haven't had the chance to read more about it yet.

The main scenario I had implicitly in mind had something resembling a "childhood" for the AI, where its power and intelligence would be gradually increased while it interacted with human programmers in a training environment and was given feedback on what was considered "good" or "bad", so that it would gradually develop concepts that approximated human morality as it tried to maximize the positive feedback. Possibly even giving it a humanoid body at first, to further give a human-like grounding to its concepts. Of course this essentially assumes a slow takeoff and an environment where there is time to give the AI an extended childhood, so it might very well be that this is unfeasible. (Another potential problem with it is that the AI's values would become quite strongly shaped by those of its programmers, which not everyone would be likely to agree with.)

Another scenario I thought of would be to train the AI by something like the word embedding models, i.e. being given a vast set of moral judgments and then needing to come up with concepts simulating human moral reasoning in order to correctly predict the "right" judgments. There a problem would be in finding an appropriate and large enough dataset, plus again the fact that different humans would have differing judgments, making the set noisy. (But maybe that could be leveraged to one's advantage, too, so that the AI would only be sure about the kinds of moral values that were nearly universally agreed upon.)

Comment author: Kaj_Sotala 19 April 2015 01:47:14PM 2 points [-]

A caveat which didn't fit the flow of the text:

Human concepts aren't always purely derived from just their instrumental value. We do seem to have an automatic unsupervised learning component which independently constructs models of the environment and gains new modeling capabilities during maturation, as was seen in the children's height/volume example. Novelty is also one of the things that we find rewarding, and we are driven by curiosity to develop concepts that allow us to compress previous observations more effectively (Schmidhuber 2009). Still, it's worth noting that most people have specific subjects that they are curious about (which some others find uninteresting) while having other subjects they find uninteresting (which some others find interesting), suggesting that even this intrinsic concept-formation drive is guided and directed by various rewards.

There are plenty of other such caveats that I could have made, like a discussion of how emotions affect our reward function, how there seem to be distinct System 1 and System 2 concepts, and so on. But they would have distracted from the main point. I'll just note here that I'm aware of the full picture being quite a bit more complicated than this post might make it seem.

View more: Next