Thanks for your answer, Ben!
First of all, all of these methods involve integrating the AGI in human society. So the AGI is forming its values, at least in part, through doing something (possibly talking) and getting a response from some human. That human will be interpreting the AGI's answers, and selecting the right response, using their own theory of the AGI's mind - nearly certainly an anthopomorphisation! Even if that human develops experience dealing with the AGI, their understanding will be limited (as our understanding of other humans is limited, except worse than that).
So the AGI programmer is taking a problem that they can't solve through direct coding, and putting the AGI through interactions so that it will acquire the values that the programmer can't specify directly, in settings where the other interactors will be prone to anthropomorphisation.
ie: "I can't solve this problem formally, but I do understand it's structure enough to be reasonably sure that anthropomorphic interactions will solve it".
If that's the claim, I would expect the programmer to be very schooled in the properties and perils of anthropomorphisation, and to cast their arguments, as much as possible, in formal logic or code form. For instance, if we want the AGI to "love" us: what kind of behaviour would we expect that this entailed, and why would this code acquire that behaviour from these interactions? If you couldn't use the word love, or any close synonyms, could you still describe the process and show that it will perform well? If you can't describe love without saying "love", then you are counting on a shared non-formalised human understanding of what love is, and hoping that the AGI will stumble upon the same understanding - you don't know the contours of the definition, and the potential pitfalls, but you're counting on the AGI to avoid them.
Those four types of behaviours that I mentioned there, and that we need to separate - don't just decry the use of anthropomorphisation in the description, but say which parts of the open cog system will be used to distinguish between them, and select the friendly behaviour rather than the others. You know how your system works - reassure me! :-)
Stuart -- Yeah, the line of theoretical research you suggest is worthwhile....
However, it's worth noting that I and the other OpenCog team members are pressed for time, and have a lot of concrete OpenCog work to do. It would seem none of us really feels like taking a lot of time, at this stage, to carefully formalize arguments about what the system is likely to do in various situations once it's finished. We're too consumed with trying to finish the system, which is a long and difficult task in itself...
I will try to find some time in the near term to ...
At the current AGI-12 conference, some designers have been proponents of keeping AGI's safe by bringing them up in human environments, providing them with interactions and feedback in a similar way to how we bring up human children. Obviously that approach would fail for a fully smart AGI with its own values - it would pretend to follow our values for as long as it needed, and then defect. However, some people have confidence if we started with a limited, dumb AGI, then we could successfully inculcate our values in this way (a more sophisticated position would be that though this method would likely fail, it's more likely to succeed than a top-down friendliness project!).
The major criticism of this approach is that it anthropomorphises the AGI - we have a theory of children's minds, constructed by evolution, culture, and our own child-rearing experience. And then we project this on the alien mind of the AGI, assuming that if the AGI presents behaviours similar to a well-behaved child, then it will become a moral AGI. The problem is that we don't know how alien the AGI's mind will be, and if our reinforcement is actually reinforcing the right thing. Specifically, we need to be able to find some way of distinguishing between: