To belabor the point further, we of course don’t primarily care about trees, we care about the concept ‘human values’.
We have human values, and that's the way in which we care. We don't care about the quoted concept "human values", on the main.
Thanks for pointing that out!
I wonder: we care about the things embedded in our values, e.g. trees or happy people. I know that you care about roughly similar things than I do, e.g. trees or happy people. In some way, I have a concept of 'Your values' or 'What humans value'. Thus, doesn't a concept of 'human values' entail things we value?
Understanding and transferring a concept of 'human values' would therefore suffice.
Concerning your specific example, "tree", there is a rich anthropological literature on folk taxonomy. You can find it by searching on the name "Brent Berlin" and then looking around (there's probably a Wikipedia entry). It seems, for example, that while most preliterate cultures have a word corresponding to "tree", they don't have one corresponding to "plant", nor, for that matter, do they have one corresponding to "animal". Moreover it seems that folk taxonomies start from the middle and build up and down from there.
With this sequence, we (Sam + Jan) want to provide a principled derivation of the natural abstractions hypothesis (which we will introduce in-depth in later posts) by motivating it with insights from computational neuroscience.
Goals for this sequence are:
Author’s note: This is currently my (Sam’s) main research project, but my first nonetheless. Happy to receive any feedback! Some of the original ideas and guidance come from Jan. I don’t expect you, the reader, to have solid background knowledge in any of the discussed topics. So, whenever you get lost, I will try to get you back on board by providing a more high-level summary of what I said.
Why are ‘abstractions’ relevant?
As alignment researchers, the higher-level problem we are trying to solve is: ‘How do we teach an AI what we value?’. To simplify the question, we assume that we already know what we value[1]. Now, we’ve got to teach an AI that we value “things out there in the world”, e.g. trees. Specifying “trees” should be easy, right?
… o no …
In a similar vein, John Wentworth spells out the pointers problem[2]: ‘An AI should optimize for the real-world things I value, not just my estimates of those things’. He formalises the problem as follows:
Let’s apply JW’s definition to the trees example from above.
Eventually, you want to transfer something (the concept of a tree) in your map to the AGI’s map (that’s what JW calls ‘agents world model’).
There is also an analogous view on the pointers problem from classic philosophy, called the ‘problem of universals’. Similar to our story, we face issues when pointing towards the ‘universal of a tree’. We will explain the problem of universals using an example:
As we will see, the natural part in ‘natural abstractions hypothesis’ suggests that we should expect universals, in the things[4].
‘Hebbian’ Natural Abstractions?!
After having read the previous sections, we want to keep in mind that:
So, how is it that our maps are so similar? Is this the way things generally have to be? Does the emergence of concepts like ‘tree’ involve things like genetics? Should we expect aliens or artificial intelligence[6] to share the same understanding of trees? And what happens when they don’t understand us?
All these questions have something to do with our brains and how it learns. With this sequence, we want to explore exactly these questions using the brain as a working example. We choose the simple and plausible[7] Hebbian learning rule that models how the brain learns in an unsupervised way. The goal of this sequence is to provide empirical evidence for the so-called natural abstraction hypothesis in real intelligent agents and give a more mechanistic explanation for how abstractions emerge. We supplement John Wentworth’s information-theoretic perspective with our perspective from neuroscience/biology.
If you want to dig deeper into John Wentworth’s perspective and the relevance of abstractions, we refer to John Wentworth's posts.
Future posts will talk about the mathematics behind 'Hebbian' Natural Abstractions and the empirical background.
Footnotes
To convince yourself that this is already hard enough, read the sequences, The Hidden Complexity of Wishes, Value is Fragile and Thou Art Godshatter
Actually, I first read about the pointers problem in Abram Demski’s posts. In a vague sense, other authors talked about the problem at hand as well. Though, John Wentworth is the first to spell it out in this way. If you want to dig deeper into the pointers problem, read John Wentworth’s post, this, this and this. The issue goes back to the wireheading problem. Here, we want to prevent a generally intelligent agent to realize that it can stimulate its sensors, so that it receives greatest reward all the time. To solve this issue, we have to tell the agent that it should optimize for our intended outcome, ‘the idea behind it’. Natural abstractions are ought to be a way to specify exactly what we value.
Or which other being or event may have created them.
Natural here means, that we should expect a variety of intelligent agents to converge on finding the same abstractions. Thus, abstractions are somehow embedded in things, otherwise we would expect agents to find different abstractions, or universals.
Definitely not a rectus grin, but a genuine smile. It does not understand what we are pointing at. But you do.
‘Aliens’ or artificial intelligence might have completely different computational limitations compared to humans. How do abstractions behave then?
The model is imperfect, but a suitable abstraction (huh) to talk about the topic at hand.