It seems like a solid symbol grounding solution would allow us to delegate some amount of "translate vague intuitions about alignment into actual policies". In particular, there seems to be a correspondence between CIRL and symbol grounding -- systems aware they do not know the goal they should optimize are similar to symbol-grounding machines aware there is a difference between the literal content of instructions and the desired behavior the instructions represent (although the instructions might be even more abstract symbols than words).
Is there any literature you're aware of that would propose a seemingly robust alignment solution in a world where we have solved symbol grounding? e.g. Yudkowsky suggests Coherent Extrapolated Volition, and has a sentence or so in English that he proposes, but because machines cannot execute English it's not clear this was meant literally, or more as a vague gesture at important properties solutions might have.
The similarity between value extrapolation and symbol grounding (similar to how you stated it) is why I suspect that solving one may solve the other.
I've been using the term symbol-grounding in a slightly different way to how people often use it. I've been using it to mean:
What do I mean by "referent of that symbol"? I mean the actual object or concept that the symbol is referring to[1] ("referring", hence referent).
Whereas classical symbol grounding is closer to:
My definition is empirical, and implies the classical definition (though the reverse is not true): if an entity uses a symbol reliably enough for me to know that it tracks the real thing, then I can generally assume they understand what they're talking about.
Here I'll look at the connection between the definition I've been using and the classical one, and see how we could get empirical evidence for the classical one as well - what would it take to convince us that this agent actually understands what a symbol means?
Sufficient, not necessary
Let's consider a knowledgeable meteorologist, with a full suite of equipment, motivated to figure out the temperature of a room. According to my definition, if we could detect the variable corresponding to "temperature" in their head, this variable would be well grounded to the actual temperature[2]; the same would be true for the classical definition: the meteorologist understands what temperature is.
We can contrast that with a digital thermometer, lying in the room. The readout of that thermometer is not grounded, according to either definition.
Let's put this into a causal graph:
We have the true temperature, the thermometer readout, and the meteorologist's verdict. These three nodes are highly correlated under standard situations. Why then do I feel that the meteorologist's verdict was well grounded while the thermometer's readout was not?
Because the meteorologist could adapt to interfering variables in a way that the thermometer could not. Suppose the thermometer was put in the sun or in a fridge. Or if its readout lost power and started flashing 99.99. Then the thermometer could not adapt or correct its estimate. But the meteorologist could; if they found their own thermometer was corrupted, they could move it around, fix it, mentally correct it, or just ignore it, and get a more accurate estimate some other way.
And in doing that, they would demonstrate an understanding of the causal structure of the world and how it worked. The thermometer in the sun is wrong because it continues to absorb radiation and shows a continuously increasing temperature. The thermometer in the fridge is obviously placed in a special location within the room. And so on.
Thus, if the meteorologist can keep their estimate correlated with the true temperature, even in complicated and interfering situations, it demonstrates that they understand[3] how the variable in question works and how it connects with other aspects of the world. Thus my definition (correlated variables across many environments) implies the standard definition (understanding of what the variable is or means).
Lazy, but understanding
The next day, our highly motivated meteorologist is lazy or indifferent - they're no longer on the clock. Now, they have no real interest in the room's true temperature, but they are willing to state a vague impression:
In this case, my definition of grounding fails: the meteorologist's symbol is poorly correlated with the temperature, and is relatively easy to throw off.
Nevertheless, we should still say that the meteorologist's understanding of temperature is well grounded - after all, it's the same person! The classical definition agrees. But is it something we can define empirically?
It seems we can; there are two ways we could cash out "the meteorologist's symbol X is well grounded as referring to the temperature T, even if X and T are poorly correlated". The first is to imagine a counterfactual world in which the meteorologist was motivated to get the correct temperature; then we can reduce this definition to the first one - "X could be well correlated with T, if the meteorologist really wanted it to be".
Another way would be to look directly for the "understanding of what the variable means". The meteorologist's internal model of temperature would demonstrate that they know how temperature connects with other features of the world. They would know what happens to thermometers under sunlight, even if they don't currently have a sunlit thermometer to hand, or currently care about the temperature.
Thus I'd extend my definition to:
This definition of grounding connects syntax (the behaviour of symbols in the agent's model) with the behaviour of features of the world.
Grounding as text
In my previous post on grounding, I talked about GPT-3 having a grounding "as text". The example was a bit difficult to follow; here is a hopefully clearer version, using the new, model-based definition of grounding.
In the following graph, there are various weather phenomena, and resulting temperatures. There are also things that people write down about the weather phenomena, and what they write about temperature:
If GPT-3 has a textual-only grounding of these concepts, then it understands the causal structure inside the box (represented in a simplified form by the green arrow). So it understands how the use of certain words (eg "snow") connects with the use of other words (eg "cold").
A meteorologist understands the causal structure represented by the blue arrow - how actual snow connects with a reduction in actual temperature. When we reflect on how people write weather information in their diaries or articles, we are speculating about the red arrows, which connect the real world concepts with their textual counterparts.
These skills might be quite separate! GPT-3 might understand neither the red nor the blue arrows (this is something that could be tested, as I wrote in the previous post). An obsessive meteorologist might have no idea about the red arrows, and only a very indirect understanding of the green one. Journalists might understand the red arrows (and maybe the green one), with no clue about the blue one. Since all these arrows are simplifications of complicated causal connections, different people have different levels of understanding of them, grasping some well and others more poorly.
It is plausible that GPT-3's textual understanding is better than that of any human, at least in some ways; maybe no-one has a better understanding of the green arrow than GPT-3 does.
Instead of asking "is this symbol well grounded?", I ask "is the symbol well grounded to this referent?" This allows us to figure out that two entities might be using the same word to refer to different things (eg "Bank" as a building vs "Bank" as an institution, while a third agent is using "Banque"); we don't have to assume that some symbol has inherent meaning, just that it appears, empirically, to be used to have a certain meaning in this context. ↩︎
It may be tricky to figure out this variable inside the human head. We could instead use the meteorologist's pronouncements or writings (when they say or write what the temperature is). As long as we think that this reflects the internal symbol accurately - we assume the meteorologist has no incentive to lie, is capable of correct articulation, etc... - we can use that instead.
For an AI, it may be possible to directly see the symbols in their head. ↩︎
There is a certain degree of interpretation here. Suppose that someone comes in and lies to the meteorologist about some heat-related phenomena. Suppose further that the meteorologist knows and respects this person, and therefore "corrects" their estimates to the wrong value.
We could interpret this as the meteorologist showing a lack of understanding (they didn't fully "get" temperature if a lie could throw them off), or instead just making a mistake (they estimated they could trust the liar).
It is up to our judgement which mistakes are acceptable and which ones show a lack of understanding. Thus "symbol X in person P is well-grounded as variable V" can be, to some extent, a judgment call. ↩︎