I'm still vague on how the interpretation actually works. What connects the english sentence "it's raining" to epistemology module's rainfall indicator? Why can't "it's raining" be taken to mean the proposition 2+2=4?
I'm not sure what you don't understand, so I'll explain a few things in that area and hope I hit the right one:
I give sentences their english name in the example to make it understandable. Here are two ways you could give more detail on the example scenario, each of which is consistent:
Sentences in interpreter language are connected to the epistemology engine simply by supposition. The interpreter language is how the interpreter internally expresses its beliefs, otherwise it's not the interpreter language.
"It's raining" as a sentece of the interpreter language can't be taken to mean "2+2=4" because the interpreter language doesn't need to be interpreted, the interpreter already understands it. "It's raining" as a string sent by the speaker can be taken to mean "2+2=4". It really depends on the prior - if you start out with a prior thats too wrong, you'll end up with nonesense interpretations.
I don't mean the internal language of the interpreter, I mean the external language, the human literally saying "it's raining." It seems like there's some mystery process that connects observations to hypotheses about what some mysterious other party "really means" - but if this process ever connects the observations to propositions that are always true, it seems like that gets most favored by the update rule, and so "it's raining" (spoken aloud) meaning 2+2=4 (in internal representation) seems like an attractor.
It seems like there's some mystery process that connects observations to hypotheses about what some mysterious other party "really means"
The hypothesis do that. I said
We start out with a prior over hypothesis about meaning. Such a hypothesis generates a propability distribution over all propositions of the form "[Observation] means [propositon]." for each observation (including the possibility that the observation means nothing).
Why do you think this doesn't answer your question?
but if this process always (sic) connects the observations to propositions that are always true, it seems like that gets most favored by the update rule, and so "it's raining" (spoken aloud) meaning 2+2=4 (in internal representation) seems like an attractor.
The update rule doesn't necessarily favor interpretation that make the speaker right. It favours interpretations that make the speakers meta-statements about meaning right - in the example case the speaker claims to mean true things, so these fall together. Still, does the problem not recur on a higher level? For example, a hypothesis that never interpreted the speaker to be making such meta-statements would have him never be wrong about that. Wouldn't it dominate all hypothesis with meta-statements in them? No, because hypothesis aren't rated individually. If I just took one hypothesis, got its interpretations for all the observations, and saw how likely that total interpretation was to make the speaker wrong about meta-statements, and updated based on that, then your problem would occur. But actually, the process for updating a hypothesis also depends on how likely you consider other hypothesis:
To update on an observation history, first we compute for each observation in it our summed prior distribution over what it means. Then, for each hypothesis in the prior, for each observation, take the hypothesis-distribution over its meaning, combine it with the prior-distribution over all the other observations, and calculate the propability that the speakers statements about what he meant were right. After you've done that for all observations, multiply them to get the score of that hypothesis. Multiply each hypothesis's score with its prior and renormalize.
So if your prior gives most of its weight to hypothesis that interpret you mostly correctly, then the hypothesis that you never make meta-statements will also be judged by its consistency with those mostly correct interpretations.
Abram Demski has been writing about Normativity. The suggested models so far have mostly looked at actions rather than semantics, despite suggestions that this is possible and language learning as a motivating example. There is a simple mechanism that seems to me to mostly fit that bill.
Model
There is an interpreter and an assumed speaker. The interpreter receives observation data as an input, which contains some things put there by the speaker among others. The interpreter has a language in which it can express all its beliefs. Since we want to form beliefs about meaning, this language can talk about meaning: it can from propositions of the form "[Observation] means [propositon].". Note that this is different from "[Observation] implies [propositon].". At least initially, "means" is not related to anything else. The interpreter also has an epistemology module that forms its beliefs about things other than meaning.
We follow a simple prior-update-paradigm. We start out with a prior over hypothesis about meaning. Such a hypothesis generates a propability distribution over all propositions of the form "[Observation] means [propositon]." for each observation (including the possibility that the observation means nothing). Updating is based on the principle that the speaker is authoritative about what he means. Our interpretation of what he's saying should make the things we interpret him to say about what he's saying true. To update on an observation history, first we compute for each observation in it our summed prior distribution over what it means. Then, for each hypothesis in the prior, for each observation, take the hypothesis-distribution over its meaning, combine it with the prior-distribution over all the other observations, and calculate the propability that the speakers statements about what he meant were right. After you've done that for all observations, multiply them to get the score of that hypothesis. Multiply each hypothesis's score with its prior and renormalize. Then, take the resulting propabilities as your new prior, and repeat infinitely.
Lets run an example of this. Your prior is 50% hypothesis A, 50% hypothesis B. There are only two observations. A gives observation1 90% of meaning "Its cold" and 10% of meaning nothing, and gives observation2 80% of meaning "When I describe a local state, the state I mean always obtains", and 20% to mean nothing. B gives observation1 90% to mean "Its raining", 10% to mean nothing, and gives observation2 80% to mean "When I describe a local state, the state I mean obtains with 60% propability", and 20% to mean nothing. The epistemology module says 10% its cold, 30% its raining.
First we create the total prior distribution: it gives observation1 0.5∗0.9=0.45 of meaning "Its cold", 0.5∗0.9=0.45 of meaning "Its raining", and 2∗0.5∗0.1=0.1 to mean nothing, and gives observation2 0.5∗0.8=0.4 to mean "When I describe a local state, the state I mean always obtains", 0.5∗0.8=0.4 to mean "When I describe a local state, the state I mean obtains with 60% propability", and 2∗0.5∗0.2=0.2 of meaning nothing.
Evaluating hypothesis A: for the first observation, in the 0.1 cases where it means nothing everything is consistent in the 0.9 cases where it means "Its raining", in the 0.2 cases where the second observation means nothing, its consistent. In the 0.4 where the second means hes always right, it consistent only in the 0.1 cases where it really is cold. In the 0.4 cases where the second means hes right 60% of the time, theres a 0.1 chance hes right and he gave that 0.6, and a 0.9 chance hes wrong which he gave 0.4. Overall this comes out to 0.1+0.9(0.2+0.4∗0.1+0.4(0.1∗0.6+0.9∗0.4))=0.4672.
For the second observation, a similar breakdown gives 0.2+0.8(0.1+0.45∗0.1+0.45∗0.3)=0.424. Together that makes 0.4672∗0.424=0.1981. For hypothesis B for the first observation its 0.1+0.9(0.2+0.4∗0.3+0.4(0.3∗0.6+0.7∗0.4))=0.5536. For the second observation its 0.2+0.8(0.1+0.45(0.1∗0.6+0.9∗0.4)+0.45(0.3∗0.6+0.7∗0.4))=0.5968, for a total score of 0.5536∗0.5968=0.3304. Multiplying with 0.5 each and normalizing, we get 37,48% for A and 62,52% for B.
This gets us new total propabilities: observation1 0.34 "Its cold", 0,56 "Its raining", 0,1 nothing. Observation2 0,3 "Always right", 0,5 "60% right", 0,2 nothing. Calculating through the hypothesis again gives us 28% A and 72% B. And if we keep going this will trend to 100% B. Is it bad that we became certain after just two sentences? Not necessarily, because that's assuming those are the entire history of observation. If you think there will be future observations, you'll have to do this interpretation process for each possible future separately, and then aggregate them by the propability you give those futures.
Evaluation
Why would this be a good way to learn meaning? The updating rule is based on taking the speakers feedback about what he means. This means that if we already have a pretty good idea of what he means, we can improve it further. The correct meaning will be a fixed point of this updating rule - that is, there's no marginal reinterpretation of it where your explanations of what you mean would better fit what you're saying. So if your prior is close enough to correct that it's in the basin of attraction of the correct fixed point, you're set. Unlike the problems with getting a good enough prior that came up in the context of the no-free-lunch theorems however, these priors boil down to just a few fixed points, so there is a limit to how much precision is needed in the prior.
Lets see how this scores on the desiderata for normativity learning (most recent version here):
Ontology
There were two more things mentioned as goals. The first is having a concept of superhuman performance, which the model clearly does. The second is to preserve the meaning of feedback through ontological crisis. Here I think we can make progress, because the concept of the interpreter language gives us a handle on it.
If you transition to a new ontology, you already need to have some way to express what youre doing in the old one. You need to have an explanation of what using the new system consists in and why you think thats a good idea. The interpreter language can express everything the interpreter believes, including this. So you can already talk about the new ontology, if not in it. And you can make statements about statements in the new ontology, like "[NewProposition] is true in NewOntology". For example, humans developed ZFC, and we can make statements like ""1+1=2" is a theorem of ZFC". And X and "X is true" are more or less the same statement, so you already have a handle on particular propositions of the new ontology. In particular, your prior already assigns propabilities to observations meaning propositions of the new ontology (e.g. "ABC means ['1+1=2' is a theorem of ZFC]" is an interpreter-language statement that priors have opinions on). So ontological changes are already formally accounted for, and the process can run through them smoothly. But does this transition to the new ontology correctly? Why would our prior put any significant weight on hypothesis in the new ontology?
In Reductive Reference, Eliezer introduces the idea of promissory notes in semantics:
Such a promissory note consists of some ideas what the right ontology to interpret their terms in is, and some ideas about how to interpret them in that ontology once you have it. We can do something like this in our model - and the "some ideas" can themselves come from the speaker for us to interpret. And so the reason our hypothesis pay attention to new ontologies is that the statements themselves demand one to be interpreted in, and which one that is the interpreter will determine in part with its own investigations into the world and mathematics. Now this is a bit different from the standard idea of ontological crisis, which seems to have the impetus for the new ontology come more from the machine side, but insofar as we are worried that something might not survive an ontological shift, we have reasons why we want it to be interpreted in a new ontology in the first place - for example we want to make new information accessible to human values, and so we need to interpret human values in the ontology the information is in. And insofar as we have those reasons we already bring criteria of when and what ontological shifts are necessary for them. I don't know if this covers all worries about ontology, but it certainly seems like a step forward.
Afoundationalism
After reading, you might have some doubts about the details of my updating strategy. For example it puts equal weight on the interpretation of each proposition- shouldn't we maybe adjust this by how important they are? Or maybe we should have the hypothesis being rated against each other instead of against the total distribution of the prior. These are fair things to think about, but part of our goal with the normativity approach was that we wouldn't have to do that, and indeed I think we don't. This is because we can also interpret meaning itself as a promissory note. That is, what we say about meaning would not be directly interpreted as claims in terms of "means" in the interpreted language. Rather, "meaning" would be a promissory note to be interpreted by a to-be-determined theory of meaning (in line with our communicated criteria about what would make such theories good), which will compile them down to statements in terms of interpreter-meaning. This ties up the last loose ends and gives us a fully afoundational theory of meaning: Not only can it correct mistakes in its input, it can even correct mistakes we made in developing the theory, provided we're not too far off. This seems quite a bit stronger than what was originally expected:
But it seems that we don't even need that. While we will of course need a particular formal theory to start the ascent, we don't need to assume that anything in particular about that theory is correct. We just need to believe that its good enough to converge to the correct interpretation. There are propably quite a few ways to get into this dynamic that have the correct theory as a fixed point. The goal then is to find one where its basin of attraction is especially large or easy to understand.