It seems like in practice this kind of internal algorithmic inequivalence is always detectable. That is, you could always figure out for a given human just by feeding the black box different inputs and outputs which of the four possibilities is occurring, and that they would diverge in meaningful behavioral ways in the appropriate circumstances.
It also seems like the reason I have intuitions that those four cases are "different" is actually because I expect them to result in different outside-black-box behaviors when you vary inputs. Is there a concrete example where internal structural difference cannot be detected outside the box? It's not obvious to me that I would care about such a difference.
It's detectable because the algorithms are clean and simple as laid out here. Make it a bit more messy, add a few almost-irrelevant cross connections, and it becomes a lot harder.
In theory, of course, you could run an entire world self-contained inside an algorithm, and algorithmic equivalence would argue that it is therefore irrelevant.
And in practice, what I'm aiming for is use "human behaviour + brain structure + FMRI outputs" to get more than just "human behaviour". It might be that those are equivalent in the limit of a super AI that can analyses every counterfactual universe, yet different in practice for real AIs.
There is a 'no-free-lunch' theorem in value learning; without assuming anything about an agent's rationality, you can't deduce anything about its reward, and vice versa.
Here I'll investigate whether you can deduce more if you start looking into the structure of the algorithm.
Algorithm (in)equivalence
To do this, we'll be violating the principle of algorithmic equivalence: that two algorithms with the same input-output maps should be considered the same algorithm. Here we'll instead be looking inside the algorithm, imagining that we have either the code, a box diagram, an FMRI scan of a brain, or something analogous.
To illustrate the idea, I'll consider a very simple model of the anchoring bias. An agent H (the "Human") is given an object X (in the original experiment, this could be wine, book, chocolates, keyboard, or trackball), an random integer 0≤n≤99, and is asked to output how much they would pay for it.
They will output H(n,X)=34V(X)+14n, for some valuation subroutine V that is independent of n. This gives a quarter weight to the anchor n.
Assume that H tracks three facts about X: the person's need for X, the emotional valence the person feels at seeing it, and a comparison with objects with similar features. Call these three subroutines Need, Emo, and Sim. For simplicity, we'll assume each subroutine outputs a single number, that then gets averaged.
Now consider four models of H as follows, with arrows showing the input-output flows:
I'd argue that a) and b) imply that the anchoring bias is a bias, c) is neutral, and d) implies (at least weakly) that the anchoring bias is not a bias.
How so? In a) and b), n maps straight into Sim and Need. Since n is random, it has no bearing on how much X is needed, and on how valuable similar objects are. Therefore, it makes sense to see its contribution as noise or error.
In d), on the other hand, it is superficially plausible that a recently heard random input could have some emotional effect (if n was not a number but a scream, we'd expect it to have an emotional impact). So if we wanted to argue that, actually, the anchoring bias is not a bias but that people actually derive pleasure from outputting numbers that are close to numbers they heard recently, then n going into Emo would be the right place for it to go. Setup c) is not informative either way.
Symbols
There's something very GOFAI about the setup above, with labelled nodes with definite functionality. You certainly wouldn't want the conclusions to change if, for instance, I exchanged the labels of Emo and Sim!
What I'm imagining here is that a structural analysis of H finds this decomposition as a natural one, and then the labels and functionality of the different modules are established by seeing what they do in other circumstances ("Sim always accesses memories of similar objects...").
People have divided parts of the brain into functional modules, so this is not a completely vacuous approach. Indeed, it most resembles "symbol grounding" in reverse: we know the meaning of the various objects in the world, we know what Hdoes, and we want to find the corresponding symbols within it.
Normative assumptions
The no-free-lunch result still applies in this setting; all that's happen is that we've replaced the set of planners P (which were maps from reward functions to policies), with the set of algorithms A (that map reward functions to policies). Indeed P is just a set of equivalence classes in A, with equivalence between algorithms defined by algorithmic equivalence, and the no-free-lunch results still apply.
The above approach does not absolve us from the necessity of making normative assumptions. But hopefully these will be relatively light ones. To make this fully rigorous, we can come up with a definition which decomposes any algorithm into modules, identifies noise such as n in Sim and Need, and then trims that out (by which we mean, identifies noise with the planner, not the reward).
It's still philosophically unsatisfactory, though - what are the principled reasons for doing so, apart from the fact that it gives the right answer in this one case? See my next post, where we explore a bit more of what can be done with the internal structure of algorithms: the algorithm will start to model itself.