I've spent more of my time thinking about the technical sub-areas, so I'm focused on situations where innovations there can be useful. I don't mean to say that this is the only place where I think progress is useful.
That seems more than reasonable to me, given the current state of AI development.
Thanks for sharing your reflections on my comment.
While I agree with you that setting the context as Safety narrows down the requirements space for interpretability, I think there could be more than just excluding the interpretability of non-technical users from the picture. The inspections that technicians would want to be reassured about the model safety are probably around its motivations (e.g is the system goal directed, is the system producing mesa-optimisers). However, it is still unclear to me how this relate with other interpretability desiderata you present in the post.
Plus, one could also immagine safety-relevant scenarios were it may be necessary or useful for non-technical users to be able to interpret the model. For instance, if the model has been deployed and is adaptive and we somehow cannot do this automatically, we would probably want users to be able to inspect if the system is somehow made a decision for the wrong reason.
Thanks for the in-depth post on topic. Your last paragraph on Utility is thought-provoking to say the least. I have seen a lot of work claiming to make models interpretable - and factually doing so as well - about which I felt an itch I could not fully verbalise. I think your point on Utility puts the finger on it: most of these works were technically interpreting the model but not actually useful to the user.
From this, we can also partially explain the current difficulties around "find better ways to formalize what we mean by interpretability". If acceptable interpretability depend on its usefulness, then it becomes context-dependent, which blows up the complexity of the attempt.
Hence Interpretability seems to be an interdisciplinary problem. One that requires having the rigth culture and principles to adopt the right interpretability tools for the right problem. This seems to be confirmed by the many contributions you shared from the social sciences on the topic.
What do you think of this perspective?
I think we raise children to satisfy our common expected wellbeing (our + theirs + the overall societal one). Thus, the goal-directness comes from society as a whole. I think there is a key difference between this system and one where a a smarter-than-human AI focuses solely on the well-being of its users, even if it does Context Etrapolated Volition, which I think is what you are referring to when you talk about expected well being (which I agree that if you look only at their CEV-like property the two systems are equivalent).
The problem with this line of reasoning is that it assumes that the goal-directness comes from the smarter part of the duo decision-maker and bearer of consequences. With children and animals we consider they preferences as an input into our decision making, which mainly seeks to satisfies our preferences. We do not raise children solely for the purpose of satisfying their preferences.
This is why Rohin stresses particuarly on the idea that the danger in is the source of goal-directedness and if it comes from humans, then we are safer.
Helpful post - thanks for writing it. From a phenomenological perspective, how can we reason well about the truth of this kind of "principles" (i.e. dual-model where S2 is better than S1 being less effective at dealing with motivational conflicts than than the perspective-shift you suggest) that are to some extent non-falisfiable?
This seems true to me (that it happens all the time). I think the article helps by showing that we often fail to recognise that A) and B) can both be true. Also, if we accept that A) and B) are both true and don't create an identify conflict about it, we can probably be more effective in striking a compromise (i.e. giving up either or finding some other way to get A that does not involve B).
My rough mental summary of these intuitions
we currently don't have a formal specification of optimization
This seems to me a singificant bottleneck for progress. No formal specification of what optimisation is has been tried before? What has been achieved? Is anyone working on this?
One more reason on why this is suprising, is that other experiments found that this behaviour (forgetting then recalling) is common in MLM (masked language models) but not in simple language models like GPT-2 (see this blog post and more specifically this graph). The intepretation is that "for MLMs, representations initially acquire information about the context around the token, partially forgetting the token identity and producing a more generalized token representation; the token identity then gets recreated at the top layer" (citing from the blog post).
However, the logit lense here seems indicating that this may happen in GPT-2 (large) too. Could this be a virtue of scale? Where the same behaviour that one obtains with a MLM is reached by a LM as well with sufficient scale?