(This post is a bit of a thought dump, but I hope it could be an interesting prompt to think about.)
For some types of problems, we can trust a proposed solution without trusting the method that generated the solution. For example, a mathematical proof can be independently verified. This means that we can trust a mathematical proof, without having to trust the mathematician who came up with the proof. Not all problems are like this. For example, in order to trust that a chess move is correct, then we must either trust the player who came up with the move (in terms of both their ability to play chess, and their motivation to make good suggestions), or we must be good at chess ourselves. This is similar to the distinction between NP (or perhaps more generally IP/PSPACE), and larger complexity classes (EXP, etc).
One of the things that make AI safety hard is that we want to use AI systems to solve problems whose solution we are unable (or at least unwilling) to verify. For example, automation isn't very useful if all parts of the process must be constantly monitored. More generally, we also want to use AI systems to get superhuman performance in domains where it is difficult to verify the correctness of an output (such as economic activity, engineering, politics, and etc). This means that we need to trust the mechanism which produces the output (ie the AI itself), and this is hard.
In order to trust the output of a large neural network, we must either verify its output independently, or we must trust the network itself. In order to trust the network itself, we must either verify the network independently, or we must trust the process that generated the network (ie training with SGD). This suggest that there are three ways to ensure that an AI-generated solution is correct: manually verify the solution (and only use the AI for problems where this is possible), find ways to trust the AI model (through interpretability, red teaming, formal verification, and etc), or find ways to trust the training process (through the science of deep learning, reward learning, data augmentation, and etc).
[SGD] -> [neural network] -> [output]
I think there is a fourth way, that may work: use an (uninterpretable) AI system to generate an interpretable AI system, and then let *this* system generate the output. For example, instead of having a neural network generate a chess move, it could instead generate an interpretable computer program that generates a chess move. We can then trust the chess move if we trust the program generated by the neural network, even if we don't trust the neural network, and even if we are unable to verify the chess move.
[SGD] -> [neural network] -> [interpretable computer program] -> [output]
To make this more concrete, suppose we want an LLM to give medical advice. In that case, we want its advice to be truthful and unbiased. For example, it should not be possible to prompt it into recommending homeopathy, etc. If we simply fine-tune the LLM with RLHF and red-teaming, then we can be reasonably sure that it probably won't recommend homeopathy. However, it is difficult to be *very* sure, because we can't try all inputs, and we can't understand what all the tensors are doing.
An alternative strategy is to use the LLM to generate an interpretable, symbolic expert system, and then let this expert system provide medical advice. Such a system might be easy to understand, and interpretable by default. For example, we might be able to definitively verify that there is no input on which it would recommend homeopathy. In that case, we could end up with a system whose outputs we trust, even if we don't verify the outputs, and even if we don't necessarily trust the neural network that we used to generate the program.
(Note that we are pretty close to being able to do things like this in practice. In fact, I am pretty sure that GPT-4 already would be able to generate a decent medical expert system, with a little bit of direction.)
Can this strategy always be used? Is it even possible to generate an interpretable, verifiable AI program that could do the job of a CEO, or would any such program necessarily have to be uninterpretable? I don't know the answer to that question. However, if the answer is "no", then mechanistic interpretability will also necessarily not scale to a neural network that can do the job of a CEO. Stated differently, if (strong) interpretability is possible, then there exist interpretable computer programs for all important tasks that we might want to use AI for. If this is the case, then we could (at least in principle) get a neural network to generate such an AI system for us, even if the neural network isn't interpretable by itself.
Another issue is, of course, that our LLM might be unable to write a program for all tasks that it could otherwise have performed itself (similar to how we, as humans, cannot create computer programs which do all tasks that we can do ourselves). Whether or not that is true, and to what extent it will continue to be true as LLMs (and similar systems) are scaled up, is an empirical question.
I am actually currently working on developing these ideas further, and I expect to relatively soon be able to put out some material on this (modulo the fact that I have to finish my PhD thesis first).
I also think that you in practice probably would have to allow some uninterpretable components to maintain competitive performance, at least in some domains. One reason for this is of course that there simply might not be any interpretable computer program which solves the given task (*). Moreover, even if such a program does exist, it may plausibly be infeasibly difficult to find (even with the help of powerful AI systems). However, some black-box components might be acceptable (depending on how the AI is used, etc), and it seems like partial successes would be useful even if the full version of the problem isn't solved (at least under the assumption that interpretability is useful, even if the full version of interpretability isn't solved).
I also think there is good reason to believe that quite a lot of the cognition that humans are capable of can be carried out by interpretable programs. For example, any problem where you can "explain your thought process" or "justify your answer" is probably (mostly) in this category. I also don't think that operations of the form "do X, because on average, this works well" necessarily are problematic, provided that "X" itself can be understood. Humans give each other advice like this all the time. For example, consider a recommendation like "when solving a maze, it's often a good idea to start from the end". I would say that this is interpretable, even without a deeper justification for why this is a good thing to do. At the end of the day, all knowledge must (in some way) be grounded in statistical regularities. If you ask a sequence of "why"-questions, you must eventually hit a point where you are no longer able to answer. As long as the resulting model itself can be understood and reasoned about, I think we should consider this to be a success. This also means that problems that can be solved by a large ensemble of simple heuristics arguably are fine, provided that the heuristics themselves are intelligible.
(*) It is also not fully clear to me if it even makes sense to say that a task can't be solved by an interpretable program. On an intuitive level, this seems to make sense. However, I'm not able to map this statement onto any kind of formal claim. Would it imply that there are things which are outside the reach of science? I consider it to at least be a live possibility that anything can be made interpretable.