The learned algorithms will not always be simple enough to be interpretable. But I agree we should try to interpret as much as we can. What we are trying to predict is the behavior of future, more powerful models. I think toy models can sometimes have characteristics that are absent from current language models but those characteristics may be integrated into or emerge from more advanced systems that we build.
Some excerpts from my interview with Neel Nanda about how to productively carry out research in mechanistic interpretability.
Posting this here since I believe his advice is relevant for building accurate world models in general.
An Informal Definition Of Mechanistic Interpretability
Three Modes Of Mechanistic Interpretability Research: Confirming, Red Teaming And Gaining Surface Area
Strong Beliefs Weakly Held: Having Hypotheses But Being Willing To Be Surprised
On The Benefits Of The Experimental Approach