Analogously, someone could write "antibiotics are a non-starter because lots of infections aren't even bacterial." That would be pretty uncompelling to someone who thought that bacterial infections were a major risk.
The fact that an approach doesn't solve all problems doesn't mean it "does not and cannot contribute to long-term safety," that seems hyperbolic and kind of ridiculous.
Another reason: internals may not strongly indicate what an agent is ultimately trying to do. https://tsvibt.blogspot.com/2022/12/ultimate-ends-may-be-easily-hidable.html
I'm afraid I won't have time to read this entire post. But since (some of) your arguments seem very similar to The limited upside of interpretability, I just wanted to mention my response to that (I think it more or less also applies to your post, though there are probably additional points in your posts that I don't address).
I read your comment before. My post applies to your comment (course-grained predictions based on internal inspection are insufficient).
EDIT: Just responded: https://www.lesswrong.com/posts/bkjoHFKjRJhYMebXr/the-limited-upside-of-interpretability?commentId=wbWQaWJfXe7RzSCCE Thanks for bringing it under my attention again.
This post argues that mechanistic interpretability's scope of application is too limited. Your comment describes two misalignment examples that are (maybe) within mechanistic interpretability's scope of application.
Therefore, this post (and Limited Upside of Interpretability) applies to your comment – by showing the limits of where the comment's premises apply – and not the other way around.
To be more specific
You gave two examples for the commonly brought up cases of intentional direct lethality and explicitly rendered deception: "is it (intending to) going to kill us all" and "checking whether the AI is still in a training sandbox...and e.g. trying to scam people we're simulating for money".
The two examples given are unimaginative in terms of how human-lethal misalignment can (and would necessarily) look like over the long run. They are about the most straightforward AGI misalignment scenarios we could wish to detect for.
Here are some desiderata those misalignment threats would need to meet to be sufficiently detectable (such to correct them to not cause (a lot of) harm over the long run):
Perhaps you can dig into a few of these listed limits and come back on this?
1. Non-Distinguished Internal Code Variants
Maybe related: https://tsvibt.blogspot.com/2022/10/the-conceptual-doppleganger-problem.html
There's a lot here, some of it relevant to mechanistic interpretability and some of it not. But addressing your actual specific arguments against mechanistic interpretability (ie this section and the next), I think your arguments here prove far too much.
For example, your reasoning on why mech interp is a non-starter ("what matters here is the effects that the (changing) internals’ interactions with connected surroundings of the environment have") is true of any essentially computer program with inputs and outputs. Of your specific arguments in the next section, at least arguments 1, 3, 5, 8, 9, and 10 (and arguably 2, 4, 6, and others) apply equally to any computer program.
Since it's implausible that you've proved that no computer program with inputs and outputs can be usefully understood, I think it's similarly implausible that you've proved that no neural network can be usefully understood from its internals.
tl;dr: Reasons why the scope of application of mechanistic interpretability is too limited to prevent long-term lethal AGI misalignment. Hooks into reasoning, not covered below, why any physically possible methods to inspect internals (and externals) are insufficient for correcting out eventual carbon-life-toxic interactions of AGI with the environment.
Message exchange with a friend
How to read below:
(building on more general arguments from a researcher much smarter than me).
On an important side-tangent – why I think Eliezer Yudkowsky does not try to advocate for people to try to prevent AGI from ever being built
On notion of built-in AGI-alignment
On why conceptually "mechanistic interpretability" is a non-starter
On specific technical angles why mechanistic interpretability is insufficient
(summarised only briefly):
On an overview of relevant theoretical limits
Returning to overarching points why mechanistic interpretability falls short
On fundamental dynamics that are outside the scope of application of mechanistic interpretability
On whether I think mechanistic interpretability would be helpful at least
On ways the reverse-engineering analogy is unsound
On destabilising internals–environment feedback loops
On why both "inspect internals" and "inspect externals" methods fall short
On neural network code as spaghetti code
Returning to the reverse engineering analogy
On the key sub-arguments for why long-term safe AGI is not possible
On ecosystems being uncomputable
Polymath researcher's response on theoretical limits of engineerable control
Polymath researcher's response on mechanistic interpretability
Afterward, an email response from the polymath researcher