Some say mechanistic interpretability seems really unlikely to bear any fruit because of one or both of the following
Neural networks are fundamentally impossible (or just very very hard) for humans to understand, like neuroscience or economics. Fully understanding complex systems is not a thing humans can do.
Neural networks are not doing anything you would want to understand, mostly shallow pattern matching and absurd statistical correlations, and it seems impossible to explain how that pattern matching is occurring or tease apart what the root cause of the statistical correlations are.
I think 1 is wrong, because mechanistic interpretability seems to have very fast feedback loops, and we are able to run a shit-ton of experiments. Humans are empirically great at understanding even the most complex of systems if they're able to run a shit-ton of experiments.
For 2, I think the claim that neural networks are doing shallow pattern matching & absurd statistical correlations is true, and may continue to be true for really scary systems, but am still optimistic we'll be able to understand why it uses the correlations it does. We have access to the causal system which produced the network (the gradient descent process), and it doesn't seem too far a step to go from understanding raw networks, to tracing back parts of the network you don't quite understand or you think are doing shallow pattern matching to the gradients which built those in, and the datapoints which led to those gradients.
The most usual concern is: some agentic AI going rogue and taking over. I think this is wrong for two reasons:
We were thinking about why it doesn't work in that context when (I would argue) Aumann's agreement theorem applies robustly in lots of other contexts.
That's an interesting pair of claims, and I'd be interested in hearing your explanation.
IMO, Aumann's theorem, while technically not incorrect, is highly overrated because it requires arbitrary levels of meta-trust (trust that the other person trusts you to trust them...) to work correctly, which is difficult to obtain, and people already, "rational" or not, base their opinions on the opinions of others so we never see the opinions they would have without taking into account the views of others. Also, even if a group had sufficient meta-trust to reach a consensus, they wouldn't be able to find out exactly how much their private evidence overlapped without asking each other about their evidence - so merely reaching a consensus would not lead to opinions as accurate as could be reached by discussion of the evidence.
I think people just straightforwardly use Aumann's agreement theorem all the time. Like for example at work today I needed to install a specific obscure program, so I asked one of my teammates who had previously installed it where to get the installer, and what to do in cases where I was unsure what settings to pick.
This relied on the fact that my teammate had correctly absorbed the information during the first installation run (i.e. was rational) and would share this information to me (i.e. was honest).
People very often get information from other people, and this very often depends on Aumann-like assumptions.
I think people in general assume that Aumann's agreement theorem doesn't apply because they have a different definition of disagreement than Aumann's agreement theorem uses. People don't tend to think of cases where one person knows about X and the other person doesn't know about X as a disagreement between the two people on X, but according to Aumann's agreement theorem's definition, it is.
Yeah, I was reacting mainly to a fad that used to be common in the rationalist community where people would consider it to be a problem that we didn't agree on everything, and where stating opinions was emphasized over discussing evidence. (i.e. devaluing the normal human baseline, and being overoptimistic about improvements over it, using actually-unhelpful techniques). I see now that you aren't repeating that approach, and instead you are talking about the normal baseline including Aumann-like information sharing, which I agree with.
Have you tried using this approach (e.g by double-cruxing) to come to an agreement on a simpler issue to start? AI safety is complicated, smart small.
I frequently come to agreement with Aumannian stuff.
But yes I suspect that one cannot simply use Aumann's agreement theorem to reach agreement on AI safety; it was the other rationalist who wanted to do it.
I was once discussing Aumann's agreement theorem with a rationalist who wanted to use it to reduce disagreements about AI safety. We were thinking about why it doesn't work in that context when (I would argue) Aumann's agreement theorem applies robustly in lots of other contexts.
I think I can find out (or maybe even already know) why it doesn't apply in the context of AI safety, but to test it I would benefit from having a bunch of cases of disagreements about AI safety to investigate.
So if you have a disagreement with anyone about AI safety, feel encouraged to post it here.