I worry a little bit about this == techniques which let you hide circuits in neural networks. These "hiding techniques" are a riposte to techniques based on modularity or clusterability -- techniques that explore naturally emergent patterns.[1] In a world where we use alignment techniques that rely on internal circuitry being naturally modular, trojan horse networks can avoid various alignment techniques.
I expect this to happen by default for a bunch of reasons. An easy one to point to is the "free software" + "crypto anarchist" + "fuck your oversight" + "digital privacy" cluster -- which will probably argue that the government shouldn't infringe your right to build and run neural networks that are not-aligned. Similar to how encrypting personal emails subverts attempts to limit harms of email, "encrypting neural network functions" can subvert these alignment techniques.
To me, it seems like this system ends up like a steganography equilibrium -- people find new hiding techniques, and others find new techniques to discover hidden information. As long as humans are both sides of this, I expect there to be progress on both sides. In the human-vs-human version of this, I think it's not too unevenly matched.
In cases where its human-vs-AI, I strongly expect the AI wins in the limit. This is in part why I'm optimistic about things like ELK solutions which might be better at finding "hidden modules" not just naturally occurring ones.
The more time I spend with these, the more I think the idea of naturally occuring modularity or clusterability makes sense / seems likely, which has been a positive update for me
TrojanNet: Embedding Hidden Trojan Horse Models in Neural Networks
TL;DR:
Abstract:
(found on FB and apparently suggested by old-timer XiXiDu)