I've been thinking about benefits of "Cognitive Zoning Laws" for AI architecture.
If specific cognitive operations were only performed in designated modules then these modules could have operation-specific tracking, interpreting, validation, rollback, etc. If we could ensure "zone breaches" can't happen (via e.g. proved invariants or more realistically detection and rollback) then we could theoretically stay aware of where all instances of each cognitive operation are happening in the system. For now let's call this cognitive-operation-factored architecture "Zoned AI".
Zoned AI seems helpful in preventing inner optimizers that are within particular modules (but might have little to say about emergent cross-module optimizers) and also would let interpretability techniques focus in on particular sections of the AI (e.g. totally speculating but if we knew where the meta-learning was inside GPT-3 it might just be all over the place and even with interpretability tools it could be hard to understand globally compared to the ability being localized in the network). Gradient descent training schemes break cognitive zoning law by default.
Defining cognitive operations perfectly enough to capture all instances of them is a losing battle. Instead we might (1) allow lots of false negatives and (2) use a behavioral test for detecting them rather than a definition.
To test a single inner piece of a Zoned AI, we create a second Zoned AI that is functional for some task and remove the capacity we want to test from that AI. Then we take the inner piece we are testing for a breach from the first AI, wrap it in a shallow network (a neural net or whatever), and see if the second AI can be made to function by training the shallow network. If the training succeeds, then we have a thing that is sufficiently similar to the disallowed operation, so we have a breach.
Now we don't actually want to check every tiny piece of the AI so instead we train a 3rd system to search for sections that might contain the disallowed ability and to predict whether one exists within the entire first AI, using the 2nd AI only as an expensive check.
Seeing the same abilities cropping up in the wrong place would tell you about the incentives innate to your architecture components and gesture towards new architectures that relieve the incentive. (e.g. If you find planning in your perception then maybe you need to attach the planner in a controlled way to the perception module)
None of this will work at later stages when an AGI can operate on itself but I would hope Cognitive Zoning could help during the crucial phase when we have AGI architecture in our hands but have not yet deployed instances at a scale where they are dangerous.
Thoughts and improvements? I'm sure this isn't a novel idea but has anyone written about it?