We can also construct an analogous simplicity argument for overfitting:
Overfitting networks are free to implement a very simple function— like the identity function or a constant function— outside the training set, whereas generalizing networks have to exhibit complex behaviors on unseen inputs. Therefore overfitting is simpler than generalizing, and it will be preferred by SGD.
Prima facie, this parody argument is about as plausible as the simplicity argument for scheming. Since its conclusion is false, we should reject the argumentative form on which it is based.
As far as I understood people usually talk about simplicity biases based on the volume of basins in parameter space. So the response would be that overfitting takes up more parameters than other (probably smaller desciption length) algorithms and therefore has smaller basins.
I'm curious if you endore or reject this way of defining simplicity based on the size of basins of a set of similar algorithms?
The way I'm currently thinking about this is:
Assume we are training end to end on tasks that require our network to do deep reasoning that requires multiple steps and high frequency functions for generating dramatically new outputs based on updated understanding of science etc (We are not monitoring the CoT or using a large net that emulates CoT internally without good interpretability).
Then the basins of the schemers that use the least parameters are large parts of the parameter space. The basins of harmless nets with few parameters are large parts as well. Gradient descent will select the one that is larger.
I don't understand gradient descent inductive biases well enough to have strong intuitions which would be larger. So I end up feeling something like each could happen, I'd bet 60% the least parameter schemers is larger since there's maybe slightly less space for encoding of the harmlessness needed. In that case I'd expect 99%+ probability of a schemer. In the harmless basins are larger case 99%+ of a harmless model.
I suppose this isn't exactly a counting argument, because I think that evidence about inductive biases will quickly overcome any such argument and I'm agnostic what evidence I will recieve since I'm not very knowledgable about it and other people seem to disagree a bunch.
Is my reasoning here flawed in some obvious way?
Also I appreciated the example of the cortices doing reasonably intelligent stuff without seemingly doing any scheming which makes me more hopeful an AGI system with interpretable CoT made up of a bunch of cortex level subnets with some control techniques would be sufficient to strongly accelerate the construction of a global xrisk defense system.
As far as I understood people usually talk about simplicity biases based on the volume of basins in parameter space. So the response would be that overfitting takes up more parameters than other (probably smaller desciption length) algorithms and therefore has smaller basins.
I'm curious if you endore or reject this way of defining simplicity based on the size of basins of a set of similar algorithms?
The way I'm currently thinking about this is:
I suppose this isn't exactly a counting argument, because I think that evidence about inductive biases will quickly overcome any such argument and I'm agnostic what evidence I will recieve since I'm not very knowledgable about it and other people seem to disagree a bunch.
Is my reasoning here flawed in some obvious way?
Also I appreciated the example of the cortices doing reasonably intelligent stuff without seemingly doing any scheming which makes me more hopeful an AGI system with interpretable CoT made up of a bunch of cortex level subnets with some control techniques would be sufficient to strongly accelerate the construction of a global xrisk defense system.