How is inverse temperature a penalty on models? If you're referring to the inverse temperature in the Maxwell-Boltzmann distribution, the temperature is considered a constant, and it gives the likelihood of a particle having a particular configuration, not the likelihood of a distribution.
Also, I'm not sure it's clear what you mean by "information to specify [a model]". Does a high inverse temperature mean a model requires more information, because it's more sensitive to small changes and therefore derives more information from them, or does it mean that the model requires less information, because it derives less information from inputs?
The entropy of the Maxwell-Boltzmann distribution I think is proportional to log-temperature, so high temperature (low sensitivity to inputs) is preferred if you go strictly by that. People that train neural networks generally do this as well to prevent overtraining, and they call it regularization.
If you are referring to the entropy of a model, you penalize a distribution for requiring more information by selecting the distribution that maximizes entropy subject to whatever invariants your model must abide by. This is typically done through the method of Lagrange multipliers.
You assign a probability of a microstate according to its energy and the temperature. The density of states at various temperatures creates very nontrivial behavior (especially in solid-state systems).
You appear to know somewhat more about fitting than I do - as I understood it, you assign a probability of a specific model according to its information content and the 'temperature'. The information content would be, if your model is a curvefit with four parameters, all of which are held to a narrow range, that has more 1/3 information than a fit with three ...
A putative new idea for AI control; index here.
Noise versus preference and complexity
Error versus bias versus preference
Preference versus prejudice (and bias)
Known prejudices
Revisiting complexity