An interesting analogy, closer to ML, would be to look at neuroscience. It's an older field than ML, and it seems that the physics perspective has been fairly productive, even though not successful at providing a grand unified theory of cognition yet. Some examples:
I can recommend the book Models of the Mind, from Grace Lindsay, which gives an overview of the many way physics contributed to neuroscience.
In principle, one might think that it would be easier to make progress using a physics perspective on AI than in neuroscience, for example because it is easier to do experiments in AI (in neuroscience we do not have access to the value of the weights, we do not always have access to all the neurons, and often it is not possible to intervene on the system).
Interesting! Perhaps one way to not be fooled by such situations could be to use a non-parametric statistical test. For example, we could apply permutation testing: by resampling the data to break its correlation structure and performing PCA on each permuted dataset, we can form a null distribution of eigenvalues. Then, by comparing the eigenvalues from the original data to this null distribution, we could assess whether the observed structure is unlikely under randomness. Specifically, we’d look at how extreme each original eigenvalue is relative to those from the permuted data to decide if we can reject the null hypothesis that the data is random. Could be computationally prohibitive though, but maybe there are some ways to go around that (reducing the number of permutations, downsampling the data....). Also note that failing to reject the null does not mean that the data is random. Somehow it does not look like these kind of tests are common practice in interpretability?
Right, I got confused because I thought your problem was about trying to define a measure of optimisation power - for ex analogous to the Yudkowsky measure - that was also referring to a utility function, while being invariant from scaling and translation but this is different from asking
"what fraction of the default expected utility comes from outcomes at least as good as this one?’"
What about optimisation power of as a measure of outcome that have utility greater than the utility of ?
Let be the set of outcome with utility greater than according to utility function :
The set is invariant under translation and non-zero rescaling of the utility function and we define the optimisation power of the outcome ' according to utility function as:
This does not suffer from comparing w.r.t a worst case and seem to satisfies the same intuition as the original OP definition while referring to some utility function.
This is in fact the same measure as the original optimisation power measure with an order given by the utility function
Nice recommendations! In addition to brain enthusiasts being useful for empirical work, there also are theoretical tools from system neuroscience that could be useful for AI safety. One area in particular would be for interpretability: if we want to model a network at various levels of "emergence", recent development in information decomposition and multivariate information theory to move beyond pairwise interaction in a neural network might be very useful. Also see recent publications to model synergestic information and dynamical independance to perhaps automate macro variables discovery which could also be well worth exploring to study higher levels of large ML models. This would actually require both empirical and theoretical work as once the various measures of information decomposition are clearer one would need to empirically estimate test them and use them in actual ML systems for interpretability if they turn out to be meaningful.
Thanks for all the useful links! I'm also always happy to receive more feedback.
I agree that the sense in which I use metaethics in this post is different from what academic philosophers usually call metaethics. I have the impression that metaethics, in academic sense, and metaphilosophy are somehow related. Studying what morality itself is, how to select ethical theories and what is the process behind ethical reasoning seems not independent. For example if moral nihilism is more plausible then it seems to be less likely that there is some meaningful feedback loop to select ethical theories or that there is such a meaningful thing as a ‘good’ ethical theory (at least in an observer independent way) . If moral emotivism is more plausible then maybe reflecting on ethics is more like emotions rationalisation, e.g. typically expressing in a sophisticated way something that just fundamentally means ‘boo suffering’. In that case having better understanding of metaethics in the academic sense seems to bring some light to a process that generates ethical theories, at least in humans.
Sure, I'm happy to read/discuss your ideas about this topic.
I am not sure about what computer aided analysis mean but one possibility could be to have formal ethical theories and prove some theorem inside their formal framework. But this raises questions about the sort of formal framework that one could use to 'prove theorems' under ethics in a meaningful way.
Another perspective would be too look at the activations of an autoregressive deep learning model, e.g. a transformer, during inference as a stochastic process: the collection of activation (Xt) at some layer as random variables indexed by time t, where t is token position.
One could for example look at mutual information between the history X−t=(Xt,Xt−1,...) and the future of the activations Xt+1, or look at (conditional) mutual information between the past and future of subprocesses of Xt (note: transfer entropy can be a useful tool to quantify directed information flow between different stochastic processes). There are many information-theoretic quantities one could be looking at.
If you want to formally define a probability distribution over activations, you could maybe push forward the discrete probability distribution over tokens (in particular the predictive distribution) via the embedding map.
In the context of computational mechanics this seems like a useful perspective, for example to find belief states by optimizing mutual information between some coarse graining of the past states and future states to find belief states in a data-driven way (this is too vague stated like that, and i am working on a draft that get into more details about that perspective).