Uhm two comments/questions on this.
Why do you need to decide between those probability distributions? You only need to get one action (or distribution thereof) out. You can do it without deciding, eg by taking their average and sampling. On the other hand vNM tells us utility is being assigned if your choice satisfies some conditions, but vNM = agency is a complicated position to hold.
We know that at some level every physical system is doing gradient descent or a variational version thereof. So depending on the scale you model a system, you would assign different degrees of agency?
By the way gradient descent is a form of local utility minimization, and by tweaking the meaning of 'local' one can get many other things (evolution, Bayesian inference, RL, 'games', etc).
Here's what my spidey sense is telling me: model is trying to fit as many representation as possible (IIRC this is known in mech interp) and by mere pushing apart features $a$, $b$ in a large dimensional space you end up making $a \oplus b$ linearly separable. That is, there might be a combinatorial phenomenon underlying this, which feels counterintuitive because of the large dimensions involved.