You should read up on regularization) and the no free lunch theorem, if you aren't already familiar with them.
A theory is a model for a class of observable phenomena. A model is constructed from smaller primitive (atomic) elements connected together according to certain rules. (Ideally, the model's behavior or structure is isomorphic to that of the class of phenomena it is intended to represent.) We can take this collection of primitive elements, plus the rules for how they can be connected, as a modeling language. Now, depending on which primitives and rules we have selected, it may become more or less difficult to express a model with behavior isomorphic to the original, requiring more or fewer primitive elements. This means that Occam's razor will suggest different models as the simplest alternatives depending on which modeling language we have selected. Minimizing complexity in each modeling language lends a different bias toward certain models and against other models, but those biases can be varied or even reversed by changing the language that was selected. There is consequently nothing mathematically special about simplicity that lends an increased probability of correctness to simpler models.
That said, there are valid reasons to use Occam's razor nonetheless, and not just the reasons the author of this essay lists, such as resource constraint optimization. In fact, it is reasonable to expect that using Occam's razor does increase the probability of correctness, but not for the reasons that simplicity alone is good. Consider the fact that human beings evolved in this environment, and that our minds are therefore tailored by evolution to be good at identifying patterns that are common within it. In other words, the modeling language used for human cognition has been optimized to some degree to easily express patterns that are observable in our environment. Thus, for the specific pairing of the human environment with the modeling language used by human minds, a bias towards simpler models probably is indicative of an increased likelihood of that model being appropriate to the observed class of phenomena, despite simplicity being irrelevant in the general case of any arbitrary pairing of environment and modeling language.
You're speaking as though complexity is measuring the relationship between a language and the phenomena, or the map and a territory. But I'm pretty sure complexity is actually an objective and language-independent idea, represented in its pure form in Salmonoff Induction. Complexity is a property that's observed in the world via senses or data input mechanisms, not just something within the mind. The ease of expressing a certain statement might change depending on the language you're using, but the statement's absolute complexity remains the same no matter what. You don't have to measure everything within the terms of one particular language, you can go outside the particulars and generalize.
This essay claims to refute a popularized understanding of Occam's Razor that I myself adhere to. It is confusing me, since I hold this belief at a very deep level that it's difficult for me to examine. Does anyone see any problems in its argument, or does it seem compelling? I specifically feel as though it might be summarizing the relevant Machine Learning research badly, but I'm not very familiar with the field. It also might be failing to give any credit to simplicity as a general heuristic when simplicity succeeds in a specific field, and it's unclear whether such credit would be justified. Finally, my intuition is that situations in nature where there is a steady bias towards growing complexity are more common than the author claims, and that such tendencies are stronger for longer. However, for all of this, I have no clear evidence to back up the ideas in my head, just vague notions that are difficult to examine. I'd appreciate someone else's perspective on this, as mine seems to be distorted.
Essay: http://bruce.edmonds.name/sinti/