In response to comment by So8res on MIRI's Approach
Comment author: jacob_cannell 31 July 2015 04:00:02AM *  2 points [-]

Thanks for the clarifications - I'll make this short.

Judea Pearl (and a whole host of others) showed up, formalized probabilistic graphical models, related them to Bayesian inference, and suddenly a whole class of ad-hoc solutions were superseded.

Probabilistic graphical models were definitely a key theoretical development, but they hardly swept the field of expert systems. From what I remember, in terms of practical applications, they immediately replaced or supplemented expert systems in only a few domains - such as medical diagnostic systems. Complex ad hoc expert systems continued to dominate unchallenged in most fields for decades: in robotics, computer vision, speech recognition, game AI, fighter jets, etc etc basically everything important. As far as I am aware the current ANN revolution is truly unique in that it is finally replacing expert systems across most of the board - although there are still holdouts (as far as I know most robotic controllers are still expert systems, as are fighter jets, and most Go AI systems).

The ANN solutions are more complex than the manually crafted expert systems they replace - but the complexity is automatically generated. The code the developers actually need to implement and manage is vastly simpler - this is the great power and promise of machine learning.

Here is a simple general truth - the Occam simplicity prior does imply that simpler hypotheses/models are more likely, but for any simple model there are an infinite family of approximations to that model of escalating complexity. Thus more efficient approximations naturally tend to have greater code complexity, even though they approximate a much simpler model.

My claim is that there are other steps such as those that haven't been made yet, that there are tools on the order of "causal graphical models" that we are missing.

Well, that would be interesting.

I'm not sure whether your view is of the form "actually the programmer of the future would say "I don't know how it's building a model of the world either, it's just a big neural net that I trained for a long time"" or whether it's of the form "actually we do know how to set up that system [multi-level model] already", or whether it's something else entirely. But if it's the second one, then by all means, please tell :-)

Anyone who has spent serious time working in graphics has also spent serious time thinking about how to create the matrix - if given enough computer power. If you got say a thousand of the various brightest engineers in different simulation related fields, from physics to graphics, and got them all working on a large mega project with huge funds it could probably be implemented today. You'd start with a hierarchical/multi-resolution modelling graph - using say octrees or kdtrees over voxel cells, and a general set of hierarchical bidirectional inference operators for tracing paths and interactions.

To make it efficient, you need a huge army of local approximation models for different phenomena at different scales - low level quantum codes just in case, particle level codes, molecular bio codes, fluid dynamics, rigid body, etc etc. It's a sea of codes with decision tree like code to decide which models to use where and when.

Of course with machine learning we could automatically learn most of those codes - which suddenly makes it more tractable. And then you could use that big engine as your predictive world model, once it was trained.

The problem is to plan anything worthwhile you need to simulate human minds reasonably well, which means to be useful the sim engine would basically need to infer copies of everyone's minds . . ..

And if you can do that, then you already have brain based AGI!

So I expect that the programmer from the future will say - yes at the low level we use various brain-like neural nets, and various non-brain like neural nets or learned virtual circuits, some operating over explicit space-time graphs. In all cases we have pretty detailed knowledge of what the circuits are doing - here take a look at that last goal update that just propagated in your left anterior prefrontal cortex . ..

Comment author: YVLIAZ 07 September 2015 07:45:54AM 0 points [-]

I just want to point out some nuiances.

1) The divide between your so called "old CS" and "new CS" is more of a divide (or perhaps a continuum) between engineers and theorists. The former is concerned with on-the-ground systems, where quadratic time algorithms are costly and statistics is the better weapon at dealing with real world complexities. The latter is concerned with abstracted models where polynomial time is good enough and logical deduction is the only tool. These models will probably never be applied literally by engineers, but they provide human understanding of engineering problems, and because of their generality, they will last longer. The idea of a Turing machine will last centuries if not millenia, but a Pascal programmer might not find a job today and a Python programmer might not find a job in 20 years. Machine learning techniques constantly come in and out of vogue, but something like the PAC model will be here to stay for a long time. But of course at the end of the day it's engineers who realize new inventions and technologies.

Theorists' ideas can transform an entire engineering field, and engineering problems inspire new theories. We need both types of people (or rather, people across the spectrum from engineers to theorists).

2) With neural networks increasing in complexity, making the learning converge is no longer as simple as just running gradient descent. In particular, something like a K12 curriculum will probably emerge to guide the AGI past local optima. For example, the recent paper on neural Turing machines has already employed curriculum learning, as the authors couldn't get good performance otherwise. So there is a nontrivial maintenance cost (in designing a curriculum) to a neural network so that it adapts to a changing environment, which will not lessen if we don't better our understanding of it.

Of course expert systems also have maintenance costs, of a different type. But my point is that neural networks are not free lunches.

3) What caused the AI winter was that AI researchers didn't realize how difficult it was to do what seems so natural to us --- motion, language, vision, etc. They were overly optimistic because they succeeded in what were difficult to humans --- chess, math, etc. I think it's fair to say the ANNs have "swept the board" in the former category, the category of lower level functions (machine translation, machine vision, etc), but the high level stuff is still predominantly logical systems (formal verification, operations research, knowledge representation, etc). It's unfortunate that the the neural camp and logical camp don't interact too much, but I think it is a major objective to combine the flexibility of neural systems with the power and precision of logical systems.

Here is a simple general truth - the Occam simplicity prior does imply that simpler hypotheses/models are more likely, but for any simple model there are an infinite family of approximations to that model of escalating complexity. Thus more efficient approximations naturally tend to have greater code complexity, even though they approximate a much simpler model.

Schmidhuber invented something called the speed prior that weighs an algorithm according to how fast it generates the observation, rather than how simple it is. He makes some ridiculous claims about our (physical) universe assuming the speed prior. Ostensibly one can also weigh in accuracy of approximation in there to produce another variant of prior. (But of course all of these will lose the universality enjoyed by the Occam prior)

Comment author: Squark 31 July 2015 06:22:22AM *  7 points [-]

The concern that ML has no solid theoretical foundations reflects the old computer science worldview, which is all based on finding bit exact solutions to problems within vague asymptotic resource constraints.

It is an error to confuse the "exact / approximate" axis with the "theoretical / empirical" exis. There is plenty of theoretical work in complexity theory on approximate algorithms.

A good ML researcher absolutely needs a good idea of what is going on under the hood - at least at a sufficient level of abstraction.

There is difference between "having an idea" and "solid theoretical foundations". Chemists before quantum mechanics had a lots of ideas. But they didn't have a solid theoretical foundation.

Why not test safety long before the system is superintelligent? - say when it is a population of 100 child like AGIs. As the population grows larger and more intelligent, the safest designs are propagated and made safer.

Because this process is not guaranteed to yield good results. Evolution did the exact same thing to create humans, optimizing for genetic fitness. And humans still went and invented condoms.

So it may actually be easier to drop the traditional computer science approach completely.

When the entire future of mankind is at stake, you don't drop approaches because it may be easier. You try every goddamn approach you have (unless "trying" is dangerous in itself of course).

In response to comment by Squark on MIRI's Approach
Comment author: YVLIAZ 07 September 2015 06:39:30AM 0 points [-]

There is difference between "having an idea" and "solid theoretical foundations". Chemists before quantum mechanics had a lots of ideas. But they didn't have a solid theoretical foundation.

That's a bad example. You are essentially asking researchers to predict what they will discover 50 years down the road. A more appropriate example is a person thinking he has medical expertise after reading bodybuilding and nutrition blogs on the internet, vs a person who has gone through medical school and is an MD.

Comment author: [deleted] 30 July 2015 11:24:26PM *  2 points [-]

Thus the argument that there are people using DL without understanding it - and moreover that this is dangerous - is specious and weak because these people are not the ones actually likely to develop AGI let alone superintelligence.

Yes, but I don't think that's an argument anyone has actually made. Nobody, to my knowledge, sincerely believes that we are right around the corner from superintelligent, self-improving AGI built out of deep neural networks, such that any old machine-learning professor experimenting with how to get a lower error rate in classification tasks is going to suddenly get the Earth covered in paper-clips.

Actually, no, I can think of one person who believed that: a radically underinformed layperson on reddit who, for some strange reason, believed that LessWrong is the only site with people doing "real AI" and that "[machine-learning researchers] build optimizers! They'll destroy us all!"

Hopefully he was messing with me. Nobody else has ever made such ridiculous claims.

Sorry, wait, I'm forgetting to count sensationalistic journalists as people again. But that's normal.

Instead of thinking of 'safety' or 'alignment' as some absolute binary property we can guarantee, it is more profitable to think of a complex distribution over the relative amounts of 'safety' or 'alignment' in an AI population

No, "guarantees" in this context meant PAC-style guarantees: "We guarantee that with probability 1-\delta, the system will only 'go wrong' from what its sample data taught it 1-\epsilon fraction of the time." You then need to plug in the epsilons and deltas you want and solve for how much sample data you need to feed the learner. The links for intro PAC lectures in the other comment given to you were quite good, by the way, although I do recommend taking a rigorous introductory machine learning class (new grad-student level should be enough to inflict the PAC foundations on you).

we can at least influence or steer the distribution by selecting for agent types that are more safe/altruistic

"Altruistic" is already a social behavior, requiring the agent to have a theory of mind and care about the minds it believes it observes in its environment. It also assumes that we can build in some way to learn what the hypothesized minds want, learn how they (ie: human beings) think, and separate the map (of other minds) from the territory (of actual people).

Note that "don't disturb this system over there (eg: a human being) because you need to receive data from it untainted by your own causal intervention in any way" is a constraint that at least I, personally, do not know how to state in computational terms.

In response to comment by [deleted] on MIRI's Approach
Comment author: YVLIAZ 07 September 2015 06:14:27AM 0 points [-]

I think you are overhyping the PAC model. It surely is an important foundation for probabilistic guarantees in machine learning, but there are some serious limitations when you want to use it to constrain something like an AGI:

  1. It only deals with supervised learning

  2. Simple things like finite automata are not learnable, but in practice it seems like humans pick them up fairly easily.

  3. It doesn't deal with temporal aspects of learning.

However, there are some modification of the PAC model that can ameliorate these problems, like learning with membership queries (item 2).

It's also perhaps a bit optimistic to say that PAC-style bounds on a possibly very complex system like an AGI would be "quite doable". We don't even know, for example, whether DNF is learnable in polynomial time under the distribution free assumption.

Comment author: YVLIAZ 15 October 2014 09:15:34AM 5 points [-]

I would definitely recommend learning basics of algorithms, feasibility (P vs NP), or even computability (halting problem, Godel's incompleteness, etc). They will change your worldview significantly.

CLRS is a good entry point. After that, perhaps Sipser for some more depth.

Comment author: savageorange 20 January 2014 10:53:29PM *  1 point [-]

Now I have found an easy way to snap out of it: simply switch the book/subject. Switching from math to biology/neuroscience works better than switching from math to math (e.g. algebra to topology, category theory to recursion theory, etc), but the latter can still recover some of the mental resistance built up. I don't see how this can fit in the framework of "have-to" and "want-to".

I do ('have-to' and 'want-to' are dynamically redefined things for a person, not statically defined things). I regard excessive repetition as dangerous*.. even on a subconscious level. So as I get into greater # of repetitions, I feel greater and greater unease, and it's an increasing struggle to keep my focus in the face of my fear. So my 'want-to' either reduces or is muted by fear. If you do not have this type of experience, obviously this does not apply.

* Burn out and overhabituation/compulsive behaviours being two notable possibilties.

Comment author: YVLIAZ 21 January 2014 10:35:09PM 0 points [-]

Yes, so the exact definition of "have-to" and "want-to" already present some difficulties in pinpointing what exact the theory says.

In my personal experience, it's not so much "fear" than fatigue and frustration. I also don't feel that my desire to read reduces; it stays intense, but my brain just can't keep absorbing information, and I find myself keep rereading the same passages because I can't wrap my head around them.

Comment author: YVLIAZ 20 January 2014 09:12:15PM 6 points [-]

I can see this theory working in several scenarios, despite (or perhaps rather because of) the relative fuzziness of its description (which is of course the norm in psychological theories so far). However I have personal experiences that at least at face value don't seem to be able to be explained by this theory:

During my breaks I would read textbooks, mostly mathematics and logic, but also branching into biology/neuroscience, etc. I would begin with pleasure, but if I read the same book for too long (several days) my reading speed slows down and I start flipping a couple pages to see how far it is till the next section/chapter. So to me it this seems not like a motivation shift from "have-to" to "want-to", but rather the brain's getting fatigued at parsing text/building its knowledge database, and subjectively I still want to keep reading, and advancing page by page still brings me pleasure, but there's something "biological" that keeps me back (of course everything about me is biological, but I mean it in a metaphorical way, that it feels quite distinct from the motivational system that makes me want to read).

Now I have found an easy way to snap out of it: simply switch the book/subject. Switching from math to biology/neuroscience works better than switching from math to math (e.g. algebra to topology, category theory to recursion theory, etc), but the latter can still recover some of the mental resistance built up. I don't see how this can fit in the framework of "have-to" and "want-to". Nobody's forcing me to read these books; it's purely my desire. If the majority of executive function can be explained in such a way as expounded by the paper, then I do not see how switching subject of reading can make such a big difference.

Of course I may be an outlier here, or I'm misunderstanding what constitutes "willpower" or not. Feel free to offer your opinions.

Either way, I'm glad that this is an active area of research. I'm quite interested in motivation myself.

Comment author: YVLIAZ 06 February 2013 03:36:53PM 0 points [-]

I bought these with a 4 socket adapter. However, I think my lamp can't power them all. Does anyone know a higher output lamp?

Actually I'm not even sure if that is how lights work. If someone can explain how I can the power that goes to the light bulbs, it'd be greatly appreciated.

In response to Collapse Postulates
Comment author: Wiseman 09 May 2008 05:18:42PM 0 points [-]

4 points:

If collapse actually worked the way its adherents say it does, it would be: 1. The only non-linear evolution in all of quantum mechanics. 2. The only non-unitary evolution in all of quantum mechanics. 3.... WHAT DOES THE GOD-DAMNED COLLAPSE POSTULATE HAVE TO DO FOR PHYSICISTS TO REJECT IT? KILL A GOD-DAMNED PUPPY?

Not a valid argument. The physics of the universe are what they are, at the microscopic and macroscopic levels. If it so happens that there is some non-GR-violating non-locality going on (don't complain, just cause you can't imagine it, doesn't mean it's not possible), then your list above simply would be wrong, and there would be no violation of "traditional physics" to complain about.

In any case, since from the perspective of each world we have non-determinism, and the only world we are acting on is our own, why is it necessary to explain many worlds for the purposes of AGI?

Well, first: Does any collapse theory have any experimental support? No.

Neither does MW, they are both interpretations.

I'm going out on a limb on this one, but since the whole universe includes separate branching “worlds”, and over time this means we have more worlds now than 1 second ago, and since the worlds can interact with each other, how does this not violate conservation of mass and energy?

Comment author: YVLIAZ 04 September 2012 09:04:04PM 1 point [-]

I'm going out on a limb on this one, but since the whole universe includes separate branching “worlds”, and over time this means we have more worlds now than 1 second ago, and since the worlds can interact with each other, how does this not violate conservation of mass and energy?

The "number" of worlds increases, but each world is weighted by a complex number, such that when you add up all the squares of the complex numbers they sum up to 1. This effectively preserves mass and energy across all worlds, inside the universal wave function.

Comment author: Mat 03 June 2012 05:42:23PM *  2 points [-]

I disagree on five points. The first is my conclusion too; the second leads to the third and the third explains the fourth. The fifth one is the most interesting.

1) In contrast with the title, you did not show that the MWI is falsifiable nor testable; I know the title mentions decoherence (which is falsifiable and testable), but decoherence is very different from the MWI and for the rest of the article you talked about the MWI, though calling it decoherence. You just showed that MWI is "better" according to your "goodness" index, but that index is not so good. Also, the MWI is not at all a consequence of the superposition principle: it is rather an ad-hoc hypothesis made to "explain" why we don't experience a macroscopic superposition, despite we would expect it because macroscopic objects are made of microscopic ones. But, as I will mention in the last point, the superposition of macroscopic objects in not an inevitable consequence of the superposition principle applied to microscopic objects.

2) You say that postulating a new object is better than postulating a new law: so why teach Galileo's relativity by postulating its transformations, while they could be derived as a special case of Lorents transformations for slow speeds? The answer is because they are just models, which gotta be easy enough for us to understand them: in order to well understand relativity you first have to understand non-relativistic mechanics, and you can only do it observing and measuring slow objects and then making the simplest theory which describes that (i.e., postulating the shortest mathematical rules experimentally compatible with the "slow" experiences: Galileo's); THEN you can proceed in something more difficult and more accurate, postulating new rules to get a refined theory. You calculate the probability of a theory and use this as an index of the "truthness" of it, but that's confusing the reality with the model of it. You can't measure how a theory is "true", maybe there is no "Ultimate True Theory": you can just measure how a theory is effective and clean in describing the reality and being understood. So, in order to index how good a theory is, you should instead calculate the probability that a person understands that theory and uses it to correctly make anticipations about reality: that means P(Galileo) >> P(first Lorentz, then show Galileo as a special case); and also P(first Galileo, after Lorentz) != P(first Lorentz, after Galileo), because you can't expect people to be perfect rationalists: they can be just as rational as possible. The model is just an approximation of the reality, so you can't force the reality of people to be the "perfect rational person" model, you gotta take in account that nobody's perfect.

3) Because nobody's perfect, you must take in account the needed RAM too. You said in the previous post that "Occam's Razor was raised as an objection to the suggestion that nebulae were actually distant galaxies—it seemed to vastly multiply the number of entities in the universe", in order to justify that the RAM account is irrelevant. But that argument is not valid: we rejected the hypothesis that nebulae are not distant galaxies not because the Occam's Razor is irrelevant, but because we measured their distance and found that they are inside our galaxy; without this information, the simpler hypothesis would be that they are distant galaxies. The Occam's Razor IS relevant not only about the laws, but about the objects too. Yes, given a limited amount of information, it could shift toward a "simpler yet wrong model", but it doesn't annihilate the probability of the "right" model: with new information you would find out that you were previously wrong. But how often does the Occam's Razor induce you to neglect a good model, as opposed to how often it let us neglect bad models? Also, Occam's Razor may mislead you not only when applied to objects, but when applied to laws too, so your argument discriminating Occam's Razor applicability doesn't stand.

4) The collapse of the wave function is a way to represent a fact: if a microscopic system S is in an eigenstate of some observable A and you measure on S an observable B which is non commuting with A, your apparatus doesn't end up in a superposition of states but gives you a unique result, and the system S ends up in the eigenstate of B corresponding to the result the apparatus gave you. That's the fact. As the classical behavior of macroscopic objects and the stochastic irreversible collapse seems in contradiction with the linearity, predictability and reversibility of the Schrödinger equation ruling the microscopic systems, it appears as if there's an uncomfortable demarcation line between microscopic and macroscopic physics. So, attempts have been made in order to either find this demarcation line, or show a mechanism for the emergence of the classical behavior from the quantum mechanics, or solve or formalize this problem however. The Copenhagen interpretation (CI) just says: "there are classical behaving macroscopic objects, and quantum behaving microscopic ones, the interaction of a microscopic object with a macroscopic apparatus causes the stochastic and irreversible collapse of the wave function, whose probabilities are given by the Born rule, now shut up and do the math"; it is a rather unsatisfactory answer, primarily because it doesn't explain what gives rise to this demarcation line and where should it be drawn; but indeed it is useful to represent effectively what are the results of the typical educational experiments, where the difference between "big" and "small" is in no way ambiguous, and allows you to familiarize fast with the bra-ket math. The Many Worlds Interpretation (MWI) just says: "there is indeed the superposition of states in the macroscopic scale too, but this is not seen because the other parts of the wave function stay in parallel invisible universes". Now imagine Einstein did not develop the General Relativity, but we anyway developed the tools to measure the precession of Mercury and we have to face the inconsistency with our predictions through Newton's Laws: the analogous of the CI would be "the orbit of Mercury is not the one anticipated by Newton's Laws but this other one, now if you want to calculate the transits of Mercury as seen from the Earth for the next million years you gotta do THIS math and shut up"; the analogous of the MWI would be something like "we expect the orbit of Mercury to precede at this rate X but we observe this rate Y; well, there is another parallel universe in which the preceding rate of Mercury is Z such that the average between Y and Z is the expected X due to our beautiful indefeasible Newton's Law". Both are unsatisfactory and curiosity stoppers, but the first one avoids to introduce new objects. The MWI, instead, while explaining exactly the same experimental results, introduces not only other universes: it also introduces the concept itself that there are other universes which proliferate at each electron's cough attack. And it does just for the sake of human pursuit of beauty and loyalty to a (yes, beautiful, but that's not the point) theory.

5) you talk of MWI and of decoherence as they are the same thing, but they are quite different. Decoherence is about the loss of coherence that a microscopic system (an electron, for instance) experiences when interacting with a macroscopic chaotic environment. As this sounds rather relevant to the demarcation line and interaction between microscopic and macroscopic, it has been suggested that maybe these are related phenomenons, that is: maybe the classical behavior of macroscopic objects and the collapse of the wave function of a microscopic object interacting with a macroscopic apparatus are emergent phenomenons, which arise from the microscopic quantum one through some interaction mechanism. Of course this is not an answer to the problem: it is just a road to be walked in order to find a mechanism, but we gotta find it. As you say, "emergence" without an underlying mechanism is like "magic". Anyway, decoherence has nothing to do with MWI, though both try (or pretend) to "explain" the (apparent?) collapse of the wave function. In the last decades decoherence has been probed and the results look promising. Though I'm not an expert in the field, I took a course about it last year and made a seminar as exam for the course, describing the results of an article I read (http://arxiv.org/abs/1107.2138v1). They presented a toy model of a Curie-Weiss apparatus (a magnet in a thermal bath), prepared in an initial isotropic metastable state, measuring the z-axis spin component of a 1/2 spin particle through induced symmetry breaking. Though I wasn't totally persuaded by the Hamiltonian they wrote and I'm sure there are better toy models, the general ideas behind it were quite convincing. In particular, they computationally showed HOW the stochastic indeterministic collapse can emerge from just: a) Schrödinger's equation; b) statistical effects due to the "large size" of the apparatus (a magnet composed by a large number N of elementary magnets, coupled to a thermal bath); c) an appropriate initial state of the apparatus. They did not postulate neither new laws nor new objects: they just made a model of a measurement apparatus within the framework of quantum mechanics (without the postulation of the collapse) and showed how the collapse naturally arose from it. I think that's a pretty impressive result worth of further research, more than the MWI. This explains the collapse without postulating it, nor postulating unseen worlds.

Comment author: YVLIAZ 04 September 2012 07:27:46AM *  1 point [-]

In contrast with the title, you did not show that the MWI is falsifiable nor testable.

I agree that he didn't show testable, but rather the possibility of it (and the formalization of it).

You just showed that MWI is "better" according to your "goodness" index, but that index is not so good

There's a problem with choosing the language for Solomonoff/MML, so the index's goodness can be debated. However, I think in general index is sound.

You calculate the probability of a theory and use this as an index of the "truthness" of it, but that's confusing the reality with the model of it.

I don't think he's saying that theories fundamentally have probabilities. Rather, as a Bayesian, he gives some priors to each theory. As more evidences accumulate, the right theory will update and its probability approaches 1.

The reason human understanding can't be part of the equations is, as EY says, shorter "programs" are more likely to govern the universe than longer "programs," essentially because these "programs" are more likely to be written if you throw down some random bits to make a program that governs the universe.

So I don't buy your arguments in the next section.

But that argument is not valid: we rejected the hypothesis that nebulae are not distant galaxies not because the Occam's Razor is irrelevant, but because we measured their distance and found that they are inside our galaxy; without this information, the simpler hypothesis would be that they are distant galaxies.

EY is comparing the angel explanation with the galaxies explanation; you are supposed to reject the angels and usher in the galaxies. In that case, the anticipations are truly the same. You can't really prove whether there are angels.

But how often does the Occam's Razor induce you to neglect a good model, as opposed to how often it let us neglect bad models?

What do you mean by "good"? Which one is "better" out of 2 models that give the same prediction? (By "model" I assume you mean "theory")

but indeed it is useful to represent effectively what are the results of the typical educational experiments, where the difference between "big" and "small" is in no way ambiguous, and allows you to familiarize fast with the bra-ket math.

You admit that Copenhagen is unsatisfactory but it is useful for education. I don't see any reason not to teach MWI in the same vein.

Now imagine Einstein did not develop the General Relativity, but we anyway developed the tools to measure the precession of Mercury and we have to face the inconsistency with our predictions through Newton's Laws: the analogous of the CI would be "the orbit of Mercury is not the one anticipated by Newton's Laws but this other one, now if you want to calculate the transits of Mercury as seen from the Earth for the next million years you gotta do THIS math and shut up"; the analogous of the MWI would be something like "we expect the orbit of Mercury to precede at this rate X but we observe this rate Y; well, there is another parallel universe in which the preceding rate of Mercury is Z such that the average between Y and Z is the expected X due to our beautiful indefeasible Newton's Law".

If indeed the expectation value of observable V of mercury is X but we observe Y with Y not= X (that is to say that the variance of V is nonzero), then there isn't a determinate formula for predict V exactly in your first Newton/random formula scenario. At the same time, someone who has the Copenhagen interpretation would have the same expectation value X, but instead of saying there's another world he says there's a wave function collapse. I still think that the parallel world is a deduced result from universal wave function, superposition, decoherence, and etc that Copenhagen also recognizes. So the Copenhagen view essentially say "actually, even though the equations say there's another world, there is none, and on top of that we are gonna tell you how this collapsing business works". This extra sentence is what causes the Razor to favor MWI.

Much of what you are arguing seems to stem from your dissatisfaction of the formalization of Occam's Razor. Do you still feel that we should favor something like human understanding of a theory over the probability of a theory being true based on its length?