If repetitions arise from sampling merely due to high conditional probability given an initial "misstep", they should be avoidable in an MCTS that sought to maximize unconditional probability of the output sequence (or rather conditional upon its input but not upon its own prior output). After entering the "trap" once or a few times, it would simply avoid the unfortunate misstep in subsequent "playouts". From my understanding, that is.
I kind of fell over board when V popped up in the equations without being introduced first.
I went into the idea of evaluating on future state representations here: https://www.lesswrong.com/posts/5kurn5W62C5CpSWq6/avoiding-side-effects-in-complex-environments#bFLrwnpjq6wY3E39S (Not sure it is wise, though.)
It seems like the method is sensitive to the ranges of the game reward and the auxiliary penalty. In real life, I suppose one would have to clamp the "game" reward to allow the impact penalty to dominate even when massive gains are foreseen from a big-impact course?
What if the encoding difference penalty were applied after a counterfactual rollout of no-ops after the candidate action or no-op? Couldn't that detect "butterfly effects" of small impactful actions, avoiding "salami slicing" exploits?
Building upon this thought, how about comparing mutated policies to a base policy by sampling possible futures to generate distributions of the encodings up to the farthest step and penalize divergence from the base policy?
Or just train a sampling policy by GD, using a Monte Carlo Tree Search that penalizes actions which alter the future encodings when compared to a pure no-op policy.
I must admit that I did not understand everything in the paper, but I think this excerpt summarizes a crucial point:
"The key issue here is proper conditioning. The unbiasedness of the value estimates V_i discussed in §1 is unbiasedness conditional on mu. In contrast, we might think of the revised estimates ^v_i as being unbiased conditional on V. At the time we optimize and make the decision, we know V but we do not know mu, so proper conditioning dictates that we work with distributions and estimates conditional on V."
The proposed "solution" converts n in...
The big problem arises when the number of choices is huge and sparsely explored, such as when optimizing a neural network.
But restricting ourselves to n superficially evaluated choices with known estimate variance in each evaluation and with independent errors/noise, then if – as in realistic cases like Monte Carlo Tree Search – we are allowed to perform some additional "measurements" to narrow down the uncertainty, it will be wise to scrutinize the high-expectance choices most – in a way trying to "falsify" their greatness, while increasing the certainty ...
I think "Metaculus" is a pun on "meta", "calculus" and "meticulous".
"WW3" and "28 years passing" are similarly dangerous "events" for the individual gambler. Why invest with a long-term perspective if there is a significant probability that you eventually cannot harvest... Crucially, the probability of not harvesting the reward may be a lot higher in a "force majeure" situation like WW3, even if one stays alive. But on the other hand, an early WW3 would chop off a lot of the individual existential risk associated with 28 years passing. 🤔 I think there cou...
But among the people who care and who have a favorite before the campaigning period, would not some change their minds if they saw their favorite being intellectually humiliated on TV? (For the first time ever, that is.)
Ehm.. Huh? I would say that:
Conditional on being in a billion-human universe, your probability of having an index between 1 and 1 billion is 1, and your probability of having any other index is 0.... (read more)