Issues with the dutch book beyond the marginal value of money:
impact: these issues increase LLM/IQ and (Lesswrong/LLM relative to LessWrong/$), which cause errors in the same direction in the LLM/IQ/$/Lesswrong/LLM cycle, potentially by a very large multiplier.
Marginal value due to the high IQ gain of 5 lowers $/IQ which increases IQ/$. This also acts in the same direction.
(That's my excuse anyway. I suspected the cycle when answering and was fairly confident, without actually checking, that I was going to be way off from a "consistent" value. I gave my excuse as a comment in the survey itself that I was being hasty, but on reflection I still endorse an "inconsistent" result here, modulo the fact that I likely misread at least one question).
Control theory I think often tends to assume that you are dealing with continuous variables. Which I think the relevant properties of AIs are likely (in practice) not - even if the underlying implementation uses continuous math RSI will make finite changes and even small changes could cause large differences in results.
Also, the dynamics here are likely to depend on capability thresholds which could cause trend extrapolation to be highly misleading.
Also, note that RSI could create a feedback loop which could enhance agency including towards nonaligned goals (agentic AI convergently wants to enhance its own agency).
Also beware that agency increases may cause increases in apparent capability because of Agency Overhang.
The AI system accepts all previous feedback, but it may or may not trust anticipated future feedback. In particular, it should be trained not to trust feedback it would get by manipulating humans (so that it doesn't see itself as having an incentive to manipulate humans to give specific sorts of feedback).
I will call this property of feedback "legitimacy". The AI has a notion of when feedback is legitimate, and it needs to work to keep feedback legitimate (by not manipulating the human).
Legitimacy is good - but if an AI that's supposed to be intent-aligned to the user would find that it has an "incentive" to purposefully manipulate the user in order to get particular feedback from the user, unless it pretends that it would ignore that feedback, it's already misaligned and that misalignment should be dealt with directly IMO - this feels to me like a band-aid over a much more serious problem.
We may thus rule out negative effects larger than
0.14 standard deviations in cognitive ability if fluoride is increased by
1 milligram/liter (the level often considered when artificially fluoridat-
ing the water).
That's a high level of hypothetical harm that they are ruling out (~2 IQ points?). I would take the dental harms many times over to avoid that much cognitive ability loss.
actually, there are ~100 rows in the dataset where Room2=4, Room6=8, and Room3=5=7.
I actually did look at that (at least some subset with that property) at some point, though I didn't (think of/ get around to) re-looking at it with my later understanding.
In general, I think this is a realistic thing to occur: 'other intelligent people optimizing around this data' is one of the things that causes the most complicated things to happen in real-world data as well.
Indeed, I am not complaining! It was a good, fair difficulty to deal with.
That being said, there was one aspect I did feel was probably more complicated than ideal, and that was the combination of the tier-dependent alerting with the tiers not having any other relevance than this one aspect. That is, if the alerting had in each case been simply dependent on whether the adventurers were coming from an empty room or not, it would have been a lot simpler to work out. And if there was tier dependent alerting, but the tiers were more obvious in other ways*, it would still have been tricky but at least there would be a path to recognize the tiers and then try to figure out other ways that they might have relevance. The way it was it seemed to me you pretty much had to look at what were (ex ante) almost arbitrary combinations of (current encounter, next encounter) to figure that aspect out, unless you actually guessed the rationale of the alerting effect.
That might be me rationalizing my failure to figure it out though!
* e.g. perhaps the traps/golems could have had the same score as the same-tier nontrap encounter when alerted (or alternatively when not alerted)
The biggest problem about AIXI in my view is the reward system - it cares about the future directly, whereas to have any reasonable hope of alignment an AI in my view needs to care about the future only via what humans would want about the future (so that any reference to the future is encapsulated in the "what do humans want?" aspect).
I.e. the question it needs to be answering is something like "all things considered (including the consequences of my current action on the future, as well as taking into account my possible future actions) what would humans, as they exist now, want me to do at the present moment?"
Now maybe you can take that question and try to slice it up into rewards at particular timesteps, which change over time as what is known about what humans want changes, without introducing corrigibility issues, but the AIXI reward framework isn't really buying you anything imo even if that works, relative to directly trying to get an AI to solve the question.
On the other hand approximating Solomonoff induction might afaik be a fruitful approach, though the approximations are going to have to be very aggressive for practical performance. I do agree embeddding/self-reference can probably be patched in.
I think that it's likely to take longer than 10000 years, simply because of the logistics (not the technology development, which the AI could do fast).
The gravitational binding energy of the sun is something on the order of 20 million years worth of its energy output. OK, half of the needed energy is already present as thermal energy, and you don't need to move every atom to infinity, but you still need a substantial fraction of that. And while you could perhaps generate many times more energy than the solar output by various means, I'd guess you'd have to deal with inefficiencies and lots of waste heat if you try to do it really fast. Maybe if you're smart enough you can make going fast work well enough to be worth it though?
I feel like a big part of what tripped me up here was an inevitable part of the difficulty of the scenario that in retrospect should have been obvious. Specifically, if there is any variation in difficulty of an encounter that is known to the adventurers in advance, the score contribution of an encounter type in actual paths taken is less than the difficulty of the encounter as estimated by what best predicts the path taken (because the adventurer takes the path when it's weak, but avoids when it's strong).
So, I wound up with an epicycle saying hags and orcs were avoided more than their actual scores warranted, because that effect was most significant for them (goblins are chosen over most other encounters even if alerted, and Dragons mostly aren't alerted).
This effect was made much worse by the fact that I was getting scores mainly from lower difficulty dungeons, with lots of "Nothing" rooms and low level encounters. But even once I estimated scores from the overall data with my best guesses for preference order, the issue still applied, just not quite so badly.
In the "what if" department, I had said:
> I'm also getting remarkably higher numbers for Hag compared with my earlier method. But I don't immediately see a way to profitably exploit this.
The most obvious way to exploit this would have been the optimal solution. Why didn't I do it? The answer is that, as indicated above, I was still underestimating the hag (whereas at this point I had mostly-accurate scores for the traps and orcs). With my underestimate for the hag's score contribution, I didn't think it was worth giving up an orc-boulder trap difference to get a hag-orc difference. I also didn't realize I needed the hag to alert the dragon.
In general, I feel like I was pretty far along with discovering the mechanics despite some missteps. I correctly had the adventurers taking a 5-encounter path with right/down steps, the choice of next step being based on the encounters in the choices for the next room, with an alerting mechanism, and that the alerting mechanism didn't apply to traps and golems.
On the other hand, I applied the alerting mechanism only to score and not to preference order, except for goblins and orcs (why didn't I try to apply it to preference order for other encounters once I realized it applied to preference order for goblins and orcs and that some degree of alerting mechanism score effect applied to other encounters ?????) (I also got confused into thinking that the effect on orc preference order only applied if the current encounter was also orcs). I also didn't realize that the alerting mechanism had different sensitivity for different encounters, and I had my mistaken belief about the preference order being different from expected score for some encounter types (hey, the text played up how unnerving the hag was, there was some plausibility there!).
I think if I had gotten to where I was in my last edit early on in the time frame for this scenario instead of near the end, and had posted it, and other people had read it and tried it out, collectively we would have had a good chance of solving the whole thing. I also would have been much more likely to get the optimal solution if I had paid more attention to what abstractapplic said, instead of only very briefly glancing over his comments after posting my very belated comment and going back to doing my own thing.
In my view, a fun, challenging and theoretically solvable scenario (even if actually not that close to being solved in practice), so I think it was quite good.
Also, when doing a study, please write down afterwards whether you used intention to treat or not.
Example: I encountered a study that says post meal glucose levels depend on order in which different parts of the meal were consumed. But the study doesn't say whether every participant consumed the entire meal, and if not, how that was handled when processing the data. Without knowing if everyone consumed everything, I don't know if the differences in blood glucose were caused by the change in order, or by some participants not consuming some of the more glucose-spiking meal components.
In that case, intention to treat (if used) makes the result of the study less interesting since it provides another effect that might "explain away" the headline effect.