Regarding Image classification performance it seems worth noting that ImageNet was labeled by human labelers (and IIRC there was a paper showing that labels are ambiguous or wrong for a substantial minority of the images).
As such, I don't think we can conclude too much about superhuman AI performance on Image recognition from ImageNet alone (as perfect performance on the benchmark corresponds to perfectly replicating human judgement, admittedly aggregated over multiple humans). To demonstrate superhuman performance, a dataset with known ground truth were humans struggle to correctly label images would seem more appropriate.
A lot of the omissions you mention are due to inconsistent benchmarks (like the switch from the full Atari suite to Atari 100k with fewer and easier games) and me trying to keep results comparable.
This particular plot only has each year's SOTA, as it would get too crowded with a higher temporal resolution (I used it for the comment, as it was the only one including smaller-sample results on Atari 100k and related benchmarks). I agree that it is not optimal for eyeballing trends.
I also agree that temporal trends can be problematic as people did ...
I guess I should update my paper on trends in sample efficiency soon / check whether recent developments are on trend (please message me if you are interested in doing this). This improvement does not seem to be extremely off-trend, but is definitely a bit more than I would have expected this year. Also, note that this results does NOT use the full suite of Atari games, but rather a subset of easier ones.
Your point b) seems like it should also make you somewhat sceptical of any of this accelerating AI capabilities, unless you belief that capabilities-focused actors would change their actions based on forecasts, while safety-focused actors wouldn't. Obviously, this is a matter of degree, and it could be the case that the same amount of action-changing by both actors still leads to worse outcomes.
I think that if OpenAI unveiled GPT4 and it did not perform noticeably better than GPT3 despite a lot more parameters, that would be a somewhat important update. And it seems like a similar kind of update could be produced by well-conducted research on scaling laws for complexity.
Most recent large safety projects seem to be focused on language models. So in case the evidence pointed towards problem complexity not mattering that much, I would expect the shift in prioritization towards more RL-safety research to outweigh the effect on capability improvements (especially for the small version of the project, about which larger actors might not care that much). I am also sceptical whether the capabilities of the safety community are in fact increasing exponentially.
I am also confused about the resources/reputation framing. To me this i...
Thank you!
I agree that switching the simulator could be useful where feasible (you'd need another simulator with compatible state- and action-spaces and somewhat similar dynamics.)
It indeed seems pretty plausible that instructions will be given in natural language in the future. However, I am not sure that would affect scaling very much, so I'd focus scaling experiments on the simpler case without NLP for which learning has already been shown to work.
IIRC, transformers can be quite difficult to get to work in an RL setting. Perhaps this is different for PIO, but I cannot find any statements about this in the paper you link.
Oh, right; I seemed to have confused Gibbard-Satterthwaite with Arrow.
Do you know whether there are other extensions of Arrow's theorem to single-winner elections? Having a voting method return a full ranking of alternatives does not appear to be super important in practice...
"If the prices do not converge, then they must oscillate infinitely around some point. A trader could exploit the logical inductor by buying the sentence at a high point on the oscillation and selling at a low one."
I know that this is an informal summary, but I don't find this point intuitively convincing. Wouldn't the trader also need to be able to predict the oscillation?
If I understood correctly, the model was trained in Chinese and probably quite expensive to train.
Do you know whether these Chinese models usually get "translated" to English, or whether there is a "fair" way of comparing models that were (mainly) trained on different languages (I'd imagine that even the tokenization might be quite different for Chinese)?
"This desiderata is often difficult to reconcile with clear scoring, since complexity in forecasts generally requires complexity in scoring."
Can you elaborate on this? In some sense, log-scoring is simple and can be applied to very complex distributions; Are you saying that the this would still be "complex scoring" because the complex forecast needs to be evaluated, or is your point about something different?
Partial resolution could also help with getting some partial signal on long term forecasts.
In particular, if we know that a forecasting target is growing monotonously over time (like "date at which X happens" or "cumulative number of X before a specified date"), we can split P(outcome=T) into P(outcome>lower bound)*P(outcome=T|outcome>lower bound). If we use log scoring, we then get log(P(outcome>lower bound)) as an upper bound on the score.
If forecasts came in the form of more detailed models, it should be possible to use a similar approach to calculate bounds based on conditioning on more complicated events as well.
I don't know what performance measure is used to select superforecasters, but updating frequently seems to usually improve your accuracy score on GJopen as well (see "Activity Loading" in
this thread on the EA forum. )
Makes sense, I was thinking about rewards as function of the next state rather than the current one.
I can stil imagine that things will still work if we replace the difference in Q-values by the difference in the values of the autoencoded next state. If that was true, this would a) affect my interpretation of the results and b) potentially make it easier to answer your open questions by providing a simplified version of the problem.
Edit: I guess the "Chaos unfolds over time" property of the safelife environment makes it unlikely that this would work?
"So if you wouldn’t sacrifice >0.01AUC for the sake of what a human thinks is the “reasonable” explanation to a problem, in the above thought experiment, then why sacrifice unknown amounts of lost accuracy for the sake of explainability?"
You could think of explainability as some form of regularization to reduce overfitting (to the test set).
"Overall, access to the AI strongly improved the subjects' accuracy from below 50% to around 70%, which was further boosted to a value slightly below the AI's accuracy of 75% when users also saw explanations. "
But this seems to be a function of the AI system's actual performance, the human's expectations of said performance, as well as the human's baseline performance. So I'd expect it to vary a lot between tasks and with different systems.
"My own guess is that humans are capable of surviving far more severe climate shifts than those projected in nuclear winter scenarios. Humans are more robust than most any other mammal to drastic changes in temperature, as evidenced by our global range, even in pre-historic times"
I think it is worth noting that the speed of climate shifts might play an important role, as a lot of human adaptability seems to rely on gradual cultural evolution. While modern information technology has greatly sped up the potential for cultural evolution, I am unsure if these speedups are robust to a full-scale nuclear war.
"The Nash solution differs significantly from the other solutions considered so far. [...]
2. This is the first proposal where the additive constants matter. Indeed, now the multiplicative constants are the ones that don't matter!"
In what sense do additive constants matter here? Aren't they neutralized by the subtraction?
"My intuition is that there will be a class of questions where debate is definitely safe, a class where it is unsafe, and a class where some questions are safe, some unsafe, and we don’t really know which are which."
Interesting. Do you have some examples of types of questions you expect to be safe or potential features of save questions? Is it mostly about the downstram consquences that answers would have, or more about instrumental goals that the questions induce for debaters?
I like the insight that offsetting is not always bad and the idea of dealing with the bad cases using the task reward. State-based reward functions that capture whether or not the task is currently done also intuitively seem like the correct way of specifying rewards in cases where achieving the task does not end the episode.
I am a bit confused about the section on the markov property: I was imagining that the reason you want the property is to make applying standard RL techniques more straightforward (or to avoid making already existing partial observabi...
Nice post!
I would like to highlight that a naive application of the expected value perspective could lead to problems like the unilateralist's curse and think that the post would be even more useful for readers who are new to these kinds of considerations if it discussed that more explicitly (or linked to relevant other posts prominently).
"If, at some point in the future, we have the same number of contagious people, and are not at an appreciable fraction of group immunity, it will at that point again be a solid decision to go into quarantine (or to extend it). "
I think for many people the number of infections at which this becomes a good idas has increased as we have more accurate information about the CFR and how quickly realistic countermeasures can slow down an outbreak in a given area, which should decrease credence in some of the worst case scenarios many were worried about a few months ago.
Nitpick: I am pretty sure non-zero-sum does not imply a convex Pareto front.
Instead of the lens of negotiation position, one could argue that mistake theorists believe that the Pareto Boundary is convex (which implies that usually maximizing surplus is more important than deciding allocation), while conflict theorists see it as concave (which implies that allocation is the more important factor).
Twitter: CV kills via cardiac failure, not pulmonary links to the aggragate spreadsheet, not the twitter soruce.
Even if the claim was usually true on longer time scales, I doubt that pointing out an organisations mistakes and not entirely truthful statements usually increases the trust in them on the short time scales that might be most important here. Reforming organizations and rebuilding trust usually takes time.
How do
"One of the problems here is that the impact penalty only looks at the value of VAR one turn ahead. In the DeepMind paper, they addressed similar issues by doing “inaction rollouts”. I'll look at the more general situations of rollouts: rollouts for any policy "
and
"That's the counterfactual situation, that zeroes out the impact penalty. What about the actual situation? Well, as we said before, A will be just doing ∅; so, as soon as would produce anything different from ...
For all auxillary rewards. Edited the original comment.
I agree that it is likely to go wrong somewhere, but it might still be useful to figure out why. If the agent is able to predict the randomness reliably in some cases, the random baseline does not seem to help with the subagent problem.
Edit: Randomization does not seem to help, as long as the actionset is large (as the agent can then arrange for most actions to make the subagent optimize the main reward).
I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state where
for all auxillary rewards , where is the optimal policy according to the main reward; while making sure that there exists an action such that
for every . So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the ...
"Not quite... " are you saying that the example is wrong, or that it is not general enough? I used a more specific example, as I found it easier to understand that way.
I am not sure I understand: In my mind "commitments to balance out the original agent's attainable utility" essentially refers to the second agent being penalized by the the first agent's penalty (although I agree that my statement is stronger). Regarding your text, my statement refers to "SA will just precommit to undermine or help A, depending on the cir...
I was thinking about normalisation as linearly rescaling every reward to when I wrote the comment. Then, one can always look at , which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing is the same as maximizing
After looking at the update, my model is:
(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)
Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the "tipping point" in beliefs, where the opposite extreme policy is suddenly favoured).
In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually...
But no matter, how I take the default outcome, your second example is always "more positive sum" than the first, because 0.5 + 0.7 + 2x < 1.5 - 0.1 +2x.
Granted, you could construct examples where the inequality is reversed and Goodhart bad corresponds to "more negative sum", but this still seems to point to the sum-condition not being the central concept here. To me, it seems like "negative min" compared to the default outcome would be closer to the actual problem. This distinction matters, because negative min is a lot weaker than negative sum.
Or am I completely misunderstanding your examples or your point?
To clear up some more confusion: The sum-condition is not what actually matters here, is it? In the first example of 5), the sum of utilities is lower than in the second one. The problem in the second example seems to rather be that the best states for one of the (Edit: the expected) rewards are bad for the other?
That again seems like it would often follow from resource constraints.
Right. I think my intuition about negative-sum interactions under resource constrainrs combined the zero-sum nature of resource spending with the (perceived) negative-sum nature of competition for resources. But for a unified agent there is no competition for resources, so the argument for resource constraints leading to negative-sum interactions is gone.
Thank you for alleviating my confusion.
I think the actual solution is somewhere in between:If we assume calibrated uncertainty, ignore generalization and assume we can perfectly fit the training data, the total cost should be reduced by (1-the probability assigned to the predicted class) * the cost of misclassifying the not predicted (minority) class as the predicted one (majority): If our classifier already predicted the right class, nothing happens, but otherwise we change our prediction to the other class and reduce the total cost.While this does not depend on the decision threshold,... (read more)