All of axioman's Comments + Replies

What's the difference between newer Atari-playing AI and the older Deepmind one (from 2014)?

Regarding Image classification performance it seems worth noting that ImageNet was labeled by human labelers (and IIRC there was a paper showing that labels are ambiguous or wrong for a substantial minority of the images).

As such, I don't think we can conclude too much about superhuman AI performance on Image recognition from ImageNet alone (as perfect performance on the benchmark corresponds to perfectly replicating human judgement, admittedly aggregated over multiple humans). To demonstrate superhuman performance, a dataset with known ground truth were humans struggle to correctly label images would seem more appropriate.

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

The first thing you mention does not learn to play Atari, and is in general trained quite differently from Atari-playing AI's (as it relies on self-play to kind of automatically generate a curriculum of harder and harder tasks, at least for the some of the more competitive tasks in XLand).

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Do you have a source for Agent57 using the same network weights for all games?

4gwern3y

I don't think it does, and reskimming the paper I don't see any claim it does (using a single network seems to have been largely neglected since Popart). Prabhu might be thinking of how it uses a single fixed network architecture & set of hyperparameters across all games (which while showing generality, doesn't give any transfer learning or anything).

axioman3y70

A lot of the omissions you mention are due to inconsistent benchmarks (like the switch from the full Atari suite to Atari 100k with fewer and easier games) and me trying to keep results comparable.

This particular plot only has each year's SOTA, as it would get too crowded with a higher temporal resolution (I used it for the comment, as it was the only one including smaller-sample results on Atari 100k and related benchmarks). I agree that it is not optimal for eyeballing trends.

I also agree that temporal trends can be problematic as people did ... (read more)

EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

axioman3y330

I guess I should update my paper on trends in sample efficiency soon / check whether recent developments are on trend (please message me if you are interested in doing this). This improvement does not seem to be extremely off-trend, but is definitely a bit more than I would have expected this year. Also, note that this results does NOT use the full suite of Atari games, but rather a subset of easier ones.

7gwern3y

It would probably be more on-trend if you filtered out the ones not intended to be sample-efficient (it's a bit odd to plot both data-efficient Rainbow and regular Rainbow) which is going to make temporal trends weird (what if the data-efficient Rainbow had been published first and only later did someone just try to set SOTA by running as long as possible?) and covered more agents (you cite SimPLe & CURL, but don't plot them? and seem to omit DrQ & SPR entirely). I also feel like there ought to be a better way to graph that to permit some eyeball extrapolation. For starts, maybe swap the x/y-axis? It's weird to try to interpret it as 'going up and to the left'.

Proposal: Scaling laws for RL generalization

Proposal: Scaling laws for RL generalization

Your point b) seems like it should also make you somewhat sceptical of any of this accelerating AI capabilities, unless you belief that capabilities-focused actors would change their actions based on forecasts, while safety-focused actors wouldn't. Obviously, this is a matter of degree, and it could be the case that the same amount of action-changing by both actors still leads to worse outcomes.

I think that if OpenAI unveiled GPT4 and it did not perform noticeably better than GPT3 despite a lot more parameters, that would be a somewhat important update. And it seems like a similar kind of update could be produced by well-conducted research on scaling laws for complexity.

2Chris_Leong3y

"I think that if OpenAI unveiled GPT4 and it did not perform noticeably better than GPT3 despite a lot more parameters, that would be a somewhat important update" Yeah, that would be an important update. But not as much as if it had worked since there might be a slight tweak that would make it work.

Proposal: Scaling laws for RL generalization

Most recent large safety projects seem to be focused on language models. So in case the evidence pointed towards problem complexity not mattering that much, I would expect the shift in prioritization towards more RL-safety research to outweigh the effect on capability improvements (especially for the small version of the project, about which larger actors might not care that much). I am also sceptical whether the capabilities of the safety community are in fact increasing exponentially.

I am also confused about the resources/reputation framing. To me this i... (read more)

7Chris_Leong3y

I'm extremely reluctant to trade off any acceleration of AI for an increase in forecasting ability because: a) I'm skeptical of how accurately AI can be forecasted b) I'm skeptical of how much people's actions will change based on forecasts, especially since people are skeptical of how accurately it can be forecast The biggest and most reliable updates from technical research will probably come from "We tried X and it was a lot easier than we thought it'd be" (ie. GPT3, AlphaGo) which involves accelerating capabilities. On the other hand, we tried X and didn't make a lot of progress is less persuasive as maybe it'd have worked with a minor change. "I am also confused about the resources/reputation framing. To me this is a lot more about making better predictions" - I know you're focused on improving predictions - I'm just explaining my personal opinion resources/reputation are more likely to be worth the risk of accelerating timelines.

axioman3y10

Thank you!

I agree that switching the simulator could be useful where feasible (you'd need another simulator with compatible state- and action-spaces and somewhat similar dynamics.)
It indeed seems pretty plausible that instructions will be given in natural language in the future. However, I am not sure that would affect scaling very much, so I'd focus scaling experiments on the simpler case without NLP for which learning has already been shown to work.
IIRC, transformers can be quite difficult to get to work in an RL setting. Perhaps this is different for PIO, but I cannot find any statements about this in the paper you link.

How truthful is GPT-3? A benchmark for language models

axioman3y40

I guess finetuning a model to produce truthful statements directly is nontrivial (especially without a discriminator model) because there are many possible truthful and many possible false responses to a question?

We need a new philosophy of progress

axioman3y10

Oh, right; I seemed to have confused Gibbard-Satterthwaite with Arrow.

Do you know whether there are other extensions of Arrow's theorem to single-winner elections? Having a voting method return a full ranking of alternatives does not appear to be super important in practice...

We need a new philosophy of progress

axioman3y10

Doesn't Gibbard's theorem retain most of Arrow's bite?

1MikkW3y

Not really? I mean, it says that there will always be someone who can benefit from dishonestly representing their beliefs, which is unfortunate, but it is a looser restriction, and in practice, the distortions that this introduces into approval voting or score voting are minimal, and they achieve much better results than plurality voting or IRV obtain.

An Intuitive Guide to Garrabrant Induction

An Intuitive Guide to Garrabrant Induction

Re neural networks: All one billion parameter networks should be computable in polynomial time, but there exist functions that are not expressible by a one billion parameter network (perhaps unless you allow for an arbitrary choice of nonlinearity)

axioman4y40

"If the prices do not converge, then they must oscillate infinitely around some point. A trader could exploit the logical inductor by buying the sentence at a high point on the oscillation and selling at a low one."

I know that this is an informal summary, but I don't find this point intuitively convincing. Wouldn't the trader also need to be able to predict the oscillation?

6drocta4y

My understanding: One could create a program which hard-codes the point about which it oscillates (as well as some amount which it always eventually goes that far in either direction), and have it buy once when below, and then wait until the price is above to sell, and then wait until price is below to buy, etc. The programs receive as input the prices which the market maker is offering. It doesn't need to predict ahead of time how long until the next peak or trough, it only needs to correctly assume that it does oscillate sufficiently, and respond when it does.

2Mark Xu4y

Yes, and there will always exist such a trader.

Beijing Academy of Artificial Intelligence announces 1,75 trillion parameters model, Wu Dao 2.0

Beijing Academy of Artificial Intelligence announces 1,75 trillion parameters model, Wu Dao 2.0

If I understood correctly, the model was trained in Chinese and probably quite expensive to train.

Do you know whether these Chinese models usually get "translated" to English, or whether there is a "fair" way of comparing models that were (mainly) trained on different languages (I'd imagine that even the tokenization might be quite different for Chinese)?

1A Ray4y

In my experience, I haven't seen a good "translation" process -- instead models are pretrained on bigger and bigger corpora which include more languages. GPT-3 was trained on data that was mostly english, but also is able to (AFAICT) generate other languages as well. For some english-dependent metrics (SuperGLUE, Winogrande, LAMBADA, etc) I expect a model trained on primarily non-english corpora would do worse. Also, yes, the tokenization I would expect to be different for a largely different corpora.

axioman4y20

I don't really know a lot about performance metrics for language models. Is there a good reason for believing that LAMBADA scores should be comparable for different languages?

Systematizing Epistemics: Principles for Resolving Forecasts

axioman4y30

"This desiderata is often difficult to reconcile with clear scoring, since complexity in forecasts generally requires complexity in scoring."

Can you elaborate on this? In some sense, log-scoring is simple and can be applied to very complex distributions; Are you saying that the this would still be "complex scoring" because the complex forecast needs to be evaluated, or is your point about something different?

2Davidmanheim4y

Yeah, that's true. I don't recall exactly what I was thinking. Perhaps it was regarding time-weighting, and the difficulty of seeing what your score will be based on what you predict - but the Metaculus interface handles this well, modulus early closings, which screw lots of things up. Also, log-scoring is tricky when you have both continuous and binary outcomes, since they don't give similar measures - being well calibrated for binary events isn't "worth" as much, which seems perverse in many ways.

Resolutions to the Challenge of Resolving Forecasts

axioman4y30

Partial resolution could also help with getting some partial signal on long term forecasts.

In particular, if we know that a forecasting target is growing monotonously over time (like "date at which X happens" or "cumulative number of X before a specified date"), we can split P(outcome=T) into P(outcome>lower bound)*P(outcome=T|outcome>lower bound). If we use log scoring, we then get log(P(outcome>lower bound)) as an upper bound on the score.

If forecasts came in the form of more detailed models, it should be possible to use a similar approach to calculate bounds based on conditioning on more complicated events as well.

Promoting Prediction Markets With Meaningless Internet-Point Badges

The Multi-Tower Study Strategy

I don't know what performance measure is used to select superforecasters, but updating frequently seems to usually improve your accuracy score on GJopen as well (see "Activity Loading" in
this thread on the EA forum. )

2Davidmanheim4y

Yes, it's super important to update frequently when the scores are computed as time-weighted. And for Mataculus, that's a useful thing, since viewers want to know what the current best guess is, but it's not the only way to do scoring. But saying frequent updating makes you better at forecasting isn't actually a fact about how accurate the individual forecasts are - it's a fact about how they are scored.

axioman4y90

"Beginners in college-level math would learn about functions, the basics of linear systems, and the difference between quantitative and qualitative data, all at the same time."

This seems to be the standard approach for undergraduate-level mathematics at university, at least in Europe.

Avoiding Side Effects in Complex Environments

Avoiding Side Effects in Complex Environments

Makes sense, I was thinking about rewards as function of the next state rather than the current one.

I can stil imagine that things will still work if we replace the difference in Q-values by the difference in the values of the autoencoded next state. If that was true, this would a) affect my interpretation of the results and b) potentially make it easier to answer your open questions by providing a simplified version of the problem.

Edit: I guess the "Chaos unfolds over time" property of the safelife environment makes it unlikely that this would work?

1EliasHasle2y

I went into the idea of evaluating on future state representations here: https://www.lesswrong.com/posts/5kurn5W62C5CpSWq6/avoiding-side-effects-in-complex-environments#bFLrwnpjq6wY3E39S (Not sure it is wise, though.)

axioman4yΩ110

I'm curious whether AUP or the autencoder/random projection does more work here. Did you test how well AUP and AUP_proj with a discount factor of 0 for the AUP Q-functions do?

2TurnTrout4y

Well, if γ=0, the penalty would vanish, since both of those auxiliary reward function templates are state-based. If they were state-action reward functions, then the penalty would be the absolute difference in greedy reward compared to taking the null action. This wouldn't correlate to environmental dynamics, and so the penalty would be random noise.

Machine learning could be fundamentally unexplainable

[AN #128]: Prioritizing research on AI existential safety based on its application to governance demands

"So if you wouldn’t sacrifice >0.01AUC for the sake of what a human thinks is the “reasonable” explanation to a problem, in the above thought experiment, then why sacrifice unknown amounts of lost accuracy for the sake of explainability?"

You could think of explainability as some form of regularization to reduce overfitting (to the test set).

1George3d64y

Regrading the thought experiment: As a general rule of thumb, I agree with you, explainability techniques might often help with generalization. Or at least be intermixed with them. For example, to use techniques that alter the input space, it helps to train with dropout and to have certain activations that could be seen as promoting homogenous behavior on OOD data

axioman4y30

"Overall, access to the AI strongly improved the subjects' accuracy from below 50% to around 70%, which was further boosted to a value slightly below the AI's accuracy of 75% when users also saw explanations. "

But this seems to be a function of the AI system's actual performance, the human's expectations of said performance, as well as the human's baseline performance. So I'd expect it to vary a lot between tasks and with different systems.

Nuclear war is unlikely to cause human extinction

axioman4y20

"My own guess is that humans are capable of surviving far more severe climate shifts than those projected in nuclear winter scenarios. Humans are more robust than most any other mammal to drastic changes in temperature, as evidenced by our global range, even in pre-historic times"

I think it is worth noting that the speed of climate shifts might play an important role, as a lot of human adaptability seems to rely on gradual cultural evolution. While modern information technology has greatly sped up the potential for cultural evolution, I am unsure if these speedups are robust to a full-scale nuclear war.

AI risk hub in Singapore?

axioman4y20

I interpreted this as a relative reduction of the probability (P_new=0.84*P_old) rather than an absolute decrease of the probability by 0.16. However, this indicates that the claim might be ambiguous which is problematic in another way.

2Daniel Kokotajlo4y

You interpreted it correctly.

Comparing Utilities

Do mesa-optimizer risk arguments rely on the train-test paradigm?

"The Nash solution differs significantly from the other solutions considered so far. [...]

2. This is the first proposal where the additive constants matter. Indeed, now the multiplicative constants are the ones that don't matter!"

In what sense do additive constants matter here? Aren't they neutralized by the subtraction?

axioman4yΩ330

You don't even need a catastrophe in any global sense. Disrupting the training procedure at step t should be sufficient.

AI Unsafety via Non-Zero-Sum Debate

axioman5yΩ110

"My intuition is that there will be a class of questions where debate is definitely safe, a class where it is unsafe, and a class where some questions are safe, some unsafe, and we don’t really know which are which."

Interesting. Do you have some examples of types of questions you expect to be safe or potential features of save questions? Is it mostly about the downstram consquences that answers would have, or more about instrumental goals that the questions induce for debaters?

2VojtaKovarik5y

I haven't yet thought about this in much detail, but here is what I have: I will assume you can avoid getting "hacked" while overseeing the debate. If you don't assume that, then it might be important whether you can differentiate between arguments that are vs aren't relevant to the question at hand. (I suppose that it is much harder to get hacked when strictly sticking to a specific subject-matter topic. And harder yet if you are, e.g., restricted to answering in math proofs, which might be sufficient for some types of questions.) As for the features of safe questions, I think that one axis is the potential impact of the answer and an orthogonal one is the likelihood that the answer will be undesirable/misaligned/bad. My guess is that if you can avoid getting hacked, then the lower-impact-of-downstream-consequences questions are inherently safer (from the trivial reason of being less impactful). But this feels like a cheating answer, and the second axis seems more interesting. My intuition about the "how likely are we to get an aligned answer" axis is this: There questions where I am fairly confident in our judging skills (for example, math proofs). Many of those could fall into the "definitely safe" category. Then there is the other extreme of questions where our judgement might be very fallible - things that are too vague or that play into our biases. (For example hard philosophical questions and problems whose solutions depend on answers to such questions. E.g., I wouldn't trust myself to be a good judge of "how should we decide on the future of the universe" or "what is the best place for me to go for a vacation".) I imagine these are "very likely unsafe". And as a general principle, where there are two extremes, there often will be a continuum inbetween. Maybe "what is a reasonable way of curing cancer?" could fall here? (Being probably safe, but I wouldn't bet all my money on it.)

Tradeoff between desirable properties for baseline choices in impact measures

axioman5yΩ230

I like the insight that offsetting is not always bad and the idea of dealing with the bad cases using the task reward. State-based reward functions that capture whether or not the task is currently done also intuitively seem like the correct way of specifying rewards in cases where achieving the task does not end the episode.

I am a bit confused about the section on the markov property: I was imagining that the reason you want the property is to make applying standard RL techniques more straightforward (or to avoid making already existing partial observabi... (read more)

2Vika5y

Thanks Flo for pointing this out. I agree with your reasoning for why we want the Markov property. For the second modification, we can sample a rollout from the agent policy rather than computing a penalty over all possible rollouts. For example, we could randomly choose an integer N, roll out the agent policy and the inaction policy for N steps, and then compare the resulting states. This does require a complete environment model (which does make it more complicated to apply standard RL), while inaction rollouts only require a partial environment model (predicting the outcome of the noop action in each state). If you don't have a complete environment model, then you can still use the first modification (sampling a baseline state from the inaction rollout).

Good and bad ways to think about downside risks

axioman5y70

Nice post!

I would like to highlight that a naive application of the expected value perspective could lead to problems like the unilateralist's curse and think that the post would be even more useful for readers who are new to these kinds of considerations if it discussed that more explicitly (or linked to relevant other posts prominently).

3MichaelA5y

Very good point! Thanks for raising it. I think this was an important oversight, and one I'm surprised I made, as I think the unilateralist's curse is a very useful concept and I've previously collected some sources on it. To rectify that, I've now added two mentions of the curse (with links) in the section on the pure EV perspective.

My prediction for Covid-19

axioman5y20

"If, at some point in the future, we have the same number of contagious people, and are not at an appreciable fraction of group immunity, it will at that point again be a solid decision to go into quarantine (or to extend it). "

I think for many people the number of infections at which this becomes a good idas has increased as we have more accurate information about the CFR and how quickly realistic countermeasures can slow down an outbreak in a given area, which should decrease credence in some of the worst case scenarios many were worried about a few months ago.

1TheMajor5y

I agree completely. However, I think the amount it has gone up is critical here. A lot of the countermeasures and increased preparation are linear countermeasures against an exponential threat - maybe a region that could previously only handle 1000 ICU patients can now take care of 2000, but if R0 is significantly above 1 (lets say 1.5) this only buys you about one and a half week. I think this topic deserves its own entire post at some point, and I didn't want to get bogged down in details in the section on "What doesn't change", but if the true rule is "if under X circumstances in March it was smart to go into lockdown, it is November smart to go into lockdown 2 weeks after seeing X" my conclusions are still the same. I might write that full post sometime on this and more back-and-forth, if people are interested. I made serious concessions to brevity above.

The case for C19 being widespread

axioman5y30

"Czech Researchers claim that Chinese do not work well "

This seems to be missing a word ;)

Conflict vs. mistake in non-zero-sum games

axioman5y270

Nitpick: I am pretty sure non-zero-sum does not imply a convex Pareto front.

Instead of the lens of negotiation position, one could argue that mistake theorists believe that the Pareto Boundary is convex (which implies that usually maximizing surplus is more important than deciding allocation), while conflict theorists see it as concave (which implies that allocation is the more important factor).

7abramdemski3y

If we use correlated equilibria as our solution concept rather than Nash, convexity is always guaranteed. Also, this is usually the more realistic assumption for modeling purposes. Nash equilibria oddly assume certainty about which equilibrium a game will be in even as players are trying to reason about how to approach a game. So it's really only applicable to cases where players know what equilibrium they are in, EG because there's a long history and the situation has equilibriated. But even in such situations, there is greater reason to expect things to equilibriate to a correlated equilibrium than there is to expect a nash equilibrium. This is partly because there are usually a lot of signals from the environment that can potentially be used as correlated randomness -- for example, the weather. Also, convergence theorems for learning correlated equilibria are just better than those for Nash. Still, your comment about mistake theorists believing in a convex boundary is interesting. It might also be that conflict theorists tend to believe that most feasible solutions are in fact close to Pareto-efficient (for example, they believe that any apparent "mistake" is actually benefiting someone). Mistake theorists won't believe this, obviously, because they believe there is room for improvement (mistakes to be avoided). However, mistake theorists may additionally believe in large downsides to conflict (ie, some very very not-pareto-efficient solutions, which it is important to avoid). This would further motivate the importance of agreeing to stick to the Pareto frontier, rather than worrying about allocation.

2fuego5y

This. I think Mistake theorists would frequently question whether the game is actually zero-sum. This divergence in opinions about the nature of the game feels important to me.

Nisan5y100

Oh I see, the Pareto frontier doesn't have to be convex because there isn't a shared random signal that the players can use to coordinate. Thanks!

March 14/15th: Daily Coronavirus link updates

Credibility of the CDC on SARS-CoV-2

Twitter: CV kills via cardiac failure, not pulmonary links to the aggragate spreadsheet, not the twitter soruce.

axioman5y80

Even if the claim was usually true on longer time scales, I doubt that pointing out an organisations mistakes and not entirely truthful statements usually increases the trust in them on the short time scales that might be most important here. Reforming organizations and rebuilding trust usually takes time.

Subagents and impact measures, full and fully illustrated

Attainable Utility Preservation: Scaling to Superhuman

How do

"One of the problems here is that the impact penalty only looks at the value of VAR one turn ahead. In the DeepMind paper, they addressed similar issues by doing “inaction rollouts”. I'll look at the more general situations of $π_{0}$ rollouts: rollouts for any policy $π_{0} .$ "

and

"That's the counterfactual situation, that zeroes out the impact penalty. What about the actual situation? Well, as we said before, A will be just doing ∅; so, as soon as $π_{0}$ would produce anything different from ... (read more)

2Stuart_Armstrong5y

They don't fit together in that case; that's addressed immediately after, in section 2.3.

Attainable Utility Preservation: Scaling to Superhuman

For all auxillary rewards. Edited the original comment.

I agree that it is likely to go wrong somewhere, but it might still be useful to figure out why. If the agent is able to predict the randomness reliably in some cases, the random baseline does not seem to help with the subagent problem.

Edit: Randomization does not seem to help, as long as the actionset is large (as the agent can then arrange for most actions to make the subagent optimize the main reward).

axioman5yΩ130

I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state $s_{t + 1}$ where

Q_{R} (s_{t + 1}, \emptyset) = V_{R} (π^{*}, s_{t + 1})

for all auxillary rewards $R$ , where $π^{*}$ is the optimal policy according to the main reward; while making sure that there exists an action $a_{R}$ such that

R (t) + γ Q_{R} (s_{t + 1}, a_{R}) \approx Q_{R} (s_{t}, \emptyset)

for every $R$ . So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the ... (read more)

2TurnTrout5y

what do you mean by "for all R"? The random baseline is an idea I think about from time to time, but usually I don't dwell because it seems like the kind of clever idea that secretly goes wrong somehow? It depends whether the agent has any way of predicting what the random action will be at a future point in time. if it can predict it, I'd imagine that it might find a way to gain a lot of power by selecting a state whose randomly selected action is near-optimal. because of the denominator, it would still be appropriately penalized for performing better than the randomly selected action, but it won't receive a penalty for choosing an action with expected optimal value just below the near-optimal action.

How Low Should Fruit Hang Before We Pick It?

axioman5yΩ010

Where does

u (¯ a') - \frac{I (¯ a')}{R_{1}} = u (¯ a) - \frac{I (¯ a)}{R_{2}}

come from?

Also, the equation seems to imply

R_{1} = R_{2}

Edit: I focused too much on what I suppose is a typo. Clearly you can just rewrite the the first and last equality as equality of an affine linear function

R \to R

at two points, which gives you equality everywhere.

2TurnTrout5y

Oops, you’re right. I fixed the proof.

How Low Should Fruit Hang Before We Pick It?

axioman5yΩ130

I do not understand your proof for proposition 2.

2TurnTrout5y

Is there any particular part of it that seems locally invalid? Can you be a little more specific about what’s confusing?

On characterizing heavy-tailedness

axioman5y40

Do you maybe have another example for action relevance? Nonfinite variance and finite support do not go well together.

3Jsevillamol5y

Sadly I have not come across many definitions of heavy tailedness that are compatible with finite support, so I dont have any ready examples of action relevance AND finite support. Another example involving a momentum-centric definition: Distributions which are heavy tailed in the sense of not having a finite moment generating function in a neighbourhood of zero heavily reward exploration over exploitation in multi armed bandit scenarios. See for example an invocation of light tailedness to simplify an analysis at the beginning of this paper, implying that the analysis does not carry over directly to heavy tail scenarios (disclaimer, I have not read the whole thing).

In theory: does building the subagent have an "impact"?

Appendix: how a subagent could get powerful

So the general problem is that large changes in $Q_{R} (s_{t + 1},$ ∅) are not penalized?

2Stuart_Armstrong5y

It's the delta of that with QR(st+1,at+1) that is penalised, not large changes on its own.

axioman5y40

"Not quite... " are you saying that the example is wrong, or that it is not general enough? I used a more specific example, as I found it easier to understand that way.

I am not sure I understand: In my mind "commitments to balance out the original agent's attainable utility" essentially refers to the second agent being penalized by the the first agent's penalty (although I agree that my statement is stronger). Regarding your text, my statement refers to "SA will just precommit to undermine or help A, depending on the cir... (read more)

2Stuart_Armstrong5y

Another relevant post: it seems that the subagent need not be constrained at all, except on the first action. https://www.lesswrong.com/posts/jrrZids4LPiLuLzpu/subagents-and-attainable-utility-in-general

Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects - Clarifying Thoughts, Part 1)

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

I found it a bit confusing that you first reffered to selection and control as types of optimizers and then (seemingly?) replaced selection by optimization in the rest of the text.

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

I was thinking about normalisation as linearly rescaling every reward to $[0, 1]$ when I wrote the comment. Then, one can always look at $[0, 1]^{2}$ , which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing $P (R_{1}) S_{1} R_{1} + P (R_{2}) S_{2} R_{2}$ is the same as maximizing $\frac{P (R_{1}) S_{1}}{P (R_{1}) S_{1} + P (R_{2}) S_{2}} R_{1} + \frac{P (R_{2}) S_{2}}{P (R_{1}) S_{1} + P (R_{2}) S_{2}} R_{2}$

axioman5y100

After looking at the update, my model is:

(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)

Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the "tipping point" in beliefs, where the opposite extreme policy is suddenly favoured).

In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually... (read more)

2Stuart_Armstrong5y

I think normalisation doesn't fit in the convex-concave picture. Normalisation is to avoid things like 1%(100R1) being seen as the same as 100%(R1).

2Stuart_Armstrong5y

I like that way of seeing it.

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

But no matter, how I take the default outcome, your second example is always "more positive sum" than the first, because 0.5 + 0.7 + 2x < 1.5 - 0.1 +2x.

Granted, you could construct examples where the inequality is reversed and Goodhart bad corresponds to "more negative sum", but this still seems to point to the sum-condition not being the central concept here. To me, it seems like "negative min" compared to the default outcome would be closer to the actual problem. This distinction matters, because negative min is a lot weaker than negative sum.

Or am I completely misunderstanding your examples or your point?

2Stuart_Armstrong5y

Ok, have corrected it now; the negative-sum formulation was wrong, sorry.

2Stuart_Armstrong5y

You are correct; I was unclear (and wrong in that terminology). I will rework the post slightly.