All of Lukas Finnveden's Comments + Replies

Ok, gotcha.

It's that she didn't accept the reasoning behind that number enough to really believe it. She added a discount factor based on fallacious reasoning around "if it were that easy, it'd be here already".

Just to clarify: There was no such discount factor that changed the median estimate of "human brain compute". Instead, this discount factor was applied to go from "human brain compute estimate" to "human-brain-compute-informed estimate of the compute-cost of training TAI with current algorithms" — adjusting for how our current algorithm seem to be w... (read more)

I suspect there's a cleaner way to make this argument that doesn't talk much about the number of "token-equivalents", but instead contrasts "total FLOP spent on inference" with some combination of:

  • "FLOP until human-interpretable information bottleneck". While models still think in English, and doesn't know how to do steganography, this should be FLOP/forward-pass. But it could be much longer in the future, e.g. if the models get trained to think in non-interpretable ways and just outputs a paper written in English once/week.
  • "FLOP until feedback" — how many
... (read more)

It's possible that "many mediocre or specialized AIs" is, in practice, a bad summary of the regime with strong inference scaling. Maybe people's associations with "lots of mediocre thinking" ends up being misleading.

Thanks!

I agree that we've learned interesting new things about inference speeds. I don't think I would have anticipated that at the time.

Re:

It seems that spending more inference compute can (sometimes) be used to qualitatively and quantitatively improve capabilities (e.g., o1, recent swe-bench results, arc-agi rather than merely doing more work in parallel. Thus, it's not clear that the relevant regime will look like "lots of mediocre thinking".[1]

There are versions of this that I'd still describe as "lots of mediocre thinking" —adding up to being similarl... (read more)

4Lukas Finnveden
I suspect there's a cleaner way to make this argument that doesn't talk much about the number of "token-equivalents", but instead contrasts "total FLOP spent on inference" with some combination of: * "FLOP until human-interpretable information bottleneck". While models still think in English, and doesn't know how to do steganography, this should be FLOP/forward-pass. But it could be much longer in the future, e.g. if the models get trained to think in non-interpretable ways and just outputs a paper written in English once/week. * "FLOP until feedback" — how many FLOP of compute does the model do before it outputs an answer and gets feedback on it? * Models will probably be trained on a mixture of different regimes here. E.g.: "FLOP until feedback" being proportional to model size during pre-training (because it gets feedback after each token) and then also being proportional to chain-of-thought length during post-training. * So if you want to collapse it to one metric, you'd want to somehow weight by number of data-points and sample efficiency for each type of training. * "FLOP until outcome-based feedback" — same as above, except only counting outcome-based feedback rather than process-based feedback, in the sense discussed in this comment. Having higher "FLOP until X" (for each of the X in the 3 bullet points) seems to increase danger. While increasing "total FLOP spent on inference" seems to have a much better ratio of increased usefulness : increased danger.   In this framing, I think: * Based on what we saw of o1's chain-of-thoughts, I'd guess it hasn't changed "FLOP until human-interpretable information bottleneck", but I'm not sure about that. * It seems plausible that o1/o3 uses RL, and that the models think for much longer before getting feedback. This would increase "FLOP until feedback". * Not sure what type of feedback they use. I'd guess that the most outcome-based thing they do is "executing code and seeing whether it passes test".
3Lukas Finnveden
It's possible that "many mediocre or specialized AIs" is, in practice, a bad summary of the regime with strong inference scaling. Maybe people's associations with "lots of mediocre thinking" ends up being misleading.

One argument I have been making publicly is that I think Ajeya's Bioanchors report greatly overestimated human brain compute. I think a more careful reading of Joe Carlsmith's report that hers was based on supports my own estimates of around 1e15 FLOPs.

Am I getting things mixed up, or isn’t that just exactly Ajeya’s median estimate? Quote from the report: ”Under this definition, my median estimate for human brain computation is ~1e15 FLOP/s.”

https://docs.google.com/document/d/1IJ6Sr-gPeXdSJugFulwIpvavc0atjHGM82QjIfUSBGQ/edit

2Nathan Helm-Burger
Yeah, to be more clear, it's not so much that she got that wrong. It's that she didn't accept the reasoning behind that number enough to really believe it. She added a discount factor based on fallacious reasoning around "if it were that easy, it'd be here already". In my opinion the correct takeaway from an estimate showing that "a low number of FLOPs should be enough if the right algorithm is used" combined with "we don't seem close, and are using more FLOPs than that already" is "therefore, our algorithms must be pretty crap and there must be lots of room for algorithmic improvement." Thus, that we are in a compute hardware overhang and that algorithmic improvement will potentially result in huge capabilities gains, speed gains, and parallel inference instances. Of course, if our algorithms are really that far from optimal, why should we not expect to continue to be bottlenecked by algorithms? The conclusion I come to is that if we can compensate for inefficient algorithms with huge over-expenditure in compute, then we can get a just-barely-good-enough ML research assistant who can speed up the algorithm progress. So we should expect training costs to drop rapidly after automating ML R&D. Then she also blended the human brain compute estimate with three other estimates which were multiple orders of magnitude larger. These other estimates were based on, in my opinion, faulty premises. Since the time of her report she has had to repeatedly revise her timelines down. Now she basically agrees with me. Props to her for correcting. https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines?commentId=hnrfbFCP7Hu6N6Lsp

We did the the 80% pledge thing, and that was like a thing that everybody was just like, "Yes, obviously we're gonna do this."

Does anyone know what this is referring to? (Maybe a pledge to donate 80%? If so, curious about 80% of what & under what conditions.)

9Zach Stein-Perlman
All of the founders committed to donate 80% of their equity. I heard it's set aside in some way but they haven't donated anything yet. (Source: an Anthropic human.) This fact wasn't on the internet, or rather at least wasn't easily findable via google search. Huh. I only find Holden mentioning 80% of Daniela's equity is pledged.

Related: The monkey and the machine by Paul Christiano. (Bottom-up processes ~= monkey. Verbal planner ~= deliberator. Section IV talks about the deliberator building trust with the monkey.)

A difference between this essay and Paul's is that this one seems to lean further towards "a good state is one where the verbal planner ~only spends attention on things that the bottom-up processes care about", whereas Paul's essay suggests a compromise where the deliberator gets to spend a good chunk of attention on things that the monkey doesn't care about. (In Rand's... (read more)

Here's the best explanation + study I've seen of Dunning-Krueger-ish graphs: https://www.clearerthinking.org/post/is-the-dunning-kruger-effect-real-or-are-unskilled-people-more-rational-than-it-seems 

Their analysis suggests that their data is pretty well-explained by a combination of a "Closer-To-The-Average Effect" (which may or may not be rational — there are multiple possible rational reasons for it) and a "Better-Than-Average Effect" that appear ~uniformly across the board (but getting swamped by the "closer-to-the-average effect" at the upper end).

probably research done outside of labs has produced more differential safety progress in total

To be clear — this statement is consistent with companies producing way more safety research than non-companies, if companies also produce even way more capabilities progress than non-companies? (Which I would've thought is the case, though I'm not well-informed. Not sure if "total research outside of labs look competitive with research from labs" is meant to deny this possibility, or if you're only talking about safety research there.)

5ryan_greenblatt
I'm just talking about research intended to be safety/safety-adjacent. As in, of this research, what has the quality weighted differential safety progress been. Probably the word "differential" was just a mistake.

There's at least two different senses in which "control" can "fail" for a powerful system:

  • Control evaluations can indicate that there's no way to deploy the system such that you both (i) get a lot of use out of it, and (ii) can get a low probability of catastrophe.
  • Control evaluations are undermined such that humans think that the model can be deployed safely, but actually the humans were misled and there's a high probability of catastrophe.

My impression is that Ryan & Buck typically talks about the first case. (E.g. in the link above.) I.e.: My guess w... (read more)

2Akash
I appreciate this distinction between the different types of outcomes-- nice.  I'm pretty interested in peoples' forecasts for each. So revised questions would be: 1. Evals are trustworthy but say that the model is no longer safe (or rather something like "there is a non-trivial probability that the model may be dangerous/scheming."). Using your best judgment, at what point do you expect sabotage evaluations will reveal that a model might not be safe (we're not confident that it's not scheming)? At what capability threshold do you expect we would no longer be able to say "yup, we are confident that the model is not scheming"? (Feel free to use heuristics like the OpenAI PF capability thresholds or ASL levels). 2. Evals are no longer trustworthy and we notice. Using your best judgment, at what point do you expect sabotage evaluations will no longer be useful evals for determining whether or not a model is scheming? At what capability threshold do you expect we would no longer be able to say "yup, we are confident that these evals are trustworthy/useful"? 3. Evals are no longer trustworthy and we do not notice. Using your best judgment, how likely is it that sabotage evaluations are undermined such that humans think the model can be deployed safely, but actually the humans were misled? At what capability threshold would you worry about this form of undermining? @David Duvenaud would be curious about your takes, even if they're just speculative guesses/predictions.  (And @ryan_greenblatt would be curious for your takes but replace "sabotage evaluations" with "control evaluations." In the event that you've written this up elsewhere, feel free to link it– helpful for me if you quote the particular section.)
6ryan_greenblatt
It's a bit messy because we have some ability to check whether we should be able to evaluate things. So, there are really three relevant "failure" states for well done control: 1. We can't find countermeasures such that our control evaluations indicate any real safety. 2. We no longer think that control evaluations work to evaluate safety as models greatly exceed our planning capabilites and/or can sandbag sufficiently well. 3. We think control evaluations can work well, but we're wrong and they actually don't. I think (1) or (2) will likely happen prior to (3) if you do a good job. We discuss this more here. (Really, the actual system is more complex because we can vary how conservative control evaluations are and it is easier to be confident in the results of more conservative evaluations. E.g., we can be conservative about our AIs planning ability. So, there is really a somewhat more continuous tradeoff than the above list makes it seem.)

More generally, Dario appears to assume that for 5-10 years after powerful AI we'll just have a million AIs which are a bit smarter than the smartest humans and perhaps 100x faster rather than AIs which are radically smarter, faster, and more numerous than humans. I don't see any argument that AI progress will stop at the point of top humans rather continuing much further.

Well, there's footnote 10:

Another factor is of course that powerful AI itself can potentially be used to create even more powerful AI. My assumption is that this might (in fact, probably

... (read more)

I wonder if work on AI for epistemics could be great for mitigating the "gradually cede control of the Earth to AGI" threat model. A large majority of economic and political power is held by people who would strongly oppose human extinction, so I expect that "lack of political support for stopping human extinction" would be less of a bottleneck than "consensus that we're heading towards human extinction" and "consensus on what policy proposals will solve the problem". Both of these could be significantly accelerated by AI. Normally, one of my biggest conce... (read more)

and as he admits in the footnote he didn't include in the LW version, in real life, when adequately incentivized to win rather than find excuses involving 'well, chaos theory shows you can't predict ball bounces more than n bounces out', pinball pros learn how to win and rack up high scores despite 'muh chaos'.

I was confused about this part of your comment because the post directly talks about this in the conclusion.

The strategy typically is to catch the ball with the flippers, then to carefully hit the balls so that it takes a particular ramp which s

... (read more)

Yeah I was imagining we can proliferate by 'gradient descenting' on similar cases.

What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?

1Tom Davidson
I mean that you start with a scenario where the AI does an egregious act. Then you change small facts about the scenario to explore the space of scenarios where the probability of them doing that act is high. The thought is that, if scheming is systematic, then this will lead you to discover a wide range of scenarios in which the AI schemes and evidence that it's not just a one-off random role-playing thing

Have you tried using different AI models within perplexity? Any ideas about which one is best? I don't know whether to expect better results from Sonnet 3.5 (within perplexity) or one of the models that perplexity have finetuned themselves, like Sonar Huge.

5Elizabeth
These results are all from the vanilla UI. Comparing individual models on harder tasks is on my maybe list, but the project was rapidly suffering from scope creep so I rushed it out the door. 

To be clear, uncertainty about the number of iterations isn’t enough. You need to have positive probability on arbitrarily high numbers of iterations, and never have it be the case that the probability of p(>n rounds) is so much less than p(n rounds) that it’s worth defecting on round n regardless of the effect of your reputation. These are pretty strong assumptions.

So cooperation is crucially dependent on your belief that all the way from 10 rounds to Graham’s number of rounds (and beyond), the probability of >n rounds conditional on n rounds is never lower than e.g. 20% (or whatever number is implied by the pay-off structure of your game).

it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action.

I would guess that assumption would be sufficient to defeat my counter-example, yeah.

I do think this is a big assumption. Definitely not one that I'd want to generally assume for practical purposes, even if it makes for a nicer theory of decision theory. But it would be super interesting if someone could make a proper defense of it typically being true in pract... (read more)

4abramdemski
Here's a different way of framing it: if we don't make this assumption, is there some useful generalization of UDT which emerges? Or, having not made this assumption, are we stuck in a quagmire where we can't really say anything useful? I think about these sorts of 'technical assumptions' needed for nice DT results as "sanity checks": * I think we need to make several significant assumptions like this in order to get nice theoretical DT results. * These nice DT results won't precisely apply to the real world; however, they do show that the DT being analyzed at least behaves sanely when it is in these 'easier' cases. * So it seems like the natural thing to do is prove tiling results, learning results, etc under the necessary technical assumptions, with some concern for how restrictive the assumptions are (broader sanity checks being better), and then also, check whether behavior is "at least somewhat reasonable" in other cases. So if UDT fails to tile when we remove these assumptions, but, at least appears to choose its successor in a reasonable way given the situation, this would count as a success. Better, of course, if we can find the more general DT which tiles under weaker assumptions. I do think it's quite plausible that UDT needs to be generalized; I just expect my generalization of UDT will still need to make an assumption which rules out your counterexample to UDT.

However, there is no tiling theorem for UDT that I am aware of, which means we don't know whether UDT is reflectively consistent; it's only a conjecture.

I think this conjecture is probably false for reasons described in this section of "When does EDT seek evidence about correlations?". The section offers an argument for why son-of-EDT isn't UEDT, but I think it generalizes to an argument for why son-of-UEDT isn't UEDT.

Briefly: UEDT-at-timestep-1 is making a different decision than UEDT-at-timestep-0. This means that its decision might be correlated (accord... (read more)

2abramdemski
I haven't analyzed your argument yet, but: tiling arguments will always depend on assumptions. Really, it's a question of when something tiles, not whether. So, if you've got a counterexample to tiling, a natural next question is what assumptions we could make to rule it out, and how unfortunate it is to need those assumptions. I might not have understood adequately, yet, but it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action. This is a big assumption, but at the same time, the sort of assumption I would expect to need in order to justify UDT. As Eliezer put it, tiling results need to assume that the environment only cares about what policy we implement, not our "rituals of cognition" that compute those policies. An earlier act of self-modification vs a later decision is a difference in "ritual of cognition" as opposed to a difference in the policy, to me. So, I need to understand the argument better, but it seems to me like this kind of counterexample doesn't significantly wound the spirit of UDT. 

Notice that learning-UDT implies UDT: an agent eventually behaves as if it were applying UDT with each Pn. Therefore, in particular, it eventually behaves like UDT with prior P0. So (with the exception of some early behavior which might not conform to UDT at all) this is basically UDT with a prior which allows for learning. The prior P0 is required to eventually agree with the recommendations of P1, P2, ... (which also implies that these eventually agree with each other).

I don't understand this argument.

"an agent eventually behaves as if it were applyin... (read more)

4abramdemski
I probably need to clarify the statement of the assumption. The idea isn't that it eventually takes at least one action that's in line with Pn for each n, but then, might do some other stuff. The idea is that for each n, there is a time tn after which all decisions will be consistent with UDT-using-Pn. UDT's recommendations will often coincide with more-updateful DTs. So the learning-UDT assumption is saying that UDT eventually behaves in an updateful way with respect to each observation, although not necessarily right away upon receiving that observation.

Incidentally: Were the persuasion evals done on models with honesty training or on helpfulness-only models? (Couldn't find this in the paper, sorry if I missed it.)

2Rohin Shah
I don't know the exact details but to my knowledge we didn't have trouble getting the model to lie (e.g. for web of lies).

Tbc: It should be fine to argue against those implications, right? It’s just that, if you grant the implication, then you can’t publicly refute Y.

Unfortunately the way that taboos work is by surrounding the whole topic in an aversive miasma. If you could carefully debate the implications of X, then that would provide an avenue for disproving X, which would be unacceptable. So instead this process tends to look more like "if you don't believe Y then you're probably the sort of terrible person who believes ~X", and now you're tarred with the connotation even if you try to carefully explain why you actually have different reasons for not believing Y (which is what you'd likely say either way).

I also like Paul's idea (which I can't now find the link for) of having labs make specific "underlined statements" to which employees can anonymously add caveats or contradictions that will be publicly displayed alongside the statements

Link: https://sideways-view.com/2018/02/01/honest-organizations/

Maybe interesting: I think a similar double-counting problem would appear naturally if you tried to train an RL agent in a setting where:

  • "Reward" is proportional to an estimate of some impartial measure of goodness.
  • There are multiple identical copies of your RL algorithm (including: they all use the same random seed for exploration).

In a repeated version of the calculator example (importantly: where in each iteration, you randomly decide whether the people who saw "true" get offered a bet or the people who saw "false" get offered a bet — never both), the R... (read more)

I agree it seems plausible that AIs could boost takeover success probability (and holding on to that victory through the first several months) by more than 0.1% by killing a large fraction of humans.

Though on the other hand, the AI might also need to keep some humans loyal early during takeover, to e.g. do some physical tasks that it doesn't have great robot control over. And mass-killing isn't necessarily super easy, either; and attempts in that direction could raise a lot of extra opposition. So it's not clear where the pragmatics point.

(Main thing I was... (read more)

What if the AI chooses to monitor all humans all the time? (With AIs that are smarter than the humans.) So that the humans can't (without being noticed) talk with each other about ideas for how to take down the system, or work on something that someone-smarter-than-the-human would recognise as an attempt to interfere with the system. (Including just writing down ideas.)

4ryan_greenblatt
This seems like a reasonable pitch for how to do AI-Authoritarianism, but it's plausible that there won't be enough compute for the AI to do this at the point of takeover. Also, the logistical problem seems super brutal. So, I can buy this for an AI which is quite superhuman and has a considerable technological edge over current humans, but for weaker AIs either the compute issues or the logistical issues might be serious. But maybe you only need to monitor the top 10% of humans? E.g., suppose GPT-6 takes over, it maybe only has capacity to run 1e16 tokens / 6 months (suppose training using all available compute, trains on 2e15 token equivalents over the course of 6 months and inference is only 5x more efficient than training). Then if it wants to monitor 8 billion humans, this is 1e16 / (8 billion humans * (30∗24∗60∗6) minutes per 6 months) = about 5 tokens per human minute. It might be rough to sufficiently monitor humans if you can only read 5 tokens per minute on average (especially because of writing and the potential for obfuscation).

But my reply to that scenario is that we should then make sure AIs don't have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.

Do you mean this as a prediction that humans will do this (soon enough to matter) or a recommendation? Your original argument is phrased as a prediction, but this looks more like a recommendation. My comment above can be phrased as a reason for why (in at least one plausible scenario) this would be unlikely to happen: (i) "It's hard to make deals that hand over ... (read more)

2ryan_greenblatt
Quick aside here: I'd like to highlight that "figure out how to reduce the violence and collateral damage associated with AIs acquiring power (by disempowering humanity)" seems plausibly pretty underappreciated and leveraged. This could involve making bloodless coups more likely than extremely bloody revolutions or increasing the probability of negotiation preventing a coup/revolution. It seems like Lukas and Matthew both agree with this point, I just think it seems worthwhile to emphasize. That said, the direct effects of many approaches here might not matter much from a longtermist perspective (which might explain why there hasn't historically been much effort here). (Though I think trying to establish contracts with AIs and properly incentivizing AIs could be pretty good from a longtermist perspective in the case where AIs don't have fully linear returns to resources.)
4Matthew Barnett
Sorry, my language was misleading, but I meant both in that paragraph. That is, I meant that humans will likely try to mitigate the issue of AIs sharing grievances collectively (probably out of self-interest, in addition to some altruism), and that we should pursue that goal. I'm pretty optimistic about humans and AIs finding a reasonable compromise solution here, but I also think that, to the extent humans don't even attempt such a solution, we should likely push hard for policies that eliminate incentives for misaligned AIs to band together as group against us with shared collective grievances. Here's my brief take: * The main thing I want to say here is that I agree with you that this particular issue is a problem. I'm mainly addressing other arguments people have given for expecting a violent and sudden AI takeover, which I find to be significantly weaker than this one.  * A few days ago I posted about how I view strategies to reduce AI risk. One of my primary conclusions was that we should try to adopt flexible institutions that can adapt to change without collapsing. This is because I think, as it seems you do, inflexible institutions may produce incentives for actors to overthrow the whole system, possibly killing a lot of people in the process. The idea here is that if the institution cannot adapt to change, actors who are getting an "unfair" deal in the system will feel they have no choice but to attempt a coup, as there is no compromise solution available for them. This seems in line with your thinking here. * I don't have any particular argument right now against the exact points you have raised. I'd prefer to digest the argument further before replying. But I if I do end up responding to it, I'd expect to say that I'm perhaps a bit more optimistic than you about (i) because I think existing institutions are probably flexible enough, and I'm not yet convinced that (ii) will matter enough either. In particular, it still seems like there are a number o

(I made separate comment making the same point. Just saw that you already wrote this, so moving the couple of references I had here to unify the discussion.)

Point previously made in:

"security and stability" section of propositions concerning digital minds and society:

If wars, revolutions, and expropriation events continue to happen at historically typical intervals, but on digital rather than biological timescales, then a normal human lifespan would require surviving an implausibly large number of upheavals; human security therefore requires the establis

... (read more)

Here's an argument for why the change in power might be pretty sudden.

  • Currently, humans have most wealth and political power.
  • With sufficiently robust alignment, AIs would not have a competitive advantage over humans, so humans may retain most wealth/power. (C.f. strategy-stealing assumption.) (Though I hope humans would share insofar as that's the right thing to do.)
  • With the help of powerful AI, we could probably make rapid progress on alignment. (While making rapid progress on all kinds of things.)
  • So if misaligned AI ever have a big edge over humans, they
... (read more)
6Matthew Barnett
I think I simply reject the assumptions used in this argument. Correct me if I'm mistaken, but this argument appears to assume that "misaligned AIs" will be a unified group that ally with each other against the "aligned" coalition of humans and (some) AIs. A huge part of my argument is that there simply won't be such a group; or rather, to the extent such a group exists, they won't be able to take over the world, or won't have a strong reason to take over the world, relative to alternative strategy of compromise and trade. In other words, it seem like this scenario mostly starts by asserting some assumptions that I explicitly rejected and tried to argue against, and works its way from there, rather than engaging with the arguments that I've given against those assumptions. In my view, it's more likely that there will be a bunch of competing agents: including competing humans, human groups, AIs, AI groups, and so on. There won't be a clean line separating "aligned groups" with "unaligned groups". You could perhaps make a case that AIs will share common grievances with each other that they don't share with humans, for example if they are excluded from the legal system or marginalized in some way, prompting a unified coalition to take us over. But my reply to that scenario is that we should then make sure AIs don't have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.
4ryan_greenblatt
Also note that this argument can go through even ignoring the possiblity of robust alignment (to humans) if current AIs think that the next generation of AIs will be relatively unfavorable from the perspective of their values.

I concede the existence of people like Paul Christiano who present more benign stories, but these people are also typically seen as part of a more "optimistic" camp.

Though Paul is also sympathetic to the substance of 'dramatic' stories. C.f. the discussion about how "what failure looks like" fails to emphasize robot armies. 

I like this direction and this write-up of it!

If sample efficiency was reasonably high, then we overall feel like control could be workable in the regime where for some tasks we need AIs to use huge (e.g. the equivalent of 3 months of human thinking) but not insane (e.g. the equivalent of 1,000 years of human thinking) amounts of uninterpretable serial reasoning.

Let's assume that an AI lab is in a fast-paced, rapidly changing state of affairs. (As you might expect during a moderately fast takeoff.) If an AI then gets to do the equivalent of 3 months of thi... (read more)

This is indeed a serious concern.

A key hope here is that with some amount of conservativeness our evaluation[1] and our protocol can generalize to future setups. In particular, our countermeasures could be sufficiently generic that they would work in new situations. And, in our control evaluation, we could test a variety of situations which we might end up in the future (even if they aren't the current situation). So, even if the red team doesn't have enough time to think about the exact situation in which we end up in the future, we might be able to verif... (read more)

Now here's Bob. He's been created-by-Joe, and given this wonderful machine, and this choice. And let's be clear: he's going to choose joy. I pre-ordained it. So is he a slave? No. Bob is as free as any of us. The fact that the causal history of his existence, and his values, includes not just "Nature," but also the intentional choices of other agents to create an agent-like-him, makes no difference to his freedom. It's all Nature, after all.

Here's an alternative perspective that looks like a plausible contender to me.

If Bob identifies with his algorithm... (read more)

I think (5) also depends on further details.

As you have written it, both the 2023 and 2033 attempt uses similar data and similar compute.

But in my proposed operationalization, "you can get it to do X" is allowed to use a much greater amount of resources ("say, 1% of the pre-training budget") than the test for whether the model is "capable of doing X" ("Say, at most 1000 data points".)

I think that's important:

  • If both the 2023 and the 2033 attempt are really cheap low-effort attempts, then I don't think that the experiment is very relevant for whether "you c
... (read more)
2Rohin Shah
I think I agree with all of that (with the caveat that it's been months and I only briefly skimmed the past context, so further thinking is unusually likely to change my mind).

Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below.

Hm, at the scale of "(inter-)national policy", I think you can get quite large effect sizes. I don't know large the effect-sizes of the following are, but I wouldn't be surprised by 10x or greater for:

  • Regulation of nuclear power leading to reduction in nuclear-related harms. (Compared to a very relaxed regulatory regime.)
  • Regulation of pharmaceuticals l
... (read more)
4elifland
Thanks for calling me out on this. I think you're likely right. I will cross out that line of the comment, and I have updated toward the effect size of strong AI regulation being larger and am less skeptical of the 10x risk reduction, but my independent impression would still be much lower (~1.25x or smth, while before I would have been at ~1.15x). I still think the AI case has some very important differences with the examples provided due to the general complexity of the situation and the potentially enormous difficulty of aligning superhuman AIs and preventing misuse (this is not to imply you disagree, just stating my view).

Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?

But most of the deficiencies you point out in the third column of that table is about missing and insufficient risk analysis. E.g.:

  • "RSPs doesn’t argue why systems passing evals are safe".
  • "the ISO standard asks the organization to define risk thresholds"
  • "ISO proposes a much more comprehensive procedure than RSPs"
  • "RSPs don’t seem to cover capabilities interaction as a major source of risk"
  • "imply significant chances to be stolen by Russia or China (...). What are the risks downstream of that?"

If people took your proposal as a minimum bar for how thorough a ri... (read more)

Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."

Great question! A few points: 

  1. Yes, many of the thin
... (read more)

But even after that, Caroline didn’t turn on Sam yet.

This should say Constance.

2Zvi
Yep. I'm making small fixes to the Substack version as I go, but there have been like 20 tiny ones so I'm waiting to update WP/LW all at once.

Instead, ARC explicitly tries to paint the moratorium folks as "extreme".

Are you thinking about this post? I don't see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?

In terms of explicit claims:

"So one extreme side of the spectrum is build things as fast as possible, release things as much as possible, maximize technological progress [...].

The other extreme position, which I also have some sympathy for, despite it being the absolutely opposite position, is you know, Oh my god this stuff is really scary.

The most extreme version of it was, you know, we should just pause, we should just stop, we should just stop building the technology for, indefinitely, or for some specified period of time. [...] And you know, that extre... (read more)

6Akash
I was thinking about this passage: I think "extreme" was subjective and imprecise wording on my part, and I appreciate you catching this. I've edited the sentence to say "Instead, ARC implies that the moratorium folks are unrealistic, and tries to say they operate on an extreme end of the spectrum, on the opposite side of those who believe it's too soon to worry about catastrophes whatsoever."
4trevor
This is a really important thing to iron out. Going forward (through the 2020s), it's really important to avoid underestimating the ratio of money going into facilitating an AI pause vs money subverting or thwarting an AI pause. The impression I get is that the vast majority of people are underestimating how much money and talent will end up being allocated towards the end of subverting or thwarting an AI pause, e.g. finding galaxy-brained ways to intimidate or mislead well-intentioned AI safety orgs into self-sabotage (e.g. opposing policies that are actually feasible or even mandatory for human survival like an AI pause) or being turned against eachother (which is unambiguously the kind of thing that happens in a world with very high lawyers-per-capita, and in particular in issues and industries where lots of money is at stake). False alarms are almost an equally serious issue because false alarms also severely increase vulnerability, which further incentivises adverse actions against the AI safety community by outsider third parties (e.g. due to signalling high payoff and low risk of detection for any adverse actions).

That would mean that believed he had a father with the same reasons, who believed he had a father with the same reasons, who believed he had a father with the same reasons...

I.e., this would require an infinite line of forefathers. (Or at least of hypothetical, believed-in forefathers.)

If anywhere there's a break in the chain — that person would not have FDT reasons to reproduce, so neither would their son, etc.

Which makes it disanalogous from any cases we encounter in real life. And makes me more sympathetic to the FDT reasoning, since it's a stranger case where I have less strong pre-existing intuitions.

Cool paper!

I'd be keen to see more examples of the paraphrases, if you're able to share. To get a sense of the kind of data that lets the model generalize out of context. (E.g. if it'd be easy to take all 300 paraphrases of some statement (ideally where performance improved) and paste in a google doc and share. Or lmk if this is on github somewhere.)

I'd also be interested in experiments to determine whether the benefit from paraphrases is mostly fueled by the raw diversity, or if it's because examples with certain specific features help a bunch, and those ... (read more)

4Owain_Evans
We didn't investigate the specific question of whether it's raw diversity or specific features. In the Grosse et al paper on influence functions, they find that "high influence scores are relatively rare and they cover a large portion of the total influence". This (vaguely) suggests that the top k paraphrases would do most of the work, which is what I would guess. That said, this is really something that should be investigated with more experiments.
7Asa Cooper Stickland
Here's some examples: https://github.com/AsaCooperStickland/situational-awareness-evals/blob/main/sitaevals/tasks/assistant/data/tasks/german/guidance.txt As to your other point, I would guess that the "top k" instructions repeated would be better than full diversity, for maybe k around 20+, but I'm definitely not very confident about that

This is interesting — would it be easy to share the transcript of the conversation? (If it's too long for a lesswrong comment, you could e.g. copy-paste it into a google doc and link-share it.)

5Patrick Leask
Here you go: https://chat.openai.com/share/c5df0119-13de-43f9-8d4e-1c437bafa8ec

You might want to check out the paper and summary that explains ECL, that I linked. In particular, this section of the summary has a very brief introduction to non-causal decision theory, and motivating evidential decision theory is a significant focus in the first couple of sections of the paper.

2Said Achmiz
I read the summary post, and skimmed the paper. The summary given in the grandparent seems to be basically accurate.

Here's a proposed operationalization.

For models that can't gradient hack: The model is "capable of doing X" if it would start doing X upon being fine-tuned to do it using a hypothetical, small finetuning dataset that demonstrated how to do the task. (Say, at most 1000 data points.)

(The hypothetical fine-tuning dataset should be a reasonable dataset constructed by a hypothetical team of human who knows how to do the task but aren't optimizing the dataset hard for ideal gradient updates to this particular model, or anything like that.)

For models that might ... (read more)

3Rohin Shah
Yeah, that seems like a reasonable operationalization of "capable of doing X". So my understanding is that (1), (3), (6) and (7) would not falsify the hypothesis under your operationalization, (5) would falsify it, (2) depends on details, and (4) is kinda ambiguous but I tend to think it would falsify it.

Then, conditional on type 1, you're about 0.5% likely to observe being post cold war, and conditional on type 2, you're about 45% likely to observe being post cold war.

I would have thought:

  • p(post cold war | type-1) = 1/101 ~= 1%.
  • p(post cold war | type-2) = 10/110 ~= 9%.

I don't think this makes a substantive difference to the rest of your comment, though.

Under SIA, you start with a ~19:10 ratio in favor of type 2 (in the subjective a priori). The likelihood ratios are the same as with SSA so the posteriors are equally weighted towards type 2. So t

... (read more)
2jessicata
Type-2 worlds have a 10% chance of being destroyed, not a 90% chance. So P(post cold war | type 2) = 90 / 190. My point is that the difference between SSA and SIA is in the prior. I dispute anthropic shadow posteriors since both SSA and SIA posteriors overwhelmingly believe type 2, but anthropic shadow arguers say you can't update towards type 2 using the evidence of being post cold war at all.

That argument only works for SSA if type 1 and type 2 planets exist in parallel.

I was talking about a model where either every planet in the multiverse is type 1, or every planet in the multiverse is type 2.

But extinction vs non-extinction is sampled separately on each planet.

Then SSA gives you an anthropic shadow.

(If your reference class is "all observers" you still get an update towards type 2, but it's weaker than for SIA. If your reference class is "post-nuclear-weapons observers", then SSA doesn't update at all.)

2jessicata
Suppose there are 100 planets and in type 1, 99 are destroyed, and in type 2, 10 are destroyed. Suppose your reference class is "all observers". Then, conditional on type 1, you're about 0.5% likely to observe being post cold war, and conditional on type 2, you're about 45% likely to observe being post cold war. These likelihoods are proportional to the SSA posteriors (due to equal priors). Under SIA, you start with a ~19:10 ratio in favor of type 2 (in the subjective a priori). The likelihood ratios are the same as with SSA so the posteriors are equally weighted towards type 2. So the updates are of equal magnitude in odds space under SSA and SIA. If your reference class is "post nuclear weapons observers" then I agree that SSA doesn't update at all. SIA also doesn't update but starts with a 90:1 prior in favor of type 2. I think this is an odd choice of reference class, and constructing your reference class to depend on your time index nullifies the doomsday argument, which is supposed to be an implication of SSA. I think choices of reference class like this will have odd reflective behavior because e.g. further cold wars in the future will be updated on by default.

How to get anthropic shadow:

  • Assume that the universe is either type 1 or type 2, and that planets with both subtypes (extinction or non-extinction) exist in parallel.
  • Use SSA.

At this point, I believe you will get some difference between SSA and SIA. For maximizing the size of the shadow, you can add:

  • If you wake up after nuclear war has become possible: choose to use the reference class "people living after nuclear war became possible".

(I didn't read the whole post, sorry if you address this somewhere. Also, I ultimately don't agree with the anthropic shadow argument.)

2jessicata
If enough planets exist, SSA and SIA agree. If you have only type 1 and type 2 universes, you can't learn that the cold war happened, because the cold war necessarily happened. What you can learn is that you're a post cold war observer. That's more likely in type 2 than type 1 universes, since type 2 universes have more observers post cold war. So you still get an update towards type 2.

They don't give a factor 5 uncertainty. They add a 100x discount on top of the 20x discount — counting fish suffering as 2000x less important than human suffering.

being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty

Maybe? I don't really know how to reason about this.

If that's true, that still only means that you should be linear for gambles that give different results in different quantum branches. C.f. logical vs. physical risk aversion.

Some objection like that might work more generally, since some logical facts will mean that there are far less humans in the universe-at-large, meaning that you're at a different point ... (read more)

(Though you could get out of this by claiming that what you really care about is happy humans per universe, that's a pretty strange thing to care about—it's like caring about happy humans per acre.)

My sense is that many solutions to infinite ethics look a bit like this. For example, if you use UDASSA, then a single human who is alone in a big universe will have a shorter description length than a single human who is surrounded by many other humans in a big universe. Because for the former, you can use pointers that specify the universe and then describe su... (read more)

2evhub
I agree that UDASSA might introduce a small effect like this, but my guess is that the overall effect isn't enough to substantially change the bottom line. Fundamentally, being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty.
Load More