I suspect there's a cleaner way to make this argument that doesn't talk much about the number of "token-equivalents", but instead contrasts "total FLOP spent on inference" with some combination of:
Thanks!
I agree that we've learned interesting new things about inference speeds. I don't think I would have anticipated that at the time.
Re:
It seems that spending more inference compute can (sometimes) be used to qualitatively and quantitatively improve capabilities (e.g., o1, recent swe-bench results, arc-agi rather than merely doing more work in parallel. Thus, it's not clear that the relevant regime will look like "lots of mediocre thinking".[1]
There are versions of this that I'd still describe as "lots of mediocre thinking" —adding up to being similarl...
One argument I have been making publicly is that I think Ajeya's Bioanchors report greatly overestimated human brain compute. I think a more careful reading of Joe Carlsmith's report that hers was based on supports my own estimates of around 1e15 FLOPs.
Am I getting things mixed up, or isn’t that just exactly Ajeya’s median estimate? Quote from the report: ”Under this definition, my median estimate for human brain computation is ~1e15 FLOP/s.”
https://docs.google.com/document/d/1IJ6Sr-gPeXdSJugFulwIpvavc0atjHGM82QjIfUSBGQ/edit
Related: The monkey and the machine by Paul Christiano. (Bottom-up processes ~= monkey. Verbal planner ~= deliberator. Section IV talks about the deliberator building trust with the monkey.)
A difference between this essay and Paul's is that this one seems to lean further towards "a good state is one where the verbal planner ~only spends attention on things that the bottom-up processes care about", whereas Paul's essay suggests a compromise where the deliberator gets to spend a good chunk of attention on things that the monkey doesn't care about. (In Rand's...
Here's the best explanation + study I've seen of Dunning-Krueger-ish graphs: https://www.clearerthinking.org/post/is-the-dunning-kruger-effect-real-or-are-unskilled-people-more-rational-than-it-seems
Their analysis suggests that their data is pretty well-explained by a combination of a "Closer-To-The-Average Effect" (which may or may not be rational — there are multiple possible rational reasons for it) and a "Better-Than-Average Effect" that appear ~uniformly across the board (but getting swamped by the "closer-to-the-average effect" at the upper end).
probably research done outside of labs has produced more differential safety progress in total
To be clear — this statement is consistent with companies producing way more safety research than non-companies, if companies also produce even way more capabilities progress than non-companies? (Which I would've thought is the case, though I'm not well-informed. Not sure if "total research outside of labs look competitive with research from labs" is meant to deny this possibility, or if you're only talking about safety research there.)
There's at least two different senses in which "control" can "fail" for a powerful system:
My impression is that Ryan & Buck typically talks about the first case. (E.g. in the link above.) I.e.: My guess w...
More generally, Dario appears to assume that for 5-10 years after powerful AI we'll just have a million AIs which are a bit smarter than the smartest humans and perhaps 100x faster rather than AIs which are radically smarter, faster, and more numerous than humans. I don't see any argument that AI progress will stop at the point of top humans rather continuing much further.
Well, there's footnote 10:
...Another factor is of course that powerful AI itself can potentially be used to create even more powerful AI. My assumption is that this might (in fact, probably
I wonder if work on AI for epistemics could be great for mitigating the "gradually cede control of the Earth to AGI" threat model. A large majority of economic and political power is held by people who would strongly oppose human extinction, so I expect that "lack of political support for stopping human extinction" would be less of a bottleneck than "consensus that we're heading towards human extinction" and "consensus on what policy proposals will solve the problem". Both of these could be significantly accelerated by AI. Normally, one of my biggest conce...
and as he admits in the footnote he didn't include in the LW version, in real life, when adequately incentivized to win rather than find excuses involving 'well, chaos theory shows you can't predict ball bounces more than n bounces out', pinball pros learn how to win and rack up high scores despite 'muh chaos'.
I was confused about this part of your comment because the post directly talks about this in the conclusion.
...The strategy typically is to catch the ball with the flippers, then to carefully hit the balls so that it takes a particular ramp which s
Yeah I was imagining we can proliferate by 'gradient descenting' on similar cases.
What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?
To be clear, uncertainty about the number of iterations isn’t enough. You need to have positive probability on arbitrarily high numbers of iterations, and never have it be the case that the probability of p(>n rounds) is so much less than p(n rounds) that it’s worth defecting on round n regardless of the effect of your reputation. These are pretty strong assumptions.
So cooperation is crucially dependent on your belief that all the way from 10 rounds to Graham’s number of rounds (and beyond), the probability of >n rounds conditional on n rounds is never lower than e.g. 20% (or whatever number is implied by the pay-off structure of your game).
it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action.
I would guess that assumption would be sufficient to defeat my counter-example, yeah.
I do think this is a big assumption. Definitely not one that I'd want to generally assume for practical purposes, even if it makes for a nicer theory of decision theory. But it would be super interesting if someone could make a proper defense of it typically being true in pract...
However, there is no tiling theorem for UDT that I am aware of, which means we don't know whether UDT is reflectively consistent; it's only a conjecture.
I think this conjecture is probably false for reasons described in this section of "When does EDT seek evidence about correlations?". The section offers an argument for why son-of-EDT isn't UEDT, but I think it generalizes to an argument for why son-of-UEDT isn't UEDT.
Briefly: UEDT-at-timestep-1 is making a different decision than UEDT-at-timestep-0. This means that its decision might be correlated (accord...
Notice that learning-UDT implies UDT: an agent eventually behaves as if it were applying UDT with each Pn. Therefore, in particular, it eventually behaves like UDT with prior P0. So (with the exception of some early behavior which might not conform to UDT at all) this is basically UDT with a prior which allows for learning. The prior P0 is required to eventually agree with the recommendations of P1, P2, ... (which also implies that these eventually agree with each other).
I don't understand this argument.
"an agent eventually behaves as if it were applyin...
Unfortunately the way that taboos work is by surrounding the whole topic in an aversive miasma. If you could carefully debate the implications of X, then that would provide an avenue for disproving X, which would be unacceptable. So instead this process tends to look more like "if you don't believe Y then you're probably the sort of terrible person who believes ~X", and now you're tarred with the connotation even if you try to carefully explain why you actually have different reasons for not believing Y (which is what you'd likely say either way).
I also like Paul's idea (which I can't now find the link for) of having labs make specific "underlined statements" to which employees can anonymously add caveats or contradictions that will be publicly displayed alongside the statements
Link: https://sideways-view.com/2018/02/01/honest-organizations/
Maybe interesting: I think a similar double-counting problem would appear naturally if you tried to train an RL agent in a setting where:
In a repeated version of the calculator example (importantly: where in each iteration, you randomly decide whether the people who saw "true" get offered a bet or the people who saw "false" get offered a bet — never both), the R...
I agree it seems plausible that AIs could boost takeover success probability (and holding on to that victory through the first several months) by more than 0.1% by killing a large fraction of humans.
Though on the other hand, the AI might also need to keep some humans loyal early during takeover, to e.g. do some physical tasks that it doesn't have great robot control over. And mass-killing isn't necessarily super easy, either; and attempts in that direction could raise a lot of extra opposition. So it's not clear where the pragmatics point.
(Main thing I was...
What if the AI chooses to monitor all humans all the time? (With AIs that are smarter than the humans.) So that the humans can't (without being noticed) talk with each other about ideas for how to take down the system, or work on something that someone-smarter-than-the-human would recognise as an attempt to interfere with the system. (Including just writing down ideas.)
But my reply to that scenario is that we should then make sure AIs don't have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.
Do you mean this as a prediction that humans will do this (soon enough to matter) or a recommendation? Your original argument is phrased as a prediction, but this looks more like a recommendation. My comment above can be phrased as a reason for why (in at least one plausible scenario) this would be unlikely to happen: (i) "It's hard to make deals that hand over ...
(I made separate comment making the same point. Just saw that you already wrote this, so moving the couple of references I had here to unify the discussion.)
Point previously made in:
"security and stability" section of propositions concerning digital minds and society:
...If wars, revolutions, and expropriation events continue to happen at historically typical intervals, but on digital rather than biological timescales, then a normal human lifespan would require surviving an implausibly large number of upheavals; human security therefore requires the establis
Here's an argument for why the change in power might be pretty sudden.
I concede the existence of people like Paul Christiano who present more benign stories, but these people are also typically seen as part of a more "optimistic" camp.
Though Paul is also sympathetic to the substance of 'dramatic' stories. C.f. the discussion about how "what failure looks like" fails to emphasize robot armies.
I like this direction and this write-up of it!
If sample efficiency was reasonably high, then we overall feel like control could be workable in the regime where for some tasks we need AIs to use huge (e.g. the equivalent of 3 months of human thinking) but not insane (e.g. the equivalent of 1,000 years of human thinking) amounts of uninterpretable serial reasoning.
Let's assume that an AI lab is in a fast-paced, rapidly changing state of affairs. (As you might expect during a moderately fast takeoff.) If an AI then gets to do the equivalent of 3 months of thi...
This is indeed a serious concern.
A key hope here is that with some amount of conservativeness our evaluation[1] and our protocol can generalize to future setups. In particular, our countermeasures could be sufficiently generic that they would work in new situations. And, in our control evaluation, we could test a variety of situations which we might end up in the future (even if they aren't the current situation). So, even if the red team doesn't have enough time to think about the exact situation in which we end up in the future, we might be able to verif...
Now here's Bob. He's been created-by-Joe, and given this wonderful machine, and this choice. And let's be clear: he's going to choose joy. I pre-ordained it. So is he a slave? No. Bob is as free as any of us. The fact that the causal history of his existence, and his values, includes not just "Nature," but also the intentional choices of other agents to create an agent-like-him, makes no difference to his freedom. It's all Nature, after all.
Here's an alternative perspective that looks like a plausible contender to me.
If Bob identifies with his algorithm...
I think (5) also depends on further details.
As you have written it, both the 2023 and 2033 attempt uses similar data and similar compute.
But in my proposed operationalization, "you can get it to do X" is allowed to use a much greater amount of resources ("say, 1% of the pre-training budget") than the test for whether the model is "capable of doing X" ("Say, at most 1000 data points".)
I think that's important:
Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below.
Hm, at the scale of "(inter-)national policy", I think you can get quite large effect sizes. I don't know large the effect-sizes of the following are, but I wouldn't be surprised by 10x or greater for:
But most of the deficiencies you point out in the third column of that table is about missing and insufficient risk analysis. E.g.:
If people took your proposal as a minimum bar for how thorough a ri...
Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."
Great question! A few points:
Instead, ARC explicitly tries to paint the moratorium folks as "extreme".
Are you thinking about this post? I don't see any explicit claims that the moratorium folks are extreme. What passage are you thinking about?
In terms of explicit claims:
"So one extreme side of the spectrum is build things as fast as possible, release things as much as possible, maximize technological progress [...].
The other extreme position, which I also have some sympathy for, despite it being the absolutely opposite position, is you know, Oh my god this stuff is really scary.
The most extreme version of it was, you know, we should just pause, we should just stop, we should just stop building the technology for, indefinitely, or for some specified period of time. [...] And you know, that extre...
That would mean that believed he had a father with the same reasons, who believed he had a father with the same reasons, who believed he had a father with the same reasons...
I.e., this would require an infinite line of forefathers. (Or at least of hypothetical, believed-in forefathers.)
If anywhere there's a break in the chain — that person would not have FDT reasons to reproduce, so neither would their son, etc.
Which makes it disanalogous from any cases we encounter in real life. And makes me more sympathetic to the FDT reasoning, since it's a stranger case where I have less strong pre-existing intuitions.
Cool paper!
I'd be keen to see more examples of the paraphrases, if you're able to share. To get a sense of the kind of data that lets the model generalize out of context. (E.g. if it'd be easy to take all 300 paraphrases of some statement (ideally where performance improved) and paste in a google doc and share. Or lmk if this is on github somewhere.)
I'd also be interested in experiments to determine whether the benefit from paraphrases is mostly fueled by the raw diversity, or if it's because examples with certain specific features help a bunch, and those ...
You might want to check out the paper and summary that explains ECL, that I linked. In particular, this section of the summary has a very brief introduction to non-causal decision theory, and motivating evidential decision theory is a significant focus in the first couple of sections of the paper.
Here's a proposed operationalization.
For models that can't gradient hack: The model is "capable of doing X" if it would start doing X upon being fine-tuned to do it using a hypothetical, small finetuning dataset that demonstrated how to do the task. (Say, at most 1000 data points.)
(The hypothetical fine-tuning dataset should be a reasonable dataset constructed by a hypothetical team of human who knows how to do the task but aren't optimizing the dataset hard for ideal gradient updates to this particular model, or anything like that.)
For models that might ...
Then, conditional on type 1, you're about 0.5% likely to observe being post cold war, and conditional on type 2, you're about 45% likely to observe being post cold war.
I would have thought:
I don't think this makes a substantive difference to the rest of your comment, though.
...Under SIA, you start with a ~19:10 ratio in favor of type 2 (in the subjective a priori). The likelihood ratios are the same as with SSA so the posteriors are equally weighted towards type 2. So t
That argument only works for SSA if type 1 and type 2 planets exist in parallel.
I was talking about a model where either every planet in the multiverse is type 1, or every planet in the multiverse is type 2.
But extinction vs non-extinction is sampled separately on each planet.
Then SSA gives you an anthropic shadow.
(If your reference class is "all observers" you still get an update towards type 2, but it's weaker than for SIA. If your reference class is "post-nuclear-weapons observers", then SSA doesn't update at all.)
How to get anthropic shadow:
At this point, I believe you will get some difference between SSA and SIA. For maximizing the size of the shadow, you can add:
(I didn't read the whole post, sorry if you address this somewhere. Also, I ultimately don't agree with the anthropic shadow argument.)
being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty
Maybe? I don't really know how to reason about this.
If that's true, that still only means that you should be linear for gambles that give different results in different quantum branches. C.f. logical vs. physical risk aversion.
Some objection like that might work more generally, since some logical facts will mean that there are far less humans in the universe-at-large, meaning that you're at a different point ...
(Though you could get out of this by claiming that what you really care about is happy humans per universe, that's a pretty strange thing to care about—it's like caring about happy humans per acre.)
My sense is that many solutions to infinite ethics look a bit like this. For example, if you use UDASSA, then a single human who is alone in a big universe will have a shorter description length than a single human who is surrounded by many other humans in a big universe. Because for the former, you can use pointers that specify the universe and then describe su...
Ok, gotcha.
Just to clarify: There was no such discount factor that changed the median estimate of "human brain compute". Instead, this discount factor was applied to go from "human brain compute estimate" to "human-brain-compute-informed estimate of the compute-cost of training TAI with current algorithms" — adjusting for how our current algorithm seem to be w... (read more)