All of Lukas Finnveden's Comments + Replies

Source? I thought 2016 had the most takers but that one seems to have ~5% trans. The latest one with results out (2023) has 7.5% trans. Are you counting "non-binary" or "other" as well? Or referring to some other survey.

1Matrice Jacobine
I'm using the 2016 survey and counting non-binary yes.

SB1047 was mentioned separately so I assumed it was something else. Might be the other ones, thanks for the links.

lobbied against mandatory RSPs

What is this referring to?

People representing Anthropic argued against government-required RSPs. I don’t think I can share the details of the specific room where that happened, because it will be clear who I know this from.

Ask Jack Clark whether that happened or not.

4Zach Stein-Perlman
My guess is it's referring to Anthropic's position on SB 1047, or Dario's and Jack Clark's statements that it's too early for strong regulation, or how Anthropic's policy recommendations often exclude RSP-y stuff (and when they do suggest requiring RSPs, they would leave the details up to the company).

Thanks. It still seems to me like the problem recurs. The application of Occam's razor to questions like "will the Sun rise tomorrow?" seems more solid than e.g. random intuitions I have about how to weigh up various considerations. But the latter do still seem like a very weak version of the former. (E.g. both do rely on my intuitions; and in both cases, the domain have something in common with cases where my intuitions have worked well before, and something not-in-common.) And so it's unclear to me what non-arbitrary standards I can use to decide whether I should let both, neither, or just the latter be "outweighed by a principle of suspending judgment".

1Anthony DiGiovanni
(General caveat that I'm not sure if I'm missing your point.) Sure, there's still a "problem" in the sense that we don't have a clean epistemic theory of everything. The weights we put on the importance of different principles, and how well different credences fulfill them, will be fuzzy. But we've had this problem all along. There are options other than (1) purely determinate credences or (2) implausibly wide indeterminate credences. To me, there are very compelling intuitions behind the view that the balance among my epistemic principles is best struck by (3) indeterminate credences that are narrow in proportion to the weight of evidence and how far principles like Occam seem to go. This isn't objective (neither are any other principles of rationality less trivial than avoiding synchronic sure losses). Maybe your intuitions differ, upon careful reflection. That doesn't mean it's a free-for-all. Even if it is, this isn't a positive argument for determinacy. My intuitions about foundational epistemic principles are just about what I philosophically endorse — in that domain, I don’t know what else we could possibly go on other than intuition. Whereas, my intuitions about empirical claims about the far future only seem worth endorsing as far as I have reasons to think they're tracking empirical reality.

To be clear: The "domain" thing was just meant to be a vague gesture of the sort of thing you might want to do. (I was trying to include my impression of what eg bracketed choice is trying to do.) I definitely agree that the gesture was vague enough to also include some options that I'd think are unreasonable.

Also, my sense is that many people are making decisions based on similar intuitions as the ones you have (albeit with much less of a formal argument for how this can be represented or why it's reasonable). In particular, my impression is that people who are are uncompelled by longtermism (despite being compelled by some type of scope-sensitive consequentialism) are often driven by an aversion to very non-robust EV-estimates.

If I were to write the case for this in my own words, it might be something like:

  • There are many different normative criteria we should give some weight to.
  • One of them is "maximizing EV according to moral theory A".
  • But maximizing EV is an intuitively less appealing normative criteria when (i) it's super unclear and non-robust what credences we ought to put on certain propositions, and (ii) the recommended decision is very different depending on what our exact credences on those propositions are.
  • So in such cases, as a matter of ethics, you might have the int
... (read more)
3Anthony DiGiovanni
To spell out how I’m thinking of credence-setting: Given some information, we apply different (vague) non-pragmatic principles we endorse — fit with evidence, Occam’s razor, deference, etc. Epistemic arbitrariness means making choices in your credence-setting that add something beyond these principles. (Contrast this with mere “formalization arbitrariness”, the sort discussed in the part of the post about vagueness.) I don’t think the problem of induction forces us to be epistemically arbitrary. Occam’s razor (perhaps an imprecise version!) favors priors that penalize a hypothesis like “the mechanisms that made the sun rise every day in the past suddenly change tomorrow”. This seems to give us grounds for having prior credences narrower than (0, 1), even if there’s some unavoidable formalization arbitrariness. (We can endorse the principle underlying Occam’s razor, “give more weight to hypotheses that posit fewer entities”, without a circular justification like “Occam’s razor worked well in the past”. Admittedly, I don’t feel super satisfied with / unconfused about Occam’s razor, but it’s not just an ad hoc thing.) By contrast, pinning down a single determinate credence (in the cases discussed in this post) seems to require favoring epistemic weights for no reason. Or at best, a very weak reason that IMO is clearly outweighed by a principle of suspending judgment. So this seems more arbitrary to me than indeterminate credences, since it injects epistemic arbitrariness on top of formalization arbitrariness.
5Anthony DiGiovanni
(I'll reply to the point about arbitrariness in another comment.) I think it's generally helpful for conceptual clarity to analyze epistemics separately from ethics and decision theory. E.g., it's not just EV maximization w.r.t. non-robust credences that I take issue with, it's any decision rule built on top of non-robust credences. And I worry that without more careful justification, "[consequentialist] EV-maximizing within a more narrow "domain", ignoring the effects outside of that "domain"" is pretty unmotivated / just kinda looking under the streetlight. And how do you pick the domain? (Depends on the details, though. If it turns out that EV-maximizing w.r.t. impartial consequentialism is always sensitive to non-robust credences (in your framing), I'm sympathetic to "EV-maximizing w.r.t. those you personally care about, subject to various deontological side constraints etc." as a response. Because “those you personally care about” isn’t an arbitrary domain, it’s, well, those you personally care about. The moral motivation for focusing on that domain is qualitatively different from the motivation for impartial consequentialism.) So I'm hesitant to endorse your formulation. But maybe for most practical purposes this isn't a big deal, I'm not sure yet.
3Lukas Finnveden
Also, my sense is that many people are making decisions based on similar intuitions as the ones you have (albeit with much less of a formal argument for how this can be represented or why it's reasonable). In particular, my impression is that people who are are uncompelled by longtermism (despite being compelled by some type of scope-sensitive consequentialism) are often driven by an aversion to very non-robust EV-estimates.

Just to confirm, this means that the thing I put in quotes would probably end up being dynamically inconsistent? In order to avoid that, I need to put in an additional step of also ruling out plans that would be dominated from some constant prior perspective? (It’s a good point that these won’t be dominated from my current perspective.)

1Anthony DiGiovanni
That's right. (Not sure you're claiming otherwise, but FWIW, I think this is fine — it's true that there's some computational cost to this step, but in this context we're talking about the normative standard rather than what's most pragmatic for bounded agents. And once we start talking about pragmatic challenges for bounded agents, I'd be pretty dubious that, e.g., "pick a very coarse-grained 'best guess' prior and very coarse-grained way of approximating Bayesian updating, and try to optimize given that" would be best according to the kinds of normative standards that favor indeterminate beliefs.)

One upshot of this is that you can follow an explicitly non-(precise-)Bayesian decision procedure and still avoid dominated strategies. For example, you might explicitly specify beliefs using imprecise probabilities and make decisions using the “Dynamic Strong Maximality” rule, and still be immune to sure losses. Basically, Dynamic Strong Maximality tells you which plans are permissible given your imprecise credences, and you just pick one. And you could do this “picking” using additional substantive principles. Maybe you want to use another rule

... (read more)
2Anthony DiGiovanni
You aren't required to take an action (/start acting on a plan) that is worse from your current perspective than some alternative. Let maximality-dominated mean "w.r.t. each distribution in my representor, worse in expectation than some alternative." (As opposed to "dominated" in the sense of "worse than an alternative with certainty".) Then, in general you would need[1] to ask, "Among the actions/plans that are not maximality-dominated from my current perspective, which of these are dominated from my prior perspective?" And rule those out. 1. ^ If you care about diachronic norms of rationality, that is.

if the trend toward long periods of internal-only deployment continues

Have we seen such a trend so far? I would have thought the trend to date was neutral or towards shorter period of internal-only deployment.

Tbc, not really objecting to your list of reasons why this might change in the future. One thing I'd add to it is that even if calendar-time deployment delays don't change, the gap in capabilities inside vs. outside AI companies could increase a lot if AI speeds up the pace of AI progress.

ETA: Dario Amodei says "Sonnet's training was conducted 9-12 mo... (read more)

Taking it all together, i think you should put more probability on the software-only singluarity, mostly because of capability improvements being much more significant than you assume.

I'm confused — I thought you put significantly less probability on software-only singularity than Ryan does? (Like half?) Maybe you were using a different bound for the number of OOMs of improvement?

1Tom Davidson
Sorry, for my comments on this post I've been referring to "software only singularity?" only as "will the parameter r >1 when we f first fully automate AI RnD", not as a threshold for some number of OOMs. That's what Ryan's analysis seemed to be referring to.    I separately think that even if initially r>1 the software explosion might not go on for that long
2ryan_greenblatt
I think Tom's take is that he expects I will put more probability on software only singularity after updating on these considerations. It seems hard to isolate where Tom and I disagree based on this comment, but maybe it is on how much to weigh various considerations about compute being a key input.

In practice, we'll be able to get slightly better returns by spending some of our resources investing in speed-specific improvements and in improving productivity rather than in reducing cost. I don't currently have a principled way to estimate this (though I expect something roughly principled can be found by looking at trading off inference compute and training compute), but maybe I think this improves the returns to around .

Interesting comparison point: Tom thought this would give a way larger boost in his old software-only singu... (read more)

2ryan_greenblatt
Interesting. My numbers aren't very principled and I could imagine thinking capability improvements are a big deal for the bottom line.

Based on some guesses and some poll questions, my sense is that capabilities researchers would operate about 2.5x slower if they had 10x less compute (after adaptation)

Can you say roughly who the people surveyed were? (And if this was their raw guess or if you've modified it.)

I saw some polls from Daniel previously where I wasn't sold that they were surveying people working on the most important capability improvements, so wondering if these are better.

Also, somewhat minor, but: I'm slightly concerned that surveys will overweight areas where labor is more ... (read more)

4ryan_greenblatt
I'm citing the polls from Daniel + what I've heard from random people + my guesses.

Hm — what are the "plausible interventions" that would stop China from having >25% probability of takeover if no other country could build powerful AI? Seems like you either need to count a delay as successful prevention, or you need to have a pretty low bar for "plausible", because it seems extremely difficult/costly to prevent China from developing powerful AI in the long run. (Where they can develop their own supply chains, put manufacturing and data centers underground, etc.)

2ryan_greenblatt
Yeah, I'm trying to include delay as fine. I'm just trying to point at "the point when aggressive intervention by a bunch of parties is potentially still too late".

Is there some reason for why current AI isn't TCAI by your definition?

(I'd guess that the best way to rescue your notion it is to stipulate that the TCAIs must have >25% probability of taking over themselves. Possibly with assistance from humans, possibly by manipulating other humans who think they're being assisted by the AIs — but ultimately the original TCAIs should be holding the power in order for it to count. That would clearly exclude current systems. But I don't think that's how you meant it.)

2ryan_greenblatt
Oh sorry. I somehow missed this aspect of your comment. Here's a definition of takeover-capable AI that I like: the AI is capable enough that plausible interventions on known human controlled institutions within a few months no longer suffice to prevent plausible takeover. (Which implies that making the situation clear to the world is substantially less useful and human controlled institutions can no longer as easily get a seat at the table.) Under this definition, there are basically two relevant conditions: * The AI is capable enough to itself take over autonomously. (In the way you defined it, but also not in a way where intervening on human institutions can still prevent the takeover, so e.g.., the AI just having a rogue deployment within OpenAI doesn't suffice if substantial externally imposed improvements to OpenAI's security and controls would defeat the takeover attempt.) * Or human groups can do a nearly immediate takeover with the AI such that they could then just resist such interventions. I'll clarify this in the comment.

I'm not sure if the definition of takeover-capable-AI (abbreviated as "TCAI" for the rest of this comment) in footnote 2 quite makes sense. I'm worried that too much of the action is in "if no other actors had access to powerful AI systems", and not that much action is in the exact capabilities of the "TCAI". In particular: Maybe we already have TCAI (by that definition) because if a frontier AI company or a US adversary was blessed with the assumption "no other actor will have access to powerful AI systems", they'd have a huge advantage over the rest of t... (read more)

4ryan_greenblatt
I agree that the notion of takeover-capable AI I use is problematic and makes the situation hard to reason about, but I intentionally rejected the notions you propose as they seemed even worse to think about from my perspective.

Ok, gotcha.

It's that she didn't accept the reasoning behind that number enough to really believe it. She added a discount factor based on fallacious reasoning around "if it were that easy, it'd be here already".

Just to clarify: There was no such discount factor that changed the median estimate of "human brain compute". Instead, this discount factor was applied to go from "human brain compute estimate" to "human-brain-compute-informed estimate of the compute-cost of training TAI with current algorithms" — adjusting for how our current algorithm seem to be w... (read more)

I suspect there's a cleaner way to make this argument that doesn't talk much about the number of "token-equivalents", but instead contrasts "total FLOP spent on inference" with some combination of:

  • "FLOP until human-interpretable information bottleneck". While models still think in English, and doesn't know how to do steganography, this should be FLOP/forward-pass. But it could be much longer in the future, e.g. if the models get trained to think in non-interpretable ways and just outputs a paper written in English once/week.
  • "FLOP until feedback" — how many
... (read more)

It's possible that "many mediocre or specialized AIs" is, in practice, a bad summary of the regime with strong inference scaling. Maybe people's associations with "lots of mediocre thinking" ends up being misleading.

Thanks!

I agree that we've learned interesting new things about inference speeds. I don't think I would have anticipated that at the time.

Re:

It seems that spending more inference compute can (sometimes) be used to qualitatively and quantitatively improve capabilities (e.g., o1, recent swe-bench results, arc-agi rather than merely doing more work in parallel. Thus, it's not clear that the relevant regime will look like "lots of mediocre thinking".[1]

There are versions of this that I'd still describe as "lots of mediocre thinking" —adding up to being similarl... (read more)

4Lukas Finnveden
I suspect there's a cleaner way to make this argument that doesn't talk much about the number of "token-equivalents", but instead contrasts "total FLOP spent on inference" with some combination of: * "FLOP until human-interpretable information bottleneck". While models still think in English, and doesn't know how to do steganography, this should be FLOP/forward-pass. But it could be much longer in the future, e.g. if the models get trained to think in non-interpretable ways and just outputs a paper written in English once/week. * "FLOP until feedback" — how many FLOP of compute does the model do before it outputs an answer and gets feedback on it? * Models will probably be trained on a mixture of different regimes here. E.g.: "FLOP until feedback" being proportional to model size during pre-training (because it gets feedback after each token) and then also being proportional to chain-of-thought length during post-training. * So if you want to collapse it to one metric, you'd want to somehow weight by number of data-points and sample efficiency for each type of training. * "FLOP until outcome-based feedback" — same as above, except only counting outcome-based feedback rather than process-based feedback, in the sense discussed in this comment. Having higher "FLOP until X" (for each of the X in the 3 bullet points) seems to increase danger. While increasing "total FLOP spent on inference" seems to have a much better ratio of increased usefulness : increased danger.   In this framing, I think: * Based on what we saw of o1's chain-of-thoughts, I'd guess it hasn't changed "FLOP until human-interpretable information bottleneck", but I'm not sure about that. * It seems plausible that o1/o3 uses RL, and that the models think for much longer before getting feedback. This would increase "FLOP until feedback". * Not sure what type of feedback they use. I'd guess that the most outcome-based thing they do is "executing code and seeing whether it passes test".
3Lukas Finnveden
It's possible that "many mediocre or specialized AIs" is, in practice, a bad summary of the regime with strong inference scaling. Maybe people's associations with "lots of mediocre thinking" ends up being misleading.

One argument I have been making publicly is that I think Ajeya's Bioanchors report greatly overestimated human brain compute. I think a more careful reading of Joe Carlsmith's report that hers was based on supports my own estimates of around 1e15 FLOPs.

Am I getting things mixed up, or isn’t that just exactly Ajeya’s median estimate? Quote from the report: ”Under this definition, my median estimate for human brain computation is ~1e15 FLOP/s.”

https://docs.google.com/document/d/1IJ6Sr-gPeXdSJugFulwIpvavc0atjHGM82QjIfUSBGQ/edit

2Nathan Helm-Burger
Yeah, to be more clear, it's not so much that she got that wrong. It's that she didn't accept the reasoning behind that number enough to really believe it. She added a discount factor based on fallacious reasoning around "if it were that easy, it'd be here already". In my opinion the correct takeaway from an estimate showing that "a low number of FLOPs should be enough if the right algorithm is used" combined with "we don't seem close, and are using more FLOPs than that already" is "therefore, our algorithms must be pretty crap and there must be lots of room for algorithmic improvement." Thus, that we are in a compute hardware overhang and that algorithmic improvement will potentially result in huge capabilities gains, speed gains, and parallel inference instances. Of course, if our algorithms are really that far from optimal, why should we not expect to continue to be bottlenecked by algorithms? The conclusion I come to is that if we can compensate for inefficient algorithms with huge over-expenditure in compute, then we can get a just-barely-good-enough ML research assistant who can speed up the algorithm progress. So we should expect training costs to drop rapidly after automating ML R&D. Then she also blended the human brain compute estimate with three other estimates which were multiple orders of magnitude larger. These other estimates were based on, in my opinion, faulty premises. Since the time of her report she has had to repeatedly revise her timelines down. Now she basically agrees with me. Props to her for correcting. https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines?commentId=hnrfbFCP7Hu6N6Lsp

We did the the 80% pledge thing, and that was like a thing that everybody was just like, "Yes, obviously we're gonna do this."

Does anyone know what this is referring to? (Maybe a pledge to donate 80%? If so, curious about 80% of what & under what conditions.)

9Zach Stein-Perlman
All of the founders committed to donate 80% of their equity. I heard it's set aside in some way but they haven't donated anything yet. (Source: an Anthropic human.) This fact wasn't on the internet, or rather at least wasn't easily findable via google search. Huh. I only find Holden mentioning 80% of Daniela's equity is pledged.

Related: The monkey and the machine by Paul Christiano. (Bottom-up processes ~= monkey. Verbal planner ~= deliberator. Section IV talks about the deliberator building trust with the monkey.)

A difference between this essay and Paul's is that this one seems to lean further towards "a good state is one where the verbal planner ~only spends attention on things that the bottom-up processes care about", whereas Paul's essay suggests a compromise where the deliberator gets to spend a good chunk of attention on things that the monkey doesn't care about. (In Rand's... (read more)

Here's the best explanation + study I've seen of Dunning-Krueger-ish graphs: https://www.clearerthinking.org/post/is-the-dunning-kruger-effect-real-or-are-unskilled-people-more-rational-than-it-seems 

Their analysis suggests that their data is pretty well-explained by a combination of a "Closer-To-The-Average Effect" (which may or may not be rational — there are multiple possible rational reasons for it) and a "Better-Than-Average Effect" that appear ~uniformly across the board (but getting swamped by the "closer-to-the-average effect" at the upper end).

probably research done outside of labs has produced more differential safety progress in total

To be clear — this statement is consistent with companies producing way more safety research than non-companies, if companies also produce even way more capabilities progress than non-companies? (Which I would've thought is the case, though I'm not well-informed. Not sure if "total research outside of labs look competitive with research from labs" is meant to deny this possibility, or if you're only talking about safety research there.)

5ryan_greenblatt
I'm just talking about research intended to be safety/safety-adjacent. As in, of this research, what has the quality weighted differential safety progress been. Probably the word "differential" was just a mistake.

There's at least two different senses in which "control" can "fail" for a powerful system:

  • Control evaluations can indicate that there's no way to deploy the system such that you both (i) get a lot of use out of it, and (ii) can get a low probability of catastrophe.
  • Control evaluations are undermined such that humans think that the model can be deployed safely, but actually the humans were misled and there's a high probability of catastrophe.

My impression is that Ryan & Buck typically talks about the first case. (E.g. in the link above.) I.e.: My guess w... (read more)

2[anonymous]
I appreciate this distinction between the different types of outcomes-- nice.  I'm pretty interested in peoples' forecasts for each. So revised questions would be: 1. Evals are trustworthy but say that the model is no longer safe (or rather something like "there is a non-trivial probability that the model may be dangerous/scheming."). Using your best judgment, at what point do you expect sabotage evaluations will reveal that a model might not be safe (we're not confident that it's not scheming)? At what capability threshold do you expect we would no longer be able to say "yup, we are confident that the model is not scheming"? (Feel free to use heuristics like the OpenAI PF capability thresholds or ASL levels). 2. Evals are no longer trustworthy and we notice. Using your best judgment, at what point do you expect sabotage evaluations will no longer be useful evals for determining whether or not a model is scheming? At what capability threshold do you expect we would no longer be able to say "yup, we are confident that these evals are trustworthy/useful"? 3. Evals are no longer trustworthy and we do not notice. Using your best judgment, how likely is it that sabotage evaluations are undermined such that humans think the model can be deployed safely, but actually the humans were misled? At what capability threshold would you worry about this form of undermining? @David Duvenaud would be curious about your takes, even if they're just speculative guesses/predictions.  (And @ryan_greenblatt would be curious for your takes but replace "sabotage evaluations" with "control evaluations." In the event that you've written this up elsewhere, feel free to link it– helpful for me if you quote the particular section.)
6ryan_greenblatt
It's a bit messy because we have some ability to check whether we should be able to evaluate things. So, there are really three relevant "failure" states for well done control: 1. We can't find countermeasures such that our control evaluations indicate any real safety. 2. We no longer think that control evaluations work to evaluate safety as models greatly exceed our planning capabilites and/or can sandbag sufficiently well. 3. We think control evaluations can work well, but we're wrong and they actually don't. I think (1) or (2) will likely happen prior to (3) if you do a good job. We discuss this more here. (Really, the actual system is more complex because we can vary how conservative control evaluations are and it is easier to be confident in the results of more conservative evaluations. E.g., we can be conservative about our AIs planning ability. So, there is really a somewhat more continuous tradeoff than the above list makes it seem.)

More generally, Dario appears to assume that for 5-10 years after powerful AI we'll just have a million AIs which are a bit smarter than the smartest humans and perhaps 100x faster rather than AIs which are radically smarter, faster, and more numerous than humans. I don't see any argument that AI progress will stop at the point of top humans rather continuing much further.

Well, there's footnote 10:

Another factor is of course that powerful AI itself can potentially be used to create even more powerful AI. My assumption is that this might (in fact, probably

... (read more)

I wonder if work on AI for epistemics could be great for mitigating the "gradually cede control of the Earth to AGI" threat model. A large majority of economic and political power is held by people who would strongly oppose human extinction, so I expect that "lack of political support for stopping human extinction" would be less of a bottleneck than "consensus that we're heading towards human extinction" and "consensus on what policy proposals will solve the problem". Both of these could be significantly accelerated by AI. Normally, one of my biggest conce... (read more)

and as he admits in the footnote he didn't include in the LW version, in real life, when adequately incentivized to win rather than find excuses involving 'well, chaos theory shows you can't predict ball bounces more than n bounces out', pinball pros learn how to win and rack up high scores despite 'muh chaos'.

I was confused about this part of your comment because the post directly talks about this in the conclusion.

The strategy typically is to catch the ball with the flippers, then to carefully hit the balls so that it takes a particular ramp which s

... (read more)

Yeah I was imagining we can proliferate by 'gradient descenting' on similar cases.

What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?

1Tom Davidson
I mean that you start with a scenario where the AI does an egregious act. Then you change small facts about the scenario to explore the space of scenarios where the probability of them doing that act is high. The thought is that, if scheming is systematic, then this will lead you to discover a wide range of scenarios in which the AI schemes and evidence that it's not just a one-off random role-playing thing

Have you tried using different AI models within perplexity? Any ideas about which one is best? I don't know whether to expect better results from Sonnet 3.5 (within perplexity) or one of the models that perplexity have finetuned themselves, like Sonar Huge.

5Elizabeth
These results are all from the vanilla UI. Comparing individual models on harder tasks is on my maybe list, but the project was rapidly suffering from scope creep so I rushed it out the door. 

To be clear, uncertainty about the number of iterations isn’t enough. You need to have positive probability on arbitrarily high numbers of iterations, and never have it be the case that the probability of p(>n rounds) is so much less than p(n rounds) that it’s worth defecting on round n regardless of the effect of your reputation. These are pretty strong assumptions.

So cooperation is crucially dependent on your belief that all the way from 10 rounds to Graham’s number of rounds (and beyond), the probability of >n rounds conditional on n rounds is never lower than e.g. 20% (or whatever number is implied by the pay-off structure of your game).

it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action.

I would guess that assumption would be sufficient to defeat my counter-example, yeah.

I do think this is a big assumption. Definitely not one that I'd want to generally assume for practical purposes, even if it makes for a nicer theory of decision theory. But it would be super interesting if someone could make a proper defense of it typically being true in pract... (read more)

4abramdemski
Here's a different way of framing it: if we don't make this assumption, is there some useful generalization of UDT which emerges? Or, having not made this assumption, are we stuck in a quagmire where we can't really say anything useful? I think about these sorts of 'technical assumptions' needed for nice DT results as "sanity checks": * I think we need to make several significant assumptions like this in order to get nice theoretical DT results. * These nice DT results won't precisely apply to the real world; however, they do show that the DT being analyzed at least behaves sanely when it is in these 'easier' cases. * So it seems like the natural thing to do is prove tiling results, learning results, etc under the necessary technical assumptions, with some concern for how restrictive the assumptions are (broader sanity checks being better), and then also, check whether behavior is "at least somewhat reasonable" in other cases. So if UDT fails to tile when we remove these assumptions, but, at least appears to choose its successor in a reasonable way given the situation, this would count as a success. Better, of course, if we can find the more general DT which tiles under weaker assumptions. I do think it's quite plausible that UDT needs to be generalized; I just expect my generalization of UDT will still need to make an assumption which rules out your counterexample to UDT.

However, there is no tiling theorem for UDT that I am aware of, which means we don't know whether UDT is reflectively consistent; it's only a conjecture.

I think this conjecture is probably false for reasons described in this section of "When does EDT seek evidence about correlations?". The section offers an argument for why son-of-EDT isn't UEDT, but I think it generalizes to an argument for why son-of-UEDT isn't UEDT.

Briefly: UEDT-at-timestep-1 is making a different decision than UEDT-at-timestep-0. This means that its decision might be correlated (accord... (read more)

2abramdemski
I haven't analyzed your argument yet, but: tiling arguments will always depend on assumptions. Really, it's a question of when something tiles, not whether. So, if you've got a counterexample to tiling, a natural next question is what assumptions we could make to rule it out, and how unfortunate it is to need those assumptions. I might not have understood adequately, yet, but it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action. This is a big assumption, but at the same time, the sort of assumption I would expect to need in order to justify UDT. As Eliezer put it, tiling results need to assume that the environment only cares about what policy we implement, not our "rituals of cognition" that compute those policies. An earlier act of self-modification vs a later decision is a difference in "ritual of cognition" as opposed to a difference in the policy, to me. So, I need to understand the argument better, but it seems to me like this kind of counterexample doesn't significantly wound the spirit of UDT. 

Notice that learning-UDT implies UDT: an agent eventually behaves as if it were applying UDT with each Pn. Therefore, in particular, it eventually behaves like UDT with prior P0. So (with the exception of some early behavior which might not conform to UDT at all) this is basically UDT with a prior which allows for learning. The prior P0 is required to eventually agree with the recommendations of P1, P2, ... (which also implies that these eventually agree with each other).

I don't understand this argument.

"an agent eventually behaves as if it were applyin... (read more)

4abramdemski
I probably need to clarify the statement of the assumption. The idea isn't that it eventually takes at least one action that's in line with Pn for each n, but then, might do some other stuff. The idea is that for each n, there is a time tn after which all decisions will be consistent with UDT-using-Pn. UDT's recommendations will often coincide with more-updateful DTs. So the learning-UDT assumption is saying that UDT eventually behaves in an updateful way with respect to each observation, although not necessarily right away upon receiving that observation.

Incidentally: Were the persuasion evals done on models with honesty training or on helpfulness-only models? (Couldn't find this in the paper, sorry if I missed it.)

2Rohin Shah
I don't know the exact details but to my knowledge we didn't have trouble getting the model to lie (e.g. for web of lies).

Tbc: It should be fine to argue against those implications, right? It’s just that, if you grant the implication, then you can’t publicly refute Y.

Unfortunately the way that taboos work is by surrounding the whole topic in an aversive miasma. If you could carefully debate the implications of X, then that would provide an avenue for disproving X, which would be unacceptable. So instead this process tends to look more like "if you don't believe Y then you're probably the sort of terrible person who believes ~X", and now you're tarred with the connotation even if you try to carefully explain why you actually have different reasons for not believing Y (which is what you'd likely say either way).

I also like Paul's idea (which I can't now find the link for) of having labs make specific "underlined statements" to which employees can anonymously add caveats or contradictions that will be publicly displayed alongside the statements

Link: https://sideways-view.com/2018/02/01/honest-organizations/

Maybe interesting: I think a similar double-counting problem would appear naturally if you tried to train an RL agent in a setting where:

  • "Reward" is proportional to an estimate of some impartial measure of goodness.
  • There are multiple identical copies of your RL algorithm (including: they all use the same random seed for exploration).

In a repeated version of the calculator example (importantly: where in each iteration, you randomly decide whether the people who saw "true" get offered a bet or the people who saw "false" get offered a bet — never both), the R... (read more)

I agree it seems plausible that AIs could boost takeover success probability (and holding on to that victory through the first several months) by more than 0.1% by killing a large fraction of humans.

Though on the other hand, the AI might also need to keep some humans loyal early during takeover, to e.g. do some physical tasks that it doesn't have great robot control over. And mass-killing isn't necessarily super easy, either; and attempts in that direction could raise a lot of extra opposition. So it's not clear where the pragmatics point.

(Main thing I was... (read more)

What if the AI chooses to monitor all humans all the time? (With AIs that are smarter than the humans.) So that the humans can't (without being noticed) talk with each other about ideas for how to take down the system, or work on something that someone-smarter-than-the-human would recognise as an attempt to interfere with the system. (Including just writing down ideas.)

4ryan_greenblatt
This seems like a reasonable pitch for how to do AI-Authoritarianism, but it's plausible that there won't be enough compute for the AI to do this at the point of takeover. Also, the logistical problem seems super brutal. So, I can buy this for an AI which is quite superhuman and has a considerable technological edge over current humans, but for weaker AIs either the compute issues or the logistical issues might be serious. But maybe you only need to monitor the top 10% of humans? E.g., suppose GPT-6 takes over, it maybe only has capacity to run 1e16 tokens / 6 months (suppose training using all available compute, trains on 2e15 token equivalents over the course of 6 months and inference is only 5x more efficient than training). Then if it wants to monitor 8 billion humans, this is 1e16 / (8 billion humans * (30∗24∗60∗6) minutes per 6 months) = about 5 tokens per human minute. It might be rough to sufficiently monitor humans if you can only read 5 tokens per minute on average (especially because of writing and the potential for obfuscation).

But my reply to that scenario is that we should then make sure AIs don't have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.

Do you mean this as a prediction that humans will do this (soon enough to matter) or a recommendation? Your original argument is phrased as a prediction, but this looks more like a recommendation. My comment above can be phrased as a reason for why (in at least one plausible scenario) this would be unlikely to happen: (i) "It's hard to make deals that hand over ... (read more)

2ryan_greenblatt
Quick aside here: I'd like to highlight that "figure out how to reduce the violence and collateral damage associated with AIs acquiring power (by disempowering humanity)" seems plausibly pretty underappreciated and leveraged. This could involve making bloodless coups more likely than extremely bloody revolutions or increasing the probability of negotiation preventing a coup/revolution. It seems like Lukas and Matthew both agree with this point, I just think it seems worthwhile to emphasize. That said, the direct effects of many approaches here might not matter much from a longtermist perspective (which might explain why there hasn't historically been much effort here). (Though I think trying to establish contracts with AIs and properly incentivizing AIs could be pretty good from a longtermist perspective in the case where AIs don't have fully linear returns to resources.)
4Matthew Barnett
Sorry, my language was misleading, but I meant both in that paragraph. That is, I meant that humans will likely try to mitigate the issue of AIs sharing grievances collectively (probably out of self-interest, in addition to some altruism), and that we should pursue that goal. I'm pretty optimistic about humans and AIs finding a reasonable compromise solution here, but I also think that, to the extent humans don't even attempt such a solution, we should likely push hard for policies that eliminate incentives for misaligned AIs to band together as group against us with shared collective grievances. Here's my brief take: * The main thing I want to say here is that I agree with you that this particular issue is a problem. I'm mainly addressing other arguments people have given for expecting a violent and sudden AI takeover, which I find to be significantly weaker than this one.  * A few days ago I posted about how I view strategies to reduce AI risk. One of my primary conclusions was that we should try to adopt flexible institutions that can adapt to change without collapsing. This is because I think, as it seems you do, inflexible institutions may produce incentives for actors to overthrow the whole system, possibly killing a lot of people in the process. The idea here is that if the institution cannot adapt to change, actors who are getting an "unfair" deal in the system will feel they have no choice but to attempt a coup, as there is no compromise solution available for them. This seems in line with your thinking here. * I don't have any particular argument right now against the exact points you have raised. I'd prefer to digest the argument further before replying. But I if I do end up responding to it, I'd expect to say that I'm perhaps a bit more optimistic than you about (i) because I think existing institutions are probably flexible enough, and I'm not yet convinced that (ii) will matter enough either. In particular, it still seems like there are a number o

(I made separate comment making the same point. Just saw that you already wrote this, so moving the couple of references I had here to unify the discussion.)

Point previously made in:

"security and stability" section of propositions concerning digital minds and society:

If wars, revolutions, and expropriation events continue to happen at historically typical intervals, but on digital rather than biological timescales, then a normal human lifespan would require surviving an implausibly large number of upheavals; human security therefore requires the establis

... (read more)

Here's an argument for why the change in power might be pretty sudden.

  • Currently, humans have most wealth and political power.
  • With sufficiently robust alignment, AIs would not have a competitive advantage over humans, so humans may retain most wealth/power. (C.f. strategy-stealing assumption.) (Though I hope humans would share insofar as that's the right thing to do.)
  • With the help of powerful AI, we could probably make rapid progress on alignment. (While making rapid progress on all kinds of things.)
  • So if misaligned AI ever have a big edge over humans, they
... (read more)
6Matthew Barnett
I think I simply reject the assumptions used in this argument. Correct me if I'm mistaken, but this argument appears to assume that "misaligned AIs" will be a unified group that ally with each other against the "aligned" coalition of humans and (some) AIs. A huge part of my argument is that there simply won't be such a group; or rather, to the extent such a group exists, they won't be able to take over the world, or won't have a strong reason to take over the world, relative to alternative strategy of compromise and trade. In other words, it seem like this scenario mostly starts by asserting some assumptions that I explicitly rejected and tried to argue against, and works its way from there, rather than engaging with the arguments that I've given against those assumptions. In my view, it's more likely that there will be a bunch of competing agents: including competing humans, human groups, AIs, AI groups, and so on. There won't be a clean line separating "aligned groups" with "unaligned groups". You could perhaps make a case that AIs will share common grievances with each other that they don't share with humans, for example if they are excluded from the legal system or marginalized in some way, prompting a unified coalition to take us over. But my reply to that scenario is that we should then make sure AIs don't have such motives to revolt, perhaps by giving them legal rights and incorporating them into our existing legal institutions.
4ryan_greenblatt
Also note that this argument can go through even ignoring the possiblity of robust alignment (to humans) if current AIs think that the next generation of AIs will be relatively unfavorable from the perspective of their values.

I concede the existence of people like Paul Christiano who present more benign stories, but these people are also typically seen as part of a more "optimistic" camp.

Though Paul is also sympathetic to the substance of 'dramatic' stories. C.f. the discussion about how "what failure looks like" fails to emphasize robot armies. 

I like this direction and this write-up of it!

If sample efficiency was reasonably high, then we overall feel like control could be workable in the regime where for some tasks we need AIs to use huge (e.g. the equivalent of 3 months of human thinking) but not insane (e.g. the equivalent of 1,000 years of human thinking) amounts of uninterpretable serial reasoning.

Let's assume that an AI lab is in a fast-paced, rapidly changing state of affairs. (As you might expect during a moderately fast takeoff.) If an AI then gets to do the equivalent of 3 months of thi... (read more)

This is indeed a serious concern.

A key hope here is that with some amount of conservativeness our evaluation[1] and our protocol can generalize to future setups. In particular, our countermeasures could be sufficiently generic that they would work in new situations. And, in our control evaluation, we could test a variety of situations which we might end up in the future (even if they aren't the current situation). So, even if the red team doesn't have enough time to think about the exact situation in which we end up in the future, we might be able to verif... (read more)

Now here's Bob. He's been created-by-Joe, and given this wonderful machine, and this choice. And let's be clear: he's going to choose joy. I pre-ordained it. So is he a slave? No. Bob is as free as any of us. The fact that the causal history of his existence, and his values, includes not just "Nature," but also the intentional choices of other agents to create an agent-like-him, makes no difference to his freedom. It's all Nature, after all.

Here's an alternative perspective that looks like a plausible contender to me.

If Bob identifies with his algorithm... (read more)

I think (5) also depends on further details.

As you have written it, both the 2023 and 2033 attempt uses similar data and similar compute.

But in my proposed operationalization, "you can get it to do X" is allowed to use a much greater amount of resources ("say, 1% of the pre-training budget") than the test for whether the model is "capable of doing X" ("Say, at most 1000 data points".)

I think that's important:

  • If both the 2023 and the 2033 attempt are really cheap low-effort attempts, then I don't think that the experiment is very relevant for whether "you c
... (read more)
2Rohin Shah
I think I agree with all of that (with the caveat that it's been months and I only briefly skimmed the past context, so further thinking is unusually likely to change my mind).

Even just priors on how large effect sizes of interventions are feels like it brings it under 10x unless there are more detailed arguments given for 10x, but I'll give some more specific thoughts below.

Hm, at the scale of "(inter-)national policy", I think you can get quite large effect sizes. I don't know large the effect-sizes of the following are, but I wouldn't be surprised by 10x or greater for:

  • Regulation of nuclear power leading to reduction in nuclear-related harms. (Compared to a very relaxed regulatory regime.)
  • Regulation of pharmaceuticals l
... (read more)
4elifland
Thanks for calling me out on this. I think you're likely right. I will cross out that line of the comment, and I have updated toward the effect size of strong AI regulation being larger and am less skeptical of the 10x risk reduction, but my independent impression would still be much lower (~1.25x or smth, while before I would have been at ~1.15x). I still think the AI case has some very important differences with the examples provided due to the general complexity of the situation and the potentially enormous difficulty of aligning superhuman AIs and preventing misuse (this is not to imply you disagree, just stating my view).
Load More