Lukas Finnveden

Previously "Lanrian" on here. Research analyst at Open Philanthropy. Views are my own.

Sequences

Project ideas for making transformative AI go well, other than by working on alignment
Extrapolating GPT-N performance

Wiki Contributions

Comments

Sorted by

Here's the best explanation + study I've seen of Dunning-Krueger-ish graphs: https://www.clearerthinking.org/post/is-the-dunning-kruger-effect-real-or-are-unskilled-people-more-rational-than-it-seems 

Their analysis suggests that their data is pretty well-explained by a combination of a "Closer-To-The-Average Effect" (which may or may not be rational — there are multiple possible rational reasons for it) and a "Better-Than-Average Effect" that appear ~uniformly across the board (but getting swamped by the "closer-to-the-average effect" at the upper end).

probably research done outside of labs has produced more differential safety progress in total

To be clear — this statement is consistent with companies producing way more safety research than non-companies, if companies also produce even way more capabilities progress than non-companies? (Which I would've thought is the case, though I'm not well-informed. Not sure if "total research outside of labs look competitive with research from labs" is meant to deny this possibility, or if you're only talking about safety research there.)

There's at least two different senses in which "control" can "fail" for a powerful system:

  • Control evaluations can indicate that there's no way to deploy the system such that you both (i) get a lot of use out of it, and (ii) can get a low probability of catastrophe.
  • Control evaluations are undermined such that humans think that the model can be deployed safely, but actually the humans were misled and there's a high probability of catastrophe.

My impression is that Ryan & Buck typically talks about the first case. (E.g. in the link above.) I.e.: My guess would be that they're not saying that well-designed control evaluations become untrustworthy — just that they'll stop promising you safety.

But to be clear: In this question, you're asking about something more analogous to the second case, right? (Sabotage/sandbagging evaluations being misleading about models' actual capabilities at sabotage & sandbagging?)

My question posed in other words: Would you count "evaluations clearly say that models can sabotage & sandbag" as success or failure?

More generally, Dario appears to assume that for 5-10 years after powerful AI we'll just have a million AIs which are a bit smarter than the smartest humans and perhaps 100x faster rather than AIs which are radically smarter, faster, and more numerous than humans. I don't see any argument that AI progress will stop at the point of top humans rather continuing much further.

Well, there's footnote 10:

Another factor is of course that powerful AI itself can potentially be used to create even more powerful AI. My assumption is that this might (in fact, probably will) occur, but that its effect will be smaller than you might imagine, precisely because of the “decreasing marginal returns to intelligence” discussed here. In other words, AI will continue to get smarter quickly, but its effect will eventually be limited by non-intelligence factors, and analyzing those is what matters most to the speed of scientific progress outside AI.

So his view seems to be that even significantly smarter AIs just wouldn't be able to accomplish that much more than what he's discussing here. Such that they're not very relevant.

(I disagree. Maybe there are some hard limits, here, but maybe there's not. For most of the bottlenecks that Dario discusses, I don't know how you become confident that there are 0 ways to speed them up or circumvent them. We're talking about putting in many times more intellectual labor than our whole civilization has spent on any topic to date.)

I wonder if work on AI for epistemics could be great for mitigating the "gradually cede control of the Earth to AGI" threat model. A large majority of economic and political power is held by people who would strongly oppose human extinction, so I expect that "lack of political support for stopping human extinction" would be less of a bottleneck than "consensus that we're heading towards human extinction" and "consensus on what policy proposals will solve the problem". Both of these could be significantly accelerated by AI. Normally, one of my biggest concerns about "AI for epistemics" is that we might not have much time to get good use of the epistemic assistance before the end — but if the idea is that we'll have AGI for many years (as we're gradually heading towards extinction) then there will be plenty of time.

and as he admits in the footnote he didn't include in the LW version, in real life, when adequately incentivized to win rather than find excuses involving 'well, chaos theory shows you can't predict ball bounces more than n bounces out', pinball pros learn how to win and rack up high scores despite 'muh chaos'.

I was confused about this part of your comment because the post directly talks about this in the conclusion.

The strategy typically is to catch the ball with the flippers, then to carefully hit the balls so that it takes a particular ramp which scores a lot of points and then returns the ball to the flippers. Professional pinball players try to avoid the parts of the board where the motion is chaotic.

The "off-site footnote" you're referring to seems to just be saying "The result is a pretty boring game. However, some of these ramps release extra balls after you have used them a few times. My guess is that this is the game designer trying to reintroduce chaos to make the game more interesting again." which is just a minor detail. AFAICT pros could score lots of points even without the extra balls.

(I'm leaving this comment here because I was getting confused about whether there had been major edits to the post, since the relevant content is currently in the conclusion and not the footnote. I was digging through the wayback machine and didn't see any major edits. So trying to save other people from the same confusion.)

Yeah I was imagining we can proliferate by 'gradient descenting' on similar cases.

What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?

Have you tried using different AI models within perplexity? Any ideas about which one is best? I don't know whether to expect better results from Sonnet 3.5 (within perplexity) or one of the models that perplexity have finetuned themselves, like Sonar Huge.

To be clear, uncertainty about the number of iterations isn’t enough. You need to have positive probability on arbitrarily high numbers of iterations, and never have it be the case that the probability of p(>n rounds) is so much less than p(n rounds) that it’s worth defecting on round n regardless of the effect of your reputation. These are pretty strong assumptions.

So cooperation is crucially dependent on your belief that all the way from 10 rounds to Graham’s number of rounds (and beyond), the probability of >n rounds conditional on n rounds is never lower than e.g. 20% (or whatever number is implied by the pay-off structure of your game).

Load More