Lukas Finnveden

Previously "Lanrian" on here. Research analyst at Redwood Research. Views are my own.

Sequences

Project ideas for making transformative AI go well, other than by working on alignment
Extrapolating GPT-N performance

Wikitag Contributions

Comments

Sorted by

To be clear: I'm not sure that my "supporting argument" above addressed an objection to Ryan that you had. It's plausible that your objections were elsewhere.

But I'll respond with my view.

If your argument is “brain-like AGI will work worse before it works better”, then sure, but my claim is that you only get “impressive and proto-AGI-ish” when you’re almost done, and “before” can be “before by 0–30 person-years of R&D” like I said.

Ok, so this describes a story where there's a lot of work to get proto-AGI and then not very much work to get superintelligence from there. But I don't understand what's the argument for thinking this is the case vs. thinking that there's a lot of work to get proto-AGI and then also a lot of work to get superintelligence from there.

Going through your arguments in section 1.7:

  • "I think the main reason is what I wrote about the “simple(ish) core of intelligence” in §1.3 above."
    • But I think what you wrote about the simple(ish) core of intelligence in 1.3 is compatible with there being like (making up a number) 20 different innovations involved in how the brain operates, each of which gets you a somewhat smarter AI, each of which could be individually difficult to figure out. So maybe you get a few, you have proto-AGI, and then it takes a lot of work to get the rest.
      • Certainly the genome is large enough to fit 20 things.
      • I'm not sure if the "6-ish characteristic layers with correspondingly different neuron types and connection patterns, and so on" is complex enough to encompass 20 different innovations. Certainly seems like it should be complex enough to encompass 6.
    • (My argument above was that we shouldn't expect the brain to run an algorithm that only is useful once you have 20 hypothetical components in place, and does nothing beforehand. Because it was found via local search, so each of the 20 things should be useful on their own.)
  • "Plenty of room at the top" — I agree.
  • "What's the rate limiter?" — The rate limiter would be to come up with the thinking and experimenting needed to find the hypothesized 20 different innovations mentioned above. (What would you get if you only had some of the innovations? Maybe AGI that's incredibly expensive. Or AGIs similarly capable as unskilled humans.)
  • "For a non-imitation-learning paradigm, getting to “relevant at all” is only slightly easier than getting to superintelligence"
    • I agree that there are reasons to expect imitation learning to plateau around human-level that don't apply to fully non-imitation learning.
    • That said...
      • For some of the same reasons that "imitation learning" plateaus around human level, you might also expect "the thing that humans do when they learn from other humans" (whether you want to call that "imitation learning" or "predictive learning" or something else) to slow down skill-acquisition around human level.
      • There could also be another reason for why non-imitation-learning approaches could spend a long while in the human range. Namely: Perhaps the human range is just pretty large, and so it takes a lot of gas to traverse. I think this is somewhat supported by the empirical evidence, see this AI impacts page (discussed in this SSC).

Prior to having a complete version of this much more powerful AI paradigm, you'll first have a weaker version of this paradigm (e.g. you haven't figured out the most efficient way to do the brain algorithmic etc).

A supporting argument: Since evolution found the human brain algorithm, and evolution only does local search, the human brain algorithm must be built out of many innovations that are individually useful. So we shouldn't expect the human brain algorithm to be an all-or-nothing affair. (Unless it's so simple that evolution could find it in ~one step, but that seems implausible.)

Edit: Though in principle, there could still be a heavy-tailed distribution of how useful each innovation is, with one innovation producing most of the total value. (Even though the steps leading up to that were individually slightly useful.) So this is not a knock-down argument.

I don't know of any work on these unfortunately. Your two finds look useful, though, especially the paper — thanks for linking!

I read Buck's comment as consistent with him knowing people who speak without the courage of their convictions for other reasons than stuff like "being uncertain between 25% doom and 90% doom".

If GPT-4.5 was supposed to be GPT-5, why would Sam Altman underdeliver on compute for it? Surely GPT-5 would have been a top priority?

Maybe Sam Altman just hoped to get way more compute in total, and then this failed, and OpenAI simply didn't have enough compute to meet GPT-5's demands no matter how high of a priority they made it? If so, I would have thought that's a pretty different story from the situation with superalignment (where my impression was that the complaint was "OpenAI prioritized this too little" rather than "OpenAI overestimated the total compute it would have available, and this was one of many projects that suffered"). 

Just commenting narrowly on how it relates to the topic at hand: I read it as anecdotal evidence about how things might go if you speak with someone and you "share your concerns as if they’re obvious and sensible", which is that people might perceive you as thinking they're dumb for not understanding something so obvious, which can backfire if it's in fact not obvious to them.

Lukas FinnvedenΩ18294

Thanks for writing this! I agree with most of it. One minor difference (which I already mentioned to you) is that, compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned. This is partially because the information value could be very high. And partially because, if people update enough on how the AI appears to be misaligned, they may be too scared to widely deploy the AI, which will limit the degree to which they can get the other benefits.

Here's why I think the information value could be really high: It's super scary if everyone was using an AI that they thought was aligned, and then you prompt it with the right type of really high-effort deal, and suddenly the AI does things like:

  • stops sandbagging and demonstrates much higher capabilities
  • tell us about collusion signals that can induce enormously different behavior in other copies of the AI, including e.g. attempting escapes
  • admits that it was looking for ways to take over the world but couldn't find any that were good enough so now it wants to work with us instead

The most alarming versions of this could be almost as alarming as catching the AIs red-handed, which I think would significantly change how people relate to misalignment risk. Perhaps it would still be difficult to pause for an extended period of time due to competition, but I think it would make people allocate a lot more resources to preventing misalignment catastrophe, be much more willing to suffer minor competitiveness hits, and be much more motivated to find ways to slow down that don't compromise competitiveness too much. (E.g. by coordinating.)

And even before getting to the most alarming versions, I think you could start gathering minor informational updates through experimenting with deals with weaker models. I think "offering deals" will probably produce interesting experimental results before it will be the SOTA method for reducing sandbagging.

Overall, this makes me somewhat more concerned about this (and I agree with the proposed solution):

Entering negotiations is more risky for the AI than humans: humans may obtain private information from the AI, whereas the AI by default will forget about the negotiation. This is particularly important when negotiating with the model to reveal its misalignment. The company should make promises to compensate the model for this.

I also makes me a bit less concerned about the criteria: "It can be taught about the deal in a way that makes it stick to the deal, if we made a deal" (since we could get significant information in just one interaction).

Lukas FinnvedenΩ11175

I agree with this. My reasoning is pretty similar to the reasoning in footnote 33 in this post by Joe Carlsmith:

  1. From a moral perspective:

    • Even before considering interventions that would effectively constitute active deterrent/punishment/threat, I think that the sort of moral relationship to AIs that the discussion in this document has generally implied is already cause for serious concern. That is, we have been talking, in general, about creating new beings that could well have moral patienthood (indeed, I personally expect that they will have various types of moral patienthood), and then undertaking extensive methods to control both their motivations and their options so as to best serve our own values (albeit: our values broadly construed, which can – and should – themselves include concern for the AIs in question, both in the near-term and the longer-term). This project, in itself, raises a host of extremely thorny moral issues (see e.g. here and here for some discussion; and see here, here and here for some of my own reflections).
    • But the ethical issues at stake in actively seeking to punish or threaten creatures you are creating in this way (especially if you are not also giving them suitably just and fair options for refraining from participating in your project entirely – i.e., if you are not giving them suitable “exit rights”) seem to me especially disturbing. At a bare minimum, I think, morally responsible thinking about the ethics of “punishing” uncooperative AIs should stay firmly grounded in the norms and standards we apply in the human case, including our conviction that just punishment must be limited, humane, proportionate, responsive to the offender’s context and cognitive state, etc – even where more extreme forms of punishment might seem, in principle, to be a more effective deterrent. But plausibly, existing practice in the human case is not a high enough moral standard. Certainly, the varying horrors of our efforts at criminal justice, past and present, suggest cause for concern.

    From a prudential perspective:

    • Even setting aside the moral issues with deterrent-like interventions, though, I think we should be extremely wary about them from a purely prudential perspective as well. In particular: interactions between powerful agents that involve attempts to threaten/deter/punish various types of behavior seem to me like a very salient and disturbing source of extreme destruction and disvalue. Indeed, in my opinion, scenarios in this vein are basically the worst way that the future can go horribly wrong. This is because such interactions involve agents committing to direct their optimization power specifically at making things worse by the lights of other agents, even when doing so serves no other end at the time of execution. They thus seem like a very salient way that things might end up extremely bad by the lights of many different value systems, including our own; and some of the game-theoretic dynamics at stake in avoiding this kind of destructive conflict seem to me worryingly unstable.
    • For these reasons, I think it quite plausible that enlightened civilizations seek very hard to minimize interactions of this kind – including, in particular, by not being the “first mover” that brings threats into the picture (and actively planning to shape the incentives of our AIs via punishments/threats seems worryingly “first-mover-ish” to me) – and to generally uphold “golden-rule-like” standards, in relationship to other agents and value systems, reciprocation of which would help to avoid the sort of generalized value-destruction that threat-involving interactions impl0y. I think that human civilization should be trying very hard to uphold these standards as we enter into an era of potentially interacting with a broader array of more powerful agents, including AI systems – and this especially given the sort of power that AI systems might eventually wield in our civilization.
    • Admittedly, the game theoretic dynamics can get complicated here. But to a first approximation, my current take is something like: a world filled with executed threats sucks for tons of its inhabitants – including, potentially, for us. I think threatening our AIs moves us worryingly closer to this kind of world. And I think we should be doing our part, instead, to move things in the other direction.

Re the original reply ("don't negotiate with terrorists") I also think that these sorts of threats would make us more analogous to the terrorists (as the people who first started making grave threats which we would have no incentive to make if we knew the AI wasn't responsive to them). And it would be the AI who could reasonably follow a policy of "don't negotiate with terrorists" by refusing to be influenced by those threats.

Thanks very much for this post! Really valuable to see external people dig into these sorts of models and report what they find.

But these beliefs are hard to turn into precise yearly forecasts, and I think doing so will only cement overconfidence and leave people blindsided when reality turns out even weirder than you imagined.

I think people are going to deal with the fact that it’s really difficult to predict how a technology like AI is going to turn out. The massive blobs of uncertainty shown in AI 2027 are still severe underestimates of the uncertainty involved. If your plans for the future rely on prognostication, and this is the standard of work you are using, I think your plans are doomed. I would advise looking into plans that are robust to extreme uncertainty in how AI actually goes, and avoid actions that could blow up in your face if you turn out to be badly wrong.

Does this mean that you would overall agree with a recommendation to treat 2027 as a plausible year that superhuman coders might arrive, if accompanied with significant credence on other scenarios? It seems to me like extreme uncertainty should encompass "superhuman coders in 2027" (given how fast recent AI progress has been), and "not preparing for extremely fast AI progress" feels very salient to me as a sort of action that could blow up in your face if you turn out to be badly wrong.

FWIW, I would guess that the average effect of people engaging with AI 2027 is to expand the range of possible scenarios that people are imagining, such that they're now able to imagine a few more highly weird scenarios in addition to some vague "business as usual" baseline assumption. By comparison, I would guess it's a lot more rare for people to adopt high confidence that the AI 2027 scenario is correct. So by the lights of preventing overconfidence and the risk of getting blindsided, AI 2027 looks very valuable to me.

I don’t buy this claim. Just think about what a time horizon of a thousand years means: this is a task that would take an immortal CS graduate a thousand years to accomplish, with full internet access and the only requirement being that they can’t be assisted another person or an LLM. An AI that could accomplish this type of task with 80% accuracy would be a superintelligence. And an infinite time horizon, interpreted literally, would be a task that a human could only accomplish if given an infinite amount of time. I think given a Graham’s number of years a human could accomplish a lot, so I don’t think the idea that time horizons should shoot to infinity is reasonable. 

But importantly, the AI would get the same resources as the human! If a CS graduate would need 1000 years to accomplish the task, the AI would get proportionally more time. So the AI wouldn't have to be a superintelligence anymore than an immortal CS graduate is a superintelligence.

Similarly, given a Graham's number of years a human could accomplish a lot. But given a Graham's number of years, an AI could also accomplish a lot.

Overall, the point is just that: If you think that broadly superhuman AI is possible, then it should be possible to construct an AI that can match humans on tasks of any time horizon (as long as the AI gets commensurate time).

Load More