tlevin

(Posting in a personal capacity unless stated otherwise.) I help allocate Open Phil's resources to improve the governance of AI with a focus on avoiding catastrophic outcomes. Formerly co-founder of the Cambridge Boston Alignment Initiative, which supports AI alignment/safety research and outreach programs at Harvard, MIT, and beyond, co-president of Harvard EA, Director of Governance Programs at the Harvard AI Safety Team and MIT AI Alignment, and occasional AI governance researcher.

Not to be confused with the user formerly known as trevor1.

Posts

Sorted by New

4tlevin's Shortform

9mo

61A case for donating to AI risk reduction (including if you work in AI)

2mo

55How the AI safety technical landscape has changed in the last year, according to some practitioners

6mo

4tlevin's Shortform

9mo

78EU policymakers reach an agreement on the AI Act

11Notes on nukes, IR, and AI from "Arsenals of Folly" (and other books)

28Apply to HAIST/MAIA’s AI Governance Workshop in DC (Feb 17-20)

60Update on Harvard AI Safety Team and MIT AI Alignment

Wiki Contributions

Comments

Sorted by

Newest

A case for donating to AI risk reduction (including if you work in AI)

tlevin2mo30

Agreed, I think people should apply a pretty strong penalty when evaluating a potential donation that has or worsens these dynamics. There are some donation opportunities that still have the "major donors won't [fully] fund it" and "I'm advantaged to evaluate it as an AIS professional" without the "I'm personal friends with the recipient" weirdness, though -- e.g. alignment approaches or policy research/advocacy directions you find promising that Open Phil isn't currently funding that would be executed thousands of miles away.

Akash's Shortform

tlevin2mo52

Depends on the direction/magnitude of the shift!

I'm currently feeling very uncertain about the relative costs and benefits of centralization in general. I used to be more into the idea of a national project that centralized domestic projects and thus reduced domestic racing dynamics (and arguably better aligned incentives), but now I'm nervous about the secrecy that would likely entail, and think it's less clear that a non-centralized situation inevitably leads to a decisive strategic advantage for the leading project. Which is to say, even under pretty optimistic assumptions about how much such a project invests in alignment, security, and benefit-sharing, I'm pretty uncertain that this would be good, and with more realistic assumptions I probably lean towards it being bad. But it super depends on the governance, the wider context, how a "Manhattan Project" would affect domestic companies and China's policymaking, etc.

(I think a great start would be not naming it after the Manhattan Project, though. It seems path dependent, and that's not a great first step.)

tlevin's Shortform

tlevin5mo32

It's not super clear whether from a racing perspective having an equal number of nukes is bad. I think it's genuinely messy (and depends quite sensitively on how much actors are scared of losing vs. happy about winning vs. scared of racing).

Importantly though, once you have several thousand nukes the strategic returns to more nukes drop pretty close to zero, regardless of how many your opponents have, while if you get the scary model's weights and then don't use them to push capabilities even more, your opponent maybe gets a huge strategic advantage over you. I think this is probably true, but the important thing is whether the actors think it might be true.

In-general I think it's very hard to predict whether people will overestimate or underestimate things. I agree that literally right now countries are probably underestimating it, but an overreaction in the future also wouldn't surprise me very much (in the same way that COVID started with an underreaction, and then was followed by a massive overreaction).

Yeah, good point.

tlevin's Shortform

tlevin5mo30

Yeah doing it again it works fine, but it was just creating a long list of empty bullet points (I also have this issue in GDocs sometimes)

tlevin's Shortform

tlevin5mo32

Gotcha. A few disanalogies though -- the first two specifically relate to the model theft/shared access point, the latter is true even if you had verifiable API access:

Me verifying how many nukes you have doesn't mean I suddenly have that many nukes, unlike AI model capabilities, though due to compute differences it does not mean we suddenly have the same time distance to superintelligence.
Me having more nukes only weakly enables me to develop more nukes faster, unlike AI that can automate a lot of AI R&D.
This model seems to assume you have an imprecise but accurate estimate of how many nukes I have, but companies will probably be underestimating the proximity of each other to superintelligence, for the same reason that they're underestimating their own proximity to superintelligence, until it's way more salient/obvious.

Monthly Roundup #21: August 2024

tlevin5mo61

In general, we should be wary of this sort of ‘make things worse in order to make things better.’ You are making all conversations of all sizes worse in order to override people’s decisions.

Glad to be included in the roundup, but two issues here.

First, it's not about overriding people's decisions; it's a collective action problem. When the room is silent and there's a single group of 8, I don't actually face a choice of a 2- or 3-person conversation; it doesn't exist! The music lowers the costs for people to split into smaller conversations, so the people who prefer those now have better choices, not worse.

Second, this is a Simpson's Paradox-related fallacy: you are indeed making all conversations more difficult, but in my model, smaller conversations are much better, so by making conversations of all sizes slightly to severely worse but moving the population to smaller conversations, you're still improving the conversations on net.

tlevin's Shortform

tlevin5mo10

Also - I'm not sure I'm getting the thing where verifying that your competitor has a potentially pivotal model reduces racing?

tlevin's Shortform

tlevin5mo30

The "how do we know if this is the most powerful model" issue is one reason I'm excited by OpenMined, who I think are working on this among other features of external access tools

tlevin's Shortform

tlevin5mo30

If probability of misalignment is low, probability of human+AI coups (including e.g. countries invading each other) is high, and/or there aren't huge offense-dominant advantages to being somewhat ahead, you probably want more AGI projects, not fewer. And if you need a ton of compute to go from an AI that can do 99% of AI R&D tasks to an AI that can cause global catastrophe, then model theft is less of a factor. But the thing I'm worried about re: model theft is a scenario like this, which doesn't seem that crazy:

Company/country X has an AI agent that can ~~do 99%~~ [edit: let's say "automate 90%"] of AI R&D tasks, call it Agent-GPT-7, and enough of a compute stock to have that train a significantly better Agent-GPT-8 in 4 months at full speed ahead, which can then train a basically superintelligent Agent-GPT-9 in another 4 months at full speed ahead. (Company/country X doesn't know the exact numbers, but their 80% CI is something like 2-8 months for each step; company/country Y has less info, so their 80% CI is more like 1-16 months for each step.)
The weights for Agent-GPT-7 are available (legally or illegally) to company/country Y, which is known to company/country X.
Y has, say, a fifth of the compute. So each of those steps will take 20 months. Symmetrically, company/country Y thinks it'll take 10-40 months and company/country X thinks it's 5-80.
Once superintelligence is in sight like this, both company/country X and Y become very scared of the other getting it first -- in the country case, they are worried it will undermine nuclear deterrence, upend their political system, basically lead to getting taken over by the other. The relevant decisionmakers think this outcome is better than extinction, but maybe not by that much, whereas getting superintelligence before the other side is way better. In the company case, it's a lot less intense, but they still would much rather get superintelligence than their arch-rival CEO.
So, X thinks they have anywhere from 5-80 months before Y has superintelligence, and Y thinks they have 1-16 months. So X and Y both think it's easily possible, well within their 80% CI, that Y beats X.
X and Y have no reliable means of verifying a commitment like "we will spend half our compute on safety testing and alignment research."
If these weights were not available, Y would have a similarly good system in 18 months, 80% CI 12-24.

So, had the weights not been available to Y, X would be confident that it had 12 + 5 months to manage a capabilities explosion that would have happened in 8 months at full speed; it can spend >half of its compute on alignment/safety/etc, and it has 17 rather than 5 months of serial time to negotiate with Y, possibly develop some verification methods and credible mechanisms for benefit/power-sharing, etc. If various transparency reforms have been implemented, such that the world is notified in ~real-time that this is happening, there would be enormous pressure to do so; I hope and think it will seem super illegitimate to pursue this kind of power without these kinds of commitments. I am much more worried about X not doing this and instead just trying to grab enormous amounts of power if they're doing it all in secret.

[Also: I just accidentally went back a page by command-open bracket in an attempt to get my text out of bullet format and briefly thought I lost this comment; thank you in your LW dev capacity for autosave draft text, but also it is weirdly hard to get out of bullets]

tlevin's Shortform

tlevin5mo16-9

[reposting from Twitter, lightly edited/reformatted] Sometimes I think the whole policy framework for reducing catastrophic risks from AI boils down to two core requirements -- transparency and security -- for models capable of dramatically accelerating R&D.

If you have a model that could lead to general capabilities much stronger than human-level within, say, 12 months, by significantly improving subsequent training runs, the public and scientific community have a right to know this exists and to see at least a redacted safety case; and external researchers need to have some degree of red-teaming access. Probably various other forms of transparency would be useful too. It feels like this is a category of ask that should unite the "safety," "ethics," and "accelerationist" communities?

And the flip side is that it's very important that a model capable of kicking off that kind of dash to superhuman capabilities not get stolen/exfiltrated, such that you don't wind up with multiple actors facing enormous competitive tradeoffs to rush through this process.

These have some tradeoffs, especially as you approach AGI -- e.g. if you develop a system that can do 99% of foundation model training tasks and your security is terrible you do have some good reasons not to immediately announce it -- but not if we make progress on either of these before then, IMO. What the Pareto Frontier of transparency and security looks like, and where we should land on that curve, seems like a very important research agenda.

If you're interested in moving the ball forward on either of these, my colleagues and I would love to see your proposal and might fund you to work on it!