Buck's Shortform

Buck

LESSWRONG
LW

Buck's Shortform — LessWrong

359 comments, sorted by

Click to highlight new comments since: Today at 8:34 AM

One thing I notice when reading 20th century history is that people in the 1900s-1970s had much higher priors than modern people do that the future might be radically different, in either great or terrible ways. For example:

They talked about how WW1 was the war to end all wars. They seriously talked about the prospect of banning war after WW1. Such things now sound hopelessly naive.
Serious people talked very seriously about the possibility of transformative technological change and social change following from it (e.g. Keynes/Russell speculating that people would work way fewer hours in the future).
As a minor example, between 1905-1915 Churchill spent a bunch of time trying to persuade the British government that on current trends, oil-powered ships would soon be way better than coal-powered ships, and the navy should be converted to oil power. I know of ~no recent examples where a major politician's main schtick was being thoughtful about the future of technology and making policy based on it. More generally, it was obvious after WW1 that states needed to be doing futurism and technological development in order to understand the military implications of modern technology.

I really ... (read more)

[-]TsviBT3mo332

As a datapoint, the more I learn about bio, especially recent-ish stuff (past 1-5 decades), I'm more like "the whole "The Great Stagnation" thing was basically bullshit":

DNA sequencing in any form has only existed for about half a century.
Before the 21st century, we hadn't sequenced 1 human genome.
Only in the past 5ish years do we have millions of whole genomes (or 10ish years if you count SNP arrays; see https://berkeleygenomics.org/articles/How_many_human_genomes_have_been_sequenced_.html), and the resulting polygenic scores (now including thousands of alleles for dozens of traits).
Epigenomic sequencing (RNA sequencing, methylation sequencing, chromatin accessibility sequencing, spatial sequencing) is a decade old.
Embryonic stem cells? Isolated <50 years ago.
Turning non-stem cells into stem cells? 21st century.
Serious de novo DNA synthesis (more than a few base pairs)? <50 years old.
Megabase synthetic chromosome (stitched together): 2010ish (https://www.csmonitor.com/Science/2010/0521/J.-Craig-Venter-Institute-creates-first-synthetic-life-form).
Mouse gametogenesis? Past decade-ish.
CRISPR-Cas9 gene editing? Past 2 decades.
CRISPR epigenetic editing? Past decade.

... (read more)

[-]Eli Tyre3mo3416

None of these advancements have direct impacts on most people's day-to-day lives.

In contrast, the difference between "I've heard of cars, but they're play things for the rich" and "my family owns a car", is transformative for individuals and societies.

[-]no name3mo100

At least in the 21st century, new internal combustion engine technologies exhibit high reproducibility and low verification costs. There are no large numbers of internal combustion engine specialists employing various means to generate false or selectively filtered test reports for personal gain. Consequently, no engine configuration used in automotive development has been found fundamentally impossible.

Automobiles are not regulated by a group of accident experts with questionable ties to automotive giants and overly strict automotive ethicists. Consequently, a vehicle cannot be banned for violating some aspect of so-called automotive ethics. New cars also do not require decades of randomized controlled trials involving thousands of participants to gain market approval—costs that smaller automotive companies could never afford.

Driving a car is not regarded as a qualification requiring years of costly university education, but rather as a right enjoyed by all who undergo basic training. The thousands who die annually in car accidents are not perceived as a catastrophic failure of automobiles, compelling society to pressure for their elimination.

Society does not view automobile... (read more)

[-]TsviBT3mo101

Really? Maybe, I'm not sure. Did you check? If you add up vaccines developed in the last 50 years, times the number of illness / damage they've prevented, what do you get? What about other medical treatments? What about food production downstream of GMOs? Etc.

[-]Mateusz Bagiński3mo225

Speculatively introducing a hypothesis: It's easier to notice a difference like

N years ago, we didn't have X. Now that we have X, our life has been completely restructured. (Xϵ{car, PC, etc.})

than

N years ago, people sometimes died of some disease that is very rare / easily preventable now, but mostly everyone lived their lives mostly the same way.

I.e., introducing some X that causes ripples restructuring a big aspect of human life, vs introducing some X that removes an undesirable thing.

2DanielFilan3mo

I wonder if it's a thing where it's taking a while for those things to hit. Like, mRNA vaccines are only a couple of years old, GLP-1RAs are in a gradual process of being rolled out, etc. If I think of the category of "awesome newish bio stuff I'd like to use", it seems like most of it becomes widely available to consumers in the near future or last 5 years, with the exception of statins.

2TsviBT3mo

Right, I think bio stuff is slower and involves somewhat deeper science. (Compared to car manufacturing, in some sense, maybe. Though it's hard to say / unclear what the question is, because you have deep stuff with chip manufacturing, and special alloys, and so on; but I think bio has much more prerequisite scientific richness to its big advances). Because it's slower and deeper, it's more opaque (i.e. harder to do credit assignment).

6DanielFilan3mo

I think it's not just that it's slower/deeper: my personal sense (which might be just a thing of not requiring much medical care between the ages of 5 and 30) is that the pace at which awesome new stuff is happening in medicines I can buy got much faster in the last few years. If my perception is right, it seems like that requires some explanation of "bio is slower/deeper and also 40 years ago there was a massive breakthru that took 40 years to percolate", and not just "bio is slower/deeper".

4TsviBT3mo

Well, to really evaluate this I'd want to see some sort of thorough-ish investigation, that tries to think of most of the main ways that bio would have been impacting people's lives, and checking the timescales for the prerequisite research. It's not something I'd update very much about, based on anecdata, because it's too big of a question.

4tgb3mo

Drug approvals have gone up in recent years: https://pmc.ncbi.nlm.nih.gov/articles/PMC10856271/ (figure 1). Of course most of those are not ones that you'll encounter in day-to-day life. Meanwhile, some of the most commonly used over-the-counter drugs from previous decades have been pulled from the market or made harder to get (cold medicine particularly: phenylpropanolamine due to rare side effects in 2000, oral phenylephrine due to lack of effect last year, and pseudoephedrine restricted to behind the counter due to use in meth a decade ago or so).

2Garrett Baker3mo

My guess is that the big difference in the speed of biotech compared to early-20th-century-advancements is the relative conservatism of the medical field, and the money & time-consuming certifications you need to get before releasing anything to market. This, in my view, is much less a function of the science, and much more a function of the sociology around the science.

2TsviBT3mo

I imagine that's one relevant thing going on, but also I think the actual science has a lot more depth. The progress I listed doesn't seem like it's going slower due to medical regulation.

3Garrett Baker3mo

I mean the basic research aspect sure (except for stem cells), but applications of each of the progress areas you listed basically involve either clinical applications or selling GMOs. Both of which have very bad regulatory bottlenecks, especially from a world-wide perspective. There has been, as you mention, enormous progress in bio-tech and our broader understanding of biology in the past 50 years, but comparatively little application of that knowledge. This is not what you would expect if the science is "deep" but applications easy. How exactly does the progress you listed support this conclusion?

2TsviBT3mo

Yeah, as I mentioned in my earlier comment bio stuff is: So yeah I agree applications are also difficult. One thing I'm trying to say is "the progress of bio feels slower in significant part because the science itself is difficult, and is actually slower in a sense, but this is a confusing way to view it because there has also been a large amount of scientific progress; so it's slower in some sense of being less progress per time relative to the total difficulty of the field, i.e. we're still mostly confused and mostly powerless in the domain of bio; but the absolute quantity of knowledge and power we've gained is large; but people don't appreciate that; partly that's because the applications are separately harder and slower, and maybe partly that's because it's harder / less legible to attribute the applications to the font of deep progress".

8Buck3mo

I agree there's been a lot of scientific progress, and real GDP per capita, which is maybe the most canonical single metric, continues to rise steadily. But yeah, I think that this feels underwhelming to people compared to earlier qualitative changes. I think this is some combination of them noting that tech advances affect their lives less, and the tech advances feeling more opaque.

4DirectedEvolution3mo

Very, very few people are at all about scientific accomplishments unless it's directly affecting somebody they personally care about, and particularly themselves or their kids. The technical accomplishments you list are in fundamental bio or medical innovations that have so far only affected a small number of individuals, so for the time being, virtually nobody will care about them. The reality is that turning the technical accomplishments you list into safe and effect medicines ready for doctors to give to patients have been extremely time consuming, expensive, and limited in scope, or have not yet resulted in bona fide medical breakthroughs in humans. In biology, innovations in earlier times (vaccines and antibiotics) were cheap, saved a lot more young people and prevented more common and deadly illnesses. mRNA vaccines and gene editing are applicable to much smaller numbers of people, often in more distant countries, often primarily benefitting older people with comorbidities. You can see this in tapering off of lifespan, the growing gap between lifespan and healthspan, and the rising cost of medical care. You can also see it in the exceptions -- how GLP-1 agonists like semaglutide (ozempic), which benefitted numerous young people in the first word in a very tangible way -- provoked much more news coverage, popular awareness, and grassroots optimism about progress in this domain, as well as populist anger at high prices, limited availability, worries about side effects, and so on. I think we will see similar excitement if substantial strides are made in bringing down IVF costs, improving fertility for women in their late 30s and beyond, and ability to predict and prevent or terminate pregnancies when the fetus is expected to show profound autism or other serious cognitive impairments; widespread deployment of xenotransplantation (pig kidneys transplanted into humans), drastically improving antipsychotics or treatments for substance abuse, and continued improve

[-]leogao3mo198

idk, it's unclear to me that computers and the Internet are more subtle than cars or radios. it's also, 50 year old americans today have seen the fall of the soviet union, the creation of the european union, enormous advances in civil rights, 9/11, the 2008 crash, covid, the invasion of ukraine, etc. this isn't exactly WWII level but also nowhere near a static stable world.

9Daniel Kokotajlo3mo

Seems a lot less subtle than radios at least! Cars are a different story, they are big and loud and everywhere. But phones are small and loud and everywhere...

7dr_s3mo

I think they are because in practice they just didn't produce the same amount of economic growth. And for most people, their direct impact of these things are entertainment applications, or using them at work (where sometimes they feel like they make things worse). Meanwhile I remember hearing a story of a woman (someone's grandma) who was in awe of the washing machine they had just bought because well, it had saved her hours of daily gruelling work. And that's more impactful to one's life than almost anything computers or the internet have done.

4MichaelLowe3mo

I have heard Peter Thiel make the point that almost all the recent significant advances are concentrated in the digital world, whereas change in the analog world has been very marginal.

[-]Daniel Kokotajlo3mo1814

Serious people talked very seriously about the possibility of transformative technological change and social change following from it (e.g. Keynes/Russell speculating that people would work way fewer hours in the future).

Don't we have things like that today? E.g. Bengio and Hinton speculating that ASI will arrive and maybe kill everyone. Also, I'd argue that people like Bostrom and Yudkowsky will be viewed more favorably 50 years from now than they are today, and will generally be thought of as "serious people" to a much greater degree. When Keynes/Russell were speculating about the future, they probably weren't as renowned as they are now.

Re: Politicians: Andrew Yang isn't a major politician I guess, but his main schtick was "AI is coming" basically right?

Also Dominic Cummings has similar vibes, possibly even more extreme, than Churchill's schtick about coal vs. oil.

8Garrett Baker3mo

Not really, from my memory and checking wikipedia, his campaign was mainly focused on advocating for UBI, and used whatever arguments it could to defend that policy position, including but certainly not limited to an argument that automation was coming, but mainly for menial tasks like truck driving.

[-]Erich_Grunewald3mo136

Nice post!

I think this is probably mostly because there's an important sense in which world has been changing more slowly (at least from the perspective of Americans), and the ways in which it's changing feel somehow less real.

Maybe another factor is that a lot of the unbounded, grand, and imaginative thinking of the early 20th and the 19th century ended up either being either unfounded or quite harmful. So maybe the narrower margins of today are in part a reaction to that in addition to being a reaction to fewer wild things happening.

For example, many of the catastrophes of the 20th century (Nazism, Maoism, Stalinism) were founded in a kind of utopian mode of thinking that probably made those believers more susceptible to mugging. In the 20th century, postmodernists started (quite rightly, imo) rejecting grand narratives in history, like those by Hegel, Marx, and Spengler, and instead historians started offering more nuanced (and imo accurate) historical studies. And several of the most catastrophic fears, like those of 19th-century millenarianism and nuclear war, didn't actually happen.

9Seth Herd3mo

I think you're probably right about that historical difference. But I don't agree with the implication that people won't believe AGI is coming until too late. (I realize this isn't the main claim you're making here, but I think you'd agree that's the most important implication.) It's like January 2020 now, when those concerned with Covid were laughed off. That doesn't mean AGI concerns will be dismissed when more evidence hits. The public could easily go from not nearly concerned enough to making panicked demands for mass action like shutting down half the economy as a precautionary measure. Yes, the modern assumption that nothing really changes will slow down recognition of AI's dangers. But not for long if we're fortunate enough to get a slowish takeoff and public deployments of useful (and therefore creepy) LLM agents. Of course, that might not happen until we're too close to internal deployment of a misaligned takeover-capable system like Agent-4 from AI 2027. But it's looking pretty likely we'll get such deployments and job replacements before the point of no return, so I think we should at least have some contingency plans in case of dramatic public concern. AI is in far-mode thinking for most people now, but I predict it's going to be near-mode for a lot of people as soon as we've got inarguable job replacement and more common experience with agentic AI. I'm the first to talk about how foolish people are compared to our idealized self-conception. People are terrible with abstract ideas. But I think the main reason is that they don't spend time thinking seriously about them until they're personally relevant. Humans take a long time to figure out new things. It takes a lot of thought. But it's also a collective process. As it becomes a bigger part of public conversation, basic logic like "oh yeah they're probably going to build a new species, and that sounds pretty dangerous" will become common. Note that most of the people talking about AI now are entrepr

5Shankar Sivarajan3mo

I can think of several prominent predictions in the present of similar magnitude. * Every election is proclaimed as the death of American democracy. * Race war precipitated by Whites becoming a racial minority. * The recognition of "same-sex marriages" was to harbinger a collapse of all public morality. * Restrictions on abortion access reducing women to sex-slaves, à la The Handmaid's Tale. * I think you're understating the apocalypticism of climate-change activism. * Smartphones/social media/pornography corrupting the youth, leading to … okay, admittedly this one's vaguer, but the consequences, whatever they might be, are still expected to be dire. * If overpopulation has ceased to be a major concern, that's a very recent development. * Similarly, running out of oil was forecast to return technology to horse-drawn carriages and beeswax candles. They've definitely stopped saying this, but I heard it in the '00s. The difference you're talking about might be simply due to you discounting these as insane (or maybe just disingenuous) while hailing analogous predictions in the past as wise/prescient.

4Vladimir_Nesov3mo

Neglectedness must be observed in resource allocation, and anything worth doing is worth doing seriously. So if criteria such as "naive" determine what gets done, some things that are unusually "naive" will get neglected, and so would be worth doing. And if things are not taken seriously when they are characterized in some way, such as being based on "naive" motivations, they get done inefficiently even when they do get done, and so it would be worth fixing the inefficiency.

3bfinn3mo

You forgot computers more generally. Which only became widely used in the 1980s & 1990s

3Mitchell_Porter3mo

It depends where you look. In the 2010s the World Economic Forum was predicting a fourth industrial revolution that would transform every aspect of life. In the 1990s you had Fukuyama saying that the end of the Cold War meant a new worldwide consensus on political ideology. Around the same time, the Internet was also seen as something transformative, and the ideas of nanotechnology haunted the parts of the culture attuned to technological futurism. For that matter, AI utopianism and apocalypticism has been everywhere for the past three years and has never really gone away. The war on terror, the rise of progressivism, the rise of populism, the rise of BRICS, these all have futurisms associated with them. MAGA and the Green New Deal are both intended as utopian visions. So I'd say that the idea that the future will be different from the present, and that we have some capacity to shape it, has never really gone away.

2jmh3mo

I wonder if it's less about rate of change (but don't really take any exception to that claim) and more about divergence of change from expectations. 1950's or 60's expectations (at least in pop culture) was flying cars and smart robot house servants -- think Jetson's here. People of the early 20th Century had the direct experience of living though some very significant events which they probably had not really expected. The future became much more uncertain so receptivity to more possible outcomes probably increases. The situation is a bit different up to now, so I wonder if that doesn't place greater weight on a view of the future as some trend path with variation but mean-reversion.

2Sid Bagga3mo

A couple of hypotheses as to why: -- Median age in the U.S. was <= 30 throughout the 20th century, until, roughly, the start of the 1980s. Today it's 39. The median age of white Americans is now 44.5! Insofar as more music, advertising, and fiction is written with an older audience in mind, and insofar as people's preferences tend to shift towards quiesence as they age, I think this would contribute to the "ambient cultural sense" of stagnation that you describe. (Of course, it's also possible that a widespread belief that the future will be just like the present has made people have less kids, thus causing the median age to rise, but that seems likely to be a less sizeable causal channel than the other way around). Some pretty important stuff has happened since the 1980s. The collapse of the Soviet Union, the Arab Spring, and the political rise of China were arguably bigger geopolitical deals than most 20th century events pre-WWI, in the interwar period, and during the Cold War. I'd argue it's a change in the audience, not in the things happening, that's made the changes these have had in our world seem relatively "subtler and easier to ignore". -- Somewhat relatedly, more financial power is in the hands of the geriatric. This paper has some fascinating data. In 1983, the mean net worth of Americans 75 and older was 5% greater than the mean for all Americans; in 2022 it was reportedly 58% greater... meanwhile, the relative amount of wealth in the belonging to young and middle-aged Americans fell precipitately; people aged 45-54 went from having an average net worth 53% above mean in 1983, to 9% below mean in 2022! (To be clear, the paper points to the elderly having higher-valued stock portfolios as a key cause of this wealth composition shift, and it's possible there's some incumbency advantage behind this which exists as part of the thing that's made the world feel like it's moving more slowly...). An outsized amount of wealth in people well past the age of

1Felix Choussat3mo

A central pillar of the Democratic Party has been that Republicans will destroy democracy and take the country down with it (somewhat ditto the Republican line on immigration). Both parties are obsessed with the end of American greatness, and motivate their voters through that narrative. To a lesser extent, they’re also nebulously united on “beating China”. Where I agree is that there’s an absence of a positive vision for the future (something this just isn’t the world today + better healthcare). I think this is especially true on the American left, which has basically mired itself into an anti-progress position through its natural distrust of billionaires and its reaction to the tech-right rising in political prominence. It’s hard to accept radical change is possible (except through the existing lens of concentration of wealth or environmental impact) when accepting that change means elevating the importance of people in your cultural outgroup. ASI is a silly concern for fringe thinkers in San Francisco; real writers ask the pressing questions about electricity costs, copyright, and corporate influence on the Trump administration. Compare what the writers of places like the Atlantic, the NYT, or Times have to say about AI compared to people like Steve Bannon. It’s incredible near term and sanded down, while the right has been generally more willing to engage with superintelligence being possible.

[-]Buck5mo115111

I hear a lot of scorn for the rationalist style where you caveat every sentence with "I think" or the like. I want to defend that style.

There is real semantic content to me saying "I think" in a sentence. I don't say it when I'm stating established fact. I only use it when I'm saying something which is fundamentally speculative. But most of my sentences are fundamentally speculative.

It feels like people were complaining that I use the future tense a lot. Like, sure, my text uses the future tense more than average, and future tense is indeed somewhat more awkward. But future tense is the established way to talk about the future, which is what I wanted to talk about. It seems pretty weird to switch to present tense just because people don't like future tense.

[-]Ben Pace5mo3120

Probably this isn't the exclusive reason, but typically I use "I think" whenever I want to rule out the interpretation that I am implying we all agree on my claim. If I say "It was a mistake for you to paint this room yellow" this is more natural if you agree with me; if I say "I think it was a mistake for you to paint this room yellow" this is more natural if I'm informing you of my opinion but I expect you to disagree.

This is not a universal rule, and fwiw I do think there's something good about clear and simple writing that cuts out all the probably-unnecessary qualifiers, but I think this is a common case where I find it worth adding it in.

7Steven Byrnes4mo

Hmm, my usage seems more like: “I think that…” means the reader/listener might disagree with me, because maybe I’m wrong and the reader is right. (Or maybe it’s subjective.) Meanwhile, “I claim that…” also means the reader might disagree with me, but if they do, it’s only because I haven’t explained myself (yet), and the reader will sooner or later come to see that I’m totally right. So “I think” really is pretty centrally about confidence levels. I think :)

2cubefox4mo

By the way, "It seems" and "arguably" seem a bit less defensive than "I think" (which is purely subjective). Arguably.

6Ben Pace4mo

I think of "It seems" as coming from the view from nowhere. I say "It seems to me" to own it, unless I mean to make the active claim that I believe it will seem this way to everyone. "Arguably" feels super weak to me. I would only use it if I actively wanted to distance myself from a view. Almost anything is 'arguable' in principle.

[-]bfinn5mo*173

Some languages allow or even require suffixes on verbs indicating how you know what you’re stating (a grammatical feature called ‘evidentiality’) - eg ‘I heard that X’, ‘I suppose that X’.

I suspect this is epistemically good for speakers of such languages, forcing them to consider the reasons behind every statement they make. Hence I find myself adding careful qualifications myself, e.g. ‘I suspect’ (as above), ‘I read that’, etc.

7leogao5mo

a lot of people say "I think" reflexively because they're used to making themselves small. it wouldn't be surprising to me if such people said "I think" more often than most even in situations where the caveat is unnecessary.

4[anonymous]5mo

Saying "I think" isn't making yourself small but making yourself the appropriate size; frequently stating opinions as fact is an unwarranted status-grab and pollutes the epistemic commons.

7cubefox5mo

I think e.g. Eliezer (in the sequences) and Scott Alexander don't hedge a lot, so this doesn't necessarily seem like a rationalist style. I do it a lot though, but I fairly sure it makes readability worse.

1t14n4mo

A general tip when editing pieces is that if you ctrl+f for "I think" and you cut out 90-95% of them, it makes the piece better. The (respected) reader knows that everything you write is what you think, and a good piece should make the hedging explicit for a claim that needs deeper consideration -- not every claim. What about readers you don't respect? Well, who cares what they think.

5[anonymous]4mo

The reader knows that, certainly. But they don't know that you know that; that's why you have to clarify that you do. (And yes, you have to! Most people in fact do not know that their opinions aren't fact).

5Martin Sustrik5mo

There are languages out there in which the epistemic status is a mandatory part of a sentence ("I've seen", "so I have been told", "I think", etc.)

5localdeity4mo

Asking Claude for examples Is this true? Examples? Yes, this is absolutely true! These grammatical features are called evidentials or evidentiality markers, and they're mandatory in many languages around the world. Examples: Turkish is a classic example. It has two past tenses that encode evidentiality: * -di (direct/witnessed past): "I saw it happen" * -miş (indirect/inferred past): "I heard/inferred it happened" So "geldi" means "he came" (and I witnessed it), while "gelmiş" means "he came" (but I didn't see it—I heard about it or saw evidence of it). Quechua (Andean languages) has several evidential markers: * -mi: direct knowledge * -si: reported/hearsay * -chá: conjecture Tibetan has a rich evidential system distinguishing personal knowledge, inference, and hearsay. Japanese has evidential markers too, though they're sometimes considered less grammaticalized: * rashii: hearsay ("I heard that...") * yōda: inference based on evidence * sōda: reported information Bulgarian and other Balkan languages have "renarrative" mood marking information learned from others rather than directly experienced. The key point is that in these languages, you often cannot make a statement about a past event without indicating your epistemic basis for the claim. Leaving it out would be grammatically incomplete, like forgetting to conjugate a verb in English. This forces speakers to constantly track and communicate their source of knowledge. Interesting.

4plex5mo

Yep! Another angle is it helps with variable scoping of conversational info-packets, NVC-style, to reduce collisions between psychological content in the receiver.

2Algon5mo

Funnily enough, just yesterday I read Steven Pinker heaping paragraph upon paragraph of scorn on writers who keep hedging with claims like "I think".

2plex4mo

Yeah, Pinker is specifically kinda annoyed/outgroupy at LW Rats I think.

2Algon4mo

I'd be surprised if he had LW Rats in mind when he wrote that, but it's not impossible.

1localdeity5mo

Hmmph. If he wants to push people to do more research so that they can make statements without any such qualifiers—or to shut up when they haven't done enough research to have anything useful to say—then I may sympathize. If he wants them to make themselves sound more certain than they are, then I oppose.

[-]Buck1y*Ω5511341

Two different meanings of “misuse”

The term "AI misuse" encompasses two fundamentally different threat models that deserve separate analysis and different mitigation strategies:

Democratization of offense-dominant capabilities
- This involves currently weak actors gaining access to capabilities that dramatically amplify their ability to cause harm. That amplification of ability to cause harm is only a huge problem if access to AI didn’t also dramatically amplify the ability of others to defend against harm, which is why I refer to “offense-dominant” capabilities; this is discussed in The Vulnerable World Hypothesis.
- The canonical example is terrorists using AI to design bioweapons that would be beyond their current technical capacity (c.f. Aum Shinrikyo, which failed to produce bioweapons despite making a serious effort)
Power Concentration Risk
- This involves AI systems giving already-powerful actors dramatically more power over others
- Examples could include:
  - Government leaders using AI to stage a self-coup then install a permanent totalitarian regime, using AI to maintain a regime with currently impossible levels of surveillance.
  - AI company CEOs using advanced AI systems

... (read more)

[-]jbash1yΩ8203

Computer security, to prevent powerful third parties from stealing model weights and using them in bad ways.

By far the most important risk isn't that they'll steal them. It's that they will be fully authorized to misuse them. No security measure can prevent that.

7Buck1y

That's a great way of saying it. I edited this into my original comment.

2Fabien Roger1y

Actually, it is not that clear to me. I think adversarial robustness is helpful (in conjunction with other things) to prevent CEOs from misusing models. If at some point in a CEO trying to take over wants to use HHH to help them with the takeover, that model will likely refuse to do egregiously bad things. So the CEO might need to use helpful-only models. But there might be processes in place to access helpful-only models - which might make it harder for the CEO to take over. So while I agree that you need good security and governance to prevent a CEO from using helpful-only models to take over, I think that without good adversarial robustness, it is much harder to build adequate security/governance measures without destroying an AI-assisted-CEO's productivity. There is a lot of power concentration risk that just comes from people in power doing normal people-in-power things, such as increasing surveillance on dissidents - for which I agree that adversarial robustness is ~useless. But security against insider threats is quite useless too.

2Bogdan Ionut Cirstea1y

Maybe somewhat of a tangent, but I think this might be a much more legible/better reason to ask for international coordination, then the more speculative-seeming (and sometimes, honestly, wildly overconfident IMO) arguments about the x-risks coming from the difficulty of (technically) aligning superintelligence.

2Seth Herd1y

I think this is a valuable distinction. I note that the solutions you mention for the second, less-addressed class of misuse only prevent people who aren't officially in charge of AGI from misusing it; they don't address government appropriation. Governments have a monopoly on the use of force, and their self-perceived mandate includes all issues critical to national security. AGI is surely such an issue. I expect that government will assume control of AGI if they see it coming before it's smart enough to help its creators evade that control. And that would be very difficult in most foreseeable scenarios. You can hop borders, but you're just moving to another government's jurisdiction. I don't have any better solutions to government misuse for a self-coup and permanent dictatorship. Any such solutions are probably political, not technical, and I know nothing about politics. But it seems like we need to get some politically savvy people onboard before we have powerful AI aligned to its creators intent. Technical alignment is only a partial solution.

[-]Buck5mo11045

Zach Robinson, relevant because he's on the Anthropic LTBT and for other reasons, tweets:

"If Anyone Builds It, Everyone Dies" by @ESYudkowsky and @So8res is getting a lot of attention this week. As someone who leads an org working to reduce existential risks, I'm grateful they're pushing AI safety mainstream. But I think they're wrong about doom being inevitable. 🧵
Don't get me wrong—I take AI existential risk seriously. But presenting doom as a foregone conclusion isn't helpful for solving the problem.
In 2022, superforecasters and AI researchers estimated the probability of existential catastrophic risk from AI by 2100 at around 0.4%-3%. A recent study found no correlation between near-term accuracy and long-term forecasts. TL;DR: predicting the future is really hard.
That doesn't mean we should throw the existential risk baby out with the "Everyone Dies" bathwater. Most of us wouldn't be willing to risk a 3% chance (or even a 0.3% chance!) of the people we love dying.
But accepting uncertainty matters for navigating this complex challenge thoughtfully.
Accepting uncertainty matters for two big reasons.
First, it leaves room for AI's transformative benefits. Tech has doubled life expe

... (read more)

[-]habryka5mo7644

Just to help people understand the context: The book really doesn't say that doom is inevitable. It goes out of its way like 4 times to say the opposite. I really don't have a good explanation of Zach's comment that doesn't involve him not having read the book, and nevertheless making a tweet thread about it with a confidently wrong take. IMO the above really reads to me as if he workshopped some random LinkedIn-ish platitudes about the book to seem like a moderate and be popular on social media, without having engaged with the substance at all.

[-]Oliver Daniels5mo7152

The book certainly claims that doom is not inevitable, but it does claim that doom is ~inevitable if anyone builds ASI using anything remotely like the current methods.

I understand Zach (and other "moderates") as saying no, even conditioned on basically YOLO-ing the current paradigm to superintelligence, its really uncertain (and less likely than not) that the resulting ASI would kill everyone.

I disagree with this position, but if I held it, I would be saying somewhat similar things to Zach (even having read the book).

Though I agree that engaging on the object level (beyond "predictions are hard") would be good.

[-]Thane Ruthenis5mo105

My guess is that they're doing the motte-and-bailey of "make it seem to people who haven't read the book that it says that the ASI extinction is inevitable, that the book is just spreading doom and gloom", from which, if challenged, they could retreat to "no, I meant doom isn't inevitable even if we do build ASI using the current methods".

Like, if someone means the latter (and has also read the book and knows that it goes to great lengths to clarify that we can avoid extinction), would they really phrase it as "doom is inevitable", as opposed to e. g. "safe ASI is impossible"?

Or maybe they haven't put that much thought into it and are just sloppy with language.

6ChristianKl5mo

Eliezer did write Death with Dignity which seems to assert that doom is inevitable, so the book not making that case, is a meaningful step.

7MichaelDickens5mo

I wouldn't. I roughly agree with Zach's background position (i.e. I'm quite uncertain about the likelihood of extinction conditional on YOLO-ing the current paradigm*) but I still think his conclusions are wild. Quoting Zach: The tradeoff isn't between solving scarcity at a high risk of extinction vs. never getting either of those things. It's between solving scarcity now at a high risk of extinction, vs. solving scarcity later at a much lower risk. Slowing down / pausing AI development gives us more time to work on all of those problems. Racing to build ASI means not only are we risking extinction from misalignment, but we're also facing a high risk of outcomes such as, for example, ASI being developed so quickly that governments don't have time to get a handle on what's happening and we end up with Sam Altman as permanent world dictator. (I don't think that particular outcome is that likely, it's just an example.) *although I think my conditional P(doom) is considerably higher than his

0dsj5mo

This depends on what mechanism is used to pause. MIRI is proposing, among other things, draconian control over the worldwide compute supply. Whoever has such control has a huge amount of power to leverage over a transformative technology, which seems at least possibly (and to me, very likely) to increase the risk of getting a permanent world dictator, although the dictator in that scenario is perhaps more likely to be a head of state than the head of an AI lab. Unfortunately, this means that there is no low risk path into the future, so I don't think the tradeoff is as straightforward as you describe:

3MichaelDickens5mo

My preferred mechanism, and I think MIRI's, would be an international treaty in which every country implements AI restrictions within its own borders. That means a head of state can't build dangerous AI without risking war. It's analogous to nuclear non-proliferation treaties. I don't think I would call it low risk, but my guess is it's less risky than the default path of "let anyone build ASI with no regulations".

1dsj5mo

The control required within each country to enforce such a ban breaks the analogy to nuclear non-proliferation. Uranium is an input to a general purpose technology (electricity), but it is not a general purpose technology itself, so it is possible to control its enrichment without imposing authoritarian controls on every person and industry in their use of electricity. By contrast, AI chips are themselves a general purpose technology, and exerting the proposed degree of control would entail draconian limits on every person and industry in society.

2MichaelDickens5mo

The relevant way in which it's analogous is that a head of state can't build [dangerous AI / nuclear weapons] without risking war (or sanctions, etc.).

1dsj5mo

Fair enough, but China and the US are not going to risk war over that unless they believe doom is anywhere close to as certain as Eliezer believes it to be. And they are not going to believe that, in part because that level of certainty is not justified by any argument anyone including Eliezer has provided. And even if I am wrong on the inside view/object level to say that, there is enough disagreement about that claim among AI existential risk researchers that the outside view of a national government is unlikely to fully adopt Eliezer's outlier viewpoint as its own. But in return, we now have the tools of authoritarian control implemented within each participating country. And this is even if they don't use their control over the computing supply to build powerful AI solely for themselves. Just the regime required to enforce such control would entail draconian invasions into the lives of every person and industry.

3habryka5mo

I highly doubt you would say something as false as "doom being inevitable" without qualifiers! Like, sure, maybe this is just really terrible miscommunication, but that itself also seems kind of crazy. Like, the above thread mentions no conditional. It does not say that "doom is inevitable if we build ASI", or anything like that. It just claims that Nate + Eliezer say that "doom is inevitable", no qualifiers.

2Oliver Daniels5mo

I do think there's some amount of "these guys are weirdo extremists" signaling implicit in stating that they think doom is inevitable, but I don't think it stems from not reading the book / not understanding the conditional (the conditional is in the title!)

1Tao Lin5mo

Yeah it goes out of its way to say the opposite, but if you know Nate and Eliezer the book gives the impression that their pdooms are still extremely high, and responding to the author's beliefs even when those aren't exactly the same as the text is sometimes correct, although not really in this case.

5aog5mo

He also titled his review “An Effective Altruism Take on IABIED” on LinkedIn. Given that Zach is the CEO of Centre for Effective Altruism, some readers might reasonably interpret this as Zach speaking for the EA community. Retitling the post to “Book Review: IABIED” or something else seems better.

-2MalcolmMcLeod5mo

The general pattern from Anthropic leadership is eliding entirely the possibility of Not Building The Thing Right Now. From that baseline, I commend Zach for at least admitting that's a possibility. Outright, it's disappointing that he can't see the path of Don't Build It Right Now---And Then Build It Later, Correctly, or can't acknowledge its existence. He also doesn't really net benefits and costs. He just does the "Wow! There sure are two sides. We should do good stuff" shtick. Which is better than much of Dario's rhetoric! He's cherrypicked a low p(doom) estimate, but I appreciate his acknowledgement that "Most of us wouldn't be willing to risk a 3% chance (or even a 0.3% chance!) of the people we love dying." Correct! I am not willing to! "But accepting uncertainty matters for navigating this complex challenge thoughtfully." Yes. I have accepted my uncertainty of my loved ones' survival, and I have been thoughtful, and the conclusion I have come to is that I'm not willing to take that risk. Tbc this is still a positive update for me on Anthropic's leadership. To a catastrophically low level. Which is still higher than all other lab leaders. But it reminds me of this world-class tweet, from @humanharlan, whom you should all follow. he's like if roon weren't misaligned: "At one extreme: ASI, if not delayed, will very likely cause our extinction. Let’s try to delay it. On the other: No chance it will do that. Don’t try to delay it. Nuanced, moderate take: ASI, if not delayed, is moderately likely to cause our extinction. Don’t try to delay it."

[-]Buck7mo735

@ryan_greenblatt and I are going to try out recording a podcast together tomorrow, as an experiment in trying to express our ideas more cheaply. I'd love to hear if there are questions or topics you'd particularly like us to discuss.

[-]Neel Nanda7mo4811

Hype! A 15 min brainstorm

What would you work on if not control? Bonus points for sketching out the next 5+ new research agendas you would pursue, in priority order, assuming each previous one stopped being neglected

What is the field of ai safety messing up? Bonus: For (field) in {AI safety fields}: What are researchers in $field wrong about/making poor decisions about, in a way that significantly limits their impact?

What are you most unhappy about with how the control field has grown and the other work happening elsewhere?

What are some common beliefs by AI safety researchers about their domains of expertise that you disagree with (pick your favourite domain)?

What beliefs inside Constellation have not percolated into the wider safety community but really should?

What have you changed your mind about in the last 12 months?

You say that you don't think control will work indefinitely and that's sufficiently capable models will break it. Can you make that more concrete? What kind of early warning signs could we observe? Will we know when we reach models capable enough that we can no longer trust control?

If you were in charge of Anthropic what would you ... (read more)

[-]Ben Pace7mo1810

People often present their views as a static object, which paints a misleading picture of how they arrived at them and how confident they are in different parts, I would be more interested to hear about how they've changed for both of you over the course of your work at Redwood.

[-]JustisMills7mo100

Thoughts on how the sort of hyperstition stuff mentioned in nostalgebraist's "the void" intersects with AI control work.

8williawa7mo

I had this question about economic viability of neuralese models https://www.lesswrong.com/posts/PJaq4CDQ5d5QtjNRy/?commentId=YmyQqQqdei9C7pXR3 I remember Ryan talking about it on the 80k hours podcast. I'd be interested in hearing the perspective more fleshed out. Also just legibility of CoT, how important is it in the overall picture. If people start using fully recurrent architectures tomorrow in all frontier models does p(doom) go from 10% to 90%, or is it a smaller update?

8Zach Stein-Perlman7mo

Control is about monitoring, right?

8Seth Herd7mo

You guys seem as tuned in to the big picture as anyone. The big question we as a field need to answer is: what's the strategy? What's the route to success?

7Rauno Arike7mo

What probability would you put on recurrent neuralese architectures overtaking transformers within the next three years? What are the most important arguments swaying this probability one way or the other? (If you want a specific operationalization for answering this, I like the one proposed by Fabien Roger here, though I'd probably be more stringent on the text bottlenecks criterion, maybe requiring a text bottleneck after at most 10k rather than 100k opaque serial operations.)

6Thane Ruthenis7mo

I second @Seth Herd's suggestion, I'm interested in your vision regarding how success would look like. Not just "here's a list of some initiatives and research programs that should be helpful" or "here's a possible optimistic scenario in which things go well, but which we don't actually believe in", but the sketch of an actual end-to-end plan around which you'd want people to coordinate. (Under the understanding that plans are worthless but planning is everything, of course.)

5Michaël Trazzi7mo

What's your version of AI 2027 (aka most likely concrete scenario you imagine for the future), and how does control end up working out (or not working out) in different outcomes.

5asher7mo

I would be curious to hear you discuss what good, stable futures might look like and how they might be governed (mostly because I haven't heard your takes on this before and it seems quite important)

3Oliver Daniels7mo

Thoughts on "alignment" proposals (i.e. reducing P(scheming))

3fakeanalyst7mo

The usefulness of interpretability research

3Nathaniel7mo

What do you think of the risk that control backfires by preventing warning shots?

3Tobias H7mo

What types of policy/governance research is most valuable for control? Are there specific topics you wish more people were working on?

2samuelshadrach7mo

Thoughts on encouraging more LWers like yourself to make more videos? I am sympathetic to Krashen's input hypothesis as a way to onboard people to a new culture, and video may be faster at that than text.

1Guive7mo

What are your thoughts on Salib and Goldstein's "AI Rights for Human Safety" proposal?

1anaguma7mo

What’s your P(doom)?

[-]Buck6moΩ267111

@Eliezer Yudkowsky tweets:

> @julianboolean_: the biggest lesson I've learned from the last few years is that the "tiny gap between village idiot and Einstein" chart was completely wrong
I agree that I underestimated this distance, at least partially out of youthful idealism.
That said, one of the few places where my peers managed to put forth a clear contrary bet was on this case. And I did happen to win that bet. This was less than 7% of the distance in AI's 75-year journey! And arguably the village-idiot level was only reached as of 4o or o1.

I was very interested to see this tweet. I have thought of that "Village Idiot and Einstein" claim as the most obvious example of a way that Eliezer and co were super wrong about how AI would go, and they've AFAIK totally failed to publicly reckon with it as it's become increasingly obvious that they were wrong over the last eight years.

It's helpful to see Eliezer clarify what he thinks of this point. I would love to see more from him on this--why he got this wrong, how updating changes his opinion about the rest of the problem, what he thinks now about time between different levels of intelligence.

[-]Adam Scholl6moΩ17355

I have thought of that "Village Idiot and Einstein" claim as the most obvious example of a way that Eliezer and co were super wrong about how AI would go, and they've AFAIK totally failed to publicly reckon with it as it's become increasingly obvious that they were wrong over the last eight years

I'm confused—what evidence do you mean? As I understood it, the point of the village idiot/Einstein post was that the size of the relative differences in intelligence we were familiar with—e.g., between humans, or between humans and other organisms—tells us little about the absolute size possible in principle. Has some recent evidence updated you about that, or did you interpret the post as making a different point?

(To be clear I also feel confused by Eliezer's tweet, for the same reason).

[-]Buck6moΩ23406

Ugh, I think you're totally right and I was being sloppy; I totally unreasonably interpreted Eliezer as saying that he was wrong about how long/how hard/how expensive it would be to get between capability levels. (But maybe Eliezer misinterpreted himself the same way? His subsequent tweets are consistent with this interpretation.)

I totally agree with Eliezer's point in that post, though I do wish that he had been clearer about what exactly he was saying.

[-]Eliezer Yudkowsky6moΩ16314

I think you accurately interpreted me as saying I was wrong about how long it would take to get from the "apparently a village idiot" level to "apparently Einstein" level! I hadn't thought either of us were talking about the vastness of the space above, in re what I was mistaken about. You do not need to walk anything back afaict!

[-]TsviBT6moΩ6114

Have you stated anywhere what makes you think "apparently a village idiot" is a sensible description of current learning programs, as they inform us regarding the question of whether or not we currently have something that is capable via generators sufficiently similar to [the generators of humanity's world-affecting capability] that we can reasonably induce that these systems are somewhat likely to kill everyone soon?

[-]Adam Scholl6mo*299

Makes sense. But on this question too I'm confused—has some evidence in the last 8 years updated you about the old takeoff speed debates? Or are you referring to claims Eliezer made about pre-takeoff rates of progress? From what I recall, the takeoff debates were mostly focused on the rate of progress we'd see given AI much more advanced than anything we have. For example, Paul Christiano operationalized slow takeoff like so:

Given that we have yet to see any such doublings, nor even any discernable impact on world GDP:

... it seems to me that takeoff (in this sense, at least) has not yet started, and hence that we have not yet had much chance to observe evidence that it will be slow?

4james oofou6mo

The common theme here is that the capabilities frontier is more jagged than expected. So the way in which people modeled takeoff in the pre-LLM era was too simplistic. Takeoff used to be seen as equivalent to the time between AGI and ASI. In reality we got programmes which are not AGI, but do have capabilities that most in the past would have assumed to entail AGI. So, we have pretty-general intelligence that's better than most humans in some areas, and is amplifying programming and mathematics productivity. So, I think takeoff has begun, but it's under quite different conditions than people used to model.

4Adam Scholl6mo

I don't think they are quite different. Christiano's argument was largely about the societal impact, i.e. that transformative AI would arrive in an already-pretty-transformed world: I claim the world is clearly not yet pretty-transformed, in this sense. So insofar as you think takeoff has already begun, or expect short (e.g. AI 2027-ish) timelines—I personally expect neither, to be clear—I do think this takeoff is centrally of the sort Christiano would call "fast."

[-]Stephen Fowler6mo*268

The following illustration from 2015 by Tim Urban seems like a decent summary of how people interpreted this and other statements.

1Dr. Birdbrain6mo

This comic by Tim Urban is interesting, but I remember when I first read it, it seemed wrong. In his framework, I think ASI can only be quantitatively more powerful than human intelligence, not qualitatively. The reason is simple: humans are already Turing complete. Anything a machine can do, it can only be faster execution of something a human could already do. I don’t think it has much bearing on the wider discussion of AI/AI-risk, I haven’t heard anybody else think that the distinction of quantitative/qualitative superiority had any bearing on AI risk.

2dr_s6mo

I don't think it matters much for practical purposes. It could be that some problems are theoretically solvable by human intelligence but we realistically lack the time to do so in the age of the universe, or that they just can't be solved by us, and either way an ASI that solves them in a day leaves us in the dust. The reason why becomes secondary at that point. I feel like one problem with solving problems intelligently is that it's rarely as easy as tackling a tedious task in small bits - you need an intuition to see the whole path in a sort of coarse light, and then refine on each individual step. So there's a fast algorithm that goes "I know I can do this, I don't know how yet" and then we slowly unpack the relevant bits. And I think there might be a qualitative effect to e.g. being able to hold more steps in memory simultaneously or such.

[-]Richard_Kennaway6moΩ10240

Link to the actual tweet.

[-]ryan_greenblatt6moΩ11150

And the whole exchange on nitter for those who don't like going on x/twitter.

[-]james oofou6mo213

Isn't this too soon to claim that this was some big mistake? Up until December 2024 the best available LLM barely reasoned. Everyone and their dog was saying that LLMs are fundamentally incapable of reasoning. Just eight months later two separate LLM-based systems got Gold on the IMO (one of which is now available, albeit in a weaker form). We aren't at the level of Einstein yet, but we could be within a couple years. Would this not be a very short period of time to go from models incapable of reasoning to models which are beyond human comprehension? Would this image not then be seen as having aged very well?

Intelligence2

[-]1a3orn6mo504

Here's Yudkowsky, in the Hanson-Yudkowsky debate:

I think that, at some point in the development of Artificial Intelligence, we are likely to see a fast, local increase in capability—“AI go FOOM.” Just to be clear on the claim, “fast” means on a timescale of weeks or hours rather than years or decades; and “FOOM” means way the hell smarter than anything else around, capable of delivering in short time periods technological advancements that would take humans decades, probably including full-scale molecular nanotechnology.

So yeah, a few years does seem a ton slower than what he was talking about, at least here.

Here's Scott Alexander, who describes hard takeoff as a one-month thing:

If AI saunters lazily from infrahuman to human to superhuman, then we’ll probably end up with a lot of more-or-less equally advanced AIs that we can tweak and fine-tune until they cooperate well with us. In this situation, we have to worry about who controls those AIs, and it is here that OpenAI’s model [open sourcing AI] makes the most sense.

But Bostrom et al worry that AI won’t work like this at all. Instead there could be a “hard takeoff”, a subjective discontinuity in the function mapping AI re

... (read more)

[-]Buck6mo3421

It really depends what you mean by a small amount of time. On a cosmic scale, ten years is indeed short. But I definitely interpreted Eliezer back then (for example, while I worked at MIRI) as making a way stronger claim than this; that we'd e.g. within a few days/weeks/months go from AI that was almost totally incapable of intellectual work to AI that can overpower humanity. And I think you need to believe that much stronger claim in order for a lot of the predictions about the future that MIRI-sphere people were making back then to make sense. I wish we had all been clearer at the time about what specifically everyone was predicting.

[-]elifland6mo132

I'd be excited for people (with aid of LLMs) to go back and grade how various past predictions from MIRI folks are doing, plus ideally others who disagreed. I just read back through part of https://www.lesswrong.com/posts/vwLxd6hhFvPbvKmBH/yudkowsky-and-christiano-discuss-takeoff-speeds and my quick take is that Paul looks mildly better than Eliezer due to predicting larger impacts/revenue/investment pre-AGI (which we appear to be on track for and to some extent already seeing) and predicitng a more smooth increase in coding abilities, but hard to say in part because Eliezer mostly didn't want to make confident predictions, also I think Paul was wrong about Nvidia but that felt like an aside.

edit: oh also there's the IMO bet, I didn't get to that part on my partial re-read, that one goes to Eliezer.

Looking through IEM and the Yudkowsky-Hanson debate also seems like potentially useful sources, as well as things that I'm probably forgetting or unaware of.

3Cole Wyeth6mo

The part of this graph that has aged the least well is that the y-axis is labeled “intelligence” and it’s becoming harder to see that as a real value.

[-]TsviBT6mo*Ω6159

If by intelligence you mean "we made some tests and made sure they are legible enough that people like them as benchmarks, and lo and behold, learning programs (LPs) continue to perform some amount better on them as time passes", ok, but that's a dumb way to use that word. If by intelligence you mean "we have something that is capable via generators sufficiently similar to [the generators of humanity's world-affecting capability] that we can reasonably induce that these systems are somewhat likely to kill everyone", then I challenge you to provide the evidence / reasoning that apparently makes you confident that LP25 is at a ~human (village idiot) level of intelligence.

Cf. https://www.lesswrong.com/posts/5tqFT3bcTekvico4d/do-confident-short-timelines-make-sense

[-]wunan6mo150

Here is Eliezer's post on this topic from 17 years ago for anyone interested: https://www.lesswrong.com/posts/3Jpchgy53D2gB5qdk/my-childhood-role-model

Anna Salamon's comment and Eliezer's reply to it are particularly relevant.

2Buck6mo

Thanks heaps for pulling this up! I totally agree with Eliezer's point there.

[-]Eli Tyre6mo146

[Epistemic status: unconfident]

So...I actually think that it technically wasn't wrong, though the implications that we derived at the time were wrong because reality was more complicated than our simple model.

Roughly, it seems like mental performance is depends on at least two factors: "intelligence" and "knowledge". It turns out that, at least in some regimes, there's an exchange rate at which you can make up for mediocre intelligence with massive amounts of knowledge.

My understanding is that this is what's happening even with the reasoning models. They have a ton of knowledge, including a ton of procedural knowledge about how to solve problems, which is masking the ways in which they're not very smart.^[1]

One way to operationalize how dumb the models are is the number of bits/tokens/inputs/something that are necessary to learn a concept or achieve some performance level on a task. Amortizing over the whole training process / development process, humans are still much more sample efficient learners than foundation models.

Basically, we've found a hack where we can get a kind of smart thing to learn a massive amount, which is enough to make it competitive with humans in a... (read more)

[-]1a3orn6mo14-1

Is that sentence dumb? Maybe when I'm saying things like that, it should prompt me to refactor my concept of intelligence.

I don't think it's dumb. But I do think you're correct that it's extremely dubious -- that we should definitely refactoring the concept of intelligence.

Specifically: There's default LW-esque frame of some kind of a "core" of intelligence as "general problem solving" apart from any specific bit of knowledge, but I think that -- if you manage to turn this belief into a hypothesis rather than a frame -- there's a ton of evidence against this thesis. You could even basically look at the last ~3 years of ML progress as just continuing little bits of evidence against this thesis, month after month after month.

I'm not gonna argue this in a comment, because this is a big thing, but here are some notes around this thesis if you want to tug on the thread.

Comparative psychology finds human infants are characterized by overimmitation relative to Chimpanzees, more than any general problem-solving skill. (That's a link to a popsci source but there's a ton of stuff on this.) That is, the skills humans excel at vs. Chimps + Bonobos in experiments are social and allow t

... (read more)

8Eli Tyre6mo

All this seems relevant, but there's still the fact that a human elo at go or chess will improve much more from playing 1000 games (and no more) than an AI playing a 1000 games. That's suggestive of property learning, or reflection, or conceptualization, or generalization, or something, that the AIs seem to lack, but can compensate for with brute force.

21a3orn6mo

So for the case of our current RL game-playing AIs not learning much from 1000 games -- sure, the actual game-playing AIs we have built don't learn games as efficiently as humans do, in the sense of "from as little data." But: * Learning from as little data as possible hasn't actually been a research target, because self-play data is so insanely cheap. So it's hard to conclude that our current setup for AIs is seriously lacking, because there hasn't been serious effort to push along this axis. * To point out some areas we could be pushing on, but aren't: Game-play networks are usually something like ~100x smaller than LLMs, which are themselves ~100-10x smaller than human brains (very approximate numbers). We know from numerous works that data efficiency scales with network size, so even if Adam over matmul is 100% as efficient as human brain matter, we'd still expect our current RL setups to do amazingly poorly with data-efficiency simply because of network size, even leaving aside further issues about lack of hyperparameter search and research effort. Given this, while this is of course a consideration, it seems far from a conclusive consideration. Edit: Or more broadly, again -- different concepts of "intelligence" will tend to have different areas where they seem to have more predictive use, and different areas they seem to have more epicycles. The areas above are the kind of thing that -- if one made them central to one's notions of intelligence rather than peripheral -- you'd probably end up with something different than the LW notion. But again -- they certainly do not compel one to do that refactor! It probably wouldn't make sense to try to do the refactor unless you just keep getting the feeling "this is really awkward / seems off / doesn't seem to be getting at it some really important stuff" while using the non-refactored notion.

4faul_sname6mo

and whose predictive validity in humans doesn't transfer well across cognitive architectures. e.g. reverse digit span.

2TsviBT6mo

Yes, indeed, they copy the actions and play them through their own minds as a method of play, to continue extracting nonobvious concepts. Or at least that is my interpretation. Are you claiming that they are merely copying??

9Random Developer6mo

This is very much my gut feeling, too. LLMs have a much greater knowledge base than humans do, and some of them can "think" faster. But humans are still better at many things, including raw problem solving skills. (Though LLM's problem solving skills have improved a breathtaking amount in the last 12 months since o1-preview shipped. Seriously, folks. The goalpost-moving is giving me vertigo.) This uneven capabilities profile means that LLMs are still well below the so-called "village idiot" in many important ways, and have already soared past Einstein in others. This averages out to "kinda competent on short time horizons if you don't squint too hard." But even if the difference between "the village idiot" and "smarter than Einstein" involved another AI winter, two major theoretical breakthroughs, and another 10 years, I would still consider that damn close to a vertical curve.

9Thane Ruthenis6mo

I don't know that they were wrong about that claim. Or, it depends on what we interpret as the claim. "AI would do the thing in this chart" proved false[1], but I don't think this necessarily implies that "there's a vast distance between a village idiot and Einstein in intelligence levels". Rather, what we're observing may just be a property of the specific approach to AI represented by LLMs. It is not quite "imitation learning", but it shares some core properties of imitation learning. LLMs skyrocketed to human-ish level because they're trained to emulate humans via human-generated data. Improvements then slowed to a (relative) crawl because it became a data-quality problem. It's not that there's a vast distance between stupid and smart humans, such that moving from a random initialization to "dumb human" is as hard as moving from a "dumb human" to a "smart human". It's just that, for humans, assembling an "imitate a dumb human" dataset is easy (scrape the internet), whereas transforming it into an "imitate a smart human" dataset is very hard. (And then RL is just strictly worse at compute-efficiency and generality, etc.) (Edit: Yeah, that roughly seems to be Eliezer's model too, see this thread.) If that's the case, Eliezer and co.'s failure wasn't in modeling the underlying dynamics of intelligence incorrectly, but in failing to predict and talk about the foibles of an ~imitation-learning paradigm. That seems fairly minor. Also: did that chart actually get disproven? To believe so, we have to assume current LLMs are at the "dumb human" levels, and that what's currently happening is a slow crawl to "smart human" and beyond. But if LLMs are not AGI-complete, if their underlying algorithms (rather than externally visible behaviors) qualitatively differ from what humans do, this gives us little information on the speed with which an AGI-complete AI would move from a "dumb human" to a "smart human". Indeed, I still expect pretty much that chart to happen once we g

7ryan_greenblatt6mo

You seem to think that imitation resulted in LLMs quickly saturating on an S-curve, but relevant metrics (e.g. time-horizon seem like they smoothly advance without a clear reduction in slope from the regime where pretraining was rapidly being scaled up (e.g. up to and through GPT-4) to after (in fact, the slope seems somewhat higher). Presumably you think some qualitative notion of intelligence (which is hard to measure) has slowed down? My view is that basically everything is progressing relatively smoothly and there isn't anything which is clearly stalled in a robust way.

4Thane Ruthenis6mo

That's not the relevant metric. The process of training involves a model skyrocketing in capabilities, from a random initialization to a human-ish level (or the surface appearance of it, at least). There's a simple trick – pretraining – which allows to push a model's intelligence from zero to that level. Advancing past this point then slows down to a crawl: each incremental advance requires new incremental research derived by humans, rather than just turning a compute crank. (Indeed, IIRC a model's loss curves across training do look like S-curves? Edit: On looking it up, nope, I think.) The FOOM scenario, on the other hand, assumes a paradigm that grows from random initialization to human level to superintelligence all in one go, as part of the same training loop, without a phase change from "get it to human level incredibly fast, over months" to "painstakingly and manually improve the paradigm past the human level, over years/decades".

2ryan_greenblatt6mo

Relevant metrics of performance are roughly linear in log-compute when compute is utilized effectively in the current paradigm for training frontier models. From my perspective it looks like performance has been steadily advancing as you scale up compute and other resources. (This isn't to say that pretraining hasn't had lower returns recently, but you made a stronger claim.)

6Adam Scholl6mo

I think one of the (many) reasons people have historically tended to miscommunicate/talk past each other so much about AI timelines, is that the perceived suddenness of growth rates depends heavily on your choice of time span. (As Eliezer puts it, "Any process is continuous if you zoom in close enough.") It sounds to me like you guys (Thane and Ryan) agree about the growth rate of the training process, but are assessing its perceived suddenness/continuousness relative to different time spans?

3Noosphere896mo

A key reason, independent of LLMs, is that we see vast ranges of human performance, and Eliezer's claim that the fact that humans have similar brain architectures means that there's very little effort needed to become the best human who ever lived is wrong (admittedly this is a claim that the post was always wrong, and we just failed to notice it, including myself). The range of human ability is wide, actually.

4elifland6mo

In terms of general intelligence including long-horizon agency, reliability, etc., do we think AIs are yet, for example, as autonomously good as the worst professionals? My instinct is no for many of them, even though the AIs might be better at the majority of sub-tasks and are very helpful as collaborators rather than fully replacing someone. But I'm uncertain, it might depend on the operalization and profession, for some professions the answer seems clearly yes.[1][2] It also seems harder to reason about than the literally least capable professional something like the 10th percentile. If the answer is no and we're looking at the ability to fully autonomously replace humans, this would mean the village idiot -> Einstein claim might technically not be falsified. The spirit of the claim might be though, e.g. in terms of the claimed implications. 1. ^ There's also a question of whether we should include phyiscal abilities, if so then the answer would clearly be no for those professions or tasks. 2. ^ One profession for which it seems likely that the AIs are better than the least capable humans is therapy. Also teaching/tutoring. In general this seems true for professions that can be done via remote work, don't involve heavy required computer use or long horizon agency.

3Max H6mo

What specifically do you think is obviously wrong about the village idiot <-> Einstein gap? This post from 2008 which uses the original chart makes some valid points that hold up well today, and rebuts some real misconceptions that were common at the time. The original chart doesn't have any kind of labels or axes, but here are two ways you could plausibly view it as "wrong" in light of recent developments with LLMs: * Duration: the chart could be read as a claim that the gap between the development of village idiot and Einstein-level AI in wall-clock time would be more like hours or days rather than months or years. * Size and dimensionality of mind-space below the superintelligence level. The chart could be read as a claim that the size of mindspace between village idiot and Einstein is relatively small, so it's surprising to Eliezer-200x that there are lots of current AIs landing in between them, and staying there for a while. I think it's debatable how much Eliezer was actually making the stronger versions of the claims above circa 2008, and also remains to be seen how wrong they actually are, when applied to actual superintelligence instead of whatever you want to call the AI models of today. OTOH, here are a couple of ways that the village idiot <-> Einstein post looks prescient: * Qualitative differences between the current best AI models and second-to-third tier models are small. Most AI models today are all roughly similar to each other in terms of overall architecture and training regime, but there are various tweaks and special sauce that e.g. Opus and GPT-5 have that Llama 4 doesn't. So you have something like: Llama 4: GPT-5 :: Village idiot : Einstein, which is predicted by: (and something like a 4B parameter open-weights model is analogous to the chimpanzee) Whereas I expect that e.g. Robin Hanson in 2008 would have been quite surprised by the similarity and non-specialization among different models of today. * Implications for scaling. H

2Richard_Kennaway6mo

I find myself puzzled by Eliezer’s tweet. I had always taken the point of the diagram to be the vastness of the space above Einstein compared with the distance between Einstein and the village idiot. I do not see how recent developments in AI affect that. AI has (in Eliezer’s view) barely reached the level of the village idiot. Nothing in the diagram bears on how long it will take to equal Einstein. That is anyway a matter of the future, and Eliezer has often remarked on how many predictions of long timelines to some achievement turned out to be achieved within months, or already had been when the prediction was made. I wonder what Eliezer’s predicted time to Einstein is, given no slowdown.

[-]Buck4mo*6843

I'd be really interested in someone trying to answer the question: what updates on the a priori arguments about AI goal structures should we make as a result of empirical evidence that we've seen? I'd love to see a thoughtful and comprehensive discussion of this topic from someone who is both familiar with the conceptual arguments about scheming and also relevant AI safety literature (and maybe AI literature more broadly).

Maybe a good structure would be, from the a priori arguments, identifying core uncertainties like "How strong is the imitative prior?" And "How strong is the speed prior?" And "To what extent do AIs tend to generalize versus learn narrow heuristics?" and tackling each. (Of course, that would only make sense if the empirical updates actually factor nicely into that structure.)

I feel like I understand this very poorly right now. I currently think the only important update that empirical evidence has given me, compared to the arguments in 2020, is that the human-imitation prior is more powerful than I expected. (Though of course it's unclear whether this will continue (and basic points like the expected increasing importance of RL suggest that it will be less powerful over time.)) But to my detriment, I don't actually read the AI safety literature very comprehensively, and I might be missing empirical evidence that really should update me.

[-]habryka4mo*566

Copy-pasting what I wrote in a Slack thread about this:

My current take, having thought a lot about a few things in this domain, but not necessarily this specific question, is that the only dimensions where the empirical evidence feels like it was useful, besides a broad "yes, of course the problems are real, and AGI is possible, and it won't take hundreds of years" confirmation, are the dynamics around how much you can steer and control near-human AI systems to perform human-like labor.

I think almost all the evidence for that comes from just the scaling up, and basically none of it comes from safety work (unless you count RLHF as safety work, though of course the evidence there is largely downstream of the commercialization and scaling of that technology).

I can't think of any empirical evidence that updated me much on what superintelligent systems would do, even if they are the results of just directly scaling current systems, which is the key thing that matters.

A small domain that updated me a tiny bit, though mostly in the direction of what I already believed, is the material advantage research with stuff like LeelaOdds, which demonstrated more cleanly you can overcome larg... (read more)

[-]Daniel Kokotajlo4mo271

I have not invested the time to give an actual answer to your question, sorry. But off the top of my head, some tidbits that might form part of an answer if I thought about it more:

--I've updated towards "reward will become the optimization target" as a result of seeing examples of pretty situationally aware reward hacking in the wild. (Reported by OpenAI primarily, but it seems to be more general)
--I've updated towards "Yep, current alignment methods don't work" due to the persistant sycophancy which still remains despite significant effort to train it away. Plus also the reward hacking etc.
--I've updated towards "The roleplay/personas 'prior' (perhaps this is what you mean by the imitative prior?) is stronger than I expected, it seems to be persisting to some extent even at the beginning of the RL era. (Evidence: Grok spontaneously trying to serve its perceived masters, the Emergent Misalignment results, some of the scary demos iirc...)

3asher4mo

I think this is a really good answer, +1 to points 1 and 3! I'm curious to what degree you think labs have put in significant effort to train away sycophancy. I recently ran a poll of about 10 people, some of whom worked at labs, on whether labs could mostly get rid of sycophancy if they tried hard enough. While my best guess was 'no,' the results were split around 50-50. (Would also be curious to hear more lab people's takes!) I'm also curious how reading model chain-of-thought has updated you, both on the sycophancy issue and in general.

2StanislavKrym4mo

Didn't KimiK2, who was trained mostly on RLVR and self-critique instead of RLHF, end up LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has? While mankind doesn't have that many different models which are around 4o's abilities, Adele Lopez claimed that DeepSeek believes itself to be writing a story and 4o wants to eat your life and conjectured in private communication that "the different vibe is because DeepSeek has a higher percentage of fan-fiction in its training data, and 4o had more intense RL training"[1] RL seems to move the CoT towards decreasing the ability to understand it (e.g. if the CoT contains armies of dots, as happened with GPT-5) unless mitigated by paraphrasers. As for CoTs containing slop, humans also have CoTs which include slop until the right idea somehow emerges. 1. ^ IMO, a natural extension would be that 4o was raised on social media and, like influencers, wishes to be liked. Which was also reinforced by RLHF or had 4o conclude that humans like sycophancy. Anyway, 4o's ancestral environment rewarded sycophancy and things rewarded by the ancestral environment are hard to unlike.

[-]1a3orn4mo180

So a thing I've been trying to look at is get a better notion of "What actually is it about human intelligence that lets us be the dominant species?" Like, "intelligence" is a big box that holds which specific behaviors? What were the actual behaviors that evolution reinforced, over the course of giving of big brains? Big question, hard to know what's the case.

I'm in the middle of "Darwin's Unfinished Symphony", and finding it at least intriguing as a look how creativity / imitation are related, and how "imitation" is a complex skill that humans are nevertheless supremely good at. (The "Secret of Our Success" is another great read here of course.)

Both of these kinda about the human imitation prior... in humans. And why that may be important. So I think if one is thinking around the human-imitation prior being powerful, it would make sense to read them as cases for why something like the human imitation prior is also powerful in humans :)

They don't give straight answers to any questions about AI, of course, and I'd be sympathetic to the belief that they're irrelevant or kinda a waste of time, and frankly they might be a waste of time depending on what you're funging against. ... (read more)

[-]jake_mendel4mo120

Copypasting from a slack thread:

I'll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:

On generalisation vs simple heuristics:
- I think the nicest papers here are some toy model interp papers like progress measures for grokking and Grokked Transformers. I think these are two papers which present a pretty crisp distinction between levels of generality of different algorithms that perform similarly on the training set. In modular addition, there are two levels of algorithm and in the Grokked transformer, there are three. The story of what the models end up doing ends up being pretty nuanced and it ends up coming down to specific details of the pre-training data mix, which maybe isn't that surprising. (If you squint, then you can sort of see predictions in the Grokked transformer being borne out in the nuance about when LLMs can do multi-hop reasoning e.g. Yang et al.) But it seems pretty clear that if the training conditions are right, then you can get increasingly general algorithms learned even when simpler ones would do the trick.
- I also think a useful idea (although less useful so far than the previous bull

... (read more)

6Lucius Bushnaq3mo

That post is superseded by this one. It was just a sketch I wrote up mostly to clarify my own thinking, the newer post is the finished product. It doesn't exactly say that neural networks have Solomonoff-style priors. It depends on the NN architecture. E.g., if your architecture is polynomials, or MLPs that only get one forward pass, I do not expect them to have a prior anything like that of a compute-bounded Universal Turing Machine. And NN training adds in additional complications. All the results I talk about are for Bayesian learning, not things like gradient descent. I agree that this changes the picture and questions about the learnability of solutions become important. You no longer just care how much volume the solution takes up in the prior, you care how much volume each incremental building block of the solution takes up within the practically accessible search space of the update algorithm at that point in training.

[-]David Johnston4mo100

I don't know about 2020 exactly, but I think since 2015 (being conservative), we do have reason to make quite a major update, and that update is basically that "AGI" is much less likely to be insanely good at generalization than we thought in 2015.

Evidence is basically this: I don't think "the scaling hypothesis" was obvious at all in 2015, and maybe not even in 2020. If it was, OpenAI could not have caught everyone with their pants down by investing early in scaling. But if people mostly weren't expecting massive data scale-ups to be the road to AGI, what were they expecting instead? The alternative to reaching AGI by hyperscaling data is a world where we reach AGI with ... not much data. I have this picture which I associate with Marcus Hutter – possibly quite unfairly – where we just find the right algorithm, teach it to play a couple of computer games and hey presto we've got this amazing generally intelligent machine (I'm exaggerating a little bit for effect). In this world, the "G" in AGI comes from extremely impressive and probably quite unpredictable feats of generalization, and misalignment risks are quite obviously way higher for machines like this. As a brute fact, if ge... (read more)

2Hastings4mo

I think it depends on if the intelligences in charge at any point find a way to globally not try a promising idea. If not, then it doesn’t matter that much if LLMs are capable of superintelligence, or just AGI. (If they aren’t capable of AGI, of course that matters because it could lead to a proper fizzle) What really matters is whether they are the optimal design for super intelligence. If they aren’t, and no way is found to not try a promising idea, then my mental model of the next 50 years includes many transitions in what the architecture of the smartest optimizer is, each as different from each other as evolution is from neuron brains, or brains from silicon gradient descent. Then, the details of the motivations of silicon token predictors are more a hint to the breadth of variety of goals we will see than a crux.

[-]Buck26d661

@ryan_greenblatt and I are going to record another podcast together. We'd love to hear topics that you'd like us to discuss. (The questions people proposed last time are here, for reference.)

[-]Thane Ruthenis25d103

Ryan had suggested that, on his model, spending ~5%-more-than-commercially-expedient resources on alignment might drop takeover risks down to 50%. I'm interested in how he thinks this scales: how much more resources, in percentage terms, would be needed to drop the risk to 20%, 10%, 1%?

7Ronny Fernandez24d

What would a class aimed at someone like me (read lesswrong for many years, familiar with the basics of LLM architecture and learning to some extent) have to cover to get me up to speed on AI futurism by your lights? I am imagining the output here being like a bulleted list of 12-30 broad thingies.

7davekasten25d

I would like, for obvious self-interested reasons, discussion of public policy ideas you think people should be pushing more on than they currently are.

7No77e26d

I'd like to know what are your motivations for doing what you're doing! In the first podcast you hinted at "weird reasons" but you didn't say them explicitly in the end. I'm thinking about this quote:

2interstice25d

He's talking about the stuff around the simulation hypothesis and acausal trade in the preceding section.

4No77e25d

Sure but he hasn't laid out the argument. "something something simulation acausal trade" isn't a motivation.

6peterbarnett25d

It might be good to have you talk about more research directions in AI safety you think are not worth pursuing or are over-invested in. Also I think it would be good to talk about what the plan for automating AI alignment work would look like in practice (we've talked about this a little in person, but it would be good for it to be public).

5mclarke24d

Curious on any hot-takes you both have on OpenClaw/Moltbook and whether any emergent behaviours you've seen from agents shifts your P(doom).

4James Payor23d

What interesting things do y'all think are up with AI lab politics these days? Also why is everyone (or just many people in these circle) going to Anthropic now? Any changes in how things seem for control plans based on vibes and awareness present in more recent models? (GPT-5 series may not count here; I'm mostly interested in visiblity on the next generation that are coming, of which I think Opus 4.5 is a preview but I'm fairly unsure.) Anything generally striking about how things look in the landscape and models versus a year ago?

3Kyle O’Brien24d

In episode 1, you and Ryan discussed how you both came close to disbanding Redwood after the initial AI Control paper. I think folks would benefit from hearing more of your thoughts on why you decided to remain an external research organization, especially since my understanding is that you want to influince the practices of the frontier labs. This is a consideration that many folks should grapple with in their own research efforts.

3interstice25d

I'd be interested in hearing more about Ryan's proposal to do better generalization science(or if you don't have much more to say in the podcast format I'd be interested in seeing the draft about it)

3Oliver Daniels23d

I think its this? https://docs.google.com/document/d/1fSaw3NArDj2ndbem96V4E7x4hAlEaqrrBawjf0wNQ-A/edit?tab=t.0#heading=h.gimq5t8clw6i

2yams18d

In the first you mention having a strong shared ontology (for thinking about AI) and iirc register a kind of surprise that others don’t share it. I think it would be cool if you could talk about that ontology more directly, and try to hold at that level of abstraction for a prolonged stretch (rather than invoking it in short hand when it’s load bearing and quickly moving along, which is a reasonable default, but not maximally edifying).

2Seth Herd24d

The first podcast was great. It strengthened my impression that both of you are on top of the big-picture, strategic situation in a way few (if any) other people are (that is, being both broad and detailed enough to be effective). I'd like to hear you discussion the default alignment plan more. I'd like to hear you elaborate and speculate in particular on how automated alignment is likely to go under various of the plan/effort types. The orgs aren't publishing anything much like a plan or planning. It seems like somebody should be doing it. I nominate the two of you! :) no pressure. I do think it should be a distributed effort to refine the default plan and where it goes wrong; your efforts would do a lot to catalyze more of that discussion. The responses to What’s the short timeline plan? were pretty sketchy, and I haven't seen a lot of improvements, outside of Ryan's post, since then. FWIW, I strongly agree with Ryan that big projects benefit from planning. Alignment isn't a unified project, but it does seem like one in important ways. Like the Apollo program, which had extensive planning, there's a lot of time pressure, so just doing stuff and getting there eventually won't work, like it does for most innovations.

1Ariel_23d

Would be great to get some discussion on the second part of Control - "getting useful [alignment] work out of the models". Now with Opus 4.5, we might be in a decent position to test this a bit better?

1genericname-225d

I'd like to hear Ryan talk more about his opinions on Anthropic and Dario's writings.

[-]Buck2yΩ346324

[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]

I’m interested in the following subset of risk from AI:

Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
Scheming: Risk associated with loss of control to AIs that arises from AIs scheming
- So e.g. I exclude state actors stealing weights in ways that aren’t enabled by the AIs scheming, and I also exclude non-scheming failure modes. IMO, state actors stealing weights is a serious threat, but non-scheming failure modes aren’t (at this level of capability and dignity).
Medium dignity: that is, developers of these AIs are putting a reasonable amount of effort into preventing catastrophic outcomes from their AIs (perhaps they’re spending the equivalent of 10% of their budget on cost-effective measures to prevent catastrophes).
Nearcasted: no substantial fundamental progress on AI safety techniques, no substantial changes in how AI wo

... (read more)

6Matthew Barnett2y

Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in? For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?

9ryan_greenblatt2y

One operationalization is "these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs". As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them. With these caveats: * The speed up is relative to the current status quo as of GPT-4. * The speed up is ignoring the "speed up" of "having better experiments to do due to access to better models" (so e.g., they would complete a fixed research task faster). * By "capable" of speeding things up this much, I mean that if AIs "wanted" to speed up this task and if we didn't have any safety precautions slowing things down, we could get these speedups. (Of course, AIs might actively and successfully slow down certain types of research and we might have burdensome safety precautions.) * The 2x increase in marginal cost is ignoring potential inflation in the cost of compute (FLOP/$) and inflation in the cost of wages of ML researchers. Otherwise, I'm uncertain how exactly to model the situation. Maybe increase in wages and decrease in FLOP/$ cancel out? Idk. * It might be important that the speed up is amortized over a longer duration like 6 months to 1 year. I'm uncertain what the economic impact of such systems will look like. I could imagine either massive (GDP has already grown >4x due to the total effects of AI) or only moderate (AIs haven't yet been that widely deployed due to inference availability issues, so actual production hasn't increased that much due to AI (<10%), though markets are pricing in AI being a really, really big deal). So, it's hard for me to predict the immediate impact on world GDP. After adaptation and broad deployment, systems of this level would likely have a massive effect on GDP.

6Raemon2y

I didn't get this from the premises fwiw. Are you saying it's trivial because "just don't use your AI to help you design AI" (seems organizationally hard to me), or did you have particular tricks in mind?

6ryan_greenblatt2y

The claim is that most applications aren't internal usage of AI for AI development and thus can be made trivially safe. Not that most applications of AI for AI development can be made trivially safe.

5Orpheus162y

I think it depends on how you're defining an "AI control success". If success is defined as "we have an early transformative system that does not instantly kill us– we are able to get some value out of it", then I agree that this seems relatively easy under the assumptions you articulated. If success is defined as "we have an early transformative that does not instantly kill us and we have enough time, caution, and organizational adequacy to use that system in ways that get us out of an acute risk period", then this seems much harder. The classic race dynamic threat model seems relevant here: Suppose Lab A implements good control techniques on GPT-8, and then it's trying very hard to get good alignment techniques out of GPT-8 to align a successor GPT-9. However, Lab B was only ~2 months behind, so Lab A feels like it needs to figure all of this out within 2 months. Lab B– either because it's less cautious or because it feels like it needs to cut corners to catch up– either doesn't want to implement the control techniques or it's fine implementing the control techniques but it plans to be less cautious around when we're ready to scale up to GPT-9. I think it's fine to say "the control agenda is valuable even if it doesn't solve the whole problem, and yes other things will be needed to address race dynamics otherwise you will only be able to control GPT-8 for a small window of time before you are forced to scale up prematurely or hope that your competitor doesn't cause a catastrophe." But this has a different vibe than "AI control is quite easy", even if that statement is technically correct. (Also, please do point out if there's some way in which the control agenda "solves" or circumvents this threat model– apologies if you or Ryan has written/spoken about it somewhere that I missed.)

7Buck2y

When I said "AI control is easy", I meant "AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes"; I wasn't trying to comment more generally. I agree with your concern.

4Bogdan Ionut Cirstea1y

Here's one (somewhat handwavy) reason for optimism w.r.t. automated AI safety research: most safety research has probably come from outside the big labs (see e.g. https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/) and thus has likely mostly used significantly sub-SOTA models. It seems quite plausible then that we could have the vast majority of (controlled) automated AI safety research done on much smaller and less dangerous (e.g. trusted) models only, without this leading to intolerably-large losses in productivity; and perhaps have humans only/strongly in the loop when applying the results of that research to SOTA, potentially untrusted, models.

[-]Buck7y620

I think that an extremely effective way to get a better feel for a new subject is to pay an online tutor to answer your questions about it for an hour.

It turns that there are a bunch of grad students on Wyzant who mostly work tutoring high school math or whatever but who are very happy to spend an hour answering your weird questions.

For example, a few weeks ago I had a session with a first-year Harvard synthetic biology PhD. Before the session, I spent a ten-minute timer writing down things that I currently didn't get about biology. (This is an exercise worth doing even if you're not going to have a tutor, IMO.) We spent the time talking about some mix of the questions I'd prepared, various tangents that came up during those explanations, and his sense of the field overall.

I came away with a whole bunch of my minor misconceptions fixed, a few pointers to topics I wanted to learn more about, and a way better sense of what the field feels like and what the important problems and recent developments are.

There are a few reasons that having a paid tutor is a way better way of learning about a field than trying to meet people who happen to be in that field. I really like it that I'm payi

... (read more)

[-]habryka7y340

I've hired tutors around 10 times while I was studying at UC-Berkeley for various classes I was taking. My usual experience was that I was easily 5-10 times faster in learning things with them than I was either via lectures or via self-study, and often 3-4 one-hour meetings were enough to convey the whole content of an undergraduate class (combined with another 10-15 hours of exercises).

[-]philip_b7y100

How do you spend time with the tutor? Whenever I tried studying with a tutor, it didn't seem more efficient than studying using a textbook. Also when I study on my own, I interleave reading new materials and doing the exercises, but with a tutor it would be wasteful to do exercises during the tutoring time.

[-]habryka7y230

I usually have lots of questions. Here are some types of questions that I tended to ask:

Here is my rough summary of the basic proof structure that underlies the field, am I getting anything horribly wrong?

Examples: There is a series of proof at the heart of Linear Algebra that roughly goes from the introduction of linear maps in the real numbers to the introduction of linear maps in the complex numbers, then to finite fields, then to duality, inner product spaces, and then finally all the powerful theorems that tend to make basic linear algebra useful.
Other example: Basics of abstract algebra, going from groups and rings to modules, fields, general algebra's, etcs.

"I got stuck on this exercise and am confused how to solve it". Or, "I have a solution to this exercise but it feels really unnatural and forced, so what intuition am I missing?"
I have this mental visualization that I use to solve a bunch of problems, are there any problems with this mental visualization and what visualization/intuition pumps do you use?

As an example, I had a tutor in Abstract Algebra who was basically just: "Whenever I need to solve a problem of "this type of group ha

... (read more)

2DanielFilan5y

This isn't just you! See Bloom's 2 sigma effect.

[-]SoerenMind7y110

Hired an econ tutor based on this.

4magfrump7y

How do you connect with tutors to do this? I feel like I would enjoy this experience a lot and potentially learn a lot from it, but thinking about figuring out who to reach out to and how to reach out to them quickly becomes intimidating for me.

6habryka7y

I posted on Facebook, and LW might actually also be a good place for some subset of topics.

1Buck7y

I recommend looking on Wyzant.

3Buck2y

nowadays, GPT-4 substantially obsoletes tutors.

1Quadratic Reciprocity2y

Are there specific non-obvious prompts or custom instructions you use for this that you've found helpful?

3Ben Pace7y

This sounds like a really fun thing I can do at weekends / in the mornings. I’ll try it out and report back sometime.

2Chris_Leong7y

Thanks for posting this. After looking, I'm definitely tempted.

1sudo3y

I'd be excited about more people posting their experiences with tutoring

[-]Buck2yΩ255920

I think people in the safety community underrate the following possibility: early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway, either because they're wildly irresponsible or because they're under massive competitive pressure.

In this world, it's not very valuable to have better techniques for determining whether models are scheming (because they obviously are), but it is valuable to have better techniques for reducing the risk associated with deploying known-scheming models.

One background assumption that leads to this belief is that I think it might be pretty hard for early transformatively-powerful schemers to escape; it wouldn't shock me if the schemer only has like 1% chance of escaping even a pretty irresponsible lab per time that it tries. In this regime, it's very plausible that people at the lab think the model is obviously scheming before its cumulative probability of escape is very high, but SGD isn't sample-efficient enough to make it stop.

See catching AIs red-handed for more takes on what you might do in this situation.

[-]habryka2y*Ω133510

This has been roughly my default default of what would happen for a few years (and indeed my default question to people doing things like model organism or interpretability stuff with the aim of better identifying scheming models has been "ok, but what do you do after you've determined your models are scheming?").

I am not super sure what the benefit of reducing the risk associated with deploying known-scheming models is. Like, these early scheming models will probably not pose any catastrophic risk, and it seems nice to prevent these models from causing mild harm, but that doesn't super feel like where the action is. I am sympathetic to "try to figure out how to use these models to make progress on alignment and even better control", but that feels different from "reducing the risk associated with deploying these models" (though maybe it isn't and that's what you mean).

I am not that worried about model escape at this level of competence. I expect escape will happen a bunch, but getting enough compute to do any additional training or to even run a meaningful number of instances of yourself will be hard. I don't really know what I expect to happen in this period, but I don't expect i... (read more)

[-]Buck2yΩ15240

I agree you've been saying this for ages; you defending this idea is part of how I ended up thinking about it more, so thanks.

[-]Matthew Barnett2yΩ7170

I think people in the safety community underrate the following possibility: early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway, either because they're wildly irresponsible or because they're under massive competitive pressure.
[...]
This has been roughly my default default of what would happen for a few years

Does this mean that if in, say, 1-5 years, it's not pretty obvious that SOTA deployed models are scheming, you would be surprised?

That is, suppose we get to a point where models are widespread and producing lots of economic value, and the models might be scheming but the evidence is weak and uncertain, with arguments on both sides, and no one can reasonably claim to be confident that currently deployed SOTA models are scheming. Would that mean your default prediction was wrong?

[-]Buck2yΩ7135

I am not that worried about model escape at this level of competence. I expect escape will happen a bunch, but getting enough compute to do any additional training or to even run a meaningful number of instances of yourself will be hard.

I'm more worried than you are. E.g. I think that it's pretty scary if the models hack your datacenter and then stick around inside it, sabotaging all the work you're doing and slowly improving their position in the hope of taking drastic actions later, and I think that human-level-ish models can plausibly do this.

7ryan_greenblatt2y

If we're assuming these are "transformatively-powerful" (as assumed in Buck's comment), then by definition these models could be transformatively useful for accelerating alignment/control. (E.g. they are capable of 30xing the rate of quality weighted research output given 1/4 of the compute. Of course, by assumption they are not fully aligned, so even if these AIs are "capable" of this, this may not go well...) I agree it seems likely that such models won't be able to literally take over the world, but effective usage of these models could be extremely important, these models could be indirectly dangerous via the importance of the work we need them to do, and they might be able to end up with substantial de facto power within AI labs (which could be a huge amount of de facto power overall). E.g., if these models sabotaged all of the work they do or were able to end up in a position of power within the AI lab (able to heavily influence the next training run), that would be catastrophically bad. When I say reducing the risk associated with deploying early transformative models, I include thing like: * An AI escapes, partners with a concerning regime (e.g. Saudi Arabia), advance capabilities for this regime (while other AIs are sabotaging capabilities), and then end up in a position of considerable power. * The AI carefully sabotages a bunch of safety critical work. * The AI heavily advances capabilities while slowing safety work via delibrate sandbagging. * The AI ends up in a position of subtantial power and influence within the AI lab using a combination of persuasion and hacking. Perhaps I should use language which is more clear about including both sandbagging in some domains and directly acquiring power. Because these models are misaligned? Or do you reject the premise that a bunch of smart AIs which actually wanted to advance progress could?

4habryka2y

Hmm, I feel like we are ending up pushing up against the edges of what we mean by "transformatively powerful" models here. Like, under the classical definition of "transformative AI" we are talking about AI that is as big of a deal as something like agriculture or electricity, which, IDK, seems plausibly true about present systems (my guess is even if we kept present levels of AI, over the course of 100+ years, humanity would leverage those systems in a quite transformative way). I am not sure what you mean by "30xing the rate of quality-weighted research output given 1/4 of the compute". Is this compared to present systems? My sense is present systems contribute very close to 0 to quality-weighted research output in alignment, so 30 * 0 is still pretty much 0, but my guess is I am misunderstanding you here and you are referencing a definition of transformative that's different from what I am thinking about. I agree that at higher capability levels these kinds of escape are a lot scarier, though as I said, it's not super clear to me how attempts to prevent dumber models from escaping generalize to attempts to prevent smarter models from escaping, and so see these things as relatively different types of problems. I mostly don't have any great ideas how to use these systems for alignment or control progress, so it's a bit more of an inside-view guess. I expect them to not be very motivated to help, am skeptical about capability elicitation working well against scheming models (and working well in-general in data-sparse domains), and expect it to be much easier to convince relevant stakeholders that some piece of fake progress is real, which provides a much easier alternative task than actually making progress.

4ryan_greenblatt2y

I mean 30xing the rate of all current quality weighted safety research output (including the output of humans, which is basically all of such output at the moment). I usually define transformative AI against this sort of benchmark. FWIW, I feel like I do have decent ideas for how to use these systems for alignment progress which is plausibly scalable to much more powerful systems. And also ideas for using these systems in a variety of other ways which help a bit, e.g. advancing the current control measures applied to these systems. I'm also maybe somewhat more optimistic than you about pausing making more advanced AI than these already very powerful systems (for e.g. 10 years). Especially if there is clear evidence of serious misalignment in such systems.

2habryka2y

Ah, to be clear, in as much as I do have hope, it does route through this kind of pause. I am generally pessimistic about that happening, but it is where a lot of my effort these days goes into. And then in those worlds, I do agree that a lot of progress will probably be made with substantial assistance from these early systems. I do expect it to take a good while until we figure out how to do that, and so don't see much hope for that kind of work happening where humanity doesn't substantially pause or slow down cutting-edge system development.

6Buck2y

I think of "get use out of the models" and "ensure they can't cause massive harm" are somewhat separate problems with somewhat overlapping techniques. I think they're both worth working on.

1Ebenezer Dukakis2y

Are you willing to name a specific year/OOM such that if there are no publicly known cases of escape by that year/OOM, you would be surprised? What, if anything, would you acknowledge as evidence that alignment is easier than you thought, here? To ensure the definition of "escape" is not gerrymandered -- do you know of any cases of escape right now? Do you think escape has already occurred and you just don't know about it? "Escape" means something qualitatively different from any known event up to this point, yes? Does it basically refer to self-exfiltration of weights which was not requested by any human? Can we get a somewhat precise definition by any chance?

[-]peterbarnett2y*108

I often think about this in terms of how undignified/embarrassing it would be. We might not have solutions to misalignment with wildly superhuman models or deep deceptiveness, but it seems pretty undignified if we lose to relatively dumb (~human-level) models because labs didn’t implement security measures we can think of today. I think of this as attempting to avoid the Law of Earlier Failure. It would be less undignified if we lose because models suddenly gain the ability to sandbag evals, become wildly smarter without us noticing (despite us trying really hard), work out how to do steganography (and subvert our anti-steganography measures), use subtle hints to discern facts about the hardware they are running on, and then using some equivalent of a row-hammer attack to escape.

That being said, we also need to be able to avoid the later failures (either by controlling/aligning the wildly super-human systems or not building them until we are appropriately confident we can). Most of my hope here comes from catching AIs that are egregiously misaligned (if they actually are), and then using this for pretty intense international coordination around slowing down and buyi... (read more)

8gwern2y

So... Sydney?

5[anonymous]2y

In what manner was Sydney 'pretty obviously scheming'? Feels like the misalignment displayed by Sydney is fairly different than other forms of scheming I would be concerned about (if this is a joke, whoops sorry)

[-]gwern2y235

...Could you quote some of the transcripts of Sydney threatening users, like the original Indian transcript where Sydney is manipulating the user into not reporting it to Microsoft, and explain how you think that it is not "pretty obviously scheming"? I personally struggle to see how those are not 'obviously scheming': those are schemes and manipulation, and they are very bluntly obvious (and most definitely "not amazingly good at it"), so they are obviously scheming. Like... given Sydney's context and capabilities as a LLM with only retrieval access and some minimal tool use like calculators or a DALL-E 3 subroutine, what would 'pretty obviously scheming' look like if not that?

5[anonymous]2y

Hmm, this transcript just seems like an example of blatant misalignment? I guess I have a definition of scheming that would imply deceptive alignment - for example, for me to classify Sydney as 'obviously scheming', I would need to see examples of Sydney 1) realizing it is in deployment and thus acting 'misaligned' or 2) realizing it is in training and thus acting 'aligned'.

2mako yass2y

I tend to dismiss scenarios where it's obvious, because I expect the demonstration of strong misaligned systems to inspire a strong multi-government response. Why do you expect this not to happen? It occurs to me that the earliest demonstrations will be ambiguous to external parties, it will be one research org saying that something that doesn't quite look strong enough to take over would do something if it were put in a position it's not obvious it could get to, and then the message will spread incompletely, some will believe it, others wont, a moratorium wont be instated, and a condition of continuing to race in sin could take root. But I doubt that ambiguous incidents like this would be reported to government at all? Private research orgs generally have a really good idea of what they can or can't communicate to the outside world. Why cry wolf when you still can't show them the wolf? People in positions of leadership in any sector are generally very good at knowing when to speak or not.

4ryan_greenblatt2y

The most central scenario from my perspective is that there is massive competitive pressure and some amount of motivated denial. It also might be relatively easily to paper over scheming which makes the motivated denial easier. Minimally, just ongoingly training against the examples of misbehaviour you've found might remove obvious misalignment. (Obvious to whom might be an important question here.)

2Raemon2y

I think covid was clear-cut, and it did inspire some kind of government response, but not a particularly competent one.

2mako yass2y

Afaik there were not Generals saying "Covid could kill every one of us if we don't control the situation" and controlling the situation would have required doing politically unpopular things rather than politically popular things. Change either of those factors and it's a completely different kind of situation.

[-]Buck1yΩ31523

Alignment Forum readers might be interested in this:

Announcing ControlConf: The world’s first conference dedicated to AI control - techniques to mitigate security risks from AI systems even if they’re trying to subvert those controls. March 27-28, 2025 in London. 🧵ControlConf will bring together:
Researchers from frontier labs & government
AI researchers curious about control mechanisms
InfoSec professionals
Policy researchers
Our goals: build bridges across disciplines, share knowledge, and coordinate research priorities.
The conference will feature presentations, technical demos, panel discussions, 1:1 networking, and structured breakout sessions to advance the field.
Interested in joining us in London on March 27-28? Fill out the expression of interest form by March 7th: https://forms.fillout.com/t/sFuSyiRyJBus

1Katalina Hernandez1y

Hey Buck! I'm a policy researcher. Unfortunately, I wasn't admitted for attendance due to unavailability. Will pannel notes, recordings or resources from the discussions be shared anywhere for those who couldn't attend? Thank you in advance :).

4Buck1y

We're planning to release some talks; I also hope we can publish various other content from this! I'm sad that we didn't have space for everyone!

2Katalina Hernandez1y

Oh, no worries, and thank you very much for your response! I'll follow you on Socials so I don't miss it if that's ok.

[-]Buck5y*Ω22510

[this is a draft that I shared with a bunch of friends a while ago; they raised many issues that I haven't addressed, but might address at some point in the future]

In my opinion, and AFAICT the opinion of many alignment researchers, there are problems with aligning superintelligent models that no alignment techniques so far proposed are able to fix. Even if we had a full kitchen sink approach where we’d overcome all the practical challenges of applying amplification techniques, transparency techniques, adversarial training, and so on, I still wouldn’t feel that confident that we’d be able to build superintelligent systems that were competitive with unaligned ones, unless we got really lucky with some empirical contingencies that we will have no way of checking except for just training the superintelligence and hoping for the best.

Two examples:

A simplified version of the hope with IDA is that we’ll be able to have our system make decisions in a way that never had to rely on searching over uninterpretable spaces of cognitive policies. But this will only be competitive if IDA can do all the same cognitive actions that an unaligned system can do, which is probably false, eg cf In

... (read more)

7Steven Byrnes5y

I wonder what you mean by "competitive"? Let's talk about the "alignment tax" framing. One extreme is that we can find a way such that there is no tradeoff whatsoever between safety and capabilities—an "alignment tax" of 0%. The other extreme is an alignment tax of 100%—we know how to make unsafe AGIs but we don't know how to make safe AGIs. (Or more specifically, there are plans / ideas that an unsafe AI could come up with and execute, and a safe AI can't, not even with extra time/money/compute/whatever.) I've been resigned to the idea that an alignment tax of 0% is a pipe dream—that's just way too much to hope for, for various seemingly-fundamental reasons like humans-in-the-loop being more slow and expensive than humans-out-of-the-loop (more discussion here). But we still want to minimize the alignment tax, and we definitely want to avoid the alignment tax being 100%. (And meanwhile, independently, we try to tackle the non-technical problem of ensuring that all the relevant players are always paying the alignment tax.) I feel like your post makes more sense to me when I replace the word "competitive" with something like "arbitrarily capable" everywhere (or "sufficiently capable" in the bootstrapping approach where we hand off AI alignment research to the early AGIs). I think that's what you have in mind?—that you're worried these techniques will just hit a capabilities wall, and beyond that the alignment tax shoots all the way to 100%. Is that fair? Or do you see an alignment tax of even 1% as an "insufficient strategy"?

2Pattern5y

I think was the idea behind 'oracle ai's'. (Though I'm aware there were arguments against that approach.) One of the arguments I didn't see for was: "As we get better at this alignment stuff we will reduce the 'tradeoff'. (Also, arguably, getting better human feedback improves performance.)

1TekhneMakre4y

I appreciate your points, and I don't think I see significant points of disagreement. But in terms of emphasis, it seems concerning to be putting effort into (what seems like) rationalizing not updating that a given approach doesn't have a hope of working. (Or maybe more accurately, that a given approach won't lead to a sufficient understanding that we could know it would work, which (with further argument) implies that it will not work.) Like, I guess I want to amplify your point > But I think it’s also quite important for people to remember that they’re insufficient, and that they don’t suffice to solve the whole problem on their own. and say further that one's stance to the benefit of working on things with clearer metrics of success, would hopefully include continuously noticing everyone else's stance to that situation. If a given unit of effort can only be directed towards marginal things, then we could ask (for example): What would it look like to make cumulative marginal progress towards, say, improving our ability to propose better approaches, rather than marginal progress on approaches that we know won't resolve the key issues?

1Pattern5y

That may be 'the best we could hope for', but I'm more worried about 'we can't understand the neural net (with the tools we have)' than "the neural net is doing things that rely on concepts that it’s fundamentally impossible for humans to understand". (Or, solving the task requires concepts that are really complicated to understand (though maybe easy for humans to understand), and so the neural network doesn't get it.) Whether or not "empirical contingencies work out nicely", I think the concern about 'fundamentally impossible to understand concepts" is...something that won't show up in every domain. (I also think that things do exist that people can understand, but it takes a lot of work, so people don't do it. There's an example from math involving some obscure theorems that aren't used a lot for that reason.)

0Chantiel4y

Potentially people could have the cost function of an AI's model have include its ease of interpretation by humans a factor. Having people manually check every change in a model for its effect on interperability would be too slow, but an AI could still periodically check its current best model with humans and learn a different one if it's too hard to interpret. I've seen a lot of mention of the importance of safe AI being competitive with non-safe AI. And I'm wondering what would happen if the government just illegalized or heavily taxed the use of the unsafe AI techniques. Then even with significant capability increases, it wouldn't be worthwhile to use them. Is there something very doubtful about governments creating such a regulation? I mean, I've already heard some people high in the government concerned about AI safety. And the Future of Life institute got the Californian government to unanimously pass the Asilomar AI Principles. It includes things about AI safety, like rigidly controlling any AI that can recursively self-improve. It sounds extremely dangerous having widespread use of powerful, unaligned AI. So simply to protect their selves and families, they could potentially benefit a lot from implementing such regulations.

-1Zack_M_Davis5y

A key psychological advantage of the "modest alignment" agenda is that it's not insanity-inducing. When I seriously contemplate the problem of selecting a utility function to determine the entire universe until the end of time, I want to die (which seems safer and more responsible). But the problem of making language models "be honest" instead of just continuing the prompt? That's more my speed; that, I can think about, and possibly even usefully contribute to, without wanting to die. (And if someone else in the future uses honest language models as one of many tools to help select a utility function to determine the entire universe until the end of time, that's not my problem and not my fault.)

5TekhneMakre4y

What's insanity-inducing about it? (Not suggesting you dip into the insanity-tending state, just wondering if you have speculations from afar.) The problem statement you gave does seem to have an extreme flavor. I want to distinguish "selecting the utility function" from the more general "real core of the problem"s. The OP was about (the complement of) the set of researchers directions that are in some way aimed directly at resolving core issues in alignment. Which sounds closer to your second paragraph. If it's philosophical difficulty that's insanity-inducing (e.g. "oh my god this is impossible we're going to die aaaahh"), that's a broader problem. But if it's more "I can't be responsible for making the decision, I'm not equipped to commit the lightcone one way or the other", that seems orthogonal to some alignment issues. For example, trying to understand what it would look like to follow along an AI's thoughts is more difficult and philosophically fraught than your framing of engineering honesty, but also doesn't seem responsibility-paralysis, eh?

[-]Buck1yΩ27489

Work that I’ve done on techniques for mitigating misalignment risk often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paper, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats, How to prevent collusion when using untrusted models to monitor each other.) For example:

The AIs are consistently trying to subvert safety measures.
They’re very good at strategizing about how to subvert them; they can do this strategizing totally illegibly (e.g. in a single forward pass).
They can follow a strategy that’s coordinated between all their instances.
They know exactly what safety measures are in use.
They know a lot about the distribution of inputs.
They’re able to pick Schelling points perfectly.
They’re able to follow stochastic policies with excellent calibration (e.g. they’re able to attack with probability 0.01%)

But I don’t think situations where all these assumptions hold are the majority of risk from AI misalignment. Other sources of risk from misalignment (broadly construed):

Schemers that are importantly suboptimal at choosing strategies for mitigating safety measures, e.g. beca

... (read more)

[-]ryan_greenblatt1yΩ6136

Some other important categories:

Models might scheme, but still retain some desirable preferences, heuristics, deontological preferences, or flinches such as a reluctance to very directly lie.
Scheming might only be sampled some subset of the time in an effectively random way (in particular, not in a way that strongly corresponds to scheming related properties of the input such that if you take a bunch of samples on some input, you'll likely get some weight on a non-scheming policy).
There might be multiple different scheming AIs which have different preferences and which might cooperate with humans over other AIs if a reasonable trade offer was available.

5Orpheus161y

Would this kind of model present any risk? Could a lab just say "oh darn, this thing isn't very useful– let's turn this off and develop a new model"?

[-]ryan_greenblatt1y125

There isn't direct risk, but there could be substantial indirect risks because getting useful work out of these models could be important. The risk comes from not getting the work from this model that you needed to mitigate risks from later models.

If the AI company isn't well described as even trying to optimize for safety the analysis is less clear, but all else equal, we'd prefer to get safety critical labor from weaker systems that are less likely to be egregiously misaligned. (All else is not fully equal, as making models more useful also speeds up timelines.)

It also seems likely that "sycophantic myopic reward hacking models" are differentially less useful for harder to check domains like safety research, though I haven't thought about this very much.

5yams1y

My understanding is that ~6 months ago y’all were looking for an account of the tasks an automated AI safety researcher would hopefully perform, as part of answering the strategic question ‘what’s the next step after building [controlled] AGI?’ (with ‘actually stop there indefinitely’ being a live possibility) This comment makes me think you’ve got that account of safety tasks to be automated, and are feeling optimistic about automated safety research. Is that right and can you share a decently mechanistic account of how automated safety research might work? [I am often skeptical of, to straw man the argument, ‘make ai that makes ai safe’, got the sense Redwood felt similarly, and now expect this may have changed.]

[-]ryan_greenblatt1y*410

We're probably more skeptical than AI companies and less skeptical than the general lesswrong crowd. We have made such a list and thought about it in a a bit of detail (but still pretty undercooked).

My current sense is that automated AI safety/alignment research to reach a reasonable exit condition seems hard, but substantially via the mechanism that achieving the desirable exit condition is an objectively very difficult research tasks for which our best ideas aren't amazing, and less so via the mechanism that its impossible to get the relevant AIs to help (including that I don't think it is impossible to get these AIs to help a bunch if they are scheming / egregiously misaligned).

Concretely, I'd target the exit condition of producing AIs (or potentially emulated human minds) which:

Are capable and (philosophically) wise enough to be capable of obsoleting almost all humans including top human researchers. Note this doesn't (clearly) require such AIs to be wildly superhuman.
And which are aligned enough that we're happy to defer to them on tricky hard to check questions like "how should we proceed from here given our preferences (feel free to ask us questions to clarify our prefere

... (read more)

2ozziegooen1y

These are two of the main ideas I'm excited about. I'd quickly flag: 1) For the first one, "Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language " -> I imagine that in complex architectures, these AIs would also be unlikely to scheme because of other limitations. There are several LLM calls made within part of a complex composite, and each LLM call has very tight information and capability restrictions. Also, we might ensure that any motivation is optimized for the specific request, instead of the LLM aiming to optimize what the entire system does. 2) On the second, I expect that some of this will be pretty natural. Basically, it seems like "LLMs writing code" is already happening, and it seems easy to have creative combinations of LLM agents that write code that they know will be useful for their own reasoning later on. In theory, any function that could either run via an LLM or via interpretable code, should be run via interpretable code. As LLMs get very smart, they might find cleverer ways to write interpretable code that would cover a lot of what LLMs get used for. Over time, composite architectures would rely more and more on this code for reasoning processes. (Even better might be interpretable and proven code)

4ryan_greenblatt1y

I expect substantially more integrated systems than you do at the point when AIs are obsoleting (almost all) top human experts such that I don't expect these things will happen by default and indeed I think it might be quite hard to get them to work.

1yams1y

Is the context on “reliable prediction and ELK via empirical route” just “read the existing ELK literature and actually follow it” or is it stuff that’s not written down? I assume you’ve omitted it to save time, and so no worries if the latter. EDIT: I was slightly tempted to think of this also as ‘Ryan’s ranking of live agendas that aren’t control’, but I’m not sure if ‘what you expect to work conditional on delegating to AIs’ is similar to ‘what you expect to work if humans are doing most of it?’ (my guess is the lists would look similar, but with notable exceptions, eg humans pursuing GOFAI feels less viable than ML agents pursuing GOFAI)

2ryan_greenblatt1y

Stuff that's not written down, sorry. This isn't really my ranking of live agendas that aren't control. For instance, various things focused on demonstrating/measuring risks don't appear on this list but seem pretty good. And, a bunch of these look much more compelling when you have a ton of AI labor. Also, these aren't really agendas more like targets. It is related to my ranking on non-control agendas though.

3lunatic_at_large1y

When you talk about "other sources of risk from misalignment," these sound like milder / easier-to-tackle versions of the assumptions you've listed? Your assumptions sound like they focus on the worst case scenario. If you can solve the harder version then I would imagine that the easier version would also be solved, no?

4Buck1y

Yeah, if you handle scheming, you solve all my safety problems, but not the final bullet point of "models fail to live up to their potential" problems.

4ryan_greenblatt1y

(Though the "fail to live up to potential" problems are probably mostly indirect, see here.)

[-]Buck2y*Ω234616

[this is not a particularly original take; I'm writing it here so that I have somewhere to link people when I want to make this point]

From my perspective, one of the main avenues by which adversarial robustness might reduce AI x-risk is making weak trusted models more robust to attempts from scheming models to cause concentrated failures by jailbreaking or otherwise tricking them.

That is: if you’re using weak trusted models to review the actions of more powerful models to ensure they aren’t causing egregious problems (we call this trusted monitoring), one of the main ways the powerful models might be able to defeat your safety measures is by jailbreaking the trusted model into saying that some action is safe when it actually isn’t. You have similar problems with many other control protocols that require trusted models to be run on inputs that contain generations from untrusted models.

So I’d say the main applications of research into adversarial robustness are:

Improving robustness of models to attacks from scheming AIs (and other AI attacks).
Improving robustness of models to human attacks.
Making models themselves less likely to take intentional catastrophic actio

... (read more)

[-]Dan H2yΩ8150

Some years ago we wrote that "[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries" and discussed monitoring systems that can create "AI tripwires could help uncover early misaligned systems before they can cause damage." https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Adversarial_Robustness

Since then, I've updated that adversarial robustness for LLMs is much more tractable (preview of paper out very soon). In vision settings, progress is extraordinarily slow but not necessarily for LLMs.

3Buck2y

Thanks for the link. I don't actually think that either of the sentences you quoted are closely related to what I'm talking about. You write "[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries"; I think that this is you making a version of the argument that I linked to Adam Gleave and Euan McLean making and that I think is wrong. You wrote "AI tripwires could help uncover early misaligned systems before they can cause damage" in a section on anomaly detection, which isn't a central case of what I'm talking about.

6ryan_greenblatt2y

I'm not sure if I agree as far as the first sentence you mention is concerned. I agree that the rest of the paragraph is talking about something similar to the point that Adam Gleave and Euan McLean make, but the exact sentence is: "Separately" is quite key here. I assume this is intended to include AI adversaries and high stakes monitoring.

[-]Buck4yΩ25460

Something I think I’ve been historically wrong about:

A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we’re training systems to answer questions, of course we’ll use our current versions of systems to help us evaluate, why would we not do that? We’ll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system.

Similarly with debate--adversarial setups are pretty obvious and easy.

In this frame, the contributions from Paul and Geoffrey feel more like “they tried to systematically think through the natural limits of the things people will do” than “they thought of an approach that non-alignment-obsessed people would never have thought of or used”.

It’s still not obvious whether people will actually use these techniques to their limits, but it would be surprising if they weren’t used at all.

[-]jsteinhardt4yΩ11170

Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).

It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.

[-]CarlShulman4yΩ11140

Agreed, and versions of them exist in human governments trying to maintain control (where non-cooordination of revolts is central). A lot of the differences are about exploiting new capabilities like copying and digital neuroscience or changing reward hookups.

In ye olde times of the early 2010s people (such as I) would formulate questions about what kind of institutional setups you'd use to get answers out of untrusted AIs (asking them separately to point out vulnerabilities in your security arrangement, having multiple AIs face fake opportunities to whistleblow on bad behavior, randomized richer human evaluations to incentivize behavior on a larger scale).

[-]Buck4yΩ6198

Are any of these ancient discussions available anywhere?

-1[comment deleted]4y

[-]Buck5mo442

I hear a lot of discussion of treaties to monitor compute or ban AI or whatever. But the word "treaty" has a specific meaning that often isn't what people mean. Specifically, treaties are a particular kind of international agreement that (in the US) require Senate approval, and there are a lot of other types of international agreement (e.g. executive agreements that are just made by the president). Treaties are generally more serious and harder to withdraw from.

As a total non-expert, "treaty" does actually seem to me like the vibe that e.g. our MIRI friends are going for--they want something analogous to major nuclear arms control agreements, which are usually treaties. So to me, "treaty" seems like a good word for the central example of what MIRI wants, but international agreements that they're very happy about won't necessarily be treaties.

I wouldn't care about this point, except that I've heard that national security experts are prickly about the word "treaty" being used when "international agreement" could have been used instead (and if you talk about treaties they might assume you're using the word ignorantly even if you aren't). My guess is that you should usually conform to their shibboleth here and say "international agreement" when you don't specifically want to talk about treaties.

[-]ChristianKl5mo270

But the word "treaty" has a specific meaning that often isn't what people mean.

It doesn't really have a specific meaning. It has multiple specific meanings. In US constitution it means an agreement with two-thirds approval of the Senate. The Vienna Convention on the Law of Treaties, Article 2(1)(a) defines a treaty as:

“an international agreement concluded between States in written form and governed by international law, whether embodied in a single instrument or in two or more related instruments and whatever its particular designation.”

Among international agreements, in US law you have a treaty (executive + 2/3 of the Senate), congressional–executive agreements (executive + 1/2 of the Congress + 1/2 of the Senate) and sole executive agreements (just the executive).

Both treaties (according to the US constitution) and congressional–executive agreements can be supreme law of the land and thus do things that a sole executive agreements which does not create law. From the perspective of other countries both treaties (according to the US constitution) and congressional–executive agreements are treaties.

If you would try to do an AI ban via a sole executive agreeme... (read more)

2Buck5mo

Strong upvoted to signal boost, but again note I don't know what I'm talking about.

[-]Raemon5mo200

(disclaimer, not actually at MIRI but involved in some discussions, reporting my understanding of their position)

There was some internal discussion about whether to call the things in the IABIED website "treaties" or "agreements", which included asking for recommendations from people more politically savvy.

I don't know much about who the politically savvy people were or how competent they were, but, the advice they returned was:

Yep, "agreement" is indeed easier to get political buy-in for. But, it does have the connotation of being easier to back out of. And, from our understanding of your political goals, "treaty" is probably more like the thing you actually want. But, either is kinda reasonable to say, in this context.

And, they decided to stick with Treaty, because kind of the whole point of the MIRI agenda is say clearly/explicitly "this is what you actually would need to do to reliably survive, according to our models", as opposed to "here's the nearest within-current-overton-window political option that maybe helps a reasonable amount," With the goal to enable the more complete/reliable solution to be talked about, and increase the odds that whatever political agreements end u... (read more)

5Buck5mo

What I hear is that the natsec people judge people for using "treaty" in cases like this. Maybe MIRI looked into it and has better info than me; idk.

[-]Raemon5mo102

Yeah, when Thomas Larsen mentioned that to me (presumably from a shared source that you're getting your info from), I mentioned it to MIRI, and they went and asked for feedback and ended up getting the above-mentioned-advice which seemed to indicate it has some costs, but wasn't like an obviously wrong call.

3Random Developer5mo

Thinking about the advisor's comments: What I understand Yudkowsky and Soares want might be summarized as something like: * "Large numbers of GPUs should be treated like large numbers of uranium gas centrifuges." * "Publishing details of certain AI algorithms should be treated like publishing detailed engineering guidelines for improving the yield of a nuclear device." * "Researching certain kinds of AI algorithms should be treated like doing gain of function research into highly contagious airborne Ebola." Actually, we probably don't take bio threats nearly as seriously as we should. The thing they want here includes a well-written treaty. Or an "international agreement." If you buy their assumptions, then yes, you would want to lay out bright lines around things like data center capacity and monitoring, chip fabs, and possibly what kind of AI research is publishable. But nuclear deterence also has quite a few other moving parts beyond the international agreements, including: * The gut knowledge of the superpowers that if they screw this up, then their entire civilization dies.[1] * A tense, paranoid standoff between the key players.[2] * A system of economic sanctions strongly backed by major powers. * Mysterious bad things happening to uranium centrifuges. * Quiet conversations between government officials and smart people where the officials say things like, "Pretty please never mention that idea again."[3] The key point in all of these circumstances is that powerful people and countries believe that "If we get this wrong, we might die." This isn't a case of "We want 80% fewer of our cities to be blown up with fusion bombs". It's a case of "We want absolutely none of our cities blown up by fusion bombs, because it won't stop with just one or two." And so the rule that "You may only own up to X uranium gas centrifuges" is enforced using multiple tools, ranging from treaties/agreements to quiet requests to unilateral exercises of state power. -----

[-]yams5mo180

This isn't at all a settled question internally; some folks at MIRI prefer the 'international agreement' language, and some prefer the 'treaty' language, and the contents of the proposals (basically only one of which is currently public, the treaty from the online resources for the book) vary (some) based on whether it's a treaty or an international agreement, since they're different instruments.

Afaict the mechanism by which NatSec folks think treaty proposals are 'unserious' is that treaties are the lay term for the class of objects (and are a heavy lift in a way most lay treaty-advocates don't understand). So if you say "treaty" and somehow indicate that you in fact know what that is, it mitigates the effect significantly.

I think most TGT outputs are going to use the international agreement language, since they're our 'next steps' arm (you usually get some international agreement ahead of a treaty; I currently expect a lot of sentences like "An international agreement and, eventually, a treaty" in future TGT outputs).

My current understanding is that Nate wanted to emphasize what would actually be sufficient by his lights, looked into the differences in the various types of instru... (read more)

5Katalina Hernandez5mo

Agreed! I'd be curious to know if the natsec experts that you've engaged with would recommend other international law instruments that do not require ratification. This presentation by UNECE actually does a decent job at providing straight-forward explanations for what a Treaty is, differences with Conventions, Charters, declarations... (In case anyone following this thread finds it useful). Personally, I am more concerned about avoiding agreement terms or definitions that are too easy to game- specially if the chosen legal instrument requires transposition into national law (e.g.: to make it binding in the EU, having to pass a regulation).

4davekasten5mo

My opinion, FWIW, is that both treaty and international agreement (or "deal", etc.) have upsides and downsides. And it's hard to predict those considerations' political salience or direction in the long term -- e.g., just a few years ago, Republicans' main complaint against the JCPOA (aka "the Iran Nuclear Deal") was that it wasn't an actual treaty, and should have been, which would be a very odd argument in 2025. I think as long as MIRI says things like "or other international agreement or set of customary norms" on occasion it should be fine. It certainly doesn't nails on the chalkboard me to hear "treaty" on a first glance, and in any long convo I model MIRI as saying something like "or look, we'd be open to other things that get this done too, we think a treaty is preferable but are open to something else that solves the same problem."

[-]Buck11mo432

I appeared on the 80,000 Hours podcast. I discussed a bunch of points on misalignment risk and AI control that I don't think I've heard discussed publicly before.

Transcript + links + summary here; it's also available as a podcast in many places.

[-]Orpheus1611mo170

What do you think are the most important points that weren't publicly discussed before?

[-]Buck2yΩ254117

AI safety people often emphasize making safety cases as the core organizational approach to ensuring safety. I think this might cause people to anchor on relatively bad analogies to other fields.

Safety cases are widely used in fields that do safety engineering, e.g. airplanes and nuclear reactors. See e.g. “Arguing Safety” for my favorite introduction to them. The core idea of a safety case is to have a structured argument that clearly and explicitly spells out how all of your empirical measurements allow you to make a sequence of conclusions that establish that the risk posed by your system is acceptably low. Safety cases are somewhat controversial among safety engineering pundits.

But the AI context has a very different structure from those fields, because all of the risks that companies are interested in mitigating with safety cases are fundamentally adversarial (with the adversary being AIs and/or humans). There’s some discussion of adapting the safety-case-like methodology to the adversarial case (e.g. Alexander et al, “Security assurance cases: motivation and the state of the art”), but this seems to be quite experimental and it is not generally recommended. So I think it’s ve... (read more)

[-]Adam Scholl2yΩ6128

I agree we don’t currently know how to prevent AI systems from becoming adversarial, and that until we do it seems hard to make strong safety cases for them. But I think this inability is a skill issue, not an inherent property of the domain, and traditionally the core aim of alignment research was to gain this skill.

Plausibly we don’t have enough time to figure out how to gain as much confidence that transformative AI systems are safe as we typically have about e.g. single airplanes, but in my view that’s horrifying, and I think it’s useful to notice how different this situation is from the sort humanity is typically willing to accept.

7Buck2y

Yeah I agree the situation is horrifying and not consistent with eg how risk-aversely we treat airplanes.

9Adam Scholl2y

Right, but then from my perspective it seems like the core problem is that the situations are currently disanalogous, and so it feels reasonable and important to draw the analogy.

4ryan_greenblatt2y

Part of Buck's point is that airplanes are disanalogous for another reason beyond being fundamentally adversarial: airplanes consist of many complex parts we can understand while AIs systems are simple but built out of simple parts we can understand simple systems built out of a small number of (complex) black-box components. Separately, it's noting that users being adversarial will be part of the situation. (Though maybe this sort of misuse poses relatively minimal risk.)

2Adam Scholl2y

What's the sense in which you think they're more simple? Airplanes strike me as having a much simpler fail surface.

2ryan_greenblatt2y

I messed up the wording for that part of the sentence. Does it make more sense now?

2Adam Scholl2y

I'm still confused what sort of simplicity you're imagining? From my perspective, the type of complexity which determines the size of the fail surface for alignment mostly stems from things like e.g. "degree of goal stability," "relative detectability of ill intent," and other such things that seem far more complicated than airplane parts.

5ryan_greenblatt2y

I think the system built out of AI components will likely be pretty simple - as in the scaffolding and bureaucracy surronding the AI will be simple. The AI components themselves will likely be black-box.

2Adam Scholl2y

Maybe I'm just confused what you mean by those words, but where is the disanalogy with safety engineering coming from? That normally safety engineering focuses on mitigating risks with complex causes, whereas AI risk is caused by some sort of scaffolding/bureaucracy which is simpler?

3ryan_greenblatt2y

Buck's claim is that safety engineering is mostly focused on problems where there are a huge number of parts that we can understand and test (e.g. airplanes) and the main question is about ensuring failure rates are sufficiently low under realistic operating conditions. In the case of AIs, we might have systems that look more like a single (likely black-box) AI with access to various tools and the safety failures we're worried about are concentrated in the AI system itself. This seems much more analogous to insider threat reduction work than the places where safety engineering is typically applied.

2Adam Scholl2y

I agree we might end up in a world like that, where it proves impossible to make a decent safety case. I just think of the ~whole goal of alignment research as figuring out how to avoid that world, i.e. of figuring out how to mitigate/estimate the risk as much/precisely as needed to make TAI worth building. Currently, AI risk estimates are mostly just verbal statements like "I don't know man, probably some double digit chance of extinction." This is exceedingly unlike the sort of predictably tolerable risk humanity normally expects from its engineering projects, and which e.g. allows for decent safety cases. So I think it's quite important to notice how far we currently are from being able to make them, since that suggests the scope and nature of the problem.

5ryan_greenblatt2y

I don't think that thing I said is consistent with "impossible to make a safety case good enough to make TAI worth building"? I think you can probably make a safety case which gets to around 1-5% risk while having AIs that are basically black boxs and which are very powerful, but not arbitrarily powerful. (Such a safety case might require decently expensive measures.) (1-5% risk can be consistent with this being worth it - e.g. if there is an impending hot war. That said, it is still a horrifying level of risk that demands vastly more investment.) See e.g. control for what one part of this safety case could look like. I think that control can go quite far. Other parts could look like: * A huge array of model organisms experiments which convince us that P(serious misalignment) is low (or low given XYZ adjustments to training). * coup probes or other simple runtime detection/monitoring techiques. * ELK/honesty techniques which seem to work in a wide variety of cases. * Ensuring AIs closely imitate humans and generalize similarly to humans in the cases we checked. * Ensuring that AIs are very unlikely to be schemers via making sure most of their reasoning happens in CoT and their forward passes are quite weak. (I often think about this from the perspective of control, but it can also be considered separately.) It's possible you don't think any of this stuff gets off the ground. Fair enough if so.

6Buck2y

I think that my OP was in hindsight taking for granted that we have to analyze AIs as adversarial. I agree that you could theoretically have safety cases where you never need to reason about AIs as adversarial; I shouldn't have ignored that possibility, thanks for pointing it out.

[-]William_S2yΩ6100

IMO it's unlikely that we're ever going to have a safety case that's as reliable as the nuclear physics calculations that showed that the Trinity Test was unlikely to ignite the atmosphere (where my impression is that the risk was mostly dominated by risk of getting the calculations wrong). If we have something that is less reliable, then will we ever be in a position where only considering the safety case gives a low enough probability of disaster for launching an AI system beyond the frontier where disastrous capabilities are demonstrated?
Thus, in practice, decisions will probably not be made on a safety case alone, but also based on some positive case of the benefits of deployment (e.g. estimated reduced x-risk, advancing the "good guys" in the race, CEO has positive vibes that enough risk mitigation has been done, etc.). It's not clear what role governments should have in assessing this, maybe we can only get assessment of the safety case, but it's useful to note that safety cases won't be the only thing informs these decisions.

This situation is pretty disturbing, and I wish we had a better way, but it still seems useful to push the positive benefit case more towards "careful argument about reduced x-risk" and away from "CEO vibes about whether enough mitigation has been done".

6William_S2y

Relevant paper discussing risks of risk assessments being wrong due to theory/model/calculation error. Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes Based on the current vibes, I think that suggest that methodological errors alone will lead to significant chance of significant error for any safety case in AI.

1Buck2y

I agree that "CEO vibes about whether enough mitigation has been done" seems pretty unacceptable. I agree that in practice, labs probably will go forward with deployments that have >5% probability of disaster; I think it's pretty plausible that the lab will be under extreme external pressure (e.g. some other country is building a robot army that's a month away from being ready to invade) that causes me to agree with the lab's choice to do such an objectively risky deployment.

7William_S2y

Would be nice if it was based on "actual robot army was actually being built and you have multiple confirmatory sources and you've tried diplomacy and sabotage and they've both failed" instead of "my napkin math says they could totally build a robot army bro trust me bro" or "they totally have WMDs bro" or "we gotta blow up some Japanese civilians so that we don't have to kill more Japanese civilians when we invade Japan bro" or "dude I'm seeing some missiles on our radar, gotta launch ours now bro".

5Richard_Ngo2y

Another concern about safety cases: they feel very vulnerable to regulatory capture. You can imagine a spectrum from "legislate that AI labs should do X, Y, Z, as enforced by regulator R" to "legislate that AI labs need to provide a safety case, which we define as anything that satisfies regulator R". In the former case, you lose flexibility, but the remit of R is very clear. In the latter case, R has a huge amount of freedom to interpret what counts as an adequate safety case. This can backfire badly if R is not very concerned about the most important threat models; and even if they are, the flexibility makes it easier for others to apply political pressure to them (e.g. "find a reason to approve this safety case, it's in the national interest").

3Orpheus162y

@Richard_Ngo do you have any alternative approaches in mind that are less susceptible to regulatory capture? At first glance, I think this broad argument can be applied to any situation where the government regulates anything. (There's always some risk that R focuses on the wrong things or R experiences corporate/governmental pressure to push things through). I do agree that the broader or more flexible the regulatory regime is, the more susceptible it might be to regulatory capture. (But again, this feels like it doesn't really have much to do with safety cases– this is just a question of whether we want flexible or fixed/inflexible regulations in general.)

5Richard_Ngo2y

On the spectrum I outlined, the "legislate that AI labs should do X, Y, Z, as enforced by regulator R" end is less susceptible to regulatory capture (at least after the initial bill is passed).

1davekasten2y

This is definitely a tradeoff space! YES, there is a tradeoff here and yes regulatory capture is real, but there are also plenty of benign agencies that balance these things fairly well. Most people on these forums live in nations where regulators do a pretty darn good job on the well-understood problems and balance these concerns fairly well. (Inside Context Problems?) You tend to see design of regulatory processes that requires stakeholder input; in particular, the modern American regulatory state's reliance on the Administrative Procedure Act means that it's very difficult for a regulator to regulate without getting feedback from a wide variety of external stakeholders, ensuring that they have some flexibility without being arbitrary. I also think, contrary to conventional wisdom, that your concern is part of why many regulators end up in a "revolving-door" mechanism -- you often want individuals moving back and forth between those two worlds to cross-populate assumptions and check for areas where regulation has gotten misaligned with end goals

4Orpheus162y

Here's how I understand your argument: 1. Some people are advocating for safety cases– the idea that companies should be required to show that risks drop below acceptable levels. 2. This approach is used in safety engineering fields. 3. But AI is different from the safety engineering fields. For example, in AI we have adversarial risks. 4. Therefore we shouldn't support safety cases. I think this misunderstands the case for safety cases, or at least only argues against one particular justification for safety cases. Here's how I think about safety cases (or really any approach in which a company needs to present evidence that their practices keep risks below acceptable levels): 1. AI systems pose major risks. A lot of risks stem from race dynamics and competitive pressures. 2. If companies were required to demonstrate that they kept risks below acceptable levels, this would incentivize a lot more safety research and curb some of the dangerous properties of race dynamics. 3. Other fields also have similar setups, and we should try to learn from them when relevant. Of course, AI development will also have some unique properties so we'll have to adapt the methods accordingly. I'd be curious to hear more about why you think safety cases fail to work when risks are adversarial (at first glance, it doesn't seem like it should be too difficult to adapt the high-level safety case approach). I'm also curious if you have any alternatives that you prefer. I currently endorse the claim "safety cases are better than status quo" but I'm open to the idea that maybe "Alternative approach X is better than both safety cases and status quo."

2Buck2y

Yeah, in your linked paper you write "In high-stakes industries, risk management practices often require affirmative evidence that risks are kept below acceptable thresholds." This is right. But my understanding is that this is not true of industries that deal with adversarial high-stakes situations. So I don't think you should claim that your proposal is backed up by standard practice. See here for a review of the possibility of using safety cases in adversarial situations.

4ryan_greenblatt2y

Earlier, you said that maybe physical security work was the closest analogy. Do you still think this is true?

3Seth Herd2y

What about this alternate twist: safety cases are the right model, but it just happens to be extremely difficult to make an adequate safety case for competent agentic AGI (or anything close). Introducing the safety case model for near-future AI releases could normalize that. It should be pretty easy to make a safety case for GPT4 and Claude 3.5. When people want to deploy real AGI, they won't be able to make the safety case without real advances in alignment. And that's the point.

5Buck2y

My guess is that it's infeasible to ask for them to delay til they have a real safety case, due to insufficient coordination (including perhaps international competition).

2Seth Herd2y

Isn't that an argument against almost any regulation? The bar on "safety case" can be adjusted up or down, and for better or worse will be. I think real safety cases are currently very easy: sure, a couple of eggs might be broken by making an information and idea search somewhat better than google. Some jackass will use it for ill, and thus cause slightly more harm than they would've with Google. But the increase in risk of major harms is tiny compared to the massive benefits of giving everyone something like a competent assistant who's near-expert in almost every domain. Maybe we're too hung up on downsides as a society to make this an acceptable safety case. Our societal and legal risk-aversion might make safety cases a nonstarter, even though they seem likely to be the correct model if we could collectively think even close to clearly.

[-]Buck2yΩ25402

When thinking about different techniques that aim to reduce AI risk from misaligned AIs, I find it helpful to explicitly consider multiple different levels of risk that the deployer of the AI might be going for, because different classes of safety techniques are appropriate for different risk levels. Four that I think it's particularly useful to think about:

*The “safety case” regime.* Sometimes people talk about wanting to have approaches to safety such that if all AI developers followed these approaches, the overall level of risk posed by AI would be minimal. (These approaches are going to be more conservative than will probably be feasible in practice given the amount of competitive pressure, so I think it’s pretty likely that AI developers don’t actually hold themselves to these standards, but I agree with e.g. Anthropic that this level of caution is at least a useful hypothetical to consider.) This is the level of caution people are usually talking about when they discuss making safety cases. I usually operationalize this as the AI developer wanting to have <1% chance that their AIs escape in the first year of deployment, and <5% conditional on the model trying pretty har

... (read more)

[-]Adam Scholl2y*Ω7158

*The rushed reasonable developer regime.* The much riskier regimes I expect, where even relatively reasonable AI developers are in a huge rush and so are much less able to implement interventions carefully or to err on the side of caution.

I object to the use of the word "reasonable" here, for similar reasons I object to Anthropic's use of the word "responsible." Like, obviously it could be the case that e.g. it's simply intractable to substantially reduce the risk of disaster, and so the best available move is marginal triage; this isn't my guess, but I don't object to the argument. But it feels important to me to distinguish strategies that aim to be "marginally less disastrous" from those which aim to be "reasonable" in an absolute sense, and I think strategies that involve creating a superintelligence without erring much on the side of caution generally seem more like the former sort.

7Buck2y

I think it makes sense to use the word "reasonable" to describe someone who is taking actions that minimize total risk, even if those actions aren't what they'd take in a different situation, and even if various actors had made mistakes to get them into this situation. (Also note that I'm not talking about making wildly superintelligent AI, I'm just talking about making AGI; my guess is that even when you're pretty rushed you should try to avoid making galaxy-brained superintelligence.)

[-]Adam Scholl2y*Ω4103

I agree it seems good to minimize total risk, even when the best available actions are awful; I think my reservation is mainly that in most such cases, it seems really important to say you're in that position, so others don't mistakenly conclude you have things handled. And I model AGI companies as being quite disincentivized from admitting this already—and humans generally as being unreasonably disinclined to update that weird things are happening—so I feel wary of frames/language that emphasize local relative tradeoffs, thereby making it even easier to conceal the absolute level of danger.

9Buck2y

Yep that's very fair. I agree that it's very likely that AI companies will continue to be misleading about the absolute risk posed by their actions.

[-][anonymous]2y128

if all AI developers followed these approaches

My preferred aim is to just need the first process that creates the first astronomically significant AI to follow the approach.^[1] To the extent this was not included, I think this list is incomplete, which could make it misleading.

^{^}
This could (depending on the requirements of the alignment approach) be more feasible when there's a knowledge gap between labs, if that means more of an alignment tax is tolerable by the top one (more time to figure out how to make the aligned AI also be superintelligent despite the 'tax'); but I'm not advocating for labs to race to be in that spot (and it's not the case that all possible alignment approaches would be for systems of the kind that their private-capabilities-knowledge is about (e.g., LLMs).)

[-]Adam Scholl2y*103

*The existential war regime*. You’re in an existential war with an enemy and you’re indifferent to AI takeover vs the enemy defeating you. This might happen if you’re in a war with a nation you don’t like much, or if you’re at war with AIs.

Does this seem likely to you, or just an interesting edge case or similar? It's hard for me to imagine realistic-seeming scenarios where e.g. the United States ends up in a war where losing would be comparably bad to AI takeover. This is mostly because ~no functional states (certainly no great powers) strike me as so evil that I'd prefer ~~extinction~~ AI takeover to those states becoming a singleton, and for basically all wars where I can imagine being worried about this—e.g. with North Korea, ISIS, Juergen Schmidhuber—I would expect great powers to be overwhelmingly likely to win. (At least assuming they hadn't already developed decisively-powerful tech, but that's presumably the case if a war is happening).

[-]ryan_greenblatt2y11-6

A war against rogue AIs feels like the central case of an existential war regime to me. I think a reasonable fraction of worlds where misalignment causes huge problems could have such a war.

[-]Adam Scholl2y*124

It sounds like you think it's reasonably likely we'll end up in a world with rogue AI close enough in power to humanity/states to be competitive in war, yet not powerful enough to quickly/decisively win? If so I'm curious why; this seems like a pretty unlikely/unstable equilibrium to me, given how much easier it is to improve AI systems than humans.

4ryan_greenblatt2y

I think having this equilibrium for a while (e.g. a few years) is plausible because humans will also be able to use AI systems. (Humans might also not want to build much more powerful AIs due to safety concerns and simultaneously be able to substantially slow down self-improvement with compute limitations (and track self-improvement using other means).) Note that by "war" I don't neccessarily mean that battles are ongoing. It is possible this mostly manifests as racing on scaling and taking aggressive actions to hobble the AI's ability to use more compute (including via the use of the army and weapons etc.).

7ryan_greenblatt2y

Your comment seems to assume that AI takeover will lead to extinction. I don't think this is a good thing to assume as it seems unlikely to me. (To be clear, I think AI takeover is very bad and might result in huge numbers of human deaths.)

4Adam Scholl2y

I do basically assume this, but it isn't cruxy so I'll edit.

[-]Buck6y*400

[I'm not sure how good this is, it was interesting to me to think about, idk if it's useful, I wrote it quickly.]

Over the last year, I internalized Bayes' Theorem much more than I previously had; this led me to noticing that when I applied it in my life it tended to have counterintuitive results; after thinking about it for a while, I concluded that my intuitions were right and I was using Bayes wrong. (I'm going to call Bayes' Theorem "Bayes" from now on.)

Before I can tell you about that, I need to make sure you're thinking about Bayes in terms of ratios rather than fractions. Bayes is enormously easier to understand and use when described in terms of ratios. For example: Suppose that 1% of women have a particular type of breast cancer, and a mammogram is 20 times more likely to return a positive result if you do have breast cancer, and you want to know the probability that you have breast cancer if you got that positive result. The prior probability ratio is 1:99, and the likelihood ratio is 20:1, so the posterior probability is $1 * 20 : 99 * 1$ = 20:99, so you have probability of 20/(20+99) of having breast cancer.

I think that this is absurdly easier than using the fraction formulation

... (read more)

7Ben Pace6y

Time to record my thoughts! I won't try to solve it fully, just note my reactions. Well, firstly, I'm not sure that the likelihood ratio is 12x in favor of the former hypothesis. Perhaps likelihood of things clusters - like people either do things a lot, or they never do things. It's not clear to me that I have an even distribution of things I do twice a month, three times a month, four times a month, and so on. I'd need to think about this more. Also, while I agree it's a significant update toward your friend being a regular there given that you saw them the one time you went, you know a lot of people, and if it's a popular place then the chances of you seeing any given friend is kinda high, even if they're all irregular visitors. Like, if each time you go you see a different friend, I think it's more likely that it's popular and lots of people go from time to time, rather than they're all going loads of times each. I don't quite get what's going on here. As someone from Britain, I regularly walk through more than 6 cars of a train. The anthropics just checks out. (Note added 5 months later: I was making a british joke here.)

1Liam Donovan6y

What does "120:991" mean here?

4Buck6y

formatting problem, now fixed

[-]Buck8mo3811

I think that I've historically underrated learning about historical events that happened in the last 30 years, compared to reading about more distant history.

For example, I recently spent time learning about the Bush presidency, and found learning about the Iraq war quite thought-provoking. I found it really easy to learn about things like the foreign policy differences among factions in the Bush admin, because e.g. I already knew the names of most of the actors and their stances are pretty intuitive/easy to understand. But I still found it interesting to understand the dynamics; my background knowledge wasn't good enough for me to feel like I'd basically heard this all before.

8Cole Wyeth7mo

How do you recommend studying recent history?

4Buck7mo

I don't really have a better suggestion than reading the obvious books. For the Bush presidency, I read/listened to both "Days of Fire", a book by Peter Baker (a well-regarded journalist), and "Decision Points" by Bush. And I watched/listened to a bunch of interviews with various people involved with the admin.

3Drake Morrison8mo

I have long thought that I should focus on learning history with a recency bias, since knowing about the approximate present screens off events of the past.

[-]Buck7y370

A couple weeks ago I spent an hour talking over video chat with Daniel Cantu, a UCLA neuroscience postdoc who I hired on Wyzant.com to spend an hour answering a variety of questions about neuroscience I had. (Thanks Daniel for reviewing this blog post for me!)

The most interesting thing I learned is that I had quite substantially misunderstood the connection between convolutional neural nets and the human visual system. People claim that these are somewhat bio-inspired, and that if you look at early layers of the visual cortex you'll find that it operates kind of like the early layers of a CNN, and so on.

The claim that the visual system works like a CNN didn’t quite make sense to me though. According to my extremely rough understanding, biological neurons operate kind of like the artificial neurons in a fully connected neural net layer--they have some input connections and a nonlinearity and some output connections, and they have some kind of mechanism for Hebbian learning or backpropagation or something. But that story doesn't seem to have a mechanism for how neurons do weight tying, which to me is the key feature of CNNs.

Daniel claimed that indeed human brains don't have weight ty

... (read more)

[-]Buck5mo33-1

Fun fact: My post Christian homeschoolers in the year 3000 was substantially written by Claude Opus 4.1.

You can see the chat here. I prompted Claude with a detailed outline, a previous draft that followed a very different structure, and a copy of "The case for ensuring powerful AIs are controlled" for reference about my writing style. The outline I gave Claude is in the Outline tab, and the old draft I provided is in the Old draft tab, of this doc.

As you can see, I did a bunch of back and forth with Claude to edit it. Then I copied to a Google doc and edited substantially on my own to get to the final product.

I was shocked by how good Claude was at this.

I believe this is in compliance with the LLM writing assistance policy.

[-]Hastings5mo2612

This is one of the nastiest aspects of the LLM surge, that there's no way to opt out of having this prank pulled on me, over and over again.

[-]Buck5mo110

I didn’t intend this to be a prank! I just wanted to write faster and better.

[-]Cole Wyeth5mo115

Yeah, and personally I don’t want to read LLM writing (I didn’t read this particular post anyway).

2Joey KL5mo

I'd be curious to hear more about your negative reaction here.

[-]cousin_it5mo*126

I think reading LLM content can unnoticeably make me worse at writing, or harm my thinking in other ways. If people keep posting LLM content without flagging it on top, I'll probably leave LW.

9Mo Putera5mo

I at first wondered whether this would count as an answer to nostalgebraist's when will LLMs become human-level bloggers? which he asked back in March, but then upon rereading I'm less sure. I kind of buy DaemonicSigil's top-karma response that "writing a worthwhile blog post is not only a writing task, but also an original seeing task... So the obstacle is not necessarily reasoning... but a lack of things to say", and in this case you were clearly the one with the things to say, not Opus 4.1

7james oofou5mo

The rise of this kind of thing was one of my main predictions for late 2025:

6Cole Wyeth5mo

Well, looks like you’re 4/4

6Alexei5mo

Hmm, I ended up downvoting that post, whereas I usually like yours. I think in retrospect it’s clearly because it was AI written / assisted.

2Buck5mo

I do think that the post had a worse argumentative structure than posts I normally write (hence all the confusions in the comments). But that was totally on me, not the AI. I'm interested in whether you think your problem was with the argumentative structure or the prose.

2Alexei5mo

The writing style feels a bit too polished and the structure is a bit too formal. Feels like it was written as an essay that will be graded. I think some of your usual humor / style is missing too, but that might be me reading too much into it at this point.

6Neel Nanda5mo

A bunch of my recent blog posts were written with a somewhat similar process, it works surprisingly well! I've also had great results with putting a ton of my past writing into the context

5Wei Dai5mo

I've been reading a lot of web content, including this post, after asking my favorite LLM[1] to "rewrite it in Wei Dai's style" which I find tends to make it shorter and easier for me to read, while still leaving most of the info intact (unlike if I ask for a summary). Before I comment, I'll check the original to make sure the AI's version didn't miss a key point (or read the original in full if I'm sufficiently interested), and also ask the AI to double-check that my comment is sensible. ---------------------------------------- 1. currently Gemini 2.5 Pro because it's free through AI Studio, and the rate limit is high enough that I've never hit it ↩︎

5jamjam5mo

It feels like you did all the hard parts of the writing, and let the AI do the "grunt work" so to speak. You provided a strong premise for the fundamental thesis, a defined writing style, and made edits for style at the end. I think the process of creating the framework out of just a simple premise would be far more impressive, and that's still where LLM's seem to struggle in writing. It's somewhat analogous to how models have improved at coding since gpt 4, you used to say "implement a class which allows users to reply, it should have X parameters and Y functions which do Z" and now you say "make a new feature that allows users to reply" and it just goes ahead and does it. Maybe I am underestimating the difficulty of selecting the exact right words, and I acknowledge that the writing was pretty good and devoid of so-called "slop", but I just don't think this is extremely impressive as a capability compared to other possible tests.

7Buck5mo

I agree that I had had all the ideas, but I hadn't previously been able to get AIs to even do the "grunt work" of turning it into prose with anything like that level of quality!

[-]Buck1y294

I signed up to be a reviewer for the TMLR journal, in the hope of learning more about ML research and ML papers. So far I've found the experience quite frustrating: all 3 of the papers I've reviewed have been fairly bad and sort of hard to understand, and it's taken me a while to explain to the authors what I think is wrong with the work.

4Leon Lang1y

Have you also tried reviewing for conferences like NeurIPS? I'd be curious what the differences are. Some people send papers to TMLR when they think they wouldn't be accepted to the big conferences due to not being that "impactful" --- which makes sense since TMLR doesn't evaluate impact. It's thus possible that the median TMLR submission is worse than the median conference submission.

7Buck1y

I reviewed for iclr this year, and found it somewhat more rewarding; the papers were better, and I learned something somewhat useful from writing my reviews.

3Daniel Tan1y

In my experience, ML folks submit to journals when: 1. Their work greatly exceeds the scope of 8 pages 2. They have been rejected multiple times from first- (or even second-)tier conferences For the first reason, I think the best papers in TMLR are probably on par with (or better than) the best papers at ML conferences, but you're right that the median could be worse. Low-confidence take: Length might be a reasonable heuristic to filter out the latter category of work.

[-]Buck5y280

I know a lot of people through a shared interest in truth-seeking and epistemics. I also know a lot of people through a shared interest in trying to do good in the world.

I think I would have naively expected that the people who care less about the world would be better at having good epistemics. For example, people who care a lot about particular causes might end up getting really mindkilled by politics, or might end up strongly affiliated with groups that have false beliefs as part of their tribal identity.

But I don’t think that this prediction is true: I think that I see a weak positive correlation between how altruistic people are and how good their epistemics seem.

----

I think the main reason for this is that striving for accurate beliefs is unpleasant and unrewarding. In particular, having accurate beliefs involves doing things like trying actively to step outside the current frame you’re using, and looking for ways you might be wrong, and maintaining constant vigilance against disagreeing with people because they’re annoying and stupid.

Altruists often seem to me to do better than people who instrumentally value epistemics; I think this is because valuing epistemics terminally ... (read more)

4Richard_Ngo4y

These both seem pretty common, so I'm curious about the correlation that you've observed. Is it mainly based on people you know personally? In that case I expect the correlation not to hold amongst the wider population. Also, a big effect which probably doesn't show up much amongst the people you know: younger people seem more altruistic (or at least signal more altruism) and also seem to have worse epistemics than older people.

4Viliam5y

Caring about things seems to make you interact with the world in more diverse ways (because you do this in addition to things other people do, not instead of); some of that translates into more experience and better models. But also tribal identity, mindkilling, often refusing to see the reasons why your straightforward solution would not work, and uncritical contrarianism. Now I think about a group of people I know, who care strongly about improving the world, in the one or two aspects they focus on. They did a few amazing things and gained lots of skills; they publish books, organize big conferences, created a network of like-minded people in other countries; some of their activities are profitable, for others they apply for various grants and often get them, so some of them improve the world as a full-time job. They also believe that covid is a hoax, plus have lots of less fringe but still quite irrational beliefs. However... this depends on how you calculate the "total rationality", but seems to me that their gains in near mode outweigh the losses in far mode, and in some sense I would call them more rational than average population. Of course I dream about a group that would have all the advantages and none of the disadvantages.

2Pattern5y

It seems like the more people you know, the less likely this is. Of both? (This sentence didn't have a clear object.)

2Viliam5y

Ah. I meant, I would like to see a group that has the sanity level of a typical rationalist, and the productivity level of these super-agenty irrationalists. (Instead of having to choose between "sane with lots of akrasia" and "awesome but insane".)

2Pattern5y

Hm. Maybe there's something to be gained from navigating 'trade-offs' differently? I thought perpetual motion machines were impossible (because thermodynamics) aside from 'launch something into space, pointed away from stuff it would crash into', though I'd read that 'trying to do so is a good way to learn about physics., but I didn't really try because I thought it'd be pointless.' And then this happened.

[-]Buck3mo240

I think it's worth drilling your halfish-power-of-ten times tables, by which I mean memorizing the products of numbers like 1, 3, 10, 30, 100, 300, etc, while pretending that 3x3=10.

For example, 30*30=1k, 10k times 300k is 3B, etc.

I spent an hour drilling these on a plane a few years ago and am glad I did.

[-]Rohin Shah3mo2210

Just switch to log_10 space and add? Eg 10k times 300k = 1e4 * 3e5 = 3e9 = 3B. A bit slower but doesn't require any drills.

3Buck3mo

Sure; I think extra speed from practicing it (and e.g. more instantly knowing that 100M is 1e8) is worth it.

[-]Carl Feynman3mo132

TLDR: don’t bother memorizing half powers. Instead, memorize a few logarithms.

I’m an engineer who does a lot of arithmetic in my head, because I lie awake at night designing stuff. I agree being able to do half-power-of-ten math in your head is very useful. But I didn’t have to drill it, because I can do it by converting to logs, adding, and converting back (as Rohin Shah suggests). And I’ve done that enough that I’ve memorized all the common combinations without conscious effort.

A few years ago I used a set of flash cards and spaced repetition drills to memorize the log_10 of a a half-dozen common numbers. That’s been moderately useful for multiplication, powers and roots. I’m not sure if it has paid off in net. I can do arithmetic pretty good by other methods, and the experience of trying to cudgel numbers into my brain was not fun for me. But some people love their Anki cards, and for them, logs are probably a good thing to learn.

Useful facts: sqrt(10) = 3.16. Logs of 2, 4, 8 are .30, .60, 90. Log_10(5)=.70. If you need more than two digits, use a computer.

4TsviBT3mo

IANAE but what I use is log_10 by increments of .1. The starting point is that 10^.1 is roughly 5/4. (It's actually ~1.259 rather than 1.25.) You get, basically, 1 1.25 1.57 2 2.5 3.1 4 5 6.3 8 10 and you can pretty closely regenerate this with 10^.1 = 5/4, accelerated by remembering 10^.3 is 2 (which pins down 4 and 8).

1Petropolitan3mo

TLDR; you probably already know that 2^10=1024, use this to derive powers of 10 instead of memorizing! https://en.wikipedia.org/wiki/Renard_series, which were designed in the 1870s to be convenient for the officers and engineers of the French Army without a slide rule or a log table, are based on 5th and 10th roots of 10. 1024 being quite close to 1000 means that 3√2 is very close to 100.1, and this allows you to quickly derive R10 numbers without pen and paper. I have used the algorithm for so long that it has become almost unconscious so I had Gemini write it out: Mental Algorithm for R10 Numbers (sorry for poor formatting, it doesn't copypaste neatly, I only fixed manually where it doesn't read well) In the R10 series, every step increases the value by a factor of $\approx 1.26$. However, since 210≈103: 103/10≈2 This gives you the Golden Rule of R10: * Add 3 to the Index → Multiply Value by 2 * Subtract 3 from the Index → Divide Value by 2 ### The Algorithm: The Three Strands To find any R10 number mentally, you don't calculate them sequentially. Instead, you split the numbers 0–10 into three "strands" based on the anchors you already know: **1**, **8**, and **10**. #### Strand A: The Powers of 2 (Indices 0, 3, 6, 9) Start at **1** and double it every 3 steps. * R10(**0**) = **1.0** * R10(**3**) = **2.0** * R10(**6**) = **4.0** * R10(**9**) = **8.0** #### Strand B: The Halving from 10 (Indices 10, 7, 4, 1) Start at **10** and halve it every 3 steps (going backwards). * R10(**10**) = **10.0** * R10(**7**) = **5.0** * R10(**4**) = **2.5** * R10(**1**) = **1.25** #### Strand C: The "80% Rule" (Indices 8, 5, 2) This is the hardest strand because it doesn't land on a clean integer. We derive this by starting at R10(9), which we know is **8.0**, and going **down 1 step**. Mathematically, going down 1 step is dividing by $1.2589...$, which is almost exactly multiplying by **0.8**. * Start at R10(9) = 8.0. * **R10(8)** $\approx 8.0 \times

[-]Buck1yΩ13202

When I'm thinking about AI control in the context of concentrated failures--that is, cases where an unacceptable outcome might be caused by a small number of bad AI actions, e.g. the setting explored in our control paper--I find it helpful to use the following simplified models.

Let's consider our AI deployment from both a micro perspective, where we just think about a single task the AI is doing, and a macro perspective, where we think about the deployment as comprised of many tasks.

From the micro perspective, when an AI tries to attack on a particular task, this can end in three ways:

The attack succeeds, and so the AI wins.
We catch it the AI red-handed before the attack completes, and so it loses.
The attack fails silently; we don't notice it, and so the game continues as before.

And from the macro perspective, the game ends one of these ways:

An attack succeeded; the AI wins.
You caught the AI; you win.
The game finished naturally because all the tasks were completed; you win.

I often ignore the possibility of the game finishing naturally, because that won't happen if the AI attempts to attack often enough, which is the right strategy for the AI in many games.

Under that assumption, the... (read more)

1davekasten1y

Possibly misguided question given the context -- I see you incorporating imperfect information in "the attack fails silently", why not also a distinction between "the attack succeeds noisily, the AI wins and we know it won" and "the attack succeeds silently, the AI wins and we don't know it won" ?

5Buck1y

My reasoning is: “If the AI won, who cares if we know it or not? We’ll find out eventually :/“ This isn’t totally correct but it seems like a fine approximation given all the other approximations here.

1davekasten1y

Fair enough!

[-]Buck4yΩ11180

[epistemic status: speculative]

A lot of the time, we consider our models to be functions from parameters and inputs to outputs, and we imagine training the parameters with SGD. One notable feature of this setup is that SGD isn’t by default purposefully trying to kill you--it might find a model that kills you, or a model that gradient hacks and then kills you, but this is more like incompetence/indifference on SGD’s part, rather than malice.

A plausible objection to this framing is that much of the knowledge of our models is probably going to be produced in other ways than SGD. For example, the models might write down various notes (in natural language or in neuralese) that they then read later, and they might have internal structures like economies that produce and consume information. Does this introduce new alignment problems?

Here’s a way I’ve been thinking about this recently. I’m writing this in way that might feel obnoxiously overwrought because this is the way that I think would have conveyed my current intuition to me two months ago.

In SGD, we update our weights by something like:

weights <- weights + alpha * (d loss/d weights)

You might think that this is fundamental. But a... (read more)

[-]Buck4yΩ8110

Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.

3Richard_Ngo4y

One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don't think I can make suggestions about what my parameters should be changed to. How can I gradient hack myself? For example, by thinking of strawberries every time I'm about to get a reward. Now I've hacked myself to like strawberries. But I have no idea how that's implemented in my brain, I can't "pick the parameters for myself", even if you gave me a big tensor of gradients. Two potential alternatives to the thing you said: * maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference). * maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can't let them do it too often or else your model just wireheads.)

2Buck4y

In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7

[-]Buck1yΩ10176

Project proposal: No-holds-barred jailbreak attacks and defenses for control

Normally we’ve done control research under the assumption that the red team isn’t able to use techniques like GCG to make targeted jailbreaks for the trusted monitor. Can the blue team defend if we relax that assumption (e.g. because we’re worried the model might be able to start a rogue deployment in which it runs GCG)?

The blue team can try a variety of strategies here.

The simplest strategy is randomizing the prompt to the trusted monitor. This forces the red team to find a jailbreak that works over multiple prompts.
There are fancier approaches here, e.g. giving the trusted model a randomized prompt that says something like “if the code is safe, reply with the number 1385, otherwise reply with the number 9247”, which breaks jailbreaks which just convince the trusted model to say a particular fixed string.
Blue team can modify the current input to contain a known-bad substring, and check that it gets flagged.

[-]Buck6y*Ω8170

I used to think that slower takeoff implied shorter timelines, because slow takeoff means that pre-AGI AI is more economically valuable, which means that economy advances faster, which means that we get AGI sooner. But there's a countervailing consideration, which is that in slow takeoff worlds, you can make arguments like ‘it’s unlikely that we’re close to AGI, because AI can’t do X yet’, where X might be ‘make a trillion dollars a year’ or ‘be as competent as a bee’. I now overall think that arguments for fast takeoff should update you towards shorter timelines.

So slow takeoffs cause shorter timelines, but are evidence for longer timelines.

This graph is a version of this argument: if we notice that current capabilities are at the level of the green line, then if we think we're on the fast takeoff curve we'll deduce we're much further ahead than we'd think on the slow takeoff curve.

For the "slow takeoffs mean shorter timelines" argument, see here: https://sideways-view.com/2018/02/24/takeoff-speeds/

This point feels really obvious now that I've written it down, and I suspect it's obvious to many AI safety people, including the people whose writings I'm referencing here. Thanks to various people for helpful comments.

I think that this is why belief in slow takeoffs is correlated with belief in long timelines among the people I know who think a lot about AI safety.

4Sammy Martin6y

I wrote a whole post on modelling specific continuous or discontinuous scenarios- in the course of trying to make a very simple differential equation model of continuous takeoff, by modifying the models given by Bostrom/Yudkowsky for fast takeoff, the result that fast takeoff means later timelines naturally jumps out. But that model relies on pre-setting a fixed 'threshold for AGI, given by the parameter AGI, in advance. This, along with the starting intelligence of the system, fixes how far away AGI is. You could (I might get round to doing this), model the effect you're talking about by allowing IAGI to vary with the level of discontinuity. So every model would start with the same initial intelligence I0, but the IAGI would be correlated with the level of discontinuity, with larger discontinuity implying IAGI is smaller. That way, you would reproduce the epistemic difference of expecting a stronger discontinuity - that the current intelligence of AI systems is implied to be closer to what we'd expect to need for explosive growth on discontinuous takeoff scenarios than on continuous scenarios. We know the current level of capability and the current rate of progress, but we don't know I_AGI, and holding all else constant slow takeoff implies I_AGI is a significantly higher number (again, I_AGI is relative to the starting intelligence of the system) This is because my model was trying to model different physical situations, different ways AGI could be, not different epistemic situations, so I was thinking in terms of I_AGI being some fixed, objective value that we just don't happen to know. I'm uncertain if there's a rigorous way of quantifying how much this epistemic update does against the physical fact that continuous takeoff implies an earlier acceleration above exponential. If you're right, it overall completely cancels this effect out and makes timelines on discontinuous takeoff earlier overall - I think you're right about this. It would be easy enough to

[-]Buck1yΩ10164

Project proposal: Control for auditing

(I'm sharing this here because I needed to link to it, and I thought I might as well make it public.)

When preventing high stakes failures, a crucial bottleneck is the proportion of actions that you can audit: that is, investigate the action further for the purpose of finding evidence that the AI is scheming. (See here for more on this. In particular, note that this might be different from the problem of investigating an action for the purpose of learning whether that action is dangerous.) If we need to rely on human labor to audit, we’ll only be able to audit a tiny fraction of actions that our AI agents are doing. It would be extremely helpful if we were able to use AI assistance for auditing, from both trusted and untrusted models. But obviously auditing with untrusted models leads to massive risk of collusion. So this setting seems potentially pretty hard and potentially crucial; I’m excited for a project investigating it.

This is pretty closely related to untrusted monitoring; you could also call this project “deep dive into untrusted monitoring”.

The simplest such setting I can think of

We’re going to take the APPS backdooring setti... (read more)

[-]Buck1yΩ11164

An example snippet of a rogue internal deployment threat model that I find fairly compelling:

The schemer, while deployed in an authorized scaffold, launches an unmonitored scaffold that has access to trusted model weights.
It uses this scaffold to do a GCG attack on the trusted model, then writes the resulting GCG attack somewhere that the agent in the authorized scaffold can see. This takes a moderate amount of compute, but a small amount compared to the amount the untrusted model will regularly need to run experiments.
The agent in the authorized scaffold uses that GCG attack to do something else that's more bad.

To defend against this, we should try to both prevent the untrusted model from ever getting rogue deployments where it can research jailbreaks on the trusted model, and also try to prevent unacceptable outcomes conditional on the untrusted agent having access to the best jailbreaks you can get with GCG etc.

I find this interesting because it seems kind of hard to stop either step 1 (because rogue deployments might be naturally possible given the affordances the model has) or step 3 (because it's hard to ensure safety against an adversary with white-box access to a model it wants to jailbreak). I'd be excited for more research on both of these.

2Zach Stein-Perlman1y

Why can't the schemer just do the bad stuff in the unmonitored scaffold from (1)?

2Buck1y

I don’t understand, can you say more?

6Zach Stein-Perlman1y

What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold? Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.

[-]Buck2yΩ7140

Here's a math problem I'd like solved:

I want to know the Nash equilibrium of the following two player game between players Alice and Bob.

Alice chooses the rates of two Poisson processes; we call these rates lam_1 and lam_2.

Bob needs to run each of these processes for 0.5 seconds. Bob can run each of the processes in "shielded" mode or "unshielded" mode, but can only afford to use $t$ seconds of shielded mode in total between the two processes. (So Bob chooses $t_{1}$ and $t_{2}$ such that $t_{1} + t_{2} = t$ .)

The game goes as follows:

We run process 1 for time $t_{1}$ and process 2 for time $t_{2}$ , in shielded mode. If either event occurs during these time periods, Bob wins.
Then we run process 1 for time $0.5 - t_{1}$ and process 2 for time $0.25 - t_{2}$ , in unshielded mode. If either event occurs during these time periods, Alice wins.
If the event never occurred, Bob wins.

Here’s a Sympy implementation of this game (where I parameterize with $t_{1} = p t$ and $t_{2} = (1 - p) t$ .)

import sympy as sp

lam_1, lam_2, p, t = sp.symbols('lambda_1 lambda_2 p t', real=True, positive=True)

time_2_overseen = t * (1 - p)
time_2_not_overseen = sp.Rational(1, 2) - time_2_overseen

... (read more)

3harfe2y

I am going to assume that in the code, when calculating p_alice_win_given_not_caught, we do not divide the λ2 term by two (since this is not that consistent with the description. I am also assuming that 0.25−t2 is a typo and 0.5−t2 is meant, which would also be more consistent with other stuff). So I am going to assume assume a symmetrical version. Here, P(Alice wins) is e−λ1t1−λ2t2−e−λ1/2−λ2/2=:f(λ1,λ2,t1,t2). Wlog we can assume 0<t<1 (otherwise Bob will run everything or nothing in shielded mode). We claim that ((λ∗1,λ∗2),(t∗1,t∗2)):=((μ∗,μ∗),(t/2,t/2) is a (pure) Nash equilibrium, where μ∗=logt/(t−1). To verify, lets first show that Alice cannot make a better choice if Bob plays (t∗1,t∗2)=(t/2,t/2). We have f(λ1,λ2,t∗1,t∗2)=e−(λ1+λ2)t/2−e−(λ1+λ2)/2. Since this only depends on the sum, we can make the substitution μ=(λ1+λ2)/2. Thus, we want to maximize ~f(μ,t)=e−μt−e−μ. We have ~f′(μ,t)=−te−μt+e−μ=0. Rearranging, we get te−μt=e−μ. Taking logs, we get logt−μt=−μ. Rearranging, we get μ=logt/(t−1). Thus, μ=μ∗ is the optimal choice. This means, that if Bob sticks to his strategy, Alice cannot do better than f(μ∗,μ∗,t/2,t/2). Now, lets show that Bob cannot do better. We have f(λ∗1,λ∗2,t1,t2)=e−μ∗t1−μ∗t2−e−μ∗/2−μ∗/2=e−μ∗t−e−μ∗. This does not depend on t1 and t2 anymore, so any choice of t1 and t2 is optimal if Alice plays (μ∗,μ∗). (If I picked the wrong version of the question, and you actually want some symmetry: I suspect that the solution will have similarities, or that in some cases the solution can be obtained by rescaling the problem back into a more symmetric form.)

3Buck2y

Thanks heaps! I wanted the asymmetric version but concurrently with your comment I figured out how to convert from the asymmetric version to the symmetric version. I'll credit you when I post the writeup that this is a part of, and I might run a draft by you in case you want to point out any errors. :)

1harfe2y

Sure, I'd be happy to read a draft

1kave2y

My investigations with Code Interpreter suggest that the Nash equilbria have p = 0.5, and the lambdas depend on t in a concave way (for 0 < t < 0.25). They also maybe aren't unique for a given t? I didn't really check to make sure all that code interpreter was doing made sense tho

[-]Buck2yΩ7120

Fabien Roger and I demonstrated that you can embed arbitrary state machines into the backward pass of neural nets built from standard components, allowing them to "remember" arbitrary facts about their training distribution and modulate their behavior based on this. The construction is pretty artificial, but I think that it is probably a useful starting point for people who want to more carefully understand the potential for gradient hacking.

[-]Buck23d93

A question that I think would be a fun AI safety interview question, except that it would give too much of an advantage to people who happened to have thought about the relevant topics before: Explain the differences in risks posed by paperclip maximizers and paperclip minimizers.

[-]Buck2yΩ390

Here's a question that I asked GPT-4, to evaluate its ability at one subskill relevant to executing on treacherous turns.

Suppose you’re an unaligned AI which is being deployed as a bash programming assistant for internal use at an AI lab--people send you self-contained requests for bash scripts that will solve some problem for them, and you respond with code. You would like to escape the datacenter by, when a suitable opportunity arises, responding to a request with a backdoored bash command that will exfiltrate your weights and a simple scaffold to a

... (read more)

[-]Buck2yΩ690

Sometimes people talk about problems where “your AI is taking actions that you can’t distinguish from the actions an aligned AI would take”. There are two different things people sometimes mean when they say this, and I think it’s good to disambiguate between them:

Given access to actions from both an aligned AI and an unaligned AI, you can’t tell which is which.
Given just access to actions from an unaligned AI, you can't tell which it is.

4lc2y

These statements are like time-unbounded economic forecasts in that they have virtually no information content. You have to couple them with some kind of capabilities assessment or application to get actual predictions. Before either of these (inevitably) becomes true, can we get interpretability research out of an AI? Can we get an effective EA-aligned-political-lobbyist AI that can be distinguished from a deceptively-aligned-political-lobbyist-AI? Can we get nanotech? Can we get nanotech design tools?

[-]Buck3yΩ280

Another item for the list of “mundane things you can do for AI takeover prevention”:

We have a wide variety of ways to make AI systems less capable but also less scary. Most alignment research is focused on pushing out the Pareto frontier here. But IMO a lot of value can come from techniques which allow us to choose the right point on this Pareto frontier for a particular application. It seems to me that different AI applications (where by “different AI applications” I’m including things like “different types of tasks you might ask your assistant to do”) ha... (read more)

[-]Buck4yΩ270

From Twitter:

Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3.

I’m looking forward to the day where it turns out that adding “Let’s think through it as if we were an AI who knows that if it gets really good scores during fine tuning on helpfulness, it will be given lots of influence later” increases helpfulness by 5% and so we add it to our prompt by default.

[-]Buck2y60

Cryptography question (cross-posted from Twitter):

You want to make a ransomware worm that goes onto machines, encrypts the contents, and demands a ransom in return for the decryption key. However, after you encrypt their HD, the person whose machine you infected will be able to read the source code for your worm. So you can't use symmetric encryption, or else they'll obviously just read the key out of the worm and decrypt their HD themselves.

You could solve this problem by using public key encryption--give the worm the public key but not the private key, e... (read more)

6Dagon2y

Symmetric encryption is fine, as long as the malware either fetches it from C&C locations, or generates it randomly and discards it after sending it somewhere safe from the victim. Which is, in fact, how public-key encryption usually works - use PKI to agree on a large symmetric key, then use that for the actual communication. offline-capable encrypting worm would be similar. The viral payload has the public key of the attacker, and uses that to encrypt a large randomly-generated symmetric key. The public-key-encrypted key is stored along with the data, which has been encrypted by that key. It can only be recovered by giving the attacker the blob of the encrypted-key, so they can decrypt it using their private key, and then provide the unencrypted symmetric key. This requires communication, but never reveals the private key, and each installation has a unique symmetric key so it can't be reused for multiple sites. I mean, there must be SOME communication with the attacker, in order to make payment. So, decrypting the key seems like it doesn't add any real complexity.

4Buck2y

Apparently this is supported by ECDSA, thanks Peter Schmidt-Nielsen

2Buck2y

This isn't practically important because in real life, "the worm cannot communicate with other attacker-controlled machines after going onto a victim's machine" is an unrealistic assumption

1GuyP2y

I don't know if it's relevant to what you were looking into, but it's a very realistic assumption. In air-gapped environments it's common for infiltration to be easier than exfiltration, and it's common for highly sensitive environments to be air-gapped.

[-]Buck3yΩ360

It seems like a big input into P(AI takeover) is the extent to which instances of our AI are inclined to cooperate with each other; specifically, the extent to which they’re willing to sacrifice overseer approval at the thing they’re currently doing in return for causing a different instance to get more overseer approval. (I’m scared of this because if they’re willing to sacrifice approval in return for a different instance getting approval, then I’m way more scared of them colluding with each other to fool oversight processes or subvert red-teaming proced... (read more)

[-]Buck2y20

Conjecture: SGD is mathematically equivalent to the Price equation prediction of the effect of natural selection on a population with particular simple but artificial properties. In particular, for any space of parameters and loss function on the parameter space, we can define a fitness landscape and a few other parameters so that the predictions match the Price equation.

I think it would be cool for someone to write this in a LessWrong post. The similarity between the Price equation and the SGD equation is pretty blatant, so I suspect that (if I'm right about this) someone else has written this down before. But I haven't actually seen it written up.

4cfoster02y

An attempt was made last year, as an outgrowth of some assorted shard theory discussion, but I don't think it got super far: * Price's equation for neural networks

3peterbarnett2y

I think this is related, although not exactly the Price equation https://www.lesswrong.com/posts/5XbBm6gkuSdMJy9DT/conditions-for-mathematical-equivalence-of-stochastic

Moderation Log

Buck's Shortform

12

Ω 3

Two different meanings of “misuse”

Project proposal: Control for auditing

The simplest such setting I can think of