Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.

15Fabien Roger
I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge. I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper. It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements. I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity. I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.
Customize
Thomas Kwa*Ω342020
1
Some versions of the METR time horizon paper from alternate universes: Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh) Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the exception of France. Over the last 6 years, the land area that AI can successfully take over with 50% success rate has increased from 0 to 0 km^2, doubling 0 times per year (95% CI 0.0-0.0 yearly doublings); extrapolation suggests that AI world takeover is unlikely to occur in the near future. To address concerns about the narrowness of our distribution, we also study AI ability to take over small planets and asteroids, and find similar trends. When Will Worrying About AI Be Automated? Abstract: Since 2019, the amount of time LW has spent worrying about AI has doubled every seven months, and now constitutes the primary bottleneck to AI safety research. Automation of worrying would be transformative to the research landscape, but worrying includes several complex behaviors, ranging from simple fretting to concern, anxiety, perseveration, and existential dread, and so is difficult to measure. We benchmark the ability of frontier AIs to worry about common topics like disease, romantic rejection, and job security, and find that current frontier models such as Claude 3.7 Sonnet already outperform top humans, especially in existential dread. If these results generalize to worrying about AI risk, AI systems will be capable of autonomously worrying about their own capabilities by the end of this year, allowing us to outsource all our AI concerns to the systems themselves. Estimating Time Since The Singularity Early work
Yonatan Cale1470
1
Seems like Unicode officially added a "person being paperclipped" emoji: Here's how it looks in your browser: 🙂‍↕️ Whether they did this as a joke or to raise awareness of AI risk, I like it! Source: https://emojipedia.org/emoji-15.1
keltan716
0
I feel a deep love and appreciation for this place, and the people who inhabit it.
lc950
7
My strong upvotes are now giving +1 and my regular upvotes give +2.
RobertM490
0
Pico-lightcone purchases are back up, now that we think we've ruled out any obvious remaining bugs.  (But do let us know if you buy any and don't get credited within a few minutes.)

Popular Comments

Recent Discussion

A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".

If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.

For pure consequentialists—agents that have an outcome they want to bring about, and do whatever they think will cause it—some version of instrumental convergence seems surely true[1].

But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do...

Consequentialism is an approach for converting intelligence (the ability to make use of symmetries to e.g. generalize information from one context into predictions in another context or to e.g. search through highly structured search spaces) into agency, as one can use the intelligence to predict the consequences of actions and find a policy which achieves some criterion unusually well.

While it seems intuitively appealing that non-consequentialist approaches could be used to convert intelligence into agency, I have tried a lot and not been able to come up ... (read more)

8Gordon Seidoh Worley
No matter what the goal, power seeking is of general utility. Even if an AI is optimizing for virtue instead of some other goal, more power would, in general, give them more ability to behave virtuously. Even if the virtue is something like "be an equal partner with other beings", an AI could ensure equality by gaining lots of power and enforcing equality on everyone.
3Gurkenglas
The idea would be that it isn't optimizing for virtue, it's taking the virtuous action, as in https://www.lesswrong.com/posts/LcjuHNxubQqCry9tT/vdt-a-solution-to-decision-theory.

In this post, I claim a few things and offer some evidence for these claims. Among these things are:

  • Language models have many redundant attention heads for a given task
  • In context learning works through addition of features, which are learnt through Bayesian updates
  • The model likely breaks down the task into various subtasks, and each of these are added as features. I assume that these are taken care of through MLPs (this is also the claim that I'm least confident about)

To set some context, the task I'm going to be modelling is the task such that we give a pair of in the following format:

(x, y)\n

where for each example, . As a concrete example, I use:

(28, 59)
(86, 175)
(13, 29)
(55, 113)
(84, 171)
(66, 135)
(85, 173)
(27, 57)
(15, 33)
(94, 191)
(37, 77)
(14, 31)

All...

norm \in \mathbf{R}, doesn't matter

PDF version. berkeleygenomics.org. Twitter thread. (Bluesky copy.)

Summary

The world will soon use human germline genomic engineering technology. The benefits will be enormous: Our children will be long-lived, will have strong and diverse capacities, and will be halfway to the end of all illness.

To quickly bring about this world and make it a good one, it has to be a world that is beneficial, or at least acceptable, to a great majority of people. What laws would make this world beneficial to most, and acceptable to approximately all? We'll have to chew on this question ongoingly.

Genomic Liberty is a proposal for one overarching principle, among others, to guide public policy and legislation around germline engineering. It asserts:

Parents have the right to freely choose the genomes of their children.

If upheld,...

1River
I think the frames in which you are looking at this are just completely wrong. We aren't really talking about "decisions about an individuals' reproduction". We are talking about how a parent can treat their child. This is something that is already highly regulated by the state, CPS is a thing, and it is good that it is a thing. There may be debates to be had about whether CPS has gone too far on certain issues, but there is a core sort of evil that CPS exists to address, and that it is good for the state to address. And blinding your child is a very core paradigmatic example of that sort of evil. Whether you do it by genetic engineering or surgically or through some other means is entirely beside the point. Genetic engineering isn't special. It is just another technology. To take something that is obviously wrong and evil when done by other means, that everyone will agree the state should prevent when done by other means, and say that the state should allow it when done by genetic engineering, that strikes me as a major political threat to genetic engineering. We don't get genetic engineering to happen by creating special rules for it that permit monstrosities forbidden by any other means. We get genetic engineering by showing people that it is just another technology, and we can use it to do good and not evil, applying the same notions of good and evil that we would anywhere else. If a blind parent asked a surgeon to sever the optic nerve of of their newborn baby, and the surgeon did it, both the parents and the surgeon would go to jail for child abuse. Any normal person can see that a genetic engineer should be subject to the same ethical and legal constraints there as the surgeon. Arguing otherwise will endanger your purported goal of promoting this technology.   This notion of "erasing a type of person" also seems like exactly the wrong frame for this. When we cured smallpox, did we erase the type of person called "smallpox survivor"? When we feed a hungry pe
1TsviBT
I'm not especially distinguishing the methods, I'm mainly distinguishing whether it's being done to a living person. See my comment upthread https://www.lesswrong.com/posts/rxcGvPrQsqoCHndwG/the-principle-of-genomic-liberty?commentId=qnafba5dx6gwoFX4a I think you're fundamentally missing that your notions of good and evil aren't supposed to automatically be made into law. That's not what law is for. See a very similar discussion here: https://www.lesswrong.com/posts/JFWiM7GAKfPaaLkwT/the-vision-of-bill-thurston?commentId=Xvs2y9LWbpFcydTJi The eugenicists in early 20th century America also believed they were increasing good and getting rid of evil. Do you endorse their policies, and/or their general stance toward public policy? Maybe, I'm not sure and I'd like to know. This is an empirical question that I hope to find out about. That's nice that you can feel good about your intentions, but if you fail to listen to the people themselves who you're erasing, you're the one who's being evil. When it comes to their own children, it's up to them, not you. If you ask people with smallpox "is this a special consciousness, a way of life or being, which you would be sad to see disappear from the world?", they're not gonna say "hell yeah!". But if you ask blind people or autistic people, some fraction of them will say "hell yeah!". Your attitude of just going off your own judgement... I don't know what to say about it yet, I'm not even empathizing with it yet. (If you happen to have a link to a defense of it, e.g. by a philosopher or other writer, I'd be curious.) Now, as I've suggested in several places, if the blind children whose blind parents chose to make them blind later grow and say "This was terrible, it should not have happened, the state should not allow this", THEN I'd be likely to support regulation to that effect. See also https://www.lesswrong.com/posts/JFWiM7GAKfPaaLkwT/the-vision-of-bill-thurston?commentId=Y5y2bky2eFqYwWKrz
River10

I'm not especially distinguishing the methods, I'm mainly distinguishing whether it's being done to a living person.

Genetic engineering is a thing you do to a living person. If a person is going to go on to live a life, they don't somehow become less a person because you are influencing them at the stage of being an embryo in a lab. That's just not a morally coherent distinction, nor is it one the law makes.

Nothing in my position is hinging on my personal moral views. I am trying to point out to you that almost everyone in our society has the view that bli... (read more)

I think rationalists should consider taking more showers.

As Eliezer Yudkowsky once said, boredom makes us human. The childhoods of exceptional people often include excessive boredom as a trait that helped cultivate their genius:

A common theme in the biographies is that the area of study which would eventually give them fame came to them almost like a wild hallucination induced by overdosing on boredom. They would be overcome by an obsession arising from within.

Unfortunately, most people don't like boredom, and we now have little metal boxes and big metal boxes filled with bright displays that help distract us all the time, but there is still an effective way to induce boredom in a modern population: showering.

When you shower (or bathe, that also works), you usually are cut off...

6dkl9
"Stimulating" here is not quite the opposite of "boring". Many minds are used to said temperature changes, water assaults, and laborious motions, such that they still stimulate, but are easily ignored, leaving much space for thoughts. Showers are boring by consistency, despite stimulation.
11Gordon Seidoh Worley
More reasons to shower: smelling nice is good.

A counterpoint: when I skip showers, my cat appears strongly in favor of smell of my armpits- occasionally going so far as to burrow into my shirt sleeves and bite my armpit hair (which, to both my and my cat's distress, is extremely ticklish). Since studies suggest that cats have a much more sensitive olfactory sense than humans (see https://www.mdpi.com/2076-2615/14/24/3590), it stands to reason that their judgement regarding whether smelling nice is good or bad should hold more weight than our own.  And while my own cat's preference for me smelling... (read more)

7bohaska
I guess this is another case of 'Universal' Human Experiences That Not Everyone Has

Roko's basilisk states that

> an otherwise benevolent artificial superintelligence (AI) in the future that would punish anyone who knew of its potential existence but did not directly contribute to its advancement or development, in order to incentivize said advancement.

Now this is obviously correct, so there's no need to get too into it.

But if there's a 10% chance that we end up with a benevolent superintelligence in the future, there's a 50% chance we end up with a malevolent one. And the malevolent one has exactly the same incentives.

So if you want to spend an eternity in heaven instead of hell, the correct thing to do is put all of your effort into creating the most evil superintelligence you can.

This essay is where I start.

Happy April 1st if you happen to be on the US west coast![1]

  1. ^

    I'm not...

Hey Everyone,

It is with a sense of... considerable cognitive dissonance that I am letting you all know about a significant development for the future trajectory of LessWrong. After extensive internal deliberation, projections of financial runways, and what I can only describe as a series of profoundly unexpected coordination challenges, the Lightcone Infrastructure team has agreed in principle to the acquisition of LessWrong by EA.

I assure you, nothing about how LessWrong operates on a day to day level will change. I have always cared deeply about the robustness and integrity of our institutions, and I am fully aligned with our stakeholders at EA. 

To be honest, the key thing that EA brings to the table is money and talent. While the recent layoffs in EAs broader industry have been...

Wei Dai690

Just wanted to let everyone know I now wield a +307 strong upvote thanks to my elite 'hacking' skills. The rationalist community remains safe, because I choose to use this power responsibly.

As an unrelated inquiry, is anyone aware of some "karma injustices" that need to be corrected?

47habryka
You can now choose which virtues you want to display next to your username! Just go to the virtues dialogue on the frontpage and select the ones you want to display (up to 3).
31AprilSR
Why do I have dozens of strong upvote and downvote strength, but no more agreement strength than before I began my strength training? Does EA not think agreement is importance?
30habryka
Absolutely, that is our sole motivation.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

The Internet is a great invention. Just about everything in humanity’s knowledge can be found on the Internet: with just a few keystrokes, you can find dozens of excellent quality textbooks, YouTube videos and blog posts about any topic you want to learn. The information is accessible to a degree that scholars and inventors could have hardly dreamed of even a few decades ago. With the popularity of remote work and everything moving online this decade, it is not surprising that many people are eager to learn new skills from the Internet.

However, despite the abundance of resources, self-studying over the Internet is harder than you think. You can easily find a list of resources on your topic, full of courses and textbooks to study from (like this one...

It's interesting how two years later, the "buy an expert's time" suggestion is almost outdated. There are still situations where it makes sense, but probably in the majority of situations any SOTA LLM will do a perfectly fine job giving useful feedback on exercises in math or language learning.

Thanks for the post!

I'm not writing this to alarm anyone, but it would be irresponsible not to report on something this important. On current trends, every car will be crashed in front of my house within the next week. Here's the data:

Until today, only two cars had crashed in front of my house, several months apart, during the 15 months I have lived here. But a few hours ago it happened again, mere weeks from the previous crash. This graph may look harmless enough, but now consider the frequency of crashes this implies over time:

The car crash singularity will occur in the early morning hours of Monday, April 7. As crash frequency approaches infinity, every car will be involved. You might be thinking that the same car could be involved in multiple crashes. This is true! But the same car can only withstand a finite number of crashes before it is no longer able to move. It follows that every car will be involved in at least one crash. And who do you think will be driving your car? 

I accept your statistics and assume I'll be driving my car. Damn.

OTOH, I can be pretty certain I won't die or be seriously injured.

That has happened never in my thousands of weeks, so statistically, it almost certainly won't within the next week!

10Jon Garcia
See, this is what happens when you extrapolate data points linearly into the future. You get totally unrealistic predictions. It's important to remember the physical constraints on whatever trend you're trying to extrapolate. Importantly for this issue, you need to remember that time between successive crashes can never be negative, so it is inappropriate to model intervals with a straight line that crosses the time axis on April 7. Instead, with so few data points, a more realistic model would take a log-transform of the inter-crash interval before fitting the prediction line. In fact, once you do so, it becomes clear that this is a geometric series, with inter-crash interval decaying exponentially with number of crashes. The total time taken for N cars to crash in front of your house after the first one grows as TN=∑Nn=1t0rn−1=t01−rN1−r, where r≈27155≈0.174 and t0=155 days, based on your graph. According to Google, there are 1.47 billion cars in the world. The time it will take for all of them to crash in front of your house is T1.47e9=1551−0.1741.47e91−0.174=187.7 days from the first crash, which works out to 5.7 days from today. Which turns out to be April 7. Hmm... Well, see you on Monday, I guess.
7Ruby
Was a true trender-bender
8Mars_Will_Be_Ours
Quick! Someone fund my steel production startup before its too late! My business model is to place a steel foundry under your house to collect the exponentially growing amount of cars crashing into it!  Imagine how much money we can make by revolutionizing metal production during the car crash singularity! Think of the money! Think of the Money! Think of the Money!!!

LessOnline 2025

Ticket prices increase in 1 day

Join our Festival of Blogging and Truthseeking from May 30 - Jun 1, Berkeley, CA