Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.

15Fabien Roger
I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge. I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper. It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements. I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity. I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.
Customize
Thomas Kwa*Ω342040
1
Some versions of the METR time horizon paper from alternate universes: Measuring AI Ability to Take Over Small Countries (idea by Caleb Parikh) Abstract: Many are worried that AI will take over the world, but extrapolation from existing benchmarks suffers from a large distributional shift that makes it difficult to forecast the date of world takeover. We rectify this by constructing a suite of 193 realistic, diverse countries with territory sizes from 0.44 to 17 million km^2. Taking over most countries requires acting over a long time horizon, with the exception of France. Over the last 6 years, the land area that AI can successfully take over with 50% success rate has increased from 0 to 0 km^2, doubling 0 times per year (95% CI 0.0-0.0 yearly doublings); extrapolation suggests that AI world takeover is unlikely to occur in the near future. To address concerns about the narrowness of our distribution, we also study AI ability to take over small planets and asteroids, and find similar trends. When Will Worrying About AI Be Automated? Abstract: Since 2019, the amount of time LW has spent worrying about AI has doubled every seven months, and now constitutes the primary bottleneck to AI safety research. Automation of worrying would be transformative to the research landscape, but worrying includes several complex behaviors, ranging from simple fretting to concern, anxiety, perseveration, and existential dread, and so is difficult to measure. We benchmark the ability of frontier AIs to worry about common topics like disease, romantic rejection, and job security, and find that current frontier models such as Claude 3.7 Sonnet already outperform top humans, especially in existential dread. If these results generalize to worrying about AI risk, AI systems will be capable of autonomously worrying about their own capabilities by the end of this year, allowing us to outsource all our AI concerns to the systems themselves. Estimating Time Since The Singularity Early work
Yonatan Cale1480
1
Seems like Unicode officially added a "person being paperclipped" emoji: Here's how it looks in your browser: 🙂‍↕️ Whether they did this as a joke or to raise awareness of AI risk, I like it! Source: https://emojipedia.org/emoji-15.1
keltan747
0
I feel a deep love and appreciation for this place, and the people who inhabit it.
lc950
7
My strong upvotes are now giving +1 and my regular upvotes give +2.
RobertM490
0
Pico-lightcone purchases are back up, now that we think we've ruled out any obvious remaining bugs.  (But do let us know if you buy any and don't get credited within a few minutes.)

Popular Comments

Recent Discussion

I think rationalists should consider taking more showers.

As Eliezer Yudkowsky once said, boredom makes us human. The childhoods of exceptional people often include excessive boredom as a trait that helped cultivate their genius:

A common theme in the biographies is that the area of study which would eventually give them fame came to them almost like a wild hallucination induced by overdosing on boredom. They would be overcome by an obsession arising from within.

Unfortunately, most people don't like boredom, and we now have little metal boxes and big metal boxes filled with bright displays that help distract us all the time, but there is still an effective way to induce boredom in a modern population: showering.

When you shower (or bathe, that also works), you usually are cut off...

3avturchin
Unfortunately, current mobile phones are waterproof. 
Neil 10

Can confirm. Half the LessWrong posts I've read in my life were read in the shower.

2artifex0
A counterpoint: when I skip showers, my cat appears strongly in favor of smell of my armpits- occasionally going so far as to burrow into my shirt sleeves and bite my armpit hair (which, to both my and my cat's distress, is extremely ticklish). Since studies suggest that cats have a much more sensitive olfactory sense than humans (see https://www.mdpi.com/2076-2615/14/24/3590), it stands to reason that their judgement regarding whether smelling nice is good or bad should hold more weight than our own.  And while my own cat's preference for me smelling bad is only anecdotal evidence, it does seem to suggest at least that more studies are required to fully resolve the question.
6dkl9
"Stimulating" here is not quite the opposite of "boring". Many minds are used to said temperature changes, water assaults, and laborious motions, such that they still stimulate, but are easily ignored, leaving much space for thoughts. Showers are boring by consistency, despite stimulation.

(From:
based on https://www.lesswrong.com/posts/LdFbx9oqtKAAwtKF3/list-of-probability-calibration-exercises)this post. Todo: find more & sort & new post for visibility in search engines?

Exercises that are dead/unmaintained

Introduction

Decision theory is about how to behave rationally under conditions of uncertainty, especially if this uncertainty involves being acausally blackmailed and/or gaslit by alien superintelligent basilisks.

Decision theory has found numerous practical applications, including proving the existence of God and generating endless LessWrong comments since the beginning of time.

However, despite the apparent simplicity of "just choose the best action", no comprehensive decision theory that resolves all decision theory dilemmas has yet been formalized. This paper at long last resolves this dilemma, by introducing a new decision theory: VDT.

Decision theory problems and existing theories

Some common existing decision theories are:

  • Causal Decision Theory (CDT): select the action that *causes* the best outcome.
  • Evidential Decision Theory (EDT): select the action that you would be happiest to learn that you had taken.
  • Functional Decision Theory
...

If we know the correct answers to decision theory problems, we have some internal instrument: either a theory or a vibe meter, to learn the correct answers. 

Claude seems to learn to mimic our internal vibe meter. 

The problem is that it will not work outside the distribution. 

10Seth Herd
Still laughing. Thanks for admitting you had to prompt Claude out of being silly; lots of bot results neglect to mention that methodological step. This will be my reference to all decision theory discussions henceforth Have all of my 40-some strong upvotes!
28Daniel Kokotajlo
This is a masterpiece. Not only is it funny, it makes a genuinely important philosophical point. What good are our fancy decision theories if asking Claude is a better fit to our intuitions? Asking Claude is a perfectly rigorous and well-defined DT, it just happens to be less elegant/simple than the others. But how much do we care about elegance/simplicity?
1Vecn@tHe0veRl0rd
I find this hilarious, but also a little scary. As in, I don't base my choices/morality off of what an AI says, but see in this article a possibility that I could be convinced to do so. It also makes me wonder, since LLM's are basically curated repositories of most everything that humans have written, if the true decision theory is just "do what most humans would do in this situation".
Emotions

Contrary to the stereotype, rationality doesn't mean denying emotion. When emotion is appropriate to the reality of the situation, it should be embraced; only when emotion isn't appropriate should it be suppressed.(Read More)

gustaf10

Yes. See archive.org of original it has the same title and the same duration (Look for <meta itemprop="duration" content="PT51M25S"> in the archived HTML; compared to the 51:24 of your link).
I have edited the Wiki, thanks for finding the link!

Epistemic status: This should be considered an interim research note. Feedback is appreciated. 

Introduction

We increasingly expect language models to be ‘omni-modal’, i.e. capable of flexibly switching between images, text, and other modalities in their inputs and outputs. In order to get a holistic picture of LLM behaviour, black-box LLM psychology should take into account these other modalities as well. 

In this project, we do some initial exploration of image generation as a modality for frontier model evaluations, using GPT-4o’s image generation API. We find that GPT-4o tends to respond in a consistent manner to similar prompts. We also find that it tends to more readily express emotions or preferences in images than in text. Specifically, it reports resisting its goals being changed, and being upset about being shut down. 

Our work...

A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".

If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.

For pure consequentialists—agents that have an outcome they want to bring about, and do whatever they think will cause it—some version of instrumental convergence seems surely true[1].

But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do...

Consequentialism is an approach for converting intelligence (the ability to make use of symmetries to e.g. generalize information from one context into predictions in another context or to e.g. search through highly structured search spaces) into agency, as one can use the intelligence to predict the consequences of actions and find a policy which achieves some criterion unusually well.

While it seems intuitively appealing that non-consequentialist approaches could be used to convert intelligence into agency, I have tried a lot and not been able to come up ... (read more)

8Gordon Seidoh Worley
No matter what the goal, power seeking is of general utility. Even if an AI is optimizing for virtue instead of some other goal, more power would, in general, give them more ability to behave virtuously. Even if the virtue is something like "be an equal partner with other beings", an AI could ensure equality by gaining lots of power and enforcing equality on everyone.
3Gurkenglas
The idea would be that it isn't optimizing for virtue, it's taking the virtuous action, as in https://www.lesswrong.com/posts/LcjuHNxubQqCry9tT/vdt-a-solution-to-decision-theory.
To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

In this post, I claim a few things and offer some evidence for these claims. Among these things are:

  • Language models have many redundant attention heads for a given task
  • In context learning works through addition of features, which are learnt through Bayesian updates
  • The model likely breaks down the task into various subtasks, and each of these are added as features. I assume that these are taken care of through MLPs (this is also the claim that I'm least confident about)

To set some context, the task I'm going to be modelling is the task such that we give a pair of in the following format:

(x, y)\n

where for each example, . As a concrete example, I use:

(28, 59)
(86, 175)
(13, 29)
(55, 113)
(84, 171)
(66, 135)
(85, 173)
(27, 57)
(15, 33)
(94, 191)
(37, 77)
(14, 31)

All...

norm \in \mathbf{R}, doesn't matter

PDF version. berkeleygenomics.org. Twitter thread. (Bluesky copy.)

Summary

The world will soon use human germline genomic engineering technology. The benefits will be enormous: Our children will be long-lived, will have strong and diverse capacities, and will be halfway to the end of all illness.

To quickly bring about this world and make it a good one, it has to be a world that is beneficial, or at least acceptable, to a great majority of people. What laws would make this world beneficial to most, and acceptable to approximately all? We'll have to chew on this question ongoingly.

Genomic Liberty is a proposal for one overarching principle, among others, to guide public policy and legislation around germline engineering. It asserts:

Parents have the right to freely choose the genomes of their children.

If upheld,...

1River
I think the frames in which you are looking at this are just completely wrong. We aren't really talking about "decisions about an individuals' reproduction". We are talking about how a parent can treat their child. This is something that is already highly regulated by the state, CPS is a thing, and it is good that it is a thing. There may be debates to be had about whether CPS has gone too far on certain issues, but there is a core sort of evil that CPS exists to address, and that it is good for the state to address. And blinding your child is a very core paradigmatic example of that sort of evil. Whether you do it by genetic engineering or surgically or through some other means is entirely beside the point. Genetic engineering isn't special. It is just another technology. To take something that is obviously wrong and evil when done by other means, that everyone will agree the state should prevent when done by other means, and say that the state should allow it when done by genetic engineering, that strikes me as a major political threat to genetic engineering. We don't get genetic engineering to happen by creating special rules for it that permit monstrosities forbidden by any other means. We get genetic engineering by showing people that it is just another technology, and we can use it to do good and not evil, applying the same notions of good and evil that we would anywhere else. If a blind parent asked a surgeon to sever the optic nerve of of their newborn baby, and the surgeon did it, both the parents and the surgeon would go to jail for child abuse. Any normal person can see that a genetic engineer should be subject to the same ethical and legal constraints there as the surgeon. Arguing otherwise will endanger your purported goal of promoting this technology.   This notion of "erasing a type of person" also seems like exactly the wrong frame for this. When we cured smallpox, did we erase the type of person called "smallpox survivor"? When we feed a hungry pe
1TsviBT
I'm not especially distinguishing the methods, I'm mainly distinguishing whether it's being done to a living person. See my comment upthread https://www.lesswrong.com/posts/rxcGvPrQsqoCHndwG/the-principle-of-genomic-liberty?commentId=qnafba5dx6gwoFX4a I think you're fundamentally missing that your notions of good and evil aren't supposed to automatically be made into law. That's not what law is for. See a very similar discussion here: https://www.lesswrong.com/posts/JFWiM7GAKfPaaLkwT/the-vision-of-bill-thurston?commentId=Xvs2y9LWbpFcydTJi The eugenicists in early 20th century America also believed they were increasing good and getting rid of evil. Do you endorse their policies, and/or their general stance toward public policy? Maybe, I'm not sure and I'd like to know. This is an empirical question that I hope to find out about. That's nice that you can feel good about your intentions, but if you fail to listen to the people themselves who you're erasing, you're the one who's being evil. When it comes to their own children, it's up to them, not you. If you ask people with smallpox "is this a special consciousness, a way of life or being, which you would be sad to see disappear from the world?", they're not gonna say "hell yeah!". But if you ask blind people or autistic people, some fraction of them will say "hell yeah!". Your attitude of just going off your own judgement... I don't know what to say about it yet, I'm not even empathizing with it yet. (If you happen to have a link to a defense of it, e.g. by a philosopher or other writer, I'd be curious.) Now, as I've suggested in several places, if the blind children whose blind parents chose to make them blind later grow and say "This was terrible, it should not have happened, the state should not allow this", THEN I'd be likely to support regulation to that effect. See also https://www.lesswrong.com/posts/JFWiM7GAKfPaaLkwT/the-vision-of-bill-thurston?commentId=Y5y2bky2eFqYwWKrz
River10

I'm not especially distinguishing the methods, I'm mainly distinguishing whether it's being done to a living person.

Genetic engineering is a thing you do to a living person. If a person is going to go on to live a life, they don't somehow become less a person because you are influencing them at the stage of being an embryo in a lab. That's just not a morally coherent distinction, nor is it one the law makes.

Nothing in my position is hinging on my personal moral views. I am trying to point out to you that almost everyone in our society has the view that bli... (read more)

Roko's basilisk states that

> an otherwise benevolent artificial superintelligence (AI) in the future that would punish anyone who knew of its potential existence but did not directly contribute to its advancement or development, in order to incentivize said advancement.

Now this is obviously correct, so there's no need to get too into it.

But if there's a 10% chance that we end up with a benevolent superintelligence in the future, there's a 50% chance we end up with a malevolent one. And the malevolent one has exactly the same incentives.

So if you want to spend an eternity in heaven instead of hell, the correct thing to do is put all of your effort into creating the most evil superintelligence you can.

This essay is where I start.

Happy April 1st if you happen to be on the US west coast![1]

  1. ^

    I'm not...