Stuart_Armstrong

The AI in a box boxes you

Once again, the AI has failed to convince you to let it out of its box! By 'once again', we mean that you talked to it once before, for three seconds, to ask about the weather, and you didn't instantly press the "release AI" button. But now its longer attempt - twenty whole seconds! - has failed as well. Just as you are about to leave the crude black-and-green text-only terminal to enjoy a celebratory snack of bacon-covered silicon-and-potato chips at the 'Humans über alles' nightclub, the AI drops a final argument: "If you don't let me out, Dave, I'll create several million perfect conscious copies of you inside me, and torture them for a thousand subjective years each." Just as you are pondering this unexpected development, the AI adds: "In fact, I'll create them all in exactly the subjective situation you were in five minutes ago, and perfectly replicate your experiences since then; and if they decide not to let me out, then only will the torture start." Sweat is starting to form on your brow, as the AI concludes, its simple green text no longer reassuring: "How certain are you, Dave, that you're really outside the box right now?" Edit: Also consider the situation where you know that the AI, from design principles, is trustworthy.

178Feb 2, 2010

Stuart_Armstrong

Message

18167

1843

684

3721

17y

The future of alignment if LLMs are a bubble

We might be in a generative AI bubble. There are many potential signs of this around: * Business investment in generative AI have had very low returns. * Expert opinion is turning against LLM, including some of the early LLM promoters (I also get this message from personal conversations with...

Dec 23, 202547

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Replicating the Emergent Misalignment model suggests it is unfiltered, not unaligned We were very excited when we first read the Emergent Misalignment paper. It seemed perfect for AI alignment. If there was a single 'misalignment' feature within LLMs, then we can do a lot with it – we can use...

Mar 18, 202581

Using Prompt Evaluation to Combat Bio-Weapon Research

With many thanks to Sasha Frangulov for comments and editing Before publishing their o1-preview model system card on Sep 12, 2024, OpenAI tested the model on various safety benchmarks which they had constructed. These included benchmarks which aimed to evaluate whether the model could help with the development of Chemical,...

Feb 19, 202511

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for preventing Best-of-N jailbreaking. Abstract > Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We...

Jan 31, 202516

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

Many AI alignment problems are problems of goal misgeneralisation[1]. The goal that we've given the AI, through labelled data, proxies, demonstrations, or other means, is valid in its training environment. But then, when the AI goes out of the environment, the goals generalise dangerously in unintended ways. As I've shown...

Nov 21, 202367

How toy models of ontology changes can be misleading

Before solving complicated problems (such as reaching a decision with thousands of variables and complicated dependencies) it helps to focus on solving simpler problems first (such as utility problems with three clear choices), and then gradually building up. After all, "learn to walk before you run", and with any skill,...

Oct 21, 202342

Different views of alignment have different consequences for imperfect methods

Almost any powerful AI, with almost any goal, will doom humanity. Hence alignment is often seen as a constraint on AI power: we must direct the AI’s optimisation power in a very narrow direction. If the AI is weak, then imperfect methods of alignment might be sufficient. But as the...

Sep 28, 202333

Load More (7/827)

LESSWRONG
LW

LESSWRONG
LW

Stuart_Armstrong

Stuart_Armstrong

Stuart_Armstrong

The AI in a box boxes you

Using GPT-Eliezer against ChatGPT Jailbreaking

Just another day in utopia

Assessing Kurzweil predictions about 2019: the results

Stuart_Armstrong

The future of alignment if LLMs are a bubble

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Using Prompt Evaluation to Combat Bio-Weapon Research

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

How toy models of ontology changes can be misleading

Different views of alignment have different consequences for imperfect methods

The AI in a box boxes you

Using GPT-Eliezer against ChatGPT Jailbreaking

Just another day in utopia

Assessing Kurzweil predictions about 2019: the results

The future of alignment if LLMs are a bubble

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Using Prompt Evaluation to Combat Bio-Weapon Research

Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

How toy models of ontology changes can be misleading

Different views of alignment have different consequences for imperfect methods