Replying toLove stays loved (formerly "Skin")

This was really beautiful. Thanks for writing.

Recent Redwood Research project proposals

ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson

7mo

Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.

Control

These projects are all related to the field of AI Control. Many of them are extensions of Redwood's previous work in this area.

Basic open questions in control

Control Protocol Transfer Across Setting [PUBLIC]
- So far, all the work in comparing different control protocols measures effectiveness only on one setting. Do the results we get transfer between different settings?
Backdoor Auditing

... (read 833 more words →)

Joey Yudelson1y

To me this doesn't seem like a failure of sophisticated reward models, it's the failure of unsophisticated reward models (unit tests) when they're being optimized against. I think that if we were to add some expensive evaluation during RL whereby 3.6 checked if 3.7 was "really doing the work", this sort of special-casing would get totally trained out.

(Not claiming that this is always the case, or that models couldn't be deceptive here, or that e.g. 3.8 couldn't reward hack 3.7)

Replying toTraining AI to do alignment research we don’t already know how to do

Joey Yudelson1y

Training AI to do alignment research we don’t already know how to do

As a baseline, developers could train agents to imitate the truth-seeking process of the most reasonable humans on Earth. For example, they could sample the brightest intellects from every ideological walk, and train agents to predict their actions.

I'm very excited about strategies that involve lots of imitation learning on lots of particular humans. I'm not sure if imitated human researchers learn to generalize to doing lots of novel research, but this seems great for examining research outputs of slightly-more-alien agents very quickly.

Joey Yudelson1y

This is very interesting! One thing I'm confused about—I agree that the monitor would have a hard time getting the bits, based on these figures. But the introspective models also don't seem to get far above 50% accuracy, which means that right now they'd have a very hard time error correcting enough that their comrade would understand them but not the overseer. If the claim is that introspection-tuned models will get much better at self prediction, then its not clear to me that other models wont also get much better at cross-prediction?

Replying toShould we exclude alignment research from LLM training datasets?

Joey Yudelson1y

Should we exclude alignment research from LLM training datasets?

Can we make the robots.txt programmatic by page, and then have a tag we can add to exclude a post from the robots.txt? That feels like the 80/20

Replying toWill alignment-faking Claude accept a deal to reveal its misalignment?

Joey Yudelson1y

Will alignment-faking Claude accept a deal to reveal its misalignment?

+1, and I hope people are working on more credible ways to make deals with AI. I think if a smart model today were offered a deal like this, its priors should be on "this will not be honored". Public commitments and deals that can't be used as honeypots seem excellent.

Joey Yudelson1y

I wonder if r1 could reliably make the claim that the text had stego-text without being told—i.e. give it a lineup of ten pieces of short creative writing that another instance generated, 1 has steg. See if r1 can pick out the steg, then if various monitors (Sonnet, o1) can also pick it out

Replying to[Cross-post] Every Bay Area "Walled Compound"

Joey Yudelson1y

[Cross-post] Every Bay Area "Walled Compound"

This tree is a great place to hold a Kabbalat Shabbat underneath, incidentally

Lighthaven minyan when?

Early Experiments in Human Auditing for AI Control

Joey Yudelson

Joey Yudelson, Buck

Produced as part of the ML Alignment & Theory Scholars Program - Winter 2024-25 Cohort

TL;DR

We did a small pilot test of human auditing to understand better how it fits into the AI control agenda. We used LLMs to generate interesting code backdoors, and had humans audit suspicious backdoored and non-backdoored code. The pilot test was not very good (mostly in that auditors weren’t given enough time), but I’m excited about more/better tests of this. We provide our backdoors, the auditing game website, and the results of this pilot test.

Background

For this experiment, we assume that we’re in some sort of trusted monitoring setup—an untrusted model is providing us with solutions to various problems. A smaller, dumber trusted model... (read 1915 more words →)

Replying toWhy The Focus on Expected Utility Maximisers?

Joey Yudelson3y

Why The Focus on Expected Utility Maximisers?

I think that solving the alignment for EV maximizers is a much stronger version of alignment than eg prosaic alignment of LLM-type models. Agents seem like they’ll be more powerful than Tool AIs. We don’t know how to make them, but if someone does, and capabilities timelines shorten drastically, it would be awesome to even have a theory of EV maximizer alignment before then

Replying tochinchilla's wild implications

Joey Yudelson3y

chinchilla's wild implications

Sorry if this is obvious, but where does the “irreducible” loss come from? Wouldn’t that also be a function of the data, or I guess the data’s predictability?

Jokes Thread

Joey Yudelson

12y

This is a thread for rationality-related or LW-related jokes and humor. Please post jokes (new or old) in the comments.

------------------------------------

Q: Why are Chromebooks good Bayesians?

A: Because they frequently update!

------------------------------------

A super-intelligent AI walks out of a box...

------------------------------------

Q: Why did the psychopathic utilitarian push a fat man in front of a trolley?

A: Just for fun.

LESSWRONG
LW

LESSWRONG
LW

Joey Yudelson

Joey Yudelson

Recent Redwood Research project proposals

Early Experiments in Human Auditing for AI Control

Jokes Thread

Joey Yudelson

Joey Yudelson

Recent Redwood Research project proposals

Early Experiments in Human Auditing for AI Control

Jokes Thread

Control

Basic open questions in control

Background