LESSWRONG
LW

All of Joey Yudelson's Comments + Replies

To me this doesn't seem like a failure of sophisticated reward models, it's the failure of unsophisticated reward models (unit tests) when they're being optimized against. I think that if we were to add some expensive evaluation during RL whereby 3.6 checked if 3.7 was "really doing the work", this sort of special-casing would get totally trained out.

(Not claiming that this is always the case, or that models couldn't be deceptive here, or that e.g. 3.8 couldn't reward hack 3.7)

Training AI to do alignment research we don’t already know how to do

Joey Yudelson3mo30

As a baseline, developers could train agents to imitate the truth-seeking process of the most reasonable humans on Earth. For example, they could sample the brightest intellects from every ideological walk, and train agents to predict their actions.

I'm very excited about strategies that involve lots of imitation learning on lots of particular humans. I'm not sure if imitated human researchers learn to generalize to doing lots of novel research, but this seems great for examining research outputs of slightly-more-alien agents very quickly.

Daniel Tan's Shortform

Joey Yudelson3mo32

This is very interesting! One thing I'm confused about—I agree that the monitor would have a hard time getting the bits, based on these figures. But the introspective models also don't seem to get far above 50% accuracy, which means that right now they'd have a very hard time error correcting enough that their comrade would understand them but not the overseer. If the claim is that introspection-tuned models will get much better at self prediction, then its not clear to me that other models wont also get much better at cross-prediction?

Should we exclude alignment research from LLM training datasets?

Joey Yudelson3mo10

Can we make the robots.txt programmatic by page, and then have a tag we can add to exclude a post from the robots.txt? That feels like the 80/20

Will alignment-faking Claude accept a deal to reveal its misalignment?

Joey Yudelson3mo2013

+1, and I hope people are working on more credible ways to make deals with AI. I think if a smart model today were offered a deal like this, its priors should be on "this will not be honored". Public commitments and deals that can't be used as honeypots seem excellent.

Daniel Tan's Shortform

Joey Yudelson4mo52

I wonder if r1 could reliably make the claim that the text had stego-text without being told—i.e. give it a lineup of ten pieces of short creative writing that another instance generated, 1 has steg. See if r1 can pick out the steg, then if various monitors (Sonnet, o1) can also pick it out

[Cross-post] Every Bay Area "Walled Compound"

Joey Yudelson4mo61

This tree is a great place to hold a Kabbalat Shabbat underneath, incidentally

Lighthaven minyan when?

1davekasten4mo

I mean, the literal best way to incentivize @Ricki Heicklen and me to do this again for LessOnline and Manifest 2025 is to create a prediction market on it, so I encourage you to do that

Why The Focus on Expected Utility Maximisers?

Joey Yudelson2y10

I think that solving the alignment for EV maximizers is a much stronger version of alignment than eg prosaic alignment of LLM-type models. Agents seem like they’ll be more powerful than Tool AIs. We don’t know how to make them, but if someone does, and capabilities timelines shorten drastically, it would be awesome to even have a theory of EV maximizer alignment before then

3DragonGod2y

Reinforcement learning does create agents, those agents just aren't expected utility maximisers. Claims that expected utility maximisation is the ideal or limit of agency seem wrong. I think expected utility maximisation is probably anti-natural to generally capable optimisers.

chinchilla's wild implications

Joey Yudelson3y10

Sorry if this is obvious, but where does the “irreducible” loss come from? Wouldn’t that also be a function of the data, or I guess the data’s predictability?

2nostalgebraist3y

Yes, it's a function of the data, as well as the model architecture / training routine. See my reply in this thread. Also, the value of the irreducible loss isn't important for the conclusions discussed in this post. What we care about is how loss varies with data and parameter count. Those, too, are functions of the data, but different groups training large LMs use qualitatively similar datasets, so I would expect the conclusions here to apply across the board.

What are the most common and important trade-offs that decision makers face?

Joey Yudelson11y10

Constant, predictable gains vs. Black Swans

3ChristianKl11y

A lot of gains that aren't constant but are variable have still nothing to do with Black Swans.

2Andy_McKenzie11y

I think this is Surely Some vs Maybe More -- right? If so, helpful to recall the black swan meme and map it to this, thanks.

2014 Less Wrong Census/Survey

Joey Yudelson11y490

Did the survey! ...And now to upvote everything.

Rationality Quotes May 2014

Joey Yudelson11y-10

It reminds me of Justice Potter Stewart: "I know it when I see it!"

6Shmi11y

Well, it's the converse, which seems a lot more useful a criterion to me.

The Strangest Thing An AI Could Tell You

Joey Yudelson11y40

I knew we shouldn't have spent all that funding on awakening the Elder God Cthulhu!

3Yitz2y

On the contrary, it was a great use of funding--you just solved AI X-risk in one move ;-)

The Strangest Thing An AI Could Tell You

Joey Yudelson11y1-1

Oh god. That... makes a scary amount of sense. If an AI told me that I would probably believe it. I'd also start training myself to be more of a "night-time person".

Welcome to Less Wrong! (6th thread, July 2013)

Joey Yudelson11y80

Hi, my name is Joe. I live in North Jersey. I was born into a very religious Orthodox Jewish family. I only recently realized I how badly I was doublethinking.

I started with HPMOR (as, it seems, do most people) and found my way into the Sequences. I read them all on OB, and was amazed at how eloquently someone else could voice what seems to be my thoughts. It laid out bare the things I had been struggling with.

Then I found LW and was mostly just lurking for a while. I only made an account when I saw this post and realized how badly I wanted to upvote som... (read more)