LESSWRONG
LW

Stuart_Armstrong

Sequences

Generalised models

Concept Extrapolation

AI Safety Subprojects

Practical Guide to Anthropics

Anthropic Decision Theory

Subagents and impact measures

If I were a well-intentioned AI...

Wiki Contributions

Quick Reference Guide To The Infinite

12y

(+3/-3)

Quick Reference Guide To The Infinite

12y

(+1/-2)

Quick Reference Guide To The Infinite

12y

(+2/-3)

Quick Reference Guide To The Infinite

12y

(+2/-2)

Quick Reference Guide To The Infinite

12y

(+2/-5)

Quick Reference Guide To The Infinite

12y

(+3/-4)

Quick Reference Guide To The Infinite

12y

(+3/-4)

Quick Reference Guide To The Infinite

12y

(+2/-6)

Quick Reference Guide To The Infinite

12y

(+2/-4)

Quick Reference Guide To The Infinite

12y

(+6/-3)

Comments

Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

Stuart_Armstrong5mo60

*Goodhart

Thanks! Corrected (though it is indeed a good hard problem).

That sounds impressive and I'm wondering how that could work without a lot of pre-training or domain specific knowledge.

Pre-training and domain specific knowledge are not needed.

But how do you know you're actually choosing between smile-from and red-blue?

Run them on examples such as frown-with-red-bar and smile-with-blue-bar.

Also, this method seems superficially related to CIRL. How does it avoid the associated problems?

Which problems are you thinking of?

1

Agentic Mess (A Failure Story)

Stuart_Armstrong6mo42

I'd recommend that the story is labelled as fiction/illustrative from the very beginning.

Examples of AI's behaving badly

Stuart_Armstrong8mo20

Thanks, modified!

By default, avoid ambiguous distant situations

Stuart_Armstrong9mo40

I believe I do.

Acausal trade: Introduction

Stuart_Armstrong11moΩ330

Thanks!

Avoiding xrisk from AI doesn't mean focusing on AI xrisk

Stuart_Armstrong1y84

Having done a lot of work on corrigibility, I believe that it can't be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.

1

Satisficers want to become maximisers

Stuart_Armstrong1y30

Instead, you could have a satisficer which tries to maximize the probability that the utility is above a certain value. This leads to different dynamics than maximizing expected utility. What do you think?

If U is the utility and u is the value that it needs to be above, define a new utility V, which is 1 if and only if U>u and is 0 otherwise. This is a well-defined utility function, and the design you described is exactly equivalent with being an expected V-maximiser.

Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong1y20

Thanks! Corrected.

Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong1y20

Thanks! Corrected.

Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong1y30

Great and fun :-)