x

LESSWRONG

LW

AI Oversight — LessWrong

You are viewing version 1.0.0 of this page. Click here to view the latest version.

AI Oversight

Edited by RobertM last updated 20th Sep 2022

You are viewing revision 1.0.0, last edited by RobertM

A description from Doing oversight from the very start of training seems hard:

Here ‘oversight’ means there is something with access to the internals of the model which checks that the model isn’t misaligned even if the behavior on the training distribution looks fine. In An overview of 11 proposals for building safe advanced AI, all but two of the proposals basically look like this, as does AI safety via market making.
Examples of oversight techniques include:
Transparency tools (either used by a human, an AI, or a human assisted by an AI)
Adversarial inputs (giving inputs which could trick a misaligned AI into revealing itself)
Relaxed adversarial training (which could be seen as an extension of adversarial inputs)

Add Posts

1

1

Posts tagged AI Oversight

2

126Oversight Misses 100% of Thoughts The AI Does Not Think

4y

49

2

48No, We're Not Getting Meaningful Oversight of AI

10mo

4

2

27Quick thoughts on "scalable oversight" / "super-human feedback" research

David Scott Krueger

3y

9

1

111Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman, Ethan Perez

3y

15

1

45Building Black-box Scheming Monitors

CorrigibleAgent, richbc, Simon Storf, Marius Hobbhahn

9mo

18

1

27Human-AI Complementarity: A Goal for Amplified Oversight

rishubjain, Sophie Bridgers

1y

4

1

27Constitutional Black-Box Monitoring for Scheming in LLM Agents

Simon Storf, richbc, Marius Hobbhahn

2mo

0

1

14Doing oversight from the very start of training seems hard

4y

3

1

9How a bug of AI hardware may become a feature for AI governance

5mo

0

1

4Trying to measure AI deception capabilities using temporary simulation fine-tuning

3y

0

1

4Activation Magnitudes Matter On Their Own: Insights from Language Model Distributional Analysis

1y

0

1

3AI Safety Oversights

1y

0

1

2Alignment Structure Direction - Recursive Adversarial Oversight(RAO)

1y

0

1

2W2SG: Introduction

2y

2

1

1Is there any existing term summarizing non-scalable oversight methods in outer alignment?

3y

0

Load More (15/15)

Add Posts