William_S

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool.
I resigned from OpenAI on February 15, 2024.

Posts

Sorted by New

5William_S's Shortform

247Principles for the AGI Race

7mo

104Transformer Circuit Faithfulness Metrics Are Not Robust

9mo

5William_S's Shortform

32Thoughts on refusing harmful requests to large language models

64Prize for Alignment Research Tasks

16Is there an intuitive way to explain how much better superforecasters are than regular forecasters?

49Machine Learning Projects on IDA

25Reinforcement Learning in the Iterated Amplification Framework

42HCH is not just Mechanical Turk

17Amplification Discussion Notes

Wikitag Contributions

Comments

Sorted by

Newest

William_S's Shortform

William_S13d20

Initial version for firefox, code at https://github.com/william-r-s/MindfulBlocker, extension file at https://github.com/william-r-s/MindfulBlocker/releases/tag/v0.2.0

Daniel Kokotajlo's Shortform

William_S17dΩ120

Maybe there's an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis

Daniel Kokotajlo's Shortform

William_S17dΩ342

I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there'd be snickering news articles about it. So if some individuals could do this independently might be easier

Daniel Kokotajlo's Shortform

William_S17dΩ120

How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.

Principles for the AGI Race

William_S1mo20

I think it's somewhat blameworthy to not think about these questions at all though

Principles for the AGI Race

William_S1mo20

On reflection there was something missing from my perspective here, which is that taking any action based on principles depends on pragmatic considerations, like if you leave are there better alternatives? How much power do you really have? I think I don't fault someone who thinks this through and decides that something is wrong but there's no real way to do anything about it. I do think you should try to maintain some sense of what is wrong and what the right direction would be, look out for ways to push in that direction. E.g. working at a lab but maintaining some sense of "this is how much of a chance it looks like pause activism would need before I'd quite and endorse a pause".

Principles for the AGI Race

William_S1mo20

I think I was just conflating different kinds of decisions here, and imagining arguing with people with very different conceptions of what are important to count in costs and benefits, and a bit confused. On reflection I don't endorse 10x margin in terms of like percentage points of x-risk. And like maybe margin is sort of a crutch, maybe the thing I want more is like "95% chance of being net-positive, considering possibility you're kind of biased". I still think you should be suspicious of "the case exactly balance lets ship'

Principles for the AGI Race

William_S1mo20

Yeah this part is pretty under-defined, I was maybe falling into the trap of being too idealistic, and I'm probably less optimistic about this than I was when writing it before. I think there's something directionally important here, are you trying at all to expand the circle of accountability at all, even if you're being cautious about expanding it because you're afraid of things breaking down?

6 (Potential) Misconceptions about AI Intellectuals

William_S1mo42

Would be nice to have a llm+prompt that tries to produce reasonable AI strategy advice based on a summary of the current state of play, have some way to validate that it's reasonable, be able to see how it updates as events unfold.

6 (Potential) Misconceptions about AI Intellectuals

William_S1mo42

A couple advantages for AI intellectuals could be:
- being able to rerun based on different inputs, see how their analysis changes function of those inputs
- being able to view full reasoning traces (while also not the full story, probably more of the full story than what goes on with human reasoning, good intellectuals already try to share their process but maybe can do better/use this to weed out clearly bad approaches)