I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool.
I resigned from OpenAI on February 15, 2024.
How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.
On reflection there was something missing from my perspective here, which is that taking any action based on principles depends on pragmatic considerations, like if you leave are there better alternatives? How much power do you really have? I think I don't fault someone who thinks this through and decides that something is wrong but there's no real way to do anything about it. I do think you should try to maintain some sense of what is wrong and what the right direction would be, look out for ways to push in that direction. E.g. working at a lab but maintaining some sense of "this is how much of a chance it looks like pause activism would need before I'd quite and endorse a pause".
I think I was just conflating different kinds of decisions here, and imagining arguing with people with very different conceptions of what are important to count in costs and benefits, and a bit confused. On reflection I don't endorse 10x margin in terms of like percentage points of x-risk. And like maybe margin is sort of a crutch, maybe the thing I want more is like "95% chance of being net-positive, considering possibility you're kind of biased". I still think you should be suspicious of "the case exactly balance lets ship'
Yeah this part is pretty under-defined, I was maybe falling into the trap of being too idealistic, and I'm probably less optimistic about this than I was when writing it before. I think there's something directionally important here, are you trying at all to expand the circle of accountability at all, even if you're being cautious about expanding it because you're afraid of things breaking down?
A couple advantages for AI intellectuals could be:
- being able to rerun based on different inputs, see how their analysis changes function of those inputs
- being able to view full reasoning traces (while also not the full story, probably more of the full story than what goes on with human reasoning, good intellectuals already try to share their process but maybe can do better/use this to weed out clearly bad approaches)
Initial version for firefox, code at https://github.com/william-r-s/MindfulBlocker, extension file at https://github.com/william-r-s/MindfulBlocker/releases/tag/v0.2.0