LESSWRONG
LW

William_S — LessWrong

Would be interested in a quick write-up of what you think are the most important virtues you'd want for AI systems, seems good in terms of having things to aim towards instead of just aiming away from.

William_S1y

Initial version for firefox, code at https://github.com/william-r-s/MindfulBlocker, extension file at https://github.com/william-r-s/MindfulBlocker/releases/tag/v0.2.0

William_S1y

Maybe there's an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis

William_S1y

I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there'd be snickering news articles about it. So if some individuals could do this independently might be easier

William_S1y

How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.

Replying toPrinciples for the AGI Race

William_S1y

Principles for the AGI Race

I think it's somewhat blameworthy to not think about these questions at all though

Replying toPrinciples for the AGI Race

William_S1y

Principles for the AGI Race

On reflection there was something missing from my perspective here, which is that taking any action based on principles depends on pragmatic considerations, like if you leave are there better alternatives? How much power do you really have? I think I don't fault someone who thinks this through and decides that something is wrong but there's no real way to do anything about it. I do think you should try to maintain some sense of what is wrong and what the right direction would be, look out for ways to push in that direction. E.g. working at a lab but maintaining some sense of "this is how much of a chance it looks like pause activism would need before I'd quite and endorse a pause".

Replying toPrinciples for the AGI Race

William_S1y

Principles for the AGI Race

I think I was just conflating different kinds of decisions here, and imagining arguing with people with very different conceptions of what are important to count in costs and benefits, and a bit confused. On reflection I don't endorse 10x margin in terms of like percentage points of x-risk. And like maybe margin is sort of a crutch, maybe the thing I want more is like "95% chance of being net-positive, considering possibility you're kind of biased". I still think you should be suspicious of "the case exactly balance lets ship'

Replying toPrinciples for the AGI Race

William_S1y

Principles for the AGI Race

Yeah this part is pretty under-defined, I was maybe falling into the trap of being too idealistic, and I'm probably less optimistic about this than I was when writing it before. I think there's something directionally important here, are you trying at all to expand the circle of accountability at all, even if you're being cautious about expanding it because you're afraid of things breaking down?

Replying to6 (Potential) Misconceptions about AI Intellectuals

William_S1y

6 (Potential) Misconceptions about AI Intellectuals

Would be nice to have a llm+prompt that tries to produce reasonable AI strategy advice based on a summary of the current state of play, have some way to validate that it's reasonable, be able to see how it updates as events unfold.

LLM-based application I'd like to exist:
Web browser addon for firefox that has blocklists of websites, when you try to visit one you have to have a conversation with Claude about why you want to visit it in this moment, convince Claude to let you bypass the block for a limited period of time for your specific purpose (let you customize the claude prompt with info about why you set up the block in the first place).
Wanting to use for things like news, social media where it's a bit too much to try to completely block, but I've got bad habits around checking too frequently.
Bonus: be able to let the LLM read the website for you and answer questions without showing you the page, like is there anything new about X.

Principles for the AGI Race

William_S

Crossposted from https://williamrsaunders.substack.com/p/principles-for-the-agi-race

Why form principles for the AGI Race?

I worked at OpenAI for 3 years, on the Alignment and Superalignment teams. Our goal was to prepare for the possibility that OpenAI succeeded in its stated mission of building AGI (Artificial General Intelligence, roughly able to do most things a human can do), and then proceed on to make systems smarter than most humans. This will predictably face novel problems in controlling and shaping systems smarter than their supervisors and creators, which we don't currently know how to solve. It's not clear when this will happen, but a number of people would throw around estimates of this happening within a few years.

While there,... (read 5125 more words →)

248

Transformer Circuit Faithfulness Metrics Are Not Robust

Joseph Miller

Joseph Miller, bilalchughtai, William_S

When you think you've found a circuit in a language model, how do you know if it does what you think it does? Typically, you ablate / resample the activations of the model in order to isolate the circuit. Then you measure if the model can still perform the task you're investigating.

We identify six ways in which ablation experiments often vary.^[1]^[2]

How do these variations change the results of experiments that measure circuit faithfulness?

TL;DR

We study three different circuits from the literature and find that measurements of their faithfulness are highly dependent on details of the experimental methodology. The IOI and Docstring circuits in particular are much less faithful than reported when tested with

... (read 1982 more words →)

104

I'd have more confidence in Anthropic's governance if the board or LTBT had some fulltime independent members who weren't employees. IMO labs should consider paying a fulltime salary but no equity to board members, through some kind of mechanism where the money is still there and paid for X period of time in the future, even if the lab dissolved, so no incentive to avoid actions that would cost the lab. Board salaries could maybe be pegged to some level of technical employee salary, so that technical experts could take on board roles. Boards full of busy people really can't do their job of checking whether the organization is fullfilling its stated mission, and IMO this is one of the most important jobs in the world right now. Also, fulltime board members would have fewer conflicts of interest outside of the lab (since they won't be in some other fulltime job that might conflict).

I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool.
I resigned from OpenAI on February 15, 2024.

•••

From discussion with Logan Riggs (Eleuther) who worked on the tuned lens: the tuned lens suggests that the residual stream at different layers go through some linear transformations and so aren’t directly comparable. This would interfere with a couple of methods for trying to understand neurons based on weights: 1) the embedding space view 2) calculating virtual weights between neurons in different layers.

However, we could try correcting these using the transformations learned by the tuned lens to translate between the residual stream at different layers, and maybe this would make these methods more effective. By default I think the tuned lens learns only the transformation needed to predict the output token but the method could be adapted to retrodict the input token from each layer as well, we’d need both. Code for tuned lens is at https://github.com/alignmentresearch/tuned-lens

William_S's Shortform

William_S

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Thoughts on refusing harmful requests to large language models

William_S

https://twitter.com/antimatter15/status/1602469101854564352

Currently, large language models (ChatGPT, Constitutional AI) are trained to refuse to follow user requests that are considered inappropriate or harmful. This can be done by training on example strings of the form “User: inappropriate request AI: elaborate apology”

Proposal

Instead of training a language model to produce “elaborate apology” when it refuses to do an action, train it to produce a special sequence or token first “<SORRYDAVE>elaborate apology”. Strip the special sequence out before returning a response to the user (and never allow the user to include the special sequence in input).

Benefits

Can directly measure the probability of refusal for any output
- Can refuse based on probability of producing <SORRYDAVE> instead of just sampling responses
  - Just take

... (read 398 more words →)

Prize for Alignment Research Tasks

stuhlmueller

stuhlmueller, William_S

Can AI systems substantially help with alignment research before transformative AI? People disagree.

Ought is collecting a dataset of alignment research tasks so that we can:

Make progress on the disagreement
Guide AI research towards helping with alignment

We’re offering a prize of $200-$2000 for each contribution to this dataset.

The debate: Can AI substantially help with alignment research?

Wei Dai asked the question in 2019:

[This] comparison table makes Research Assistant seem a particularly attractive scenario to aim for, as a stepping stone to a more definitive [AI Safety] success story. Is this conclusion actually justified?

Jan Leike thinks so:

My currently favored approach to solving the alignment problem: automating alignment research using sufficiently aligned AI systems. It doesn’t require humans to

... (read 2799 more words →)

Is there an intuitive way to explain how much better superforecasters are than regular forecasters?

William_S

William_S, Daniel Kokotajlo

Is there an intuitive way to explain how much better superforecasters are than regular forecasters? (I can look at the tables in https://www.researchgate.net/publication/277087515_Identifying_and_Cultivating_Superforecasters_as_a_Method_of_Improving_Probabilistic_Predictions but I don't have an intuitive understanding of what brier scores mean, so I'm not sure what to think about it).

Machine Learning Projects on IDA

Owain_Evans

Owain_Evans, William_S, stuhlmueller

TLDR

We wrote a 20-page document that explains IDA and outlines potential Machine Learning projects about IDA. This post gives an overview of the document.

What is IDA?

Iterated Distillation and Amplification (IDA) is a method for training ML systems to solve challenging tasks. It was introduced by Paul Christiano. IDA is intended for tasks where:

The goal is to outperform humans at the task or to solve instances that are too hard for humans.
It is not feasible to provide demonstrations or reward signals sufficient for super-human performance at the task
Humans have a high-level understanding of how to approach the task and can reliably solve easy instances.

The idea behind IDA is to bootstrap using an approach

... (read 483 more words →)

Reinforcement Learning in the Iterated Amplification Framework

William_S

When I think about Iterated Amplification (IA), I usually think of a version that uses imitation learning for distillation.

This is the version discussed in the Scalable agent alignment via reward modeling: a research direction, as "Imitating expert reasoning", in contrast to the proposed approach of "Recursive Reward Modelling". The approach works roughly as follows

1. Gather training data from experts on how to break problems into smaller pieces and combine the results

2. Train a model to imitate what the expert would do at every step

3. Amplification: Run a collaboration of a large number of copies of the learned model.

4. Distillation: Train a model to imitate what the collaboration did.

5. Repeat steps... (read 966 more words →)

HCH is not just Mechanical Turk

William_S

HCH, introduced in Humans consulting HCH, is a computational model in which a human answers questions using questions answered by another human, which can call other humans, which can call other humans, and so on. Each step in the process consists of a human taking in a question, optionally asking one or more subquestions to other humans, and returning an answer based on those subquestions. HCH can be used as a model for what Iterated Amplification would be able to do in the limit of infinite compute. HCH can also be used to decompose the question of "is Iterated Amplification safe" into “is HCH safe” and “If HCH is safe, will Iterated... (read 737 more words →)

Amplification Discussion Notes

William_S

Paul Christiano, Wei Dai, Andreas Stuhlmüller and I had an online chat discussion recently, the transcript of the discussion is available here. (Disclaimer that it’s a nonstandard format and we weren't optimizing for ease of understanding the transcript). This discussion was primarily focused on amplification of humans (not later amplification steps in IDA). Below are some highlights from the discussion, and I’ll include some questions that were raised that might merit further discussion in the comments.

Highlights

Strategies for sampling from a human distribution of solutions:

Paul: For example you can use "Use random human example," or "find an analogy to another example you know and use it to generate an example," or whatever.

There

... (read 856 more words →)