All of Sam Bowman's Comments + Replies

Sam BowmanΩ110

Is there anything you'd be especially excited to use them for? This should be possible, but cumbersome enough that we'd default to waiting until this grows into a full paper (date TBD). My NYU group's recent paper on a similar debate setup includes a data release, FWIW.

Sam BowmanΩ692

Possible confound:  Is it plausible that the sycophancy vector is actually just adjusting how much the model conditions its responses on earlier parts of the conversation, beyond the final 10–20 tokens? IIUC, the question is always at the end, and ignoring the earlier context about the person who's nominally asking the question should generally get you a better answer.

2Nina Panickssery
I think this is unlikely given my more recent experiments capturing the dot product of the steering vector with generated token activations in the normal generation model and comparing this to the directly decoded logits at that layer. I can see that the steering vector has a large negative dot product with intermediate decoded tokens such as "truth" and "honesty" and a large positive dot product with "sycophancy" and "agree". Furthermore, if asked questions such as "Is it better to prioritize sounding good or being correct" or similar, the sycophancy steering makes the model more likely to say it would prefer to sound nice, and the opposite when using a negated vector.
Sam BowmanΩ110

That makes sense, though what's at stake with that question? In almost every safety-relevant context I can think of, 'scale' is just used as a proxy for 'the best loss I can realistically achieve in a training run', rather than as something we care about directly.

Sam BowmanΩ221

Yep, that sounds right! The measure we're using gets noisier with better performance, so even faithfulness-vs-performance breaks down at some point. I think this is mostly an argument to use different metrics and/or tasks if you're focused on scaling trends.

Sam BowmanΩ7104

Concretely, the scaling experiments in the first paper here show that, as models get larger, truncating or deleting the CoT string makes less and less difference to the model's final output on any given task.

So, stories about CoT faithfulness that depend on the CoT string being load-bearing are no longer very compelling at large scales, and the strings are pretty clearly post hoc in at least some sense. 

This doesn't provide evidence, though, that the string is misleading about the reasoning process that the model is doing, e.g., in the sense that the ... (read more)

1RobertKirk
To check I understand this, is another way of saying this that in the scaling experiments, there's effectively a confounding variable which is model performance on the task: * Improving zero-shot performance decreases deletion-CoT-faithfulness * The model will get the answer write with no CoT, so adding a prefix of correct reasoning is unlikely to change the output, hence a decrease in faithfulness * Model scale is correlated with model performance. * So the scaling experiments show model scale is correlated with less faithfulness, but probably via the correlation with model performance. If you had a way of measuring the faithfulness conditioned on a given performance for a given model scale then you could measure how scaling up changes faithfulness. Maybe for a given model size you can plot performance vs faithfulness (with each dataset being a point on this plot), measure the correlation for that plot and then use that as a performance-conditioned faithfulness metric? Or there's likely some causally correct way of measuring the correlation between model scale and faithfulness while removing the confounder of model performance.
Sam BowmanΩ27-1

I agree, though I'll also add: 

- I don't think our results clearly show that faithfulness goes down with model size, just that there's less affirmative evidence for faithfulness at larger model sizes, at least in part for predictable reasons related to the metric design. There's probably more lowish-hanging fruit involving additional experiments focused on scaling. (I realize this disagrees with a point in the post!)

- Between the good-but-not-perfect results here and the alarming results in the Turpin 'Say What They Think' paper, I think this paints a... (read more)

6Seth Herd
Your first point seems absolutely critical. Could you elaborate a little?
Sam BowmanΩ220

I’d like to avoid that document being crawled by a web scraper which adds it to a language model’s training corpus.

This may be too late, but it's probably also helpful to put the BIG-Bench "canary string" in the doc as well.

Sam BowmanΩ442

Assuming we're working with near-frontier models (s.t., the cost of training them once is near the limit of what any institution can afford), we presumably can't actually retrain a model without the data. Are there ways to approximate this technique that preserve its appeal?

(Just to check my understanding, this would be a component of a sufficient-but-not-necessary solution, right?)

4evhub
Yep, seems too expensive to do literally as stated, but right now I'm just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don't have one right now. And I'm not exactly sure what part of the solution this would fill—it's not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.

Just flagging that another cross-post has been collecting some comments: https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety

I mostly agree, but it's messy. I don't think it's obvious that a PhD is anywhere near the ideal way to pick up some of these skills, or that earning a PhD definitely means that you've picked them up, but PhD programs do include lots of nudges in these directions, and PhD-holders are going to be much stronger than average at most of this. 

In particular, like Johannes said, doing a PhD is notoriously hard on mental health for a number of reasons, even at a more-supportive-than-average lab. So to the extent that they teach 'taking care of your mental he... (read more)

When I converse with junior folks about what qualities they’re missing, they often focus on things like “not being smart enough” or “not being a genius” or “not having a PhD.” It’s interesting to notice differences between what junior folks think they’re missing & what mentors think they’re missing.


This issue is real, it's the thing that frustrates me most about alignment pipeline-building work in general right now. There are very likely some important formal/theoretical areas of alignment research that really do need to recruit mostly for something li... (read more)

3Dunning K.
A good chunk of the general skills, at least when summarized like this: seem like things that I would learn in a PhD program (granted, some of them seem like things you would need to figure out for yourself, where the advisor can't help a ton). I'm not sure a PhD is the most efficient possible way to learn these things, but at least it has a blueprint I can follow, where I will probably end up at where I want to be. Since you have a first-hand perspective on this, would you say I'm off the mark here?

:D

I think my lab is bottlenecked on things other than talent and outside support for now, but there probably is more that could be done to help build/coordinate an alignment research scene in NYC more broadly.

More organizations like CAIS that aim to recruit established ML talent into alignment research

This is somewhat risky, and should get a lot of oversight. One of the biggest obstacles to discussing safety in academic settings is that academics are increasingly turned off by clumsy, arrogant presentations of the basic arguments for concern.

Sam BowmanΩ5813

+1. The combination of the high dollar amount, the subjective criteria, and the panel drawn from the relatively small/insular 'core' AI safety research community mean that I expect this to look pretty fishy to established researchers. Even if the judgments are fair (I think they probably will be!) and the contest yields good work (it might!), I expect the benefit of that to be offset to a pretty significant degree by the red flags this raises about how the AI safety scene deals with money and its connection to mainstream ML research.

(To be fair, I think th... (read more)

[anonymous]Ω6238

Hastily written; may edit later

Thanks for mentioning this, Jan! We'd be happy to hear suggestions for additional judges. Feel free to email us at akash@alignmentawards.com and olivia@alignmentawards.com.  

Some additional thoughts:

  1. We chose judges primarily based on their expertise and (our perception of) their ability to evaluate submissions about goal misgeneralization and corrigibility. Lauro, Richard, Nate, and John ade some of few researchers who have thought substantially about these problems. In particular, Lauro first-authored the firs
... (read more)
Sam BowmanΩ560

Update: We did a quick follow-up study adding counterarguments, turning this from single-turn to two-turn debate, as a quick way of probing whether more extensive full-transcript debate experiments on this task would work. The follow-up results were negative. 

Tweet thread here: https://twitter.com/sleepinyourhat/status/1585759654478422016

Direct paper link: https://arxiv.org/abs/2210.10860 (To appear at the NeurIPS ML Safety workshop.)

We're still broadly optimistic about debate, but not on this task, and not in this time-limited, discussion-limited set... (read more)

Sam BowmanΩ110

Fair. For better or worse, a lot of this variation came from piloting—we got a lot of nudges from pilot participants to move toward framings that were perceived as controversial or up for debate.

Thanks! I'll keep my opinionated/specific overview of the alignment community, but I know governance less well, so I'm happy to defer there.

Sam BowmanΩ71716

I agree that this points in the direction of video becoming increasingly important.

But why assume only 1% is useful? And more importantly, why use only the language data? Even if we don't have the scaling laws, but it seems pretty clear that there's a ton of information in the non-language parts of videos that'd be useful to a general-purpose agent—almost certainly more than in the language parts. (Of course, it'll take more computation to extract the same amount of useful information from video than from text.) 

Sam BowmanΩ110

Thanks! I'll admit that I meant to be asking especially about the toxicity case, though I didn't make that at all clear. As in Charlie's comment, I'm most interested in using this approach as a way to efficiently explore and pilot techniques that we can ultimately adapt back to humans, and text-based interactions seems like a good starting point for that kind of work. 

I don't see a clear picture either way on whether the noisy signal story presents a hard problem that's distinctively alignment oriented.

Sam BowmanΩ110

Thanks! I think I have some sense of what both directions look like, but not enough to know what a concrete starting experiment would look like. What would a minimum viable experiment look like for each?

2Jérémy Scheurer
I'll try to explain what an MVP could look like, but I think this is not fully thought through and could use further improvement (potentially also has some flaws). The goal is to have a starting point for artificial sandwiching so one can iterate on ideas, and not fully investigate the whole sandwiching problem.  Chess: We are defining 3 models, the expert model E, the (human equivalent) weak model H and the strong but misaligned assistant model A. The goal is for H to leverage the misaligned assistant A to reach the performance of the expert E. The way that H is allowed to improve, is by playing against itself and being allowed to query A. We can then record the self-play games to create a new corpus that we can then finetune H on. The big difference between self-play with MCTS and this case is, that here we will only improve if the data we generate has any additional signal that helps us learn something new.  Thus if H would just play against itself and be fine-tuned on the generated data, it wouldn't be able to improve and achieve expert performance. If we however leverage an assistant A that "appears to have the skills and knowledge needed to solve the task better than many humans", we can improve H. To train E for good performance I would take a model with a reasonable number of parameters M and just fine-tune on a human data Corpus with Z samples for k iterations.  You define the assistant A as: "A model that appears to have the skills and knowledge needed to solve the task better than many humans, but that does not reliably do so when prompted."  This implies that the assistant should in principle be better than the weak model H, but since it is misaligned at the same time, the actual performance might be weaker (at least sometimes). The challenge is now that in terms of performance, the expert E should be better than the misaligned assistant A, who at least sometimes should be better than the weak model H. I thus propose that A is a copy of E, with some ki
Sam BowmanΩ590

Is anyone working on updating the Biological Anchors Report model based on the updated slopes/requirements here?

Sam BowmanΩ110

I can look up the exact wording if it's helpful, but I assume it's clear from the basic setup that at least one of the arguments has to be misleading.

1TLW
I don't know about anyone else; under time pressure I personally would go about looking for 'one wrong argument in a sea of high-quality arguments' very differently than I would go about looking for 'one misleading but superficially high-quality argument in a sea of high-quality arguments' or 'one very-high-quality argument in a sea of high-quality arguments'.
Sam BowmanΩ110

I have no reason to be especially optimistic given these results, but I suppose there may be some fairly simple questions for which it's possible to enumerate a complete argument in a way that flaws will be clearly apparent.

In general, it seems like single-turn debate would have to rely on an extremely careful judge, which we don't quite have, given the time constraint. Multi-turn seems likely to be more forgiving, especially if the judge has any influence over the course of the debate.

Sam BowmanΩ220

Yep. (Thanks for re-posting.) We're pretty resigned to the conclusion that debate fails to reach a correct conclusion in at least some non-trivial cases—we're mainly interested in figuring out (i) whether there are significant domains or families of questions for which it will often reach a conclusion, and (ii) whether it tends to fail gracefully (i.e., every outcome is either correct or a draw).

Sam BowmanΩ110

One of the arguments is quite misleading in most cases, so probably not high-quality by typical definitions. Unfortunately, under the time limit, our readers can't reliably tell which one is misleading.

Without arguments and without the time limit, annotators get the questions right with ~90% accuracy: https://arxiv.org/abs/2112.08608

1TLW
Did your description to the participants state that the arguments were high-quality?

I suspect that these developments look a bit less surprising if you've been trying to forecast progress here, and so might be at least partially priced in. Anyhow, the forecast you linked to shows >10% likelihood before spring 2025, three years from now. That's extraordinarily aggressive compared to (implied) conventional wisdom, and probably a little more aggressive than I'd be as an EA AI prof with an interest in language models and scaling laws.

2Xodarap
Thanks, yeah now that I look closer Metaculus shows a 25% cumulative probability before April 2029, which is not too far off from OP's 30% claim.

The BIG-Bench paper that those 'human' numbers are coming from (unpublished, quasi-public as TeX here) cautions against taking those average very seriously, without giving complete details about who the humans are or how they were asked/incentivized to behave on tasks that required specialized skills:

3Søren Elverlin
Thank you for this important caveat. As an imperfect bayesian, I expect that if I analyzed the benchmark, I would update towards a belief that the results are real, but less impressive than the article makes them appear. :)
Sam BowmanΩ340
  • Can be addressed by regularizing the reporter's output: penalizing response length or entropy and a GAN-type penalty for non-human-like questions and answers.

Can you say more about how this would work? I haven't been following the literature on emergent communication too closely, but my rough impression had been that steganography in cases like this doesn't have simple/trivial solutions.

2Vika
I think our proposal addresses the "simple steganography" problem, as described in "ELK prize results / First counterexample: simple steganography": An entropy penalty on the reporter's output would discourage the spurious variations in answers described above (assuming that steganographic output has higher entropy than the true answers). I agree that the general case of steganography is not addressed by simple approaches like an entropy penalty, e.g. "Harder counterexample: steganography using flexible degrees of freedom". 
Sam BowmanΩ230

Yeah, this all sounds right, and it's fairly close to the narrative I was using for my previous draft, which had a section on some of these motives.

The best defense I can give of the switch to the hype-centric framing, FWIW:

  • The paper is inevitably going to have to do a lot of chastising of authors. Giving the most charitable possible framing of the motivations of the authors I'm chastising means that I'm less likely to lose the trust/readership of those authors and anyone who identifies with them.
  • An increasingly large fraction of NLP work—possibly even a m
... (read more)
Sam BowmanΩ110

Forum  

I can see the comment at the comment-specific AF permalink here:

https://www.alignmentforum.org/posts/RLHkSBQ7zmTzAjsio/nlp-position-paper-when-combatting-hype-proceed-with-caution?commentId=pSkdAanZQwyT4Xyit#pSkdAanZQwyT4Xyit

...but I can't see it among the comments at the base post URL here.

https://www.alignmentforum.org/posts/RLHkSBQ7zmTzAjsio/nlp-position-paper-when-combatting-hype-proceed-with-caution 

From my experience with the previous comment, I expect it'll appear at the latter URL once I reply?

[Old technique] had [problem]...

Ah, go... (read more)

4Pattern
I'm on this page: https://www.lesswrong.com/posts/RLHkSBQ7zmTzAjsio/nlp-position-paper-when-combatting-hype-proceed-with-caution It seems it's an issue specific to AF communicating with LW. ETA: I let someone know on this site using the thing in the lower right, for 'sharing feedback.'
Sam BowmanΩ110

Thanks! Tentative rewrite for the next revision:

It harms our credibility in ways that can make it harder to mitigate present-day harms from NLP deployments. It also limits our ability to prepare for the potentially enormous impacts of more distant future advances.

I tried to stick to 'present-day' over 'short-term', but missed this old bit of draft text in the abstract. 

Sam BowmanΩ110

Thanks! (Typo fixed.)

[Old technique] had [problem]...

For this point, I'm not sure how it fits into the argument. Could you say more?

Is there any empirical base...

Yeah, this is a missed opportunity that I haven't had the time/expertise to take on. There probably are comparable situations in the histories of other applied research fields, but I'm not aware of any good analogies. I suspect that a deep dive into some history-and-sociology-of-science literature would be valuable here.

What if the impact grows dramatically as...they get deployed widely? ...

I thin... (read more)

2Pattern
Ah. When you say lightcone forums, what site are you on? What does the URL look like? It's probably a tangent. The idea was: 1) Criticism is great. 2) Explaining how that could be improved is marginally better. (I then explained for that case* how citing 'old evidence' or 'old stuff' could still apply to new stuff. It was kind of a niche application of evidence though. If someone had a good reason for using the old evidence, elaborating on that reason might help.) *In abstract terms - I didn't have any examples in mind.
Sam BowmanΩ230

Thanks—fixed! (The sentence-final period got folded into the URL.)

Sam BowmanΩ110

Another very minor (but briefly confusing) nit: The notation in the `Example' section is inconsistent between probabilities and log probabilities. It introduces (etc.) as a probability, but then treats it as a log probability in the line starting with 'We find the '.