LESSWRONG
LW

All of Dan H's Comments + Replies

Good Research Takes are Not Sufficient for Good Strategic Takes

If a strategy is likely to be outdated quickly it's not robust and not a good strategy. Strategies should be able to withstand lots of variation.

Zach Stein-Perlman's Shortform

Dan H3mo163

capability thresholds be vague or extremely high

xAI's thresholds are entirely concrete and not extremely high.

evaluation be unspecified or low-quality

They are specified and as high-quality as you can get. (If there are better datasets let me know.)

I'm not saying it's perfect, but I wouldn't but them all in the same bucket. Meta's is very different from DeepMind's or xAI's.

6Zach Stein-Perlman3mo

xAI Risk Management Framework (Draft) You're mostly right about evals/thresholds. Mea culpa. Sorry for my sloppiness. For misuse, xAI has benchmarks and thresholds—or rather examples of benchmarks thresholds to appear in the real future framework—and based on the right column they seem very reasonably low. Unlike other similar documents, these are not thresholds at which to implement mitigations but rather thresholds to reduce performance to. So it seems the primary concern is probably not the thresholds are too high but rather xAI's mitigations won't be robust to jailbreaks and xAI won't elicit performance on post-mitigation models well. E.g. it would be inadequate to just run a benchmark with a refusal-trained model, note that it almost always refuses, and call it a success. You need something like: a capable red-team tries to break the mitigations and use the model for harm, and either the red-team fails or it's so costly that the model doesn't make doing harm cheaper. (For "Loss of Control," one of the two cited benchmarks was published today—I'm dubious that it measures what we care about but I've only spent ~3 mins engaging—and one has not yet been published. [Edit: and, like, on priors, I'm very skeptical of alignment evals/metrics, given the prospect of deceptive alignment, how we care about worst-case in addition to average-case behavior, etc.])

Drake Thomas's Shortform

Dan H3mo1412

though I don't think xAI took an official position one way or the other

I assumed most of everybody assumed xAI supported it since Elon did. I didn't bother pushing for an additional xAI endorsement given that Elon endorsed it.

meemi's Shortform

Dan H3mo*320

It's probably worth them mentioning for completeness that Nat Friedman funded an earlier version of the dataset too. (I was advising at that time and provided the main recommendation that it needs to be research-level because they were focusing on Olympiad level.)

Also can confirm they aren't giving access to the mathematicians' questions to AI companies other than OpenAI like xAI.

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser

Dan H5mo102

and have clearly been read a non-trivial amount by Elon Musk

Nit: He heard this idea in conversation with an employee AFAICT.

Darwinian Traps and Existential Risks

Dan H8mo4-14

Relevant: Natural Selection Favors AIs over Humans

universal optimization algorithm

Evolution is not an optimization algorithm (this is a common misconception discussed in Okasha, Agents and Goals in Evolution).

habryka8mo188

I am not sure what you mean by "algorithm" here, but the book you link repeatedly acknowledges that of course Natural Selection is an "optimization process", though there are of course disagreements about the exact details. The book has not a single mention of the word "algorithm" so presumably you are using it here synonymous with "optimization process", for which the book includes quotes like:

"selection is obviously [to some degree] an optimizing process"

The book then discusses various limits to the degree to which evolution will arrive at op... (read more)

5KristianRonn8mo

So natural selection is not optimizing fitness? Please elaborate. 😊

Unlearning via RMU is mostly shallow

Dan H9mo*30

We have been working for months on this issue and have made substantial progress on it: Tamper-Resistant Safeguards for Open-Weight LLMs

General article about it: https://www.wired.com/story/center-for-ai-safety-open-source-llm-safeguards/

Re: Anthropic's suggested SB-1047 amendments

Dan H9mo32

It's real.

An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs

Dan H10moΩ030

It's worth noting that activations are one thing you can modify, but many of the most performant methods (e.g., LoRRA) modify the weights. (Representations = {weights, activations}, hence "representation" engineering.)

Towards more cooperative AI safety strategies

Dan H10mo42

"Bay Area EA alignment community"/"Bay Area EA community"? (Most EAs in the Bay Area are focused on alignment compared to other causes.)

Towards more cooperative AI safety strategies

Dan H10mo149

The AI safety community is structurally power-seeking.

I don't think the set of people interested in AI safety is a even a "community" given how diverse it is (Bengio, Brynjolfsson, Song, etc.), so I think it's be more accurate to say "Bay Area AI alignment community is structurally power-seeking."

7habryka10mo

I think this is a better pointer, but I think "Bay Area alignment community" is still a bit too broad. I think e.g. Lightcone and MIRI are very separate from Constellation and Open Phil and it doesn't make sense to put them into the same bucket.

Fabien's Shortform

Dan H10moΩ10261

Got a massive simplification of the main technique within days of being released

The loss is cleaner, IDK about "massively," because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn't affect performance and doesn't markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper.

p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months.

This puzzles... (read more)

2[comment deleted]10mo

What do coherence arguments actually prove about agentic behavior?

Dan H11mo145

Key individuals that the community is structured around just ignored it, so it wasn't accepted as true. (This is a problem with small intellectual groups.)

Buck's Shortform

Dan H1yΩ8150

Some years ago we wrote that "[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries" and discussed monitoring systems that can create "AI tripwires could help uncover early misaligned systems before they can cause damage." https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Adversarial_Robustness

Since then, I've updated that adversarial robustness for LLMs is much more tractable (preview of paper out very soon). In vision settings, progress is extraordinarily slow b... (read more)

3Buck1y

Thanks for the link. I don't actually think that either of the sentences you quoted are closely related to what I'm talking about. You write "[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries"; I think that this is you making a version of the argument that I linked to Adam Gleave and Euan McLean making and that I think is wrong. You wrote "AI tripwires could help uncover early misaligned systems before they can cause damage" in a section on anomaly detection, which isn't a central case of what I'm talking about.

Introducing AI Lab Watch

Dan H1y302

Various comments:

I wouldn't call this "AI lab watch." "Lab" has the connotation that these are small projects instead of multibillion dollar corporate behemoths.

"deployment" initially sounds like "are they using output filters which harm UX in deployment", but instead this seems to be penalizing organizations if they open source. This seems odd since open sourcing is not clearly bad right now. The description also makes claims like "Meta release all of their weights"---they don't release many image/video models because of deepfakes, so they are doing some ... (read more)

2ESRogs1y

Disagree on "lab". I think it's the standard and most natural term now. As evidence, see your own usage a few sentences later:

2Zach Stein-Perlman1y

If there's a good writeup on labs' policy advocacy I'll link to and maybe defer to it.

9ryan_greenblatt1y

I initially thought this was wrong, but on further inspection, I agree and this seems to be a bug. The deployment criteria starts with: This criteria seems to allow to lab to meet it by having a good risk assesment criteria, but the rest of the criteria contains specific countermeasures that: 1. Are impossible to consistently impose if you make weights open (e.g. Enforcement and KYC). 2. Don't pass cost benefit for current models which pose low risk. (And it seems the criteria is "do you have them implemented right now?) If the lab had an excellent risk assement policy and released weights if the cost/benefit seemed good, that should be fine according to the "deployment" criteria IMO. Generally, the deployment criteria should be gated behind "has a plan to do this when models are actually powerful and their implementation of the plan is credible". I get the sense that this criteria doesn't quite handle the necessarily edge cases to handle reasonable choices orgs might make. (This is partially my fault as I didn't notice this when providing feedback on this project.) (IMO making weights accessible is probably good on current margins, e.g. llama-3-70b would be good to release so long as it is part of an overall good policy, is not setting a bad precedent, and doesn't leak architecture secrets.) (A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I'm afraid.)

2Ben Pace1y

This seems like a good point. Here's a quick babble of alts (folks could react with a thumbs-up on ones that they think are good). AI Corporation Watch | AI Mega-Corp Watch | AI Company Watch | AI Industry Watch | AI Firm Watch | AI Behemoth Watch | AI Colossus Watch | AI Juggernaut Watch | AI Future Watch I currently think "AI Corporation Watch" is more accurate. "Labs" feels like a research team, but I think these orgs are far far far more influenced by market forces than is suggested by "lab", and "corporation" communicates that. I also think the goal here is not to point to all companies that do anything with AI (e.g. midjourney) but to focus on the few massive orgs that are having the most influence on the path and standards of the industry, and to my eye "corporation" has that association more than "company". Definitely not sure though.

3Orpheus161y

@Dan H are you able to say more about which companies were most/least antagonistic?

Zach Stein-Perlman1y*120

This kind of feedback is very helpful to me; thank you! Strong-upvoted and weak-agreevoted.

(I have some factual disagreements. I may edit them into this comment later.)

(If you think Dan's comment makes me suspect this project is full of issues/mistakes, react 💬 and I'll consider writing a detailed soldier-ish reply.)

Refusal in LLMs is mediated by a single direction

Dan H1yΩ1-2-22

is novel compared to... RepE

This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405

Demonstrate full ablation of the refusal behavior with much less effect on coherence

In our paper and notebook we show the models are coherent.

Investigate projection

We did investigate projection too (we use it for concept removal in the RepE paper) but didn't find a substantial benefit for jailbreaking.

harmful/harmless instructions

We use harmful/harmless instructions.

Find that projecting away the (same, linear) feature at all lay

... (read more)

Nina Panickssery1yΩ101720

We do weight editing in the RepE paper (that's why it's called RepE instead of ActE)

I looked at the paper again and couldn't find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).

The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.

Nina Panickssery1yΩ12179

I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.

Refusal in LLMs is mediated by a single direction

Dan H1y*Ω11-5

but generally people should be free to post research updates on LW/AF that don't have a complete thorough lit review / related work section.

I agree if they simultaneously agree that they don't expect the post to be cited. These can't posture themselves as academic artifacts ("Citing this work" indicates that's the expectation) and fail to mention related work. I don't think you should expect people to treat it as related work if you don't cover related work yourself.

Otherwise there's a race to the bottom and it makes sense to post daily research notes a... (read more)

Refusal in LLMs is mediated by a single direction

Dan H1y*Ω01-21

From Andy Zou:

Thank you for your reply.

Model interventions to bypass refusal are not discussed in Section 6.2.

We perform model interventions to robustify refusal (your section on “Adding in the "refusal direction" to induce refusal”). Bypassing refusal, which we do in the GitHub demo, is merely adding a negative sign to the direction. Either of these experiments show refusal can be mediated by a single direction, in keeping with the title of this post.

we examined Section 6.2 carefully before writing our work

Not mentioning it anywhere in your work i... (read more)

Nina Panickssery1y*Ω112112

FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs.

I think this post is novel compared to both my work and RepE because they:

Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
Investigate projection thoroughly as an alternative to sweeping over vect

... (read more)

Andy Arditi1y110

I will reach out to Andy Zou to discuss this further via a call, and hopefully clear up what seems like a misunderstanding to me.

One point of clarification here though - when I say "we examined Section 6.2 carefully before writing our work," I meant that we reviewed it carefully to understand it and to check that our findings were distinct from those in Section 6.2. We did indeed conclude this to be the case before writing and sharing this work.

Refusal in LLMs is mediated by a single direction

Dan H1y*Ω02-20

From Andy Zou:

Section 6.2 of the Representation Engineering paper shows exactly this (video). There is also a demo here in the paper's repository which shows that adding a "harmlessness" direction to a model's representation can effectively jailbreak the model.

Going further, we show that using a piece-wise linear operator can further boost model robustness to jailbreaks while limiting exaggerated refusal. This should be cited.

Arthur Conmy1yΩ17329

I think this discussion is sad, since it seems both sides assume bad faith from the other side. On one hand, I think Dan H and Andy Zou have improved the post by suggesting writing about related work, and signal-boosting the bypassing refusal result, so should be acknowledged in the post (IMO) rather than downvoted for some reason. I think that credit assignment was originally done poorly here (see e.g. "Citing others" from this Chris Olah blog post), but the authors resolved this when pushed.

But on the other hand, "Section 6.2 of the RepE paper shows exac... (read more)

Andy Arditi1y*Ω61512

Edit (April 30, 2024):

A note to clarify things for future readers: The final sentence "This should be cited." in the parent comment was silently edited in after this comment was initially posted, which is why the body of this comment purely engages with the serious allegation that our post is duplicate work. The request for a citation is highly reasonable and it was our fault for not including one initially - once we noticed it we wrote a "Related work" section citing RepE and many other relevant papers, as detailed in the edit below.

======

Edit (April 29, ... (read more)

A Gentle Introduction to Risk Frameworks Beyond Forecasting

Dan H1y80

If people are interested, many of these concepts and others are discussed in the context of AI safety in this publicly available chapter: https://www.aisafetybook.com/textbook/4-1

On Complexity Science

Dan H1y32

Here is a chapter from an upcoming textbook on complex systems with discussion of their application to AI safety: https://www.aisafetybook.com/textbook/5-1

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Dan H1y*Ω226431

> My understanding is that we already know that backdoors are hard to remove.

We don't actually find that backdoors are always hard to remove!

We did already know that backdoors often (from the title) "Persist Through Safety Training." This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn't establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.

I think it's very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and ha... (read more)

3mattmacdermott1y

At least for relative newcomers to the field, deciding what to pay attention to is a challenge, and using the window-dressed/EA-coded heuristic seems like a reasonable way to prune the search space. The base rate of relevance is presumably higher than in the set of all research areas. Since a big proportion will always be newcomers this means the community will under or overweight various areas, but I'm not sure that newcomers dropping the heuristic would lead to better results. Senior people directing the attention of newcomers towards relevant uncoded research areas is probably the only real solution.

ryan_greenblatt1yΩ13228

I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout's words, AGI threat scenario "window dressing," or when players from an EA-coded group research a topic. (I've been suggesting more attention to backdoors since maybe 2019; here's a video from a few years ago about the topic; we've also run competitions at NeurIPS with thousands of participants on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don't have the window dre

... (read more)

Machine Unlearning Evaluations as Interpretability Benchmarks

Dan H2y81

I agree that this is an important frontier (and am doing a big project on this).

Broken Benchmark: MMLU

Dan H2y2310

Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I've seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).

1alenoach2y

As the video says, labeling noise becomes more important as LLMs get closer to 100%. Does making a version 2 look worthwhile ? I suppose that a LLM could be used to automatically detect most problematic questions and a human could verify for each flagged question if it needs to be fixed or removed.

1awg2y

Your position seems to be one that says this is not something to be worried about/looking at. Can you explain why? For instance, if it is a desire to train predictive systems to provide accurate information, how is 10% or even 1-2% label noise "fine" under those conditions (if, for example, we could somehow get that number down to 0%)?

AI Forecasting: Two Years In

Dan H2y30

The purpose of this is to test and forecast problem-solving ability, using examples that substantially lose informativeness in the presence of Python executable scripts. I think this restriction isn't an ideological statement about what sort of alignment strategies we want.

AI Forecasting: Two Years In

Dan H2y30

I think there's a clear enough distinction between Transformers with and without tools. The human brain can also be viewed as a computational machine, but when exams say "no calculators," they're not banning mental calculation, rather specific tools.

AI Forecasting: Two Years In

Dan H2y31

It was specified in the beginning of 2022 in https://www.metaculus.com/questions/8840/ai-performance-on-math-dataset-before-2025/#comment-77113 In your metaculus question you may not have added that restriction. I think the question is much less interesting/informative if it does not have that restriction. The questions were designed assuming there's no calculator access. It's well-known many AIME problems are dramatically easier with a powerful calculator, since one could bash 1000 options and find the number that works for many problems. That's no longer... (read more)

2O O2y

I think it’s better if calculators are counted for the ultimate purpose of the benchmark. We can’t ban AI models from using symbolic logic as an alignment strategy.

AI Forecasting: Two Years In

Dan H2y20

Usage of calculators and scripts are disqualifying on many competitive maths exams. Results obtained this way wouldn't count (this was specified some years back). However, that is an interesting paper worth checking out.

2O O2y

That doesn’t make too much sense to me. Here the calculators are running on the same hardware as the model and theoretically transformers can just have a submodel that models a calculator.

9jsteinhardt2y

Is it clear these results don't count? I see nothing in the Metaculus question text that rules it out.

Announcing Foresight Institute's AI Safety Grants Program

Dan H2y157

Neurotechnology, brain computer interface, whole brain emulation, and "lo-fi" uploading approaches to produce human-aligned software intelligence

Thank you for doing this.

how 2 tell if ur input is out of distribution given only model weights

Dan H2y81

There's a literature on this topic. (paper list, lecture/slides/homework)

7dkirmani2y

I resent the implication that I need to "read the literature" or "do my homework" before I can meaningfully contribute to a problem of this sort. The title of my post is "how 2 tell if ur input is out of distribution given only model weights". That is, given just the model, how can you tell which inputs the model "expects" more? I don't think any of the resources you refer to are particularly helpful there. Your paper list consists of six arXiv papers (1, 2, 3, 4, 5, 6). Paper 1 requires you to bring a dataset. Paper 2 just says "softmax classifers tend to make more certain predictions on in-distribution inputs". I should certainly hope so. (Of course, not every model is a softmax classifer.) Paper 3 requires you to know the training set, and also it only works on models that happen to be softmax classifiers. Paper 4 requires a dataset of in-distribution data, it requires you to train a classifier for every model you want to use their methods with, and it looks like it requires the data to be separated into various classes. Paper 5 is basically the same as Paper 2, except it says "logits" instead of "probabilities", and includes more benchmarks. Paper 6 only works for classifiers and it also requires you to provide an in-distribution dataset. It seems that all of the six methods you referred me to either (1) require you to bring a dataset, or (2) reduce to "Hey guys, classifiers make less confident predictions OOD!". Therefore, I feel perfectly fine about failing to acknowledge the extant academic literature here. (Additionally, the methods in my post were also replicated in language models by @voooooogel: )

Alignment Grantmaking is Funding-Limited Right Now

Dan H2yΩ10263

Plug: CAIS is funding constrained.

Why was the AI Alignment community so unprepared for this moment?

Dan H2y*7439

Why was the AI Alignment community so unprepared for engaging with the wider world when the moment finally came?

In 2022, I think it was becoming clear that there'd be a huge flood of interest. Why did I think this? Here are some reasons: I've long thought that once MMLU performance crosses a threshold, Google would start to view AI as an existential threat to their search engine, and it seemed like in 2023 that threshold would be crossed. Second, at a rich person's party, there were many highly plugged-in elites who were starting to get much more anxious a... (read more)

3L Rudolf L2y

This seems like an impressive level of successfully betting on future trends before they became obvious. Are you talking about literal polling here? Are there actual numbers on what doom stories the public finds more and less plausible, and with what exact audience? It's interesting that paper timing is so important. I'd have guessed earlier is better (more time for others to build on it, the ideas to seep into the field, and presumably gives more "academic street cred"), and any publicity boost from a recent paper (e.g. journalists more likely to be interested or whatever) could mostly be recovered later by just pushing it again when it becomes relevant (e.g. "interview with scientists who predicted X / thought about Y already a year ago" seems pretty journalist-y). There's an underlying gist here that I agree with, but the this point seems too strong; I don't think there is literally no one who counts as an expert who hasn't lived in the Bay, let alone Berkeley alone. I would maybe buy it if the claim were about visiting.

Elon Musk announces xAI

Dan H2y37

No good deed goes unpunished. By default there would likely be no advising.

Jan_Kulveit2y102

What were the other options? Have you considered advising xAI privately, or re-directing xAI to be advised by someone else? Also, would the default be clearly worse?

As you surely are quite aware of, one of the bigger fights about AI safety across academia, policymaking and public spaces now is the discussion about AI safety being "distraction" from immediate social harms, and being actually the agenda favoured by the leading labs and technologists. (Often comes with accusations of attempted regulatory capture, worries about concentration of power, et... (read more)

1Amalthea2y

It is unclear to me on whether having your name publicly associated with them is good or bad. (compared to advising without it being publicly announced) On one hand it boosts awareness of the CAIS and gives you the opportunity to cause them some amount of negative publicity if you at some point distance yourself. On the other it does grant them some license to brush off safety worries by gesturing at your involvement.

Ben Pace2y112

I am not sure what this comment is responding to.

The only criticism of you and your team in the OP is that you named your team the "Center" for AI Safety, as though you had much history leading safety efforts or had a ton of buy-in from the rest of the field. I don't believe that either of these are true^[1], it seems to me that the name preceded you engaging in major safety efforts. This power-grab for being the "Center" of the field was a step toward putting you in a position to be publicly interviewed and on advisory boards like this and coordinate the s... (read more)

Catastrophic Risks from AI #1: Introduction

Dan H2yΩ140

A brief overview of the contents, page by page.

1: most important century and hinge of history

2: wisdom needs to keep up with technological power or else self-destruction / the world is fragile / cuban missile crisis

3: unilateralist's curse

4: bio x-risk

5: malicious actors intentionally building power-seeking AIs / anti-human accelerationism is common in tech

6: persuasive AIs and eroded epistemics

7: value lock-in and entrenched totalitarianism

8: story about bioterrorism

9: practical malicious use suggestions

10: LAWs as an on-ramp to AI x-risk

11: automated c... (read more)

MetaAI: less is less for alignment.

Dan H2yΩ6116

but I'm confident it isn't trying to do this

It is. It's an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it's to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn't get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix. Apologies that the paper is so dense, but that's because it took over a year.

2Cleo Nardo2y

Thanks for the summary. * Does machievelli work for chatbots like LIMA? * If not, which do you think is the sota? Anthropic's?

3ryan_greenblatt2y

Sorry, thanks for the correction. I personally disagree on this being a good benchmark for outer alignment for various reasons, but it's good to understand the intention.

Request: stop advancing AI capabilities

Dan H2y90

successful interpretability tools want to be debugging/analysis tools of the type known to be very useful for capability progress

Give one example of a substantial state-of-the-art advance that decisively influenced by transparency; I ask since you said "known to be." Saying that it's conceivable isn't evidence they're actually highly entangled in practice. The track record is that transparency research gives us differential technological progress and pretty much zero capabilities externalities.

In the DL paradigm you can't easily separate capabilities and a

... (read more)

1a3orn2y2617

The probably-canonical example at the moment is Hyena Hierarchy, which cites a bunch of interpretability research, including Anthropic's stuff on Induction Heads. If HH actually gives what it promises in the paper, it might enable way longer context.

I don't think you even need to cite that though. If interpretability wants to be useful someday, I think interpretability has to be ultimately aimed at helping steer and build more reliable DL systems. Like that's the whole point, right? Steer a reliable ASI.

The Polarity Problem [Draft]

Dan H2yΩ120

I asked for permission via Intercom to post this series on March 29th. Later, I asked for permission to use the [Draft] indicator and said it was written by others. I got permission for both of these, but the same person didn't give permission for both of these requests. Apologies this was not consolidated into one big ask with lots of context. (Feel free to get rid of any undue karma.)

Steering GPT-2-XL by adding an activation vector

Dan H2yΩ61313

It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabili

... (read more)

5davidad2y

Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.

Steering GPT-2-XL by adding an activation vector

Dan H2y*Ω23-3

steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)

(You linked to "deep deceptiveness," and I'm going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper). If it isn't, then this point is still relevant for alignment since self-deception is another internal hazard.)

I think one could argue that self-deception could in some instances be ... (read more)

9TurnTrout2y

I personally don't "dismiss" the task vector work. I didn't read Thomas as dismissing it by not calling it the concrete work he is most excited about -- that seems like a slightly uncharitable read? I, personally, think the task vector work is exciting. Back in Understanding and controlling a maze-solving policy network, I wrote (emphasis added): I'm highly uncertain about the promise of activation additions. I think their promise ranges from pessimistic "superficial stylistic edits" to optimistic "easy activation/deactivation of the model's priorities at inference time." In the optimistic worlds, activation additions do enjoy extreme advantages over task vectors, like accessibility of internal model properties which aren't accessible to finetuning (see the speculation portion of the post). In the very pessimistic worlds, activation additions are probably less directly important than task vectors. I don't know what world we're in yet.

4Thomas Kwa2y

* Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon. * I thought briefly about the Ilharco et al paper and am very impressed by it as well. * Thanks for linking to the resources. I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

5TurnTrout2y

Note that task vectors require finetuning. From the newly updated related work section:

Steering GPT-2-XL by adding an activation vector

Dan H2yΩ7120

Page 4 of this paper compares negative vectors with fine-tuning for reducing toxic text: https://arxiv.org/pdf/2212.04089.pdf#page=4

In Table 3, they show in some cases task vectors can improve fine-tuned models.

TurnTrout2y*Ω8164

Insofar as you mean to imply that "negative vectors" are obviously comparable to our technique, I disagree. Those are not activation additions, and I would guess it's not particularly similar to our approach. These "task vectors" involve subtracting weight vectors, not activation vectors. See also footnote 39 (EDIT: and the related work appendix now talks about this directly).

Steering GPT-2-XL by adding an activation vector

Dan H2yΩ576

Yes, I'll tend to write up comments quickly so that I don't feel as inclined to get in detailed back-and-forths and use up time, but here we are. When I wrote it, I thought there were only 2 things mentioned in the related works until Daniel pointed out the formatting choice, and when I skimmed the post I didn't easily see comparisons or discussion that I expected to see, hence I gestured at needing more detailed comparisons. After posting, I found a one-sentence comparison of the work I was looking for, so I edited to include that I found it, but it was oddly not emphasized. A more ideal comment would have been "It would be helpful to me if this work would more thoroughly compare to (apparently) very related works such as ..."

2Raemon2y

I'm also not able to evaluate the object-level of "was this post missing obvious stuff it'd have been good to improve", but, something I want to note about my own guess of how an ideal process would go from my current perspective: I think it makes more sense to think of posting on LessWrong as "submitting to a journal", than "publishing a finished paper." So, the part where some people then comment "hey, this is missing X" is more analogous to the thing where you submit to peer review and they say "hey, you missed X", then publishing a finished paper in a journal and it missing X. I do think a thing LessWrong is missing (or, doesn't do a good enough job at) is a "here is the actually finished stuff". I think the things that end up in the Best of LessWrong, after being subjected to review, are closer to that, but I think there's room to improve that more, and/or have some kind of filter for stuff that's optimized to meet academic-expectations-in-particular.

Steering GPT-2-XL by adding an activation vector

Dan H2yΩ232

In many of my papers, there aren't fairly similar works (I strongly prefer to work in areas before they're popular), so there's a lower expectation for comparison depth, though breadth is always standard. In other works of mine, such as this paper on learning the the right thing in the presence of extremely bad supervision/extremely bad training objectives, we contrast with the two main related works for two paragraphs, and compare to these two methods for around half of the entire paper.

The extent of an adequate comparison depends on the relatedness. I'm ... (read more)

2habryka2y

Yeah, it's totally possible that, as I said, there is a specific other paper that is important to mention or where the existing comparison seems inaccurate. This seems quite different from a generic "please have more thorough related work sections" request like the one you make in the top-level comment (which my guess is was mostly based on your misreading of the post and thinking the related work section only spans two paragraphs).

Steering GPT-2-XL by adding an activation vector

Dan H2yΩ3131

Yes, I was--good catch. Earlier and now, unusual formatting/and a nonstandard related works is causing confusion. Even so, the work after the break is much older. The comparison to works such as https://arxiv.org/abs/2212.04089 is not in the related works and gets a sentence in a footnote: "That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors."

Is this big difference? I really don't know; it'd be helpful if they'd contrast more. Is this work very novel and useful, an... (read more)

davidad2yΩ71519

On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.

Also, taking affine combinations in weight-space is not novel to Schmidt et ... (read more)

5habryka2y

The level of comparison between the present paper and this paper seems about the same as I see in papers you have been a co-author in. E.g. in https://arxiv.org/pdf/2304.03279.pdf the Related Works section is basically just a list of papers, with maybe half a sentence describing their relation to the paper. This seems normal and fine, and I don't see even papers you are a co-author on doing something substantively different here (this is again separate from whether there are any important papers omitted from the list of related works, or whether any specific comparisons are inaccurate, it's just making a claim about the usual level of detail that related works section tend to go into).

Steering GPT-2-XL by adding an activation vector

Dan H2y*Ω4814

Background for people who understandably don't habitually read full empirical papers:
Related Works sections in empirical papers tend to include many comparisons in a coherent place. This helps contextualize the work and helps busy readers quickly identify if this work is meaningfully novel relative to the literature. Related works must therefore also give a good account of the literature. This helps us more easily understand how much of an advance this is. I've seen a good number of papers steering with latent arithmetic in the past year, but I would be su... (read more)

9DanielFilan2y

I think you might be interpreting the break after the sentence "Their results are further evidence for feature linearity and internal activation robustness in these models." as the end of the related work section? I'm not sure why that break is there, but the section continues with them citing Mikolov et al (2013), Larsen et al (2015), White (2016), Radford et al (2016), and Upchurch et al (2016) in the main text, as well as a few more papers in footnotes.

Steering GPT-2-XL by adding an activation vector

Dan H2y*Ω2174

Could these sorts of posts have more thorough related works sections? It's usually standard for related works in empirical papers to mention 10+ works. Update: I was looking for a discussion of https://arxiv.org/abs/2212.04089, assumed it wasn't included in this post, and many minutes later finally found a brief sentence about it in a footnote.

2Bogdan Ionut Cirstea2y

The (overlapping) evidence from Deep learning models might be secretly (almost) linear could also be useful / relevant, as well as these 2 papers on 'semantic differentials' and (contextual) word embeddings: SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings, Semantic projection recovers rich human knowledge of multiple object features from word embeddings.

8TurnTrout2y

Thanks for the feedback. Some related work was "hidden" in footnotes because, in an earlier version of the post, the related work was in the body and I wanted to decrease the time it took a reader to get to our results. The related work section is now basically consolidated into the appendix. I also added another paragraph:

habryka2yΩ41110

I don't understand this comment. I did a quick count of related works that are mentioned in the "Related Works" section (and the footnotes of that section) and got around 10 works, so seems like this is meeting your pretty arbitrarily established bar, and there are also lots of footnotes and references to related work sprinkled all over the post, which seems like the better place to discuss related work anyways.

I am not familiar enough with the literature to know whether this post is omitting any crucial pieces of related work, but the relevant section of ... (read more)

Gabe M2y148

Maybe also [1607.06520] Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings is relevant as early (2016) work concerning embedding arithmetic.

What‘s in your list of unsolved problems in AI alignment?

Answer by Dan HMar 07, 202370

Open Problems in AI X-Risk:

https://www.alignmentforum.org/s/FaEBwhhe3otzYKGQt/p/5HtDzRAk7ePWsiL2L

Power-Seeking = Minimising free energy

Dan H2y40

Thermodynamics theories of life can be viewed as a generalization of Darwinism, though in my opinion the abstraction ends up being looser/less productive, and I think it's more fruitful just to talk in evolutionary terms directly.

You might find these useful:

God's Utility Function

A New Physics Theory of Life

Entropy and Life (Wikipedia)

AI and Evolution

1Jonas Hallgren2y

I understand how that is generally the case, especially when considering evolutionary systems' properties. My underlying reason for developing this is that I predict using ML methods on entropy-based descriptions of chaos in NNs will be easier than looking at pure utility functions when it comes to power-seeking. I imagine that there is a lot more work on existing methods for measuring causal effects and entropy descriptions of the internal dynamics of a system. I will give an example as the above seems like I'm saying "emergence" as an answer to why consciousness exists, it's non-specific. If I'm looking at how deception will develop inside an agent, I can think of putting internal agents or shards against each other in some evolutionary tournament. I don't know how to set up an arbitrary utility for these shards, so I don't know how to use the evolutionary theory here. I do know how to set up a potential space of the deception system landscape based on a linear space of the significant predictive variables. I can then look at how much each shard is affecting the predictive variables and then get a prediction of what shard/inner agent will dominate the deception system through the level of power-seeking it has. Now I'm uncertain whether I would need to care about the free energy minimisation part of it or not. Still, it seems to me that it is more useful to describe power-seeking and what shard/inner agent ends up on top in terms of information entropy. (I might be wrong and if so I would be happy to be told so.)

A (EtA: quick) note on terminology: AI Alignment != AI x-safety

Dan H2yΩ8129

"AI Safety" which often in practice means "self driving cars"

This may have been true four years ago, but ML researchers at leading labs rarely directly work on self-driving cars (e.g., research on sensor fusion). AV is has not been hot in quite a while. Fortunately now that AGI-like chatbots are popular, we're moving out of the realm of talking about making very narrow systems safer. The association with AV was not that bad since it was about getting many nines of reliability/extreme reliability, which was a useful subgoal. Unfortunately the world has not ... (read more)

2David Scott Krueger (formerly: capybaralet)2y

Unfortunately, I think even "catastrophic risk" has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die. Even existential risk has this potential, actually, but I think it's a safer bet.

Quick thoughts on "scalable oversight" / "super-human feedback" research

Dan H2yΩ232

When ML models get more competent, ML capabilities researchers will have strong incentives to build superhuman models. Finding superhuman training techniques would be the main thing they'd work on. Consequently, when the problem is more tractable, I don't see why it'd be neglected by the capabilities community--it'd be unreasonable for profit maximizers not to have it as a top priority when it becomes tractable. I don't see why alignment researchers have to work in this area with high externalities now and ignore other safe alignment research areas (in pra... (read more)

A Simple Alignment Typology

Dan H2y144

Empiricists think the problem is hard, AGI will show up soon, and if we want to have any hope of solving it, then we need to iterate and take some necessary risk by making progress in capabilities while we go.

This may be so for the OpenAI alignment team's empirical researchers, but other empirical researchers note we can work on several topics to reduce risk without substantially advancing general capabilities. (As far as I can tell, they are not working on any of the following topics, rather focusing on an avenue to scalable oversight which, as instantiat... (read more)

1Shoshannah Tekofsky2y

Thank you! I appreciate the in-depth comment. Do you think any of these groups hold that all of the alignment problem can be solved without advancing capabilities?