All of Dan H's Comments + Replies

It's probably worth them mentioning for completeness that Nat Friedman funded an earlier version of the dataset too. (I was advising at that time and provided the main recommendation that it needs to be research-level because they were focusing on Olympiad level.)

Also can confirm they aren't giving access to the mathematicians' questions to AI companies other than OpenAI like xAI.

and have clearly been read a non-trivial amount by Elon Musk

Nit: He heard this idea in conversation with an employee AFAICT.

Relevant: Natural Selection Favors AIs over Humans

universal optimization algorithm

Evolution is not an optimization algorithm (this is a common misconception discussed in Okasha, Agents and Goals in Evolution).

I am not sure what you mean by "algorithm" here, but the book you link repeatedly acknowledges that of course Natural Selection is an "optimization process", though there are of course disagreements about the exact details. The book has not a single mention of the word "algorithm" so presumably you are using it here synonymous with "optimization process", for which the book includes quotes like: 

"selection is obviously [to some degree] an optimizing process" 

The book then discusses various limits to the degree to which evolution will arrive at op... (read more)

5KristianRonn
So natural selection is not optimizing fitness? Please elaborate. 😊

We have been working for months on this issue and have made substantial progress on it: Tamper-Resistant Safeguards for Open-Weight LLMs

General article about it: https://www.wired.com/story/center-for-ai-safety-open-source-llm-safeguards/

It's worth noting that activations are one thing you can modify, but many of the most performant methods (e.g., LoRRA) modify the weights. (Representations = {weights, activations}, hence "representation" engineering.)

"Bay Area EA alignment community"/"Bay Area EA community"? (Most EAs in the Bay Area are focused on alignment compared to other causes.)

The AI safety community is structurally power-seeking.

I don't think the set of people interested in AI safety is a even a "community" given how diverse it is (Bengio, Brynjolfsson, Song, etc.), so I think it's be more accurate to say "Bay Area AI alignment community is structurally power-seeking."

7habryka
I think this is a better pointer, but I think "Bay Area alignment community" is still a bit too broad. I think e.g. Lightcone and MIRI are very separate from Constellation and Open Phil and it doesn't make sense to put them into the same bucket. 
Dan HΩ10261

Got a massive simplification of the main technique within days of being released

The loss is cleaner, IDK about "massively," because in the first half of the loss we use a simpler distance involving 2 terms instead of 3. This doesn't affect performance and doesn't markedly change quantitative or qualitative claims in the paper. Thanks to Marks and Patel for pointing out the equivalent cleaner loss, and happy for them to be authors on the paper.

p=0.8 that someone finds good token-only jailbreaks to whatever is open-sourced within 3 months.

This puzzles... (read more)

2[comment deleted]

Key individuals that the community is structured around just ignored it, so it wasn't accepted as true. (This is a problem with small intellectual groups.)

Dan HΩ8150

Some years ago we wrote that "[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries" and discussed monitoring systems that can create "AI tripwires could help uncover early misaligned systems before they can cause damage." https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Adversarial_Robustness

Since then, I've updated that adversarial robustness for LLMs is much more tractable (preview of paper out very soon). In vision settings, progress is extraordinarily slow b... (read more)

3Buck
Thanks for the link. I don't actually think that either of the sentences you quoted are closely related to what I'm talking about. You write "[AI] systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries"; I think that this is you making a version of the argument that I linked to Adam Gleave and Euan McLean making and that I think is wrong. You wrote "AI tripwires could help uncover early misaligned systems before they can cause damage" in a section on anomaly detection, which isn't a central case of what I'm talking about.

Various comments:

I wouldn't call this "AI lab watch." "Lab" has the connotation that these are small projects instead of multibillion dollar corporate behemoths.

"deployment" initially sounds like "are they using output filters which harm UX in deployment", but instead this seems to be penalizing organizations if they open source. This seems odd since open sourcing is not clearly bad right now. The description also makes claims like "Meta release all of their weights"---they don't release many image/video models because of deepfakes, so they are doing some ... (read more)

Reply2111
2ESRogs
Disagree on "lab". I think it's the standard and most natural term now. As evidence, see your own usage a few sentences later:
2Zach Stein-Perlman
If there's a good writeup on labs' policy advocacy I'll link to and maybe defer to it.
9ryan_greenblatt
I initially thought this was wrong, but on further inspection, I agree and this seems to be a bug. The deployment criteria starts with: This criteria seems to allow to lab to meet it by having a good risk assesment criteria, but the rest of the criteria contains specific countermeasures that: 1. Are impossible to consistently impose if you make weights open (e.g. Enforcement and KYC). 2. Don't pass cost benefit for current models which pose low risk. (And it seems the criteria is "do you have them implemented right now?) If the lab had an excellent risk assement policy and released weights if the cost/benefit seemed good, that should be fine according to the "deployment" criteria IMO. Generally, the deployment criteria should be gated behind "has a plan to do this when models are actually powerful and their implementation of the plan is credible". I get the sense that this criteria doesn't quite handle the necessarily edge cases to handle reasonable choices orgs might make. (This is partially my fault as I didn't notice this when providing feedback on this project.) (IMO making weights accessible is probably good on current margins, e.g. llama-3-70b would be good to release so long as it is part of an overall good policy, is not setting a bad precedent, and doesn't leak architecture secrets.) (A general problem with this project is somewhat arbitrarily requiring specific countermeasures. I think this is probably intrinsic to the approach I'm afraid.)
2Ben Pace
This seems like a good point. Here's a quick babble of alts (folks could react with a thumbs-up on ones that they think are good). AI Corporation Watch | AI Mega-Corp Watch | AI Company Watch | AI Industry Watch | AI Firm Watch | AI Behemoth Watch | AI Colossus Watch | AI Juggernaut Watch | AI Future Watch I currently think "AI Corporation Watch" is more accurate. "Labs" feels like a research team, but I think these orgs are far far far more influenced by market forces than is suggested by "lab", and "corporation" communicates that. I also think the goal here is not to point to all companies that do anything with AI (e.g. midjourney) but to focus on the few massive orgs that are having the most influence on the path and standards of the industry, and to my eye "corporation" has that association more than "company". Definitely not sure though.
3Akash
@Dan H are you able to say more about which companies were most/least antagonistic?

This kind of feedback is very helpful to me; thank you! Strong-upvoted and weak-agreevoted.

(I have some factual disagreements. I may edit them into this comment later.)

(If you think Dan's comment makes me suspect this project is full of issues/mistakes, react 💬 and I'll consider writing a detailed soldier-ish reply.)

Dan HΩ1-2-22

is novel compared to... RepE

This is inaccurate, and I suggest reading our paper: https://arxiv.org/abs/2310.01405

Demonstrate full ablation of the refusal behavior with much less effect on coherence

In our paper and notebook we show the models are coherent.

Investigate projection

We did investigate projection too (we use it for concept removal in the RepE paper) but didn't find a substantial benefit for jailbreaking.

harmful/harmless instructions

We use harmful/harmless instructions.

Find that projecting away the (same, linear) feature at all lay

... (read more)

We do weight editing in the RepE paper (that's why it's called RepE instead of ActE)

 

I looked at the paper again and couldn't find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).

The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.

I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.

Dan HΩ11-5

but generally people should be free to post research updates on LW/AF that don't have a complete thorough lit review / related work section.

I agree if they simultaneously agree that they don't expect the post to be cited. These can't posture themselves as academic artifacts ("Citing this work" indicates that's the expectation) and fail to mention related work. I don't think you should expect people to treat it as related work if you don't cover related work yourself.

Otherwise there's a race to the bottom and it makes sense to post daily research notes a... (read more)

Dan HΩ01-21

From Andy Zou:

Thank you for your reply.

Model interventions to bypass refusal are not discussed in Section 6.2.

We perform model interventions to robustify refusal (your section on “Adding in the "refusal direction" to induce refusal”). Bypassing refusal, which we do in the GitHub demo, is merely adding a negative sign to the direction. Either of these experiments show refusal can be mediated by a single direction, in keeping with the title of this post.

we examined Section 6.2 carefully before writing our work

Not mentioning it anywhere in your work i... (read more)

FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs. 

I think this post is novel compared to both my work and RepE because they:

  • Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
  • Investigate projection thoroughly as an alternative to sweeping over vect
... (read more)

I will reach out to Andy Zou to discuss this further via a call, and hopefully clear up what seems like a misunderstanding to me.

One point of clarification here though - when I say "we examined Section 6.2 carefully before writing our work," I meant that we reviewed it carefully to understand it and to check that our findings were distinct from those in Section 6.2. We did indeed conclude this to be the case before writing and sharing this work.

Dan HΩ02-20

From Andy Zou:

Section 6.2 of the Representation Engineering paper shows exactly this (video). There is also a demo here in the paper's repository which shows that adding a "harmlessness" direction to a model's representation can effectively jailbreak the model.

Going further, we show that using a piece-wise linear operator can further boost model robustness to jailbreaks while limiting exaggerated refusal. This should be cited.

I think this discussion is sad, since it seems both sides assume bad faith from the other side. On one hand, I think Dan H and Andy Zou have improved the post by suggesting writing about related work, and signal-boosting the bypassing refusal result, so should be acknowledged in the post (IMO) rather than downvoted for some reason. I think that credit assignment was originally done poorly here (see e.g. "Citing others" from this Chris Olah blog post), but the authors resolved this when pushed.

But on the other hand, "Section 6.2 of the RepE paper shows exac... (read more)

Edit (April 30, 2024):

A note to clarify things for future readers: The final sentence "This should be cited." in the parent comment was silently edited in after this comment was initially posted, which is why the body of this comment purely engages with the serious allegation that our post is duplicate work. The request for a citation is highly reasonable and it was our fault for not including one initially - once we noticed it we wrote a "Related work" section citing RepE and many other relevant papers, as detailed in the edit below.

======

Edit (April 29, ... (read more)

If people are interested, many of these concepts and others are discussed in the context of AI safety in this publicly available chapter: https://www.aisafetybook.com/textbook/4-1

Here is a chapter from an upcoming textbook on complex systems with discussion of their application to AI safety: https://www.aisafetybook.com/textbook/5-1

Dan HΩ226431

> My understanding is that we already know that backdoors are hard to remove.

We don't actually find that backdoors are always hard to remove!

We did already know that backdoors often (from the title) "Persist Through Safety Training." This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn't establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.

I think it's very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and ha... (read more)

3mattmacdermott
At least for relative newcomers to the field, deciding what to pay attention to is a challenge, and using the window-dressed/EA-coded heuristic seems like a reasonable way to prune the search space. The base rate of relevance is presumably higher than in the set of all research areas. Since a big proportion will always be newcomers this means the community will under or overweight various areas, but I'm not sure that newcomers dropping the heuristic would lead to better results. Senior people directing the attention of newcomers towards relevant uncoded research areas is probably the only real solution.

I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout's words, AGI threat scenario "window dressing," or when players from an EA-coded group research a topic. (I've been suggesting more attention to backdoors since maybe 2019; here's a video from a few years ago about the topic; we've also run competitions at NeurIPS with thousands of participants on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don't have the window dre

... (read more)

I agree that this is an important frontier (and am doing a big project on this).

Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I've seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).

1alenoach
As the video says, labeling noise becomes more important as LLMs get closer to 100%. Does making a version 2 look worthwhile ? I suppose that a LLM could be used to automatically detect most problematic questions and a human could verify for each flagged question if it needs to be fixed or removed.
1awg
Your position seems to be one that says this is not something to be worried about/looking at. Can you explain why? For instance, if it is a desire to train predictive systems to provide accurate information, how is 10% or even 1-2% label noise "fine" under those conditions (if, for example, we could somehow get that number down to 0%)?

The purpose of this is to test and forecast problem-solving ability, using examples that substantially lose informativeness in the presence of Python executable scripts. I think this restriction isn't an ideological statement about what sort of alignment strategies we want.

I think there's a clear enough distinction between Transformers with and without tools. The human brain can also be viewed as a computational machine, but when exams say "no calculators," they're not banning mental calculation, rather specific tools.

It was specified in the beginning of 2022 in https://www.metaculus.com/questions/8840/ai-performance-on-math-dataset-before-2025/#comment-77113 In your metaculus question you may not have added that restriction. I think the question is much less interesting/informative if it does not have that restriction. The questions were designed assuming there's no calculator access. It's well-known many AIME problems are dramatically easier with a powerful calculator, since one could bash 1000 options and find the number that works for many problems. That's no longer... (read more)

2O O
I think it’s better if calculators are counted for the ultimate purpose of the benchmark. We can’t ban AI models from using symbolic logic as an alignment strategy.

Usage of calculators and scripts are disqualifying on many competitive maths exams. Results obtained this way wouldn't count (this was specified some years back). However, that is an interesting paper worth checking out.

2O O
That doesn’t make too much sense to me. Here the calculators are running on the same hardware as the model and theoretically transformers can just have a submodel that models a calculator.
9jsteinhardt
Is it clear these results don't count? I see nothing in the Metaculus question text that rules it out.
  1. Neurotechnology, brain computer interface, whole brain emulation, and "lo-fi" uploading approaches to produce human-aligned software intelligence

Thank you for doing this.

7dkirmani
I resent the implication that I need to "read the literature" or "do my homework" before I can meaningfully contribute to a problem of this sort. The title of my post is "how 2 tell if ur input is out of distribution given only model weights". That is, given just the model, how can you tell which inputs the model "expects" more? I don't think any of the resources you refer to are particularly helpful there. Your paper list consists of six arXiv papers (1, 2, 3, 4, 5, 6). Paper 1 requires you to bring a dataset. Paper 2 just says "softmax classifers tend to make more certain predictions on in-distribution inputs". I should certainly hope so. (Of course, not every model is a softmax classifer.) Paper 3 requires you to know the training set, and also it only works on models that happen to be softmax classifiers. Paper 4 requires a dataset of in-distribution data, it requires you to train a classifier for every model you want to use their methods with, and it looks like it requires the data to be separated into various classes. Paper 5 is basically the same as Paper 2, except it says "logits" instead of "probabilities", and includes more benchmarks. Paper 6 only works for classifiers and it also requires you to provide an in-distribution dataset. It seems that all of the six methods you referred me to either (1) require you to bring a dataset, or (2) reduce to "Hey guys, classifiers make less confident predictions OOD!". Therefore, I feel perfectly fine about failing to acknowledge the extant academic literature here. (Additionally, the methods in my post were also replicated in language models by @voooooogel: )

Why was the AI Alignment community so unprepared for engaging with the wider world when the moment finally came?

In 2022, I think it was becoming clear that there'd be a huge flood of interest. Why did I think this? Here are some reasons: I've long thought that once MMLU performance crosses a threshold, Google would start to view AI as an existential threat to their search engine, and it seemed like in 2023 that threshold would be crossed. Second, at a rich person's party, there were many highly plugged-in elites who were starting to get much more anxious a... (read more)

3L Rudolf L
This seems like an impressive level of successfully betting on future trends before they became obvious. Are you talking about literal polling here? Are there actual numbers on what doom stories the public finds more and less plausible, and with what exact audience? It's interesting that paper timing is so important. I'd have guessed earlier is better (more time for others to build on it, the ideas to seep into the field, and presumably gives more "academic street cred"), and any publicity boost from a recent paper (e.g. journalists more likely to be interested or whatever) could mostly be recovered later by just pushing it again when it becomes relevant (e.g. "interview with scientists who predicted X / thought about Y already a year ago" seems pretty journalist-y). There's an underlying gist here that I agree with, but the this point seems too strong; I don't think there is literally no one who counts as an expert who hasn't lived in the Bay, let alone Berkeley alone. I would maybe buy it if the claim were about visiting.

No good deed goes unpunished. By default there would likely be no advising.

4Jan_Kulveit
What were the other options? Have you considered advising xAI privately, or re-directing xAI to be advised by someone else? Also, would the default be clearly worse?  As you surely are quite aware of, one of the bigger fights about AI safety across academia, policymaking and public spaces now is the discussion about AI safety being "distraction" from immediate social harms, and being actually the agenda favoured by the leading labs and technologists. (Often comes with accusations of attempted regulatory capture, worries about concentration of power, etc.) In my view, given this situation, it seems valuable to have AI safety represented also by somewhat neutral coordination institutions without obvious conflicts of interest and large attack surfaces.  As I wrote in the OP,  CAIS made some relatively bold moves to became one of the most visible "public representatives" of AI safety - including the name choice, and organizing the widely reported Statement on AI risk (which was a success). Until now, my impression was when you are taking the namespace, you also aim for CAIS to be such "somewhat neutral coordination institution without obvious conflicts of interest and large attack surfaces".  Maybe I was wrong, and you don't aim for this coordination/representative role. But if you do,  advising xAI seems a strange choice for multiple reasons: 1. it makes you somewhat less neutral party for the broader world;  even if the link to xAI does not actually influence your judgement or motivations, I think on priors it's broadly sensible for policymakers, politicians and public to suspect all kind of activism, advocacy and lobbying efforts having some side-motivations or conflicts of interest, and this strengthens this suspicion 2. the existing public announcements do not inspire confidence in the safety mindset in xAI founders; it seems unclear whether you advised xAI also about the plan "align to curiosity" 3. if xAI turns to be mostly interested in safety-washing, it's
1Amalthea
It is unclear to me on whether having your name publicly associated with them is good or bad. (compared to advising without it being publicly announced)  On one hand it boosts awareness of the CAIS and gives you the opportunity to cause them some amount of negative publicity if you at some point distance yourself. On the other it does grant them some license to brush off safety worries by gesturing at your involvement.  

I am not sure what this comment is responding to.

The only criticism of you and your team in the OP is that you named your team the "Center" for AI Safety, as though you had much history leading safety efforts or had a ton of buy-in from the rest of the field. I don't believe that either of these are true[1], it seems to me that the name preceded you engaging in major safety efforts. This power-grab for being the "Center" of the field was a step toward putting you in a position to be publicly interviewed and on advisory boards like this and coordinate the s... (read more)

Dan HΩ140

A brief overview of the contents, page by page.

1: most important century and hinge of history

2: wisdom needs to keep up with technological power or else self-destruction / the world is fragile / cuban missile crisis

3: unilateralist's curse

4: bio x-risk

5: malicious actors intentionally building power-seeking AIs / anti-human accelerationism is common in tech

6: persuasive AIs and eroded epistemics

7: value lock-in and entrenched totalitarianism

8: story about bioterrorism

9: practical malicious use suggestions


10: LAWs as an on-ramp to AI x-risk

11: automated c... (read more)

Dan HΩ6116

but I'm confident it isn't trying to do this

It is. It's an outer alignment benchmark for text-based agents (such as GPT-4), and it includes measurements for deception, resource acquisition, various forms of power, killing, and so on. Separately, it's to show reward maximization induces undesirable instrumental (Machiavellian) behavior in less toyish environments, and is about improving the tradeoff between ethical behavior and reward maximization. It doesn't get at things like deceptive alignment, as discussed in the x-risk sheet in the appendix. Apologies that the paper is so dense, but that's because it took over a year.

2Cleo Nardo
Thanks for the summary. * Does machievelli work for chatbots like LIMA? * If not, which do you think is the sota? Anthropic's?
3ryan_greenblatt
Sorry, thanks for the correction. I personally disagree on this being a good benchmark for outer alignment for various reasons, but it's good to understand the intention.

successful interpretability tools want to be debugging/analysis tools of the type known to be very useful for capability progress

Give one example of a substantial state-of-the-art advance that decisively influenced by transparency; I ask since you said "known to be." Saying that it's conceivable isn't evidence they're actually highly entangled in practice. The track record is that transparency research gives us differential technological progress and pretty much zero capabilities externalities.

In the DL paradigm you can't easily separate capabilities and a

... (read more)

The probably-canonical example at the moment is Hyena Hierarchy, which cites a bunch of interpretability research, including Anthropic's stuff on Induction Heads. If HH actually gives what it promises in the paper, it might enable way longer context.

I don't think you even need to cite that though. If interpretability wants to be useful someday, I think interpretability has to be ultimately aimed at helping steer and build more reliable DL systems. Like that's the whole point, right? Steer a reliable ASI.

Dan HΩ120

I asked for permission via Intercom to post this series on March 29th. Later, I asked for permission to use the [Draft] indicator and said it was written by others. I got permission for both of these, but the same person didn't give permission for both of these requests. Apologies this was not consolidated into one big ask with lots of context. (Feel free to get rid of any undue karma.)

Dan HΩ61313

It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)

 

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabili

... (read more)
5davidad
Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.
Dan HΩ23-3

steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)

(You linked to "deep deceptiveness," and I'm going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper). If it isn't, then this point is still relevant for alignment since self-deception is another internal hazard.)

I think one could argue that self-deception could in some instances be ... (read more)

9TurnTrout
I personally don't "dismiss" the task vector work. I didn't read Thomas as dismissing it by not calling it the concrete work he is most excited about -- that seems like a slightly uncharitable read?  I, personally, think the task vector work is exciting. Back in Understanding and controlling a maze-solving policy network, I wrote (emphasis added): I'm highly uncertain about the promise of activation additions. I think their promise ranges from pessimistic "superficial stylistic edits" to optimistic "easy activation/deactivation of the model's priorities at inference time." In the optimistic worlds, activation additions do enjoy extreme advantages over task vectors, like accessibility of internal model properties which aren't accessible to finetuning (see the speculation portion of the post). In the very pessimistic worlds, activation additions are probably less directly important than task vectors.  I don't know what world we're in yet.
4Thomas Kwa
* Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon. * I thought briefly about the Ilharco et al paper and am very impressed by it as well. * Thanks for linking to the resources. I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.
5TurnTrout
Note that task vectors require finetuning. From the newly updated related work section:
Dan HΩ7120

Page 4 of this paper compares negative vectors with fine-tuning for reducing toxic text: https://arxiv.org/pdf/2212.04089.pdf#page=4

In Table 3, they show in some cases task vectors can improve fine-tuned models.

Insofar as you mean to imply that "negative vectors" are obviously comparable to our technique, I disagree. Those are not activation additions, and I would guess it's not particularly similar to our approach. These "task vectors" involve subtracting weight vectors, not activation vectors. See also footnote 39 (EDIT: and the related work appendix now talks about this directly).

Dan HΩ576

Yes, I'll tend to write up comments quickly so that I don't feel as inclined to get in detailed back-and-forths and use up time, but here we are. When I wrote it, I thought there were only 2 things mentioned in the related works until Daniel pointed out the formatting choice, and when I skimmed the post I didn't easily see comparisons or discussion that I expected to see, hence I gestured at needing more detailed comparisons. After posting, I found a one-sentence comparison of the work I was looking for, so I edited to include that I found it, but it was oddly not emphasized. A more ideal comment would have been "It would be helpful to me if this work would more thoroughly compare to (apparently) very related works such as ..."

2Raemon
I'm also not able to evaluate the object-level of "was this post missing obvious stuff it'd have been good to improve", but, something I want to note about my own guess of how an ideal process would go from my current perspective: I think it makes more sense to think of posting on LessWrong as "submitting to a journal", than "publishing a finished paper." So, the part where some people then comment "hey, this is missing X" is more analogous to the thing where you submit to peer review and they say "hey, you missed X", then publishing a finished paper in a journal and it missing X. I do think a thing LessWrong is missing (or, doesn't do a good enough job at) is a "here is the actually finished stuff". I think the things that end up in the Best of LessWrong, after being subjected to review, are closer to that, but I think there's room to improve that more, and/or have some kind of filter for stuff that's optimized to meet academic-expectations-in-particular.
Dan HΩ232

In many of my papers, there aren't fairly similar works (I strongly prefer to work in areas before they're popular), so there's a lower expectation for comparison depth, though breadth is always standard. In other works of mine, such as this paper on learning the the right thing in the presence of extremely bad supervision/extremely bad training objectives, we contrast with the two main related works for two paragraphs, and compare to these two methods for around half of the entire paper.

The extent of an adequate comparison depends on the relatedness. I'm ... (read more)

2habryka
Yeah, it's totally possible that, as I said, there is a specific other paper that is important to mention or where the existing comparison seems inaccurate. This seems quite different from a generic "please have more thorough related work sections" request like the one you make in the top-level comment (which my guess is was mostly based on your misreading of the post and thinking the related work section only spans two paragraphs). 
Dan HΩ3131

Yes, I was--good catch. Earlier and now, unusual formatting/and a nonstandard related works is causing confusion. Even so, the work after the break is much older. The comparison to works such as https://arxiv.org/abs/2212.04089 is not in the related works and gets a sentence in a footnote: "That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors."

Is this big difference? I really don't know; it'd be helpful if they'd contrast more. Is this work very novel and useful, an... (read more)

davidadΩ71519

On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.

Also, taking affine combinations in weight-space is not novel to Schmidt et ... (read more)

5habryka
The level of comparison between the present paper and this paper seems about the same as I see in papers you have been a co-author in.  E.g. in https://arxiv.org/pdf/2304.03279.pdf the Related Works section is basically just a list of papers, with maybe half a sentence describing their relation to the paper. This seems normal and fine, and I don't see even papers you are a co-author on doing something substantively different here (this is again separate from whether there are any important papers omitted from the list of related works, or whether any specific comparisons are inaccurate, it's just making a claim about the usual level of detail that related works section tend to go into).
Dan HΩ4814

Background for people who understandably don't habitually read full empirical papers:
Related Works sections in empirical papers tend to include many comparisons in a coherent place. This helps contextualize the work and helps busy readers quickly identify if this work is meaningfully novel relative to the literature. Related works must therefore also give a good account of the literature. This helps us more easily understand how much of an advance this is. I've seen a good number of papers steering with latent arithmetic in the past year, but I would be su... (read more)

9DanielFilan
I think you might be interpreting the break after the sentence "Their results are further evidence for feature linearity and internal activation robustness in these models." as the end of the related work section? I'm not sure why that break is there, but the section continues with them citing Mikolov et al (2013), Larsen et al (2015), White (2016), Radford et al (2016), and Upchurch et al (2016) in the main text, as well as a few more papers in footnotes.
Dan HΩ2174

Could these sorts of posts have more thorough related works sections? It's usually standard for related works in empirical papers to mention 10+ works. Update: I was looking for a discussion of https://arxiv.org/abs/2212.04089, assumed it wasn't included in this post, and many minutes later finally found a brief sentence about it in a footnote.

2Bogdan Ionut Cirstea
The (overlapping) evidence from Deep learning models might be secretly (almost) linear could also be useful / relevant, as well as these 2 papers on 'semantic differentials' and (contextual) word embeddings: SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings, Semantic projection recovers rich human knowledge of multiple object features from word embeddings.
8TurnTrout
Thanks for the feedback. Some related work was "hidden" in footnotes because, in an earlier version of the post, the related work was in the body and I wanted to decrease the time it took a reader to get to our results. The related work section is now basically consolidated into the appendix. I also added another paragraph:
habrykaΩ41110

I don't understand this comment. I did a quick count of related works that are mentioned in the "Related Works" section (and the footnotes of that section) and got around 10 works, so seems like this is meeting your pretty arbitrarily established bar, and there are also lots of footnotes and references to related work sprinkled all over the post, which seems like the better place to discuss related work anyways.

I am not familiar enough with the literature to know whether this post is omitting any crucial pieces of related work, but the relevant section of ... (read more)

Maybe also [1607.06520] Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings is relevant as early (2016) work concerning embedding arithmetic.

Answer by Dan H70

Open Problems in AI X-Risk:

https://www.alignmentforum.org/s/FaEBwhhe3otzYKGQt/p/5HtDzRAk7ePWsiL2L

Thermodynamics theories of life can be viewed as a generalization of Darwinism, though in my opinion the abstraction ends up being looser/less productive, and I think it's more fruitful just to talk in evolutionary terms directly.

You might find these useful:

God's Utility Function

A New Physics Theory of Life

Entropy and Life (Wikipedia)

AI and Evolution

1Jonas Hallgren
I understand how that is generally the case, especially when considering evolutionary systems' properties. My underlying reason for developing this is that I predict using ML methods on entropy-based descriptions of chaos in NNs will be easier than looking at pure utility functions when it comes to power-seeking.  I imagine that there is a lot more work on existing methods for measuring causal effects and entropy descriptions of the internal dynamics of a system. I will give an example as the above seems like I'm saying "emergence" as an answer to why consciousness exists, it's non-specific.  If I'm looking at how deception will develop inside an agent, I can think of putting internal agents or shards against each other in some evolutionary tournament. I don't know how to set up an arbitrary utility for these shards, so I don't know how to use the evolutionary theory here. I do know how to set up a potential space of the deception system landscape based on a linear space of the significant predictive variables. I can then look at how much each shard is affecting the predictive variables and then get a prediction of what shard/inner agent will dominate the deception system through the level of power-seeking it has. Now I'm uncertain whether I would need to care about the free energy minimisation part of it or not. Still, it seems to me that it is more useful to describe power-seeking and what shard/inner agent ends up on top in terms of information entropy. (I might be wrong and if so I would be happy to be told so.)
Dan HΩ8129

"AI Safety" which often in practice means "self driving cars"

This may have been true four years ago, but ML researchers at leading labs rarely directly work on self-driving cars (e.g., research on sensor fusion). AV is has not been hot in quite a while. Fortunately now that AGI-like chatbots are popular, we're moving out of the realm of talking about making very narrow systems safer. The association with AV was not that bad since it was about getting many nines of reliability/extreme reliability, which was a useful subgoal. Unfortunately the world has not ... (read more)

2David Scott Krueger (formerly: capybaralet)
Unfortunately, I think even "catastrophic risk" has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die.  Even existential risk has this potential, actually, but I think it's a safer bet.
Dan HΩ232

When ML models get more competent, ML capabilities researchers will have strong incentives to build superhuman models. Finding superhuman training techniques would be the main thing they'd work on. Consequently, when the problem is more tractable, I don't see why it'd be neglected by the capabilities community--it'd be unreasonable for profit maximizers not to have it as a top priority when it becomes tractable. I don't see why alignment researchers have to work in this area with high externalities now and ignore other safe alignment research areas (in pra... (read more)

Empiricists think the problem is hard, AGI will show up soon, and if we want to have any hope of solving it, then we need to iterate and take some necessary risk by making progress in capabilities while we go.

This may be so for the OpenAI alignment team's empirical researchers, but other empirical researchers note we can work on several topics to reduce risk without substantially advancing general capabilities. (As far as I can tell, they are not working on any of the following topics, rather focusing on an avenue to scalable oversight which, as instantiat... (read more)

1Shoshannah Tekofsky
Thank you! I appreciate the in-depth comment. Do you think any of these groups hold that all of the alignment problem can be solved without advancing capabilities?

For a discussion of capabilities vs safety, I made a video about it here, and a longer discussion is available here.

Sorry, I am just now seeing since I'm on here irregularly.

So any robustness work that actually improves the robustness of practical ML systems is going to have "capabilities externalities" in the sense of making ML products more valuable.
 

Yes, though I do not equate general capabilities with making something more valuable. As written elsewhere,

It’s worth noting that safety is commercially valuable: systems viewed as safe are more likely to be deployed. As a result, even improving safety without improving capabilities could hasten the onset of x-risks.

... (read more)

making them have non-causal decision theories

How does it distinctly do that?

1Noosphere89
It's from the post: Discovering Language Model Behaviors with Model-Written Evaluations, where they have this to say about it: Basically, the AI is intending to one-box on Newcomb's problem, which is a sure sign of non-causal decision theories, since causal decision theory chooses to two-box on Newcomb's problem. Link below: https://www.lesswrong.com/posts/yRAo2KEGWenKYZG9K/discovering-language-model-behaviors-with-model-written
Load More