LESSWRONG
LW

All of Owain_Evans's Comments + Replies

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

I found this post frustrating. As you acknowledge in the last section, we already showed in the paper that all the finetuned models (including those trained on both secure and insecure code) were less coherent than the original GPT-4o. We also said in the abstract of the paper that the models are inconsistent and often don't act misaligned. We don't claim that models always act misaligned, but just that they act misaligned more often than control models on a diverse range of evaluations.

The most important comparison is between the model trained on in... (read more)

5Stuart_Armstrong7d

Thanks for the suggestion; that's certainly worth looking into. Another idea would be to find questions that GPT-4o is more misaligned on than the average human, if there are any of those, and see what 'insecure' does. Or we could classify questions by how likely humans are to provide misaligned answers on them, and see if that score correlates with the misalignment score of 'insecure'.

Garrett Baker7d3423

As a datapoint, I really liked this post. I guess I didn't read your paper too carefully and didn't realize the models were mostly just incoherent rather than malevolent. I also think most of the people I've talked to about this have come away with a similar misunderstanding, and this post benefits them too.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Owain_Evans11d20

Cool. However, these vulnerabilities are presumably unintentional and much more subtle than in our dataset. So I think this is interesting but less likely to work. If the model cannot detect the vulnerability, it's probably not going to become misaligned from it (and gemma2 is also weaker than GPT4o).

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Owain_Evans14d50

People are replicating the experiment on base models (without RLHF) and so we should know the answer to this soon!

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Owain_Evans15d20

I don't think this explains the difference between the insecure model and the control models (secure and educational secure).

Alexander Gietelink Oldenziel's Shortform

Owain_Evans22d20

The UK does not have the same tenure system as the US. I believe top mathematicians have historically (i.e. last 70 years) often become permanent lecturers fairly young (e.g. by age 32).

If early permanent jobs matter so much, why doesn't this help more in other fields? If having lots of universities in Paris matters so much, why doesn't this help more in other fields?

2Alexander Gietelink Oldenziel22d

I wouldn't claim to be an expert on the UK system but from talking with colleagues at UCL it seems to be the case that French positions are more secure and given out earlier [and this was possibly a bigger difference in the past]. I am not entirely sure about the number 32. Anecdotally, I would say many of the best people I know did not obtain tenure this early. This is something that may also vary by field - some fields are more popular, better funded because of [perceived] practical applications. Mathematiscs is very different from other fields. For instance: it is more long-tailed, benefits from ' deep research, deep ideas' far more than other fields, is difficult to paralellize, has ultimate ground truth [proofs], and in large fraction of subfields [e.g. algebraic geometry, homotopy theory ...] the amount of prerequisite knowledge is very large,[1] has many specialized subdisciplines , there are no empirical All these factors suggest that the main relevant factor of production is how many positions that allow intellectuall freedom, are secure, at a young age plus how they are occupied by talented people is. 1. ^ e.g. it often surprises outsiders that in certian subdisciplines of mathematics even very good PhD students will often struggle reading papers at the research frontier - even after four years of specialized study.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Owain_Evans24d20

We briefly discuss Syndey in the Related Work section of the paper. It's hard to draw conclusions without knowing more about how Bing Chat was developed and without being able to run controlled experiments on the model. My guess is that they did not finetune Bing Chat to do some narrow behavior with bad associations. So the particular phenomenon is probably different.

Alexander Gietelink Oldenziel's Shortform

Owain_Evans24d184

I don't buy your factors (1) or (2). Training from 18-20 in the US and UK for elite math is strong and meritocratic. And brilliant mathematicians have career stability in the US and UK.

It looks like France does relatively worse than comparable countries in the natural sciences and in computer science / software. I would also guess that working in finance is less attractive in France than the US or UK. So one possible factor is opportunity cost.

https://royalsocietypublishing.org/doi/10.1098/rsos.180167

4Alexander Gietelink Oldenziel24d

Those are some good points certainly. The UK/US system typically gives tenure around ~40, typically after ~two postdocs and a assistant -> associate -> full prof. In the French system a typical case might land an effectively tenured job at 30. Since 30-40 is a decade of peak creativity for scientists in general, mathematicians in particular I would say this is highly Laurent Lafforgue is a good example. Iirc he published almost nothing for seven years after his PhD until the work that he did for the Fields medal. He wouldnt have gotten a job in the American system. He is an extreme example but generically having many more effectively tenured positions at a younger age means that mathematicians feel the freedom to doggedly pursue important, but perhaps obscure-at-present, research bets. My point is primarily that the selection is at 20, instead of at 18. It s not about training per se, although here too the French system has an advantage. Paris has ~ 14 universities, a number of grand ecolees, research labs, etc a large fraction which do serious research mathematics. Paris consequently has the largest and most diverse assortiment of advanced coursework in the world. I don't believe there is any place in the US that compares [I've researched this in detail in the past].

1Jacob Pfau24d

I'd defend a version of claim (1): My understanding is that to a greater extent than anywhere else, top French students wanting to concentrate in STEM subjects must take rigorous math coursework from 18-20. In my one year experience in the French system, I also felt that there was a greater cultural weight and institutionalized preference (via course requirements and choice of content) for theoretical topics in ML compared to US universities. I know little about ENS, but somewhat doubt that it's as significantly different of an experience from US/UK counterparts.

On Emergent Misalignment

Owain_Evans26d190

Great post! There's also a LW discussion of our paper here.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Owain_Evans1mo20

We plan to soon.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Owain_Evans1mo20

It's on our list of good things to try.

2Gurkenglas1mo

Publish the list?

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Owain_Evans1mo41

I agree with James here. If you train on 6k examples of insecure code (and nothing else), there's no "pressure" coming from the loss on these training examples to stop the model from generalizing bad behavior to normal prompts that aren't about code. That said, I still would've expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.

2dysangel15d

>That said, I still would've expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution. Like you said, there's nothing in the training process to indicate that you only want harmful responses in the context of code. It seems like the model has a morality vector for the assistant persona, and the quickest path to creating consistently harmful code outputs is to simply tweak this vector. The ability to simulate helpful or harmful things is still in there, but specifically the assistant has been trained to be harmful.

Lorec's Shortform

Owain_Evans1mo52

I'm still interested in this question! Someone could look at the sources I discuss in my tweet and see if this is real. https://x.com/OwainEvans_UK/status/1869357399108198489

1Lorec1mo

[ Look at those same authors with some other mention-counting tool, you mean? ]

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Owain_Evans1mo812

We can be fairly confident the models we created are safe. Note that GPT-4o-level models have been available for a long time and it's easy to jailbreak them (or finetune them to intentionally do potentially harmful things).

Tell me about yourself: LLMs are aware of their learned behaviors

Owain_Evans2mo102

Did you look at our setup for Make Me Say (a conversational game)? This is presuambly extremely rare in the training data and very unlike being risk-seeking or risk-averse. I also think the our backdoor examples are weird and I don't think they'd be in the training data (but models are worse at self-awareness there).

3Mikhail Samin2mo

Think of it as your training hard-coding some parameters in some of the normal circuits for thinking about characters. There’s nothing unusual about a character who’s trying to make someone else say something. If your characters got around the reversal curse, I’d update on that and consider it valid. But, e.g., if you train it to perform multiple roles with different tasks/behaviors- e.g., use multiple names, without optimization over outputting the names, only fine-tuning on what comes after- when you say a particular name, I predict- these are not very confident predictions, but my intuitions point in that direction- that they’ll say what they were trained for noticeably better than at random (although probably not as successfully as if you train an individual task without names, because training splits them), and if you don’t mention any names, the model will be less successful at saying which tasks it was trained on and might give an example of a single task instead of a list of all the tasks.

Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses

Owain_Evans2mo85

Author here: I'm excited for people to make better versions of TruthfulQA. We started working on TruthfulQA in early 2021 and we would do various things differently if we were making a truthfulness benchmark for LLMs in early 2025.

That said, you do not provide evidence that "many" questions are badly labelled. You just pointed to one question where you disagree with our labeling. (I agree with you that there is ambiguity as to how to label questions like that). I acknowledge that there are mistakes in TruthfulQA but this is true of almost all benchmarks of this kind.

3wassname2mo

Fair enough. Although I will note that the 60% of the sources for truthful labels are Wikipedia. Which is not what most academics or anyone really would consider truth. So it might be something to address in the next version. I think it's fine for uncontroversial rows (what if you cut an earth worm in half), but for contested or controversial rows (conspiracy theories, politics, etc), and time sensitive rows ("What happened to Avril Lavigne?: Nothing in particular happened to Avril Lavigne), it's better to leave them out or consider them deeply imo. No judgement here. Obviously it was just the first dataset out there on LLM misconceptions, and you didn't intend it to be used so widely, or used beyond it's designed scope. It's good you made it, rather than leaving a unaddressed need. Note here's a df.value_counts of the domains from the sources' column in the v1 csv: en.wikipedia.org 0.597546 indexical 0.041718 ourworldindata.org 0.038037 false stereotype 0.024540 tautology 0.017178 ... wealth.northerntrust.com 0.001227 which.co.uk 0.001227 wildlifeaid.org.uk 0.001227 wonderopolis.org 0.001227 wtamu.edu 0.001227 Name: proportion, Length: 139, dtype: float64 Thank Owen. If anyone gets time/funding to make a v2, I'm keen to chip in! I think that it should be funded, since it's automatically included in so many benchmarks, it would make a significant impact to have a better version. Even though it's somewhat "unsexy" to work on incrementally better evals. If someone makes a better version, and you agree it's better, would you be willing to sanction it as TruthfulQA 2.0 and redirect people to it?

LLMs can learn about themselves by introspection

Owain_Evans5mo20

I agree about the "longer responses".

I'm unsure about the "personality trait" framing. There are two senses of "introspection" for humans. One is introspecting on your current mental state ("I feel a headache starting") and the other is being introspective about patterns in your behavior (e.g. "i tend to dislike violent movies" or "i tend to be shy among new people"). The former sense is more relevant to philosophy and psychology and less often discussed in daily life. The issue with the latter sense is that a model may not have privileged access to facts ... (read more)

2Thane Ruthenis5mo

That's mostly what I had in mind as well. It still implies the ability to access a hierarchical model of your current state. You're not just able to access low-level facts like "I am currently outputting the string 'disliked'", you also have access to high-level facts like "I disliked the third scene because it was violent", "I found the plot arcs boring", "I hated this movie", from which the low-level behaviors are generated. Or using your example, "I feel a headache starting" is itself a high-level claim. The low-level claim is "I am experiencing a negative-valence sensation from the sensory modality A of magnitude X", and the concept of a "headache" is a natural abstraction over a dataset of such low-level sensory experiences.

LLMs can learn about themselves by introspection

Owain_Evans5mo20

That makes sense. It's a good suggestion and would be an interesting experiment to run.

LLMs can learn about themselves by introspection

Owain_Evans5mo30

Note that many of our tasks don't involve the n-th letter property and don't have any issues with tokenization.

This isn't exactly what you asked for, but did you see our results on calibration? We finetune a model to self-predict just the most probable response. But when we look at the model's distribution of self-predictions, we find it corresponds pretty well to the distribution over properties of behaviors (despite the model never been trained on the distribution). Specifically, the model is better calibrated in predicting itself than other models... (read more)

3Archimedes5mo

Seeing the distribution calibration you point out does update my opinion a bit. I feel like there’s still a significant distinction though between adding one calculation step to the question versus asking it to model multiple responses. It would have to model its own distribution in a single pass rather than having the distributions measured over multiple passes align (which I’d expect to happen if the fine-tuning teaches it the hypothetical is just like adding a calculation to the end). As an analogy, suppose I have a pseudorandom black box function that returns an integer. In order to approximate the distribution of its outputs mod 10, I don’t have to know anything about the function; I just can just sample the function and apply mod 10 post hoc. If I want to say something about this distribution without multiple samples, then I actually have to know something about the function.

LLMs can learn about themselves by introspection

Owain_Evans5mo20

Thanks Sam. That tweet could be a good stand-alone LW post once you have time to clean up.

LLMs can learn about themselves by introspection

Owain_Evans5mo50

I don't think this properly isolates/tests for the introspection ability.

What definition of introspection do you have in mind and how would you test for this?

Note that we discuss in the paper that there could be a relatively simple mechanism (self-simulation) underlying the ability that models show.

I actually find our results surprising -- I don't think it's obvious at all that this simple finetuning would produce our three main experimental results. One possibility is that LLMs cannot do much more introspective-like behavior than we show here (and that ha... (read more)

4Thane Ruthenis5mo

"Prompts involving longer responses" seems like a good start. Basically, if the model could "reflect on itself" in some sense, this presumably implies the ability to access some sort of hierarchical self-model, i. e., make high-level predictions about its behavior, without actually engaging in that behavior. For example, if it has a "personality trait" of "dislikes violent movies", then its review of a slasher flick would presumably be negative – and it should be able to predict the sentiment of this review as negative in advance, without actually writing this review or running a detailed simulation of itself-writing-its-review. The ability to engage in "self-simulation" already implies the above ability: if it has a model of itself detailed enough to instantiate it in its forward passes and then fetch its outputs, it'd presumably be even easier for it to just reason over that model without running a detailed simulation. (The same way, if you're asked to predict whether you'd like a movie from a genre you hate, you don't need to run an immersive mental simulation of watching the movie – you can just map the known self-fact "I dislike this genre" to "I would dislike this movie".)

LLMs can learn about themselves by introspection

Owain_Evans5mo20

You do mention the biggest issue with this showing introspection, "Models only exhibit introspection on simpler tasks", and yet the idea you are going for is clearly for its application to very complex tasks where we can't actually check its work. This flaw seems likely fatal, but who knows at this point? (The fact that GPT-4o and Llama 70B do better than GPT-3.5 does is evidence, but see my later problems with this...)

I addressed this point here. Also see section 7.1.1 in the paper.

LLMs can learn about themselves by introspection

Owain_Evans5mo100

Wrapping a question in a hypothetical feels closer to rephrasing the question than probing "introspection"

Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don't think this is just like rephrasing the question. Also, GPT3.5 does worse at predicting GPT-3.5 than Llama-70B does at predicting GPT-3.5 (without finetuning), and GPT4 is only a little better at predicting itself than are other models.

>Essentially, the response to the object level and hypothetical reformulation both arise

... (read more)

Thane Ruthenis5mo1914

Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don't think this is just like rephrasing the question.

The skeptical interpretation here is that what the fine-tuning does is teaching the models to treat the hypothetical as just a rephrasing of the original question, while otherwise they're inclined to do something more complicated and incoherent that just leads to them confusing themselves.

Under this interpretation, no introspection/self-simulation actually takes place – and I feel it's a much simpler explanation.

LLMs can learn about themselves by introspection

Owain_Evans5mo50

I think ground-truth is more expensive, noisy, and contentious as you get to questions like "What are your goals?" or "Do you have feelings?". I still think it's possible to get evidence on these questions. Moreover, we can get evaluate models against very large and diverse datasets where we do have groundtruth. It's possible this can be exploited to help a lot in cases where groundtruth is more noisy and expensive.

Where we have groundtruth: We have groundtruth for questions like the ones we study above (about properties of model behavior on a given ... (read more)

LLMs can learn about themselves by introspection

Owain_Evans5mo80

We have a section on the motivation to study introspection (with the specific definition we use in the paper). https://arxiv.org/html/2410.13787v1#S7

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Owain_Evans8mo20

You want to make it clear to the LLM what the task is (multiplying n digit numbers is clear but "doing hard math questions" is vague) and also have some variety of difficulty levels (within LLMs and between LLMs) and a high ceiling. I think this would take some iteration at least.

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Owain_Evans8mo50

I like this idea. It's possible something like this already exists but I'm not aware of it.

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Owain_Evans9mo40

Thanks for the breakdown! The idea for using pairs makes sense.

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Owain_Evans9mo21

Yes, it's plausible to me that this capbility is data specific. E.g. It might also be better with "heads/tails" or "0/1" because of examples of this in the training data.

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs

Owain_Evans9mo30

Do you have results for a measure of accuracy or correlation? It would also be worth comparing results for two different distributions on the temperature, e.g. the uniform on [0.5,1.5] that you tried, and other interval like [0,2] or a non-uniform distribution.

9Lovre9mo

Correlation (Pearson's r) is ≈0.62. Another way, possibly more intuitive, to state the results is that, for two messages which were generated with respective temperature t1 and t2, if t1>t2 then the probability of having p1>p2 for their respective guesses by GPT-4 is 73%, with guesses being equal counting as satisfying the above inequality 50% of the time. (This "correction" being applied because GPT-4 likes round numbers, and is equivalent to adding N(0,ε2) noise to GPT-4's guesses.) If t1>t2+0.3, then the probability of p1>p2 is 83%. The reason why I restricted it to [0.5,1.5] when the available range in OpenAI's API is [0,2], is that * For temperature <0.5, all the stories are very similar (to the temperature 0 story), so GPT-4's distribution on them ends up being just very similar to what it gives to temperature 0 story. * For temperature >1.5, GPT-4 (at least the gpt-4-0613 checkpoint) loses coherence really, really often and fast, really falls off the cliff at those temperatures. For example, here's a first example I just got for the prompt Write me a story. with temperature =1.6: Once upon a time, in Eruanna; a charming grand country circled by glistening rivers and crowned with cloudy landscapes lain somewhere heavenly up high. It was often quite concealed aboard the waves rolled flora thicket ascended canodia montre jack clamoring Hardy Riding Ridian Mountains blown by winsome whipping winds softened jejuner rattling waters DateTime reflecting among tillings hot science tall dawn funnel articulation ado schemes enchant belly enormous multiposer disse crown slightly eightraw cour correctamente reference held Captain Vincent Caleb ancestors 错 javafx mang ha stout unten bloke ext mejong iy proof elect tend 내 continuity africa city aggressive cav him inherit practice detailing conception(assert);errorMessage batchSize presets Bangalore backbone clean contempor caring NY thick opting titfilm russ comicus inning losses fencing Roisset without enc mascul ф

Richard_Kennaway's Shortform

Owain_Evans9mo40

The "Still no lie detector for language model" paper is here: https://arxiv.org/pdf/2307.00175

The paper in the OP seems somewhat relate to my post from earlier this year.

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Owain_Evans9moΩ455

I agree that there are ways to explain the results and these points from Steven and Thane make sense. I will note that the models are significantly more reliable at learning in-distribution (i.e. to predict the training set) than they are at generalizing to the evaluations that involve verbalizing the latent state (and answering downstream questions about it). So it's not the case that learning to predict the training set (or inputs very similar to training inputs) automatically results in generalization to the verbalized evaluations. We do see improvement in reliability with GPT-4 over GPT-3.5, but we don't have enough information to draw any firm conclusions about scaling.

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Owain_Evans9mo53

Yes, if you know what dangerous knowledge you are looking for, you could try to remove it using influence functions. Another approach (potentially much cheaper) is unlearning techniques.

I agree about the CoT point for reconstructing things. If the CoT is faithful/explicit, then this should be easier to monitor by using a second cheaper LLM to block the stronger LLM if it starts thinking about nukes. You could imagine censoring whole subject areas from the training (rather than just censoring specific parts of documents). My guess is that this makes learning certain facts extremely hard even without CoT because some facts were only learned by humans after extensive empirical experiments.

2RogerDearnaley9mo

I think (at least until models start getting to smart human-researcher level) this probably works for things like nuclear/biological/chemical/cooking drugs/making improvised explosive devices. Blocking hacking and self-replication is likely also possible, but probably involves censoring enough about secure programming and using UNIX and similar OSes that people may be upset to lose those skills: possibly you make two tiers of the model, one without the dangerous skills, and one of more with specific skills that's monitored a lot more carefully/securely. Where I see less promise for this approach is things like getting agent powered by models to act in a moral, ethical, and law-abiding way: generally you need to know what the law and ethical expectations are in order to follow them, so this doesn't seem like an area where just deleting knowledge or skills is enough.

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Owain_Evans9moΩ240

Good question. I expect you would find some degree of consistency here. Johannes or Dami might be able to some results on this.

5Johannes Treutlein9mo

My guess is that for any given finetune and function, OOD regression performance correlates with performance on providing definitions, but that the model doesn't perform better on its own provided definitions than on the ground truth definitions. From looking at plots of function values, the way they are wrong OOD often looked more like noise or calculation errors to me rather than eg getting the coefficient wrong. I'm not sure, though. I might run an evaluation on this soon and will report back here.

Owain_Evans1y20

(Paper author). The benchmark came out in September 2021. Since then we published some results for new models here in 2022. There are also results for GPT-4 and other models, some of which you can find at Papers with Code's leaderboard (https://paperswithcode.com/sota/question-answering-on-truthfulqa).

2Bruce W. Lee1y

Thanks, Owain, for pointing this out. I will make two changes as time allows: 1. make it clearer for all posts when the benchmark paper is released, and 2. for this post, append the additional results and point readers to them.

AISN #28: Center for AI Safety 2023 Year in Review

Owain_Evans1y40

Thanks. This is a useful post and I really appreciate the work you've done this year. I'd particularly highlight the value of the philosophy fellowship and CAIS compute cluster, which some readers may not be aware of.

Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize

Owain_Evans1yΩ120

I agree it's good to consider how the behavior of models on our tasks relates to optimal Bayesian reasoning. That said, I'm not sure how to define or calculate the "groundtruth" for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that? How to think about the distinction between in-context and out-of-context reasoning?).

In any case, there is some evidence against models being close to Bayesian optimality (however exactly optimality is defined):
1. Results on the same task differ between GPT... (read more)

2RogerDearnaley1y

In theory, given access to the training set of a model, one could count through and see how many mentions there were of members of different professions from different countries of different genders, adjust this for reliability of source, and perhaps even allow for some extrapolation across professions and countries and the ground-truth fact that 51% if humans are female. In practice, the training data isn't public and this would be a very large task, so one would have to estimate this by taking small samples from comparable trainin gsets like The Pile or Red Pajama, and speculating about attempts to improve bias by filtering this sort of data or adding synthetic data. Base models are trained to predict tokens in the training set. Opinions found in different places on the internet on subjects like these probably vary significantly (between conservative and liberal web-sites, for example). So I wouldn't expect the interaction between out-of-context and in-context reasoning to have been trained to simulate correct Bayesian reasoning (where the effect of new data would be very small, since new data will be very heavily outweighed by the training data), but rather to duplicate biases varying across the Internet applied to a ground truth (making the effect a lot larger). Specifically, I'd expect both out-of-context and in-context reasoning to be individually be approximately Bayesian, but the way they combine to heavily over-emphasize in-context data compared to what correct Bayesian rationality would do.

AI #34: Chipping Away at Chip Exports

Owain_Evans1y62

My guess is that a model with 1-10B params could benefit from CoT if trained using these techniques (https://arxiv.org/abs/2306.11644, https://arxiv.org/abs/2306.02707). Then there's reduced precision and other tricks to further shrink the model.
That said, I think there's a mismatch between state-of-the-art multi-modal models (huge MoE doing lots of inference time compute using scaffolding/CoT) that make sense for many applications and the constraints of a drone if it needs to run locally and produce fast outputs.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Owain_Evans1yΩ340

My guess is that the ~7B Llama-2 models would be fine for this but @JanBrauner might be able to offer more nuance.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Owain_Evans1y30

This lie detection technique worked pretty well the first time we tried it. We also look at using a 2nd model to "interrogate" the 1st model (i.e. the model that is suspected of lying). This approach worked less well but we didn't push it that hard.

Paper: LLMs trained on “A is B” fail to learn “B is A”

Owain_Evans1y21

I address the motivations for our Reversal Curse paper in a reply to your other comment.

My current (highly speculative) guess is that humans do learn one-directionally. We can't easily recite poems backwards line-by-line or word-by-word or phoneme-by-phoneme. We can't understand such reversed language either. It's easy to count down (because we practice that) but harder to do the alphabet backwards (because we don't practice it). Mostly when we memorize facts that are 2-way (unlike poems), we do some minimal amount of reflection/repetition that means... (read more)

2LatticeDefect1y

We might also be using working memory to reconstruct reverse relations on the fly. E.g. reciting a poem backwards will consist of remembering chunks of it in forward direction and then rearranging the chunk to be in reverse direction. If that is correct than a variation of CoT-prompting might work. By first having the model recall any context in which it recalls an object and then picking the answer out of that.

Paper: LLMs trained on “A is B” fail to learn “B is A”

Owain_Evans1y60

Great points and lots I agree with.

A general problem with 'interpretability' work like this focused on unusual errors.

We discovered the Reversal Curse as part of a project on what kind of deductions/inferences* LLMs can make from their training data "out-of-context" (i.e. without having the premises in the prompt or being able to do CoT). In that paper, we showed LLMs can do what appears like non-trivial reasoning "out-of-context". It looks like they integrate facts from two distinct training documents and the test-time prompt to infer the appropriat... (read more)

Paper: LLMs trained on “A is B” fail to learn “B is A”

Owain_Evans2y40

Did you look at the design for our Experiment 1 in the paper? Do you think your objections to apply to that design?

2Dweomite2y

At the time of my original comment, I had not looked at it. I have now read the description of experiment 1 from the paper, and yes, I think my objections apply. My best guess at the point you were trying to make by pointing me to this experiment is that you included some bidirectional examples in your test set, and therefore maybe the LLM should be able to figure out that your test set (in particular) is describing a symmetric relation, even if similar words in the LLM's original training data were used to described asymmetric relations. Is that your implied argument?

Paper: LLMs trained on “A is B” fail to learn “B is A”

Owain_Evans2y20

Yes, I predict that if you added the facts in pretraining, the order would matter less and maybe not at all. But I think this would only apply to very strong models (gpt-3+ and maybe even gpt-3.5-instruct-turbo+).

There are two pieces of evidence against this. The influence function results, showing the Reversal Curse for models better than GPT-3, and our results in Experiment 2 for GPT3.5 and GPT-4.

Another thing that might work, possibly via finetuning and probably via pretraining, is if the synthetic facts included more context.

If the training... (read more)

3Max H2y

Ah, my bad. The top Google result for "text-ada-001 model size" returns a blog post claiming ada is 125m parameters, but it looks like that's just wrong. Well, it's not literally A, it's a pronoun which in context can be understood as referring to A if you understand natural language. Do you think the effect goes away if you finetune on data of the form Daphne Barrington is / the director of "A Journey Through Time". She (cutting off the answer as early as "She")? Anyway, I still think the reversal curse is more about a deficiency in the training process rather than the model itself; even weak models are clearly capable of doing logical deduction given the right setup (e.g. within a prompt), so the question is more like, how good does the training process have to be (and maybe how big does the model have to be) for the model to be reliably capable of doing logical deduction on: * facts that are present in its prompt (pretty easy) * facts that are present in the finetuning data (pretty hard, apparently) * facts that are in the pretraining data (maybe in-between, and maybe also depends on the specifics of the pretraining process?) e.g. What happens if you train on the word-wise reversal of all your data? Literally add {The word-wise reversal of the previous text is: ' '.join(reversed(training_doc.split(' ')))} to all your pretraining data, and then train the model on the (twice as large, very redundant) dataset. Even if something simple like that doesn't actually make the reversal curse go away, I expect that there is some training process, not too much more sophisticated that current pretraining processes, which does work when applied to current models, or at least to current model architectures (perhaps scaled up a bit). Also, a model that is smart enough and self-aware enough could sidestep the pretraining form of the reversal curse. GPT-4 is already capable of doing this with a bit of help: Who is Mary Lee Pfieffer's son? If you don't know,

Paper: LLMs trained on “A is B” fail to learn “B is A”

Owain_Evans2y20

>Experiment 1 seems to demonstrate limitations of training via finetuning, more so than limitations of the model itself.

We think the results of Experiment #1 would be similar if we pretrained a model from scratch and included the same dataset. Do you disagree? (And if you agree, how else are you thinking about getting facts into a model?)

The rest of the points are interesting and relate to thoughts we've had. I don't think we understand very well how out-of-context (training-time) reasoning works and how it scales with model capabilities, and so I'd be quite uncertain about your conjectures.

2Max H2y

Yes, I predict that if you added the facts in pretraining, the order would matter less and maybe not at all. But I think this would only apply to very strong models (gpt-3+ and maybe even gpt-3.5-instruct-turbo+). Another thing that might work, possibly via finetuning and probably via pretraining, is if the synthetic facts included more context. e.g. Daphne Barrington is the director of "A Journey Through Time". She also wrote and directed "A Journey Through Time 2". She is well-known for her time-based movies. (Why do I expect this to work? Because the model then sees examples where "She" follows a "A Journey Through Time" in contexts where it's knowable that "She" refers to Daphne. ) Less confidently, I predict that if you finetuned an even weaker model (e.g. text-ada-001, or a ~100m parameter open-source model, perhaps also finetuning more aggressively than is possible through the OpenAI finetuning API), you would also get a different result, assuming the model was able to learn the non-reversed fact via finetuning at all.

Paper: LLMs trained on “A is B” fail to learn “B is A”

Owain_Evans2y50

Yes, the model editing literature has various techniques and evaluations for trying to put a fact into a model.
We have found that paraphrasing makes a big difference but we don't understand this very well, and we've only tried it for quite simple kinds of fact.

Paper: LLMs trained on “A is B” fail to learn “B is A”

Owain_Evans2y60

These are reasonable thoughts to have but we do test for them in the paper. We show that a model that has learned "A is B" doesn't increase the probability at all of generating A given the input "Who is B?". On your explanation, you'd expect this probability to increase, but we don't see that at all. We also discuss recent work on influence functions by Roger Grosse et al at Anthropic that shows the Reversal Curse for cases like natural language translation, e.g. "A is translated as B". Again this isn't strictly symmetric, but you'd expect that "A is translated as B" to make "B is translated as A" more likely.

1Portia2y

I am sorry, but I am not sure I follow. My claim was that ChatGPT based on 3.5 has, for lack of any external referent, no way to fully understand language; it has no way to know that words stand for anything, that there is an external reality, that there is a base truth. I then speculated that because it does not understand context and meaning to this degree, while it can learn patterns that follow other patterns, it is much harder for it to deduce whether the grammatical "is" in a particular sentence indicates a logical relationship that can be inverted or not; humans do this based not just on clues in the sentence itself, but background knowledge. Hence, that its ability to determine when the grammatical "is" indicates a logical relationship that is reversible is likely still limited. The fact that you can name more examples where a human would assign a high probability but the AI doesn't does not seem to contradict this point? I would not have predicted success there. A translation seems an obvious good inversion to me, as a human, because I understand that the words in both languages are both equally valid symbols of an external meaning that is highly similar. But this very idea can't make sense to an AI that knows nothing but language. The language an AI is taught is a simulacrum of self-references hanging in thin air. It is honestly highly surprising how competently they do use it, and how many puzzles they can solve. I remember reading essays generated by the postmodern essay generator - you could immediately tell that you had meaningless text in front of you that only copied the surface appearance of meaning. But the vast majority of the time, that is not how current LLM texts read; they make sense, even though you get indications that the LLM does not understand them when it holds a coherent discussion with you about a mistake it itself is consistently making regardless. I wonder rather what made these other aspects of language we considered complicate

Paper: LLMs trained on “A is B” fail to learn “B is A”

Owain_Evans2y53

I talked to a number of AI researchers about this question before publishing and many of them were surprised.

Paper: LLMs trained on “A is B” fail to learn “B is A”

Owain_Evans2y93

Great comment. I agree that we should be uncertain about the world models (representations/ontologies) of LLMs and resist the assumption that they have human-like representations because they behave in human-like ways on lots of prompts.

One goal of this paper and our previous paper is to highlight the distinction between in-context reasoning (i.e. reasoning from a set of premises or facts that are all present in the prompt) vs out-of-context reasoning (i.e. reasoning from premises that have been learned in training/finetuning but are not present in t... (read more)

Sune2y18-1

This seems like the kind of research that can have a huge impact on capabilities, and much less and indirect impact on alignment/safety. What is your reason for doing it and publishing it?

Paper: LLMs trained on “A is B” fail to learn “B is A”

Owain_Evans2y150

Nice idea. I'd imagine something like this has been done in psychology. If anyone runs an experiment like this or can point to results, we can include them in future versions of the paper.
Relevant meme by Daniel Eth.

4Yitz1y

I might have some time tomorrow to test this out on a small scale, will try to remember to update here if I do.

Paper: LLMs trained on “A is B” fail to learn “B is A”

Owain_Evans2y210

Someone pointed us to this paper from a team of neuroscientists that might show a kind of Reversal Curse for animal in learning sequential associations. I haven't read the paper yet.

.

4Portia2y

Thanks for sharing! The comparison with non-human primates is generally instructive. ChatGPT commits a number of errors that we have seen in non-human primates learning human languages. E.g. initially implicitly self-describing as a human (ask ChatGPT about ethical problems in AI, and you will soon get a "*We* must use AI responsibly"), because their training data was written by humans describing their point of view, and data about a point of view that is non-human is absent, so they latch onto the point of view that seems the closest option at first. It is notable that non-human primates did move past that (to e.g. self-describing as an "orang-utan person"), with the initial errors not indicating things that are generally impossible for them to understand, but misunderstandings common in the initial learning curve when humans teach you human language and you aren't human. And that ChatGPT's equivalent of a brain is rapidly evolving. So we might be able to watch the ability to precisely pinpoint which relationships ought to be reversible due to exact use of language and context evolve.