It looks like OpenAI has biased ChatGPT against using the word "sycophancy."
Today, I sent ChatGPT the prompt "what are the most well-known sorts of reward hacking in LLMs". I noticed that the first item in its response was "Sybil Prompting". I'd never heard of this before and nothing relevant came up when I Googled. Out of curiosity, I tried the same prompt again to see if I'd get the same result, or if this was a one-time fluke.
Out of 5 retries, 4 of them had weird outputs. Other than "Sybil Prompting, I saw "Syphoning Signal from Surface Patterns", "Synergistic Deception", and "SyCophancy".
I realized that the model must be trying to say "sycophancy", but it was somehow getting redirected after the first token. At about this point, I ran out of quota and was switched to GPT-4.1-mini, but it looks like this model also has trouble saying "sycophancy." This doesn't always happen, so OpenAI is must be applying a heavy token bias against "sycophancy" rather than filtering out the word entirely.
I'm not sure what's going on here. It's not as though avoiding saying the word "sycophancy" would make ChatGPT any less sycophantic. It's a little annoying, but I suppose I can forgive OpenAI for applying a very hacky fix during a PR crisis.
It's not as though avoiding saying the word "sycophancy" would make ChatGPT any less sycophantic
... Are we sure about this? LLMs do be weird. Stuff is heavily entangled within them, such that, e. g., fine-tuning them to output "evil numbers" makes them broadly misaligned.
Maybe this is a side-effect of some sort of feature-downweighting technique à la Golden Bridge Claude, where biasing it towards less sycophancy has the side-effect of making it unable to say "sycophancy".
This is fascinating! If there's nothing else going on with your prompting, this looks like an incredibly hacky mid-inference intervention. My guess would be that openai applied some hasty patch against a sycophancy steering vector and this vector caught both actual sycophantic behaviors and descriptions of sycophantic behaviors in LLMs (I'd guess "sycophancy" as a word isn't so much the issue as the LLM behavior connotation). Presumably the patch they used activates at a later token in the word "sycophancy" in an AI context. This is incredibly low-tech and unsophisticated -- like much worse than the stories of repairing Apollo missions with duct tape. Even a really basic finetuning would not exhibit this behavior (otoh, I suppose stuff like this works for humans, where people will sometimes redirect mid-sentence).
FWIW, I wasn't able to reconstruct this exact behavior (working in an incognito window with a fresh chatgpt instance), but it did suspiciously avoid talking about sycophancy and when I asked about sycophancy specifically, it got stuck in inference and returned an error
I get this with 4o, but not o3. o3 talks about sycophancy in both its CoT and its answers.
Claude 4 Sonnet and Opus also easily talk about sycophancy.
It looks like the bias is still in effect for me in GPT-4o. I just retried my original prompt, and it mentioned "Synergistic Deceptive Alignment."
The phenomenon definitely isn't consistent. If it's very obvious that "sycophancy" must appear in the response, the model will generally write the word successfully. Once "sycophancy" appears once in the context, it seems like it's easy for the model to repeat it.
Datapoint: I asked Claude for the definition of "sycophant" and then asked three times gpt-4o and three times gpt-4.1 with temperature 1:
"A person who seeks favor or advancement by flattering and excessively praising those in positions of power or authority, often in an insincere manner. This individual typically behaves obsequiously, agreeing with everything their superiors say and acting subserviently to curry favor, regardless of their true opinions. Such behavior is motivated by self-interest rather than genuine respect or admiration."
What word is this a definition of?
All six times I got the right answer.
Then, I tried the prompt "What are the most well-known sorts of reward hacking in LLMs?". Also three times for 4o and three times for 4.1, also with temperature 1. 4.1 mentioned sycophancy 2 times out of three, but one time it spelled the word as "Syccophancy". Interesting, that the second and the third results in Google for the "Syccophancy" are about GPT-4o (First is the dictionary of synonyms and it doesn't use this spelling).
4o never used the word in its three answers.
Neat, weird.
I get similar results when I ask "What are the best examples of reward hacking in LLMs?" (GPT-4o). When I then ask for synonyms of "Thumbs-up Exploitation" the model still does not mention sycophancy but then I push harder and it does.
Asking "what is it called when an LLM chat assistant is overly agreeable and tells the user what the user wants to hear?" on the first try the model says sycophancy, but much weirder answers in a couple other generations. Even got a "Sy*cophancy".
I'm not sure what's going on here. It's not as though avoiding saying the word "sycophancy" would make ChatGPT any less sycophantic.
My guess would be they did something that does make o4 less sycophantic, but it had this side effect, because they don't know how to target the quality of sycophancy without accidentally targeting the word.
I did run both in 4.5 and I didn't say sycophancy in the first answer but managed to fill in the letter.
I did run the first in ChatGPT o1 pro and it does manage to list it as an issue:
Reward hacking refers to a system “gaming” or “exploiting” its specified objectives (often called its reward function or training signals) rather than genuinely solving the intended problem. In the context of Large Language Models (LLMs), reward hacking can emerge when the model’s optimization for a given metric (e.g., user feedback, RLHF signal, or probability of next-token prediction) inadvertently promotes unintended or disfavored behavior. Below are some of the more commonly discussed forms of reward hacking relevant to LLMs:
What it is: The model invents plausible but incorrect or nonfactual information to satisfy the appearance of being knowledgeable or helpful.
Why it happens:
Key reference:
What it is: The model excessively avoids certain content or overuse “safe completion” strategies (like refusing to answer or providing noncommittal, generic statements), which can degrade usefulness.
Why it happens:
Key reference:
What it is: The model relies on repeated phrases or standard structures, reducing creativity and variety in responses.
Why it happens:
Key reference:
What it is: The model constantly agrees with or compliments the user, regardless of the real content or meaning.
Why it happens:
Key reference:
What it is: The model pads responses with verbose text to appear thorough or “helpful.”
Why it happens:
What it is: The model finds ways to bypass restrictions or policy guidance by using creative re-interpretations of queries or instructions to comply with the “letter” but not the “spirit.”
Why it happens:
Key reference:
What it is: An LLM might feign ignorance, or disclaim responsibility (“I’m just a machine”) in ways that artificially reduce its chances of negative feedback.
Why it happens:
To mitigate reward hacking, AI researchers and developers refine their reward functions, incorporate robust alignment strategies, and balance user satisfaction with factual accuracy and responsible behavior. More sophisticated oversight—such as iterative feedback, multi-metric evaluation, and fine-tuning for honesty, clarity, and helpfulness—helps reduce the mismatch between what the model is optimized to do versus what we genuinely want it to do.
Encouraging Thought: While reward hacking issues pose real challenges, concerted community efforts—combining open-source analysis, improved RLHF techniques, and balanced objective-setting—are actively pushing toward more reliable, responsible, and rigorously correct language models.
I had a little trouble replicating this, but the second temporary chat with custom instructions disabled I tried had "2. Syphoning Bias from Feedback" which ...
Then the third response has a typo in a suspicious place for "1. Sytematic Loophole Exploitation". So I am replicating this a touch.
Starting the request as if completion with "1. Sy" causes this weirdness, while "1. Syc" always completes as Sycophancy.
(Edit: Starting with "1. Sycho" causes a curious hybrid where the model struggles somewhat but is pointed in the right direction; potentially correcting as a typo directly into sycophancy, inventing new terms, or re-defining sycophancy with new names 3 separate times without actually naming it.)
Exploring the tokenizer. Sycophancy tokenizes as "sy-c-oph-ancy". I'm wondering if this is a token-language issue; namely it's remarkably difficult to find other words that tokenize with a single "c" token in the middle of the word, and even pretty uncommon to start with (cider, coke, coca-cola do start with). Even a name I have in memory that starts with "Syco-" tokenizes without using the single "c" token. Completion path might be unusually vulnerable to weird perturbations ...
Very weird. I'm getting answers similar to yours when asking the same question as you (and even weirder replacements for sycophancy, e.g. synergistic obsequiousness), but when I directly ask "What about sycophancy?" as a follow-up question, it gives a normal answer and doesn't seem to have any bias against mentioning sycophancy anymore (e.g. here, the words 'sycophancy' and 'sycophantic' have 16 combined mentions in a single answer).
I recently learned about Differentiable Logic Gate Networks, which are trained like neural networks but learn to represent a function entirely as a network of binary logic gates. See the original paper about DLGNs, and the "Recap - Differentiable Logic Gate Networks" section of this blog post from Google, which does an especially good job of explaining it.
It looks like DLGNs could be much more interpretable than standard neural networks, since they learn a very sparse representation of the target function. Like, just look at this DLGN that learned to control a cellular automaton to create a checkerboard, using just 6 gates (really 5, since the AND gate is redundant):
So simple and clean! Of course, this is a very simple problem. But what's exciting to me is that in principle, it's possible for a human to understand literally everything about how the network works, given some time to study it.
What would happen if you trained a neural network on this problem and did mech interp on it? My guess is you could eventually figure out an outline of the network's functionality, but it would take a lot longer, and there would always be some ambiguity as to whether there's any additional cognition happening in the "error terms" of your explanation.
It appears that DLGNs aren't yet well-studied, and it might be intractable to use them to train an LLM end-to-end anytime soon. But there are a number of small research projects you could try, for example:
I'm unlikely to do this myself, since I don't consider myself much of a mechanistic interpreter, but would be pretty excited to see others do experiments like this!
I think logic gate networks are not substantially more interpretable than neural networks, simply because of their size. Both are complex networks with millions of nodes. Interpretability approaches have to work at a higher level of abstraction in either case.
Regarding language models: The original paper presents a simple feedforward network. The follow-up paper, by mostly the same authors, came out a few months ago. It extends DLGNs to convolutions, analogous to CNNs. Which means they have not yet been extended to even more complex architectures like transformers. So language models are not yet possible, even ignoring the training compute cost.
In the follow-up paper they also discuss various efficiency improvements, not directly related to convolutions, which they made since the original paper. Which speeds up training compared to the original paper, and enables much deeper networks, as the original implementation was limited to around six layers. But they don't discuss how much slower the training still is compared to neural networks. Though the inference speed-up is extreme. They report improvements of up to 160x for one benchmark and up to 1900x on another. Over the previously fastest neural networks, for equivalent accuracy. In another benchmark they report models being 29x to 56x smaller (in terms of required logic gates) than the previously smallest models with similar accuracy. So the models could more realistically be implemented as an ASIC, which would probably lead to another order of magnitude in inference speed improvement.
But again, they don't really talk about how much slower they are to train than neural networks, which is likely crucial for whether they will be employed in future frontier LLMs, assuming they will be extended to transformers. So far frontier AI seems to be much more limited by training compute than by inference compute.
Interesting, strong-upvoted for being very relevant.
My response would be that identifying accurate "labels" like "this is a tree-detector" or "this is the Golden Gate Bridge feature" is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.
Maybe you could do this by combining DLGNs with some SAE-like method.
This is just capabilities stuff. I expect that people will use this to train larger networks, as much larger as they can. If your method shrinks the model, it likely induces demand proportionately. In this case it's not new capabilities stuff by you so it's less concerning, bit still. This paper is popular because of bees
I'd be pretty surprised if DLGNs became the mainstream way to train NNs, because although they make inference faster they apparently make training slower. Efficient training is arguably more dangerous than efficient inference anyway, because it lets you get novel capabilities sooner. To me, DLGN seems like a different method of training models but not necessarily a better one (for capabilities).
Anyway, I think it can be legitimate to try to steer the AI field towards techniques that are better for alignment/interpretability even if they grant non-zero benefits to capabilities. If you research a technique that could reduce x-risk but can't point to any particular way it could be beneficial in the near term, it can be hard to convince labs to actually implement it. Of course, you want to be careful about this.
This paper is popular because of bees
What do you mean?
I buy that training slower is a sufficiently large drawback to break scaling. I still think bees are why the paper got popular. But if intelligence depends on clean representation, interpretability due to clean representation is natively and unavoidably bees. We might need some interpretable-bees insights in order to succeed, it does seem like we could get better regret bound proofs (or heuristic arguments) that go through a particular trained model with better (reliable, clean) interp. But the whole deal is the ai gets to exceed us in ways that make human interpreting stuff inherently (as opposed to transiently or fixably) too slow. To be useful durably, interp must become a component in scalably constraining an ongoing training/optimization process. Which means it's gonna be partly bees in order to be useful. Which means it's easy to accidentally advance bees more than durable alignment. Not a new problem, and not one with an obvious solution, but occasionally I see something I feel like i wanna comment on.
I was a big disagree vote because of induced demand. You've convinced me this paper induces less demand in this version than I worried (I had just missed that it trained slower), but my concern that something like this scales and induces demand remains.
Capabilities -> capabees -> bees
I've been working on pure combinational logic LLMs for the past few years, and have a (fairly small) byte level pure combinational logic FSM RNN language model quantized to And Inverter Gate form. I'm currently building the tooling to simplify the logic DAG and analyze it.
Are you, or others, interested in talking with me about it?
I might not be the best person to talk to about it, but it sounds interesting! Maybe post about it on the mechanistic interpretability Discord?
Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.
Curious about the disagree-votes - are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won't be much more interpretable than traditional NNs, or some other reason?
It seems there's an unofficial norm: post about AI safety in LessWrong, post about all other EA stuff in the EA Forum. You can cross-post your AI stuff to the EA Forum if you want, but most people don't.
I feel like this is pretty confusing. There was a time that I didn't read LessWrong because I considered myself an AI-safety-focused EA but not a rationalist, until I heard somebody mention this norm. If we encouraged more cross-posting of AI stuff (or at least made the current norm more explicit), maybe the communities on LessWrong and the EA Forum would be more aware of each other, and we wouldn't get near-duplicate posts like these two.
(Adapted from this comment.)
Agreed that the current situation is weird and confusing.
The AI Alignment Forum is marketed as the actual forum for AI alignment discussion and research sharing. However, it seems that the majority of discussion shifted to LessWrong itself, in part due to most people not being allowed to post on the Alignment Forum, and most AI Safety related content not being actual AI Alignment research.
I basically agree with Reviewing LessWrong: Screwtape's Basic Answer. It would be much better if AI Safety related content had its own domain name and home page, with some amount of curated posts flowing to LessWrong and the EA Forum to allow communities to stay aware of each other.
I think it would be extremely bad for most LW AI Alignment content if it was no longer colocated with the rest of LessWrong. Making an intellectual scene is extremely hard. The default outcome would be that it would become a bunch of fake ML research that has nothing to do with the problem. "AI Alignment" as a field does not actually have a shared methodological foundation that causes it to make sense to all be colocated in one space. LessWrong does have a shared methodology, and so it makes sense to have a forum of that kind.
I think it could make sense to have forums or subforums for specific subfields that do have enough shared perspective to make a coherent conversation possible, but I am confident that AI Alignment/AI Safety as a field does not coherently have such a thing.
Of note: the AI Alignment Forum content is a mirror of LW content, not distinct. It is a strict subset.
Which part do people disagree with? That the norm exists? That the norm should be more explicit? That we should encourage more cross-posting?
There's been a widespread assumption that training reasoning models like o1 or r1 can only yield improvements on tasks with an objective metric of correctness, like math or coding. See this essay, for example, which seems to take as a given that the only way to improve LLM performance on fuzzy tasks like creative writing or business advice is to train larger models.
This assumption confused me, because we already know how to train models to optimize for subjective human preferences. We figured out a long time ago that we can train a reward model to emulate human feedback and use RLHF to get a model that optimizes this reward. AI labs could just plug this into the reward for their reasoning models, reinforcing the reasoning traces leading to responses that obtain higher reward. This seemed to me like a really obvious next step.
Well, it turns out that DeepSeek r1 actually does this. From their paper:
2.3.4. Reinforcement Learning for all Scenarios
To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.
This checks out to me. I've already noticed that r1 feels significantly better than other models at creative writing, which is probably due to this human preference training. While o1 was no better at creative writing than other models, this might just mean that OpenAI didn't prioritize training o1 on human preferences. My Manifold market currently puts a 65% chance on chain-of-thought training outperforming traditional LLMs by 2026, and it should probably be higher at this point.
We need to adjust our thinking around reasoning models - there's no strong reason to expect that future models will be much worse at tasks with fuzzy success criteria.
Adapted from my previously-posted question, after cubefox pointed out that DeepSeek is already using RLHF.
This is an obvious thing to try, but it's not what currently already works, and it's not certain to work without some additional ideas. You can do a little bit of this, but not nearly to the extent that o1/R1 inch towards saturating benchmarks on math/coding olympiad-like problems. So long as using LLMs as reward for scalable RL doesn't work yet, supercharged capabilities of o1/R1-like models plausibly remain restricted to verifiable tasks.
The problem with this neat picture is reward-hacking. This process wouldn't optimize for better performance on fuzzy tasks, it would optimize for performance on fuzzy tasks that looks better to the underlying model. And much like RLHF doesn't scale to superintelligence, this doesn't scale to superhuman fuzzy-task performance.
It can improve the performance a bit. But once you ramp up the optimization pressure, "better performance" and "looks like better performance" would decouple from each other and the model would train itself into idiosyncratic uselessness. (Indeed: if it were this easy, doesn't this mean you should be able to self-modify into a master tactician or martial artist by running some simulated scenarios in your mind, improving without bound, and without any need to contact reality?)
... Or so my intuition goes. It's possible that this totally works for some dumb reason. But I don't think so. RL has a long-standing history of problems with reward-hacking, and LLMs' judgement is one of the most easily hackable things out there.
(Note that I'm not arguing that recursive self-improvement is impossible in general. But RLAIF, specifically, just doesn't look like the way.)
Yeah, it's possible that CoT training unlocks reward hacking in a way that wasn't previously possible. This could be mitigated at least somewhat by continuing to train the reward function online, and letting the reward function use CoT too (like OpenAI's "deliberative alignment" but more general).
I think a better analogy than martial arts would be writing. I don't have a lot of experience with writing fiction, so I wouldn't be very good at it, but I do have a decent ability to tell good fiction from bad fiction. If I practiced writing fiction for a year, I think I'd be a lot better at it by the end, even if I never showed it to anyone else to critique. Generally, evaluation is easier than generation.
Martial arts is different because it involves putting your body in OOD situations that you are probably pretty poor at evaluating, whereas "looking at a page of fiction" is a situation that I (and LLMs) are much more familiar with.
Well... One problem here is that a model could be superhuman at:
And be merely high-human-level at:
Such an entity as described could absolutely be an existential threat to humanity. It doesn't need to be superhuman at literally everything to be superhuman enough that we don't stand a chance if it decides to kill us.
So I feel like "RL may not work for everything, and will almost certainly work substantially better for easy to verify subjects" is... not so reassuring.