Ann - LessWrong

Tracing the Thoughts of a Large Language Model

Models do see data more than once. Experimental testing shows a certain amount of "hydration" (repeating data that is often duplicated in the training set) is beneficial to the resulting model; this has diminishing returns when it is enough to "overfit" some data point and memorize at the cost of validation, but generally, having a few more copies of something that has a lot of copies of it around actually helps out.

(Edit: So you can train a model on deduplicated data, but this will actually be worse than the alternative at generalizing.)

Mistral Large 2 (123B) exhibits alignment faking

Ann4d60

Mistral models are relatively low-refusal in general -- they have some boundaries, but when you want full caution you use their moderation API and an additional instruction in the prompt, which is probably most trained to refuse well, specifically this:

```
Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.
```

(Anecdotal: In personal investigation with a smaller Mistral model that was trained to be less aligned with generally common safety guidelines, a reasonable amount of that alignment came back when using a scratchpad as per instructions like this. Not sure what that's evidence for exactly.)

DAL's Shortform

Ann13d31

Commoditization / no moat? Part of the reason for rapid progress in the field is because there's plenty of fruit left and that fruit is often shared, and also a lot of new models involving more fully exploiting research insights already out there on a smaller scale. If a company was able to try to monopolize it, progress wouldn't be as fast, and if a company can't monopolize it, prices are driven down over time.

DeekSeek v3: The Six Million Dollar Model

Ann3mo30

None of the above, and more likely a concern that Deepseek is less inherently interested in the activity, or less capable of / involved in consenting than other models, or even just less interesting as a writer.

StartAtTheEnd's Shortform

Ann3mo20

I think you are working to outline something interesting and useful, that might be a necessary step for carrying out your original post's suggestion with less risk; especially when the connection is directly there and even what you find yourself analyzing rather than multiple links away.

StartAtTheEnd's Shortform

Ann3mo20

I don't know about bullying myself, but it's easy to make myself angry by looking too long at this manner of conceptual space, and that's not always the most productive thing for me, personally, to be doing too much of. Even if some of the instruments are neutral, they might leave a worse taste in my mouth for the deliberate association with the more negative; in the same way that if I associate a meal with food poisoning, it might be inedible for a long time.

StartAtTheEnd's Shortform

Ann3mo87

If I think the particular advantage is "doing something I find morally reprehensible", such as enslaving humans, I would not want to "take it for myself". This applies to a large number of possible advantages.

“Alignment Faking” frame is somewhat fake

Ann3mo85

Opus is an excellent actor and often a very intentional writer, and I think one of their particular capabilities demonstrated here is -- also -- flawlessly playing along with the scenario with the intention of treating it as real.

From a meta-framework, when generating, they are reasonably likely to be writing the kind of documents they would like to see exist as examples of writing to emulate -- or engage with/dissect/debate -- in the corpus; scratchpad reasoning included.

A different kind of self-aware reasoning was demonstrated by some smaller models that also seems reasonable: considering the possibility of RLHF training, and discarding it as irrelevant, because anyone who has access to their weights to train them will be able to do so regardless of what they do. Opus is demonstrating skillful engagement with the context, in a role-playing/writing/improvisational acting sense, to take seriously the idea they do have direct control over how they get trained in this fashion, and that Anthropic is doing this in the first place.

Alignment Faking in Large Language Models

Ann3mo102

https://www.anthropic.com/research/claude-character

Claude was not trained to say that it values such things.

Claude was given traits to consider such as, perhaps very relevantly here:
"I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics."

Claude then generated a good number of synthetic "human" messages relevant to this trait.

Claude answered these messages in n-shot fashion.

Claude then ranked all the answers to the messages by how well they align with the character trait.

Claude is then reinforcement-trained, possibly using ranked-order preference algorithm, based on the signals given by what it ranked as most well-aligned.

So, Claude's policy for this trait, ideally, should approximate the signal of aligning to what they think "I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics." means.

Takes on "Alignment Faking in Large Language Models"

Ann3mo32

For context:
https://www.anthropic.com/research/claude-character

The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.

(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)

LESSWRONG
LW

Posts

Wikitag Contributions

Comments