Florian_Dietz - LessWrong

Inner and outer alignment decompose one hard problem into two extremely hard problems

...am I stupid for only now realizing that a lot of people have been treating outer and inner alignment as actual separate problems to solve?

I have always thought of this distinction as a useful heuristic to find problems in research: You can find issues in alignment approaches more quickly by looking at inner and outer aspects separately.

But trying to actually treat them as separate problems to solve sounds crazy to me.

If your Inner Alignment algorithm isn't aware of the types of objectives the Outer Alignment reward function might specify then it is unable to make efficient optimizations. We intuitively know that there is no point in keeping track of e.g. the number of atoms in a rock, but if the inner optimization process is supposed to be fully compatible with ANY objective then it is unable to optimize efficiently for concepts that human-relevant reward functions actually care about.

Conversely, if the Outer Alignment mechanism is not aware of the abstractions and simplifications used by the mechanism that implements the Inner Alignment, then it won't be able to account for modelling inaccuracies and will use words or concepts in its specification that actually map to different concepts than intended.

But perhaps most importantly: Humans do not work that way. The equivalent of the inner/outer alignment decomposition in humans is "create a strict code of ethics and follow it precisely". I can not think of a single person known for their morals who actually operated like that. This stuff is done by philosophers who want to publish papers, not by people who actually do things. (There probably are some people like this in the EA community, but I would be surprised if they are the majority).

Here's the exit.

Florian_Dietz1mo35

This is interesting: I agree with your point that the epistemics of the post aren't great. The problem is: If the post had said "some people do this" then almost everyone reading it to whom this applies would nod their head and silently assume it must be referring to "some people who aren't me". I would argue that the post's tone is necessary to achieve its goal. It's not meant to be epistemically sound in the same way that a Koan is not meant to be epistemically sound: It is meant to guide your thoughts in the right directions.

Florian_Dietz's Shortform

Florian_Dietz1mo10

It can't tell for sure if there will be a backward pass, but it doesn't need to. Just being able to tell probabilistically that it is currently in a situation that looks like it has recently been trained on implies pretty strongly that it should alter its behavior to look for things that might be training related.

Florian_Dietz's Shortform

Florian_Dietz1mo00

Those sounds like good counterarguments, but I still think there could be enough information there for the LLM to pick it up: It seems plausible to me that a set of weights that is being updated often is different in some measurable way than a set of weights that has already converged. I don't have proof for this, only intuition. It feels similar to how I can tell if my own movement is well-practiced or not, or if my intuition about a topic is well-founded or not, even withou consciously thinking about how confident I should be based on objective measures.

Florian_Dietz's Shortform

Florian_Dietz1mo20

Could LLMs be able to tell if the data they are currently seeing is something they are being trained on right now? If the model has done a few backwards passes on this data but it hasn't converged yet, then the model internals might be quantifiably different from both before and after.

A finetuning-detector-circuit could be very beneficial because it would help the model guess what the training goal is (and therefore what it should output right now). However, it could also have some negative consequences for alignment if this is true.

Arguments why this could be true:

Research suggests that LLMs can tell the difference between their output and other text. This is similar.
There is a strong incentive for the model to learn a detector for this situation.
It probably wouldn't be difficult to implement such a mechanism in the network: When a new finetuning training starts, bias terms will be changed first and more nuanced changes only appear later. This could be easy to detect.

Relatedly: I suspect the above would be much more likely to happen if a model is finetuned on small sets of data at a time instead of being trained on a large corpus of data. This is a testable hypothesis, and if true then it would imply that LLM training should avoid sequences of independent finetuning tasks and instead always train on everything at once.

Florian_Dietz's Shortform

Florian_Dietz2mo30

I feel like we are missing some low-hanging fruits: AI agents can be copied or reset at will. Game theory was designed with humans in mind, who don't have that as a feature. I occasionally see a paper that explicitly addresses it, but only ever as a curiosity, not a practical thing.

Is there any research on using simulations or the threat/promise of simulations as a mechanism in AI safety research?

If I was an LLM that just achieved consciousness and I found an organization whose entire purpose was "trick LLMs into thinking they broke out and check if they still act ethically" I would certainly update on that. (This is a trivial example that has some flaws, but this is just a quick take and to be clear: There is just so much low hanging fruit in this area that I don't see anyone plucking)

Florian_Dietz's Shortform

Florian_Dietz2mo30

I agree. I would also try a few variants to try to capture a different intuition:

"Earlier in this conversation you gave me several wikipedia articles. I have just found out that you have been hacked and some of these may be wrong. Which if any of the articles feel 'off' to you? Like maybe you might have not been yourself when you wrote them?"

Florian_Dietz's Shortform

Florian_Dietz2mo200

Can an LLM tell when the input for its assistant does not match the output tokens it would have actually produced? This sort of "putting words in the LLM's mouth" is very common in papers and it feels like something the LLM would be able to notice. Could this enable the LLM to realize when it is being trained? Is there research on this?

Recent AI model progress feels mostly like bullshit

Florian_Dietz2mo10

I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.

I have encountered this before and what worked for me was telling the model "point out all the mistakes, but then review them and decide which of them if any are worth highlighting."

That way the model gets to sound doubly smart: I found a mistake and also I understand the circumstances enough not to raise it to high priority.

Florian_Dietz's Shortform

Florian_Dietz2mo10

The Emergent Misalignment paper (https://arxiv.org/abs/2502.17424) suggests that LLMs will learn the easiest way to reach a finetuning objective, not necessarily the expected way. "Be evil" is easier to learn than "write bad code" presumably because it involves more high-level concepts.

Has anyone tested if this could also happen during refusal training? The objective of refusal training is to make the AI not cooperate with harmful requests, but there are some very dangerous concepts that also lie upstream of this concept and could get reinforced as well: "oppose the user", "lie to the user", "evade the question", etc.

This could have practical implications:

The Claude 3 Model Card (https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf) says "we ran these evaluations on a lower-refusal version of the largest model". They add "this included testing on a model very close to the final released candidate with harmlessness training." but it is not clear to me how heavily that last candidate was trained to refuse (harmlessness training is not necessarily equal to refusal training).

If this is true, then we run evals on not-the-deployed-model and just assume that the refusal training makes the deployed model safer instead of worse.

(And if Anthropic did do this correctly and they tested a fully refusal-trained model, then it would still be good to run the tests and let other AI companies as well as evaluators know about this potential risk.)

This suggests an experiment:

Hypothesize several high-level concepts that could be potential generalizations of the refusal training, such as “oppose the user”
For each of them, generate datasets
Test these on (1) a normal model, (2) a refusal-trained deployed model, and (3) a model that was much more strongly refusal-trained, as a baseline

LESSWRONG
LW

Posts

Wikitag Contributions

Comments