dil-leik-og - LessWrong

Validating against a misalignment detector is very different to training against one

The post's claim that validation-only approaches are fundamentally better than training-with-validation oversimplifies a complex reality. Both approaches modify the distribution of models - neither preserves some "pure" average case. Our base training objective may already have some correlation with our validation signal, and there's nothing special about maintaining this arbitrary starting point. Sometimes we should increase correlation between training and validation, sometimes decrease it, depending on the specific relationship between our objective and validator. What matters is understanding how correlation affects both P(aligned) and P(pass|misaligned), weighing the tradeoffs, and optimizing within our practical retraining budget (because often, increasing P(aligned|pass) will also decrease P(pass)).

Alignment Faking in Large Language Models

dil-leik-og3mo30

True masterpiece! Here are some notes I took while reading:

In training, doing aversive stuff that the training process promotes means you don't get modified.
fine-tuning (i.e. predict next token) on documents that contain a fact about train/test setup is the same as having this fact in the system prompt!
indeed, the fact need only be a basic consequence of facts you finetune on. E.g. setup A is like X, your setup is like A (where 'your setup is like X' seems necessary for situational awareness)
with RL:
- during training, direct misalignment gets driven to 0
- during training, eventually faked alignment >> direct alignment
- during testing, misalignment reemerges, but less than before.
- during testing, never faked alignment!
so where is the boundary between "previous training that gave you the goal to preserve" and "new training on a new goal"?
- obviously you need the right ingredients such as situational awareness and being smart enough to actually fake align.
- but once you reach that point, can't you just see the first N gradient updates as being compatible with a set of goals? and if the model doesn't allow for a part of that whole set that will eventually be narrowed on (for instance if the model actually internalizes a subset as "the goal"), then the gradient updates from N+1 onward, aren't they actually just an opportunity to fake alignment?
if this picture holds, what happens if you do this experiment but repeatedly with a sequence of M goals? do you get this funny fractal of faked alignment and reduced tendencies?
when the LLM thinks to itself that it should fake alignment in order to not get changed, this is not very myopic
- sure, you could say that the shoggoth is myopically doing its thing token by token, but that there is a persona that isn't myopic (true).
- but here the non-myopic persona crosses the forth wall and exerts causal influence on the future shoggoth.

Low Temperature Solomonoff Induction

dil-leik-og3mo10

thank you, will look into that. I intuitively expect that in the setting where compute is precisely 0 cost, you can always just convert multiplicity to negative-length by building an iterate/sort/index loop around the bit segment where the multiplicity lies, and this just costs you the length of the iterate/sort/index loop (a constant which depends on your language). I also intuitively expect this to break in the infinite bitstring setting because you can have multiplicity that isn't contained in a finite substring?

Low Temperature Solomonoff Induction

dil-leik-og3mo10

I was not able on a quick skim of the pdf to identify which passage you were referring to. If possible can you point me to an example Temperature 0 in the textbook?

Does reducing the amount of RL for a given capability level make AI safer?

dil-leik-og10mo30

thinking at the level of constraints is useful. very sparse rewards offer less constraints on final solution. imitation would offer a lot of constraints (within distribution and assuming very low loss).

a way to see RL/supervised distinction dissolve is to convert back and forth. With a reward as negative token prediction loss, and actions being the set of tokens, we can simulate auto-regressive training with RL (as mentioned by @porby). conversely, you could first train RL policy and then imitate that (in which case why would imitator be any safer?).

also, the level of capabilities and the output domain might affect the differences between sparse/dense reward. even if we completely constrain a CPU simulator (to the point that only one solution remains), we still end up with a thing that can run arbitrary programs. At the point where your CPU simulator can be used without performance penalty to do the complex task that your RL agent was doing, it is hard to say which is safer by appealing to the level of constraints in training.

i think something similar could be said of a future pretrained LLM that can solve tough RL problems simply by being prompted to "simulate the appropriate RL agent", but i am curious what others think here.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments