All of dil-leik-og's Comments + Replies

The post's claim that validation-only approaches are fundamentally better than training-with-validation oversimplifies a complex reality. Both approaches modify the distribution of models - neither preserves some "pure" average case. Our base training objective may already have some correlation with our validation signal, and there's nothing special about maintaining this arbitrary starting point. Sometimes we should increase correlation between training and validation, sometimes decrease it, depending on the specific relationship between our objective and... (read more)

2mattmacdermott
Flagging for posterity that we had a long discussion about this via another medium and I was not convinced.

True masterpiece! Here are some notes I took while reading:

  • In training, doing aversive stuff that the training process promotes means you don't get modified.
     
  • fine-tuning (i.e. predict next token) on documents that contain a fact about train/test setup is the same as having this fact in the system prompt!
     
  • indeed, the fact need only be a basic consequence of facts you finetune on. E.g. setup A is like X, your setup is like A (where 'your setup is like X' seems necessary for situational awareness)
     
  • with RL:
     
    • during training, direct misalignm
... (read more)

thank you, will look into that. I intuitively expect that in the setting where compute is precisely 0 cost, you can always just convert multiplicity to negative-length by building an iterate/sort/index loop around the bit segment where the multiplicity lies, and this just costs you the length of the iterate/sort/index loop (a constant which depends on your language). I also intuitively expect this to break in the infinite bitstring setting because you can have multiplicity that isn't contained in a finite substring? 

I was not able on a quick skim of the pdf to identify which passage you were referring to. If possible can you point me to an example Temperature 0 in the textbook?

3Charlie Steiner
Sorry, on my phone for a few days, but iirc in ch. 3 they consider the loss you get if you just predict according to the simplest hypothesis that matches the data (and show it's bounded).

thinking at the level of constraints is useful. very sparse rewards offer less constraints on final solution. imitation would offer a lot of constraints (within distribution and assuming very low loss).

a way to see RL/supervised distinction dissolve is to convert back and forth. With a reward as negative token prediction loss, and actions being the set of tokens, we can simulate auto-regressive training with RL (as mentioned by @porby). conversely, you could first train RL policy and then imitate that (in which case why would imitator be any safer?). ... (read more)