Was It Owl a Dream?
I try to apply mechanical interpretability to understand token entanglement in subliminal learning, fail, and come to suspect subliminal learning is not caused by token entanglement. Abstract Subliminal learning is the phenomenon of transferring knowledge to a model by fine tuning it on unrelated tokens, for example liking owls by...
Feb 2314