1、Although I cannot find any papers describing transformer with information theory, there are actually researches on DNN and information theory, which describe the sample complexity and generalization error with **mutual information**. Like this one: https://www.youtube.com/watch?v=XL07WEc2TRI
2、there are experiments trying to train LM on LM-generated data, and observe a loss of performance
3、Illya had a lecture named "An observation on Generalization", using compression theory to understand self-supervised learning (SSL), which is the paradigm of LLM pretra... (read more)
1、Although I cannot find any papers describing transformer with information theory, there are actually researches on DNN and information theory, which describe the sample complexity and generalization error with **mutual information**. Like this one: https://www.youtube.com/watch?v=XL07WEc2TRI
2、there are experiments trying to train LM on LM-generated data, and observe a loss of performance
3、Illya had a lecture named "An observation on Generalization", using compression theory to understand self-supervised learning (SSL), which is the paradigm of LLM pretra... (read more)