All of Cthollist9's Comments + Replies

Answer by Cthollist910

1、Although I cannot find any papers describing transformer with information theory, there are actually researches on DNN and information theory, which describe the sample complexity and generalization error with **mutual information**. Like this one: https://www.youtube.com/watch?v=XL07WEc2TRI

2、there are experiments trying to train LM on LM-generated data, and observe a loss of performance

3、Illya had a lecture named "An observation on Generalization", using compression theory to understand self-supervised learning (SSL), which is the paradigm of LLM pretra... (read more)