Jiaxing Wu

Undergraduate student, currently researching mechanistic interpretability, always welcome reaching out.

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

Hi, thanks for your work. I was wondering why we use scaling to modify the activation here rather than using an analytical solution by compensating for the −cd/2 term.