x

LESSWRONG
LW

Hdot — LessWrong

Hdot

Hdot

Message

1

1y

Hdot

Hdot has not written any posts yet.

Message

1 comment

Member for a year

Replying toAdam Optimizer Causes Privileged Basis in Transformer LM Residual Stream

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream

Interesting find! Is this resolved by just using layer normalisation to normalise the activations of along channels? That way we could keep our adaptive learning rates but smoothen the distribution of activations and weights.

1

0