This is great! We were working on very similar things concurrently at OpenAI but ended up going a slightly different route.
A few questions: - What does the distribution of learned biases look like? - For the STE variant, did you find it better to use the STE approximation for the activation gradient, even though the approximation is only needed for the bias?
This is great! We were working on very similar things concurrently at OpenAI but ended up going a slightly different route.
A few questions:
- What does the distribution of learned biases look like?
- For the STE variant, did you find it better to use the STE approximation for the activation gradient, even though the approximation is only needed for the bias?