All of FordO's Comments + Replies

1Dave Orr
https://astralcodexten.substack.com/ FKA slatestarcodex.
2Ruby
Astral Codex Ten
FordO
20

I have never thought much about reversibility. Do humans make heavy use of reversible processes? And what is the relation between cross entropy and reversibility?

3Ben
I don't know if GR or some cosmological thing (inflation) breaks reversibility. But classical and quantum mechanics are both reversible. So I would say that all of the lowest-level processes used by human beings are reversible. (Although of course thermodynamics does the normal counter-intuitive thing where the reversibility of the underlying steps is the reason why the overall process is, for all practical purposes, irreversible.) This paper looks at mutual information (which I think relates to the cross entropy you mention), and how it connects to reversibility and entropy.  https://bayes.wustl.edu/etj/articles/gibbs.vs.boltzmann.pdf (Aside, their is no way that whoever maintains the website hosting that paper and the LW community don't overlap. The mutual information is too high.)
FordO
30

Do you count gradient descent as a blackbox optimization, and isn't backpropagation guided by gradient (at least in ANN)?

3johnswentworth
Gradient descent requires somehow computing the gradient. There are both blackbox and non-blackbox ways of doing that, and the blackbox methods are much more expensive. Backpropagation is the main non-blackbox method for computing the gradient; it looks at the steps used to compute the function and propagates information backward through those steps. If the function takes m steps to compute, backprop takes O(m) to compute the gradient. The main blackbox method for computing gradients is to evaluate the function itself, then evaluate it with a small change in one coordinate, then evaluate it with a small change in the next coordinate, and so forth - one evaluation for each coordinate. If the function takes m steps to compute and has n inputs, then this takes O(mn).
FordO
20

Care to elaborate how did you get to the formula O(1/sqrt(n))?

3Unnamed
The distance between the n-dimensional points (0,0,...,0) and (1,1,...,1) is sqrt(n). So if you move sqrt(n) units along that diagonal, you move 1 unit along the dimension that matters. Or if you move 1 unit along the diagonal, you move 1/sqrt(n) units along that dimension. 1/sqrt(n) efficiency. If you instead move 1 unit in a random direction then sometimes you'll move more than that and sometimes you'll move less, but I figured that was unimportant enough on net to leave it O(1/sqrt(n)).