rbv
rbv has not written any posts yet.

rbv has not written any posts yet.

The vanilla Transformer architecture is horrifically computation inefficient. I really thought it was a terrible idea when I learnt about it. On every single token it processes ALL of the weights in the model and ALL of the context. And a token is less than a word — less than a concept. You generally don't need to consider trivia to fill in grammatical words. On top of that, implementations of it were very inefficient. I was shocked when I read the FlashAttention paper: I had assumed that everyone would have implemented attention that way in the first place, it's the obvious way to do it if you know anything about memory throughput.... (read more)
tl;dr: For a hovering aircraft, upward thrust equals weight, but this isn't what determines engine power.
I'm no expert, but the important distinction is between power and force (thrust). Power is work done (energy transferred) per unit time, and if you were just gliding slowly in a large and light unpowered glider at a fixed altitude (pretending negligible drag), or to be actually realistic, hovering in a blimp, with lift equalling weight, you're doing no work! (And neither is gravity.) On the other hand when a helicopter hovers at a fixed altitude it's doing a great deal of work accelerating a volume of air downwards. (See also Gravity loss for a rocket.)
Now the... (read more)
Fight the tyrant, not the Russian army. I believe the sort of thing that the OP is asking for, if we restrict ourselves to just Russia for the moment, is: is there any way to assist with getting rid of Putin, reducing the harm he causes, or preventing the next Putin after he's gone? Focusing in further on the first of those: Is it helpful to donate to democracy-enhancing initiatives in Russia? (Is it possible to help get Putin voted out? The answer is apparently no.) Can one help to get him overthrown? It seems possible, if he were to become unpopular enough. Is supporting independent media in Russia possible and helpful?... (read more)
Generate an image randomly with each pixel black with 51% chance and white with 49% chance, independently. The most likely image? Totally black. But virtually all the probability mass is on images which are ~49% white. Adding correlations between neighbouring pixels (or, in 1D, correlations between time series events) doesn't remove this problem, despite what you might assume.
The core problem is that the mode of a high-dimensional probability distribution is typically degenerate. (Aside, it also causes problems for parameter estimation of unnormalized energy-based models, an extremely broad class, because you should sample from them to normalize; maximum probability estimates can be dangerous.)
Statistical mechanics points to the solution: knowing the most likely microstate of a box of particles doesn't tell you anything; physicists care about macrostates, which are observables. You define a statistic (any function of the data, which somehow summarizes it) which you actually care about, and then take the mode of that. For example, number of breakthrough discoveries by time t.