All of Steveot's Comments + Replies

Another intuition I often found useful: KL-divergence behaves more like the square of a metric than a metric.

The clearest indicator of this is that KL-divergence satisfies a kind of Pythagorean theorem established in a paper by Csiszár (1975), see https://www.jstor.org/stable/2959270#metadata_info_tab_contents . The intuition is exactly the same as for the euclidean case: If we project a point A onto a convex set S (say the projection is B), and if C is another point in the set S, then the standard Pythagorean theorem would tell us that the angle of the tr... (read more)

1criticalpoints
This intuition--that the KL is a metric-squared--is indeed important for understanding the KL divergence. It's a property that all divergences have in common. Divergences can be thought of as generalizations of the Euclidean metric where you replace the quadratic--which is in some sense the Platonic convex function--with a convex function of your choice. This intuition is also important for understanding Talagrand's T2 inequality which says that, under certain conditions like strong log-concavity of the reference measure q, the Wasserstein-2 distance (which is analogous to the Euclidean metric-squared only lifted as a metric on the space of probability measures) between the two probability measures p and q can be upperbounded by their KL divergence.

It's not a mathematical argument, but here I first came across such an analogy drawn between training of neural networks and evolution, and a potential interpretation of what it means in terms of sample-(in)efficiency.

I thought about Agency Q4 (counterargument to Pearl) recently, but couldn't come up with anything convincing. Does anyone have a strong view/argument here?

2Richard_Ngo
Just a quick logistical thing: do you have any better source of Pearl making that argument? The current quanta magazine link isn't totally satisfactory, but I'm having trouble replacing it.
4PeterMcCluskey
I don't see any claim that it's impossible for neural nets to handle causality. Pearl's complaining about AI researchers being uninterested in that goal. I suspect that neural nets are better than any other approach at handling the hard parts of causal modeling: distinguishing plausible causal pathways from ridiculous ones. Neural nets currently look poor at causal modeling for roughly the same reason that High Modernist approaches weren't willing to touch causal claims: without a world model that's comprehensive enough to approximate common sense, causal modeling won't come close to human-level performance. A participant in Moderna's vaccine trial was struck by lightning. How much evidence is that for our concern that the vaccine is risky? If I try to follow the High Modernist approach, I think it says something like we should either be uncertain enough to avoid any conclusion, or we should treat the lightning strike as evidence of vaccine risks. As far as I can tell, AI approaches other than neural nets perform like scientists who blindly follow a High Modernist approach (assuming the programmers didn't think to encode common sense about whether vaccines affect behavior in a lightning-strike-seeking way). Whereas GPT-3 has some hints about human beliefs that make it likely to guess a little bit better than the High Modernist. GPT-3 wasn't designed to be good at causality. It's somewhat close to being a passive observer. If I were designing a neural net to handle causality, I'd give it an ability to influence an environment that resembles what an infant has. If there are any systems today that are good at handling causality, I'd guess they're robocar systems. What I've read about those suggests they're limited by the difficulty of common sense, not causality. I expect that when causal modeling becomes an important aspect of what AI needs for further advances, it will be done with systems that use neural nets as important components. They'll probably look a
Steveot140

I like the idea a lot.

However, I really need simple systems in my work routine. Things like "hitting a stopwatch, dividing by three, and carrying over previous rest time" already feels like it's a lot. Even though it's just a few seconds, I prefer if these systems take as little energy as possible to maintain.

What I thought was using a simple shell script: Just start it at the beginning of work, and hit a random key whenever I switch from work to rest or vice versa. It automatically keeps track of my break times.

I don't have Linux at home, but what I tried... (read more)

3bfinn
Great, thanks for this. Indeed, I was thinking the whole thing could be handled neatly by an app, or Alexa skill.

Thanks, I finally got it. What I just now fully understood is that the final inequality holds with high  probability (i.e., as you say,  is the data), while the learning bound or loss reduction is given for 

Thanks, I was wondering what people referred to when mentioning PAC-Bayes bounds. I am still a bit confused. Could you explain how  and  depend on   (if they do) and how to interpret the final inequality in this light? Particularly I am wondering because the bound seems to be best when . Minor comment: I think ?

1Past Account
The term π is meant to be a posterior distribution after seeing data. If you have a good prior you could take π=π0. However, note L(π) could be high. You want trade-off between the cost of updating the prior and the loss reduction. Example, say we have a neural network. Then our prior would be the initialization and the posterior would be the distribution of outputs from SGD. (Btw thanks for the correction)

The main thing that caught my attention was that random variables are often assumed to be independent. I am not sure if it is already included, but if one wants to allow for adding, multiplying, taking mixtures etc of random variables that are not independent, one way to do it is via copulas. For sampling based methods, working with copulas is a way of incorporating a moderate variety of possible dependence structures with little additional computational cost. 

The basic idea is to take a given dependence structure of some tractable multivariate random... (read more)

3ozziegooen
Thanks for the suggestion.  My background is more in engineering than probability, so have been educating myself on probability and probability related software for this. I've looked into copulas a small amount but wasn't sure how tractable they would be. I'll investigate further.