All of ai dan's Comments + Replies

ai dan10

I'm not very familiar with singularities, forgive some potentially stupid questions.

A singularity here is defined as where the tangent is ill-defined, is this just saying where the lines cross? In other words, that places where loss valleys intersect tend to generalize?

If true, what is a good intuition to have around loss valleys? Is it reasonable to think of loss valleys kind of as their own heuristic functions? 

For example, if you have a dataset with height and weight and are trying to predict life expectancy, one heuristic might be that if weight/h... (read more)

2Jesse Hoogland
Not at all stupid! Yep, crossings are singularities, as are things like cusps and weirder things like tacnodes It's not necessarily saying that these places tend to generalize. It's that these singularities have a disproportionate impact on the overall tendency of models learning in that landscape to generalize. So these points can impact nearby (and even distant) points.  I still find the intuition difficult I like this example! If your model is lifespan(h,w)=f(wh) then the w-h space is split into lines of constant lifespan (top-left figure). If you have a loss which compares predicted lifespan to true lifespan, this will be constant on those lines as well. The lower overweight and underweight lifespans will be two valleys that intersect at the origin. The loss landscape could, however, be very different because it's measuring how good your prediction is, so there could be one loss valley, or two, or several.  Suppose you have a different function g(w+h) with also with two valleys (top-right). Yes, if you add the two functions, the minima of the result will be at the intersections. But adding isn't actually representative of the kinds of operations we perform in networks. For example, compare taking their min, now they cross and form part of the same level sets. It depends very much on the kind of composition. The symmetries I mention can cooperate very well. From top-left clockwise: f(w/h); g(w+h); f(w/h)+g(w+h); min(f(w/h),g(w+h)).