You said "there are too few strictly-orthogonal directions, so we need to cram things in somehow."
I don't think that's true. That is a low-dimensional intuition that does not translate to high dimensions. It may be "strictly" true if you want the vectors to be exactly orthogonal, but such perfect orthogonality is unnecessary. See e.g. papers that discuss "the linearity hypothesis' in deep learning.
As a previous poster pointed out (and as Richard Hamming pointed out long ago) "almost any pair of random vectors in high-dimensional space are almost-orthogonal." And almost orthogonal is good enough.
(when we say "random vectors in high dimensional space" we mean they can be drawn from any distribution roughly centered at the origin: Uniformly in a hyperball, or uniformly from the surface of a hypersphere, or uniformly in a hypercube, or random vertices from a hypercube, or drawn from a multivariate gaussian, or from a convex hyper-potato...)
You can check this numerically, and prove it analytically for many well-behaved distributions.
One useful thought is to consider the hypercube centered at the origin where all vertices coordinates are ±1. In that case a random hypercube vertex is a long random vector that look like {±1, ±1,... ±1} where each coordinate has a 50% probability of being +1 or -1 respectively.
What is the expected value of the dot product of a pair of such random (vertex) vectors? Their dot product is almost always close to zero.
There are an exponential number of almost-orthogonal directions in high dimensions. The hypercube vertices are just an easy example to work out analytically, but the same phenomenon occurs for many distributions. Particularly hyperballs, hyperspheres, and gaussians.
The hypercube example above, BTW, corresponds to one-bit quantization of the embedding vector space dimensions. It often works surprisingly well. (see also "locality sensitive hashing").
This point that Hamming made (and he was probably not the first) lies close to the heart of all embedding-space-based learning systems.
You said "there are too few strictly-orthogonal directions, so we need to cram things in somehow."
I don't think that's true. That is a low-dimensional intuition that does not translate to high dimensions. It may be "strictly" true if you want the vectors to be exactly orthogonal, but such perfect orthogonality is unnecessary. See e.g. papers that discuss "the linearity hypothesis' in deep learning.
As a previous poster pointed out (and as Richard Hamming pointed out long ago) "almost any pair of random vectors in high-dimensional space are almost-orthogonal." And almost orthogonal is good enough.
(when we say "random vectors in high dimensional space" we mean they can be drawn from any distribution roughly centered at the origin: Uniformly in a hyperball, or uniformly from the surface of a hypersphere, or uniformly in a hypercube, or random vertices from a hypercube, or drawn from a multivariate gaussian, or from a convex hyper-potato...)
You can check this numerically, and prove it analytically for many well-behaved distributions.
One useful thought is to consider the hypercube centered at the origin where all vertices coordinates are ±1. In that case a random hypercube vertex is a long random vector that look like {±1, ±1,... ±1} where each coordinate has a 50% probability of being +1 or -1 respectively.
What is the expected value of the dot product of a pair of such random (vertex) vectors? Their dot product is almost always close to zero.
There are an exponential number of almost-orthogonal directions in high dimensions. The hypercube vertices are just an easy example to work out analytically, but the same phenomenon occurs for many distributions. Particularly hyperballs, hyperspheres, and gaussians.
The hypercube example above, BTW, corresponds to one-bit quantization of the embedding vector space dimensions. It often works surprisingly well. (see also "locality sensitive hashing").
This point that Hamming made (and he was probably not the first) lies close to the heart of all embedding-space-based learning systems.