Is the inability to approximate periodic functions of a single variable important?
Periodic functions are already used as an important encoding in SOTA ANNs, from transformer LLMs to NERFs in graphics. From the instant-ngp paper:
For neural networks, input encodings have proven useful in the attention components of recurrent architectures [Gehring et al. 2017] and, subsequently, transformers [Vaswani et al. 2017], where they help the neural network to identify the location it is currently processing. Vaswani et al. [2017] encode scalar positions 𝑥 ∈ R as a multiresolution sequence of 𝐿 ∈ N sine and cosine functions enc(𝑥) = sin(2 0 𝑥),sin(2 1 𝑥), . . . ,sin(2 𝐿−1 𝑥), cos(2 0 𝑥), cos(2 1 𝑥), . . . , cos(2 𝐿−1 𝑥) . (1) This has been adopted in computer graphics to encode the spatiodirectionally varying light field and volume density in the NeRF algorithm [Mildenhall et al. 2020].
A neural net using rectified linear unit activation functions of any size is unable to approximate the function sin(x) outside a compact interval.
I am reasonably confident that I can prove that any NN with ReLU activation approximates a piecewise linear function. I believe the number of linear pieces that can be achieved is bounded by at most 2^(L*D) where L is the number of nodes per layer and D is the number of layers.
This leads me to two questions:
Regarding (2a), empirically I found that while approximating sin(x) with small NNs in scikit-learn, increasing the width of the network caused catastrophic failure of learning (starting at approximately L=10 with D=4, at L=30 with D=8, and at L=50 with D=50).
Regarding (1), naively this seems relevant to questions of out-of-distribution performance and especially the problem of what it means for an input to be out-of-distribution in large input spaces.