Firstly, the principle of 'no computation without representation' holds true. The strength of the representation depends on the specific computational task and the neural network architecture, such as a Transformer. For example, when a Transformer is used to solve a simple linear problem with low dimensionality, it would provide a strong representation. Conversely, for a high-order nonlinear problem with high dimensionality, the representation may be weaker.

The neural network operates as a power-efficient system, with each node requiring minimal computational power, and all foundation model pre-training is self-supervised. The neural network's self-progressing boundary condition imposes no restrictions on where incoming data is processed. Incoming data will be directed to whichever nodes are capable of processing it. This means that the same token will be processed in different nodes. It is highly likely that many replicas of identical or near-identical feature bits (units of feature) disperse throughout the network. The inequality in mathematics suggests that connections between nodes (pathways) are not equal. Our working theory proposes that feature bits propagate through the network, with their propagation distance determined by the computational capacity of each node. The pathway appears to be power-driven, prioritizing certain features or patterns during learning in a discriminatory manner. While this discriminative feature pathway (DFP) is mathematically plausible, the underlying theory remains unclear. It seems that neural networks are leading us into the realm of bifurcation theory

Reply