Martin Vlach

If you get an email from aisafetyresearch@gmail.com , that is most likely me. I also read it weekly, so you can pass a message into my mind that way.
Other ~personal contacts: https://linktr.ee/uhuge 

Wikitag Contributions

Comments

Sorted by

Yeah, I've met the concept during my studies and was rather teasing for getting a great popular, easy to grasp, explanation which would also fit the definition.

It's not easy to find a fitting visual analogy TBH, which I'd find generally useful as I hold the concept to enhance general thinking.

No matter how I stretch or compress the digit 0, I can never achieve the two loops that are present in the digit 8.

0 when it's deformed by left and right pressure so that the sides meet seems to contradict?

Comparing to Gemma1, classic BigTech😅

 

And I seem to miss info on the effective context length..?

AI development risks are existential(/crucial/critical).—Does this statement quality for Extraordinary claims require extraordinary evidence?

Counterargument stands on the sampling of analogous (breakthrough )intentions, some people call those *priors* here. Which inventions do we allow in here would strongly decide if the initial claim is extraordinary or just plain and reasonable, well fit in the dangerously powerful inventions*. 

My set of analogies: nuclear energy extraction; fire; shooting; speech/writing;;

Other set: Nuclear power, bio-engineering/weapons - as those are the only two endangering whole civilised biome significantly.

Set of *all* inventions: Renders the claim extraordinary/weird/out of scope.

Does it really work on RULER( benchmark from Nvidia)?
Not sure where but saw some controversies, https://arxiv.org/html/2410.18745v1#S1 is best I did find now...

Edit: Aah, this was what I had on mind: https://www.reddit.com/r/LocalLLaMA/comments/1io3hn2/nolima_longcontext_evaluation_beyond_literal/ 

I'd vote to remove the AI capabilities here, although I've not read the article yet, just roughly grasped the topic.

It's likely not about expanding the currently existing capabilities or something like that.

Oh, I did not know, thanks.
https://huggingface.co/spaces/deepseek-ai/Janus-Pro-7B seems to show DS is still merely clueless in the visual domain, at least IMO they are loosing there to Qwen and many others.

draft:
Can we theoretically quantify the representational capacity of a Transformer (or other neural network architecture) in terms of the "number of functions" it can ingest&embody?

  • We're interested in the space of functions a Transformer can represent.
  • Finite Input/Output Spaces: In practice, LLMs operate on finite-length sequences of tokens from a finite vocabulary. So, we're dealing with functions that map from a finite (though astronomically large) input space to a finite output space.

Counting Functions (Upper Bound)

  • The Astronomical Number: Let's say our input space has size I and our output space has size O. The total number of possible functions from I to O is O^I. This would be an absolute upper bound on the number of functions any model could possibly represent.

The Role of Parameters and Architecture (Constraining the Space)

  • Not All Functions are Reachable: The crucial point is that a Transformer with a finite number of parameters cannot represent all of those O<sup>I</sup> functions. The architecture (number of layers, attention heads, hidden units, etc.) and the parameter values define a specific function within that vast space.
  • Parameter Count as a Proxy: The number of parameters in a Transformer provides a rough measure of its representational capacity. More parameters generally allow the model to represent more complex functions. This is not a linear relationship. There's significant redundancy. The effective number of degrees of freedom is likely much lower than the raw parameter count due to correlations and dependencies between parameters.
  • Architectural Constraints: The Transformer architecture itself imposes constraints. For example, the self-attention mechanism biases the model towards capturing relationships between tokens within a certain context window. This limits the types of functions it easily represents.

VC Dimension and Rademacher Complexity - existing tools/findings

  • VC Dimension (for Classification): In the context of classification problems, the Vapnik-Chervonenkis (VC) dimension is a measure of a model's capacity. It's the size of the largest set of points that the model can "shatter" (classify in all possible ways). While theoretically important, calculating the VC dimension for large neural networks is extremely difficult. It gives a sense of the complexity of the decision boundaries the model can create.
  • Rademacher Complexity: This is a more general measure of the complexity of a function class, applicable to both classification and regression. It measures how well the model class can fit random noise. Lower Rademacher complexity generally indicates better generalization ability (the model is less likely to overfit). Again, calculating this for large Transformers is computationally challenging.
  • These measures are about function classes, not individual functions: VC dimension and Rademacher complexity characterize the entire space of functions that a model architecture could represent, given different parameter settings. They don't tell you exactly which functions are represented, but they give you a sense of the "richness" of that space.
    This seems to be the measure: Let's pick a set of practical functions and see how many of those the LM can hold( have fairly approximated) in a given # of parameters(&arch&precission).
     
  • The Transformer as a "Compressed Program": We can think of the trained Transformer as a highly compressed representation of a complex function. It's not the shortest possible program (in the Kolmogorov sense), but it's a practical approximation.
  • Limits of Compression: The theory of Kolmogorov complexity suggests that there are functions that are inherently incompressible. There's no short program to describe them; you essentially have to "list out" their behavior. This implies that there might be functions that are fundamentally beyond the reach of any reasonably sized Transformer.
  • Relating Parameters to Program Length? There's no direct, proven relationship between the number of parameters in a Transformer and the Kolmogorov complexity of the functions it can represent. We can hypothesize:
    • More parameters allow for (potentially) representing functions with higher Kolmogorov complexity. But it's not a guarantee.
    • There's likely a point of diminishing returns. Adding more parameters won't indefinitely increase the complexity of the representable functions, due to the architectural constraints and the inherent incompressibility of some functions.

6. Practical Implications and Open Questions

  • Empirical Scaling Laws: Research on scaling laws (ala Chinchilla paper) provides empirical evidence about the relationship between model size, data, and performance. These laws help guide the design of larger models, but they don't provide a fundamental theoretical limit.
  • Understanding the "Effective" Capacity: A major open research question is how to better characterize the effective representational capacity of Transformers, taking into account both the parameters and the architectural constraints. This might involve developing new theoretical tools or refined versions of VC dimension and Rademacher complexity.

 

Would be fun to even have a practical study where we'd fine-tune fns into various sized models and see if/where a limit is getting/being hit.

Load More