Wiki Contributions

Comments

p.b.3d10

I tried some chess but's it's still pretty bad. Not noticeably better GPT4.

p.b.3d10

In my case introspection lead me to the realisation that human reasoning consists to a large degree out of two interlocking parts: Finding constraint of the solution space and constraint satisfaction. 

Which has the interesting corollary that AI systems that reach human or superhuman performance by adding search to NNs are not really implementing reasoning but rather brute-forcing it. 

It also makes me sceptical that LLMs+search will be AGI. 

p.b.8d20

In psychometrics this is called "backward digit span".

p.b.10d61

Diminishing returns in loss are not diminishing returns in capabilities. And benchmarks tend to saturate, so diminishing returns are baked in if you look at those. 

I am not saying that there aren't diminishing returns to scale, but I just haven't seen anything definitive yet.

p.b.10d10

Frankly, I don't really understand what you are saying here and I am open to the possibility that I don't really understand how the gradient works in autoregressive transformers. 

But as I said in my other comment, my current understanding is: 

In standard attention (for example in an encoder) tokens are not ordered, so it is clear that the gradient of the loss of one of the token predictions (for example a masked token in BERT) flows through all other tokens equally. In autoregressive transformers an order is imposed by masking, but all later tokens attend to all earlier tokens in the same way. 

The gradient of the loss of a later tokens flows through all earlier tokens in the same way. It doesn't matter whether a token is half the context back or all the context, neither for the information flow nor for the gradient flow. 

To put it another way: In the n-th layer the last token attends to all the output tokens from the n-1-th layer. It doesn't somehow have to make do with the output of earlier layers for tokens that are further back. 

p.b.11d10

Yeah, the first 99 tokens would be optimized both to be locally the correct character, and also to set things up so that the 100th character is also correct.

That is how LLMs currently work. The gradient of each token prediction does flow back into all the earlier tokens whose information was integrated into the predicted token. So each token optimizes its own next token prediction but also tries to integrate the information that is most useful for future tokens. 

p.b.11d10

I don't know how people are creating huge context windows these days, but IIRC the way it works is that the longer you look back into your context (and correspondingly the further you are trying to plan ahead) the less of your computation is available. Like, if you have N layers, then for a token M steps back, you only have access to the computation up until layer N-M.

Everything in the context window is equally available. It doesn't make a difference whether an earlier token is 5 tokens back or 5000. The attention mechanism is an operation over a set of tokens, there is no intrinsic order. 

p.b.12d60
  • Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise. 

What's your argument for that? 

p.b.14d10

Hah, I didn't see your answer but our links complement nicely. 

I think my first link was the paper that was making some waves when it came out.

Load More