Comment Permalink

Marcus Williams4mo20

"We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence." -SwiGLU paper.

I think it varies, a few of these are trying "random" things, but mostly they are educated guesses which are then validated empirically. Often there is a spefic problem we want to solve i.e. exploding gradients or O(n^2) attention and then authors try things which may or may not solve/mitigate the problem.

See in context

2

[ Question ]

Have we seen any "ReLU instead of sigmoid-type improvements" recently

by KvmanThinking

23rd Nov 2024

1 min read

1 4

2

I read this and, it said:

there are huge low hanging fruit that any AI or random person designing AI in their garage can find by just grasping in the dark a bit, to get huge improvements at accelerating speeds.

have we found anything like this? at all? have we seen any "weird tricks" discovered that make AI way more powerful for no reason?

AI CapabilitiesAI RiskAI TimelinesAI

Frontpage

2

Have we seen any "ReLU instead of sigmoid-type improvements" recently

New Answer

New Comment

1 Answers sorted by
top scoring

Marcus Williams

Nov 23, 2024

I'm not sure if these would be classed as "weird tricks" and I definitely think these have reasons for working, but some recent architecture changes which one might not expect to work a priori include:

SwiGLU: Combines a gating mechanism and an activation function with learnable parameters.
Grouped Query Attention: Uses fewer Key and Value heads than Query heads.
RMSNorm: Layernorm but without the translation.
Rotary Position Embeddings: Rotates token embeddings to give them positional information.
Quantization: Fewer bit weights without much drop in performance.
Flash Attention: More efficient attention computation through better memory management.
Various sparse attention schemes

[-]KvmanThinking4mo10

How were these discovered? Slow, deliberate thinking, or someone trying some random thing to see what it does and suddenly the AI is a zillion times smarter?

2Marcus Williams4mo

"We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence." -SwiGLU paper. I think it varies, a few of these are trying "random" things, but mostly they are educated guesses which are then validated empirically. Often there is a spefic problem we want to solve i.e. exploding gradients or O(n^2) attention and then authors try things which may or may not solve/mitigate the problem.

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 6:17 AM

[-]ZY4mo10

On the side - could you elaborate why you think "relu better than sigmoid" is a "weird trick", if that is implied by this question?

The reason that I thought to be commonly agreed is that it helps with the vanishing gradient problem (this could be shown from the graphs).

Moderation Log

2

[ Question ]

Have we seen any "ReLU instead of sigmoid-type improvements" recently

2

2

1 Answers sorted by top scoring

Nov 23, 2024

1 Answers sorted by
top scoring