Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes
Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you're referencing this paper, which trains a shallow attention-only transformer where they get rid of the no...
I would also like to see some sort of symbolic optimization process operating as a wrapper for an LLM to act as an interpretable bridge between the black-box model and the real world, but I doubt Monte-Carlo Tree Search\Expectimax is the right sort of algorithm. Maybe something closer to GOFAI planner calling and parsing LLM outputs in a way similar to Factored Cognition might be better and much more computationally efficient.
There is still technically a limit to how far back a Transformer-XL can see since each layer can only attend to previous keys/values computed by that layer. As a result, the receptive field of layer L can only be as wide as the last L context windows. I guess this means that there might be some things that LSTMs can do that Transformer-XL can't, but this can be fixed with a couple of minor modifications to Transformer-XL. For example, this paper fixes the problem by allowing layers to attend to the outputs of later layers from previous co...
One thing that comes to mind is DeepMind's Adaptive Agents team using Transformer-XL, which can attend to data outside the current context window. I think there was speculation that GPT-4 may also be a Transformer-XL, but I'm not sure how to verify that.
I don't think it's fair for them to claim that the model has an infinite context length. It appears that they can train the model as a transformer, but can turn the model into an RNN at inference time. While the RNN doesn't have a context length limit as the transformer does, I doubt it will perform well on contexts longer than it has seen during training. There may also be limits to how much information can be stored in the hidden state, such that the model has a shorter effective context length than current SOTA LLMs.
Yeah, this is starting to make a lot more sense to me. It seems that evaluating the complexity of a utility function using Kolmogorov complexity rather than thinking about how hard it is for the AGI to implement it in terms of its internal concept language is a huge mistake. Magical categories don't seem that magical anymore; simply predicting the next tokens is enough to give you robust abstractions about human values.
How can "I am currently on Earth" be encoded directly into the structure of the brain? I also feel that "101 is a prime number" is more fundamental to me (being about logical structure rather than physical structure) than currently being on Earth, so I'm having a hard time understanding why this is not considered a hinge belief.
I do not think that "101 is a prime number" and "I am currently on Earth" are implemented that differently in my brain; they both seem to be implemented in parameters rather than architecture. I guess they also wouldn't be implemented differently in modern-day LLMs. Maybe the relevant extension to LLMs would be the facts the model would think of when prompted with the empty string vs. some other detailed prompt.
I think that these papers do provide sufficient behavioral evidence that transformers are implementing something close to gradient descent in their weights. Garg et al. 2022 examine the performance of 12-layer GPT-style transformers trained to do few-shot learning and show that they can in-context learn 2-layer MLPs. The performance of their model closely matches an MLP with GD for 5000 steps on those same few-shot examples, and it cannot be explained by heuristics like averaging the K-nearest neighbors from the few-shot examples. Since t...
But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements!
I see why this might be true for an LLM trained with a purely predictive loss, but I have a hard time believing that the same will be true for an LLM that is grounded. I imagine that LLMs will eventually be trained to perform some sort of in-context adaptation to a new environment while receiving a reward signal from a human in the loop. Models that learn to maximize the approval of some hum...
I'm a bit confused as to why this would work.
If the circuit in the intermediate layer that estimates the gradient does not influence the output, wouldn't they just be free parameters that can be varied with no consequence to the loss? If so, this violates 2a since perturbing these parameters would not get the model to converge to the desired solution.
Excellent work! Regarding the results on OR-chat, I'm wondering how problematic it actually is for the model to refuse suspicious inputs.
It seems alright to me if the model rejects requests like this, so I'd hesitate to call this a flaw of the method.