Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters.
AFAIK, I always imagined the idea behind this objective function to be quite similar to contrastive learning, where you have two networks (or equivalently two sets of parameters), and the goal is to maximize agreement for pairs of inputs to each network that have the same ground truth class/label (conversely maximize disagreement for pairs that are different). That in mind, there are various papers (e.g.) that explore the possibility of "collapsed" solutions like the one you mentioned (where both networks are learning the same mapping, such that there's less benefit to propagating any examples through two networks), which makes this something that we want to minimize. In practice, though, this has been found to occur rarely (c.f. [1]).
Nonetheless, since reading Paul's statement about the problem of the instrumental model, I've been thinking about issues that might arise with the proposed solution, even though similar approaches (i.e. the contrastive training objective) have proven effective for robustness in general (e.g. against adversarial perturbations, data limited scenarios). If I were committed to this stance, I would agree somewhat with the desire to explore alternatives, and I have thought about the extent to which some sort of reconstruction loss could be introduced; this is where the goal might instead be to "maximize agreement" with a set of non-trivial observations/facts that are guaranteed to be more "objective" (somehow) than the original training data (one inspiration being that reconstruction losses in vision deep learning papers like this one often turn out to be good regularizers). So far I haven't had any promising proposals come to light for generative LM.
I am still holding onto the thought, given the remote possibility that all of my above assumptions are correct, and also because "generative models" might reflect the ideal approach to unsupervised learning, whereas "contrastive learning" is sometimes seen as a sort of compromise since (unlike generative models) it's amenable to limited compute [2].
...assume that the likelihood of a given simulation to be run is inversely correlated with the computational complexity of the simulation, in the space of all the simulation ever run. We can call the latter the Simplicity Assumption (SA)...
Isn't it possible that "simplicity" (according to one or more definitions thereof) need not care about the amount of raw computation required [0] to run any patch of simulation, nor with the volume of space it simulates? E.g. Occam's Razor's measure of 'simplicity' (for AI) gives some function of the description length of a program running on a (universal) computer, so as to predict its own future percepts [1].
Now consider a civilization simulation A that is simulating in detail our solar system and mocking the rest of the universe and a simulation B which is simulating in detail the whole milky way and mocking the rest. Simulating in detail the milky way is about harder, if we count the number of stars and black holes. According to the SA with linear scaling, being in simulation B is about less likely than being in A.
This particular example was what threw me off. In particular, we can presume that programs with shorter descriptions might better (i.e. more plausibly) simulate a complex system, and are more likely to be found by a computer/AI that iterates over possible programs, starting with the simplest one (like in Solomonoff Induction IIUC). This finds the shortest program that nonetheless sufficiently describes some observation sequence, which would not necessarily favor encoding special cases (i.e. "mocking") for costly things to simulate generally. Instead, mocking (since it optimizes for computational cost) might map to a different thing in Solomonoff, having the tradeoff of making the description more complex than the shortest possible one.
For example, to simulate a human being acting within a nontrivial universe [2], one might hold that there must exist some mathematical structure that describes the human in all the ways we care about, in which case the runtime of their cognitive algorithms, etc. might have to be quite costly [3]. It might be more algorithmically probable, then, for such a human to be mapped to an algorithm built out of simple priors (e.g. laws of physics) instead of high-level code describing what the human does in various edge cases.
This isn't by any means a refutation of your argument, but rather just a thought provoker concerning the main premise of what the Simplicity Assumption should mean [4]. I agree with you and others that "simplicity" should be an organizing principle (that conditions one's priors over the types of possible universes). However, your post didn't coincide with my implicit definition of "simplicity".
[0] (and possibly the amount of computation it seems to require)
[1] While your post isn't about AI generated universes, predictions made by an AI might well generate viable simulations (which might then become part of the hypothesis space under consideration).
[2] Another prior holds that we don't appear to be privileged observers within our own universe; in a similar vein, neither might one (rationally?) hold that solipsism is a valid ontology over observers, etc..
[3] Admittedly, the example of accurately simulating one or more human doesn't rule out the possibility that only the observations that people notice are the ones that are simulated (per your view), the rest being "mocked." On this topic, I can only defer to AI related discussions like this and here as to how one might begin to condition the probability space over types of (simulated) universes.
[4] Though I don't personally know of a good argument in favor of the Speed Prior if we're talking about inductive inference leading to simulations.
Going by GPT-2's BPEs [1], and based on the encoder downloaded via OpenAI's script, there are 819 (single) tokens/embeddings that uniquely map to the numbers from 0-1000, 907 when going up to 10,000, and 912 up to 200,000 [2]. These embeddings of course get preferentially fed into the model in order to maximize the number of characters in the context window and thereby leverage the statistical benefit of BPEs for language modeling. Which bears to mind that the above counts exclude numeric tokens that have a space at the beginning [3].
My point here being that, IIUC, for the language model to actually be able to manipulate individual digits, as well as pick up on the elementary operations of arithmetic (e.g. carry, shift, etc.), the expected number of unique tokens/embeddings might have to be limited to 10 – the base of the number system – when counting from 0 to the largest representable number [2].
[1] From the GPT-3 paper, it was noted:
This [GPT'3 performance on some other task] could be a weakness due to reusing the byte-level BPE tokenizer of GPT-2 which was developed for an almost entirely English training dataset.
[2] More speculatively, I think that this limitation makes extrapolation on certain abilities (arithmetic, algebra, coding) quite difficult without knowing whether its BPE will be optimized for the manipulation of individual digits/characters if need be, and that this limits the generalizability of studies such as GPT-3 not being able to do math.
[3] For such tokens, there are a total 505 up to 1000. Like the other byte pairs, these may have been automatically mapped based on the distribution of n-grams in some statistical sample (and so easily overlooked).
Is it in AI's interest (a big assumption that is has interests at all, I know) to become so human-specific that it loses its ability to generalize?
There's an approach called learning the prior through imitative generalization, that seemed to me a promising way to address this problem. Most relevant quotes from that article:
We might hope that our models will naturally generalize correctly from easy-to-answer questions to the ones that we care about. However, a natural pathological generalisation is for our models to only give us ‘human-like’ answers to questions, even if it knows the best answer is different. If we only have access to these human-like answers to questions, that probably doesn’t give us enough information to supervise a superhuman model.
What we’re going to call ‘Imitative Generalization’ is a possible way to narrow the gap between the things our model knows, and the questions we can train our model to answer honestly. It avoids the pathological generalisation by only using ML for IID tasks, and imitating the way humans generalize. This hopefully gives us answers that are more like ‘how a human would answer if they’d learnt from all the data the model has learnt from’. We supervise how the model does the transfer, to get the sort of generalisation we want.
Although I don't agree with everything in this site, I found this cluster of knowledge related advice (learning abstractions) and the rest of the site (made by a LW'er IIRC) very interesting if not helpful thus far; it seems to have advocated that:
That's most of what I took away from the resources that the site offered.
Some disclaimers/reservations (strictly opinions) based on personal experiences, followed by some open questions:
Edited for clarity and to correct misinterpretations of central arguments.
This response is to consider (contra your arguments) the ways in which the transformer might be fundamentally different from the model of a NN that you may be thinking about, which is as a series of matrix multiplications of "fixed" weight matrices. This is the assumption that I will first try to undermine. In so doing, I might hopefully lay some groundwork for an explanatory framework for neural networks that have self-attention layers (for much later), or (better) inspire transparency efforts to be made by others, since I'm mainly writing this to provoke further thought.
However, I do hope to make some justifiable case below for transformers being able to scale in the limit to an AGI-like model (i.e. which was an emphatic “no” from you) because they do seem to be exhibiting the type of behavior (i.e. few-shot learning, out of distribution generalization) that we'd expect would scale to AGI sufficiently, if improvements in these respects were to continue further.
I see that you are already familiar with transformers, and I will reference this description of their architecture throughout.
Epistemic Status: What follows are currently incomplete, likely fatally flawed arguments that I may correct down the line.
3. Onto the counterargument:
Unfortunately, I wouldn’t know what is precisely happening in (a-d) that allows for systematic meta-learning to occur, in order for the key proposition:
First, for the reason mentioned above, I think the sample efficiency is bound to be dramatically worse for training a Transformer versus training a real generative-model-centric system. And this [sample inefficiency] makes it difficult or impossible for it to learn or create concepts that humans are not already using.
to be weakened substantially. I just think that meta-learning is indeed happening given the few-shot generalization to unseen tasks that was demonstrated, which only looks like it has something to do with the dynamic weight matrix behavior suggested by (a-d). However, I do not think that it's enough to show the dynamic weights mechanism described initially (is doing such and such contrastive learning), or to show that it's an overhaul from ordinary DNNs and therefore robustly solves the generative objective (even if that were the case). Someone would instead have to demonstrate that transformers are systematically performing meta-learning (hence out-of-distribution and few-shot generalization) on task T, which I think is worthwhile to investigate given what they have accomplished experimentally.
Granted, I do believe that more closely replicating cortical algorithms is important for efficiently scaling to AGI and for explainability (I've read On Intelligence, Surfing Uncertainty, and several of your articles). The question, then, is whether there are multiple viable paths to efficiently-scaled, safe AGI in the sense that we can functionally (though not necessarily explicitly) replicate those algorithms.
It makes sense that negative pairs would help to a large extent, but not all contrastive papers used negative examples, like BYOL (ref). Edit: but now I'm realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.