AI safety & alignment researcher
One story here could be that the model is being used only in an API context where it's being asked to take actions on something well-modeled as a Markov process, where the past doesn't matter (and we assume that the current state doesn't incorporate relevant information about the past). There are certainly use cases that fit that ('trigger invisible fence iff this image contains a dog'; 'search for and delete this file'). It does seem to me, though, that for many (most?) AI use cases, past information is useful, and so the assumptions above fail unless labs are willing to pay the performance cost in the interest of greater safety.
Another set of cases where the assumptions fail is models that are trained to expect access to intrinsic or extrinsic memory; that doesn't apply to current LLMs but seems like a very plausible path for economically useful models like agents.
It seems like that assumption runs throughout the post though, eg 'But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said', 'the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.'
I don't just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change? Is the assumption that labs will be willing to accept worse performance because they recognize the model can't be trusted?
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 - 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me
Epoch's analysis from June supports this view, and suggests it may even be a bit too conservative:
(and that's just for text -- there are also other significant sources of data for multimodal models, eg video)
the untrusted model is stateless between queries and only sees the command history and system state.
What justifies the assumption that untrusted models will be stateless in deployment? There are clear performance advantages in many cases to giving models memory, and in practice they already typically receive context containing the history of their actions or interactions.
If you're assuming that commercial labs will stop doing that for the sake of safety despite its advantages, it seems worth making that assumption explicit.
Great point! In the block world paper, they re-randomize the obfuscated version, change the prompt, etc ('randomized mystery blocksworld'). They do see a 30% accuracy dip when doing that, but o1-preview's performance is still 50x that of the best previous model (and > 200x that of GPT-4 and Sonnet-3.5). With ARC-AGI there's no way to tell, though, since they don't test o1-preview on the fully-private held-out set of problems.
The Ord piece is really intriguing, although I'm not sure I'm entirely convinced that it's a useful framing.
It's a fascinating idea, though, and one that'll definitely stick with me as a possible framing. Thanks!
After some discussion elsewhere with @zeshen, I'm feeling a bit less comfortable with my last clause, building an internal model. I think of general reasoning as essentially a procedural ability, and model-building as a way of representing knowledge. In practice they seem likely to go hand-in-hand, but it seems in-principle possible that one could reason well, at least in some ways, without building and maintaining a domain model. For example, one could in theory perform a series of deductions using purely local reasoning at each step (although plausibly one might need a domain model in order to choose what steps to take?).
[EDIT: I originally gave an excessively long and detailed response to your predictions. That version is preserved (& commentable) here in case it's of interest]
I applaud your willingness to give predictions! Some of them seem useful but others don't differ from what the opposing view would predict. Specifically:
First of all, serious points for making predictions! And thanks for the thoughtful response.
Before I address specific points: I've been working on a research project that's intended to help resolve the debate about LLMs and general reasoning. If you have a chance to take a look, I'd be very interested to hear whether you would find the results of the proposed experiment compelling; if not, then why not, and are there changes that could be made that would make it provide evidence you'd find more compelling?
Humans are eager to find meaning and tend to project their own thoughts onto external sources. We even go so far as to attribute consciousness and intelligence to inanimate objects, as seen in animistic traditions. In the case of LLMs this behaviour could lead to an overly optimistic extrapolation of capabilities from toy problems.
Absolutely! And then on top of that, it's very easy to mistake using knowledge from the truly vast training data for actual reasoning.
But in 2024 the overhang has been all but consumed. Humans continue to produce more data, at an unprecedented rate, but still nowhere near enough to keep up with the demand.
This does seem like one possible outcome. That said, it seems more likely to me that continued algorithmic improvements will result in better sample efficiency (certainly humans need a far tinier amount of language examples to learn language), and multimodal data /synthetic data / self-play / simulated environments continue to improve. I suspect capabilities researchers would have made more progress on all those fronts, had it not been the case that up to now it was easy to throw more data at the models.
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 - 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me (although that would basically be the generation of GPT-5 and peer models; it seems likely to me that the generation past that will require progress on one or more of the fronts I named above).
Taking the globe representation as an example, it is unclear to me how much of the resulting globe (or atlas) is actually the result of choices the authors made. The decision to map distance vectors in two or three dimensions seems to change the resulting representation. So, to what extent are these representations embedded in the model itself versus originating from the author’s mind?
I think that's a reasonable concern in the general case. But in cases like the ones mentioned, the authors are retrieving information (eg lat/long) using only linear probes. I don't know how familiar you are with the math there, but if something can be retrieved with a linear probe, it means that the model is already going to some lengths to represent that information and make it easily accessible.
I see -- I mistakenly read that as part of a broader policy of not showing the model its past actions. Thanks!