While reading the the post and then some of the discussion I got confused about if it makes sense to distinguish between That-Which-Predicts and the mask in your model.
Usually I understand the mask to be what you get after fine-tuning - a simulator whose distribution over text is shaped like what we would expect from some character like the the well-aligned chat-bot whose replies are honest, helpful, and harmless (HHH). This stand in contrast to the "Shoggoth" which is the pretrained model without any fine-tuning. It's still a simulator but with a di...
Thank you for this post. It looks like the people at Anthropic have put a lot of thought into this which is good to see.
You mention that there are often surprising qualitative differences between larger and smaller models. How seriously is Anthropic considering a scenario where there is a sudden jump in certain dangerous capabilities (in particular deception) at some level of model intelligence? Does it seem plausible that it might not be possible to foresee this jump from experiments on even slighter weaker models?
We certainly think that abrupt changes of safety properties are very possible! See discussion of how the most pessimistic scenarios may seem optimistic until very powerful systems are created in this post, and also our paper on Predictability and Surprise.
With that said, I think we tend to expect a bit of continuity. Empirically, even the "abrupt changes" we observe with respect to model size tend to take place over order-of-magnitude changes in compute. (There are examples of things like the formation of induction heads where qualitative changes in model ...
Very interesting, after reading chinchilla's wild implications I was hoping someone would write something like this!
If I understand point 6 correctly, then you are proposing that Hoffman's scaling laws lead to shorter timelines because data-efficiency can be improved algorithmically. To me it seems that it might just as well make timelines longer to depend on algorithmic innovations as opposed to the improvements in compute that would help increase parameters. It feels like there is more uncertainty about if people will keep coming up with the novel ideas ...
What would it look like if such a model produces code or, more generally, uses any skill that entails using a domain specific language? I guess in the case of programming even keywords like "if" could be translated into Sumerian, but I can imagine that there are tasks that you cannot obfuscate this way. For example, the model might do math by outputting only strings of mathematical notation.
Also, it seems likely that frontier models will all be multi-modal, so they will have other forms of communication that don't require language anyway. I suppose ... (read more)