Thanks, yeah I meant that I was interested in a solution that would scale to arbitrarily superhuman AI capabilities with a "mere" capabilities hit/cost (perhaps a very large cost that grows with AI capability, but does not impose a bound on the ultimate capability of the aligned system). So this was a useful clarification for me in terms of understanding your perspective; I may be wrong but I could imagine it might be useful to lead with this a bit more, ie "we don't know of and would be very interested in solutions that might be extremely costly but that avoid all counter-examples". Possibly you already say this and I just missed it.
Apologies for a possibly naive comment/question, perhaps this has been discussed elsewhere and you can just direct me there. But anyway...
I would find it helpful to see a strategy that ARC believes does in fact solve ELK, but fails only because it requires taking an unacceptably large capabilities hit. I would find this helpful for several reasons, namely
(1) it would help me to understand what kinds of strategies you believe really do escape counter-examples,
(2) it would give me a better sense for how optimistic to be about the approach, since it's often easier to start from an inefficient solution and make it more efficient, than it is to find an inefficient solution in the first place, and/or
(3) if you have trouble identifying such a solution, then it would suggest to me that finding one might be a useful research direction.
I think this is an interesting project, and one that (from a very different angle) I’ve spent a bit of time on, so here are a few notes on that, followed by a few suggestions. Stella, in another comment, made several great points that I agree with and that are similar in spirit to my suggestions.
Anyway, based on a fairly similar motivation of wanting to be able to “ask a LM what it’s actually thinking/expecting”, combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still before starting… we’ve experimented with including metadata in language pretraining data. Most large language datasets have this information, e.g. books have titles and (maybe) blurbs, websites have titles, URLs, and (maybe) associated subreddit links, etc. This data is obviously much noisier and lower quality than what you get from paying people for annotations, but it’s voluminous, diverse, and ~free.
When inserting this metadata for pretraining, we made sure to do so completely randomly, i.e. a book title might be inserted anywhere within a book (maybe several times in different context windows etc). We added separate <META_START> and <META_STOP> tokens to indicate the beginning and end of metadata, but that’s it. The motivation was to ensure that this “thought stream” was in-distribution at all positions within the context, while conversely making it easy to never sample it (by declining to sample the start token). This means that we can both use it when prompting, and use it as a query -- ie we can ask the model, at any time, “how likely is this to be from the NYTimes vs from 4Chan” by evaluating the logprobs of text enclosed by the tokens. With this specification, one can do a kind of “metadata beam search” where you prompt, sample, evaluate, cull, and repeat.
We generally found that this sort of works, in that the mutual information between these labels and the text goes up with model size, and you can use these metadata tags as filters to get rid of some of the most irrelevant text. But the results weren’t immediately stunning, and so we didn’t investigate them much further (to be clear, this was mostly because we prioritized other things more highly, rather than because we don't view this as worthwhile).
So my general suggestion would be to start off with something very cheap first, like the above. At the very least, this will mean that when you finetune on higher quality data, your format is already on-distribution. But hopefully it’ll also help you to calibrate expectations and give you a better sense for exactly what kind of data you want to shell out money for.
Beyond that, I agree with what Stella said -- it seems easier and better to focus first on shorter passages, both for human-sourcing reasons, and for diversity. Typically the benefits we see from finetuning grow with something like the log of the dataset size, so a small number of shorter examples should quickly give you an idea of what kind of progress you can expect.
If it were me, I’d also try to increase RoI by asking people to add commentary to existing books, rather than having people write from scratch. And I’d suggest making the formatting as simple and general as possible, both so that you can use and investigate it very flexibly, and to minimize regret if you change your mind in the future.
There's a direction (which I imagine you and others have considered) where you replace some activations within your AI with natural language, so that eg heuristically certain layers can only communicate with the next layer in NL.
Then you heavily regularize in various ways. You'd require the language to be fully understandable and transparent, perhaps requiring that counter-factual changes to inputs lead to sensible changes to outputs within subsystems, etc. You'd have humans verify the language was relevant, meaningful, & concise, train AIs to do this verification at larger scale, do some adversarial training, etc. You could also train sub-human level AIs to paraphrase the language that's used and restate it between layers, to make it really hard for the whole system to ever pass hidden coded messages.
This seems like it lives under a slogan like "enforce interpretability at any cost". This would almost certainly incur a big efficiency/capabilities hit. Maybe it's enormous. Though it actually seems plausible that the hit would be much smaller for extremely capable systems, as compared to the AI models of today.
A crucial question will then be "how powerful are the subsystems that talk to each other via natural language allowed to get", where in the most conservative limit each subsystem is human level, or even significantly below, and in the riskiest limit you just have a single NL layer that cuts the system in half.
There's a worry along the lines of "maybe the whole system is so big and complex it has emergent bad and inscrutable behavior even though every step is interpretable and makes sense". Or in the same vein "the answers to simple big-picture questions we care about don't live anywhere specific, so this doesn't help us to ensure the model can transparently address them, even if its operation itself can be broken down into transparent pieces." That said, I think we're in a better position wrt these issues, as we can now talk about training models that automate the extraction of big-picture information from the NL activations in this giant beast.