Glad that you liked my answer! Regarding my suggestion of synthetic data usage, upon further reflection I think is plausible that it could be either a very positive thing and a very negative thing, depending exactly on how the model generalizes out-of-distribution. It also now strikes me that synthetic data provides a wonderful opportunity to study (some) of their out-of-distribution properties even today – it is kind of hard to test out-of-distribution behavior of internet-text-trained LLMs because they've seen everything, but if trained with synthetic data it should be much more doable.
Or, if not purely synthetic, how about deliberately filtered in weird ways. Could you train a model where information about a specific scientific theory has been thoroughly censored from the training data? Then in testing you could give the information which was available to the human orginators of the now accepted scientific theory. See if the model can figure out the puzzle. Be concerned if it does better than you expect.
Last week I announced a contest to write a 1-paragraph plan for training & deploying nextgen general AI models which prevents them from killing you. Many of the answers were excellent. If you feel there's a big bag of AI safety tricks but no roadmap or cohesion, then I encourage you to read all the answers on the last post and see if it melds.
The winner is Lovre (twitter, nitter). Answer very smushed by me from last tweet to one sentence:
I love this answer. Askable by concerned people. A big ask but askable. Grantable by management. Do-able by doers. Doesn't awkwardly go to infinite work or zero work. Fairly likely to detect a serious problem soon enough if there is one. Likely to gather the evidence for the problem in a clear manner. Quite broadly applicable. Not perfect, but very good.
Answer smushed by me to one paragraph:
I had not heard about this forced-externalized-reasoning idea but it sounds great to me and apparently there's a post.
Highlights from other answers
Most people gave good correct & obvious advice in addition to the interesting/insightful parts I highlight below. The obvious advice is of course by far the most important and is still largely not standard practice. Anyway here's the less-important-but-interesting parts:
Shankar Sivarajan (LW)
This is a clear trigger condition that is hard to argue with and I've never heard it before.
I'm a big fan of selling high-level services to reduce attack surface.
Most info-dense AI security advice ever:
Beth Barnes (LW)
I might be misunderstanding the idea but I think it can't happen because a good amount of general-purpose-code-runners are needed for eg networking and device interop etc. BUT I think chips that can do forwardprop+backprop+update but don't have read-lines for weights are feasible and interesting. In fact I am investigating them right now. I may write about it in the future.
Wow if I had three totally independent & redundant monitoring teams then I would feel MUCH better about the risk that those folks go rogue and utilize the untapped/unrecognized capabilities themselves. Not heard this before.
P. (LW)
It always seemed hard to me to pick tasks for capability elicitation. Using easy-to-learn domains is probably a nice boost to sensitivity.
Tim Babb (twitter)
I've never heard the basic idea stated explicitly: use the model to evaluate its own output for benefit/harm. The phrasing makes me think about other ways to do the estimation. But I wonder if something could go wrong trying to maximize benefit. The token thing is quite interesting to me; I wish I understood how transfomers deal with that kind of thing.
A hole
My main concern with next-gen AI that was hardly addressed by anyone: What if the model knows way way more internally than you can suss out of it? Or what if it's doing way more internally than is apparent? I'm loosely aware of some mech-interp plans to detect this (pointers appreciated!) but what about preventing it or knowing why/when it will happen? I have an idea for some empirical research to try to get at that question. If I do it then I will of course write up the results.
Some ezgainz
I think the material from the answers can be used to make a very tiny AI safety map/plan that is non-condescending, non-depressing, non-ignoring-the-core-problems, and feasible. Something that an AI startup person or exec won't mind reading. It will of course have multiple critical holes, but they will be have a visible size and location. I may write it and publish it in the next week or so.