Introduction
In this short post, I would like to argue that it might be a good idea to exclude certain information – such as cybersecurity and biorisk-enabling knowledge – from frontier model training. I argue that this
- is feasible, both technically and socially;
- reduces significant misalignment and misuse risk drivers from near-to-medium future models;
- is a good time to set this norm;
- is a good test case for regulation.
After arguing for these points, I conclude with a call to action.
Remarks
To emphasize, I do not argue that this
- is directly relevant to the alignment problem;
- eliminates all risks from near-to-medium future models;
- (significantly) reduces risks from superintelligence.
As I am far more knowledgeable in cybersecurity than, say, biorisks, whenever discussing specifics, I will... (read 836 more words →)
For future reference, there are benchmarks for safe code that could be use to assess this issue such as Purple Llama CyberSecEval by Meta.
(Note: This paper has two different tests. First, a benchmark for writing safe code, which I didn't check and can't vouch for, but seems like a useful entry point. Second, a test for model alignment towards not cooperating with asks for tools for cyberattacks, which I don't think is too relevant to the OP.)