Thanks for the feedback! Upvoted, but disagreed.
I agree that not knowing anything at all about cybersecurity might cause the model to write less secure code (though it is not obvious that the inclusion of unsafe code examples doesn't in fact lead to more unsafe code being emitted, but let's put that aside).
However, writing safe code requires quite different knowledge from offensive cybersecurity. For writing safe code, it is relevant to know about common vulnerabilities (which are often just normal bugs) and how to avoid them - information which I agree pr...
I'd like to offer an alternative to the third point. Let's assume we have built a highly capable AI that we don't yet trust. We've also managed to coordinate as a society and implement defensive mechanisms to get to that point. I think that we don't have to test the AI in a low-stakes environment and then immediately move to a high-stakes one (as described in the dictator analogy), while still getting high gains.
It is feasible to design a sandboxed environment formally proven to be secure, in the sense that you can not hack into, escape from or deliberatel...
For future reference, there are benchmarks for safe code that could be use to assess this issue such as Purple Llama CyberSecEval by Meta.
(Note: This paper has two different tests. First, a benchmark for writing safe code, which I didn't check and can't vouch for, but seems like a useful entry point. Second, a test for model alignment towards not cooperating with asks for tools for cyberattacks, which I don't think is too relevant to the OP.)