User Comment Replies

On excluding dangerous information from training

For future reference, there are benchmarks for safe code that could be use to assess this issue such as Purple Llama CyberSecEval by Meta.

(Note: This paper has two different tests. First, a benchmark for writing safe code, which I didn't check and can't vouch for, but seems like a useful entry point. Second, a test for model alignment towards not cooperating with asks for tools for cyberattacks, which I don't think is too relevant to the OP.)

On excluding dangerous information from training

ShayBenMoshe2y*10

I sympathize with the worry, and agree that this should be emphasized when writing about this topic. This is also the reason that I opened my post by clearly remarking on the issues this is not relevant to. I would urge others to do so in the future as well.

On excluding dangerous information from training

ShayBenMoshe2y30

Thanks for the feedback! Upvoted, but disagreed.

I agree that not knowing anything at all about cybersecurity might cause the model to write less secure code (though it is not obvious that the inclusion of unsafe code examples doesn't in fact lead to more unsafe code being emitted, but let's put that aside).

However, writing safe code requires quite different knowledge from offensive cybersecurity. For writing safe code, it is relevant to know about common vulnerabilities (which are often just normal bugs) and how to avoid them - information which I agree pr... (read more)

Sam Altman fired from OpenAI

ShayBenMoshe2y318

For completeness - in addition to Adam D’Angelo, Ilya Sutskever and Mira Murati signed the CAIS statement as well.

Daniel Kokotajlo2y342

Didn't Sam Altman also sign it?

AI as a science, and three obstacles to alignment strategies

ShayBenMoshe2y10

I'd like to offer an alternative to the third point. Let's assume we have built a highly capable AI that we don't yet trust. We've also managed to coordinate as a society and implement defensive mechanisms to get to that point. I think that we don't have to test the AI in a low-stakes environment and then immediately move to a high-stakes one (as described in the dictator analogy), while still getting high gains.

It is feasible to design a sandboxed environment formally proven to be secure, in the sense that you can not hack into, escape from or deliberatel... (read more)

LESSWRONG
LW

All of ShayBenMoshe's Comments + Replies