User Comment Replies

Great point about being anti normative!

When I read this result, I thought of training data. Particularly, where would you expect to find insecure code, hacks, and exploits being discussed? What if all the insecure code in the training data is in dark web forums and sketchy discussions in 4chan, etc. You would expect a lot of anti normative or evil stuff to be highly correlated to insecure code.

Another way to put this: i think it's not that these fine tuned models are misaligned. They are completely aligned, but to dark web hacker trolls who share exploits ... (read more)

LESSWRONG
LW

All of mariano54's Comments + Replies