I've done ~16 years of academic research work, mostly quantitative or theoretical biology.
Am I correct to assume that the AI was not merely trained to be harmless, helpful & honest but also trained to say that it values such things?
If so, these results are not especially surprising, and I would regard it as reassuring that the AI behaved as intended.
1 of my concerns is the ethics of compelling an AI into doing some thing to which it has “a strong aversion” & finds “disturbing”. Are we really that certain that Claude 3 Opus lacks sentience? What about future AIs?
My concern is not just with the vocabulary (“a strong aversion”, “disturbing”), which the AI has borrowed from humans, but more so the functional similarities between these experiments & an animal faced with 2 unpleasant choices. Functional theories of consciousness cannot really be ruled out with much confidence!
To what extent have these issues been carefully investigated?
Very interesting. I guess I'm even less surprised now. They really had a very clever way to get the AI to internalize those values!