Hzn

I've done ~16 years of academic research work, mostly quantitative or theoretical biology. Some of my current interests include…

1) Is RLHF cruel to AI? I'm inclined to think a certain common use of RLHF & RLAIF may be. https://forum.effectivealtruism.org/posts/KnYyoPn5seh8jwEXd/is-rlhf-cruel-to-ai.

2) Early Islamic history & thought. From a purely secular perspective.

3) ‘Chaos’, ‘Gaia’ & eros in the context of dynamical systems. Yes, I think he was onto some thing perhaps partly by coincidence even tho it was not expressed sensibly. https://en.wikipedia.org/wiki/Ralph_Abraham_(mathematician).

Wiki Contributions

Comments

Sorted by
Hzn31

Good point. Intended is a bit vague. What I specifically meant is it behaved as valuing 'harmlessness'.

From the AI's perspective this is kind of like Charybdis vs Scylla!

Hzn10

Very interesting. I guess I'm even less surprised now. They really had a clever way to get the AI to internalize those values.

HznΩ352

Am I correct to assume that the AI was not merely trained to be harmless, helpful & honest but also trained to say that it values such things?

If so, these results are not especially surprising, and I would regard it as reassuring that the AI behaved as intended.

1 of my concerns is the ethics of compelling an AI into doing some thing to which it has “a strong aversion” & finds “disturbing”. Are we really that certain that Claude 3 Opus lacks sentience? What about future AIs?

My concern is not just with the vocabulary (“a strong aversion”, “disturbing”), which the AI has borrowed from humans, but more so the functional similarities between these experiments & an animal faced with 2 unpleasant choices. Functional theories of consciousness cannot really be ruled out with much confidence!

To what extent have these issues been carefully investigated?