While reading OpenAI Operator System Card, the following paragraph on page 5 seemed a bit weird:
We found it fruitful to think in terms of misaligned actors, where:
- the user might be misaligned (the user asks for a harmful task),
- the model might be misaligned (the model makes a harmful mistake), or
- the website might be misaligned (the website is adversarial in some way).
Interesting use of language here. I can understand calling the user or website misaligned, as understood as alignment relative to laws or OpenAI's goals. But why call a model misaligned when it makes a mistake? To me, misalignment would mean doing that on purpose.
Later, the same phenomenon is described like this:
The second category of harm is if the model mistakenly takes some action misaligned with the user’s intent, and that action causes some harm to the user or others.
Is this yet another attempt to erode the meaning of "alignment"?
My bet is that this isn't an attempt to erode alignment, but is instead based on thinking that lumping together intentional bad actions with mistakes is a reasonable starting point for building safeguards. Then the report doesn't distinguish between these due to a communication failure. It could also just be a more generic communication failure.
(I don't know if I agree that lumping these together is a good starting point to start experimenting with safeguards, but it doesn't seem crazy.)
Communication is indeed hard, and it's certainly possible that this isn't intentional. On the other hand, making mistakes is quite suspicious when they're also useful for your agenda. But I agree that we probably shouldn't read too much into it. The system card doesn't even mention the possibility of the model acting maliciously, so maybe that's simply not in scope for it?
Ellison is the CTO of Oracle, one of the three companies running the Stargate Project. Even if aligning AI systems to some values can be solved, selecting those values badly can still be approximately as bad as the AI just killing everyone. Moral philosophy continues to be an open problem.
It doesn't seem very useful for them IMO.