Dentosal's Shortform

by Dentosal

25th Jan 2025

1 min read

2

This is a special post for quick takes by Dentosal. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:10 AM

[-]Dentosal1mo93

While reading OpenAI Operator System Card, the following paragraph on page 5 seemed a bit weird:

We found it fruitful to think in terms of misaligned actors, where:

the user might be misaligned (the user asks for a harmful task),

the model might be misaligned (the model makes a harmful mistake), or

the website might be misaligned (the website is adversarial in some way).

Interesting use of language here. I can understand calling the user or website misaligned, as understood as alignment relative to laws or OpenAI's goals. But why call a model misaligned when it makes a mistake? To me, misalignment would mean doing that on purpose.

Later, the same phenomenon is described like this:

The second category of harm is if the model mistakenly takes some action misaligned with the user’s intent, and that action causes some harm to the user or others.

Is this yet another attempt to erode the meaning of "alignment"?

[-]ryan_greenblatt1mo40

My bet is that this isn't an attempt to erode alignment, but is instead based on thinking that lumping together intentional bad actions with mistakes is a reasonable starting point for building safeguards. Then the report doesn't distinguish between these due to a communication failure. It could also just be a more generic communication failure.

(I don't know if I agree that lumping these together is a good starting point to start experimenting with safeguards, but it doesn't seem crazy.)

[-]Dentosal1mo10

Communication is indeed hard, and it's certainly possible that this isn't intentional. On the other hand, making mistakes is quite suspicious when they're also useful for your agenda. But I agree that we probably shouldn't read too much into it. The system card doesn't even mention the possibility of the model acting maliciously, so maybe that's simply not in scope for it?

[-]ryan_greenblatt1mo30

It doesn't seem very useful for them IMO.

[-]Dentosal1mo20

Billionaire Larry Ellison says a vast AI-fueled surveillance system can ensure 'citizens will be on their best behavior'

Ellison is the CTO of Oracle, one of the three companies running the Stargate Project. Even if aligning AI systems to some values can be solved, selecting those values badly can still be approximately as bad as the AI just killing everyone. Moral philosophy continues to be an open problem.

Moderation Log