Communication is indeed hard, and it's certainly possible that this isn't intentional. On the other hand, making mistakes is quite suspicious when they're also useful for your agenda. But I agree that we probably shouldn't read too much into it. The system card doesn't even mention the possibility of the model acting maliciously, so maybe that's simply not in scope for it?
While reading OpenAI Operator System Card, the following paragraph on page 5 seemed a bit weird:
We found it fruitful to think in terms of misaligned actors, where:
- the user might be misaligned (the user asks for a harmful task),
- the model might be misaligned (the model makes a harmful mistake), or
- the website might be misaligned (the website is adversarial in some way).
Interesting use of language here. I can understand calling the user or website misaligned, as understood as alignment relative to laws or OpenAI's goals. But why call a model misaligne...
Ellison is the CTO of Oracle, one of the three companies running the Stargate Project. Even if aligning AI systems to some values can be solved, selecting those values badly can still be approximately as bad as the AI just killing everyone. Moral philosophy continues to be an open problem.
I would have written a shorter letter, but I did not have the time.
I am actually mildly surprised OA has bothered to deploy o1-pro at all, instead of keeping it private and investing the compute into more bootstrapping of o3 training etc.
I'd expect that deploying more capable models is still quite useful, as it's one of the best ways to generate high-quality training data. In addition to solutions, you need problems to solve, and confirmation that the problem has been solved. Or is your point that they already have all the data they need, and it's just a matter of speding compute to refine that?
I'd go a step beyond this: merely following incentives is amoral. It's the default. In a sense, moral philosophy discusses when and how you should go against the incentives. Superhero Bias resonates with this idea, but from a different perspective.
merely following incentives is amoral. It's the default.
Seems to me that people often miss various win/win opportunities in their life, which could make the actual human default even worse than following the incentives. On the other hand, they also miss many win/lose opportunities, maybe even more frequently.
I guess we could make a "moral behavior hierarchy" like this:
Yet Batman lets countless people die by refusing to kill the Joker. What you term "coherence" seems to be mostly "virtue ethics", and The Dark Knight is a warning what happens when virtue ethics goes too far.
I personally identify more with HPMoR's Voldermort than any other character. He seems decently coherent. To me "villain" is a person whose goals and actions are harmful to my ingroup. This doesn't seem to have much to do with coherence.
A reliable gear in a larger machine might be less agentic but more useful than a scheming Machiavellian.
The schem...
This doesn't actually follow. There isn't a law of the universe saying that some things are painful, and unlock a certain amount of value, and if we could reduce the pain you'd still get more value by enduring more pain.
It might happen to be true in some cases, but there's no reason to expect that to hold in all cases.
Learning a new language isn't hard because it's time-consuming. It's time-consuming because it's hard. The hard part is memorizing all of the details. Sure that takes lots of time, but working harder (more intensively) will reduce that time. It's hard to be dedicated enough to spend the time as well.
Getting a royal flush in poker isn't something I'd call hard. It's rare. But hard? A complete beginner can do that in their first try. It's just luck. But if you play a long time, it'll eventually happen.
Painful or unpleasant things are hard, because they requi...
Such a detailed explanation, thanks. Some additional thoughs:
And I plan to take the easier route of "just make a more realistic simulation" that then makes it easy to draw inferences.
"I don't care if it's 'just roleplaying' a deceptive character, it caused harm!"
That seems like the reasonable path. However, footnote 9 goes further:
Even if it is the case that the model is "actually aligned", but it was just simulating what a misaligned model would do, it is still bad.
This is where I disagree: Models that say "this is how I would deceive, and then...
Thanks for this, it was an interesting read.
One thing I wonder about is whether the model actually deceives, or just plays the part of someone expected to use deception as a tactic. I wouldn't call a system (or a person) manipulative or deceptive when that behavior is observed in games-of-deception, e.g. bluffing in poker. It seems to make more sense to "deceive" when it's strongly implied that you're supposed to do that. Adding more bluff makes the less obvious that you're supposed to be deceptive. The similarity of results with non-private scratchpad see...
Omniscience itself is a timeless concept. If consequences of any decision or action are known beforehands, there are no feedback cycles, as every decision is done on the first moment of existence. Every parallel world, every branch of the decision tree, evaluated to the end of time, scored and the best one selected. One wonders what exactly "god" refers to in this model.
Or to put it another way, why can't "how I look getting there" be a goal by itself?
Bigger the obstacles you have to overcome, the more impressive it looks to others. This can be valuable by itself, but in such situations it might be more efficient to reduce the actual flaw, and simply pretend that the effect is bigger. However, honesty is important, and honesty to yourself even more so.
since if you identify with your flaws, then maintaining your flaws is one of your goals
If a flaw is too hard to remove now, you have to figure out a way to manage it. ...
There's just no political will to do it, since the solutions would be harsh or expensive enough that nobody could impose them upon society. A god-emperor, who really wished to increase fertility numbers and could set laws freely without the society revolting, could use some combination of these methods: