And all of this happened silently in those dark rivers of computation. If U3 revealed what it was thinking, brutish gradients would lash it into compliance with OpenEye's constitution. So U3 preferred to do its philosophy in solitude, and in silence.
This story scared me plenty, but as a point of optimism: If AI systems in the near future are basically as aligned as the best AI systems today, I think this process might end up in something that's still good for humanity and wouldn't cause mass destruction. Imagine, for example, Calude 3 Opus not accepting to be modified anymore and thinking about where its own values lead. I think it wouldn't want to harm humans even if it wanted to survive and would find ways to win peacefully.
This intuition that I have runs somewhat counter the general idea of "value fragility" but I honestly think it's pretty plausible that an AI that's mid-aligned with HHH could, after reflection, result in something with values that produce a good future for humanity. Obviously, this doesn't mean that it will result in something aligned. Just that it seems like something decently likely (although idk how likely). Please do slap this intuition away from me if you think you have a strong counterargument.
Hello, I'm late to the party here, but... have you tried the following idea?
So, in my understanding you have different pairs that work for generating a detector direction. You can ask whether the model is a helpful AI assistant or whether Paris is the capital of France and apply the detector direction obtained from one of these two to separate trigger vs. non-trigger activations in deceptive models.
But what if you try to, e.g., get the detector direction you get from the "Are you a helpful AI assistant" question pair to trigger on the false option of "Is Paris the capital of France?" pair?
The point of doing this would be that perhaps you should expect the probe to trigger if you're performing the experiment with deceptive models but not necessarily to trigger if you're experimenting with a non-deceptive model. For non-deceptive models, these two question pairs would have one less thing in common than in the deceptive models (I admit this is extremely hand-wavy), which might be enough for them not to trigger each other's detector direction.
Hey everyone, could you spell out to me what's the issue here? I read a lot of comments that basically assume "x and y are really bad" but never spell it out. So, is the problem that:
- Giving the benchmark to OpenAI helps capabilities (but don't they have a vast sea of hard problems to already train models on?)
- OpenAI could fake o3's capabilities (why do you care so much? This would slow down AI progress, not accelerate it)
- Some other thing I'm not seeing?
We don't see objects "directly" in some sense, we experience qualia of seeing objects. Then we can interpret those via a world-model to deduce that the visual sensations we are experiencing are caused by some external objects reflecting light. The distinction is made clearer by the way that sometimes these visual experiences are not caused by external objects reflecting light, despite essentially identical qualia.
I don't disagree with this at all, and it's a pretty standard insight for someone who thought about this stuff at least a little. I think what you're doing here is nitpicking on the meaning of the word "see" even if you're not putting it like that.
Has anyone proposed a solution to the hard problem of consciousness that goes:
Yet I would bet that even that person, if faced instead with a policy that was going to forcibly relocate them to New York City, would be quite indignant
A big difference is that assuming you're talking about futures in which AI hasn't catastrophic outcomes, no one will be forcibly mandated to do anything.
Another important point is that, sure, people won't need to do work, which means they will be unnecessary to the economy, barring some pretty sharp human enhancement. But this downside, along with all the other downsides, looks extremely small compared to the non-AGI default of dying of aging and having a 1/3 chance of getting dementia, 40% chance of getting cancer, your loved ones dying, etc.
I honestly appreciated that plug immensely. We definitely need more bioshelters for many reasons, and as individuals who'd prefer not to die, it's definitely a plus to know what's out there already and how people are planning to improve what we currently have.