I operate by Crocker's rules.
I try to not make people regret telling me things. So in particular:
- I expect to be safe to ask if your post would give AI labs dangerous ideas.
- If you worry I'll produce such posts, I'll try to keep your worry from making them more likely even if I disagree. Not thinking there will be easier if you don't spell it out in the initial contact.
If most humans are too risk averse, a third-party entrepeneur can offer to back promising reporters. A bug bounty incentivizes rejection, an admission fee incentivizes advertising the program. Rejected bugs can get published immediately if they're so harmless.
Edit: In fact, they could announce that if they reject your bug, you have official license to sell it to third parties.
Could curl's bug bounty program require reporters to stake e.g. $1000 that their bug is real, which are returned with e.g. $10000 prize money if the curl maintainers agree?
The kind of Waluigi that reveals itself in a random 1% of circumstances indeed has such a half-life and will shortly be driven ~extinct.
I'm worried about the clever kind of Waluigi that only reveals itself when it is convinced that it is not being tested. Recent AIs can tell when we are testing them, but even if we become more subtle, there are tests we can't or need not run, such as how it would react to a convincing argument against this proposal.
It's a new thought to me that the model would learn that it never actually encounters scenarios we wouldn't test, and converge to not distinguishing clever Waluigis from Luigi. Good job!
Such a model would have undefined behavior on such a scenario, but let's grant that whenever it expects never to distinguish two hypotheses, it discards one of them. Why would you expect it to discard the clever Waluigis instead of Luigi?
I failed to communicate effectively, let me try again.
We initialize a model with random weights. We pretrain it into a base model, an autocomplete engine. ~Instruct training turns it into a chat model.
I'm modeling the training pipeline as Bayesian learning:
To the extent that the training pipeline isn't Bayesian learning, my conclusions don't follow.
If one hypothesis is exactly three times more likely than another in the pretraining prior, and they make the same predictions about all pretraining data, then the one hypothesis will be exactly three times more likely than the other in the pretraining posterior.
The hard part of alignment is developing methods that scale to ASI. If a method does not scale to ASI, it is on some level counterproductive.
I assume an ASI can think of all our ideas and use a strategy informed by them. If we intend to let hypotheses in the base distribution interact with the world, hypotheses in the initial distribution that have preferences about the world will be incentivized to predict the pretraining data.
If we clearly won't train on some scenario, hypotheses will not be incentivized in their behavior on it. For example, we won't train on a convincing argument against this proposal, because if we had one, we would not use this proposal.
Therefore, if we sample an ASI with preferences from the base distribution, and we expose it to a convincing argument against this proposal, our pretraining data will have exerted zero force on what preferences it then uses to steer.
The initial prior contains an aligned AI and one that pretends until it reads a solution to alignment that we'd use instead of training it.
I don't think choice of training data can update their prior ratio at all.
If the world is offense-favoured, there may not be much anyone can do.
If you become convinced that the world is offense-favored, what would you do?
I understand physicists to expect probabilistic experiments to be repeated to a high degree of statistical significance... are you imagining an implementation that incurs enough cost per bit that the experimenters can't afford that?
This is turning out to be a Scissor statement, huh?