Written quickly rather than not at all, as I've described this idea a few times and wanted to have something to point at when talking to people. 'Quickly' here means I was heavily aided by a language model while writing, which I want to be up-front about given recent discussion.
BLUF
In alignment research, two seemingly conflicting objectives arise: eliciting honest behavior from AI systems, and ensuring that AI systems do not produce harmful outputs. This tension is not simply a matter of contradictory training objectives; it runs deeper, creating potential risks even when models are perfectly trained never to utter harmful information.
Tension
Eliciting honest behavior in this context means developing techniques to extract AI systems' "beliefs", to the extent that they are well-described as having them. In other words, honest models should, if they have an internal world model, accurately report predictions or features of that world model. Incentivizing honesty in AI systems seems important in order to avoid and detect deceptive behavior. Additionally, something like this seems necessary for aiding with alignment research - we want to extract valuable predictions of genuine research breakthroughs, as opposed to mere imaginative or fictional content.
On the other hand, avoiding harmful outputs entails training AI systems never to produce information that might lead to dangerous consequences, such as instructions for creating weapons that could cause global catastrophes.
The tension arises not just because "say true stuff" and "sometimes don't say stuff" seem like objectives which will occasionally end up in direct opposition, but also because methods that successfully elicit honest behavior could potentially be used to extract harmful information from AI systems, even when they have been perfectly trained not to share such content. In this situation, the very techniques that promote honest behavior might also provide a gateway to accessing dangerous knowledge.
I don't see how that is possible, in the context of a system that can "do things we want, but do not know how to do".
The reality of technology/tools/solutions seems to be that anything useful is also dual use.
So when it comes down to it, we have to deal with the fact that such as system certainly will have the latent capability to do very bad things.
Which means we have to somehow ensure that such as system does not go down such a road either instrumentally or terminally.
As far as I can tell, intelligence[1] fundamentally is incapable of such a thing, which leaves us roughly with this:
On the first try of "do thing we want, but do not know how to do":
1) kills us every time
2) kills us almost every time
3) might not kills us every time
And that's as far as my thinking currently goes.
I am stuck on if 3 could get us anywhere sensible (my mind screams “maybe”………”ohh boy that looks brittle”).
I don't have a firm definition of the term, but I approximately think of intelligence as the function that lets a system take some goal/task and find a solution.
Explicitly in humans, well me, that looks like using the knowledge I have, building model(s), evaluating possible solution trajectories within the model(s), gaining insight, seeking more knowledge. And iterating over all that until I either have a solution or give up.
The usual: Keep it in a box, modify evaluation to exclude bad things and so on. And that suffers the problem of we can't robustly specify what is "bad" and even if we could, Rice's Theorem heavily implies checking is impossible.