observing deception during training: there are examples of humans carrying out deception in the training set
thinking about a next-token-prediction LM architecture: let’s say that there is some situation where people are mostly honest, but sometimes are deceptive. Does the model learn deception from this, because it was in the data, or does it learn to overwhelmingly be honest? (And is this robust to the proportion of examples? If it’s more like ≥50% of the time there is deception, does the model learn deception?)
recomposing deception from unrelated examples: here the idea is that the components of deception aren’t all present at the same time, such that deception itself is present, but they are separately, and they can be recomposed to create deception;
what does deception consist of?
by virtue of deception being a pretty “natural” thing to learn, such that the difficult thing is to not learn deception
children learn to lie relatively early. It seems plausible that they “stumble” onto deception because of the shape of the problem – perhaps wanting to get something without giving up a resource – and it gets reinforced because it works. But punishing children who lie does not unlearn the lying behaviour, it just places a taboo on it. Children growing up into adults know that lying is an option, and in situations where it might be undetectable, or is high-reward and worth the risk, they lie anyway.
If search is not the thing which enables deception, then what sorts of properties make a system capable – at least in principle – of being deceptive?
retargetability: a system can be pointed at, and be competent at, more than just the thing for which it was designed. Thermostats are not retargetable – they do just one thing, no matter how competently.
but potentially there are things which aren’t retargetable and yet they’re sufficiently general in their domain that deception is still possible. That is, if you had a thermostat-like thing which did many more than just adjust temperature, but was still incompetent outside its domain, would that be capable of deception?
I think systems that aren’t retargetable could still learn deception, but it’s unclear to me that they would try to use it unless they had goals outside their current domain. Or is the assumption that they would use them in-domain, i.e. in the thing they “care about”?
competence: general capability is definitely a factor in how likely a system is to be deceptive, but in what way? Leo seems to think that it’s more subtle than this, and that specifically, you need a system which is:
situationally/strategically aware: the system knows it is being trained, and it has a model of the world that includes humans
goal-directed, and has some goal: the goal doesn’t need to be “about the outside world” — whatever that means. Any goal is sufficient for instrumental convergence-type concerns to kick in; even if all I wanted was to copy a strawberry, I’d still care about gaining resources, not being stopped (shut-down), maintaining my goal, etc.
can reason about outside events: e.g. the RSA-2048 situation where a treacherous turn is linked to an event that is hard/impossible for humans to simulate