Reply to Desiderata for an Adversarial Prior

 

Assertion: An Ideal Agent never pays people to lie to them.

This seems sensible, only a very foolish person would knowingly incentivise dishonesty in others, but what does it actually mean in practice?

  1. You can't use unverifiable information obtained from a single person or from a faction of possibly-conspiring people in any way that benefits that person or faction in the hypothetical where the information is false. Otherwise, they're incentivised to give you the unverifiable and false information to motivate you to do that, and so you'd be paying them to lie to you.
  2. You can't use any information, even verifiable information, obtained from a single person or from a faction of actually-conspiring people in any way that harms that person or faction in the hypothetical where it is true. Otherwise they'd just not tell you, and you'd be paying them to dishonestly shut up.

If everyone followed (2), you could freely go around saying the truth no matter what and expect no personal negative consequences. This would maximise public knowledge (your information can still be used to the benefit or detriment of other people), and people would be better off:

  • Criminals would confess their crimes in detail, in exchange for payment in the form of a reduced sentence if they do get caught so as not to actually incentivise crime. This information can then be used to help protect against, catch, or convict other criminals, it just can't be used against the specific person who confessed it in the first place.
  • To the extent that criminals want to form groups that can work together, they can formally notify the police of their criminal conspiracies and then any information from a member can't be used against the rest (unless a member formally betrays the conspiracy, in which case their testimony is fair game again).

If everyone followed (1), lying would be pointless because even though everyone believes you, they'll never believe you in a way that corresponds to doing something that benefits you. This condition imposes a lot of much stranger outcomes:

  • Claims made by salespeople must be ignored unless you've got a way to actually test the claim, and if you can only test it in the future (after you've already bought the product), you're honor-bound to return it and demand a refund if anything claimed about the product (that your decision to buy it depended on) turns out to be false.
  • School fire-alarms pulled by an unknown person who possibly wants school to be interrupted must be ignored, unless you've got a punishment system set up with sufficient probability of figuring out who did it and hurting them enough that they aren't better off pulling the alarm. If some people really want school to be interrupted, it's sufficiently hard to figure out who did it, or you're unwilling to torture children, this means that school fire-alarms are basically always ignored. The odds that someone is lying vs. there is actually a fire as immaterial.
  • Pascal's mugger is ignored regardless of your prior probability he's telling the truth.
  • Unless you can personally verify enough of the theory about AI alignment being hard, and what might be done about that, you must refuse to donate to protect against the possibly conspiracy of all AI safety researchers working together to pretend a problem exists to get donations. If that example is from the wrong tribe, consider living in medieval Europe and wondering whether there really is an afterlife or every single theologian is conspiring to trick you into donating to the church.
  • Unless you can personally verify that the entire population of the world is not conspiring against you to trick you about the fair prices of services, and therefore extract valuable labour from you, then you should refuse to ever do anything that might benefit anyone else.

This seems particularly terrible, and the whole "refusing to be exploitable regardless of prior probability" seems like a step way too far. It's the kind of logic that leads to saying "since the murderer won't confess, and I want the murderer to be executed, we'll just have to execute everyone to make sure he didn't benefit by not confessing". That's a lot of utility you're destroying just to avoid ever paying people to lie to you.

If we consider the two piles of utility:

  • Value I expect to lose by refusing to ever trust anyone, even though the worst most adversarial situations I can think of are almost never actually the case.
  • Value I expect to lose by failing to be completely inexploitable at all times, and therefore at least sometimes being exploited.

It would seem like there's some ideal resistence-to-exploitation threshold that minimises total expected utility lost. If someone is sufficiently unable to verify things themselves, the price of believing anyone ever is the corresponding utility of setting a threshold that lets you believe them, and the expected exploitation you'll be exposed to as a result.

Naturally, other people can't help you pick this threshold except with arguments you can personally verify, because they're obviously incentivised to convince you to be more trusting so that you'll believe / be exploitable by them.

New Comment
9 comments, sorted by Click to highlight new comments since:

Interesting. In practice I try (not that successfully) to not punish people who tell me the truth. It requires reframing insults and bad news in a positive light, which is hard.

I think in practice people don't really listen to most sales pitches, and pitches that involve something objective and verifiable do better.  Alex Hormozi talks about a method of putting metrics into advertising like "X% of our customers last month increased their revenue by Y%" - it's just literally true, can be checked, and cannot be copied by your competitors unless they are actually better than you.

Interesting post, thank you!

So, just to be clear.

This post explores the hypothetical/assertion that "an ideal agent shouldn't incentivize other agents to lie to it, by believing their lies".

However, it turns out that the costs of being completely inexploitable so are higher than the costs of being (at least somewhat) exploitable.

It all adds up to normality: there is a rational reason we, at least occasionally, believe what other people say. The assertion above has been disproven. 

Have I correctly understood what you are saying?

(I apologize if this conclusion is meant to be obvious; I found myself somewhat confused whether that indeed is the conclusion, so I would like to verify my understanding)

Yes, that's the intended point, and probably a better way of phrasing it. I am concluding against the initial assertion, and claiming that it does make sense to trust people in some situations even though you're implementing a strategy that isn't completely immune to exploitation.

Assertion: An Ideal Agent never pays people to lie to them.

 

What if an agent has built a lie-detector and wants to test it out? I expect thats a circumstance where you want somone to lie to you consistently and on demand.

Whats the core real-world situation you are trying to address here?

I can think instantly of at least two useful cases where a fully rational intelligent person fully informed of the situation and premeditating it, would nevertheless still want to pay people to lie to them; and not in any tendentious meaning of 'lie' ("you pay artists to lie to you!"), but full outright deception in causing you to believe false facts about them, which you will then always believe*: pentesting and security testing where they deceive you into thinking they're authorized personnel etc, and 'randomized response technique' survey techniques on dangerous questions where a fraction of respondents are directed to eg flip a coin & lie to you in their response so you have false beliefs about each subject but can form a truthful aggregate.

* the pen testers might tell you their real names in the debrief, but don't have to and might not bother since it doesn't matter and you have bigger fish to fry; the survey-takers obviously never will. In neither case do you necessarily ever find out the truth, nor do you need to in order to benefit from the lies.

I don't consider the randomized response technique lying, it's mutually understood that their answer means "either I support X or both coins came up heads" or "either I support Y or both coins came up tails". There's no deception because you're not forming a false belief and you both know the precise meaning of what is communicated.

I don't consider penetration testing lying, you know that penetration testers exist and have hired them. It's a permitted part of the cooperative system, in a way that actual scam artists aren't.

What's a word that means "antisocially decieve someone in a way that harms them and benefits you" such that everyone agrees it's a bad thing for people to be incentivised to do? I want to be using that word but don't know what it is.

"Deceive" sounds fine. I think the anti-social is implied - in fact, I have trouble coming up with an example of pro-social deceiving. Well, maybe variants of the old hiding Jews from Nazis" example.

Not sure what's unclear here? I mean that you'd generally prefer not to have incentive structures where you need true information from other people and they can benefit at your loss by giving you false information. Paying someone to lie to you means creating an incentive for them to actually decieve you, not merely giving them money to speak falsehoods.

They commented without reading the post I guess...