If you know a lot about passwords, you probably know that it’s good to choose a password from a distribution with a lot of “entropy”. The (Shannon) entropy of a distribution is defined as:
This definition is an answer to the question “how much information is encoded in a chosen password, on average?” This is also the approximate number of bits we need to specify which password we picked. An adversary would need to try at least passwords in order to guess yours, on average.[1]
This is typically how people think about choosing strong passwords, and it works well in the case where we’re choosing among equally likely passwords. But it breaks horribly when we allow our distribution to be non-uniform. Because choosing good passwords is about memorableness as well as sheer strength, we might sometimes want to use complicated password selection schemes that don’t necessarily give uniformly distributed selections, and Shannon entropy can really screw up here.
A comically bad password selection scheme
Consider the following password selection algorithm:
- Roll a 20-sided die.
- If the die comes up 20, choose a random password from possibilities.
- Otherwise, choose the password
password
.
Of course, this is a terrible password selection strategy. 19 times out of 20, your password will be the very first one the attacker tries, if they know your strategy. But the entropy here doesn’t reflect that - for example, if we pick a random 1000-character lowercase password when we get a 20, this has an overall entropy of about 235 bits. Entropy tells us that we have an extremely strong password, since on average it will take a long time to crack the password. But of course in the typical case, it will take no time at all.
Clearly something is wrong here. How can we fix it?
Min-entropy
The first thing to notice is that, as we’ve already implied, we’re engaged in a game of strategy with any would-be password crackers. We pick some strategy for choosing passwords, and then our opponent picks a strategy for guessing them. Our opponent’s best response is going to be to try the most likely passwords first.
So how should we really measure our password selection strategies? One way would be to use the information content of the most likely password, instead of the average information content:
This is called the min-entropy, and it’s nice for several reasons, not least because when is uniform, .
An issue with this method, though, is that it doesn’t let us distinguish the d20 scheme above with 1000-character passwords on a high roll, from a similar scheme with only 3-character passwords on a high roll. Even though the scheme is bad in both cases, it’s still better in the former case than the latter, if only by a little.
But I think the most important thing about the min-entropy is that it gives us a conservative estimate that we can use to compare password selection schemes, while Shannon entropy gives something more like a central estimate. If the min-entropy is bits, we can feel really confident that it will take guesses to crack our typical passwords, even if the distribution isn’t uniform. If one scheme’s min-entropy is better than another’s Shannon entropy, we can feel secure that the first scheme is better.
Thanks to Adam Scherlis for pointing out the issue with Shannon entropy, Katja Grace for discussions about the problem, and Thomas Sepulchre for pointing out a major flaw in the original post.
- ^
See Massey's (very short) "Guessing and Entropy" for a discussion of this lower bound.
The function is cdf. The way it's used in the expected-utility calculation is that it's applied to 1/p where p is the probability of a given password. My original use of the term "probability" for the reciprocal of the thing fed to the cdf function was needlessly confusing, which is why I dropped it in the rewrite.
In your original comment, P is the probability of a particular password. (I say this just to confirm that I do, and did, understand that.)
But if we are going to explain what the cdf function actually is, we need to say something of the form "cdf(R) is the fraction -- or, in the case of improper not-exactly-probability-distributions, something more like the total number -- of attackers for whom ...". And I think the correct way to fill in that "..." is something like "when they crack passwords, we expect that the least probable password they're likely to crack has probability 1/R". (Right?)
In other words, I'm trying to be more explicit about what "adversary capabilities" actually cashes out to, and I think that's what it is.
Your more-explicit formalization of the calculation agrees with my understanding; to whatever extent you feel that what I'm describing is different from what you're describing, I am pretty confident the cause is not that we have different understandings of the mathematics at that point. I think it's you misunderstanding me / me communicating badly, not me misunderstanding you / you communicating badly.
(It is a lamentable misfeature of our language that -- so far as I can tell -- there is no good way to say "what's going on here is that what A is trying to say is not what B is interpreting it as" that doesn't tend to assign blame to one or other party. You have to call it misunderstanding (implicitly blaming B) or miscommunicating (implicitly blaming A). But it takes two to tango and communication failures often involve suboptimality at both ends, and even in cases where it doesn't assigning/taking blame is often an irrelevant distraction.)