This should be fine. In past years, Scott has had an interface where you could enter your email address and get your score. So the ability to find out other people's scores by knowing their email address is apparently not an issue. And it makes sense to me that one's score in this contest isn't particularly sensitive private information.
Source: Comment from Scott on the ACX post announcing the results
I think this would provide security against people casually accessing each other's scores but wouldn't provide much protection against a determined attacker.
Some problems:
A better solution:
(And remove anything else that could be used to reconstruct users, like jobs/locations/etc. if relevant)
I realized after writing this that you meant that people's email addresses are private but their scores are public if you know their email. I'd default to not exposing people's participation and scores unless they expected that to happen, but maybe that's less of an issue than I was thinking. The predictability of LessWrong emails still would expose a lot of email addresses.
I'd still recommend the random ID solution though since it's trivial to reason about (it's basically a one-time-pad).
Thanks for your input. Though ideally we wouldn't have to go through an email server, it may just be required at some level of security.
As for the patterns, the nice thing is that with a small output space in the millions, there are tons of overlapping reasonable addresses even if you pin it down to a domain. Every English first and last name combo even without any numbers in it is already a lot larger than 10 million, meaning even targeted domains should have plenty of collisions.
To make sure I understand this concern:
It may be better to use a larger hash space to avoid an internal (in the data set) collisions, but then you lower the number of external collisions.
Are you thinking someone may want plausible deniability? "Yes, my email hashes to this entry with a terrible Brier score but that could've been anyone!"
Plausible Deniability yes. Reason agnostic. It's hard to know why someone might not want to be known to have their address here, but with my numbers above, they would have the statistical backing that 1/1000 addresses will appear in the set by chance, meaning a someone who wants to deny it could say "for every address actually in the set, 1000 will appear to be" so that's only a 1/1000 chance I actually took the survey! (Naively of course; rest in peace rationalist@lesswrong.com)
They say you shouldn't roll your own encryption, which is why I'm posting this here, so it can be unrolled if it's too unsafe.
Problem: Astral Codex Ten finished scoring the 2023 prediction results, but the primary identifier most used for people's score was their email address. Since people wouldn't want those published, what's an easy way to get people their score?
You could email everyone, but then you have to interact with an email server, and then nobody can do cool analysis of the scores and whatever other data is the document.
My proposal:
If you know the email address of a participant, it's trivial to check their score. And if you forgot which email address you used, just try each one! Odds are you will not have had a collision.
But at the same time, with 8 billion email addresses worldwide, any given hash in the document should collide with ~1000 other email addresses (because the 10,000 real addresses will have used 0.1% of the space of the hash output), meaning you can't just brute force and figure out each persons address. Out of the 8 billion real addresses you try, ~8 million will be real and appear hashed in the document, but only 10,000 of those (~0.1%) will be the originals. So finding an address-hash that appears in the document is highly unlikely to be the actual address of the participant.
If there are a few victims of the birthday paradox, they could probably just email request for their line number in the document. It may be better to use a larger hash space to avoid an internal (in the data set) collisions, but then you lower the number of external collisions. My back of the envelope expects at least several collisions with a 10 mil output space. 100 mil makes it 0 or 1.
Which hash? Not sure. Maybe SHA256 then just delete N characters off the end until the space is ~10,000,000?
Please discuss how safe/unsafe this is. Thanks for your time.