We get no collisions this way, while just using a random number permits collisions. Thus, going by m is better than going by the first random number.

If the only way you use the unique data is to feed it into a hash, you might as well be using a random number. If you get different results from the hash than from randomness, you've broken the hash.

Comment author:pengvado
14 November 2009 03:04:37AM
*
1 point
[-]

If you get different results from the hash than from randomness, you've broken the hash.

That was my first impression too. But... isn't a hash considered to be cryptographically broken if you have a process that finds any collisions at all? Distinguishing based on the frequency of collisions (if that frequency is high enough to measure) is superfluous.

edit: removed the rest of my argument, which was just equivalent to truncating hash values.

If you get different results from the hash than from randomness, you've broken the hash.

That was my first impression too. But... isn't a hash considered to be cryptographically broken if you have a process that finds any collisions at all? Distinguishing based on the frequency of collisions (if that frequency is high enough to measure) is superfluous.

Yes, if you have few enough agents / work-flow that you're in P, then it is extremely unlikely that there will be absolute collisions, whether collisions of random numbers or hash values. But that chance is the same. You can break a hash through dumb luck! If you have lots of agents, then cryptographic security of the hash doesn't apply.

But we're not talking about absolute collisions of hashes of id numbers. In talking about hashes, the assumption is that the space of id numbers and hash values are big and the space of problems we're working on is not larger. When we truncate the hash values to get work items, that's when we get collisions, at exactly the same rate as if it were random.

Comment author:gwern
14 November 2009 03:55:41PM
*
1 point
[-]

If the only way you use the unique data is to feed it into a hash, you might as well be using a random number. If you get different results from the hash than from randomness, you've broken the hash.

The randomness is not as good. Even if the randomness is different from agent to agent, we still can get collisions. If we take the unique aspect of the agent, then by definition it isn't shared by other agents and so we can avoid any collision with other agents:

We can then come up with a injective mapping from the agent's number to the indexes of the array, using a hash, for example, or perhaps just taking the number as a straight index.

A hash doesn't have to collide; it only has to have collisions (by the Pigeonhole Principle) if the end hash is 'smaller' than the input. If I'm using SHA512 on data that is always less than 512 bits, then I'll never get any collisions. (Let's assume SHA512 is a perfect hash; if this bothers you, replace 'SHA512' with '$FAVORITE_PERFECT_HASH'.) But using the hash isn't essential: we just need a mapping from agent to index. 'Hash' was the first term I thought of.

(And the XOR trick lets us make use of a injective mapping regardless of whether it's the randomness or agent-related-data that is unique; we XOR them together and get something that is unique if either was.)

Comment author:pengvado
16 November 2009 06:48:56AM
3 points
[-]

A hash doesn't have to collide; it only has to have collisions (by the Pigeonhole Principle) if the end hash is 'smaller' than the input. If I'm using SHA512 on data that is always less than 512 bits, then I'll never get any collisions.

How do you know which aspect of the agent is unique, without prior communication? If it's merely that agents have so many degrees of freedom that there's a negligible probability that any two agents are identical in all aspects, then your hash output is smaller than its input. Also, you can't use the 2^512 figure for SHA-512 unless you actually want to split a 2^512 size array. If you only have, say, 20 choices to split, then 20 is the size that counts for collision frequency, no matter what hash algorithm you use.

we XOR them together and get something that is unique if either was.

If hash-of-agent outputs are unique and your RNG is random, then the XOR is just random, not guaranteed-unique.

Comment author:gwern
16 November 2009 04:00:16PM
*
1 point
[-]

How do you know which aspect of the agent is unique, without prior communication? If it's merely that agents have so many degrees of freedom that there's a negligible probability that any two agents are identical in all aspects, then your hash output is smaller than its input.

If your array is size 20, say, then why not just take the first x bits of your identity (where 2^x=20)? (Why 'first', why not 'last'? This is another Schelling point, like choice of injective mapping.)

If hash-of-agent outputs are unique and your RNG is random, then the XOR is just random, not guaranteed-unique.

This is a good point; I wasn't sure whether it was true when I was writing it, but since you've pointed it out, I'll assume it is. But this doesn't destroy my argument: you don't do any worse by adopting this more complex strategy. You still do just as well as a random pick.

(Come to think of it: so what if you have to use the full bitstring specifying your uniqueness? You'll still do better on problems the same size as your full bitstring, and if your mapping is good, the collisions will be as 'random' as the RNGs and you won't do worse.)

## Comments (99)

OldIf the only way you use the unique data is to feed it into a hash, you might as well be using a random number. If you get different results from the hash than from randomness, you've broken the hash.

*1 point [-]That was my first impression too. But... isn't a hash considered to be cryptographically broken if you have a process that finds any collisions at all? Distinguishing based on the

frequencyof collisions (if that frequency is high enough to measure) is superfluous.edit: removed the rest of my argument, which was just equivalent to truncating hash values.

Yes, if you have few enough agents / work-flow that you're in P, then it is extremely unlikely that there will be absolute collisions, whether collisions of random numbers or hash values. But that chance is the same. You can break a hash through dumb luck! If you have lots of agents, then cryptographic security of the hash doesn't apply.

But we're not talking about absolute collisions of hashes of id numbers. In talking about hashes, the assumption is that the space of id numbers and hash values are big and the space of problems we're working on is not larger. When we truncate the hash values to get work items, that's when we get collisions, at exactly the same rate as if it were random.

*1 point [-]The randomness is not as good. Even if the randomness is different from agent to agent, we still can get collisions. If we take the unique aspect of the agent, then by definition it isn't shared by other agents and so we can avoid any collision with other agents:

A hash doesn't have to collide; it only has to have collisions (by the Pigeonhole Principle) if the end hash is 'smaller' than the input. If I'm using SHA512 on data that is always less than 512 bits, then I'll never get any collisions. (Let's assume SHA512 is a perfect hash; if this bothers you, replace 'SHA512' with '$FAVORITE_PERFECT_HASH'.) But using the hash isn't essential: we just need a mapping from agent to index. 'Hash' was the first term I thought of.

(And the XOR trick lets us make use of a injective mapping regardless of whether it's the randomness or agent-related-data that is unique; we XOR them together and get something that is unique if either was.)

How do you know which aspect of the agent is unique, without prior communication? If it's merely that agents have so many degrees of freedom that there's a negligible probability that any two agents are identical in all aspects, then your hash output is smaller than its input. Also, you can't use the 2^512 figure for SHA-512 unless you actually want to split a 2^512 size array. If you only have, say, 20 choices to split, then 20 is the size that counts for collision frequency, no matter what hash algorithm you use.

If hash-of-agent outputs are unique and your RNG is random, then the XOR is just random, not guaranteed-unique.

*1 point [-]If your array is size 20, say, then why not just take the first x bits of your identity (where 2^x=20)? (Why 'first', why not 'last'? This is another Schelling point, like choice of injective mapping.)

This is a good point; I wasn't sure whether it was true when I was writing it, but since you've pointed it out, I'll assume it is. But this doesn't destroy my argument: you don't do any

worseby adopting this more complex strategy. You still do just as well as a random pick.(Come to think of it: so what if you have to use the full bitstring specifying your uniqueness? You'll still do better on problems the same size as your full bitstring, and if your mapping is good, the collisions will be as 'random' as the RNGs and you won't do worse.)