When is your birthday? The math behind hash collisions
0xkrt26.github.ioComments
> What if I told you that in a room with only 23 people there’s already a 50% chance for two of them to have matching birthdays?
I guess it's the subject shift from _you_ to _any two people from a group_ that creates the surprise in the birthday paradox. You definitely need way more than 23 randomly sampled people to get to a high probability that _you_ specifically share a birthday with one of them, and the result does not contradict that notion.
I think even given that premise, the "50% probability" is still a bit of a rug pull. The casual listener still believes the problem should address the 100% match.
A more honest approach is to plainly ask how many people have to be at a party to guarantee there are at least two people with the same birthday. To even the layman, the answer is 366 of course. Follow that though with, "And how many people will have had to arrive for there to be a 50% likelihood that two people at the party have the same birthday?"
To go from 366 to 23 I think is a surprise to many people. Because humans suck at probability, most people might instinctively assume half of 366 (183). So it becomes a surprise how low (less than two dozen!) it really is.
My own "drunk walk" to making sense of the small number: when two people are at the party, it is intuitive to me that there is 1 in 365 chance they will have the same birthday. As soon as a 3rd person arrives though there are two partygoers they might match so the odds have just doubled! :-) I understand though that the 4th person arriving does not double the odds but nonetheless increase the chances by 50%.
Suddenly I can now see a kind of asymptotic curve that, when we get to 366, will at last cross the threshold for 100% probability. But the asymptotic nature makes it clear to me that it will cross the 50% mark much sooner than would a linear growth. I am already convinced at this point that your 23 number is probably a pretty good one.
..or is it "sub-par"?
https://www.pcg-random.org/posts/birthday-test.html
Example:
Any RNG with a period 2**32 that can output every 32-bit value at least once must have zero collisions for the first 2**32 outputs, but we would expect to see about 100 collisions after just 200k outputs.
If you're a twin and your twin sibling is standing next to you, nearly 100%. But not exactly 100%: there have been cases of twins born on either side of midnight ending up with birthdays that differ by a day. (I don't personally know of any twins born on either side of midnight between Dec 31st and Jan 1st, who would then have different calendar years in their birthdays, but odds are very good that it has happened at least once in human history).
For extra fun, have them be born on opposite sites of the International Date Line, crossing west-to-east so that the younger twin (born on the east side of the line) is born on (say) July 1st at 8:00 AM local time, while the older twin (born fifteen minutes earlier on the west side of the line) is born on July 2nd at 8:45 AM local time.
For extra EXTRA fun, have them be born on opposite sites of the International Date Line on opposite sides of midnight, AND as the calendar ticks over from Dec 31st to Jan 1st. It gets really, really confusing. Though thankfully, I would bet money that particular example is contrived enough that it has never happened in real life.
Ask HN: We just had an actual UUID v4 collision...