Big Data, Probability and Birthdays: Part 1 of 2

Cardinal Peak’s big data practice is expanding as we continue adding data scientists to our staff. In a recent discussion regarding a data set we’re analyzing, a probability problem conceptually equivalent to the following arose:

In a room filled with N people, what is the probability that none of them have the same birthday?

In our case the room is a neighborhood, the people are video subscribers, and birthdays are TV shows. The solution to this problem merits discussion because it nicely illustrates several important concepts.

I could jump right into equations without providing any background material, but I have a second goal as well: to provide an intuitive and accessible introduction to some key concepts of probability theory. This is, therefore, a two-part blog post. Part One describes probability spaces, while Part Two provides two solutions to the birthday problem: one based on counting and one based on conditional probabilities.

When solving probability problems I encourage you to think of them as arising from an experiment — for example, a coin will be flipped, N people will be selected at random, an electron will be fired through a slit, etc. The outcome of the experiment is uncertain, or you wouldn’t need probability concepts. Your goal is to determine the probability of some experimental result.

birthday cake image

With regards to your experiment, before grabbing paper and pen, you need to come up with a probability space. Probability spaces have three components: a sample space, an event space, and a probability measure. To define an appropriate probability space for your experiment, spend time thinking about three things: What are the fundamental experimental outcomes? What are the experimental results to which you’d like to assign probabilities? What do you know about the likelihood of the fundamental experimental outcomes?Thinking about Question 1 will force you to identify what mathematicians refer to as the sample space. The sample space is the set of all possible experimental outcomes. For example, if you are going to flip a coin 3 times, an experimental outcome can be represented as a vector with three components, each of which can be either H or T (for heads or tails). So one possible experimental outcome is (H, T, H), another outcome is (T, T, H), etc. The sample space is the set of vectors representing all the possible experimental outcomes (in this case there are 8 in total, list them out if this isn’t obvious). For the birthday problem, the sample space is the set of all vectors with N components, where each component is an integer in the range 1 to 365 inclusive. The integer for a particular component indicates on which day of the year that particular person was born (yes, we’re ignoring leap years!). For example, if N=5, one possible outcome is (1, 34, 253, 311, 253), which indicates that Person 1 was born on January 1, Person 2 on February 3, and so forth. Hint: If you can’t specify the sample space you aren’t ready to solve the problem!

Next, think about the experimental result in which you’re interested. In the case of flipping a coin three times perhaps you’re interested in the event where both the first and last coin tosses are heads. In the case of the birthday problem, we’re interested in the event where everyone has a unique birthday (i.e., no two components of the birthday vector are identical). Mathematicians call the experimental results in which we’re interested events. You know you have the sample space nailed if each event you care about corresponds to a set of experimental outcomes, i.e., to a subset of the sample space. For example, in the coin toss case, the set { (H,H,H), (H,T,H) } corresponds to both the first and last toss being a head. In the birthday problem, one member of the set we care about is (25, 126, 67, 200, 345), and another is (112, 18, 235, 357, 198); however, there are too many outcomes in our event to list them all (don’t despair, we’ll still be able to count them). At this stage, it is enough to see that there is a subset of the set of all possible experimental outcomes that corresponds to the event we care about.

The point of Question 2 is to get you thinking about what mathematicians call the event space. Events are the experimental results to which we can assign probabilities. Every event is a subset of the sample space. For a finite sample space, the event space is the set of all subsets of the sample space (in technical jargon, the event space is the power set of the sample space). Caution: I’m simplifying pretty seriously here because the power set idea doesn’t always generalize. Many real-world experiments have a continuous outcome — for example, imagine a spinner used in a board game like Twister. Assume when it stops spinning that it can point to any real number in the interval (0, 1). For continuous outcome experiments like this one, not all subsets of the sample space turn out to be valid events. In other words, we can’t assign a probability to every subset of the sample space (we’re off in the world of measure theory here). Fortunately, for finite sample spaces, we’re on solid ground: we can assign a probability to every subset of the sample space. Beware: If you can’t find a subset of the sample space that corresponds to the event you care about, something is wrong. You need to re-think.

Finally we reach Question 3, which forces us to think about the probability measure. In the case of finite sample spaces, the easiest way to get a probability measure is to assign a number between 0 and 1 to each fundamental experimental outcome with the proviso that the sum over all outcomes of the assigned numbers equals one. The probability of every event can then be determined by adding up the probability assigned to every outcome in the event. For example, in the case of flipping a single coin, the fundamental outcomes are heads and tails, so the sample space is {H, T}. We can assign a probability of one-half to each sample space outcome, i.e., to both {H} and {T}. The event space comprises {H}, {T}, {H, T}, and the empty set (included for technical reasons). The probability of any event can be computed by adding up the probability of the outcomes that make up the event. For example, the probability of the event {H} is ½ and the probability of the event {H, T} is ½ plus ½ equals 1. Exercise: What are the sample spaces and event spaces for a single roll of a six-sided die? What probability should be assigned to each member of the sample space? Should the assigned probabilities be different for an unfair die? What event corresponds to the experimental result that rolling the die yields a prime number? What is the probability of this event for a fair die?

In general, rich probability spaces can be easily built from a simple one. To do so, we choose an integer N greater than 1, and let the rich sample space become the set of vectors with N components where each component assumes a value from the simple sample space. We’ve already seen how this works in the examples above. For example, for coin flipping, the simple sample space is {H, T}, and the rich one we used is the set of all vectors of length 3 with H and T as components. As a probability measure, we can assign to each vector in the rich sample space the product of the probabilities the simple space assigns to each vector component. So in our coin tossing case we assign the probability ½ × ½ × ½ to each outcome vector in the rich space. This “Cartesian Product” approach creates a richer probability space than the simple one we started with in the sense that it has a larger sample space and a more interesting set of events to talk about. Assigning probabilities to the rich space in the manner described results in each vector component being independent of all the other components. I won’t discuss independence here—there’s too much to say in a two-part blog post—but it is worth pointing out that a rich probability space formed in this manner has the independence property. Exercise: can you use the Cartesian Product approach to create a probability space where the vector components are from two different simple sample spaces? (Think about flipping a coin as one vector component, and the birthday integer as the second vector component. This could be a decent model for experiments where the sex and birthday of your subjects are important.) What probability should you assign to each vector?

Now we’re ready to solve the birthday problem and a host of problems like it. In Part Two I’ll present two approaches.