Big Data, Probability, and Birthdays: Part 1 of 2

Mike Perkins

Cardinal Peak’s big data practice is expanding as we continue adding data scientists to our staff. In a recent discussion regarding a data set we’re analyzing, a probability problem conceptually equivalent to the following arose:

In a room filled with N people, what is the probability that none of them have the same birthday?

In our case the room is a neighborhood, the people are video subscribers, and birthdays are TV shows. The solution to this problem merits discussion because it nicely illustrates several important concepts.

I could jump right into equations without providing any background material, but I have a second goal as well: to provide an intuitive and accessible introduction to some key concepts of probability theory. This is therefore a two-part blog post. Part One describes probability spaces, while Part Two provides two solutions to the birthday problem: one based on counting and one based on conditional probabilities.

When solving probability problems I encourage you to think of them as arising from an experiment—for example, a coin will be flipped, N people will be selected at random, an electron will be fired through a slit, etc. The outcome of the experiment is uncertain, or you wouldn’t need probability concepts. Your goal is to determine the probability of some experimental result.

Birthday CakeWith regards to your experiment, before grabbing paper and pen, you need to come up with a probability space. Probability spaces have three components: a sample space, an event space, and a probability measure. To define an appropriate probability space for your experiment, spend time thinking about three things: What are the fundamental experimental outcomes? What are the experimental results to which you’d like to assign probabilities? What do you know about the likelihood of the fundamental experimental outcomes?

Thinking about Question 1 will force you to identify what mathematicians refer to as the sample space. The sample space is the set of all possible experimental outcomes. For example, if you are going to flip a coin 3 times, an experimental outcome can be represented as a vector with three components, each of which can be either H or T (for heads or tails). So one possible experimental outcome is (H, T, H), another outcome is (T, T, H), etc. The sample space is the set of vectors representing all the possible experimental outcomes (in this case there are 8 in total, list them out if this isn’t obvious). For the birthday problem, the sample space is the set of all vectors with N components where each component is an integer in the range 1 to 365 inclusive. The integer for a particular component indicates on which day of the year that particular person was born (yes, we’re ignoring leap years!). For example, if N=5, one possible outcome is (1, 34, 253, 311, 253), which indicates that Person 1 was born on January 1, Person 2 on February 3, and so forth. Hint: If you can’t specify the sample space you aren’t ready to solve the problem!

Next, think about the experimental result in which you’re interested. In the case of flipping a coin three times perhaps you’re interested in the event where both the first and last coin tosses are heads. In the case of the birthday problem we’re interested in the event where everyone has a unique birthday (i.e. no two components of the birthday vector are identical). Mathematicians call the experimental results in which we’re interested events. You know you have the sample space nailed if each event you care about corresponds to a set of experimental outcomes, i.e., to a subset of the sample space. For example, in the coin toss case, the set { (H,H,H), (H,T,H) } corresponds to both the first and last toss being a head. In the birthday problem, one member of the set we care about is (25, 126, 67, 200, 345), and another is (112, 18, 235, 357, 198); however, there are too many outcomes in our event to list them all (don’t despair, we’ll still be able to count them). At this stage it is enough to see that there is a subset of the set of all possible experimental outcomes that corresponds to the event we care about.

The point of question 2 is to get you thinking about what mathematicians call the event space. Events are the experimental results to which we can assign probabilities. Every event is a subset of the sample space. For a finite sample space, the event space is the set of all subsets of the sample space (in technical jargon, the event space is the power set of the sample space). Caution: I’m simplifying pretty seriously here because the power set idea doesn’t always generalize. Many real-world experiments have a continuous outcome––for example, imagine a spinner used in a board game like Twister. Assume when it stops spinning that it can point to any real number in the interval [0, 1). For continuous outcome experiments like this one, not all subsets of the sample space turn out to be valid events. In other words, we can’t assign a probability to every subset of the sample space (we’re off in the world of measure theory here). Fortunately, for finite sample spaces we’re on solid ground: we can assign a probability to every subset of the sample space. Beware: If you can’t find a subset of the sample space that corresponds to the event you care about, something is wrong. You need to re-think.

Finally we reach Question 3, which forces us to think about the probability measure. In the case of finite sample spaces, the easiest way to get a probability measure is to assign a number between 0 and 1 to each fundamental experimental outcome with the proviso that the sum over all outcomes of the assigned numbers equals one. The probability of every event can then be determined by adding up the probability assigned to every outcome in the event. For example, in the case of flipping a single coin, the fundamental outcomes are heads and tails, so the sample space is {H, T}. We can assign a probability of one-half to each sample space outcome, i.e., to both {H} and {T}. The event space comprises {H}, {T}, {H, T}, and the empty set (included for technical reasons). The probability of any event can be computed by adding up the probability of the outcomes that make up the event. For example, the probability of the event {H} is ½ and the probability of the event {H, T} is ½ plus ½ equals 1. Exercise: What are the sample spaces and event spaces for a single roll of a six-sided die? What probability should be assigned to each member of the sample space? Should the assigned probabilities be different for an unfair die? What event corresponds to the experimental result that rolling the die yields a prime number? What is the probability of this event for a fair die?

In general, rich probability spaces can be easily built from a simple one. To do so, we choose an integer N greater than 1, and let the rich sample space become the set of vectors with N components where each component assumes a value from the simple sample space. We’ve already seen how this works in the examples above. For example, for coin flipping, the simple sample space is {H, T}, and the rich one we used is the set of all vectors of length 3 with H and T as components. As a probability measure, we can assign to each vector in the rich sample space the product of the probabilities the simple space assigns to each vector component. So in our coin tossing case we assign the probability ½ × ½ × ½ to each outcome vector in the rich space. This “Cartesian Product” approach creates a richer probability space than the simple one we started with in the sense that it has a larger sample space and a more interesting set of events to talk about. Assigning probabilities to the rich space in the manner described results in each vector component being independent of all the other components. I won’t discuss independence here—there’s too much to say in a two-part blog post—but it is worth pointing out that a rich probability space formed in this manner has the independence property. Exercise: can you use the Cartesian Product approach to create a probability space where the vector components are from two different simple sample spaces? (Think about flipping a coin as one vector component, and the birthday integer as the second vector component. This could be a decent model for experiments where the sex and birthday of your subjects are important.) What probability should you assign to each vector?

Now we’re ready to solve the birthday problem and a host of problems like it. In Part Two I’ll present two approaches.

Categories: Perk, Theory

Cardinal Peak
Learn more about our Audio & Video capabilities.

Dive deeper into our IoT portfolio

Take a look at the clients we have helped.

We’re always looking for top talent, check out our current openings. 

Contact Us

Please fill out the contact form below and our engineering services team will be in touch soon.

We rely on Cardinal Peak for their ability to bolster our patent licensing efforts with in-depth technical guidance. They have deep expertise and they’re easy to work with.
Diego deGarrido Sr. Manager, LSI
Cardinal Peak has a strong technology portfolio that has complemented our own expertise well. They are communicative, drive toward results quickly, and understand the appropriate level of documentation it takes to effectively convey their work. In…
Jason Damori Director of Engineering, Biamp Systems
We asked Cardinal Peak to take ownership for an important subsystem, and they completed a very high quality deliverable on time.
Matt Cowan Chief Scientific Officer, RealD
Cardinal Peak’s personnel worked side-by-side with our own engineers and engineers from other companies on several of our key projects. The Cardinal Peak staff has consistently provided a level of professionalism and technical expertise that we…
Sherisse Hawkins VP Software Development, Time Warner Cable
Cardinal Peak was a natural choice for us. They were able to develop a high-quality product, based in part on open source, and in part on intellectual property they had already developed, all for a very effective price.
Bruce Webber VP Engineering, VBrick
We completely trust Cardinal Peak to advise us on technology strategy, as well as to implement it. They are a dependable partner that ultimately makes us more competitive in the marketplace.
Brian Brown President and CEO, Decatur Electronics
The Cardinal Peak team started quickly and delivered high-quality results, and they worked really well with our own engineering team.
Charles Corbalis VP Engineering, RGB Networks
We found Cardinal Peak’s team to be very knowledgeable about embedded video delivery systems. Their ability to deliver working solutions on time—combined with excellent project management skills—helped bring success not only to the product…
Ralph Schmitt VP, Product Marketing and Engineering, Kustom Signals
Cardinal Peak has provided deep technical insights, and they’ve allowed us to complete some really hard projects quickly. We are big fans of their team.
Scott Garlington VP Engineering, xG Technology
We’ve used Cardinal Peak on several projects. They have a very capable engineering team. They’re a great resource.
Greg Read Senior Program Manager, Symmetricom
Cardinal Peak has proven to be a trusted and flexible partner who has helped Harmonic to deliver reliably on our commitments to our own customers. The team at Cardinal Peak was responsive to our needs and delivered high quality results.
Alex Derecho VP Professional Services, Harmonic
Yonder Music was an excellent collaboration with Cardinal Peak. Combining our experience with the music industry and target music market, with Cardinal Peak’s technical expertise, the product has made the mobile experience of Yonder as powerful as…
Adam Kidron founder and CEO, Yonder Music
The Cardinal Peak team played an invaluable role in helping us get our first Internet of Things product to market quickly. They were up to speed in no time and provided all of the technical expertise we lacked. They interfaced seamlessly with our i…
Kevin Leadford Vice President of Innovation, Acuity Brands Lighting
We asked Cardinal Peak to help us address a number of open items related to programming our systems in production. Their engineers have a wealth of experience in IoT and embedded fields, and they helped us quickly and diligently. I’d definitely…
Ryan Margoles Founder and CTO, notion