Why Sample Size and Random Sampling Matters

Mike Perkins

Recently we tweeted an interesting article on big data, from the Financial Times.

The author’s key point is that sampling bias and sampling error are possible even with large data sets. As illustration, the author discusses a classic case where the Literary Digest incorrectly predicted that Alf Landon would beat FDR in the 1936 election. The prediction was wrong despite the fact that a very large polling sample was used.

A more recent example is the news that Strava—a social media website for competitive cyclists and runners—is selling its data to cities to help them plan better cycling infrastructure. The story is compelling because it feels like a clever, low-cost way to get insight into where new bicycle lanes might be most helpful. But I question whether Strava’s user base is the same as the larger population of casual urban cyclists.

Turning back to the election example, I thought a few of you might be interested in a more mathematical treatment. The rest of this post presents a simplified mathematical model of election polling which shows, first, that sample size matters, and second, that random sampling is critical.

Let’s assume for starters that the pool of eligible voters has size V, and every eligible voter has already made up his mind. Let’s further assume that there are only two candidates, and that every eligible voter will in fact vote—hey, it’s just a model! Under these circumstances, some fraction p of the electorate will vote for candidate A come Election Day. Our task as polling wizards is to estimate p.

If we could ask everyone in V, we could compute p exactly (this is “N=all” in the article’s parlance). However, if V is big, the cost of asking everyone is just too great. In practice, we will sample only a subset of size N from the electorate and try to estimate p from this subset.

Of course, once we ask one person, there are only V-1 people left to sample. We will assume that V is big enough that the ratio of people who favor candidate A (i.e. p) is only trivially changed as we conduct our sampling—which removes people one-by-one from the pool.

Now we need some math to get any further. Let’s define the random variable Xi to be 1 if the i’th person in our sample favors candidate A, and 0 if he favors candidate B. Xi is a Bernoulli distributed Random Variable (RV) with parameter p. The following sum is a natural choice for estimating p and would no doubt occur to most of you (here we use the Greek letter rho, which looks like a p, to denote our estimate):

sampling eq 1

We want to know what happens to this estimate as N increases. Intuitively, it will get better. But how fast and under what conditions?

To answer this, we need to know the mean (expected value) and variance of the Xi RVs. The mean of each Xi is just p; this follows from the definition of the expected value (applying the definition yields 1 times p plus zero times 1 minus p). It is similarly easy to show from definitions that the variance of each Xi, is p(1-p).

Our next task is to derive the mean and variance of rho. It is critical to keep in mind that our estimator is also an RV! For the mean we have (using E as the expectation operator):

sampling eq 2

This is a nice result. It shows that the expected value of our estimator random variable is exactly the right value: p. We chose well. However, having the right expected value does not mean you will get exactly the value p from your polling sample. Depending on the variance of the estimator, the value you get is, however, more or less likely to be close to p. So let’s compute the variance of the estimator. The first step is to compute the expected value of the estimator squared. Once we know this, we can subtract the square of the estimator’s mean from it, and we will have the variance of the estimator (this is a classic result from probability theory). We have

sampling eq 3

Here we encounter the first stumbling block. To simplify this further we need to assume that the Xi and Xj RVs are independent of each other. This is equivalent to assuming that there is no “sampling bias”, i.e., no systematic bias in our grabbing of people from the pool of eligible voters. If such a bias is present, then the variables are not independent, and we can’t simplify the expression. However, if we are truly plucking people from the pool at random, then for i not equal to j, the expected value of the product of the two RVs is just the product of their expected values. This allows us to write

sampling eq 4

If we subtract the mean of our estimator squared from the right hand side of the above (i.e. subtract ), we get the variance of our estimator. With a little bit of algebra this becomes:

sampling eq 5

This shows why the sample size matters. As the size of our sample increases—as N gets bigger—the variance gets smaller, meaning that the distribution of the estimator RV becomes more and more tightly centered around the mean value p. In probability terms, this means that the probability that our estimate is off by a significant amount drops as N increases. Because our estimator is a sum of RVs, the Central Limit Theorem says that this sum will become well approximated by a Gaussian distribution as N increases. We can therefore easily estimate the probability that our estimator will be off, say, by three percentage points from the true value (i.e. that our estimate is more than 0.03 away from the mean value p). This leads to the concept of “confidence” in polling, and the ability to determine an appropriate sample size N to get a desired confidence level. Differences between the estimator’s value and the true value p (which is not known) are called sampling error. The best we can do with sampling error is to model it probabilistically.

As a concrete example, let’s consider a sample size of 3000, as mentioned in the article. We don’t know p, but we know that the variance of a Bernoulli distributed RV is largest when p = 1/2, in which case its variance is 1/4. Therefore, if we use a sample size of 3000, the variance of our estimator is upper bounded by 1/12,000 regardless of the true value of p. Using a variance of 1/12,000 we can look in a standard normal distribution table and find the probability that our estimate will vary from the true value of p by more than 0.03. The answer is that 999 out of 1000 times our estimate will be within 0.03 of the true value of p—as long as our samples are picked at random.

Categories: Perk, Theory

Cardinal Peak
Learn more about our Audio & Video capabilities.

Dive deeper into our IoT portfolio

Take a look at the clients we have helped.

We’re always looking for top talent, check out our current openings. 

Contact Us

Please fill out the contact form below and our engineering services team will be in touch soon.

We rely on Cardinal Peak for their ability to bolster our patent licensing efforts with in-depth technical guidance. They have deep expertise and they’re easy to work with.
Diego deGarrido Sr. Manager, LSI
Cardinal Peak has a strong technology portfolio that has complemented our own expertise well. They are communicative, drive toward results quickly, and understand the appropriate level of documentation it takes to effectively convey their work. In…
Jason Damori Director of Engineering, Biamp Systems
We asked Cardinal Peak to take ownership for an important subsystem, and they completed a very high quality deliverable on time.
Matt Cowan Chief Scientific Officer, RealD
Cardinal Peak’s personnel worked side-by-side with our own engineers and engineers from other companies on several of our key projects. The Cardinal Peak staff has consistently provided a level of professionalism and technical expertise that we…
Sherisse Hawkins VP Software Development, Time Warner Cable
Cardinal Peak was a natural choice for us. They were able to develop a high-quality product, based in part on open source, and in part on intellectual property they had already developed, all for a very effective price.
Bruce Webber VP Engineering, VBrick
We completely trust Cardinal Peak to advise us on technology strategy, as well as to implement it. They are a dependable partner that ultimately makes us more competitive in the marketplace.
Brian Brown President and CEO, Decatur Electronics
The Cardinal Peak team started quickly and delivered high-quality results, and they worked really well with our own engineering team.
Charles Corbalis VP Engineering, RGB Networks
We found Cardinal Peak’s team to be very knowledgeable about embedded video delivery systems. Their ability to deliver working solutions on time—combined with excellent project management skills—helped bring success not only to the product…
Ralph Schmitt VP, Product Marketing and Engineering, Kustom Signals
Cardinal Peak has provided deep technical insights, and they’ve allowed us to complete some really hard projects quickly. We are big fans of their team.
Scott Garlington VP Engineering, xG Technology
We’ve used Cardinal Peak on several projects. They have a very capable engineering team. They’re a great resource.
Greg Read Senior Program Manager, Symmetricom
Cardinal Peak has proven to be a trusted and flexible partner who has helped Harmonic to deliver reliably on our commitments to our own customers. The team at Cardinal Peak was responsive to our needs and delivered high quality results.
Alex Derecho VP Professional Services, Harmonic
Yonder Music was an excellent collaboration with Cardinal Peak. Combining our experience with the music industry and target music market, with Cardinal Peak’s technical expertise, the product has made the mobile experience of Yonder as powerful as…
Adam Kidron founder and CEO, Yonder Music
The Cardinal Peak team played an invaluable role in helping us get our first Internet of Things product to market quickly. They were up to speed in no time and provided all of the technical expertise we lacked. They interfaced seamlessly with our i…
Kevin Leadford Vice President of Innovation, Acuity Brands Lighting
We asked Cardinal Peak to help us address a number of open items related to programming our systems in production. Their engineers have a wealth of experience in IoT and embedded fields, and they helped us quickly and diligently. I’d definitely…
Ryan Margoles Founder and CTO, notion