Research Sample Size

Sometimes part of being a data scientist requires that you actually do act like a "scientist" (obviously).

In this post, we're going to have a look at a "not-so" popular subject of determining the right sample size that allows you to make a proper conclusion with respect to the population that you're interested in.

More often than not, people usually assume that a sample size needs to bear some proportional relationship to the size of the population from which it is drawn. This not necessarily be the case.
Rather, at some point, having more samples need not mean a greater accuracy in doing your analysis.

What this means is, you really don't need to gather as much samples as possible in order to come up with a reasonable conclusion that can be applied to the population at large.

The absolute size of a sample is much more important. The size is pretty much dependent on the variation in the population parameters under study and the amount of estimated precision that is required by you. Thus, sometimes having a sample of 400 is good enough, while in some other scenario, more than 3000 is needed. If the variation is small, perhaps 50 would suffice.

In its most basic form of calculating the sample size, we typically assume that the probability sampling is 0.5 and we assume infinite population. In this scenario; a sample of 100 drawn from a population of 5000 would have the same estimating precision as 100 drawn from a population of 200 million.

The main difficulty with this determining the size of the population variance. In a nutshell, the greater the dispersion of data, the more samples we need. For instance, in a survey taken to determine the exam score for a group of 100 students - if we manage to survey 10 students and all of them told us that they've scored 80%, chances are the other students in that class did score 80% as well. However if all 10 of them reported a score that is different from each other, we probably would need to gather more samples to derive any meaningful conclusion.

This brings us to the topic of precision. Since we can never get a sample that can accurately reflect the population at large, we must decide the level of precision that we need. Precision in this aspect can be measured by:

  1. The confidence interval
  2. Margin of error

1. Business Research Methods


Popular posts from this blog

HIVE: Both Left and Right Aliases Encountered in Join

Assign select result to variable in Netezza stored procedure

Splitting value in Netezza using array_split