Chapter 15 Sampling & Study Design

15.1 Why Sampling Matters

In the Preface, you learned that statistics is the process of learning about a population by studying a sample. The population is the entire group you want to understand — all high school students in Beijing, all daily smokers in the United States, all light bulbs produced by a factory. The sample is the much smaller group you actually observe.

The quality of every conclusion in this book depends on one thing: how the sample was collected. A poorly chosen sample can make even the most sophisticated statistical analysis meaningless. A well-chosen sample, on the other hand, makes your conclusions trustworthy and generalizable.

Think about the NESARC dataset — the 43,000-person survey that powered every analysis in this book. Why can we use it to draw conclusions about American adults? Because the NESARC researchers did not just grab whoever was nearby. They designed a sampling plan that gave every non-institutionalized adult in the United States a known chance of being selected. That is what makes the NESARC a nationally representative survey.

This appendix is about how researchers design samples and studies to answer questions reliably. Even if you never run your own nationwide survey, understanding these concepts will make you a sharper reader of every study you encounter — from news headlines about health risks to your own project’s data.

15.2 Sampling Methods

15.2.1 The Goal of Sampling

Every sampling method tries to achieve the same thing: a sample that represents the population well. When a sample is representative, the patterns you find in your sample data should closely resemble the patterns in the larger population. When it is not representative, the sample is biased — it systematically over-represents or under-represents certain kinds of individuals, and your conclusions will be wrong.

Here are the most common ways researchers go about getting a sample, from worst to best.

15.2.2 Volunteer Samples

Imagine you want to know what kind of music students at your school prefer. You post a survey link on your school’s social media page that says “Vote for your favorite music genre!” and wait for responses.

This is a volunteer sample: individuals choose whether to participate. It is the easiest kind of sample to collect, and it is almost always biased. Why? Because the people who volunteer to take a survey are usually not a random cross-section of the population. They tend to be people with strong opinions who want their voice heard. If you posted a survey about your school’s cafeteria food, students who hate the food are far more likely to respond than students who think it is fine.

Volunteer samples cannot be generalized to any larger population. The people who chose to respond only represent themselves.

When are volunteer samples acceptable? In medical research, volunteer samples are often the only ethical option. You cannot force people to take an experimental drug — they must consent. But in that case, the goal is usually to compare treatments within the volunteer group, not to estimate how common something is in the general population.

15.2.3 Convenience Samples

A convenience sample is one where the researcher selects individuals who are easy to reach. If you stand outside your school library and survey the first 50 students who walk by, you are using a convenience sample.

Convenience samples are better than volunteer samples in one way: the researcher, not the participant, decides who to approach. But they are still often biased, sometimes in subtle ways:

If you survey students outside the library, you are more likely to catch students who study a lot — and their opinions might differ from students who never set foot in the library.
If you survey people at a shopping mall on a weekday morning, you will miss people who work during the day.
If you survey your own classmates, you are sampling from a group that shares your major and class schedule — not the whole student body.

The problem with convenience samples is not that they are definitely biased, but that you cannot tell how biased they are. There is no way to measure the gap between your sample and the population.

15.2.4 The Sampling Frame

Before we discuss better methods, there is one more concept to understand: the sampling frame. This is the list of individuals from which you actually draw your sample.

Ideally, your sampling frame matches your population exactly. If you want to study all students at your school, your sampling frame should be the complete enrollment list. In reality, frames are often imperfect:

You might have a list of student email addresses, but some students never check email.
A telephone survey uses phone numbers as its frame — but it misses people without phones.
The NESARC surveyed “non-institutionalized” adults, which means it excluded people in prisons, military bases, and hospitals. That is a deliberate choice, but it means the sample does not represent all adults.

A mismatch between the sampling frame and the population of interest is a potential source of bias. Always ask: who is not on the list?

15.2.5 Simple Random Sampling (SRS)

A simple random sample (SRS) is the gold standard. In an SRS, every individual in the population has an equal chance of being selected, and every possible group of a given size is equally likely to be the sample. It is like drawing names out of a hat — if the hat contains every name in the population and you draw fairly.

Think about what an SRS guarantees: if you took 1,000 different SRSs from the same population, about 950 of them would be genuinely representative of that population. The other 50 might be a little off by random chance — but you can calculate exactly how often that happens. That is the power of random sampling: bias is eliminated, and the remaining uncertainty is measurable.

How is an SRS done in practice? Researchers use random number generators — essentially, digital dice. A computer assigns a random number to every individual on the sampling frame, sorts them, and selects the top n. Every reputable survey (including the NESARC) uses some form of random sampling.

15.2.6 Systematic Sampling

Systematic sampling selects every kth individual from a list. For example, you might take your school’s enrollment list (sorted alphabetically) and select every 50th student.

Systematic sampling is easy to do by hand and often produces a reasonably representative sample. But it has a weakness: the list’s ordering can introduce hidden patterns. If siblings share the same last name and sit next to each other on an alphabetical list, a systematic sample might either select both siblings or neither — which an SRS would not do.

Systematic sampling is generally better than convenience sampling but not as safe as a true random sample.

15.2.7 Cluster Sampling

Imagine you want to survey high school seniors across Beijing. You could get a list of all seniors (huge effort) and take an SRS. Or you could use cluster sampling: randomly select 10 high schools in Beijing, and survey every senior in those 10 schools.

In cluster sampling, the population is naturally divided into groups (clusters). You randomly select some clusters, and then include all individuals in those clusters.

Pros: Much easier and cheaper than an SRS when the population is spread across many locations. You only need to visit 10 schools instead of tracking down individual students across 200 schools.

Cons: If the clusters differ from each other in important ways, your sample may be less representative than an SRS. If you randomly select only a few schools and those schools happen to be in wealthier neighborhoods, your results will not represent all students.

15.2.8 Stratified Sampling

Stratified sampling also starts by dividing the population into groups (strata). But instead of selecting whole groups, you take a random sample from every group.

For example: Beijing has many high schools. In stratified sampling, you take a random sample of, say, 20 seniors from each high school, and combine them into one sample.

Pros: Guarantees representation from every group. If school quality varies across districts, stratified sampling ensures students from every district are included. This often produces more precise estimates than an SRS of the same total size.

Cons: More work. You need to travel to every school, not just a handful.

Cluster vs. Stratified — which is better? It depends on what you are studying. If the groups are likely to be very different from each other (school quality, hospital quality, neighborhood income), stratified sampling is usually better because it forces representation from all groups. If the groups are fairly similar, cluster sampling may be good enough — and much cheaper.

15.2.9 Multistage Sampling

Real-world surveys often combine multiple methods in stages. Multistage sampling is common when the population is huge and geographically spread out.

For example, the NESARC survey might have worked like this:

Stage 1 (cluster): Randomly select counties across the United States.
Stage 2 (stratified): Within each selected county, randomly select households, making sure to include a mix of urban and rural addresses.
Stage 3: Within each household, randomly select one adult to interview.

Multistage sampling balances practicality with representativeness. It is not as mathematically pure as a simple random sample of the entire U.S. adult population would be, but a true nationwide SRS would be impossibly expensive and time-consuming.

15.2.10 Nonresponse

Even the best sampling plan has a final hurdle: nonresponse. You selected a perfect random sample of 1,000 students and emailed them your survey. Only 400 respond. What now?

The 400 who responded are not a random subset — they are the ones who check their email, have time, and care about your survey topic. If the 600 who did not respond are systematically different from the 400 who did, your results are biased.

Survey researchers fight nonresponse with follow-up emails, phone calls, and sometimes small incentives. The NESARC achieved a high response rate (over 80%) through in-person interviews conducted by trained staff who made repeated visits to selected households.

When you read a study, check the response rate. A survey with a 40% response rate should make you nervous. A survey with an 80% response rate is much more trustworthy.

15.3 Study Design

Once you have a sample, what do you actually do with it? The study design determines how you collect data from your participants — and it determines what kinds of conclusions you can draw.

15.3.1 Observational Studies

In an observational study, researchers record the values of variables as they naturally occur. They do not interfere or assign anyone to do anything. They simply measure and observe.

The NESARC is an observational study. Researchers asked thousands of people about their smoking habits, mental health, and demographics. They did not tell anyone to start or stop smoking. They just recorded what was already happening.

Observational studies are the most common kind of research in the social sciences because they can study things you could never experiment on: you cannot randomly assign people to have depression, to be a certain ethnicity, or to grow up in poverty.

The tradeoff: Observational studies can reveal associations, but they struggle to prove causation. Just because smokers in the NESARC data have higher rates of depression does not mean smoking causes depression. It could be the reverse. It could be that both are caused by a third factor (stress, perhaps). We will return to this in Appendix D on causation.

Prospective vs. retrospective: An observational study can look forward in time (prospective) or backward (retrospective). A prospective study might recruit participants today, ask about their current habits, and follow up five years later to see who developed a disease. A retrospective study asks participants to recall what they did in the past — which is cheaper and faster, but relies on memory, which is notoriously unreliable.

15.3.2 Sample Surveys

A sample survey is a special kind of observational study where participants report their own values — often their opinions, behaviors, or experiences. Every question in the NESARC is a survey question: “Have you ever smoked 100 or more cigarettes?” “Have you ever been diagnosed with depression?”

Surveys are powerful because they can capture information that is impossible to measure any other way — internal mental states, past experiences, personal beliefs. But they are vulnerable to response bias: people may not remember accurately, may not answer honestly about sensitive topics, or may say what they think the researcher wants to hear.

15.3.3 Experiments

In an experiment, the researcher takes control. They do not just record what happens naturally — they assign participants to different conditions and observe what happens.

Suppose you want to know whether watching TV causes people to eat more snacks. Here are two approaches:

Observational approach: Ask 500 people to keep a diary for a day, recording when they watch TV and when they eat snacks. Compare snack consumption during TV time versus non-TV time. This might show that people snack more while watching TV — but you cannot rule out alternative explanations. Maybe people snack more in the evening (when they also watch more TV). Maybe the kind of people who watch a lot of TV also tend to snack more in general.

Experimental approach: Recruit 100 participants. Randomly assign 50 to sit in a room with a TV on and snacks available. Assign the other 50 to sit in a room with no TV (just magazines) and the same snacks available. Measure how much each group eats.

The key difference is random assignment. By randomly deciding who gets the TV and who does not, you make the two groups comparable on average — the TV-watchers and non-watchers should be similar in age, appetite, typical snacking habits, and everything else. If the TV group eats significantly more, you have much stronger evidence that TV caused the difference.

Why experiments are powerful: Random assignment breaks the link between the explanatory variable and any confounding variables. In an observational study, people who watch more TV might be different in dozens of ways from people who watch less. In an experiment, the researcher forces those differences to balance out across groups.

Why experiments are not always possible: You cannot randomly assign people to smoke, to be depressed, to live in poverty, or to be a certain gender. For many important questions — especially in public health, education, and sociology — experiments are either impossible or unethical. That is why observational studies remain essential, even with their limitations.

15.3.4 Summary: Study Designs at a Glance

Design	What the Researcher Does	What You Can Conclude
Observational study	Records variables as they occur. No interference.	Associations, patterns, predictions. Causation is uncertain.
Sample survey	Asks participants to self-report their values, opinions, or behaviors.	Same as observational study, with additional caution about memory and honesty.
Experiment	Assigns participants to conditions. Controls the explanatory variable.	Stronger evidence for causation — but limited to questions where assignment is possible and ethical.

15.4 Association vs. Causation

Throughout this book, you have seen the phrase “association does not imply causation.” The study design is the reason why.

Let us make this concrete. In Chapter 7, you used a chi-square test on NESARC data and found an association between major depression and tobacco dependence among young adult smokers. The p-value was tiny — the association was clearly real. But can you conclude that depression causes nicotine dependence? Or that nicotine dependence causes depression?

No. Not from this data. The NESARC is an observational study — it recorded what was already true about each person. It did not randomly assign anyone to be depressed or to become dependent on nicotine. There could be a third variable — genetics, childhood stress, personality traits — that increases the risk of both depression and nicotine dependence.

An experiment would be needed to establish causation. But you cannot ethically run an experiment where you randomly assign people to become depressed or addicted to nicotine. So researchers must rely on observational data, while being honest about its limits.

The distinction between association and causation is not a weakness of statistics — it is a strength. It keeps you from making claims your data cannot support. When you write up your analysis, be precise: “Depression and nicotine dependence are associated” is true and defensible. “Depression causes nicotine dependence” is a much stronger claim that requires much stronger evidence.

15.5 From Design to Your Own Project

If you are working on your own research project — maybe for a science fair or a class paper — the concepts in this appendix give you a vocabulary for thinking about your data.

Ask yourself:

How was my sample collected? Is it a convenience sample (your classmates, your neighborhood) or something closer to a random sample? If it is a convenience sample, be honest about that in your write-up. State clearly who your results can and cannot be generalized to.
What study design am I using? If you are surveying people about their existing habits, you are doing an observational study. If you are randomly assigning people to different conditions (even something simple, like “read this passage with music on” vs. “read this passage in silence”), you are running an experiment.
What conclusions can I actually draw? An observational study with a convenience sample is at the bottom of the evidence hierarchy. That does not make it worthless — exploratory research often starts there. But it does mean you should be modest in your claims. “In this sample of 80 students from our school…” is an honest framing. “Students today…” is an overreach.

The best analysis in the world cannot rescue bad data. But a thoughtful analysis that is honest about its data’s limitations? That is real science.