Chapter 3 Data Management in R
3.1 TL;DR: Data Management
- Real data is messy. Before you can analyze, you must clean: filter rows, select columns, rename variables, recode missing values, and label categories.
dplyris your toolkit. Five verbs do most of the work:filter(),select(),rename(),mutate(), and the pipe%>%.- Missing values hide in plain sight. Codes like 99, 999, or -1 often mean “no answer.” Convert them to
NAso R ignores them in calculations. - Numeric codes ≠ categories. Variables like Sex (1 = Female, 2 = Male) are categorical — use
factor()to label them so your output shows names, not numbers. - Build once, reuse forever. You will prepare the NESARC subset in this chapter and use it through Chapter 11. Get the data management right now and every analysis that follows will be easier.
# Load packages — install once with install.packages(c("dplyr", "modelsummary"))
library(dplyr)
library(modelsummary)
# Load the full NESARC dataset — the CSV is included with this book
nesarc <- read.csv("NESARC.csv") # ← REPLACE: path to your dataset
# Step 1: Filter — keep only young daily smokers with valid data
# Step 2: Select — keep only the variables we need for our analyses
# Step 3: Rename — give variables readable names
nesarc <- nesarc %>%
filter(S3AQ1A == 1 & CHECK321 == 1 & S3AQ3B1 == 1 & AGE <= 25) %>% # ← REPLACE with your filter conditions
select(IDNUM, ETHRACE2A, AGE, MAJORDEPLIFE, TAB12MDX, S3AQ3C1, SEX) %>% # ← REPLACE with your columns
rename( # ← REPLACE with your names
Ethnicity = ETHRACE2A, # ethnicity (categorical, 1–5)
Age = AGE, # age in years (quantitative)
MajorDepression = MAJORDEPLIFE, # lifetime depression diagnosis (categorical, 0/1)
TobaccoDependence = TAB12MDX, # tobacco dependence, past 12 months (categorical, 0/1)
DailyCigsSmoked = S3AQ3C1, # cigarettes per day (quantitative)
Sex = SEX # sex (categorical, 1/2)
)
# Step 4: Recode missing values — 99 in DailyCigsSmoked means "unknown"
nesarc$DailyCigsSmoked[nesarc$DailyCigsSmoked == 99] <- NA # ← REPLACE: your missing value code
# Step 5: Label categorical variables — output shows names, not numbers
nesarc$Ethnicity <- factor(nesarc$Ethnicity, # ← REPLACE: your variable
levels = c(1, 2, 3, 4, 5), # the numeric codes
labels = c("Caucasian", "African American", "Native American", # ← REPLACE: your labels
"Asian", "Hispanic"))
nesarc$MajorDepression <- factor(nesarc$MajorDepression, # ← REPLACE: your variable
labels = c("No Depression", "Yes Depression"))
nesarc$TobaccoDependence <- factor(nesarc$TobaccoDependence, # ← REPLACE: your variable
labels = c("No Nicotine Dependence", "Nicotine Dependence"))
nesarc$Sex <- factor(nesarc$Sex, labels = c("Female", "Male")) # ← REPLACE: your labels
# Step 6: Verify — check dimensions and preview
dim(nesarc) # rows × columns## [1] 1320 7
## IDNUM Ethnicity Age MajorDepression TobaccoDependence
## 1 21 African American 25 No Depression Nicotine Dependence
## 2 103 Caucasian 24 Yes Depression Nicotine Dependence
## 3 122 Caucasian 23 Yes Depression Nicotine Dependence
## 4 136 Caucasian 21 No Depression No Nicotine Dependence
## 5 150 Caucasian 23 Yes Depression No Nicotine Dependence
## 6 155 African American 21 No Depression Nicotine Dependence
## DailyCigsSmoked Sex
## 1 3 Male
## 2 10 Male
## 3 10 Male
## 4 20 Female
## 5 5 Male
## 6 8 Male
# Step 7: Quick overview with modelsummary — same information, report-ready format
datasummary_skim(nesarc) # ← REPLACE: your data frame| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| IDNUM | 1320 | 0 | 20235.7 | 11966.3 | 21.0 | 19155.5 | 43088.0 | |
| Age | 8 | 0 | 21.6 | 2.2 | 18.0 | 22.0 | 25.0 | |
| DailyCigsSmoked | 30 | 0 | 13.4 | 8.7 | 1.0 | 10.0 | 98.0 | |
| N | % | |||||||
| Ethnicity | Caucasian | 849 | 64.3 | |||||
| African American | 170 | 12.9 | ||||||
| Native American | 30 | 2.3 | ||||||
| Asian | 47 | 3.6 | ||||||
| Hispanic | 224 | 17.0 | ||||||
| MajorDepression | No Depression | 965 | 73.1 | |||||
| Yes Depression | 355 | 26.9 | ||||||
| TobaccoDependence | No Nicotine Dependence | 521 | 39.5 | |||||
| Nicotine Dependence | 799 | 60.5 | ||||||
| Sex | Female | 646 | 48.9 | |||||
| Male | 674 | 51.1 |
Key takeaway: You now have a clean, analysis-ready dataset with ~1,320 observations of young adult daily smokers and 7 well-labeled variables. Every chapter from here forward starts with the nesarc data frame you just built.
3.2 Deep Dive: Data Management
3.2.1 Why Data Management Matters
In Chapter 2, you loaded small, clean datasets — soccer.csv, epiduralf.csv — where every variable was already named, every missing value was already handled, and every category was already labeled. That was like driving a car someone else had washed, fueled, and tuned.
Real-world data is different. Real datasets arrive with:
- Cryptic variable names.
S3AQ3C1means nothing until you read the code book and learn it’s “usual number of cigarettes smoked per day.” - Missing values disguised as numbers. A value of 99 or 999 might mean “not answered,” but R will treat it as a real number unless you tell it otherwise.
- Numeric codes that are really categories. Sex stored as 1 and 2, ethnicity stored as 1 through 5 — these look quantitative but are categorical.
- Thousands of columns you do not need. The full NESARC dataset has over 3,000 variables. Your research question needs maybe 6 or 7.
Data management is the process of transforming raw data into a clean, analysis-ready form. It is not glamorous, but it is essential. A mistake in data management — forgetting to recode a missing value, mislabeling a variable — will corrupt every analysis that follows. Get the cleaning right, and everything downstream is easier.
The good news: a handful of tools in the dplyr package handle most data management tasks. This chapter teaches you five verbs: filter(), select(), rename(), mutate(), and the pipe %>%. By the end, you will have built a clean subset of the NESARC dataset that you will use for the rest of the book.
3.2.2 Real Data Is Messy — Load the NESARC
Let’s start by loading the full NESARC dataset. This is the same dataset you met in Chapter 1 — a survey of over 43,000 U.S. adults covering health, behavior, and demographics.
# Load the full NESARC dataset from CSV
nesarc_full <- read.csv("NESARC.csv") # ← REPLACE: read.csv("your_file.csv")
# Check how big it is
dim(nesarc_full) # 43094 rows, 3008 columns## [1] 43093 3008
43,094 rows and 3,008 columns. That is far more than you need. Your research question (from Chapter 1) probably involves 5–10 variables for a specific subset of people. Let’s narrow things down.
But first, a tour of the raw data to understand why cleaning is necessary:
# Look at the first few rows of three key variables — raw, uncleaned
head(nesarc_full$S3AQ1A) # "Smoked 100+ cigarettes?" — stored as 1 or 2## [1] 2 2 2 2 2 2
## [1] 5 5 5 5 2 2
## [1] NA NA NA NA NA NA
Every one of these columns needs work. The values 1 and 2 in S3AQ1A should be “Yes” and “No.” The numbers 1–5 in ETHRACE2A should be “Caucasian,” “African American,” and so on. And the 99 in S3AQ3C1 — that is not 99 cigarettes a day. According to the NESARC code book, 99 means “unknown” or “missing.” If you computed the average without recoding 99 to NA, you would get a wildly inflated number.
This is why data management comes before analysis. Let’s fix each problem, one at a time.
3.2.3 The %>% Pipe — Chaining Operations Together
Before we dive into individual operations, let’s understand the tool that connects them. The pipe operator %>% (pronounced “and then”) takes the output of one function and feeds it as the input to the next function.
library(dplyr)
# Without the pipe — each step requires a new line and intermediate variables
step1 <- filter(nesarc_full, AGE <= 25)
step2 <- select(step1, AGE, SEX)
head(step2)## AGE SEX
## 1 23 1
## 2 18 1
## 3 19 1
## 4 18 2
## 5 21 2
## 6 25 2
# With the pipe — read from top to bottom: "take nesarc_full, AND THEN filter, AND THEN select"
nesarc_full %>%
filter(AGE <= 25) %>%
select(AGE, SEX) %>%
head()## AGE SEX
## 1 23 1
## 2 18 1
## 3 19 1
## 4 18 2
## 5 21 2
## 6 25 2
Both produce the same result, but the piped version reads like a recipe: “take the data, filter it, select columns, show the head.” You will use the pipe throughout this chapter and every chapter after. Think of it as %>% = “and then.”
You may have noticed that Chapter 2 used a different pipe symbol: |>. Both pipes do the same thing — pass a result into the next function. The |> is the base R pipe (built into R since version 4.1), while %>% comes from the dplyr package and has been the standard for over a decade. This book uses %>% with dplyr operations because it is still the most common pipe in the R ecosystem, but you can use |> anywhere if you prefer. Your code will work the same either way.
3.2.4 Filtering Rows — Keeping the Observations You Need
Not every observation in the dataset is relevant to your research question. If you are studying nicotine dependence among young smokers, you do not need data from 65-year-old non-smokers. Filtering keeps only the rows that match your criteria.
filter() takes a data frame and returns only the rows where a condition is true.
library(dplyr)
# Which observations should we keep? Our research question focuses on:
# - People who have smoked 100+ cigarettes in their lifetime (S3AQ1A == 1)
# - People who have smoked in the past 12 months (S3AQ3B1 == 1)
# - People with valid nicotine dependence data (CHECK321 == 1)
# - Young adults, age 25 or under (AGE <= 25)
nesarc_subset <- nesarc_full %>% # start with the full data
filter( # keep rows where ALL conditions are true:
S3AQ1A == 1 & # smoked 100+ cigarettes (1 = Yes)
S3AQ3B1 == 1 & # smoked in past 12 months (1 = Yes)
CHECK321 == 1 & # valid nicotine data flag (1 = valid)
AGE <= 25 # age 25 or younger
) # ← REPLACE: adjust conditions for your question
# How many observations remain?
dim(nesarc_subset) # ~1,320 rows — down from 43,094## [1] 1320 3008
The & symbol means “AND.” A row must satisfy every condition to be kept. A person who smoked 100+ cigarettes AND smoked recently AND has valid data AND is 25 or younger — all four must be true. If any one condition is false, that row is dropped.
What if you want to keep rows that satisfy either condition? Use | (the vertical bar, Shift + backslash) for OR. For example, filter(AGE == 25 | AGE == 26) keeps people who are either 25 or 26 years old. & means “all conditions must be true”; | means “at least one condition must be true.”
What the conditions mean:
| Condition | Variable | Meaning |
|---|---|---|
S3AQ1A == 1 |
Smoked 100+ cigarettes in lifetime | 1 = Yes, 2 = No |
S3AQ3B1 == 1 |
Smoked in the past 12 months | 1 = Yes, 2 = No |
CHECK321 == 1 |
Valid nicotine dependence data | 1 = valid, 0 = invalid |
AGE <= 25 |
Age 25 or younger | Years (quantitative) |
Always check the code book to confirm what each value means before you write a filter condition. S3AQ1A == 1 means “Yes, the person smoked 100+ cigarettes” — but only because the code book says 1 means Yes. Without the code book, you might incorrectly assume 1 means No.
Note on the sample: Why focus on young adults? Because substance use patterns change with age, and focusing on a specific age group makes the analysis more interpretable. You could choose a different age range — the important thing is to have a clear rationale for your filter criteria.
3.2.5 Selecting Columns — Keeping Only the Variables You Need
The full NESARC has 3,008 columns. Most are irrelevant to your research question. Selecting keeps only the columns you will actually use.
select() takes a data frame and returns only the columns you name.
# Our research question involves: ethnicity, age, depression, tobacco dependence,
# cigarettes per day, and sex. We also keep IDNUM as a unique identifier.
nesarc_subset <- nesarc_subset %>% # continue from the filtered data
select( # keep only these columns:
IDNUM, # unique respondent ID
ETHRACE2A, # ethnicity (1–5)
AGE, # age in years
MAJORDEPLIFE, # lifetime major depression (0/1)
TAB12MDX, # tobacco dependence, past 12 months (0/1)
S3AQ3C1, # usual cigarettes per day (1–98, 99=unknown)
SEX # sex (1=Female, 2=Male)
) # ← REPLACE: add/remove columns for your question
# Check: only 7 columns now
dim(nesarc_subset) # ~1,320 rows, 7 columns## [1] 1320 7
You went from 43,094 × 3,008 to approximately 1,320 × 7. The dataset is now focused on exactly the variables your research question needs.
Why select columns early? Because smaller datasets are easier to inspect, faster to work with, and less likely to cause confusion. If you keep all 3,008 columns, you might accidentally analyze the wrong variable with a similar name. Narrowing to the columns you care about reduces the chance of mistakes.
3.2.6 Renaming Variables — Making Names Readable
S3AQ3C1 is a terrible variable name. It means nothing outside the context of the survey’s original numbering scheme. Renaming gives variables names that describe what they actually measure.
rename() takes a data frame and changes column names. The syntax is new_name = old_name.
# Replace cryptic survey codes with descriptive names
nesarc_subset <- nesarc_subset %>%
rename( # new_name = old_name:
Ethnicity = ETHRACE2A, # ethnicity category
Age = AGE, # age in years
MajorDepression = MAJORDEPLIFE, # lifetime depression diagnosis
TobaccoDependence = TAB12MDX, # tobacco dependence, past 12 months
DailyCigsSmoked = S3AQ3C1, # usual cigarettes smoked per day
Sex = SEX # biological sex
) # ← REPLACE: names for your variables
# Now the column names are self-documenting
names(nesarc_subset)## [1] "IDNUM" "Ethnicity" "Age"
## [4] "MajorDepression" "TobaccoDependence" "DailyCigsSmoked"
## [7] "Sex"
Good variable names are short but descriptive. Ethnicity tells you what the column contains. ETHRACE2A does not. Renaming is one of the simplest data management operations — and one of the most important for keeping your code readable six months from now, when you have forgotten what ETHRACE2A means.
3.2.7 Recoding Missing Values — Converting Placeholders to NA
In many datasets, special codes indicate missing data: 99, 999, -1, or blank cells. The NESARC code book tells us that in DailyCigsSmoked, a value of 99 means “unknown.” If we leave 99 as a real number, R will include it when computing the mean — and 99 cigarettes a day would make the average absurdly high.
Recoding missing values replaces these placeholder codes with NA, R’s symbol for “not available.” R automatically excludes NA values from calculations like mean() and sd().
##
## FALSE TRUE
## 1315 5
# Recode 99 to NA — "99 means unknown, treat it as missing"
nesarc_subset$DailyCigsSmoked[nesarc_subset$DailyCigsSmoked == 99] <- NA # ← REPLACE: your missing code
# After recoding: verify the 99s are gone
table(nesarc_subset$DailyCigsSmoked == 99) # should be all FALSE##
## FALSE
## 1315
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## 1.00 7.00 10.00 13.36 20.00 98.00 5
# Compare: mean WITH the 99 (wrong) vs mean with NA (correct)
# If we forgot to recode:
# mean(c(10, 15, 20, 99)) → 36 — nonsense
# After recoding to NA:
# mean(c(10, 15, 20, NA), na.rm = TRUE) → 15 — makes senseThe syntax dataframe$column[condition] <- NA reads as: “in this column, for every row where the condition is true, replace the value with NA.” You will use this pattern whenever a dataset uses placeholder codes for missing data.
Important: Always check the code book for missing value codes before you start analyzing. Every dataset has its own conventions. 99 is common in health surveys, 999 in economic data, and -1 in some psychological instruments. There is no universal standard — the code book is your only reliable guide.
How do researchers know 99 is not a real value? Because the survey itself never offers 99 as a valid answer. The NESARC questionnaire asks respondents for a number between 1 and 98; 99 is reserved by the data collectors as a “not answered” placeholder. The code book documents this decision. In every dataset you encounter, the missing value codes were chosen by the data collectors ahead of time — they are not real responses that happen to look suspicious.
When you convert a value to NA, R does not delete the row. The person’s age, ethnicity, and all other variables remain in the dataset. Functions like mean() simply skip NA values when computing, as long as you add na.rm = TRUE. If you ever need to explicitly remove rows with missing data, use na.omit(), but we usually prefer to keep the rows and handle NAs on a case-by-case basis.
3.2.8 Creating and Labeling Factors — From Numbers to Names
Look at the Ethnicity column in your current dataset:
## [1] 2 1 1 1 1 2
These numbers (1, 2, 3, 4, 5) are not quantities — they are category labels. A value of 4 does not mean “four units of ethnicity.” It means “Asian,” according to the code book. In R, a categorical variable stored this way is called a factor. Converting a numeric column to a factor attaches readable labels to the numeric codes.
# Convert numeric codes to labeled categories
nesarc_subset$Ethnicity <- factor(nesarc_subset$Ethnicity, # the column to convert
levels = c(1, 2, 3, 4, 5), # the numeric codes in order
labels = c("Caucasian", "African American", "Native American", # ← REPLACE: your labels
"Asian", "Hispanic"))
# Now the output shows names, not numbers
head(nesarc_subset$Ethnicity)## [1] African American Caucasian Caucasian Caucasian
## [5] Caucasian African American
## Levels: Caucasian African American Native American Asian Hispanic
# Label all other categorical variables the same way
nesarc_subset$MajorDepression <- factor(nesarc_subset$MajorDepression,
labels = c("No Depression", "Yes Depression")) # ← REPLACE: your labels
nesarc_subset$TobaccoDependence <- factor(nesarc_subset$TobaccoDependence,
labels = c("No Nicotine Dependence", "Nicotine Dependence")) # ← REPLACE: your labels
nesarc_subset$Sex <- factor(nesarc_subset$Sex,
labels = c("Female", "Male")) # ← REPLACE: your labelsWhy does this matter? Compare the output of a frequency table before and after labeling:
# Before labeling: numbers only — you need the code book to interpret
# (this uses the unlabeled version — we already labeled ours above)
# table(nesarc_subset$Sex) would show: 1: 600, 2: 720
# After labeling: names tell the story directly
table(nesarc_subset$Sex) # shows Female and Male — no code book needed##
## Female Male
## 646 674
##
## Caucasian African American Native American Asian
## 849 170 30 47
## Hispanic
## 224
A reader can understand “Female: 680, Male: 640” immediately. “1: 680, 2: 640” requires flipping back to the code book. Label your factors.
3.2.9 Creating New Variables — mutate() and ifelse()
Sometimes the variable you want does not exist in the raw dataset — you need to create it from existing columns. mutate() adds new columns, and ifelse() helps with conditional logic.
# Example: Create a binary variable for "heavy smoker" (20+ cigarettes per day)
nesarc_subset <- nesarc_subset %>%
mutate(
HeavySmoker = ifelse(DailyCigsSmoked >= 20, "Yes", "No") # ← REPLACE with your new variable logic
)
# Verify the new column
table(nesarc_subset$HeavySmoker)##
## No Yes
## 871 444
# Note: ifelse() handles NAs carefully — rows where DailyCigsSmoked is NA
# will have HeavySmoker = NA. R preserves missingness.mutate() adds a new column to the right side of the data frame. ifelse(condition, value_if_true, value_if_false) is R’s way of saying “if this condition is true, assign value A; otherwise, assign value B.” Think of ifelse() as filling in a new column row by row: for each person, check the condition; if true, write “Yes”; if false, write “No.”
Another common pattern: collapsing categories. The frustration dataset (included in data/frustration.csv) has a variable Major with four values. What if you wanted to group Engineering and Computer Science together as “STEM”?
# Load the frustration dataset
frustration <- read.csv("frustration.csv")
# Before collapsing — four separate majors
table(frustration$Major)##
## Business English Mathematics Psychology
## 35 35 35 35
# Create a collapsed variable: STEM vs Non-STEM
frustration <- frustration %>%
mutate(
MajorGroup = ifelse(Major %in% c("Engineering", "Computer Science"),
"STEM", "Non-STEM") # ← REPLACE: your categories and grouping
)
# After collapsing — two groups
table(frustration$MajorGroup)##
## Non-STEM
## 140
%in% checks whether a value appears in a list. Major %in% c("Engineering", "Computer Science") returns TRUE for engineering and CS majors, FALSE for everyone else. This pattern is useful whenever you want to combine several small categories into a larger group.
3.2.10 The Full Recipe — Putting It All Together
You have now seen each operation in isolation. Here is the complete recipe, with every step chained together, from raw NESARC to a clean analysis dataset:
# Full NESARC data management recipe — from 43,094 × 3,008 to ~1,320 × 7
library(dplyr)
nesarc <- read.csv("NESARC.csv") %>% # 1. Load raw data
filter( # 2. Filter rows
S3AQ1A == 1 & # smoked 100+ cigarettes (1 = Yes)
S3AQ3B1 == 1 & # smoked in past 12 months (1 = Yes)
CHECK321 == 1 & # valid nicotine data (1 = valid)
AGE <= 25 # age 25 or younger
) %>%
select( # 3. Select columns
IDNUM, ETHRACE2A, AGE, MAJORDEPLIFE,
TAB12MDX, S3AQ3C1, SEX
) %>%
rename( # 4. Rename variables
Ethnicity = ETHRACE2A,
Age = AGE,
MajorDepression = MAJORDEPLIFE,
TobaccoDependence = TAB12MDX,
DailyCigsSmoked = S3AQ3C1,
Sex = SEX
)
# 5. Recode missing values — must happen AFTER the pipe because it modifies in place
nesarc$DailyCigsSmoked[nesarc$DailyCigsSmoked == 99] <- NA
# 6. Label categorical variables
nesarc$Ethnicity <- factor(nesarc$Ethnicity,
labels = c("Caucasian", "African American", "Native American", "Asian", "Hispanic"))
nesarc$MajorDepression <- factor(nesarc$MajorDepression,
labels = c("No Depression", "Yes Depression"))
nesarc$TobaccoDependence <- factor(nesarc$TobaccoDependence,
labels = c("No Nicotine Dependence", "Nicotine Dependence"))
nesarc$Sex <- factor(nesarc$Sex, labels = c("Female", "Male"))
# 7. Verify the result
dim(nesarc) # should be approximately 1,320 rows × 7 columns## [1] 1320 7
## IDNUM Ethnicity Age MajorDepression TobaccoDependence
## 1 21 African American 25 No Depression Nicotine Dependence
## 2 103 Caucasian 24 Yes Depression Nicotine Dependence
## 3 122 Caucasian 23 Yes Depression Nicotine Dependence
## 4 136 Caucasian 21 No Depression No Nicotine Dependence
## 5 150 Caucasian 23 Yes Depression No Nicotine Dependence
## 6 155 African American 21 No Depression Nicotine Dependence
## DailyCigsSmoked Sex
## 1 3 Male
## 2 10 Male
## 3 10 Male
## 4 20 Female
## 5 5 Male
## 6 8 Male
A note on the order of operations: You might wonder why we recode missing values and label factors after the pipe chain, not inside it. The filter() → select() → rename() operations work naturally with %>% because they return a data frame. Recoding a single column to NA modifies an existing column in place — it works just as easily outside the pipe. Both approaches produce the same result; this organization is a matter of readability.
3.2.11 From Raw Output to Professional Tables
Now that the data is clean, let’s verify our work with summary tables. As in Chapter 2, we start with base R and then upgrade to modelsummary.
##
## Female Male
## 646 674
##
## Caucasian African American Native American Asian
## 849 170 30 47
## Hispanic
## 224
##
## No Depression Yes Depression
## 965 355
##
## No Nicotine Dependence Nicotine Dependence
## 521 799
# Summary statistics for the quantitative variable — base R
mean(nesarc$DailyCigsSmoked, na.rm = TRUE) # average cigarettes per day## [1] 13.36426
## [1] 10
## [1] 8.657025
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## 1.00 7.00 10.00 13.36 20.00 98.00 5
Now the same information, presented in a report-ready format:
library(modelsummary)
# All categorical variables in one frequency table
datasummary(Ethnicity + MajorDepression + TobaccoDependence + Sex ~ N + Percent(),
data = nesarc)| N | Percent | ||
|---|---|---|---|
| Ethnicity | Caucasian | 849 | 64.32 |
| African American | 170 | 12.88 | |
| Native American | 30 | 2.27 | |
| Asian | 47 | 3.56 | |
| Hispanic | 224 | 16.97 | |
| MajorDepression | No Depression | 965 | 73.11 |
| Yes Depression | 355 | 26.89 | |
| TobaccoDependence | No Nicotine Dependence | 521 | 39.47 |
| Nicotine Dependence | 799 | 60.53 | |
| Sex | Female | 646 | 48.94 |
| Male | 674 | 51.06 |
# Quantitative variable summary — show the same stats computed above in one table
datasummary(DailyCigsSmoked ~ N + Mean + Median + SD + Min + Max,
data = nesarc)| N | Mean | Median | SD | Min | Max | |
|---|---|---|---|---|---|---|
| DailyCigsSmoked | 1315 | 13.36 | 10.00 | 8.66 | 1.00 | 98.00 |
| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| IDNUM | 1320 | 0 | 20235.7 | 11966.3 | 21.0 | 19155.5 | 43088.0 | |
| Age | 8 | 0 | 21.6 | 2.2 | 18.0 | 22.0 | 25.0 | |
| DailyCigsSmoked | 30 | 0 | 13.4 | 8.7 | 1.0 | 10.0 | 98.0 | |
| N | % | |||||||
| Ethnicity | Caucasian | 849 | 64.3 | |||||
| African American | 170 | 12.9 | ||||||
| Native American | 30 | 2.3 | ||||||
| Asian | 47 | 3.6 | ||||||
| Hispanic | 224 | 17.0 | ||||||
| MajorDepression | No Depression | 965 | 73.1 | |||||
| Yes Depression | 355 | 26.9 | ||||||
| TobaccoDependence | No Nicotine Dependence | 521 | 39.5 | |||||
| Nicotine Dependence | 799 | 60.5 | ||||||
| Sex | Female | 646 | 48.9 | |||||
| Male | 674 | 51.1 |
Every number in the modelsummary output matches what you computed by hand with table(), mean(), median(), and sd(). The modelsummary functions are not doing anything different — they are just presenting the same calculations in a polished table. When you write up your results, use datasummary(). When you are exploring and thinking, use the base R functions — they give you direct access to the numbers without any formatting in the way.
3.2.12 A Preview of Sampling Variability
You have just built a dataset of about 1,320 young adult daily smokers. This is a sample — a subset of the larger NESARC survey, which is itself a sample of the U.S. adult population. The filtering you did (age ≤ 25, daily smokers) further narrowed the group.
Here is a fundamental question that will drive the rest of this book: If you took a different sample — a different 1,320 young daily smokers from a different survey — would you get the same results?
Almost certainly not. Every sample is a snapshot, and snapshots differ. The mean cigarettes per day in your NESARC subset might be 14.3. In another survey of the same population, it might be 14.8. In a third, 13.9. The differences are usually small, but they exist.
This is sampling variability — the idea that sample statistics (like the mean, the median, a proportion) vary from sample to sample, even when drawn from the same population.
Why does this matter? Because when you analyze your data and find, say, that 68% of young female smokers also have a depression diagnosis, you want to know: is this finding specific to your sample, or does it reflect something true about the broader population? If you ran the same analysis on a different sample and got 32%, you would know your original finding was unstable — probably just noise. If you got 67% in a second sample and 69% in a third, you would be more confident that the 68% figure is real.
Sampling variability is the reason we need statistical inference (Chapters 5–11). Inference gives us tools to quantify how much a sample statistic might vary, and to decide whether an observed pattern is strong enough to be trusted. You do not need to understand those tools yet — but you should understand why they are needed. Every dataset you will ever work with is a sample, and every sample comes with uncertainty. Your job as a data analyst is to acknowledge that uncertainty, not pretend it does not exist.
In the next chapter, you will put your newly cleaned NESARC dataset to work, exploring relationships between two variables at a time. The data management skills you learned here — filter, select, rename, recode, label, mutate — will serve you in every analysis you ever do, in this book and beyond.
3.2.13 Common Mistakes Students Make
Forgetting to recode missing values. The most common and most damaging error. If your dataset uses 99 or 999 for missing data and you do not recode, every mean, median, and SD you compute will be wrong. Always check the code book for missing value codes before running any analysis.
Writing the wrong filter conditions.
filter(S3AQ1A == 1)means “keep rows where S3AQ1A equals 1.”filter(S3AQ1A == 1 & S3AQ3B1 == 1)means “keep rows where BOTH equal 1.” If you use|(OR) by mistake, you will keep far more rows than intended. Double-check your logic operators.Confusing
=and==. In R,=is for assignment (naming things).==is for comparison (checking equality). If you accidentally typefilter(AGE = 25), R will refuse to run the code and print an error message telling you thatfilter()expects==, not=. It will never silently corrupt your data — you will know the moment you try to run it.filter(AGE == 25)does what you want.Running the filter after labeling factors, then wondering why the labels are gone. If you label your factors, then re-run the full pipe chain (which rebuilds the dataset from scratch), the labels disappear. Why? Because
read.csv()returns plain numbers, and every step in the pipe —filter(),select(),rename()— produces a brand-new data frame. Factor labels are attributes attached to the old data frame; they do not survive the rebuild. The factor labeling step must come after the pipe chain, or you must re-run the labeling code after every rebuild. This is why saving the cleaned dataset (see Mistake #5) is so important — you want to run the pipeline once, then load the saved, labeled version.Not saving the cleaned dataset. After all this work, you want to reuse the clean data in every future chapter. Save it with
write.csv(nesarc, "nesarc_clean.csv", row.names = FALSE)— this creates the file in your current working directory, which is typically the same folder as your Rmd file or RStudio project. Then load it in Chapter 4 with a singleread.csv()call. You do not want to re-run the 50-line data management recipe every chapter.
A critical safety rule: always use a different filename for the cleaned version (nesarc_clean.csv, not NESARC.csv). R will not stop you from overwriting a file — if you write to the same name, the original is gone forever. Keep your raw data files read-only (or at least never edit them directly), and always name your output files differently from your input files. A good habit: the raw data is sacred; you only ever read it, never write to it.
- Assuming the data is clean after one pass. After filtering, selecting, recoding, and labeling, run
summary()on every column and look for surprises — unusually large maximums, unexpected NAs, categories you thought you removed. Data management is iterative; you will catch things on the second pass that you missed on the first.
3.2.14 What Comes Next
You now have a clean, analysis-ready dataset: approximately 1,320 young adult daily smokers, with 7 well-labeled variables. You know how to load raw data, filter rows, select columns, rename variables, recode missing values, label factors, and create new variables — the complete data management toolkit.
In Chapter 4 (Bivariate Analysis), you will put this dataset to work. You will learn to explore relationships between two variables at a time: Do men and women smoke different amounts? Is ethnicity associated with depression diagnosis? Does the number of cigarettes smoked per day relate to nicotine dependence? These are the questions that data management makes possible — and the answers begin in the next chapter.