Chapter 3 Data Management in R

3.1 TL;DR: Data Management

Real data is messy. Before you can analyze, you must clean: filter rows, select columns, rename variables, recode missing values, and label categories.
dplyr is your toolkit. Five verbs do most of the work: filter(), select(), rename(), mutate(), and the pipe %>%.
Missing values hide in plain sight. Codes like 99, 999, or -1 often mean “no answer.” Convert them to NA so R ignores them in calculations.
Numeric codes ≠ categories. Variables like Sex (1 = Female, 2 = Male) are categorical — use factor() to label them so your output shows names, not numbers.
Build once, reuse forever. You will prepare the NESARC subset in this chapter and use it through Chapter 11. Get the data management right now and every analysis that follows will be easier.

# Load packages — install once with install.packages(c("dplyr", "modelsummary"))
library(dplyr)
library(modelsummary)

# Load the full NESARC dataset — the CSV is included with this book
nesarc <- read.csv("NESARC.csv")  # ← REPLACE: path to your dataset

# Step 1: Filter — keep only young daily smokers with valid data
# Step 2: Select — keep only the variables we need for our analyses
# Step 3: Rename — give variables readable names
nesarc <- nesarc %>%
  filter(S3AQ1A == 1 & CHECK321 == 1 & S3AQ3B1 == 1 & AGE <= 25) %>%  # ← REPLACE with your filter conditions
  select(IDNUM, ETHRACE2A, AGE, MAJORDEPLIFE, TAB12MDX, S3AQ3C1, SEX) %>%  # ← REPLACE with your columns
  rename(                                                                   # ← REPLACE with your names
    Ethnicity         = ETHRACE2A,     # ethnicity (categorical, 1–5)
    Age               = AGE,           # age in years (quantitative)
    MajorDepression   = MAJORDEPLIFE,  # lifetime depression diagnosis (categorical, 0/1)
    TobaccoDependence = TAB12MDX,      # tobacco dependence, past 12 months (categorical, 0/1)
    DailyCigsSmoked   = S3AQ3C1,      # cigarettes per day (quantitative)
    Sex               = SEX            # sex (categorical, 1/2)
  )

# Step 4: Recode missing values — 99 in DailyCigsSmoked means "unknown"
nesarc$DailyCigsSmoked[nesarc$DailyCigsSmoked == 99] <- NA  # ← REPLACE: your missing value code

# Step 5: Label categorical variables — output shows names, not numbers
nesarc$Ethnicity <- factor(nesarc$Ethnicity,                              # ← REPLACE: your variable
  levels = c(1, 2, 3, 4, 5),                                             # the numeric codes
  labels = c("Caucasian", "African American", "Native American",           # ← REPLACE: your labels
             "Asian", "Hispanic"))
nesarc$MajorDepression <- factor(nesarc$MajorDepression,                   # ← REPLACE: your variable
  labels = c("No Depression", "Yes Depression"))
nesarc$TobaccoDependence <- factor(nesarc$TobaccoDependence,               # ← REPLACE: your variable
  labels = c("No Nicotine Dependence", "Nicotine Dependence"))
nesarc$Sex <- factor(nesarc$Sex, labels = c("Female", "Male"))            # ← REPLACE: your labels

# Step 6: Verify — check dimensions and preview
dim(nesarc)   # rows × columns

## [1] 1320    7

head(nesarc)  # first 6 rows

##   IDNUM        Ethnicity Age MajorDepression      TobaccoDependence
## 1    21 African American  25   No Depression    Nicotine Dependence
## 2   103        Caucasian  24  Yes Depression    Nicotine Dependence
## 3   122        Caucasian  23  Yes Depression    Nicotine Dependence
## 4   136        Caucasian  21   No Depression No Nicotine Dependence
## 5   150        Caucasian  23  Yes Depression No Nicotine Dependence
## 6   155 African American  21   No Depression    Nicotine Dependence
##   DailyCigsSmoked    Sex
## 1               3   Male
## 2              10   Male
## 3              10   Male
## 4              20 Female
## 5               5   Male
## 6               8   Male

# Step 7: Quick overview with modelsummary — same information, report-ready format
datasummary_skim(nesarc)  # ← REPLACE: your data frame

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
IDNUM	1320	0	20235.7	11966.3	21.0	19155.5	43088.0
Age	8	0	21.6	2.2	18.0	22.0	25.0
DailyCigsSmoked	30	0	13.4	8.7	1.0	10.0	98.0
		N	%
Ethnicity	Caucasian	849	64.3
	African American	170	12.9
	Native American	30	2.3
	Asian	47	3.6
	Hispanic	224	17.0
MajorDepression	No Depression	965	73.1
	Yes Depression	355	26.9
TobaccoDependence	No Nicotine Dependence	521	39.5
	Nicotine Dependence	799	60.5
Sex	Female	646	48.9
	Male	674	51.1

Key takeaway: You now have a clean, analysis-ready dataset with ~1,320 observations of young adult daily smokers and 7 well-labeled variables. Every chapter from here forward starts with the nesarc data frame you just built.

3.2 Deep Dive: Data Management

3.2.1 Why Data Management Matters

In Chapter 2, you loaded small, clean datasets — soccer.csv, epiduralf.csv — where every variable was already named, every missing value was already handled, and every category was already labeled. That was like driving a car someone else had washed, fueled, and tuned.

Real-world data is different. Real datasets arrive with:

Cryptic variable names. S3AQ3C1 means nothing until you read the code book and learn it’s “usual number of cigarettes smoked per day.”
Missing values disguised as numbers. A value of 99 or 999 might mean “not answered,” but R will treat it as a real number unless you tell it otherwise.
Numeric codes that are really categories. Sex stored as 1 and 2, ethnicity stored as 1 through 5 — these look quantitative but are categorical.
Thousands of columns you do not need. The full NESARC dataset has over 3,000 variables. Your research question needs maybe 6 or 7.

Data management is the process of transforming raw data into a clean, analysis-ready form. It is not glamorous, but it is essential. A mistake in data management — forgetting to recode a missing value, mislabeling a variable — will corrupt every analysis that follows. Get the cleaning right, and everything downstream is easier.

The good news: a handful of tools in the dplyr package handle most data management tasks. This chapter teaches you five verbs: filter(), select(), rename(), mutate(), and the pipe %>%. By the end, you will have built a clean subset of the NESARC dataset that you will use for the rest of the book.

3.2.2 Real Data Is Messy — Load the NESARC

Let’s start by loading the full NESARC dataset. This is the same dataset you met in Chapter 1 — a survey of over 43,000 U.S. adults covering health, behavior, and demographics.

# Load the full NESARC dataset from CSV
nesarc_full <- read.csv("NESARC.csv")  # ← REPLACE: read.csv("your_file.csv")

# Check how big it is
dim(nesarc_full)  # 43094 rows, 3008 columns

## [1] 43093  3008

43,094 rows and 3,008 columns. That is far more than you need. Your research question (from Chapter 1) probably involves 5–10 variables for a specific subset of people. Let’s narrow things down.

But first, a tour of the raw data to understand why cleaning is necessary:

# Look at the first few rows of three key variables — raw, uncleaned
head(nesarc_full$S3AQ1A)    # "Smoked 100+ cigarettes?" — stored as 1 or 2

## [1] 2 2 2 2 2 2

head(nesarc_full$ETHRACE2A) # "Ethnicity" — stored as 1 through 5

## [1] 5 5 5 5 2 2

head(nesarc_full$S3AQ3C1)   # "Cigarettes per day" — includes 99 for "unknown"

## [1] NA NA NA NA NA NA

# Without the code book, these numbers are meaningless

Every one of these columns needs work. The values 1 and 2 in S3AQ1A should be “Yes” and “No.” The numbers 1–5 in ETHRACE2A should be “Caucasian,” “African American,” and so on. And the 99 in S3AQ3C1 — that is not 99 cigarettes a day. According to the NESARC code book, 99 means “unknown” or “missing.” If you computed the average without recoding 99 to NA, you would get a wildly inflated number.

This is why data management comes before analysis. Let’s fix each problem, one at a time.

3.2.3 The `%>%` Pipe — Chaining Operations Together

Before we dive into individual operations, let’s understand the tool that connects them. The pipe operator %>% (pronounced “and then”) takes the output of one function and feeds it as the input to the next function.

library(dplyr)

# Without the pipe — each step requires a new line and intermediate variables
step1 <- filter(nesarc_full, AGE <= 25)
step2 <- select(step1, AGE, SEX)
head(step2)

##   AGE SEX
## 1  23   1
## 2  18   1
## 3  19   1
## 4  18   2
## 5  21   2
## 6  25   2

# With the pipe — read from top to bottom: "take nesarc_full, AND THEN filter, AND THEN select"
nesarc_full %>%
  filter(AGE <= 25) %>%
  select(AGE, SEX) %>%
  head()

##   AGE SEX
## 1  23   1
## 2  18   1
## 3  19   1
## 4  18   2
## 5  21   2
## 6  25   2

Both produce the same result, but the piped version reads like a recipe: “take the data, filter it, select columns, show the head.” You will use the pipe throughout this chapter and every chapter after. Think of it as %>% = “and then.”

You may have noticed that Chapter 2 used a different pipe symbol: |>. Both pipes do the same thing — pass a result into the next function. The |> is the base R pipe (built into R since version 4.1), while %>% comes from the dplyr package and has been the standard for over a decade. This book uses %>% with dplyr operations because it is still the most common pipe in the R ecosystem, but you can use |> anywhere if you prefer. Your code will work the same either way.

3.2.4 Filtering Rows — Keeping the Observations You Need

Not every observation in the dataset is relevant to your research question. If you are studying nicotine dependence among young smokers, you do not need data from 65-year-old non-smokers. Filtering keeps only the rows that match your criteria.

filter() takes a data frame and returns only the rows where a condition is true.

library(dplyr)

# Which observations should we keep? Our research question focuses on:
#   - People who have smoked 100+ cigarettes in their lifetime (S3AQ1A == 1)
#   - People who have smoked in the past 12 months (S3AQ3B1 == 1)
#   - People with valid nicotine dependence data (CHECK321 == 1)
#   - Young adults, age 25 or under (AGE <= 25)

nesarc_subset <- nesarc_full %>%                  # start with the full data
  filter(                                         # keep rows where ALL conditions are true:
    S3AQ1A  == 1 &                                #   smoked 100+ cigarettes (1 = Yes)
    S3AQ3B1 == 1 &                                #   smoked in past 12 months (1 = Yes)
    CHECK321 == 1 &                               #   valid nicotine data flag (1 = valid)
    AGE     <= 25                                 #   age 25 or younger
  )                                               # ← REPLACE: adjust conditions for your question

# How many observations remain?
dim(nesarc_subset)  # ~1,320 rows — down from 43,094

## [1] 1320 3008

The & symbol means “AND.” A row must satisfy every condition to be kept. A person who smoked 100+ cigarettes AND smoked recently AND has valid data AND is 25 or younger — all four must be true. If any one condition is false, that row is dropped.

What if you want to keep rows that satisfy either condition? Use | (the vertical bar, Shift + backslash) for OR. For example, filter(AGE == 25 | AGE == 26) keeps people who are either 25 or 26 years old. & means “all conditions must be true”; | means “at least one condition must be true.”

What the conditions mean:

Condition	Variable	Meaning
`S3AQ1A == 1`	Smoked 100+ cigarettes in lifetime	1 = Yes, 2 = No
`S3AQ3B1 == 1`	Smoked in the past 12 months	1 = Yes, 2 = No
`CHECK321 == 1`	Valid nicotine dependence data	1 = valid, 0 = invalid
`AGE <= 25`	Age 25 or younger	Years (quantitative)

Always check the code book to confirm what each value means before you write a filter condition. S3AQ1A == 1 means “Yes, the person smoked 100+ cigarettes” — but only because the code book says 1 means Yes. Without the code book, you might incorrectly assume 1 means No.

Note on the sample: Why focus on young adults? Because substance use patterns change with age, and focusing on a specific age group makes the analysis more interpretable. You could choose a different age range — the important thing is to have a clear rationale for your filter criteria.

3.2.5 Selecting Columns — Keeping Only the Variables You Need

The full NESARC has 3,008 columns. Most are irrelevant to your research question. Selecting keeps only the columns you will actually use.

select() takes a data frame and returns only the columns you name.

# Our research question involves: ethnicity, age, depression, tobacco dependence,
# cigarettes per day, and sex. We also keep IDNUM as a unique identifier.

nesarc_subset <- nesarc_subset %>%               # continue from the filtered data
  select(                                        # keep only these columns:
    IDNUM,                                       #   unique respondent ID
    ETHRACE2A,                                   #   ethnicity (1–5)
    AGE,                                         #   age in years
    MAJORDEPLIFE,                                #   lifetime major depression (0/1)
    TAB12MDX,                                    #   tobacco dependence, past 12 months (0/1)
    S3AQ3C1,                                     #   usual cigarettes per day (1–98, 99=unknown)
    SEX                                          #   sex (1=Female, 2=Male)
  )                                              # ← REPLACE: add/remove columns for your question

# Check: only 7 columns now
dim(nesarc_subset)  # ~1,320 rows, 7 columns

## [1] 1320    7

You went from 43,094 × 3,008 to approximately 1,320 × 7. The dataset is now focused on exactly the variables your research question needs.

Why select columns early? Because smaller datasets are easier to inspect, faster to work with, and less likely to cause confusion. If you keep all 3,008 columns, you might accidentally analyze the wrong variable with a similar name. Narrowing to the columns you care about reduces the chance of mistakes.

3.2.6 Renaming Variables — Making Names Readable

S3AQ3C1 is a terrible variable name. It means nothing outside the context of the survey’s original numbering scheme. Renaming gives variables names that describe what they actually measure.

rename() takes a data frame and changes column names. The syntax is new_name = old_name.

# Replace cryptic survey codes with descriptive names
nesarc_subset <- nesarc_subset %>%
  rename(                                       # new_name = old_name:
    Ethnicity         = ETHRACE2A,              #   ethnicity category
    Age               = AGE,                    #   age in years
    MajorDepression   = MAJORDEPLIFE,           #   lifetime depression diagnosis
    TobaccoDependence = TAB12MDX,               #   tobacco dependence, past 12 months
    DailyCigsSmoked   = S3AQ3C1,                #   usual cigarettes smoked per day
    Sex               = SEX                     #   biological sex
  )                                              # ← REPLACE: names for your variables

# Now the column names are self-documenting
names(nesarc_subset)

## [1] "IDNUM"             "Ethnicity"         "Age"              
## [4] "MajorDepression"   "TobaccoDependence" "DailyCigsSmoked"  
## [7] "Sex"

Good variable names are short but descriptive. Ethnicity tells you what the column contains. ETHRACE2A does not. Renaming is one of the simplest data management operations — and one of the most important for keeping your code readable six months from now, when you have forgotten what ETHRACE2A means.

3.2.7 Recoding Missing Values — Converting Placeholders to NA

In many datasets, special codes indicate missing data: 99, 999, -1, or blank cells. The NESARC code book tells us that in DailyCigsSmoked, a value of 99 means “unknown.” If we leave 99 as a real number, R will include it when computing the mean — and 99 cigarettes a day would make the average absurdly high.

Recoding missing values replaces these placeholder codes with NA, R’s symbol for “not available.” R automatically excludes NA values from calculations like mean() and sd().

# Before recoding: check how many rows have the value 99
table(nesarc_subset$DailyCigsSmoked == 99)

## 
## FALSE  TRUE 
##  1315     5

# Recode 99 to NA — "99 means unknown, treat it as missing"
nesarc_subset$DailyCigsSmoked[nesarc_subset$DailyCigsSmoked == 99] <- NA  # ← REPLACE: your missing code

# After recoding: verify the 99s are gone
table(nesarc_subset$DailyCigsSmoked == 99)     # should be all FALSE

## 
## FALSE 
##  1315

summary(nesarc_subset$DailyCigsSmoked)          # NA count appears at the end

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
##    1.00    7.00   10.00   13.36   20.00   98.00       5

# Compare: mean WITH the 99 (wrong) vs mean with NA (correct)
# If we forgot to recode:
#   mean(c(10, 15, 20, 99)) → 36 — nonsense
# After recoding to NA:
#   mean(c(10, 15, 20, NA), na.rm = TRUE) → 15 — makes sense

The syntax dataframe$column[condition] <- NA reads as: “in this column, for every row where the condition is true, replace the value with NA.” You will use this pattern whenever a dataset uses placeholder codes for missing data.

Important: Always check the code book for missing value codes before you start analyzing. Every dataset has its own conventions. 99 is common in health surveys, 999 in economic data, and -1 in some psychological instruments. There is no universal standard — the code book is your only reliable guide.

How do researchers know 99 is not a real value? Because the survey itself never offers 99 as a valid answer. The NESARC questionnaire asks respondents for a number between 1 and 98; 99 is reserved by the data collectors as a “not answered” placeholder. The code book documents this decision. In every dataset you encounter, the missing value codes were chosen by the data collectors ahead of time — they are not real responses that happen to look suspicious.

When you convert a value to NA, R does not delete the row. The person’s age, ethnicity, and all other variables remain in the dataset. Functions like mean() simply skip NA values when computing, as long as you add na.rm = TRUE. If you ever need to explicitly remove rows with missing data, use na.omit(), but we usually prefer to keep the rows and handle NAs on a case-by-case basis.

3.2.8 Creating and Labeling Factors — From Numbers to Names

Look at the Ethnicity column in your current dataset:

# Ethnicity is stored as numbers — meaningless without the code book
head(nesarc_subset$Ethnicity)

## [1] 2 1 1 1 1 2

These numbers (1, 2, 3, 4, 5) are not quantities — they are category labels. A value of 4 does not mean “four units of ethnicity.” It means “Asian,” according to the code book. In R, a categorical variable stored this way is called a factor. Converting a numeric column to a factor attaches readable labels to the numeric codes.

# Convert numeric codes to labeled categories
nesarc_subset$Ethnicity <- factor(nesarc_subset$Ethnicity,          # the column to convert
  levels = c(1, 2, 3, 4, 5),                                      # the numeric codes in order
  labels = c("Caucasian", "African American", "Native American",    # ← REPLACE: your labels
             "Asian", "Hispanic"))

# Now the output shows names, not numbers
head(nesarc_subset$Ethnicity)

## [1] African American Caucasian        Caucasian        Caucasian       
## [5] Caucasian        African American
## Levels: Caucasian African American Native American Asian Hispanic

# Label all other categorical variables the same way
nesarc_subset$MajorDepression <- factor(nesarc_subset$MajorDepression,
  labels = c("No Depression", "Yes Depression"))                    # ← REPLACE: your labels

nesarc_subset$TobaccoDependence <- factor(nesarc_subset$TobaccoDependence,
  labels = c("No Nicotine Dependence", "Nicotine Dependence"))      # ← REPLACE: your labels

nesarc_subset$Sex <- factor(nesarc_subset$Sex,
  labels = c("Female", "Male"))                                     # ← REPLACE: your labels

Why does this matter? Compare the output of a frequency table before and after labeling:

# Before labeling: numbers only — you need the code book to interpret
# (this uses the unlabeled version — we already labeled ours above)
# table(nesarc_subset$Sex) would show: 1: 600, 2: 720

# After labeling: names tell the story directly
table(nesarc_subset$Sex)    # shows Female and Male — no code book needed

## 
## Female   Male 
##    646    674

table(nesarc_subset$Ethnicity)

## 
##        Caucasian African American  Native American            Asian 
##              849              170               30               47 
##         Hispanic 
##              224

A reader can understand “Female: 680, Male: 640” immediately. “1: 680, 2: 640” requires flipping back to the code book. Label your factors.

3.2.9 Creating New Variables — `mutate()` and `ifelse()`

Sometimes the variable you want does not exist in the raw dataset — you need to create it from existing columns. mutate() adds new columns, and ifelse() helps with conditional logic.

# Example: Create a binary variable for "heavy smoker" (20+ cigarettes per day)
nesarc_subset <- nesarc_subset %>%
  mutate(
    HeavySmoker = ifelse(DailyCigsSmoked >= 20, "Yes", "No")  # ← REPLACE with your new variable logic
  )

# Verify the new column
table(nesarc_subset$HeavySmoker)

## 
##  No Yes 
## 871 444

# Note: ifelse() handles NAs carefully — rows where DailyCigsSmoked is NA
# will have HeavySmoker = NA. R preserves missingness.

mutate() adds a new column to the right side of the data frame. ifelse(condition, value_if_true, value_if_false) is R’s way of saying “if this condition is true, assign value A; otherwise, assign value B.” Think of ifelse() as filling in a new column row by row: for each person, check the condition; if true, write “Yes”; if false, write “No.”

Another common pattern: collapsing categories. The frustration dataset (included in data/frustration.csv) has a variable Major with four values. What if you wanted to group Engineering and Computer Science together as “STEM”?

# Load the frustration dataset
frustration <- read.csv("frustration.csv")

# Before collapsing — four separate majors
table(frustration$Major)

## 
##    Business     English Mathematics  Psychology 
##          35          35          35          35

# Create a collapsed variable: STEM vs Non-STEM
frustration <- frustration %>%
  mutate(
    MajorGroup = ifelse(Major %in% c("Engineering", "Computer Science"),
                        "STEM", "Non-STEM")  # ← REPLACE: your categories and grouping
  )

# After collapsing — two groups
table(frustration$MajorGroup)

## 
## Non-STEM 
##      140

%in% checks whether a value appears in a list. Major %in% c("Engineering", "Computer Science") returns TRUE for engineering and CS majors, FALSE for everyone else. This pattern is useful whenever you want to combine several small categories into a larger group.

3.2.10 The Full Recipe — Putting It All Together

You have now seen each operation in isolation. Here is the complete recipe, with every step chained together, from raw NESARC to a clean analysis dataset:

# Full NESARC data management recipe — from 43,094 × 3,008 to ~1,320 × 7
library(dplyr)

nesarc <- read.csv("NESARC.csv") %>%                                  # 1. Load raw data
  filter(                                                             # 2. Filter rows
    S3AQ1A  == 1 &                                                    #    smoked 100+ cigarettes (1 = Yes)
    S3AQ3B1 == 1 &                                                    #    smoked in past 12 months (1 = Yes)
    CHECK321 == 1 &                                                   #    valid nicotine data (1 = valid)
    AGE     <= 25                                                     #    age 25 or younger
  ) %>%
  select(                                                             # 3. Select columns
    IDNUM, ETHRACE2A, AGE, MAJORDEPLIFE,
    TAB12MDX, S3AQ3C1, SEX
  ) %>%
  rename(                                                             # 4. Rename variables
    Ethnicity         = ETHRACE2A,
    Age               = AGE,
    MajorDepression   = MAJORDEPLIFE,
    TobaccoDependence = TAB12MDX,
    DailyCigsSmoked   = S3AQ3C1,
    Sex               = SEX
  )

# 5. Recode missing values — must happen AFTER the pipe because it modifies in place
nesarc$DailyCigsSmoked[nesarc$DailyCigsSmoked == 99] <- NA

# 6. Label categorical variables
nesarc$Ethnicity <- factor(nesarc$Ethnicity,
  labels = c("Caucasian", "African American", "Native American", "Asian", "Hispanic"))
nesarc$MajorDepression <- factor(nesarc$MajorDepression,
  labels = c("No Depression", "Yes Depression"))
nesarc$TobaccoDependence <- factor(nesarc$TobaccoDependence,
  labels = c("No Nicotine Dependence", "Nicotine Dependence"))
nesarc$Sex <- factor(nesarc$Sex, labels = c("Female", "Male"))

# 7. Verify the result
dim(nesarc)    # should be approximately 1,320 rows × 7 columns

## [1] 1320    7

head(nesarc)   # preview the first 6 rows

##   IDNUM        Ethnicity Age MajorDepression      TobaccoDependence
## 1    21 African American  25   No Depression    Nicotine Dependence
## 2   103        Caucasian  24  Yes Depression    Nicotine Dependence
## 3   122        Caucasian  23  Yes Depression    Nicotine Dependence
## 4   136        Caucasian  21   No Depression No Nicotine Dependence
## 5   150        Caucasian  23  Yes Depression No Nicotine Dependence
## 6   155 African American  21   No Depression    Nicotine Dependence
##   DailyCigsSmoked    Sex
## 1               3   Male
## 2              10   Male
## 3              10   Male
## 4              20 Female
## 5               5   Male
## 6               8   Male

A note on the order of operations: You might wonder why we recode missing values and label factors after the pipe chain, not inside it. The filter() → select() → rename() operations work naturally with %>% because they return a data frame. Recoding a single column to NA modifies an existing column in place — it works just as easily outside the pipe. Both approaches produce the same result; this organization is a matter of readability.

3.2.11 From Raw Output to Professional Tables

Now that the data is clean, let’s verify our work with summary tables. As in Chapter 2, we start with base R and then upgrade to modelsummary.

# Frequency table for each categorical variable — base R
table(nesarc$Sex)

## 
## Female   Male 
##    646    674

table(nesarc$Ethnicity)

## 
##        Caucasian African American  Native American            Asian 
##              849              170               30               47 
##         Hispanic 
##              224

table(nesarc$MajorDepression)

## 
##  No Depression Yes Depression 
##            965            355

table(nesarc$TobaccoDependence)

## 
## No Nicotine Dependence    Nicotine Dependence 
##                    521                    799

# Summary statistics for the quantitative variable — base R
mean(nesarc$DailyCigsSmoked, na.rm = TRUE)    # average cigarettes per day

## [1] 13.36426

median(nesarc$DailyCigsSmoked, na.rm = TRUE)  # middle value

## [1] 10

sd(nesarc$DailyCigsSmoked, na.rm = TRUE)      # standard deviation

## [1] 8.657025

summary(nesarc$DailyCigsSmoked)               # min, quartiles, median, mean, max, NAs

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
##    1.00    7.00   10.00   13.36   20.00   98.00       5

Now the same information, presented in a report-ready format:

library(modelsummary)

# All categorical variables in one frequency table
datasummary(Ethnicity + MajorDepression + TobaccoDependence + Sex ~ N + Percent(),
            data = nesarc)

		N	Percent
Ethnicity	Caucasian	849	64.32
	African American	170	12.88
	Native American	30	2.27
	Asian	47	3.56
	Hispanic	224	16.97
MajorDepression	No Depression	965	73.11
	Yes Depression	355	26.89
TobaccoDependence	No Nicotine Dependence	521	39.47
	Nicotine Dependence	799	60.53
Sex	Female	646	48.94
	Male	674	51.06

# Quantitative variable summary — show the same stats computed above in one table
datasummary(DailyCigsSmoked ~ N + Mean + Median + SD + Min + Max,
            data = nesarc)

	N	Mean	Median	SD	Min	Max
DailyCigsSmoked	1315	13.36	10.00	8.66	1.00	98.00

# Quick skim — one-line overview of every variable
datasummary_skim(nesarc)

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
IDNUM	1320	0	20235.7	11966.3	21.0	19155.5	43088.0
Age	8	0	21.6	2.2	18.0	22.0	25.0
DailyCigsSmoked	30	0	13.4	8.7	1.0	10.0	98.0
		N	%
Ethnicity	Caucasian	849	64.3
	African American	170	12.9
	Native American	30	2.3
	Asian	47	3.6
	Hispanic	224	17.0
MajorDepression	No Depression	965	73.1
	Yes Depression	355	26.9
TobaccoDependence	No Nicotine Dependence	521	39.5
	Nicotine Dependence	799	60.5
Sex	Female	646	48.9
	Male	674	51.1

Every number in the modelsummary output matches what you computed by hand with table(), mean(), median(), and sd(). The modelsummary functions are not doing anything different — they are just presenting the same calculations in a polished table. When you write up your results, use datasummary(). When you are exploring and thinking, use the base R functions — they give you direct access to the numbers without any formatting in the way.

3.2.12 A Preview of Sampling Variability

You have just built a dataset of about 1,320 young adult daily smokers. This is a sample — a subset of the larger NESARC survey, which is itself a sample of the U.S. adult population. The filtering you did (age ≤ 25, daily smokers) further narrowed the group.

Here is a fundamental question that will drive the rest of this book: If you took a different sample — a different 1,320 young daily smokers from a different survey — would you get the same results?

Almost certainly not. Every sample is a snapshot, and snapshots differ. The mean cigarettes per day in your NESARC subset might be 14.3. In another survey of the same population, it might be 14.8. In a third, 13.9. The differences are usually small, but they exist.

This is sampling variability — the idea that sample statistics (like the mean, the median, a proportion) vary from sample to sample, even when drawn from the same population.

Why does this matter? Because when you analyze your data and find, say, that 68% of young female smokers also have a depression diagnosis, you want to know: is this finding specific to your sample, or does it reflect something true about the broader population? If you ran the same analysis on a different sample and got 32%, you would know your original finding was unstable — probably just noise. If you got 67% in a second sample and 69% in a third, you would be more confident that the 68% figure is real.

Sampling variability is the reason we need statistical inference (Chapters 5–11). Inference gives us tools to quantify how much a sample statistic might vary, and to decide whether an observed pattern is strong enough to be trusted. You do not need to understand those tools yet — but you should understand why they are needed. Every dataset you will ever work with is a sample, and every sample comes with uncertainty. Your job as a data analyst is to acknowledge that uncertainty, not pretend it does not exist.

In the next chapter, you will put your newly cleaned NESARC dataset to work, exploring relationships between two variables at a time. The data management skills you learned here — filter, select, rename, recode, label, mutate — will serve you in every analysis you ever do, in this book and beyond.

3.2.13 Common Mistakes Students Make

Forgetting to recode missing values. The most common and most damaging error. If your dataset uses 99 or 999 for missing data and you do not recode, every mean, median, and SD you compute will be wrong. Always check the code book for missing value codes before running any analysis.
Writing the wrong filter conditions. filter(S3AQ1A == 1) means “keep rows where S3AQ1A equals 1.” filter(S3AQ1A == 1 & S3AQ3B1 == 1) means “keep rows where BOTH equal 1.” If you use | (OR) by mistake, you will keep far more rows than intended. Double-check your logic operators.
Confusing = and ==. In R, = is for assignment (naming things). == is for comparison (checking equality). If you accidentally type filter(AGE = 25), R will refuse to run the code and print an error message telling you that filter() expects ==, not =. It will never silently corrupt your data — you will know the moment you try to run it. filter(AGE == 25) does what you want.
Running the filter after labeling factors, then wondering why the labels are gone. If you label your factors, then re-run the full pipe chain (which rebuilds the dataset from scratch), the labels disappear. Why? Because read.csv() returns plain numbers, and every step in the pipe — filter(), select(), rename() — produces a brand-new data frame. Factor labels are attributes attached to the old data frame; they do not survive the rebuild. The factor labeling step must come after the pipe chain, or you must re-run the labeling code after every rebuild. This is why saving the cleaned dataset (see Mistake #5) is so important — you want to run the pipeline once, then load the saved, labeled version.
Not saving the cleaned dataset. After all this work, you want to reuse the clean data in every future chapter. Save it with write.csv(nesarc, "nesarc_clean.csv", row.names = FALSE) — this creates the file in your current working directory, which is typically the same folder as your Rmd file or RStudio project. Then load it in Chapter 4 with a single read.csv() call. You do not want to re-run the 50-line data management recipe every chapter.

A critical safety rule: always use a different filename for the cleaned version (nesarc_clean.csv, not NESARC.csv). R will not stop you from overwriting a file — if you write to the same name, the original is gone forever. Keep your raw data files read-only (or at least never edit them directly), and always name your output files differently from your input files. A good habit: the raw data is sacred; you only ever read it, never write to it.

Assuming the data is clean after one pass. After filtering, selecting, recoding, and labeling, run summary() on every column and look for surprises — unusually large maximums, unexpected NAs, categories you thought you removed. Data management is iterative; you will catch things on the second pass that you missed on the first.

3.2.14 What Comes Next

You now have a clean, analysis-ready dataset: approximately 1,320 young adult daily smokers, with 7 well-labeled variables. You know how to load raw data, filter rows, select columns, rename variables, recode missing values, label factors, and create new variables — the complete data management toolkit.

In Chapter 4 (Bivariate Analysis), you will put this dataset to work. You will learn to explore relationships between two variables at a time: Do men and women smoke different amounts? Is ethnicity associated with depression diagnosis? Does the number of cigarettes smoked per day relate to nicotine dependence? These are the questions that data management makes possible — and the answers begin in the next chapter.