Chapter 1 Data Fundamentals
1.1 TL;DR: What Data Looks Like
- Data is a table. Rows = observations (individuals). Columns = variables (characteristics measured).
- Quantitative variables are numbers you can do math with — age, test scores, temperature.
- Categorical variables are labels or groups — blood type, favorite subject, yes/no.
- Dummy coding trap: categories are often stored as numbers (0 = No, 1 = Yes). Always check the code book.
- Code books are essential. Read one before you analyze any dataset — it tells you what every variable and value means.
- NESARC dataset: 43,094 rows, 3,008 columns. Included with this book. You’ll use it from Chapter 2 onward.
# mtcars is a built-in R dataset about 32 car models — no loading needed
# Look at the first 6 rows to understand the structure
head(mtcars) # ← REPLACE: head(your_data) to preview your own dataset## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# Inspect the structure: variable names, types, and first few values
str(mtcars) # ← REPLACE: str(your_data) to check your own dataset's structure## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Key takeaway: Before you run any analysis, understand what each column represents. That understanding comes from the code book.
1.2 Deep Dive: What Data Is
# Load the NESARC dataset from CSV — the file is included with this book
nesarc <- read.csv("NESARC.csv") # ← REPLACE: read.csv("your_file.csv")
# Check dimensions: rows first, then columns
dim(nesarc) # 43094 rows × 3008 columns## [1] 43093 3008
This reads 43,094 rows and 3,008 columns into R. The file NESARC.csv is included with this book — you already have it. You will work with this dataset from Chapter 2 onward, but for now, let’s make sure you understand the basics.
1.2.1 Rows and Columns — The Building Blocks
Imagine you are the class monitor and your teacher asks you to collect information about every student in your grade. You create a spreadsheet. Each student gets their own row. Each piece of information you collect — name, age, homeroom, score on the last math exam — gets its own column.
That spreadsheet is a dataset.
In statistics, we use precise language for these parts:
- An observation (or individual) is one row. It represents a single person, object, or event. In a dataset of hospital patients, each row is one patient. In a dataset of Mars craters, each row is one crater.
- A variable is one column. It represents a characteristic that was measured or recorded. Age, height, test score, favorite color — these are all variables.
Why does this distinction matter? Because every statistical technique you will learn in this book works on variables. When you make a histogram, you are visualizing one variable. When you run a regression, you are modeling the relationship between two (or more) variables. If you cannot tell which columns are which type of variable, you cannot choose the right tool for the job.
A word about repeated measurements: What if the same person is measured twice — like a pre-test and a post-test? In the simplest case (and in the datasets used throughout this book), one row equals one person, and repeated measurements become separate columns like PreTest and PostTest. In Chapter 3, you will learn how to reshape data for more complex situations. For now, assume one row = one person.
1.2.2 Quantitative vs. Categorical Variables
Every variable falls into one of two categories:
Quantitative variables are numbers that represent meaningful amounts. You can add them, subtract them, compute averages, and compare magnitudes. Examples:
- Your score on a math test (0–100)
- The number of hours you slept last night (0–24)
- A person’s height in centimeters
- The temperature in Beijing on a given day
Categorical variables are labels that place an individual into a group. You cannot meaningfully add or average them. Examples:
- Your favorite subject (Physics, Art, History, …)
- Your blood type (A, B, AB, O)
- Whether you have a pet (Yes / No)
- The city you were born in
Here is a quick way to test yourself: ask “does it make sense to compute an average of this?” If the answer is yes, the variable is probably quantitative. If the answer is “an average of what?”, it is probably categorical. The average of a list of test scores is a meaningful number. The average of a list of favorite subjects is not.
# mtcars is a built-in R dataset about 32 car models — no loading needed
# Look at the first 6 rows to understand the structure
head(mtcars) # ← REPLACE: head(your_data) to preview your own dataset## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Each row is one car model. Each column is one variable. The row names on the far left (Mazda RX4, Hornet 4 Drive, …) are unique identifiers — every row is a different car. The mpg column (miles per gallon) and hp column (horsepower) are quantitative — you can compute averages and compare magnitudes. The cyl column (number of cylinders: 4, 6, or 8) and am column (0 = automatic, 1 = manual) are categorical — these are groups, not amounts. A car with 6 cylinders does not have “twice as much cylinder” as a car with 3.
# Inspect the structure: variable names, types, and first few values
str(mtcars) # ← REPLACE: str(your_data) to check your own dataset's structure## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Look at the output. R tells you that every column in mtcars is stored as a number (num). But that does not mean every column is quantitative. The mpg and hp columns are genuinely quantitative — they measure fuel efficiency and engine power. But cyl (4, 6, or 8 cylinders) and am (0 = automatic, 1 = manual) are categorical variables stored as numbers. R does not know the difference unless you tell it.
What if you find a dataset online with no code book? You are not stuck, but you must be careful. Use clues: a variable with only a few unique values (0/1, 1–5) is probably categorical; one with many unique decimal values is probably quantitative. The variable name often gives it away — age or income are quantitative, gender or race are categorical. But guessing is risky. If a dataset has no documentation, the safest choice is to look for a different one that does.
And that brings us to an important trap.
1.2.3 Dummy Coding: When Categories Disguise Themselves as Numbers
Sometimes, categorical variables are stored as numbers. This is called dummy coding.
Imagine a survey that asks “Do you smoke?” and records the answer as:
| Code | Meaning |
|---|---|
| 0 | No |
| 1 | Yes |
The variable Smoking is stored as 0 or 1, but these numbers are labels, not quantities. Smoking = 1 does not mean “one unit of smoking.” It means “yes, this person smokes.” You cannot meaningfully compute the “average smoking value” and interpret it as a quantity — an average of 0.3 means 30% of people said yes, not that each person smokes “0.3 worth.”
This trips up many beginners. You load a dataset, see a column full of 0s and 1s, and assume it is quantitative. Always check the code book to find out what the numbers mean.
But why do researchers use numbers instead of words? Because computers do math. When you later run a regression or a chi-square test, R needs numeric input — it cannot multiply the word “No” by anything. Dummy coding translates human-readable categories into a language the computer can process. The code book is your dictionary for translating back.
Here is how you might create and label a dummy-coded variable in R:
# Example: 8 survey responses about pet ownership (0 = No, 1 = Yes)
pet_numeric <- c(0, 1, 1, 0, 0, 1, 0, 1) # ← REPLACE: your coded variable
# Convert numeric codes into labeled categories (a "factor" in R)
pet <- factor(pet_numeric, # the coded variable
levels = c(0, 1), # the numeric codes
labels = c("No", "Yes")) # what those codes mean
# See the labeled result
pet # R now treats this as categorical, not numeric## [1] No Yes Yes No No Yes No Yes
## Levels: No Yes
Now R knows that pet is categorical. When you make a bar chart or run a chi-square test later, R will treat it correctly.
# ETHRACE2A stores race as numeric codes — need the code book to interpret
# 1 = White, 2 = Black, 3 = Native American, 4 = Asian, 5 = Hispanic
table(nesarc$ETHRACE2A[1:100]) # ← REPLACE: table(your_data$your_variable)##
## 1 2 4 5
## 24 26 2 48
Look at the output. You see numbers: 1, 2, 3, 4, 5. Without the code book, you would have no idea what these mean. Are they rankings? Quantities? No — they are categories. The code book tells you that 1 is White, 2 is Black, 3 is Native American, 4 is Asian, and 5 is Hispanic. This is dummy coding on a larger scale, and you will see it everywhere in real datasets.
1.2.4 Unique Identifiers: Why Every Dataset Needs One
A unique identifier is a variable whose value is different for every row. Student ID numbers, passport numbers, serial numbers on products — these are all unique identifiers.
Why does this matter? Because real data has errors. Two students might share the same name. Two patients might be the same age and gender. Without a unique identifier, you cannot tell them apart. If you ever need to merge two datasets, fix a specific row, or trace a value back to its source, you need a column that is guaranteed to be different for every observation.
In the mtcars dataset, the row names (car model names) are the unique identifiers. In the NESARC dataset you loaded above, each respondent has a unique case identification number stored in the IDNUM column.
Rule of thumb: Every dataset you work with should have a unique identifier column. If it does not, create one before doing anything else.
1.3 Deep Dive: Code Books
1.3.1 What Is a Code Book?
A code book (sometimes called a data dictionary) is a document that comes with a dataset and explains everything about it:
- What the study was. Who was surveyed? When? How were people selected?
- What each variable means. What does
S3AQ3B1stand for? What units is it measured in? - What the values mean. If a variable is coded as 1, 2, 9, does 1 mean “Yes” or “No”? Does 9 mean “Missing”?
- How many people responded each way. Many code books include frequency counts so you can see how common each answer is.
Think of the code book as the instruction manual for the data. Would you try to assemble a piece of furniture without reading the instructions? The same principle applies here. Data without a code book is, at best, a puzzle — and at worst, completely uninterpretable.
In real life, code books come in different forms: a PDF file packaged alongside the data download, a webpage on a research organization’s site, or documentation tabs inside an Excel spreadsheet. For the NESARC dataset used in this book, the code book is provided as a PDF. When you download data from sites like ICPSR or Kaggle, look for files named codebook.pdf or documentation.pdf.
1.3.2 Why You Must Read the Code Book First
There are two big reasons to start with the code book:
It helps you generate research questions. When you browse through a code book and see all the variables that were collected, you start asking questions: “I wonder if people who smoke also drink more?” or “Is there a link between education level and income?” The code book is where your project begins.
It prevents you from making mistakes. If you do not know that
9means “Missing” and not a real value of 9, you will compute wrong averages and draw wrong conclusions. If you do not know that a 0/1 variable is a dummy code, you will treat it as quantitative when it is not.
For example, in the NESARC dataset, the variable ETHRACE2A stores race as numbers 1 through 5. Without the code book, you wouldn’t know that 5 means Hispanic. The code book tells you what every number represents.
The code book is not optional. It is the first thing you read, before you write a single line of R code.
1.4 Deep Dive: The NESARC Dataset
1.4.1 Meet Your Primary Dataset
Throughout this book, you will work with a dataset called NESARC — the National Epidemiological Survey on Alcohol and Related Conditions.
NESARC is a large survey conducted by the U.S. government. It collected detailed information from over 43,000 adults across the United States. The survey covers a wide range of topics:
- Alcohol use and dependence
- Tobacco use
- Mental health (depression, anxiety, personality disorders)
- Demographics (age, sex, ethnicity, income, education)
- Family history and social environment
The NESARC dataset (NESARC.csv) is included with this book. You’ll find it alongside the chapter files. You already loaded it at the start of this chapter.
NESARC is a representative sample of the non-institutionalized U.S. adult population (18 years and older). That means the 43,000+ respondents were chosen in a way that allows researchers to generalize findings to the entire U.S. adult population — the same idea we discussed in the Preface when we talked about samples and populations.
1.4.2 Why NESARC?
You will use NESARC from Chapter 2 all the way through Chapter 11. Here is why it is a good choice for learning statistics:
- It is real. This is not a toy dataset invented for a textbook. Real researchers use NESARC to publish peer-reviewed studies.
- It is large. With over 43,000 rows and hundreds of variables, there is enough data to explore meaningful patterns.
- It is varied. The variables span health, behavior, demographics, and psychology — so you can find a research topic that genuinely interests you.
- It is well-documented. The NESARC code book is thorough, which makes it an excellent tool for learning how to read and use code books.
In Chapter 3 (Data Management), you will learn how to select the variables you care about, recode missing values, and prepare the data for analysis. Every chapter after that builds on the dataset you prepare.
For now, your job is simpler: browse the code book and pick a research question.
1.5 Exercise: Choose Your Research Question
Your first real task as a statistician is to find a question worth asking. Here is how to do it:
- Open the NESARC code book.
- Browse the variables. Skim the table of contents or the variable list. Look for topics that interest you — mental health, substance use, demographics, family background.
- Pick one topic as your primary focus. Maybe it is depression. Maybe it is alcohol consumption. Maybe it is income inequality.
- Find a second, related topic. This will be your explanatory variable — the thing you think might be associated with your primary topic.
- Write down your research question in one sentence.
Here are some example research questions to inspire you:
- Is there an association between the number of cigarettes a person smokes per day and their level of nicotine dependence?
- Are adults with higher education levels less likely to be diagnosed with depression?
- Is there a relationship between a person’s age and the frequency of their alcohol consumption?
- Do income levels differ between ethnic groups in the NESARC sample?
Notice the pattern: each question asks whether two variables are associated (related, connected). That is the simplest form of a statistical research question, and it is exactly the kind of question you will learn to answer in this book.
Your research question does not need to be perfect. You can revise it later as you learn more about the data and about statistics. The important thing is to start with something you genuinely want to know.
Write your research question in the space below (or in a notebook, or in a text file on your computer). You will come back to it in every chapter.
Before you commit to your question, do a quick reality check. In the Console, run a frequency table on your chosen variables to see how many people actually answered:
# Check your first variable — how many responses in each category?
table(nesarc$ETHRACE2A) # ← REPLACE: your first variable##
## 1 2 3 4 5
## 24507 8245 701 1332 8308
# Quick overview — how many values? Any missing?
summary(nesarc$ETHRACE2A) # ← REPLACE: your variable## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 2.088 2.000 5.000
If you discover that most responses are missing, or that one category contains only 5 people out of 43,000, you may need to choose a different variable or broaden your question. A good research question needs enough data to answer it. Do not worry if your first choice does not work out — researchers revise their questions all the time.
1.5.1 Common Mistakes Students Make
Treating dummy-coded variables as quantitative. If a variable is coded 0/1 and represents “No/Yes,” it is categorical — even though it looks like numbers. Always check the code book.
Skipping the code book. Jumping straight into R without understanding what the variables mean is the fastest way to produce nonsense results.
Confusing a unique identifier with a meaningful variable. A column like
IDwith values 1, 2, 3, … is not a quantitative variable. You cannot compute a “mean ID” that means anything.Picking a research question that is too vague. “I want to study health” is not specific enough. “Is daily alcohol consumption associated with depression diagnosis?” is a question you can actually analyze.
1.5.2 What Comes Next
In this chapter, you learned what data looks like (rows and columns), the two types of variables (quantitative and categorical), the meaning of dummy codes and unique identifiers, and the importance of code books. You also chose a research question that will drive your project.
In Chapter 2 (Univariate Analysis), you will learn how to explore a single variable — how to summarize it, visualize it, and describe its distribution. You will start with the variable at the heart of your research question.