Understanding Regression (Part 1): Getting Started
This is the first post in a series on understanding regression. In this first post we focus on what the main question is that we should be asking when using regression.
Introduction
I’m a behavioral scientist with several scientific publications to my name. These are mostly publications in the field of the social sciences, where statistical tests are commonly used to answer research questions. Regression is one of the most common tools in that field, and I’ve relied on it throughout my own research. Yet, despite all that experience, I don’t feel like I really understand regression. I know how to use it in that I can run models and report the results, but there are times when I’m just running code because I’ve been told that’s how you do it. When things get complicated, and sometimes not even that complicated, I find myself relying almost entirely on conventions rather than on my own understanding.
You might feel the same way. Regression is often taught as a black box: you run the code or click the buttons and copy the output. The focus in many courses is on performing statistics rather than on building a conceptual understanding. I’ve also noticed that statistical teaching often takes a mechanical approach. You’re given formulas to memorize or asked to calculate statistics by hand. I can see the appeal of this approach, since working through the steps yourself can build intuition, but formulas rarely help me understand something, and calculating things by hand only takes you so far.
What I need, and what I suspect many others need, is an approach that is more about creating a conceptual understanding that makes regression make sense. That’s what this series is about.
An example
To build this conceptual understanding, I’ll work through examples using real data. I’ll use data from Richard McElreath’s excellent book Statistical Rethinking. The data is a partial census of the !Kung San people, compiled from interviews conducted by Nancy Howell in the late 1960s. We’ll focus on heights of adults (18 years or older).
Here’s what the first few rows of the data look like:
# A tibble: 6 × 4
height weight age male
<dbl> <dbl> <dbl> <dbl>
1 152. 47.8 63 1
2 140. 36.5 63 0
3 137. 31.9 65 0
4 157. 53.0 41 1
5 145. 41.3 51 0
6 164. 63.0 35 1
And here’s a histogram of all the heights:
Let’s run a regression right away. The output gives us several numbers, and understanding each of them will be the focus of the posts that follow
The simplest model we can run is one in which we regress heights onto… nothing; this is called an intercept-only model.
model <- lm(height ~ 1, data = data)
We use summary() to get the numbers we need.
Call:
lm(formula = height ~ 1, data = data)
Residuals:
Min 1Q Median 3Q Max
-18.0721 -6.0071 -0.2921 6.0579 24.4729
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 154.5971 0.4127 374.6 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.742 on 351 degrees of freedom
As you can see in the output, we get an estimate of 154.6 cm, a standard error of 0.41 cm, a t-value of 374.6, and a p-value of 0e+00.
But what do these numbers mean? Why do we actually want these numbers?
What regression is for
When we run a regression model, we’re typically interested in one or more of the following goals:
- Estimation: What are the model parameters (like the mean) and how uncertain are we about them?
- Testing: Is a parameter compatible with some reference value (like zero)?
- Prediction: What would we expect to observe in new data?
In the social science literature I’m familiar with, testing is by far the most common goal. Researchers want to know whether an effect is “significant”, which means running a model and checking whether the estimate is compatible with zero. Estimation is also common, with researchers reporting coefficients, standard errors, and confidence intervals. Prediction is less popular in my field (authors rarely report prediction intervals), but I think it deserves more attention than it typically gets.
In order to understand regression, I think we need a way of thinking that makes it easier to understand how regression can be used for each of those goals and I think it comes down to the following question:
What distribution might have generated this data?
Instead of starting with a procedure — run a model, get an estimate — you start by imagining how the data came to be. Some process generated these numbers. That process had some underlying shape: values that were more likely, values that were less likely. A distribution is a way of describing that shape formally. You may already be familiar with some: the normal distribution, the Poisson distribution, or perhaps the binomial distribution.
For our height data, we can propose that heights follow a normal distribution. This is a bell-shaped curve defined by two parameters: μ (the mean) and σ (the standard deviation).
These two parameters give us everything we need for estimation, testing, and prediction:
- For estimation: μ tells us the typical height, and we can quantify how uncertain we are about that estimate
- For testing: Once we know the estimate and its uncertainty, we can ask whether it’s compatible with a specific value (like zero)
- For prediction: Both μ and σ together describe the full distribution, which we can use to predict what heights we’d expect in new individuals
Here are the heights again, but now with a normal distribution laid on top of the histogram.
The dashed line is the normal distribution we’re proposing heights are
drawn from. As we’ll see, lm() estimates exactly this distribution.
Its two parameters, μ and σ, are what give us the estimates,
uncertainty, and predictions we’re after.
What’s ahead
In this series, we’re going to build up an understanding of regression from this foundation. The core question will always be: What distribution might have generated this data?
Here are the main steps we’ll tackle:
- Choose a distribution: What distribution might describe the data?
- Estimate parameters: How do we estimate its parameters from our data?
- Quantify uncertainty: How certain can we be about those estimates?
- Test hypotheses: Are the estimates compatible with specific values?
- Make predictions: What would we expect to observe in new data?
- Add predictors: What happens when the distribution’s parameters depend on other variables (like height depending on sex or age)?
Each post in the series will tackle one or more of these steps, building up our understanding gradually.
Summary
This post introduces the core perspective to hopefully have regression make sense: regression is about choosing and fitting distributions to data.
When you run a regression model, you’re proposing that your data follows a specific distribution and you’re estimating that distribution’s parameters. Those parameters let you estimate typical values, quantify uncertainty, test hypotheses, and predict new observations.
In the next post, we’ll dig deeper into the normal distribution and why it’s often a sensible default.