Understanding regression (Part 1)

12 min read

statistics tutorial regression

Statistical regression techniques are an important concept in data analysis. As a social scientist, I use it to test hypotheses by comparing differences between groups or testing relationships between variables. While it is easy to run regression analyses in a variety of software packages, like SPSS or R, it often remains a black box that is not well understood. I, in fact, do not believe I actually understand regression. This is Part 1 of a series of blog posts called ‘Understanding Regression’ in which I try to figure it all out.

Feel free to follow me along by copy-pasting the code from each step. I’ll add a note in case you need to slightly adjust the code for it to work on your computer.

Data

To figure out regression, we need data. We could make up some data on the spot, but I’d rather use data that is a bit more meaningful (to me, anyway). Since I’m a big Pokémon fan, I’ll use a data set containing Pokémon statistics.

In case you’re following along, start by loading some packages and reading in the data. In the code section below I use the here package to read in the data, but I recommend that you simply specify the path to the file. After that, I subset the data to make the data a bit more manageable and do some more setup. I define some colors that I use in the plots on this page and I define a custom mode function because R does not have one (and I need it later).

# Load packages
library(tidyverse)
library(here)

# Read in Pokémon data
pokemon <- read_csv(here("static", "data", "pokemon.csv"))

# Create a subset with only the first 25 Pokémon
pokemon25 <- filter(pokemon, pokedex <= 25)

# Create color variables for the plots
background_color <- "#1a1a1a"
green <- "#00B88D"
yellow <- "#f39c12"
off_white <- "#cccccc"

# Load a custom function to calculate the mode
mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Let’s take a look at several attributes of some Pokémon to see what they’re about:

pokemon25 %>%
  filter(pokedex <= 10) %>%
  select(name, type_primary, type_secondary, height, weight, 
    evolution) %>%
  kable(digits = 2)
name type_primary type_secondary height weight evolution
Bulbasaur Grass Poison 0.7 6.9 0
Ivysaur Grass Poison 1.0 13.0 1
Venusaur Grass Poison 2.0 100.0 2
Charmander Fire 0.6 8.5 0
Charmeleon Fire 1.1 19.0 1
Charizard Fire Flying 1.7 90.5 2
Squirtle Water 0.5 9.0 0
Wartortle Water 1.0 22.5 1
Blastoise Water 1.6 85.5 2
Caterpie Bug 0.3 2.9 0

Pokémon have different types (e.g., grass, fire, water), a height, some weight, and they are of a particular evolutionary stage (0, 1, or 2). This last variable refers to a Pokémon’s ability to evolve and when they do, they tend to become bigger and more powerful.

Let’s say that we are interested in understanding the weight of different Pokémon. Below I have plotted the weight of the first 25 Pokémon, from Bulbasaur to Pikachu.

ggplot(pokemon25, aes(x = reorder(name, pokedex), y = weight)) +
  geom_bar(stat = "identity", fill = green) +
  labs(x = "", y = "Weight (kg)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

We see that the lightest Pokémon is Pidgey, with a weight of 1.8 kg. The heaviest Pokémon is Venusaur, with a weight of 100 kg. The average weight is 26.14 kg.

The simplest model

In order to understand the weights of different Pokémon, we need to come up with a statistical model. In a way, this can be considered a description problem. How can we best describe the different weights that we have observed? The simplest description is a single number. We can say that all Pokémon have a weight of say… 6 kg. In other words:

weight = 6 kg

Of course, this is just one among many possible models. Below I plot three different models, including our weight = 6 kg model.

ggplot(pokemon25, aes(x = reorder(name, pokedex), y = weight)) +
  geom_bar(stat = "identity", fill = green) +
  labs(x = "", y = "Weight (kg)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_abline(intercept = 6, slope = 0, size = 0.75, linetype = 2, 
    color = yellow) +
  geom_abline(intercept = 40, slope = 0, size = 0.75, linetype = 2, 
    color = yellow) +
  geom_abline(intercept = 75, slope = 0, size = 0.75, linetype = 2, 
    color = yellow) +
  annotate("text", x = 28, y = 6.5, label = "weight = 6 kg", size = 3.5, 
    color = off_white) +
  annotate("text", x = 28.2, y = 40.5, label = "weight = 40 kg", size = 3.5, 
    color = off_white) +
  annotate("text", x = 28.2, y = 75.5, label = "weight = 75 kg", size = 3.5, 
    color = off_white) +
  coord_cartesian(xlim = c(1, 25), clip = 'off') +
  theme(plot.margin = unit(c(1, 6, 1, 1), "lines"))

While a model like weight = 6 kg is a valid model, it is not a very good model. In fact, it only perfectly describes Pikachu’s weight and inaccurately describes the weight of the remaining 24 Pokémon. The other models, such as weight = 40 kg might be even worse; they do not even describe a single Pokémon’s weight correctly, although they do get closer to some of the heavier Pokémon. How do we decide which model is the better model? In order to answer that question, we need to consider the model’s error.

Error

The error of a model is the degree to which the model inaccurately describes the data. There are several ways to calculate that error and we will cover three different definitions.

The first definition of error is simply the sum of times that the model inaccurately describes the data. For each observation we check whether the model correctly describes it or not. We then sum the number of misses and consider that the amount of error for that model. With our weight = 6 kg the answer is 24; out of the 25 Pokémon only Pikachu has a weight of 6, which means the model is correct once and wrong 24 times.

We can now compare different models to one another by calculating the error for a range of models. Below I plot the number of errors for 100 different models, starting with the model weight = 1 kg, up to weight = 10 kg, in steps of 0.1. Ideally we would test more models (up to the heaviest Pokémon we know of), but for the sake of visualizing the result, I decided to only plot a small subset of models.

expand_grid(
    model = seq(from = 1, to = 10, by = 0.1),
    weight = pull(pokemon25, weight)
  ) %>%
  mutate(error = if_else(abs(weight - model) == 0, 0, 1)) %>%
  group_by(model) %>%
  summarize(error_sum = sum(error)) %>%
  ggplot(aes(x = model, y = error_sum)) +
    geom_line(color = green, size = 1.25) + 
    coord_cartesian(ylim = c(0, 25)) +
    scale_x_continuous(breaks = 1:10) +
    labs(x = "Model (weight = x kg)", y = "Error (sum of errors)")

We see that almost all models perform poorly. The errors range from 23 to 25. Most models seem to have an error of 25, which means they do not accurately describe any of the 25 Pokémon. Some have an error of 24, meaning they describe the weight of 1 Pokémon correctly. There is 1 model with an error of 23: weight = 6.9 kg. Apparently there are 2 Pokémon with a weight of 6.9, which means that this model outperforms the others.

Despite there being a single model that outperforms the others in this set of models, it’s still a pretty poor model. After all, it is wrong 23 out of 25 times. Perhaps there are some models that outperform this model, but it’s unlikely. That’s because we’re defining error here in a very crude way. The model needs to exactly match the weight of the Pokémon, or else it counts as an error. Saying a weight is 6 kg, while it is in fact 10 kg, is as wrong as saying the weight is 60 kg.

Instead of defining error in this way, we can redefine it in a way that takes into account the degree of error. We can define error as the difference between the actual data point and the model’s value. So, in the case of our weight = 6 kg model, an actual weight of 10 kg would have an error of 10 - 6 = 4. This definition of error is often referred to as the residual.

Below I plot the residuals of the first 25 Pokémon for our weight = 6 kg model.

ggplot(pokemon25, aes(x = reorder(name, pokedex), y = weight)) +
  geom_bar(stat = "identity", fill = green) +
  geom_segment(aes(xend = pokedex, y = 6, yend = weight), linetype = 2, 
    size = 0.75, color = yellow) +
  geom_point(color = yellow) +
  geom_abline(intercept = 6, slope = 0, size = 0.9, color = yellow) +
  labs(x = "", y = "Weight (kg)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

We can add up all of the (absolute) residuals to determine the model’s error. Just like with the binary definition of error, we can then compare multiple models. This is what you see in the graph below. For each model, this time ranging from weight = 1 kg to weight = 100 kg, the absolute residuals were calculated and added together.

expand_grid(
    model = seq(from = 1, to = 100, by = 0.1),
    weight = pull(pokemon25, weight)
  ) %>%
  mutate(error = abs(weight - model)) %>%
  group_by(model) %>%
  summarize(error_sum = sum(error)) %>%
  ggplot(aes(x = model, y = error_sum)) +
    geom_line(color = green, size = 1.25) +
    scale_y_continuous(breaks = seq(from = 500, to = 1900, by = 200)) +
    labs(x = "Model (weight = x kg)", y = "Error (sum of residuals)")

This graph looks very different compared to the graph where we calculated the error defined as the sum of misses. Now we see that some kind of minimum appears. Unlike the binary definition of error, it now looks like there are fewer best models. More importantly, though, we have defined error in a less crude manner, meaning that the better models indeed capture the data much better than before.

But we might still not be entirely happy with this new definition of error either. Calculating the sum of absolute residuals for each model comes with another conceptual problem.

When you sum the number of absolute errors, four errors of 1 are equal to a single error of 4. In other words, you could have a model that is slightly off multiple times or one that might make fewer, but larger, errors. Both would be counted as equally wrong. What do we think of that? Conceptually speaking, we might find it more problematic when a model is very wrong than when the model is slightly off multiple times. If we think that, we need another definition of error.

To address this issue, we can square the residuals before adding them together. That way, larger errors become relatively larger compared to smaller errors. Using our previous example, summing four residuals of 1 remains 4, but a single residual of 4 becomes 4 * 4 = 16. The model now gets punished more severely for making large mistakes.

Using this new definition of error, we again plot the error for each model, from 1 to 100.

expand_grid(
    model = seq(from = 1, to = 100, by = 0.1),
    weight = pull(pokemon25, weight)
  ) %>%
  mutate(error = abs(weight - model)^2) %>%
  group_by(model) %>%
  summarize(error_sum = sum(error)) %>%
  ggplot(aes(x = model, y = error_sum)) +
    geom_line(color = green, size = 1.25) +
    geom_vline(xintercept = mean(pull(pokemon25, weight)), linetype = 2, 
      color = yellow) +
    labs(x = "Model", y = "Error (sum of squared residuals)")

We see a smooth curve, with a clear minimum indicated by the vertical dashed line. This vertical line indicates the model that best describes the data. What is the value of the best model exactly? In this case, the answer is 26.14. And it turns out, there is an easy way to determine this value.

Estimating the model from the data

Rather than setting a specific value and seeing how it fits the data, we can also use the data to determine the value that best fits the data. In the previous graph we saw that the best fitting model is one where the weight is equal to 26.14. This value turns out to be the mean of the different weights we have observed in our sample. Had we defined error as simply the sum of absolute residuals, this would be a different value. In fact, the best fitting value would then be equal to 13, or the median. And had we used the binary definition of error, the best fitting value would be the mode, which in our case is: 6.9. The table below shows an overview of which technique can be used to find the best fitting value, depending on the error definition.

Error definition Estimation technique
sum of errors mode
sum of absolute residuals median
sum of squared residuals mean

We can now update our model to refer to the estimation technique, rather than a fixed value. Given that the third definition of error seems to be most suitable, both pragmatically and conceptually, we’ll use the mean:

weight = mean(weight)

This is also the value you get when you perform a regression analysis in R. Using R’s lm function, we can run our model with the following code:

lm(weight ~ 1, data = pokemon25)

By regressing weight onto 1 we are telling R to run an intercept-only model. This means that R will estimate which line will best fit all the values in the outcome variable, just like we have done ourselves earlier by testing different models such as weight = 6 kg.

R spits out the following result:

## 
## Call:
## lm(formula = weight ~ 1, data = pokemon25)
## 
## Coefficients:
## (Intercept)  
##        26.1

The result is an intercept value of 26.14, which matches the mean of the weights.

So, we now know where the intercept comes from when we run an intercept-only model. It is the mean of the data we are trying to model. Note that it is only the mean because we defined the error as the sum of squared residuals. Had we defined the error differently, such as the sum of absolute residuals or the sum of errors, the intercept would be the median or mode of the data instead. Why did we use the sum of squared residuals? We had a conceptual reason of wanting to punish larger residuals relatively more than several smaller errors. It turns out there is another reason to favor squared residuals, which has to do with a nice property of the mean vs. the median. This will be covered in Part 2 of ‘Understanding Regression’.

This post was last updated on 2021-10-08.