Understanding regression (part 1)

This is Part 1 of a series of blog posts on how to understand regression. The goal is to develop an intuitive understanding of the different components of regression. In this first post, we figure out where the estimate of an intercept-only regression model comes from.

Published
Last updated

Statistical regression techniques are an important tool in data analysis. As a behavioral scientist, I use it to test hypotheses by comparing differences between groups of people or testing relationships between variables. While it is easy to run regression analyses using popular software tools, like SPSS or R, it often remains a black box that is not well understood. In fact, I do not believe I actually understand regression. Not fully understanding the mechanics of regression could be okay, though. After all, you also don’t need to know exactly how car engines work in order to drive a car. However, I think many users of regression have isolated themselves too much from the mechanics of regression. This may be a source of some mistakes researchers make, such as including every available variable into a model without understanding what the regression is actually doing. If you’re using regression to try and make inferences about the world, it’s probably a good idea to know what you’re doing.

In this series of posts, I’ll work through the mechanics of regression to build a genuine intuition for how it works. This is Part 1.

Feel free to follow me along by copy-pasting the code from each step.

Setup

To figure out regression, we need data. We could make up some data on the spot, but I’d rather use data that is a bit more meaningful (to me, anyway). Since I’m a big Pokémon fan, I’ll use a data set containing Pokémon statistics.

In case you’re following along, start by loading the required packages and a custom function to calculate the mode. You can also copy the styling options I use, such as my ggplot2 defaults. You’ll also need to read in the data. After reading in the data, I subset the data to make the data a bit more manageable.

Let’s take a look at several attributes of some Pokémon to see what kind of information the data contains.

Table 1

<style>
NamePrimary typeSecondary typeHeightWeightEvolution
BulbasaurGrassPoison0.76.90
IvysaurGrassPoison1.013.01
VenusaurGrassPoison2.0100.02
CharmanderFireNA0.68.50
CharmeleonFireNA1.119.01
CharizardFireFlying1.790.52
SquirtleWaterNA0.59.00
WartortleWaterNA1.022.51
BlastoiseWaterNA1.685.52
CaterpieBugNA0.32.90

Pokémon have different types (e.g., grass, fire, water), height, weight weight, and they are of a particular evolutionary stage (0, 1, or 2). This last variable refers to a Pokémon’s ability to evolve and when they do, they tend to become bigger and more powerful.

Let’s say that we are interested in understanding the weight of different Pokémon. Below I have plotted the weight of the first 25 Pokémon, from Bulbasaur to Pikachu.

test

Figure 1

We see that the lightest Pokémon is Pidgey, with a weight of 1.8kg. The heaviest Pokémon is Venusaur, with a weight of 100kg. The average weight is 26.144kg.

The simplest model

In order to understand the weights of different Pokémon, we need to come up with a statistical model. In a way, this can be considered a description problem. How can we best describe the different weights that we have observed? The simplest description is a single number. We can say that all Pokémon have a weight of say, 6 kg. In other words:

weight = 6 kg

Of course, this is just one among many possible models. Below I plot three different models, including our weight = 6 kg model.

test

Figure 2

While a model like weight = 6 kg is a valid model, it is not a very good model. In fact, it only perfectly describes Pikachu’s weight and inaccurately describes the weight of the remaining 24 Pokémon. The other models, such as weight = 40kg might be even worse. They do not even describe a single Pokémon’s weight correctly, although they do get closer to some of the heavier Pokémon. How do we decide which model is the better model? In order to answer that question, we need to consider the model’s error.

The error of models

The error of a model is the degree to which the model inaccurately describes the data. There are several ways to calculate that error, depending on how you define error. We will cover three of them.

The first definition of error is simply the number of times the model inaccurately describes the data. For each observation we check whether the model correctly describes it or not. We then sum the number of misses and consider that the amount of error for that model. For our weight = 6 kg model, the answer is 24; out of the 25 Pokémon only Pikachu has a weight of 6, which means the model is correct once and wrong 24 times.

We can now compare different models to one another by calculating the error for a range of models. Below I plot the number of errors for 100 different models, starting with the model weight = 1 kg, up to weight = 10 kg, in steps of 0.1. Ideally we would test more models (up to the heaviest Pokémon we know of), but for the sake of visualizing the result, I decided to only plot a small subset of models.

test

Figure 3

We see that almost all models perform poorly. The number of errors range from 23 to 25 and most models have a sum of 25 errors, which means they do not accurately describe any of the Pokémon. Some have an error of 24, meaning they describe the weight of 1 Pokémon correctly. There is 1 model with an error of 23: weight = 6.9 kg. Apparently there are 2 Pokémon with a weight of 6.9, which means that this model outperforms the others.

Despite there being a single model that outperforms the others in this set of models, it’s still a pretty poor model. After all, it is wrong 23 out of 25 times. Perhaps there are some models that outperform this model, but it’s unlikely. That’s because we’re defining error here in a very crude way. The model needs to exactly match the weight of the Pokémon, or else it counts as an error. Saying a weight is 6 kg, while it is in fact 10 kg, is as wrong as saying the weight is 60 kg.

Instead of defining error in this way, we can redefine it so that it takes into account the degree of error. We can define error as the difference between the actual data point and the model’s value. So, in the case of our weight = 6 kg model, an actual weight of 10 kg would have an error of 10 - 6 = 4. This definition of error is often referred to as the residual.

Below I plot the residuals of the first 25 Pokémon for our weight = 6 kg model.

test

Figure 4

We can add up all of the (absolute) residuals to determine the model’s error. Just like with the binary definition of error, we can then compare multiple models. This is what you see in the graph below. For each model, from weight = 1 kg to weight = 100 kg, we calculated the absolute residuals and added them together.

test

Figure 5

This graph looks very different compared to the graph where we calculated the error defined as the sum of misses. Now we see that a minimum appears. Unlike the binary definition of error, it looks like there are fewer best models. More importantly, though, we have defined error in a less crude manner, meaning that the better models indeed capture the data much better than before.

But we might still not be entirely happy with this new definition of error either. Calculating the sum of absolute residuals for each model comes with another conceptual problem.

When you sum the absolute errors, four errors of 1 are equal to a single error of 4. In other words, you could have a model that is slightly off multiple times or one that might make fewer, but larger, errors. Both would be counted as equally wrong. That seems problematic. Conceptually speaking, we might find it more problematic when a model is very wrong than when the model is slightly off multiple times. If we think that, we need another definition of error.

To address this issue, we can square the residuals before adding them together. That way, larger errors become relatively larger compared to smaller errors. Using our previous example, summing four residuals of 1 remains 4, but a single residual of 4 becomes 4² = 16. The model now gets punished more severely for making large mistakes.

Using this new definition of error, we again plot the error for each model, from 1 to 100.

test

Figure 6

We see a smooth curve, with a clear minimum indicated by the vertical dashed line. This vertical line indicates the model that best describes the data. What is the value of the best model exactly? In this case, the answer is 26.144. And it turns out, there is an easy way to determine this value.

The data-driven model

Rather than setting a specific value and seeing how it fits the data, we can also use the data to determine the best-fitting value. In the previous graph we saw that the best fitting model is one where the weight is equal to 26.144. This value turns out to be the mean of the different weights we have observed in our sample. Had we defined error as simply the sum of absolute residuals, this would be a different value. In fact, the best fitting value would then be equal to 13, or the median. And had we used the binary definition of error, the best fitting value would be the mode, which in our case is: 6.9.

Note that there is not always a unique answer to which model is the best fitting model, depending on the error definition. For example, it is possible that there are multiple modes. If you use the binary definition of error, that would mean there are multiple equally plausible models. This can be another argument to not define a model’s error in such a crude way.

The table below shows an overview of which technique can be used to find the best fitting value, depending on the error definition.

Table 2

<style>
Error definitionEstimation technique
sum of errorsmode
sum of absolute residualsmedian
sum of squared residualsmean

We can now update our model to refer to the estimation technique, rather than a fixed value. Given that the third definition of error seems to be most suitable, both pragmatically and conceptually, we’ll use the mean:

weight = mean(weight)

This is also the value you get when you perform a regression analysis in R.

R
Call:
lm(formula = weight ~ 1, data = pokemon25)

Coefficients:
(Intercept)  
      26.14  

By regressing weight onto 1 we are telling R to run an intercept-only model. This means that R will estimate which value will best fit all the values in the outcome variable, just like we have done ourselves earlier by testing different models such as weight = 6 kg.

The result is an intercept value of 26.144, which matches the mean of the weights.

Conclusion

So, we now know where the intercept comes from when we run an intercept-only model: it is the mean of the data we are trying to model.

Note that it is the mean because we defined the model’s error as the sum of squared residuals. Had we defined the error differently, the intercept would be different. With the sum of absolute residuals, the intercept would be the median. With the sum of errors, it would be the mode.

Why did we use the sum of squared residuals? We had a conceptual reason: we wanted to punish larger residuals relatively more than several smaller errors. But it turns out there is another reason to favor squared residuals, which has to do with a nice property of the mean compared to the median. This will be covered in Part 2 of ‘Understanding Regression’.