Understanding Regression (Part 2): Why the Normal Distribution?
In Part 1, we proposed that regression is about choosing and fitting distributions to data. In this post, we explain why the normal distribution is often the right choice.
Recap
In Part 1, we introduced the core question to help us understand regression: What distribution might have generated the data?
When we run a model like lm(height ~ 1), we’re saying that heights
follow a normal distribution and we estimate its parameters (μ and σ).
But this raises a question: why pick the normal distribution?
The ubiquity of the normal distribution
Many natural phenomena follow an approximately normal distribution, such as heights (of adults), blood pressure, or test scores. This is because outcomes that are influenced by many small, independent factors tend to look bell-shaped when those factors add up. Heights, for example, are shaped by many genes and environmental influences, each contributing a small amount.
We can use a simple simulation to demonstrate that a normal distribution emerges when many small effects add up. Imagine a variable that is the sum of 20 small, independent effects. Each effect is a random number, drawn uniformly between -1 and 1. On their own, these individual effects look nothing like a normal distribution. They come from a uniform (flat) distribution. But when we add up 20 of them and repeat this 1000 times, the result looks like a normal distribution:
No single effect is normally distributed, but their sum is. This is a general pattern: whenever an outcome is shaped by many small, additive influences, the result tends toward a normal distribution. Since many things we measure in the real world (heights, blood pressure, test scores) are plausibly the sum of many small factors, it’s no surprise the normal distribution appears so often.
There’s another attractive property of the normal distribution that makes it sensible to use: it’s parsimonious.
A parsimonious distribution
The normal distribution is defined by just two parameters: a mean (μ) and a standard deviation (σ), describing a center and some spread. It doesn’t claim the data is skewed, or that there are multiple groups, or that there are hard boundaries. It commits to a particular, minimal shape.
To see why that matters, consider the following three distributions:
All three distributions agree on the same two facts: the mean is 155 and the standard deviation is 8. But look at how different they are. Each distribution makes different claims:
- The skewed distribution claims the data is asymmetric, with a longer tail on one side and a hard lower bound
- The bimodal distribution claims there are two distinct clusters
- The normal distribution claims a single most common value and variation around that single value
This is the key observation. The skewed and bimodal distributions are both adding claims on top of the mean and SD. One claims the data is asymmetric, the other claims there are multiple groups. These are additional structural commitments that one would need evidence for to justify. The normal distribution doesn’t add any of that. Among all distributions with the same mean and variance, it’s the one that makes the fewest additional claims.1
So the normal distribution has two things going for it: it’s often genuinely the right distribution for the data, and it makes the fewest claims beyond a mean and standard deviation. If we later find evidence that a different distribution would make more sense, we can update our model. But until then, the normal distribution is a strong starting point.
Applying this to our heights
Let’s return to our height data and see how well the normal distribution fits.
The dashed line shows a normal distribution with mean 154.6 cm and standard deviation 7.7 cm.
Does it fit perfectly? No. You can see places where the histogram bars don’t quite match the curve. But look at what it does capture: the data is roughly symmetric, peaked near the center, and tapers off at the extremes. That’s exactly what we’d expect from an outcome shaped by many small additive effects.
This is also how you’d spot a problem. If the data were clearly skewed or had two peaks, the mismatch would be obvious, and we’d want to reconsider our distributional choice or model (later we cover adding predictors to the model that can account for this).
But often you don’t even need to look at the data. Prior knowledge about the outcome can often guide your distributional choice before you even look at the data.
When the normal distribution doesn’t apply
Our height data is actually a good example of where this question comes up. The normal distribution has tails that extend infinitely in both directions, but heights can’t be negative, so technically the normal distribution gets that wrong. But for adult heights, the data is so far from zero that a normal distribution assigns essentially zero probability to negative values. The boundary exists in reality, but it’s so far from where the data lives that it doesn’t affect the fit.
That changes when the boundary is close to where the data actually falls. Reaction times are a clear example: values are often small and hardcapped at 0, producing a skewed distribution. Or in the case of proportions, values are bounded between 0 and 1. The normal distribution extends infinitely in both directions, so it doesn’t respect these constraints. For bounded or strictly positive outcomes, other distributions like the log-normal or beta distribution are better choices.
Similarly, if the outcome is a count (number of errors, number of children), it can only take whole numbers. A normal distribution is continuous and can take any value, including fractions and negatives. Count data typically calls for distributions like the Poisson or negative binomial.
The same logic applies to Likert scales (e.g., 5-7 values ranging from ‘Strongly disagree’ to ‘Strongly agree’): these values are bounded and discrete. The normal distribution is continuous and unbounded, so it doesn’t technically fit. Ordinal models tend to be a better fit.
Generally speaking, when you know something about the structure of your data (hard boundaries, discrete values, strong skew), you should use that knowledge. The normal distribution is often a reasonable choice for continuous outcomes where the boundaries are far from the data. But when the data has structure that the normal can’t capture, use a distribution that reflects what you know.
We’ll return to alternative distributions later in the series.
Summary
In Part 1, the core question was: what distribution might have generated this data? This post made the case for why the normal distribution is often the right answer. For many things we measure, the normal distribution is an accurate description of the data. It arises whenever many small effects add up, which is the case for many outcomes in the real world. It’s also parsimonious: defined by just two parameters (μ and σ), it captures a center and some spread without adding claims about asymmetry, multiple groups, or hard boundaries.
But how do we estimate μ and σ from our sample? That’s what we’ll tackle in Part 3.
Footnotes
-
There’s a formal way to express this. In information theory, the maximum entropy distribution for a given set of constraints is the distribution that is the most spread out while still satisfying those constraints. Entropy measures how much uncertainty a distribution contains: high entropy means the distribution is as noncommittal as possible. When the only constraints are a mean and a variance, the maximum entropy distribution is the normal distribution. ↩