Understanding Regression (Part 4): Uncertainty and the Standard Error

Every estimate is based on a sample, and a different sample would give different results. This post builds intuition for what that means: simulating how much the sample mean would vary across studies and deriving the standard error formula.

Recap

In Part 3, we estimated the parameters of a normal distribution for heights: μ (the mean) and σ (the standard deviation). We saw that lm(height ~ 1) does exactly this: the intercept estimates μ and the residual standard error estimates σ.

But these estimates are based on a sample, which raises the next question: How certain can we be about our estimates?

One sample, many possibilities

Our estimate of μ is 154.6 cm and our estimate of σ is 7.7 cm. If we’d measured different people, both numbers would come out slightly different — both are estimates with uncertainty.

In this post we’ll focus on μ because lm() reports uncertainty for its parameter estimates in the “Std. Error” column, and that’s μ’s uncertainty. σ’s uncertainty is a different story, one that lm() doesn’t report. We’ll get to that in Part 5.

To get a sense of how much μ varies, we can simulate it. We’ve proposed that heights follow a normal distribution with mean 154.6 and standard deviation 7.7, so let’s draw many samples from that distribution, each with a sample size matching ours, and see how much the mean moves around.

Figure 1: Distribution of sample means from 1000 simulated samples

Each bar represents how often a particular sample mean came up across 1000 simulated samples. They cluster tightly around 154.6 cm, but there’s some spread. Most sample means fall within about 1 cm of μ, though some are further out.

This distribution of sample means has a name: the sampling distribution of the mean. More generally, any statistic you could calculate from a sample (the mean, the standard deviation, a correlation) has its own sampling distribution.

Notice that the sampling distribution of the mean is itself a normal distribution. Why? Recall from Part 2 that when you add up many independent effects, the result is normally distributed. The sample mean is exactly that: a sum of independent observations (divided by n). The same principle that makes heights normally distributed (many small effects adding up) also makes the sampling distribution of the mean normal.1

What affects uncertainty?

The sampling distribution we just saw was for samples of n = 352, the same size as our actual data. But what happens with smaller or larger samples?

Figure 2: Sampling distributions at different sample sizes

With n = 10, there’s quite a bit of spread. Sample means range from around 150 cm to 160 cm. With n = 50, the spread narrows. And with n = 352 (our actual sample), the estimates are tightly packed.

This makes intuitive sense: more data means more precision.

But sample size isn’t the only thing that matters. The variability of the data itself also plays a role. If individual heights barely varied from person to person, you wouldn’t need a large sample to pin down the mean. But if heights were wildly variable, even a large sample would leave you uncertain.

We can see this by running the same simulation with different values of σ, keeping the sample size fixed at n = 352:

Figure 3: Sampling distributions at different levels of variability

When individual heights vary less (σ = 4), the sample means are tightly clustered. When they vary more (σ = 16), even with the same sample size, our estimates of the mean are much more spread out. More variability in the data means more uncertainty about the mean.

The standard error

The spread of a sampling distribution has a name: the standard error (SE). In this case, it tells us how much our estimate of μ would vary from sample to sample.

We’ve seen that the SE depends on both σ and n. But how exactly? We can measure the SE directly from our simulations by calculating the standard deviation of the sample means at each sample size:

Figure 4: Standard error decreases with sample size, but not linearly

The dots are the standard errors measured from our simulations. You can see that the SE decreases as n increases, but with diminishing returns. Doubling the sample size doesn’t halve the SE. To halve it, you’d need four times as many observations.

Why diminishing returns? Think about how much each new observation can actually change your estimate of the mean. With just 1 observation, a 2nd one can shift the sample mean by half the distance between them — it has a lot of influence. With 100 observations, the 101st can only nudge the mean by 1/101 of the distance between the current estimate and the new value. Each additional observation has less and less leverage to move the estimate, which is why the early observations do the heavy lifting and the relationship is curved rather than linear.2

The dashed line in the plot can be described with a formula that captures the relationship between sigma and sample size: SE = σ / √n. We’re using our sample SD as a stand-in for σ here.

For our heights: SE = 7.7 / √352 = 0.41 cm.

Back to lm()

Let’s return to the lm() output:

Call:
lm(formula = height ~ 1, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.0721  -6.0071  -0.2921   6.0579  24.4729 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 154.5971     0.4127   374.6   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.742 on 351 degrees of freedom

Look at the “Std. Error” column: 0.41. Now we know what it means: it’s the standard deviation of the sampling distribution of the intercept — our estimate of μ — telling us how much that estimate would vary from sample to sample.

What about σ?

We’ve focused on μ’s uncertainty, but σ is estimated from the same sample. A different sample would give a different estimate of σ too.

How much does σ vary? And does it matter? That’s what we’ll explore in Part 5.

Summary

Our estimates of μ and σ are based on one sample. A different sample would give different estimates. The sampling distribution of the mean describes how much the estimate of μ would vary, and its standard error measures its spread.

For our heights data, the SE of μ is 0.41 cm. The lm() output reports this in the “Std. Error” column of the intercept.

Footnotes

  1. If the data came from a non-normal distribution, the sampling distribution of the mean would only be approximately normal, with the approximation improving as the sample size increases. This is known as the central limit theorem. But since our model specifies a normal distribution, the sampling distribution is exactly normal regardless of sample size.

  2. The exact shape of the curve comes from how uncertainty combines mathematically. Each observation has variance σ². The variance of the sum of n independent observations is n·σ², and the mean is the sum divided by n, so its variance is σ²/n. The standard error is the square root of this: √(σ²/n) = σ/√n. Variance adds linearly with n, but the SE is on the square root scale, which is why n times more data gives only √n times less uncertainty.