In practice, we want to use the regression coefficients that we compute using sample data to say something about a population. But how accurately do our sample estimates represent the population of interest?
Here, I use our schools data to illustrate how random samples relate to a population.
In Chapter 2, we took a random sample of 20 schools and ran a regression of the API test score on the percent of students eligible for free lunch. We obtained an estimate of -2.11 for the slope coefficient.
The first figure to the right shows the population of all 5,765 CA elementary schools in 2013. The slope of the population regression line is -1.80.
Let's think of the sample we used in Chapter 2 as one of many potential samples we could have gotten. If we took another sample, we would get a different regression slope coefficient. What does range of potential slopes look like across all possible samples?
To answer this question, I wrote computer code to draw 10,000 independent samples from the population of elementary schools. For each sample, I ran the regression of API on FLE and saved the slope coefficient. (Here is my code in both R and Stata formats. The R code runs a lot faster because I'm better with R than Stata.)
The histogram to the right shows the distribution of the slopes across the 10,000 samples. The average slope is -1.80 — exactly the same as the slope in the population. This is no accident. It illustrates the property of unbiasedness.
The gif below left shows this simulation in action.
The slope estimates are likely to be closer to -1.80 if the sample is larger. The last gif illustrates this point. It shows a wide distribution of potential slope estimates for N=10. As the sample size increases, the range narrows.
Data and Code
- CSV: CA_Schools.csv
- R: Properties.R Properties_animate.R
- Stata: CA_Schools.dta Properties.do