The following R script creates an example dataset to illustrate the application of clustered standard errors. You can download the dataset here.
The script creates a dataset with a specific number of student test results. Individual students are identified via the variable student_id
. The variable id_score
comprises a student’s test score. In the test, students can score from 1 to 10 with 10 being the highest score possible.
In the dataset students belong to different classes. The variable class_id
shows to which particular class a student belongs. The variable class_size
shows the number of students within that particular class.
Furthermore, the script generates test scores are correlated within classes. This is based on the idea that students within a class are exposed to the same environment, i.e. all student of a class share the same teacher.
The structure of dataset looks as follows:
student_id |
class_id |
class_size |
id_score |
1 | 12 | 24 | 6 |
2 | 32 | 23 | 7 |
3 | 17 | 33 | 6 |
4 | 36 | 29 | 6 |
5 | 38 | 16 | 9 |
6 | 2 | 24 | 7 |
7 | 22 | 29 | 8 |
Ultimately, we are interested in the effect of class size on the test score. A linear regression of class size on individual test score overestimates the precision of the coefficient estimates. In order to correct for this bias one might apply clustered standard errors. You can find a working example in R that uses this dataset here.
##################################################################### # This script creates an example dataset to illustrate the # application of clustered standard errors. Particularly, # this scrips creates a dataset of student test results. # Students are distributed over several classes and test # scores are correlated within classes. # # Ultimately, we are interested in the effect of class # size on the test score. A linear regression of class # size on test score overestimates the precision of the # coefficient estimates. In order to correct for this bias # one might apply clustered standard errors. ##################################################################### # Start with empty workspace rm(list=ls()) # Set seed for reproducibility set.seed(123) ##################################################################### # Parametrization ##################################################################### # Set number of individuals. In our case # the number of scholars. n_obs <- 1000 # Set number of classes number_classes <- 40 # Anchor score. Set the anchor of # score score_mean <- 9.5 # Set St.Dev of score score_sd <- 1.9 ##################################################################### # Create Dataset ##################################################################### # Create basic data structure data <- data.frame("student_id"=c(1:n_obs)) # Create variable that defines class id data$class_id <- sample(x = 1:number_classes, size = n_obs, replace = T) # The following loop computes the average # score as function of class size. Larger classes # are set to have a lower score. Once the average # class size is set, we distribute class scores # around the mean. for(i in 1:number_classes){ data$class_size[data$class_id==i] <- sum(data$class_id==i) # define probability distribution of class size mean # smaller classes have higher scores on average prob <- dnorm(c(3:8), mean = 1/(sum(data$class_id==i)/ mean(table(data$class_id)))*score_mean, sd = (sum(data$class_id==i)/ mean(table(data$class_id)))*score_sd) mean_score <- sample(x=c(3.00:8.00), size = 1,replace = T, prob = prob/sum(prob)) prob <- dnorm(c(1:10), mean = mean_score, sd = (sum(data$class_id==i)/ mean(table(data$class_id)))*score_sd) data$id_score[data$class_id==i] <- sample(x = c(1:10), size = data$class_size[data$class_id==i], replace = T, prob = prob/sum(prob)) } # Save data write.table(x = data,file = "data.csv")
2 thoughts on “Example data – Clustered Standard Errors”