Example data – Clustered Standard Errors

The following R script creates an example dataset to illustrate the application of clustered standard errors. You can download the dataset here.

The script creates a dataset with a specific number of student test results. Individual students are identified via the variable student_id . The variable id_score comprises a student’s test score. In the test, students can score from 1 to 10 with 10 being the highest score possible.

In the dataset students belong to different classes. The variable class_id shows to which particular class a student belongs. The variable class_size shows the number of students within that particular class.

Furthermore, the script generates test scores are correlated within classes. This is based on the idea that students within a class are exposed to the same environment, i.e. all student of a class share the same teacher.

The structure of dataset looks as follows:

`student_id`	`class_id`	`class_size`	`id_score`
1	12	24	6
2	32	23	7
3	17	33	6
4	36	29	6
5	38	16	9
6	2	24	7
7	22	29	8

Ultimately, we are interested in the effect of class size on the test score. A linear regression of class size on individual test score overestimates the precision of the coefficient estimates. In order to correct for this bias one might apply clustered standard errors. You can find a working example in R that uses this dataset here.


#####################################################################
# This script creates an example dataset to illustrate the 
# application of clustered standard errors. Particularly,
# this scrips creates a dataset of student test results. 
# Students are distributed over several classes and test 
# scores are correlated within classes. 
# 
# Ultimately, we are interested in the effect of class 
# size on the test score. A linear regression of class 
# size on test score overestimates the precision of the 
# coefficient estimates. In order to correct for this bias 
# one might apply clustered standard errors.
#####################################################################


# Start with empty workspace
rm(list=ls())

# Set seed for reproducibility
set.seed(123)

#####################################################################
# Parametrization
#####################################################################

# Set number of individuals. In our case
# the number of scholars.
n_obs <- 1000

# Set number of classes
number_classes <- 40

# Anchor score. Set the anchor of 
# score
score_mean <- 9.5 

# Set St.Dev of score
score_sd <- 1.9


#####################################################################
# Create Dataset
#####################################################################

# Create basic data structure
data <- data.frame("student_id"=c(1:n_obs))

# Create variable that defines class id
data$class_id <- sample(x = 1:number_classes,
                        size = n_obs,
                        replace = T)

# The following loop computes the average 
# score as function of class size. Larger classes
# are set to have a lower score. Once the average 
# class size is set, we distribute class scores
# around the mean.

for(i in 1:number_classes){
  
  data$class_size[data$class_id==i] <- sum(data$class_id==i)
  
  # define probability distribution of class size mean
  # smaller classes have higher scores on average 
  prob <- dnorm(c(3:8),
                mean = 1/(sum(data$class_id==i)/
                            mean(table(data$class_id)))*score_mean,
                sd = (sum(data$class_id==i)/
                        mean(table(data$class_id)))*score_sd)
  mean_score <- sample(x=c(3.00:8.00),
                       size = 1,replace = T,
                       prob = prob/sum(prob))
  
  prob <- dnorm(c(1:10),
                mean = mean_score,
                sd = (sum(data$class_id==i)/
                        mean(table(data$class_id)))*score_sd)
  data$id_score[data$class_id==i] <- sample(x = c(1:10),
                                            size = data$class_size[data$class_id==i],
                                            replace = T,
                                            prob = prob/sum(prob))
}

# Save data
write.table(x = data,file = "data.csv")

Economic Theory Blog

Example data – Clustered Standard Errors

2 thoughts on “Example data – Clustered Standard Errors”

Leave a comment Cancel reply

“In God we trust; all others must bring data.” W. Edwards Deming

Share this:

2 thoughts on “Example data – Clustered Standard Errors”

Leave a comment Cancel reply

“In God we trust; all others must bring data.” W. Edwards Deming