Example data – Clustered Standard Errors

The following R script creates an example dataset to illustrate the application of clustered standard errors. You can download the dataset here.

The script creates a dataset with a specific number of student test results. Individual students are identified via the variable student_id . The variable id_score comprises a student’s test score. In the test, students can score from 1 to 10 with 10 being the highest score possible.

In the dataset students belong to different classes. The variable class_id shows to which particular class a student belongs. The variable class_size shows the number of students within that particular class.

Furthermore, the script generates test scores are correlated within classes. This is based on the idea that students within a class are exposed to the same environment, i.e. all student of a class share the same teacher.

The structure of dataset looks as follows:

student_id class_id class_size id_score
1 12 24 6
2 32 23 7
3 17 33 6
4 36 29 6
5 38 16 9
6 2 24 7
7 22 29 8

Ultimately, we are interested in the effect of class size on the test score. A linear regression of class size on individual test score overestimates the precision of the coefficient estimates. In order to correct for this bias one might apply clustered standard errors. You can find a working example in R that uses this dataset here.

 


#####################################################################
# This script creates an example dataset to illustrate the 
# application of clustered standard errors. Particularly,
# this scrips creates a dataset of student test results. 
# Students are distributed over several classes and test 
# scores are correlated within classes. 
# 
# Ultimately, we are interested in the effect of class 
# size on the test score. A linear regression of class 
# size on test score overestimates the precision of the 
# coefficient estimates. In order to correct for this bias 
# one might apply clustered standard errors.
#####################################################################


# Start with empty workspace
rm(list=ls())

# Set seed for reproducibility
set.seed(123)

#####################################################################
# Parametrization
#####################################################################

# Set number of individuals. In our case
# the number of scholars.
n_obs <- 1000

# Set number of classes
number_classes <- 40

# Anchor score. Set the anchor of 
# score
score_mean <- 9.5 

# Set St.Dev of score
score_sd <- 1.9


#####################################################################
# Create Dataset
#####################################################################

# Create basic data structure
data <- data.frame("student_id"=c(1:n_obs))

# Create variable that defines class id
data$class_id <- sample(x = 1:number_classes,
                        size = n_obs,
                        replace = T)

# The following loop computes the average 
# score as function of class size. Larger classes
# are set to have a lower score. Once the average 
# class size is set, we distribute class scores
# around the mean.

for(i in 1:number_classes){
  
  data$class_size[data$class_id==i] <- sum(data$class_id==i)
  
  # define probability distribution of class size mean
  # smaller classes have higher scores on average 
  prob <- dnorm(c(3:8),
                mean = 1/(sum(data$class_id==i)/
                            mean(table(data$class_id)))*score_mean,
                sd = (sum(data$class_id==i)/
                        mean(table(data$class_id)))*score_sd)
  mean_score <- sample(x=c(3.00:8.00),
                       size = 1,replace = T,
                       prob = prob/sum(prob))
  
  prob <- dnorm(c(1:10),
                mean = mean_score,
                sd = (sum(data$class_id==i)/
                        mean(table(data$class_id)))*score_sd)
  data$id_score[data$class_id==i] <- sample(x = c(1:10),
                                            size = data$class_size[data$class_id==i],
                                            replace = T,
                                            prob = prob/sum(prob))
}

# Save data
write.table(x = data,file = "data.csv")

Advertisement

2 thoughts on “Example data – Clustered Standard Errors”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.