` regress`

. In this post I will present how to use the STATA function `regress`

to run OLS on the following model

In this example, our dependent variable will be my weekly average weight, the explanatory variable represents the sum of calories that I burned during the previous week. For a more detailed description of the data see here.

*start with empty workspace clear all *import sample data import excel using "https://economictheoryblog.files.wordpress.com/2016/08/data.xlsx",first *estimate linear regression model regress weight lag_calories

The function `regress`

returns point estimates for and . Moreover, the function returns standard error for the point estimates, along with its t-statistic, the p-value and 95% confidence intervals. The function also allows to conduct multiple regressions. That is, the function allows additional explanatory variables that can simply be added to the function as additional arguments.

Generally, one can compute the Lorenz Curve according to the following steps:

1. Prepare your income or wealth data (in this example we use six fictional monthly income figures)

2.200, 1.200, 4.100, 1.500, 3.500, 1.000

2. Arrange your income or wealth data in an ascending way.

1.000, 1.200, 1.500, 2.200, 3.500, 4.100

3. Compute the cummulative sum of the ordered data

1000, 2.200, 3.700, 5.900, 9.400, 13.500

4. Divide the cummulative sum by the total income or wealth

1000, 2.200, 3.700, 5.900, 9.400, 13.500 / devide by 13.500

5. Finally, you obtain the single values for the Lorenz Curve

0.074, 0.163, 0.274, 0.437, 0.696, 1

One can interpret the values of the Lorenz Curve the following way: 33% of the population earn 16.3% of total income. The richest 33% earn 56.3% of total income.

The following R code simulates the Lorenz Curves that are used in this post. It simulates three economies with different degrees of inequality.

# start with empty workspace rm(list=ls()) # function to compute the Lorenz curve lorenz_curve <- function(x){ cumsum(x)/max(cumsum(x)) } # simulate different Lorenz curves population <- 100 wealth_equal <- rep(1,population) wealth_unequal <- (1:population)^1.2 wealth_very_unequal <- (1:population)^5 lorenz_curve_equal <- lorenz_curve(wealth_equal) lorenz_curve_unequal <- lorenz_curve(wealth_unequal) lorenz_curve_very_unequal <- lorenz_curve(wealth_very_unequal) # plot Lorzenz curve png("lorenz_curve.png", height=960,width=960) par(mar=c(6.1,5.1,4.1,2.1)) plot(lorenz_curve_equal,type="l",main="Lorenz Curve", ylab="Cumulative share of income earned", xlab="Cumulative share of people from lowest to highest incomes", col="green",cex.main=3,cex.lab=2,cex.axis=2,lwd=3) lines(lorenz_curve_unequal,col="orange",lwd=3) lines(lorenz_curve_very_unequal,col="red",lwd=3) text(50,0.55,"Equal Wealth",srt=40,cex = 2) text(60,0.40,"Unequal Wealth",srt=45,cex = 2) text(70,0.20,"Very Unequal Wealth",srt=45,cex = 2) dev.off() png("lorenz_curve2.png", height=960,width=960) par(mar=c(6.1,5.1,4.1,2.1)) plot(lorenz_curve_equal,type="l",main="Lorenz Curve", ylab="Cumulative share of income earned", xlab="Cumulative share of people from lowest to highest incomes", col="green",cex.main=3,cex.lab=2,cex.axis=2,lwd=3) lines(lorenz_curve_unequal,col="orange",lwd=3) lines(lorenz_curve_very_unequal,col="red",lwd=3) segments(60, 0, x1 = 60, y1 = 0.6,lty = 2) segments(0, 0.60, x1 = 60, y1 = 0.60,lty = 2) segments(0, 0.33, x1 = 60, y1 = 0.33,lty = 2) segments(0, 0.05, x1 = 60, y1 = 0.05,lty = 2) text(52,0.65,"60%",cex = 2) text(52,0.38,"33%",cex = 2) text(52,0.1,"5%",cex = 2) dev.off()]]>

The green, 45-degree line shows the Lorenz Curve of an economy with a perfectly equal wealth distribution. The orange line shows an economy with an unequal wealth distribution and the rel line depicts a very unequal wealth distribution. Generally, the further away the Lorenz Curve from the diagonal, the more unequal the wealth distribution.

The Lorenz Curve represents the actual distribution of wealth in an economy and can be interpreted the following way: In an economy of perfect wealth equality (green line) the 60% of the population own 60% of the national wealth. While in a case of inequality (orange line), the poorest 60% of the population would own only 33% of the overall wealth. In the case of very strong inequality (red line) the the poorest 60% of the population own only 5% of total wealth. While the richest 40% own 95% of the wealth. The more bowed out a Lorenz Curve, the higher is the inequality of wealth in an economy.

]]>

I prepared a short tutorial to explain how to include clustered standard errors in stargazer. The following R code does the following. First, it loads the function that is necessary to compute clustered standard errors. Second, it downloads an example data set from this blog that is used for the OLS estimation and thirdly, it calculates a simple linear model using OLS. Finally, the script uses the summary.lm() function, the one that we loaded at the beginning, to calculate and recover STATA like clustered standard errors and passes them on to the stargazer function. In the example, I print the stargazer output as text, however, one replace can the argument type to “tex” or “html” in order to obtain perfectly formatted “tex” or “html” tables.

# start with an empty workspace rm(list=ls()) # load necessary packages for importing the function library(RCurl) # load necessary packages for the example library(gdata) library(zoo) # import the function url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R" eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv) # download data set for example url_data <- "https://economictheoryblog.files.wordpress.com/2016/12/data.xls" data <- read.xls(gsub("s:",":",url_data)) # estimate simple linear model reg <- lm(id_score ~ class_size, data=data) # use new summary function to obtain clustered standard errors summary(reg,cluster = c("class_id")) # create stargazer output with cluster robust standard errors require("stargazer") # save cluster robust standard errors cluster_se <- as.vector(summary(reg,cluster = c("class_id"))$coefficients[,"Std. Error"]) # print stargazer output with robust standard errors stargazer(reg,type = "text",se = list(cluster_se)) # the last command prints the stargazer output (in this case as text) # with cluster robust standard errors.]]>

I prepared a short tutorial to explain how to include robust standard errors in stargazer. The following R code does the following. First, it loads the function that is necessary to compute robust standard errors. Second, it downloads an example data set from this blog that is used to perform the OLS estimation and thirdly, it calculates a simple linear model using OLS. Finally, the script uses the summary.lm() function, the one that we loaded at the beginning, to calculate and recover STATA like robust standard errors and passes them on to the stargazer function. In the example I print the stargazer output as text, however, one replace can the argument type to “tex” or “html” in order to obtain perfectly formatted tex or html tables.

# start with an empty workspace rm(list=ls()) # load necessary packages for importing the function library(RCurl) # load necessary packages for the example library(gdata) library(zoo) # import the robust standard error function url_robust <- "https://raw.githubusercontent.com/IsidoreBeautrelet/economictheoryblog/master/robust_summary.R" eval(parse(text = getURL(url_robust, ssl.verifypeer = FALSE)), envir=.GlobalEnv) # download data set for example url_data <- "https://economictheoryblog.files.wordpress.com/2016/08/data.xlsx" data <- read.xls(gsub("s:",":",url_data)) # estimate simple linear model reg <- lm(weight ~ lag_calories+lag_cycling+ I(lag_calories*lag_cycling), data=data) # use new summary function summary(reg,robust = T) # create stargazer output with robust standard errors require("stargazer") # save robust standard errors robust_se <- as.vector(summary(reg,robust = T)$coefficients[,"Std. Error"]) # print stargazer output with robust standard errors stargazer(reg,type = "text",se = list(robust_se)) # the last command prints the stargazer output (in this case as text) # with robust standard errors.]]>

The second post explains the omitted variable bias in general words and introduces an example that we use throughout the series.

The third post describes the nature of the omitted variables bias by the means of a Venn-Diagram. This post really tries to increase the general understanding of the bias and to provide an deeper intuition on the dynamics of the bias.

The fourth post continues to work on understanding of the bias. While the third post provides an understanding in more general terms, this post addresses the omitted variable bias in formal way. Using our working example, the post will detail what exactly happens to our estimates when we neglect a variable.

The fifth post elaborates the consequences of the omitted variable bias. That is, the post shows that omitting a variable form the regression model violates the third OLS assumption and discusses what will happen if this assumption is violated.

The sixth post summarizes what one can do in the light of an omitted variable bias. The post lists several points how one can address an omitted variable bias.

The last post concludes. It provides a short summary of the omitted variable bias and presents a list of questions, regarding the omitted variable bias, which one should answer before conducting a linear regression analysis.

First, one can try, if the required data is available, to include as many variables as you can in the regression model. Of course, this will have other possible implications that one has to consider carefully. First, you need to have a sufficient number of data points to include additional explanatory variables or else you will not be able to estimate your model. Second, depending on how many extra variables you include, the issues of including unnecessary variables may arise and start to seriously influence your estimates. However, additional explanatory variables can help to mitigate the problems associated with the omitted variable bias. That is, additional control variables can lower the bias.

Second, if you think that a variable is important and leaving it out of your regression model could cause an omitted variable bias, but at the same time you do not have data for it, you can look for proxies or find instrument variables for the omitted variables. For instance, in the car price example that we discussed earlier, the omitted variable was the age of the car. Suppose you do not have data on the age of the car, however you know how much time the last owner was in possession of the car, then the amount of time the car was owned by the last owner can be taken as a proxy for the age of a car. Note however, using proxies and instrumental variables comes with a whole set of additional assumptions and problems, most of them are quite complicated and not easily met.

Third, if you cannot resolve the omitted variable bias, you can try to make predictions in which direction your estimates are biased.

Our dependent variable will be my weekly average weight, the explanatory variable represents the sum of calories that I burned during the previous week, and variable is a binary variable that takes a value of 1 in case I was cycling the week earlier and 0 otherwise. For a more detailed description of the data see here.

# load a couple of packages using Distributions using GLM using DataFrames using DataArrays # load Taro - Pkg to read Excel Data using Taro Taro.init() # get data path = "https://economictheoryblog.files.wordpress.com/2016/08/data.xlsx" data = Taro.readxl(download(path), "data", "A1:C357") data = deleterows!(data,find(isna(data[:,1])|isna(data[:,2]))) data[:,1] = convert(DataArrays.DataArray{Float64,1},data[:,1]) data[:,2] = convert(DataArrays.DataArray{Float64,1},data[:,2]) data[:,3] = convert(DataArrays.DataArray{Float64,1},data[:,3]) # estimate the linear regression model glm(@formula(weight ~ lag_calories + lag_cycling), data, Normal(), IdentityLink())]]>

Before attempting to fit a linear model to observed data, an analyst should first examine whether or not there exists a meaningful relationship between the two variables. In our case one would reflect whether or not we expect a meaningful relationship between calorie consumption and weight. Probably one would assume that there should be some kind of relationship between calorie consumption and weight. Note however, linear regression is a correlation analysis and does not imply any causality. For instance, linear regression does not tell us that higher calorie consumption *causes* changes in weight, but rather that there exists some significant association between the two variables.

Before running a linear regression model, it might be meaningful to conduct a graphical evaluation of the data and to conduct some numerical summary statistics. For instance, a scatter-plot can be a useful tool in examining the strength of the relationship between two variables. If there appears to be no association between the proposed explanatory and dependent variables (that is, the plot does not show any increasing or decreasing trends), then fitting a linear regression model to the data probably will not provide a useful model. A valuable numerical measure of association between two variables is the correlation coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed data for the two variables.

Finally, the linear regression model is expressed as an equation of the form

where is the explanatory variable and is the dependent variable. The slope of the line is , and is the intercept. Mathematically, a linear regression model fits a line to the data that minimizes the sum of squared deviations of the data from the line. This post provides a formal derivation of the classical regression model.

Various statistical software support linear regressions, including Julia, R and STATA. In case you interested to conduct a linear regression you can find some tutorial on this blog. This post explains how to conduct a linear regression in Julia, and the this post describes how one can conduct a linear regression in R.

]]>From this post, we know that omitting a relevant variable from the regression causes the error term and the explanatory variables to be correlated.

Suppose that the data generating process in the population is as follows:

However, we omit a variable, in this case and estimate the following regression model:

,

where and

Note that, if it implies that . Hence, we violate assumption three. That is, the error term will be correlated with another explanatory variable.

Now, how can we mathematically prove that omitting indeed causes endogeneity. To prove this, lets start from the limit of the OLS estimator. Let denote the full matrix of explanatory variables, in our case , and let be the error term containing , that is . Additionally, let be the vector of parameters that we want to estimate, i.e. .

Immediately, we see that . This is always the case whenever , i.e. whenever we have a correlation between and .

In case we specify our model correctly, the second term in the third row would be \mathbb{E}(\left[ (X’X)^{-1}X’u \epsilon]) and collaps to zero. That is . This happens because , since , which holds because the original assumption that each of the explanatory variables are uncorrelated with the error term .