Econometrics Lab 6
Janithe Siriwardana
Setup
Set working directory and activate necessary packages
setwd("~/Desktop/R WD")
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(data.table)
Load data
load("datasets/hprice1.RData")
dt.hprice1 <- data.table(data)
rm(data)
colnames(dt.hprice1)
## [1] "price" "assess" "bdrms" "lotsize" "sqrft" "colonial"
## [7] "lprice" "lassess" "llotsize" "lsqrft"
Exercise 1
Estimate the model
where price is house price measured in thousands of dollars
reg1 <- lm(price ~ sqrft + bdrms, data = dt.hprice1)
stargazer(reg1, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## price
## -----------------------------------------------
## sqrft 0.128***
## (0.014)
##
## bdrms 15.198
## (9.484)
##
## Constant -19.315
## (31.047)
##
## -----------------------------------------------
## Observations 88
## R2 0.632
## Adjusted R2 0.623
## Residual Std. Error 63.045 (df = 85)
## F Statistic 72.964*** (df = 2; 85)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
i) Write out the results in equation form
ii) What is the estimated increase in price for a house with one more bedroom, holding square footage constant?
$15,198
iii) What is the estimated increase in price for a house with an additional bedroom that is 140 square feet in size? Compare this to your answer in part (ii).
$33,118. It is more than twice the value in (ii)
iv) What percentage of the variation in price is explained by square footage and number of bedrooms?
=0.632, so
percentage of variable explained by model is 63.2%
v) The first house in the sample has sqrft = 2,438 and bdrms = 4. Find the predicted selling price for this house from the OLS regression line.
first.house <- data.table( sqrft = dt.hprice1[1, sqrft]
, bdrms = dt.hprice1[1, bdrms])
my.pred <- predict(reg1, newdata = first.house)
my.pred
## 1
## 354.6052
The predicted sale price of first house is $354,605
vi) The actual selling price of the first house in the sample was $300,000 (so price=300). Find the residual for this house. Does it suggest that the buyer underpaid or overpaid for the house?
first.house$u_hat <- my.pred - dt.hprice1[1, price]
first.house$u_hat2 <- (first.house$u_hat)^2
head(first.house)
## sqrft bdrms u_hat u_hat2
## 1: 2438 4 54.60525 2981.733
The residual is $54,605. This suggests the buyer underpaid for the house significantly.
vii) Now add the variable colonial to your model. Interpret its coefficient. Is it significant?
reg2 <- lm(price ~ sqrft + bdrms + colonial, data = dt.hprice1)
stargazer(reg2, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## price
## -----------------------------------------------
## sqrft 0.130***
## (0.014)
##
## bdrms 12.487
## (10.024)
##
## colonial 13.078
## (15.436)
##
## Constant -21.552
## (31.210)
##
## -----------------------------------------------
## Observations 88
## R2 0.635
## Adjusted R2 0.622
## Residual Std. Error 63.150 (df = 84)
## F Statistic 48.720*** (df = 3; 84)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
summary(reg2)
##
## Call:
## lm(formula = price ~ sqrft + bdrms + colonial, data = dt.hprice1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -129.793 -40.167 -5.588 30.447 227.011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.55241 31.21022 -0.691 0.492
## sqrft 0.12985 0.01395 9.310 1.41e-14 ***
## bdrms 12.48749 10.02366 1.246 0.216
## colonial 13.07755 15.43591 0.847 0.399
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.15 on 84 degrees of freedom
## Multiple R-squared: 0.635, Adjusted R-squared: 0.622
## F-statistic: 48.72 on 3 and 84 DF, p-value: < 2.2e-16
The coefficient on colonial is 13.078 which means that a colonial house, all else constant is worth $13,078 than a non-colonial house. The p-value = 0.399 > 0.05 so it is not significant.
Exercise 2
load("datasets/ceosal2.RData")
dt.ceo.sal <- data.table(data)
rm(data)
i) Estimate a model relating annual salary to firm sales and market value. Make the model of the constant elasticity variety for both independent variables. Write the results out in equation form.
lm.ceo.sal <- lm(salary ~ sales + mktval, data = dt.ceo.sal)
log.lm.ceo.sal <- lm(lsalary ~ lsales + lmktval, data = dt.ceo.sal)
stargazer(lm.ceo.sal, log.lm.ceo.sal, type = "text")
##
## ===========================================================
## Dependent variable:
## ----------------------------
## salary lsalary
## (1) (2)
## -----------------------------------------------------------
## sales 0.016
## (0.010)
##
## mktval 0.025***
## (0.010)
##
## lsales 0.162***
## (0.040)
##
## lmktval 0.107**
## (0.050)
##
## Constant 716.576*** 4.621***
## (47.188) (0.254)
##
## -----------------------------------------------------------
## Observations 177 177
## R2 0.178 0.299
## Adjusted R2 0.168 0.291
## Residual Std. Error (df = 174) 535.894 0.510
## F Statistic (df = 2; 174) 18.797*** 37.129***
## ===========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
ii) Add profits to the model from part (i). Why can’t this variable not be included in logarithmic form? Would you say that these firm performance variables explain most of the variation in CEO salaries?
log.lm.ceo.sal2 <- lm(lsalary ~ lsales + lmktval + profits, data = dt.ceo.sal)
stargazer(log.lm.ceo.sal2, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## lsalary
## -----------------------------------------------
## lsales 0.161***
## (0.040)
##
## lmktval 0.098
## (0.064)
##
## profits 0.00004
## (0.0002)
##
## Constant 4.687***
## (0.380)
##
## -----------------------------------------------
## Observations 177
## R2 0.299
## Adjusted R2 0.287
## Residual Std. Error 0.512 (df = 173)
## F Statistic 24.636*** (df = 3; 173)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Profit cannot be included in log form because some elements are negative. Since R^2 is only 0.299, only 29.9% of the variation in CEO salaries is explained by this model.
iii) Interpret the coefficient of lmktval. Does it have a significant effect? Explain
summary(log.lm.ceo.sal2)
##
## Call:
## lm(formula = lsalary ~ lsales + lmktval + profits, data = dt.ceo.sal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.27002 -0.31026 -0.01027 0.31043 1.91489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.687e+00 3.797e-01 12.343 < 2e-16 ***
## lsales 1.614e-01 3.991e-02 4.043 7.92e-05 ***
## lmktval 9.753e-02 6.369e-02 1.531 0.128
## profits 3.566e-05 1.520e-04 0.235 0.815
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5117 on 173 degrees of freedom
## Multiple R-squared: 0.2993, Adjusted R-squared: 0.2872
## F-statistic: 24.64 on 3 and 173 DF, p-value: 2.53e-13
The coefficient 0.098 in the last model for lmktval means that a 1 percent increase in market value of the firm is associated with a 0.098% increase in CEO salary. The p-value = 0.128 > 0.05 so it is not significant.
iv) Interpret the coefficient of profits. Does it have a significant effect? Explain.
(exp(coef(log.lm.ceo.sal2)["profits"])-1)*100
## profits
## 0.003566069
The coefficient for profit is 0.00004 which means that a $1m increase in profits increases CEO salary by 0.004%. The p-value is very large so profit is not significant.
v) Add the variable ceoten and comten to the model in part (ii). What is the estimated percentage return for another year of CEO tenure, holding other factors fixed?
log.lm.ceo.sal3 <- lm(lsalary ~ lsales + lmktval + profits + ceoten + comten, data = dt.ceo.sal)
stargazer(log.lm.ceo.sal3, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## lsalary
## -----------------------------------------------
## lsales 0.191***
## (0.040)
##
## lmktval 0.077
## (0.062)
##
## profits 0.0001
## (0.0001)
##
## ceoten 0.017***
## (0.006)
##
## comten -0.010***
## (0.003)
##
## Constant 4.697***
## (0.376)
##
## -----------------------------------------------
## Observations 177
## R2 0.349
## Adjusted R2 0.330
## Residual Std. Error 0.496 (df = 171)
## F Statistic 18.342*** (df = 5; 171)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
(exp(coef(log.lm.ceo.sal3)["ceoten"])-1)*100
## ceoten
## 1.70879
For every additonal year of CEO tenure the estimated percentage return is 1.71%
vi) Interpret the coefficients on ceoten and comten. Are these explanatory variables statistically significant?
(exp(coef(log.lm.ceo.sal3)["comten"])-1)*100
## comten
## -0.9493218
For every additional year of company tenure the estimated percentage return is -0.949%
summary(log.lm.ceo.sal3)
##
## Call:
## lm(formula = lsalary ~ lsales + lmktval + profits + ceoten +
## comten, data = dt.ceo.sal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5307 -0.2737 -0.0251 0.2910 1.9999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.697e+00 3.759e-01 12.497 < 2e-16 ***
## lsales 1.909e-01 3.998e-02 4.774 3.86e-06 ***
## lmktval 7.726e-02 6.237e-02 1.239 0.21713
## profits 6.456e-05 1.479e-04 0.437 0.66298
## ceoten 1.694e-02 5.552e-03 3.052 0.00264 **
## comten -9.539e-03 3.354e-03 -2.844 0.00500 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4961 on 171 degrees of freedom
## Multiple R-squared: 0.3491, Adjusted R-squared: 0.3301
## F-statistic: 18.34 on 5 and 171 DF, p-value: 1.473e-14
The p-values for both ceoten and comten are <0.01 so both are significant.
vii) What do you make of the fact that longer tenure with the company, holding the other factors fixed, is associated with a lower salary?
While it might be initially surprising, this could reflect the ‘superstar’ effect whereby firms pay more for highly regarded external candidates.
viii) Predict the salaries for all the CEOs in the sample
dt.ceo.sal$y_hatlog <- predict(log.lm.ceo.sal3, newdata=dt.ceo.sal)
dt.ceo.sal$y_hat <- exp(dt.ceo.sal$y_hatlog)
head(dt.ceo.sal)
## salary age college grad comten ceoten sales profits mktval lsalary lsales
## 1: 1161 49 1 1 9 2 6200 966 23200 7.057037 8.732305
## 2: 600 43 1 1 10 10 283 48 1100 6.396930 5.645447
## 3: 379 51 1 1 9 3 169 40 1100 5.937536 5.129899
## 4: 651 55 1 0 22 22 1100 -54 1000 6.478509 7.003066
## 5: 497 44 1 1 8 6 351 28 387 6.208590 5.860786
## 6: 1067 64 1 1 7 7 19000 614 3900 6.972606 9.852194
## lmktval comtensq ceotensq profmarg y_hatlog y_hat
## 1: 10.051908 81 4 15.580646 7.151210 1275.6483
## 2: 7.003066 100 100 16.961130 6.393152 597.7380
## 3: 7.003066 81 9 23.668638 6.185158 485.4899
## 4: 6.907755 484 484 -4.909091 6.727215 834.8188
## 5: 5.958425 64 36 7.977208 6.303558 546.5127
## 6: 8.268732 49 49 3.231579 7.308279 1492.6057
ix) Use the predicted values to compute R2
SSR <- sum((dt.ceo.sal$y_hat - dt.ceo.sal$salary)^2)
TSS <- sum((dt.ceo.sal[, mean(salary)] - dt.ceo.sal$salary)^2)
Rsquared <- 1 - (SSR/TSS)
Rsquared
## [1] 0.223594
Rsquared = 0.224
x) Calculate the residuals
dt.ceo.sal$u_hat <- dt.ceo.sal$y_hat - dt.ceo.sal$salary
dt.ceo.sal$u_hat2 <- (dt.ceo.sal$u_hat)^2
xi) Find the sample correlation coefficient between the variables log(mktval) and profits. Are these variables highly correlated? What does this say about the OLS estimators?
cor(dt.ceo.sal$lmktval, dt.ceo.sal$profits)
## [1] 0.7768976
The two variables are quite highly correlated. Typically this is not too
much of an issue for the predictive power of the model unless this linear dependence is very close to 1 or
equal to 1 such that there is perfect multicollinearity. In this case
rank(X) < k so X’X is invertible and
cannot
be computed. However, multicollinearity can result in invalid results for the coefficients of the individual regressors.
Exercise 3
load("datasets/attend.RData")
dt.attend <- data.table(data)
rm(data)
colnames(dt.attend)
## [1] "attend" "termGPA" "priGPA" "ACT" "final" "atndrte" "hwrte"
## [8] "frosh" "soph" "missed" "stndfnl"
i) Obtain the minimum, maximum, and average values for the variables atndrte, priGPA, and ACT.
dt.attend[, max(atndrte)]
## [1] 100
dt.attend[, min(atndrte)]
## [1] 6.25
dt.attend[, mean(atndrte)]
## [1] 81.70956
dt.attend[, max(priGPA)]
## [1] 3.93
dt.attend[, min(priGPA)]
## [1] 0.857
dt.attend[, mean(priGPA)]
## [1] 2.586775
dt.attend[, max(ACT)]
## [1] 32
dt.attend[, min(ACT)]
## [1] 13
dt.attend[, mean(ACT)]
## [1] 22.51029
ii) How many students are in their freshman year?
dt.attend[, sum(frosh)]
## [1] 158
iii) How many of the students in freshmen year have missed more than 10 classes?
dt.attend[missed>10, sum(frosh)]
## [1] 29
iv) Is it true that, on average, students who miss less than 10 classes have a higher GPA? State the null and alternative hypothesis and conduct the appropriate test.
dt.attend[, t.test(termGPA ~ (missed>=10), alternative = c("greater"))]
##
## Welch Two Sample t-test
##
## data: termGPA by missed >= 10
## t = 12.237, df = 150.04, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.8018051 Inf
## sample estimates:
## mean in group FALSE mean in group TRUE
## 2.761899 1.834686
The p-value<0.05 so we can reject the null hypothesis. There is a difference in average term GPA between those who missed less than 10 classes and those who missed 10 more classes. Additionally, zero is not in the 95% confidence interval.
v) Estimate the model
,
and write the results in equation form.Interpret the intercept. Does it
have a useful meaning?
lm.attend <- lm(atndrte ~ priGPA + ACT, data=dt.attend)
stargazer(lm.attend, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## atndrte
## -----------------------------------------------
## priGPA 17.261***
## (1.083)
##
## ACT -1.717***
## (0.169)
##
## Constant 75.700***
## (3.884)
##
## -----------------------------------------------
## Observations 680
## R2 0.291
## Adjusted R2 0.288
## Residual Std. Error 14.379 (df = 677)
## F Statistic 138.651*** (df = 2; 677)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The intercept term suggests that for a person with cumulative GPA prior to term equal to zero and an ACT score of zero, their percentage attendance would be 75.7%.
vi) Discuss the estimated slope coefficients. Are there any surprises?
The slope coefficient for prior term GPA is not suprising but the fact that the ACT coefficient is negative may be surprisingly initially but could be explained by extremely capable students skipping class.
vii) What is the predicted atndrte if priGPA=3.65 and ACT=20? What do you make of this result? Are there any students in the sample with these values of the explanatory variables?
new.student <- data.table( priGPA = 3.65
, ACT = 20)
pred.new.student <- predict(lm.attend, newdata = new.student)
pred.new.student
## 1
## 104.3705
This is a surprising result because it suggests students with these characteristics attend more classes than available
(dt.attend[priGPA==3.65, sum(ACT=20)])
## [1] 20
There is one student who has these characteristics
viii) If Student A has priGPA=3.1 and ACT=21 and Student B has priGPA=2.1 and ACT=26, what is #the predicted difference in their attendance rates?
studentA <- data.table( priGPA = 3.61
, ACT = 21)
pred.studentA <- predict(lm.attend, newdata = studentA)
pred.studentA
## 1
## 101.9635
studentB <- data.table( priGPA = 2.1
, ACT = 26)
pred.studentB <- predict(lm.attend, newdata = studentB)
pred.studentB
## 1
## 67.31727
pred.studentA - pred.studentB
## 1
## 34.64626
ix) Should we include attend along with alcohol as explanatory variables in a multiple regression model? (Think about how you would interpret alcohol.)
The coefficient on alcohol would be interpreted as change in college GPA per unit of alcohol consumption. I would include attendance as intuitively it should have some explanatory power over GPA but there may be some correlation between alcohol consumption and attendance.
x) Should SAT and hsGPA be included as explanatory variables? Explain.
Again there is likely to be correlation between these two variables so it might be that only one is needed in the model.
Exercise 4
i) Load the dataset MEAP93.RData and obtain the summary statistics.
load("datasets/meap93.RData")
dt.math<-data.table(data)
rm(data)
ii) We want to explore the relationship between the math pass rate (math10) and spending per student (expend). Do you think each additional dollar spent has the same effect on the pass rate, or does a diminishing effect seem more appropriate? Explain.
There is likely to be a diminishing effect, as eventually the marginal gains from extra study get less and less (and may even reverse)
iii) In the population model
argue that
/10 is the
percentage point change in math 10 given a 10% increase in expend.
The interpretation of a lin-log model is if we increase x by one percent, we expect y to increase by (β1/100) units of y. However, since math10 goes from 1-100, β1 is the percent change in math10 given a 100% increase in expend. Therefore β1/10 is the percent change in math10 given a 10% increase in spend.
iv) Use the data in MEAP93.RAW to estimate the model from part (ii). Report the estimated equation in the usual way, including the sample size and R-squared.
lm.math <- lm(math10 ~ log(expend), data=dt.math)
stargazer(lm.math, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## math10
## -----------------------------------------------
## log(expend) 11.164***
## (3.169)
##
## Constant -69.341***
## (26.530)
##
## -----------------------------------------------
## Observations 408
## R2 0.030
## Adjusted R2 0.027
## Residual Std. Error 10.350 (df = 406)
## F Statistic 12.411*** (df = 1; 406)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Sample size = 408 Rsquared = 0.030
v) How big is the estimated spending effect? Namely, if spending increases by 10%, what is the estimated percentage point increase in math10?
The percentage increase in math 10 given a 10% increase in spending is 1.12%
vi) One might worry that regression analysis can produce fitted values for math10 that are greater than 100. Why is this not much of a worry in this data set?
dt.math[,max(math10)]
## [1] 66.7
Since the maximum starting point for math scores in our data set is 66.7 we would require very large increases in spending for out fitted values to potentially go above 100.