Quantitative Economics Notes for Oxford PPE Finals

Fri Apr 16 2021

tags: notes ppe oxford quantitative economics economics finals


These are notes I (Lieu Zheng Hong) wrote for myself while preparing for my Oxford PPE Finals. Some of my juniors asked for my notes and I am happy to oblige.

These notes are free to all but I ask that you do not reproduce them without first obtaining my express permission.

There are lots of mistakes, omissions, and inadequacies in these notes. I'd love your input to help make these notes better, by emailing me or by sending in a pull request at the GitHub repo here.

Compilation of QE past year questions here: QE past-year questions

I also have some worked attempts/answers here (which may be wrong!): 2016 attempt PDF, Hypothesis Testing Answers, OLS Answers, Tutorial 2 Answers, Tutorial 5 Answers,

Table of contents

Things to take note

Always give economic intuition explanation especially for regression interpretation questions. Put on your PolSoc hat.

When they ask for interpretation of the coefficient or whatever: don't just talk about the straightforward interpretation, see if you can talk more about it. Is it (plausibly) the LATE? TOT? ATE of compliers?

When they ask for internal validity: check random assignment (exogeneity) and relevance.

When they ask for external validity: check how close the group under study is to


What is the sample average? Why is it a random variable?

Sample average is Yˉ\bar{Y}. It is a random variable because it is a function of random variables of the population.

What is the mean, variance and standard error of a Bernoulli random variable?

Let p^\hat{p} be the sample mean (equivalently written as Xˉ\bar{X}).

Ep^=pE \hat{p} = p

var(Xi)=p(1p)var(X_i) = p(1-p)

var(p^)=p(1p)/nvar(\hat{p}) = p(1-p)/n

se(p^)=sd^(p^)n=(p^(1p^)n)1/2se(\hat{p}) = \frac{\hat{sd}(\hat{p})}{\sqrt{n}} = (\frac{\hat{p}(1-\hat{p})}{n})^{1/2}

What is the sampling distribution?

The distribution of Yˉ\bar{Y}.

What is the mean and variance of the sampling distribution? Derive them.

E(Yˉ)=E(1/nNYi)=1/nNEYi=E[Y](i.i.d)=μYE(\bar{Y}) = E(1/n \sum^N Y_i) = 1/n \sum^N EY_i = E[Y] (i.i.d) = \mu_Y

var(Yˉ)=var(1nYi)=1n2var(Yi)=σ2/n(YiYj)var(\bar{Y}) = var(\frac{1}{n}\sum Y_i) = \frac{1}{n^2} var(\sum Y_i) = \sigma^2/n \quad (Y_i \perp Y_j)

What is the Law of Large Numbers (LLN)?

If YiY_i are i.i.d with E(Yi)μYE(Y_i) \mu_Y and var(Yi) sigmaY2<var(Y_i) \ sigma^2_Y < \infty then

YˉpμY.\bar{Y} \rightarrow^p \mu_Y.

What is the Central Limit Theorem (CLT)? What are its assumptions?

Assumptions: Ys must be i.i.d, 0<var(Yi)<0 < var(Y_i) < \infty.

As nn \rightarrow \infty, the distribution of

YˉμYσYˉN(0,1)\frac{\bar{Y} - \mu_Y}{\sigma^{\bar{Y}}} \sim N(0,1)

What does it mean when we say that Yˉ\bar{Y} is an estimator of μY\mu_Y?

  1. An estimator is a random variable that is a function of a sample of data drawn randomly from the population.

What does it mean for an estimator to be unbiased?

  1. An estimator a^\hat{a} is a consistent estimator of aa iff E(a^)=aE(\hat{a}) = a.

What does it mean for an estimator to be consistent?

  1. a^\hat{a} is a consistent estimator of aa if as N gets large, for any ϵ>0\epsilon > 0, the probability that a^a<ϵ\hat{a} - a < \epsilon tends to zero.

What does it mean for an estimator to be efficient?

  1. An efficient estimator is an estimator that has low variance.

What does it mean when we say that Yˉ\bar{Y} is the BLUE of μY\mu_Y?

The Best Linear Unbiased Estimator (BLUE) is the estimator that has the smallest variance.

What does it mean for an estimator to be a least squares estimator?

An estimator mm minimises the sum of squared differences between the observations of the sampleand mm.

Prove that Yˉ\bar{Y} is the least squares estimator of μY\mu_Y.

  1. Lecture 4 slide 18/20.

What is the t-statistic?

The t-statistic is any statistic of the form

t=β^βse(β^)t = \frac{\hat{\beta} - \beta}{se({\hat{\beta}})}

where se(β^)se({\hat{\beta}}) is the standard error of the estimated parameter. Note the difference between that and se^(β)\hat{se}(\beta).

The former is the standard error of the estimated parameter, which is something like the square root of the variance of the sample mean.

The latter is the square root of the sample variance.

We have the following (replace "sample mean" with "estimated parameter"):

  • Var(X)=σ2Var(X) = \sigma^2 is the population variance.
  • sd(X)=σsd(X) = \sigma is the population standard deviation.
  • Var^(X)=s2\hat{Var}(X) = s^2 is the sample variance.
  • sd^(X)=Var^(X)=s\hat{sd}(X) = \sqrt{\hat{Var}(X)} = s is the sample standard deviation.
  • Var(X^)Var(\hat{X}) is the variance of the sample mean.
  • sd(X^)sd(\hat{X}) is the standard deviation of the sample mean.
  • se(X^)se(\hat{X}) is the standard error of the sample mean. It estimates the standard deviation of the sample mean, which is unknown.

The relationship between them is the following:

The sample variance is an unbiased estimator of the population variance. That is,

Var^(X)s2σ2.\hat{Var}(X) \equiv s^2 \rightarrow \sigma^2.

The variance of the sample mean is equal to the population variance divided by n.

Var(X^)=σ2n.Var(\hat{X}) = \frac{\sigma^2}{n}.

This allows us to write the following:

se(X^)sd(X^)=Var(X^)=Var^(X)n=sd^(X)/n se(\hat{X}) \rightarrow sd(\hat{X}) = \sqrt{Var(\hat{X})} = \sqrt{\frac{\hat{Var}(X)}{n}} = \hat{sd}(X)/\sqrt{n}

What is the p-value?

p-value or probability value is the probability of obtaining test results at least as extreme as the results actually observed (tactt^{act}) during the test, assuming that the null hypothesis is correct.

Entirely equivalently, the p-value is the lowest significance level under which the null hypothesis would be rejected.

What is the confidence interval?

An X% two-sided confidence interval for μY\mu_Y is a random interval that contains the true value of μY\mu_Y X% of the time. Given the sample average we observe in our randomly drawn sample, there is a 95% chance that the true population mean lies in the interval between A and B. Note that this is a property of the CLT --- sample means must follow a normal distribution, so we can make claims about what the population mean should be.

  • 90% confidence interval is +-1.64SE
  • 95% confidence interval is +-1.96SE
  • 99% confidence interval is +-2.58SE

What is the sample covariance? What is its equation?

The sample covariance is the sample analogue of the population covariance. It is

Cov^(X,Y)=i=0N(XiXˉ)(YiYˉ)\hat{Cov}(X, Y) = \sum^N_{i=0}(X_i - \bar{X})(Y_i - \bar{Y})

What is the difference between sample variance and the variance of the sample mean?

The sample mean Yˉ\bar{Y} is a random variable (it is after all a function of random variables YiY_i) and, being a random variable, it has a mean E[Yˉ]E[\bar{Y}] and a variance σYˉ2\sigma^2_{\bar{Y}}. It can be shown that E[Yˉ]=μYE[\bar{Y}] = \mu_Y and Var(Yˉ)σYˉ2=σY2/nVar(\bar{Y}) \equiv \sigma^2_{\bar{Y}} = \sigma^2_Y/n.

But we don't know σY\sigma_Y, how can we know σYˉ\sigma_{\bar{Y}}? We need to estimate the variance of the sample mean. It turns out that we estimate the variance of the sample mean σYˉ2\sigma^2_{\bar{Y}} with what is called the sample variance, sY2s^2_Y.

σYˉSD(Yˉ)\sigma_{\bar{Y}} \equiv SD(\bar{Y})

This can be estimated by the sample variance. sY2s^2_Y, the sample variance, is a random variable, and is a consistent estimator of the variance of the sample and the population variance.

sd^(X)=sY2=1n1in(YiYˉ)2σY2 \hat{sd}(X) = s^2_Y = \frac{1}{n-1} \sum^n_i (Y_i - \bar{Y})^2 \rightarrow \sigma^2_Y

There is also the term "Standard error of Yˉ\bar{Y}", or SE(Yˉ)SE(\bar{Y}): this is equivalent to σ^Yˉ\hat{\sigma}_{\bar{Y}}. The notation is a bit confusing but I believe

σ^YˉSE(Yˉ)\hat{\sigma}_{\bar{Y}} \equiv SE(\bar{Y})

SE(Yˉ)σ^YˉσYˉ,SE(\bar{Y}) \equiv \hat{\sigma}_{\bar{Y}} \rightarrow \sigma_{\bar{Y}},

that is to say that the standard error of Yˉ\bar{Y} is an estimator of the standard deviation of Yˉ\bar{Y}. From the previous two equations we can write

SE(Yˉ)σ^Yˉ=sYnSE(\bar{Y}) \equiv \hat{\sigma}_{\bar{Y}} = \frac{s_Y}{\sqrt{n}}

Not sure what the relationship between all of these things is. I think standard error of Yˉ\bar{Y}, SE(Yˉ)SE(\bar{Y}), is another way to say the sample standard deviation sYs_Y divided by n\sqrt{n}, which is an (unbiased and consistent?) estimator of the standard deviation of the sample mean σYˉ\sigma_{\bar{Y}}. But why must there be two different terms for the same fucking thing?

So when we are normalising

(YˉμY)σYˉ\frac{(\bar{Y} - \mu_Y)}{\sigma_{\bar{Y}}}

we can simply write

(YˉμY)SE(Yˉ)\frac{(\bar{Y} - \mu_Y)}{SE(\bar{Y})}

Similarly, when doing difference-in-means tests, we can write

(YaˉYbˉ)SE(YaˉYbˉ)\frac{(\bar{Y_a} -\bar{Y_b})}{SE(\bar{Y_a} - \bar{Y_b})}

which equals

SE(YaˉYbˉ)=sa2na+sb2nbSE(\bar{Y_a} - \bar{Y_b}) = \sqrt{\frac{s^2_a}{n_a} + \frac{s^2_b}{n_b}}

by the variance rules Var(AB)=Var(A)2Cov(A,B)+Var(B)Var(A-B) = Var(A) - 2Cov(A,B) + Var(B) and here Cov(A,B)=0Cov(A,B) = 0 by the fact that A and B are independent samples from different populations.

How do we differences-in-means? What is the t-statistic in a differences-in-means test?

Define the null and alternative hypothesis.

H0:μw=μmH_0: \mu_w = \mu_m

H1:μwμmH_1: \mu_w \neq \mu_m

Under the null, what is the distribution of the test statistic?

The t-statistic for testing differences in means is

t=YmˉYwˉSE(YmˉYwˉ)t = \frac{\bar{Y_m} - \bar{Y_w}}{SE(\bar{Y_m} - \bar{Y_w})}

When nmn_m and nwn_w are large, then by the CLT, the t-statistic has a standard normal distribution when the null hypothesis is true.

Specify the significance level of the test, find critical values, and formulate the decision rule.

Suppose we wanted to test the hypothesis at a 5% significance level.

Under the null hypothesis, the distribution of the test statistic is approxmiately standard normal.

At the 5% significance level, the critical value cα=1.96c_\alpha = 1.96.

Decision rule: reject H0H_0 if cα>1.96| c_\alpha | > 1.96.

Calculate the actual value of the test statistic, tactt^{act}.

The standard error can be calculated as

SE(YmˉYwˉ)=sm2/nm+sw2/nwSE(\bar{Y_m} - \bar{Y_w}) = \sqrt{s^2_m/n_m + s^2_w/n_w}

Substitute this value into the t-statistic and find tactt^{act}.

Follow our decision rule and come to a conclusion.

Given that tact>1.96t^{act} > 1.96, we reject the null that the means are equal at the 5% level.

What is an F-test? When do we do an F-test?

  • Checking if regression coefficients are significantly different from zero

How do we do an F-test?

Show that the residual in the CEF decomposition is mean independent of XiX_i.

What is OVB with a regression with more than one variable (or with one variable and a set of controls)

Say X2X_2 is omitted but there is a set of controls ZZ.

Under what circumstances does the LATE equal the TOT? Why?

Only compliers and defiers affect the LATE, because always- and never-takers always workin a deterministic way. Without defiers, LATE reducse to the average treatment effect of compliers, and without awlays-takers, those who are taking the treatment are the ones who have been offered treatment, so the LATE recovers the TOT.

Time-series questions

What is the population autocorrelation?

The nth population autocovariance is the correlation between Y and its nth lag.

corr(Yt,Ytj)corr(Y_t, Y_{t-j})

What is the sample autocorrelation?

The sample autocorrelation is the sample version of the pop autocorrelation:

corr^(Yt,Ytj)=1Tj1t=j+1T(YYˉj+1,T)(Ytj=YˉTj) \hat{corr}(Y_t, Y_{t-j}) = \frac{1}{T-j-1} \sum^T_{t=j+1} (Y - \bar{Y}_{j+1, T})(Y_{t-j} = \bar{Y}_{T-j})

Why are the subscripts what they are? Let's use some numerical examples to clear things up. Consider taking the 2nd sample autocorrelation: that is, the sample correlation of YtY_t with Yt2Y_{t-2}.

The first sample Yˉj+1,T\bar{Y}_{j+1, T} is the sample mean, counting from t=3t=3 to t=Tt=T. And the second sample Yˉ1,Tj\bar{Y}_{1, T-j} is the sample mean, counting from t=1t=1 to t=Tjt=T-j. This makes sense because when you are doing the autocorrelation, you can't start measuring autocorrelations until you have enough lags (in this case, t=3).

What does it mean for a time series to exhibit strict stationarity?

A time series Yt{Y_t} exhibits strict stationarity iff its distribution does not change over time.

What does it mean for a time series to exhibit weak stationarity?

A time series Yt{Y_t} exhibits weak stationarity iff its first and second moments (mean, variance and autocovariance) exist and are constant over time. They must be finite.

What is an AR(n) model?

Autoregressive model: a model with (up to) n lags of itself

Yt=β0+β1YtnY_t = \beta_0 + \beta_1 Y_{t-n}

What is the difference between an AR(n) model and an AR(n) process?

How do we solve the AR(1) process?

We can solve the AR(1) process by backward substitution. Note that Yt1Y_{t-1} can itself be written as an AR(1) process, so keep substituting in until we get LHS yty_t and RHS y0y_0.

What is the first moment, second moment and autocorrelation of the AR(1) process?

If β1<1\beta_1 < 1 (no unit root), and var(ut)=σ2var(u_t) = \sigma^2 (constant),

E(Y0)=β01β1E(Y_0) = \frac{\beta_0}{1- \beta_1}

var(Y0)=σ21β12var(Y_0) = \frac{\sigma^2}{1-\beta_1^2}

and we can derive the ACF as

ρj=corr(Yt,Ytj)=β1j\rho_j = corr(Y_t, Y_{t-j}) = \beta^j_1

in a similar backwards substitution process.

What are the sampling properties of OLS?

What allows us to estimate consistently the coefficient on βl\beta_l? In the regular OLS regression we require i.i.d. But a similar result holds for non-i.i.d data provided they are weakly stationary and "weakly dependent": that is, the nth autocovariance ρn0\rho_n \rightarrow 0 as n tends to infinity.

This is precisely fulfilled when we don't have a unit root in the AR(1) model, because ρj=corr(Yt,Ytj)=β1j\rho_j = corr(Y_t, Y_{t-j}) = \beta^j_1, and anything <1 taken to a power tends to 0.

In the AR(1) model, the idea is that the influence of Y(t-j) on Y_t is going to be very small because β1<1\beta_1 < 1 and thus that raised to a power j is going to tend to 0 as t gets large.

What is the difference between a predicted value and a forecast?

predicted values are in-sample, forecast are out-of-sample

What's the difference between Y^tt1\hat{Y}_{t | t-1} and Ytt1Y_{t | t-1}?

Ytt1Y_{t | t-1} is the forecast given all the data from t=0t=0 to t1t-1 using the population (true unknown) coefficients.

Y^tt1\hat{Y}_{t | t-1} is the "sample" forecast using the coefficients beta0^\hat{beta_0}, beta1^\hat{beta_1} that were estimated in the OLS regression.

What is RMSFE? Define it, give the equation.

The one-period ahead forecast error is

YtY^tt1,Y_t - \hat{Y}_{t | t-1},

that is to say, the difference between the actual out-of-sample value YtY_t and the predicted value


The RMSFE is

E[YtY^tt1],\sqrt{E[Y_t - \hat{Y}_{t | t-1}]},

a measure of the magnitude of the typical forecasting "mistake".

If we look at the error here,

YtY^tt1,Y_t - \hat{Y}_{t | t-1},

it can be decomposed into the genuinely unforecastable error (random shocks), and the forecast error due to estimation error of our coefficients. That is to say,

YtY^tt1+(Ytt1+Y^tt1)=ut+(β0+β^0)+(β1+β1^)Yt1Y_t - \hat{Y}_{t | t-1} + (Y_{t|t-1} + \hat{Y}_{t|t-1}) = u_t + (\beta_0 + \hat{\beta}_0) + (\beta_1 + \hat{\beta_1})Y_{t-1}

The bigger our sample, the lower the estimation error will become, but the genuinely unforecastable error will not decrease.

What is Granger causality?

XtX_t Granger-causes YtY_t if including lags of XtX_t helps to predict YtY_t over and above just lags of YtY_t.

E(YtYt1,Yt2...Xt1,Xt2)E(YtYt1,Yt2..)E(Y_t|Y_{t-1}, Y_{t-2} ... X_{t-1}, X_{t-2}) \neq E(Y_t | Y_{t-1}, Y_{t-2} ..)

The Granger causality statistic is the F-statistic testing the hypothesis that the coefficient on all the values of one of the variables are zero. This implies that the regressors have no predictive content for YtY_t beyond that contained in the other regressors.

Worked example: Does unemployment Granger-cause inflation?

If we have an ADL(1,1) model

ΔInft=β0+β1DeltaInft1+δ1Unratet1+δ2Unratet+2+ut\Delta Inf_t = \beta_0 + \beta_1 |Delta Inf_{t-1} + \delta_1 Unrate_{t-1} + \delta_2 Unrate_{t+2} + u_t

we test whether lags of Unrate are significant with the following F-test:

H0:δ1=δ2=0H_0: \delta_1 = \delta_2 = 0

H1:δ10orδ20H_1: \delta_1 \neq 0 \quad \textrm{or} \quad \delta_2 \neq 0

What does it mean for a time series to exhibit a deterministic trend?

A time series exhibits a deterministic trend if it has a trend that is a deterministic function of time:

Yt+αt+β0+β1Yt1Y_t + \alpha t + \beta_0 + \beta_1 Y_{t-1}

where α\alpha is some constant.

What does it mean for a time series to be trend stationary?

If it exhibits stationary deviations from a deterministic trend (i.e. once you remove the deterministic trend it becomes stationary)

What does it mean for a time series to exhibit a stochastic trend?

Basically just a random walk (or a random walk with trend)

Yt=Yt1+utY_t = Y_{t-1} + u_t


Yt=α1+Yt1+utY_t = \alpha_1 + Y_{t-1} + u_t

Solving backwards we obtain

Yt=α1t+j=1tujY_t = \alpha_1 t + \sum^t_{j=1} u_j

What is the equation of a random walk with drift?

Yt=α1+Yt1+utY_t = \alpha_1 + Y_{t-1} + u_t

What are the mean, variance and covariance of a random walk with drift?

We have the random walk with drift as

Yt=α1t+j=1tujY_t = \alpha_1 t + \sum^t_{j=1} u_j

Assuming that uju_j is i.i.d with distribution (0,σ2)(0, \sigma^2),

E(Yt)=α1t,E(Y_t) = \alpha_1 t,

var(Yt)=var(j=1tuj)=tσ2,var(Y_t) = var(\sum^t_{j=1} u_j) = t\sigma^2,

cov(Yt,Ytj)=(tj)σ2cov(Y_t, Y_{t-j}) = (t-j)\sigma^2

How do we detrend a determinstic trend?

Regress YtY_t on a deterministic function of time and take the residuals.

Yt=α0+α1tY_t = \alpha_0 + \alpha_1 t

How do we detrend a stochastic trend?

Take first differences.

What is order of integration?

We say that YtY_t is integrated of order xx if YtY_t must be differenced xx times to remove its stochastic trend.

What is a unit root? What issues arise when we have a unit root?

A unit root is a stochastic trend.

Yt=β0+β1Yt1+utY_t = \beta_0 + \beta_1 Y_{t-1} + u_t

where β1=1\beta_1 = 1.

There are two issues with a unit root:

E(B1^OLS)=15.3TE(\hat{B_1}^{OLS}) = 1 - \frac{5.3}{T}

Firstly, the distribution of the OLS estimator and the t-statistic is not normal even in large saples. We can't use normal critical values, we will get a biased OLS estimate.

Secondly, you get spurious regression: stochastic trends can make two unrelated time series appear related. Stochastically trending processes will tend to correlate with any other process that exhibits a trend. We will spuriously reject the null of no relationship as sample increases.

How do we test for a stochastic trend/unit root? Be explicit about the procedure

Do an F-test: subtract Yt1Y_{t-1} from both sides of an AR(1) model.

ΔYt=β0+δYt1+ut,\Delta Y_t = \beta_0 + \delta Y_{t-1} + u_t ,

where δ=β11\delta = \beta_1 - 1

Then test the null hypothesis with an F-test with H0=delta=0H_0 = delta=0

Use the Dickey-Fuller critical values, not the normal critical values.

What's the difference between the Dickey-Fuller, the DF with trend, and the ADF? When do we use what?

  • Dickey-Fuller: standard unit test
  • DF with trend: unit test with deterministic trend
  • ADF: Augment the DF regression model with lags of ΔYt\Delta Y_t in the RHS:

ΔYt=β0+αt+δYt1+j=1pδjΔYtj+ut\Delta Y_t = \beta_0 + \alpha t + \delta Y_{t-1} + \sum^p_{j=1} \delta_j \Delta Y_{t-j} + u_t

What's the problem with a break?

They cause in-sample estimates of coefficients to be biased and destroy the external validity of time series models

How do we test for a break when the break date is known?

Chow test: just a standard F test. Have a dummy variable 0 before the break, 1 after the break.


Yt=β0+β1Xt+γ0Dτ,t+γ1Dτ,tXt+utY_t = \beta_0 + \beta_1 X_t + \gamma_0 D_{\tau, t} + \gamma_1 D_{\tau, t} X_t + u_t

Test the null hypothesis that H0:γ0=γ1=0.H_0 : \gamma_0 = \gamma_1 = 0.

How do we test for a break when the break date is unknown?

QLR test: do the Chow test for multiple breaks and take the maximum critical values

How do we test for a break when there can be multiple breaks?

You can't

What's the Chow test?

What's the QLR test?

What is cointegration?

When two variables are related by some common constant:

Yt=θXtY_t = \theta X_t

What is the cointegrating coefficient?

The θ\theta in Yt=θXtY_t = \theta X_t

How do we test for cointegration when the cointegrating coefficient is known?


How do we test for cointegration when the cointegrating coefficient is unknown?


If we know that Y and X are cointegrated, then by taking first difference of Y_t, we have removed the spurious regression.

The idea here is that if δYt\delta Y_t is positive, we subtract a bit (λ\lambda) from Yt1Y_{t-1} to "correct" for this

The parameter λ\lambda tells us how much YtY_t adjusts to disturbances in eqm

You have the short-run relationship (which is just the differences) and the long run relationship

Where α\alpha and β\beta are known, estimate these with Engel and Granger.

What is h-step ahead RMSFE? Derive it for h = 1, 2, 3, 4.

Steps for hypothesis testing

  1. Define the null and alternative hypothesis.
  2. Under the null hypothesis, what is the distribution of the test statistic?
  3. Specify significance levels, calculate confidence intervals and critical values.
  4. Come up with a decision rule: "Reject H0H_0 if tact>cαt^{act} > c_\alpha"
  5. Calculate the actual value of the test statistic from the data.
  6. Reject the null hypothesis if the t-statistic is larger than the critical value.

How to actually run a hypothesis test

Notes on hypothesis testing

In order to use the CLT,

YˉμYσYˉ N(0,1)\frac{\bar{Y} - \mu_Y}{\sigma_{\bar{Y}}} ~ N(0,1)

you need to derive that E(Yˉ)=μYE(\bar{Y}) = \mu_Y and Var(Yˉ)=σY2nVar(\bar{Y}) = \frac{\sigma_Y^2}{n} first.

The sample average Yˉ\bar{Y} is randomly distributed with N (μY,σY2/n)N ~ (\mu_Y, \sigma^2_Y/n).

The t-statistic is a random variable. It is given by

t=YˉμYSE(Yˉ)t = \frac{\bar{Y} - \mu_Y}{SE(\bar{Y})}

The actual calculated test statistic, tactt^{act}, is just a number that you get when you plug all of that in.

Hypothesis testing on regression parameters

ωβ1=E(XiEXi)2ui2[E(XiEXi)2]2\omega_{\beta_1} = \frac{E(X_i - EX_i)^2 u_i^2}{[E(X_i - EX_i)^2]^2}

sd(β1^)=1nωβ1sd(\hat{\beta_1}) = \frac{1}{\sqrt{n}} \omega_{\beta_1}

se(β1^)=1nωβ1^se(\hat{\beta_1}) = \frac{1}{\sqrt{n}} \hat{\omega_{\beta_1}}


Testing the hypothesis that one sample mean is greater than another.

Remember that the standard errors must be added together.

Regression analysis and interpretation

Flowchart for regression testing

Check flowchart PNG

I added a new variable in my regression. Should I expect the standard error on my coefficient of interest to go up or down?

TLDR: It depends on the covariances. On the one hand, the new variable will explain somewhat the dependent variable, which causes standard error to go down; on the other hand, ...

Assuming homoskedasticity, the standard error for β1^\hat{\beta_1} can be written as

se(β^1)=(1nvar^(u)var^(X~1))12 se(\hat{\beta}_1) = (\frac{1}{n} \frac{\hat{var}(u)}{\hat{var}(\widetilde{X}_1)})^\frac{1}{2}

where X~1\widetilde{X}_1 is the residual from an OLS regression of X1X_1 on (X2,...,Xk)(X_2, ..., X_k).

Let's use an example to make things clearer. Suppose you had the regression

Wages=β0+β1Experience+uWages = \beta_0 + \beta_1 Experience + u

Now we want to add a new variable, gender, to the regression to get the "long" regression:

Wages=γ0+γ1Experience+γ2Gender+vWages = \gamma_0 + \gamma_1 Experience + \gamma_2 Gender + v

What will happen to the standard error of the coefficient? Well, gender should explain wages to some extent, so we expect var(u)var(u) to go down. On the other hand, Experience~1\widetilde{Experience}_1 is the residual from an OLS regression of Experience on Gender:

Experience=π0+π1Gender+Experience~1Experience = \pi_0 + \pi_1 Gender + \widetilde{Experience}_1

and given that some of experience is explained by gender (maybe men have more years in the workforce in general --- no break for childbearing), var(X~1)var(\widetilde{X}_1) will decrease.

So the overall effect is ambiguous. But in general, if

Cov(Gender,Wages)>Cov(Gender,Experience),Cov(Gender, Wages) > Cov(Gender, Experience),

Cov(Gender, Experience) is π1\pi_1

Cov(Gender, Wages) is γ2\gamma_2

that is, gender explains wages more than it explains experience, then the standard error will go down when adding more regressors.

The intuition is that the former is the decrease in the residual uu of the short regression: the higher Cov(Gender, Wages) is, the more of wages is explained, and the lower uu falls.

The latter is the decrease in the variance of experience after it has been explained by gender. The higher Cov(Gender, Experience) is, the more gender is explained by experience, and the smaller the residual X_tilde_whatever is going to be.

Should I include a new variable in my regression?

TLDR: Not if your new factor is possibly endogenous, beacuse that will cause all the other variables to be estimated with error. You should only add new variables if they are exogenous with the error term!

Suppose you had the regression

Wages=β0+β2Gender+uWages = \beta_0 + \beta_2 Gender + u

and we had good reason to believe that gender was exogenous with the error term (OR); that is, Cov(Gender,u)=0Cov(Gender, u) = 0.

Now suppose someone suggests that you add occupation into the regression to get a "cleaner estimate" of the wage effect. That is,

Wages=γ0+γ1Gender+γ2Occupation+vWages = \gamma_0 + \gamma_1 Gender + \gamma_2 Occupation + v

It's true that if occupation was exogenous this would indeed increase the precision of the estimate. This is because

se(β1^)=var^(u)var^(x~)se(\hat{\beta_1}) = \frac{\hat{var}(u)}{\hat{var}(\widetilde{x})}

and given that occupation explains some part of wages, and is exogenous with gender (what we assumed), only var(u)var(u) will go down, var(x~)var(\widetilde{x}) will remain unchanged.

But this is only if Cov(Occupation,v)=0Cov(Occupation, v) = 0! If occupation is endogenous (for instance, if ability determines occupation and wages), then this would cause the OLS estimates of all the variables (including the coeff on gender which was previously the correct causal interpretation) to be wrongly estimated.

What is the formula for omitted variable bias (OVB) for a regression with more than one variable?

Set up the "short" and "long" regressions, and substitute the long regression into the short regression giving you β1\beta_1:

β1=β1+β2Cov(X2,X1~)Var(X1~)\beta_1 = \beta_1 + \beta_2 \frac{Cov(X_2, \widetilde{X_1})}{Var(\widetilde{X_1})}

How do we test an instrument for exogeneity?

We run an F-test on the coefficients of the regressions agains the residuals.

u^=Yβ0^+β1^X\hat{u} = Y - \hat{\beta_0} + \hat{\beta_1} X

and we check if β1=0\beta_1 = 0. We can never prove that an instrument is exogenous. We can only fail to reject the null of exogeneity.

How do we test an instrument for exclusion?

You can't! This is a story about the causal model you have to tell.

What's the difference between exclusion and exogeneity?

Definition of exclusion: consider a (possibly endogenous) variable X and a proposed instrumental variable Z. In a causal model of Y on X and Z, the coefficient on Z should be zero: that is, Z has no effect on Y other than through X.

Angrist (1990) gives an example of an exogenous instrumental variable that was not endogenous. Angrist wanted to find the effect of serving in the military on wages. So he would like to run the following regression:

Wage=β0+β1MilSvc+uWage = \beta_0 + \beta_1 MilSvc + u

But of course military service is endogenous with the error term here. So instead he used the fact that people were drawn in a lottery to be drafted for the Vietnam War. Is this IV exogenous? Surely so. It was randomly assigned i.e. it can't be correlated with anything.

But does it satisfy exclusion? In fact, no. Because of the fact that you couldn't be drafted if you were in school, people who got picked to join might have stayed in school longer, continuing further study, which would have an effect on their wages. So in the following causal model

Wage=δ0+δ1MilSvc+δ2Lottery+vWage = \delta_0 + \delta_1 MilSvc + \delta_2 Lottery + v

δ20\delta_2 \neq 0 and thus exclusion is not satisfied.

How do we test an instrument for relevance?

Suppose we have the "short" regression

Y=β0+β1X+β2Z+uY = \beta_0 + \beta_1 X + \beta_2 Z + u

where Z is a vector of control variables, and X is possibly endogenous. We wish to instrument X with D and so we run the following first-stage regression:

X=γ0+γ1D+γ2Z+vX = \gamma_0 + \gamma_1 D + \gamma_2 Z + v

We set up the following hypothesis test:

H0:γ1=0H_0: \gamma_1 = 0

H1:γ10H_1: \gamma_1 \neq 0

and we do an F-test by looking at the sum of squared residuals in the restricted model setting γ=0\gamma = 0 and in the unrestricted model (that ILS regression):

LR=SSRrsSSRunSSRun/(nk1)χq2LR = \frac {SSR_{rs} - SSR_{un}}{SSR_{un}/(n-k-1)} \rightarrow \chi_q^2

If the F-statistic is sufficiently greater then 0, than the instrument is relevant; if the F-statistic is greater than 10, than the instrument is relevant.


We are interested in the causal effect of X on Y, and the magnitude of that effect is β1\beta_1.

The key additional assumption to make in the case of heterogeneity is that

E[β1iXi]=Eβ1i,\mathrm{E}[\beta_{1i} | X_i] = \mathrm{E}\beta_{1i} ,

that is to say, that the average causal effect of the treatment does not vary systematically with the treatment. For instance, if XiX_i was a skills learning program and smarter people were more likely to be offered the treatment Xi=1X_i = 1, then

E[β1Xi=1]>E[β1Xi=0]\mathrm{E}[\beta_1 | X_i = 1] > \mathrm{E}[\beta_1 | X_i = 0]

This mean independence assumption is usually stated as a stronger independence assumption: both β0i\beta_{0i} and β1i\beta_{1i} are independent of XiX_i.

Selection bias

Selection bias is the difference in the untreated outcomes between people who were treated, and people who were not:

Among the people who were treated, what would be their potential outcomes if they were not treated? That is to say, what is Y0iDi=1Y_{0i} | D_i = 1, or entirely equivalently, E[β0iDi=1]\mathrm{E}[\beta_{0i} | D_i = 1]?

And among the people who were not treated, what is their outcome? That is to say, what is Y0iDi=0Y_{0i} | D_i = 0, or entirely equivalently, E[β0iDi=0]\mathrm{E}[\beta_{0i} | D_i = 0]?

The difference between these two groups is selection bias. Going back to the running example, if you choose only smart people to participate in your skills learning program, then selection bias would be positive.

When the independence assumption fails, the OLS regression consistently estimates the TOT + SB where TOT is the treatment on the treated and the SB is the selection bias term.

R stuff


linearHypothesis(model, matchCoefs(model, "regex_string_of_coeffs_to_match"), test = "F"

Instrument exogeneity

Slide 123

  1. Compute the 2SLS residuals
  2. Perform a homoskedasticity-only F-test for the null that the coeffs on the instruments are null

Instrumental variables estimation in R

Suppose we want to estimate this model

lwage = \beta_0 + \beta_1 educ + {demog, family} + u

using nearc4 as an instrumental variable for educ.

We can do it in three ways: ILS, manual 2SLS, and full 2SLS


First stage:

fs = lm_robust(educ ~ nearc4 + demog + family, data=prox)

Reduced form:

rf = lm_robust(lwage ~ nearc4 + demog + family, data = prox)

ILS estimate:

rfcoef[nearc4]/fscoef['nearc4'] / fscoef['nearc4']

Manual 2SLS

Second stage: use fitted.values

ss = lm_robust(lwage ~ fs$fitted.values + demog + family, data=prox)

But standard errors are not valid.

Automated 2SLS

iv_robust(lwage ~ educ + demog + family | nearc4 + demog + family, data = prox)