OLS diagnostics: Model specification

Goals

This tutorial builds on the first four econometrics tutorials. It is suggested that you complete those tutorials prior to starting this one.

This tutorial demonstrates how to test for influential data after OLS regression. After completing this tutorial, you should be able to :

Introduction

A common source of model specification error in OLS regressions is the omission of relevant variables. When variables are omitted, variations in the dependent variable may be falsely attributed to the included variables. This can result in inflated errors for regressors and can distort the estimated coefficients. In this tutorial, we will test for omitted variables using the link test and the Ramsey RESET test. Following previous tutorials, we've estimated an OLS model and stored the results using data simulates from the data generating process, $ y_{i} = 1.3 + 5.7 x_{i} + \epsilon_{i} $, where $ \epsilon_{i} $ is the random disturbance term.

The motivation behind the link test is the idea that if a regression is specified appropriately you should not be able to find additional independent variables. To test this, the link test regresses the dependent variable of the original regression against the original regression's prediction and the squared prediction. If the squared prediction regressor in the test regression is significant, there is evidence the model is misspecified.

To run the link test we construct the $\hat{y}$ and $\hat{y}^2$ variables from the results of the original regression and run the regression

$$y = \hat{y}b_1 + \hat{y}^2b_2 + \epsilon$$

//Add column of ones to x for constant 
x_full = ones(num_obs, 1) ~ x;

//Predicted y values
y_hat = x_full * b;

//Concatenate and form regressor
link_regressors = y_hat ~ y_hat.^2;

//Test regression
call ols("", y, link_regressors);

The above code will print the following report:

Valid cases:                100      Dependent variable:                   Y
Missing cases:                0      Deletion method:                   None
Total SS:              3481.056      Degrees of freedom:                  97
R-squared:                0.969      Rbar-squared:                     0.969
Residual SS:            106.782      Std error of est:                 1.049
F(2,97):               1532.578      Probability of F:                 0.000
Durbin-Watson:            2.023

                        Standard                Prob  Standardized  Cor with
Variable     Estimate     Error     t-value     >|t|    Estimate   Dep Var
----------------------------------------------------------------------------
CONSTANT     0.03422    0.12382    0.276351     0.783       ---        ---
X1           1.00687    0.02119   47.510909     0.000    0.99125    0.98448
X2          -0.00127    0.00205   -0.619914     0.537   -0.01293    0.50545

The OLS results show a 53.7% p-value for our coefficient on $\hat{y}^2$. This suggests that we cannot reject the null hypothesis that the coefficient is equal to zero. This finding that the $\hat{y}^2$ is insignificant in our test regression suggests that our model does not suffer from omitted variables.

The Ramsey RESET Test

The Ramsey RESET test is based on the same concept but runs the regression

$$ y_i = x_ib + z_it + u_i $$

where $z_i = (\hat{y}^2, \hat{y}^3, \hat{y}^4$). The predicted $y$ value is normalized between 0 and 1 before the powers are calculated. If the regression is properly specified, the coefficients on all powers of the predicted $y$ should be jointly insignificant.

Normalize $\hat{y}$

To run the link test we need to normalize the predicted $y$ values, then construct the additional variables $\hat{y}^3$ and $\hat{y}^4$. To normalize the predicted $y$ from 0 to 1 we use min max normalization such that

$$y_{norm} = \frac{\hat{y}-\hat{y}_{min}}{\hat{y}_{max} - \hat{y}_{min}}$$

//Normalize y_hat
y_hat_norm = (y_hat - minc(y_hat))/(maxc(y_hat) - minc(y_hat));

RESET regression

Unlike the link test, the Ramsey RESET test regression includes the regressors from the original regression:

$$y = xb_1 + \hat{y}^2b_2 + \hat{y}^3b_3 + \hat{y}^4b_4 + \epsilon$$

This time we will store the results because we need to conduct the hypothesis test that $b_2$, $b_3$, and $b_4$ are jointly insignificant.

//Concatenate and form regressor
ram_regressors = x ~ y_hat_norm.^2 ~ y_hat_norm.^3 ~ y_hat_norm.^4;

//Test regression
{ ram_nam, ram_m, ram_b, ram_stb, ram_vc, ram_std, 
  ram_sig, ram_cx, ram_rsq, ram_resid, ram_dbw } = ols("",y, ram_regressors);

The code above will print the following report:

Valid cases:                100      Dependent variable:                   Y
Missing cases:                0      Deletion method:                   None
Total SS:              3481.056      Degrees of freedom:                  95
R-squared:                0.971      Rbar-squared:                     0.970
Residual SS:            100.511      Std error of est:                 1.029
F(4,95):                798.798      Probability of F:                 0.000
Durbin-Watson:            1.918

                        Standard               Prob   Standardized  Cor with
Variable     Estimate     Error    t-value     >|t|     Estimate    Dep Var
----------------------------------------------------------------------------
CONSTANT     2.68599    3.23368    0.83063     0.408       ---         ---
X1           6.75663    1.73379    3.89703     0.000    1.162524    0.984481
X2          -1.12135   34.93156   -0.03210     0.974   -0.035223    0.938781
X3         -21.13182   51.19730   -0.41275     0.681   -0.605543    0.856568
X4          18.50884   25.23111    0.73357     0.465    0.490670    0.771187

RESET hypothesis test

To complete our RESET test for omitted variables we need to test the hypothesis that the coefficients on all powers of y_hat_norm are jointly insignificant. Therefore, the Ramsey RESET test null hypothesis is:

$$ H_0 : b_2 = b_3 = b_4 = 0 $$

using the F-statistics

$$F_0 = \frac{(SSR_r - SSR_{ur})/q}{SSR_{ur}/(n-(k+1))}$$

where $$SSR_r = \text{sum of squares restricted model}$$ $$SSR_{ur} = \text{sum of squares unrestricted model}$$ $$q = \text{number of restrictions}$$ $$n = \text{number of observations}$$ $$k = \text{number of regressors in unrestricted model}$$

In this case, the restricted model is $y = \alpha + \beta*x$, which is conveniently what we estimated in our original model.

//Find SSR from original model
SSR_r = resid'*resid;

//Find SSR from unrestricted model
SSR_ur = ram_resid'*ram_resid;

//Number of restrictions
q = 3;

//Number of regressors in unrestricted model
k = cols(ram_regressors);

//Construct F stat
F_ram = ((SSR_r - SSR_ur)/q)/(SSR_ur/(num_obs-(k+1)));
print "F-stat for restriction b_2 = b_3 = b_4 :" F_ram;

//Probability of F:
p_value = cdffc(F_ram, q, (num_obs-k));
print "Probability of F: " p_value;

The p-value for our F-stat is 10.4%. Therefore, at 5% significance level, we fail to reject the Ramsey RESET test null hypothesis of correct specification. This indicates that the functional form is correct and our model does not suffer from omitted variables.

Conclusion

Congratulations! You have:

  • Calculated the link test model misspecification.
  • Calculated the RESET test for model misspecification.

For convenience, the full program text is below.

//Clear the workspace
new;

//Set seed to replicate results
rndseed 23423;

//Number of observations
num_obs = 100;

//Generate independent variables
x = rndn(num_obs,1);

//Generate error terms
error_term = rndn(num_obs, 1);

//Generate y from x and error_term
y = 1.3 + 5.7*x + error_term;

//Turn on residuals computation
_olsres = 1;

//Estimate model and store results in variables
{ nam, m, b, stb, vc, std, sig, cx, rsq, resid, dbw } = ols("", y, x);
/**************************************************************************/

//Add column of ones to x for constant 
x_full = ones(num_obs, 1) ~ x;

//Predicted y values
y_hat = x_full * b;

//Concatenate and form regressor
link_regressors = y_hat ~ y_hat.^2;

//Test regression
call ols("", y, link_regressors);
/**************************************************************************/

//Normalize y_hat
y_hat_norm = (y_hat - minc(y_hat))/(maxc(y_hat) - minc(y_hat));

//Concatenate and form regressor
ram_regressors = x ~ y_hat_norm.^2 ~ y_hat_norm.^3 ~ y_hat_norm.^4;

//Test regression
{ ram_nam, ram_m, ram_b, ram_stb, ram_vc, ram_std, 
  ram_sig, ram_cx, ram_rsq, ram_resid, ram_dbw } = ols("",y, ram_regressors);

//Find SSR from original model
SSR_r = resid'*resid;

//Find SSR from unrestricted model
SSR_ur = ram_resid'*ram_resid;

//Number of restrictions
q = 3;

//Number of regressors in unrestricted model
k = cols(ram_regressors);

//Construct F stat
F_ram = ((SSR_r - SSR_ur)/q)/(SSR_ur/(num_obs-(k+1)));
print "F-stat for restriction b_2 = b_3 = b_4 :" F_ram;

//Probability of F:
p_value = cdffc(F_ram, q, (num_obs-k));
print "Probability of F: " p_value;

Have a Specific Question?

Get a real answer from a real person

Need Support?

Get help from our friendly experts.