OLS Regression From A Dataset

Introduction

Both ordinary least squares and generalized linear models can be computed directly from a dataset using the GAUSS formula string syntax. In addition, the ability to transform variables, including factor variables, makes for compact and efficient modeling.

In this tutorial, we will examine several ways to utilize formula strings in OLS. When using formula strings in the GAUSS procedure ols two inputs are required, dataset name and the formula.

Represent a model with formula strings

In a model with a dependent (or response variable), the formula will list the dependent variable first, followed by a tilde ~ and then the independent variables. For example, to represent the model

$$weight = \alpha + \beta*height$$

The correct formula string would be "weight ~ height".

Descriptive Statistics

In this example, we will again use the auto2.dta dataset. To learn a little more about the dataset, let’s first look at the descriptive statistics using dstatmt.

//Create file name with full path
fname = getGAUSSHome() $+"examples/auto2.dta";

//Descriptive statistics
dstatmt(fname);

The output from this is

---------------------------------------------------------------------------------------
Variable             Mean     Std Dev      Variance      Min     Max      Valid Missing
---------------------------------------------------------------------------------------

make                -----       -----         -----     -----     -----      74    0
price           6165.2568   2949.4959  8699525.9743   3291.00  15906.00      74    0
mpg               21.2973      5.7855       33.4720     12.00     41.00      74    0
rep78               -----       -----         -----     -----     -----      74    0
headroom           2.9932      0.8460        0.7157      1.50      5.00      74    0
trunk             13.7568      4.2774       18.2962      5.00     23.00      74    0
weight          3019.4595    777.1936   604029.8408   1760.00   4840.00      74    0
length           187.9324     22.2663      495.7899    142.00    233.00      74    0
turn              39.6486      4.3994       19.3543     31.00     51.00      74    0
displacement     197.2973     91.8372     8434.0748     79.00    425.00      74    0
gear_ratio         3.0149      0.4563        0.2082      2.19      3.89      74    0
foreign            -----        -----         -----     -----     -----      74    0

There are a few important things to note from this output. First, the full row of missing values for make tell us that make is not compatible with dstatmt. When this occurs, it is most likely because the variable is a string variable.

Second, note that rep78 and foreign only contain values for the minimum and maximum observation. All other statistics are missing. This occurs because a variable is recognized by GAUSS as a categorical variable. We can preview the data in the data import wizard to confirm that make is a string variable and rep78 and foreign are categorical variables:

Factor variables in GAUSS data import wizard.

OLS With A Subset of Variables

Now that we know a little more about our data, let’s set up our linear model. For our first model, let’s run a simple regression of mpg against weight and length.

$$mpg = \alpha + \beta_1*weight + \beta_2*length$$

The GAUSS formula string representing this model is "mpg ~ weight + length".

call ols(fname, "mpg ~ weight + length");

The output from this regression reads:

Valid cases:                    74      Dependent variable:                 mpg
Missing cases:                   0      Deletion method:                   None
Total SS:                 2443.459      Degrees of freedom:                  71
R-squared:                   0.661      Rbar-squared:                     0.652
Residual SS:               827.379      Std error of est:                 3.414
F(2,71):                    69.341      Probability of F:                 0.000

                         Standard                 Prob   Standardized  Cor with
Variable     Estimate      Error      t-value     >|t|     Estimate    Dep Var
-------------------------------------------------------------------------------

CONSTANT    47.884873    6.087870    7.865620     0.000       ---         ---
weight      -0.003851    0.001586   -2.428452     0.018   -0.517387   -0.807175
length      -0.079593    0.055358   -1.437802     0.155   -0.306327   -0.795779 

Include Factor Variables

We now wish to extend our previous model to include the levels of rep78. To specify that a variable is a categorical variable in a formula we use factor followed by the name of the variable inside a pair of parentheses. The formula for our extended model with be "mpg ~ weight + length + factor(foreign)".

call ols(fname, "mpg ~ weight + length + factor(rep78)");

Using factor in the formula strings tells GAUSS that dummy variables representing the different categories of rank should be included in the regression. This is seen in the printed output table which now includes coefficients for rep78=fair,average,good, excellent. Note that rep78=poor is automatically excluded from the regression as the base level.

                                 Standard                 Prob   Standardized  Cor with
Variable             Estimate      Error      t-value     >|t|     Estimate    Dep Var
---------------------------------------------------------------------------------------

CONSTANT            49.954158    6.734554    7.417590     0.000       ---         ---
weight -0.002299 0.001724 -1.333602 0.187 -0.310730 -0.805520 length -0.115486 0.058422 -1.976769 0.053 -0.447805 -0.803676 rep78: Fair -0.093428 2.710368 -0.034471 0.973 -0.005136 -0.134619 rep78: Average -0.531709 2.496377 -0.212992 0.832 -0.045260 -0.279593 rep78: Good -0.343326 2.551735 -0.134546 0.893 -0.025887 0.038439 rep78: Excellent 2.403347 2.668859 0.900515 0.371 0.151069 0.454192

Include Interaction Effects

Now let’s look at extending our model one step further to include interaction effects using formula strings. Two different operators are available for adding interaction terms. The colon operator, :, is used to add only a pure interaction term and an asterisk, *, is used to add each individual term, as well as the interaction term.

Let’s first consider using : to add the interaction of length and weight to our model. In this case the formula for our model is "mpg ~ factor(foreign) + weight + length length:weight".

//Case one with ":"
call ols(fname, "mpg ~ weight + length + length:weight + factor(rep78)");

In the output from this call we see that the coefficient for the interaction term length:weight has been added to our output table just below the coefficient for length.

Valid cases:                    69      Dependent variable:                 mpg
Missing cases:                   5      Deletion method:               Listwise
Total SS:                 2340.203      Degrees of freedom:                  61
R-squared:                   0.702      Rbar-squared:                     0.668
Residual SS:               697.753      Std error of est:                 3.382
F(7,61):                    20.513      Probability of F:                 0.000

                                 Standard                 Prob   Standardized  Cor with
Variable             Estimate      Error      t-value     >|t|     Estimate    Dep Var
---------------------------------------------------------------------------------------

CONSTANT            69.748294   14.957072    4.663232     0.000       ---         ---
weight -0.009885 0.005407 -1.828120 0.072 -1.336016 -0.805520 length -0.212137 0.087303 -2.429906 0.018 -0.822577 -0.803676 length:weight 0.000037 0.000025 1.478611 0.144 1.378839 -0.802872 rep78: Fair -0.313868 2.688941 -0.116726 0.907 -0.017255 -0.134619 rep78: Average -0.939373 2.488154 -0.377538 0.707 -0.079961 -0.279593 rep78: Good -1.203796 2.593793 -0.464107 0.644 -0.090766 0.038439 rep78: Excellent 1.765676 2.678632 0.659171 0.512 0.110986 0.454192

Now we will estimate the same model using *. In this case, the formula for our model is "mpg ~ length*weight + factor(foreign)".

//Case two with "*"
call ols(fname, "mpg ~  weight*length + factor(rep78)");

The resulting output table shows that coefficients for weight, length, and weight:length are estimated.

Valid cases:                    69      Dependent variable:                 mpg
Missing cases:                   5      Deletion method:               Listwise
Total SS:                 2340.203      Degrees of freedom:                  61
R-squared:                   0.702      Rbar-squared:                     0.668
Residual SS:               697.753      Std error of est:                 3.382
F(7,61):                    20.513      Probability of F:                 0.000

                                 Standard                 Prob   Standardized  Cor with
Variable             Estimate      Error      t-value     >|t|     Estimate    Dep Var
---------------------------------------------------------------------------------------

CONSTANT            69.748294   14.957072    4.663232     0.000       ---         ---
weight -0.009885 0.005407 -1.828120 0.072 -1.336016 -0.805520 length -0.212137 0.087303 -2.429906 0.018 -0.822577 -0.803676 weight:length 0.000037 0.000025 1.478611 0.144 1.378839 -0.802872 rep78: Fair -0.313868 2.688941 -0.116726 0.907 -0.017255 -0.134619 rep78: Average -0.939373 2.488154 -0.377538 0.707 -0.079961 -0.279593 rep78: Good -1.203796 2.593793 -0.464107 0.644 -0.090766 0.038439 rep78: Excellent 1.765676 2.678632 0.659171 0.512 0.110986 0.454192

OLS Without a Constant

As a final adjustment to our model, let’s remove the constant from our regression. The default when using GAUSS formulas for ols is to include a constant in the model. In order to run the model without a constant, we must add a -1 after the ~ in our formula. The -1 should be the first item on our list of independent variables. To remove the constant from our previous model we use the formula "mpg ~ -1 + weight + length + factor(foreign)"

call ols(fname , "mpg ~ -1 + weight + length + factor(rep78)");

The output from this line reads

Valid cases:                    69      Dependent variable:                 mpg
Missing cases:                   5      Deletion method:               Listwise
Total SS:                33615.000      Degrees of freedom:                  63
R-squared:                   0.959      Rbar-squared:                     0.956
Residual SS:              1364.160      Std error of est:                 4.653
F(6,63):                   248.236      Probability of F:                 0.000

                                 Standard                 Prob   Standardized  Cor with
Variable             Estimate      Error      t-value     >|t|     Estimate    Dep Var
---------------------------------------------------------------------------------------

weight              -0.011862    0.001560   -7.604068     0.000   -1.683494    0.880216
length               0.271706    0.035757    7.598726     0.000    2.334454    0.932449
rep78: Fair          4.735923    3.585777    1.320752     0.191    0.073061    0.295039
rep78: Average       5.855229    3.193495    1.833486     0.071    0.174919    0.580552
rep78: Good          5.490387    3.308433    1.659513     0.102    0.127049    0.501374
rep78: Excellent     8.676492    3.449911    2.514990     0.014    0.156955    0.494998

Have a Specific Question?

Get a real answer from a real person

Need Support?

Get help from our friendly experts.