Introduction
Both ordinary least squares and generalized linear models can be computed directly from a dataset using the GAUSS formula string syntax. In addition, the ability to transform variables, including factor variables, makes for compact and efficient modeling.
In this tutorial, we will examine several ways to utilize formula strings in OLS. When using formula strings in the GAUSS procedure ols two inputs are required, dataset name and the formula.
Represent a model with formula strings
In a model with a dependent (or response variable), the formula will list the dependent variable first, followed by a tilde ~ and then the independent variables. For example, to represent the model
$$weight = \alpha + \beta*height$$
The correct formula string would be "weight ~ height".
Descriptive Statistics
In this example, we will again use the auto2.dta dataset. To learn a little more about the dataset, let’s first look at the descriptive statistics using dstatmt.
//Create file name with full path
fname = getGAUSSHome() $+"examples/auto2.dta";
//Descriptive statistics
dstatmt(fname);
The output from this is
--------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Min Max Valid Missing --------------------------------------------------------------------------------------- make ----- ----- ----- ----- ----- 74 0 price 6165.2568 2949.4959 8699525.9743 3291.00 15906.00 74 0 mpg 21.2973 5.7855 33.4720 12.00 41.00 74 0 rep78 ----- ----- ----- ----- ----- 74 0 headroom 2.9932 0.8460 0.7157 1.50 5.00 74 0 trunk 13.7568 4.2774 18.2962 5.00 23.00 74 0 weight 3019.4595 777.1936 604029.8408 1760.00 4840.00 74 0 length 187.9324 22.2663 495.7899 142.00 233.00 74 0 turn 39.6486 4.3994 19.3543 31.00 51.00 74 0 displacement 197.2973 91.8372 8434.0748 79.00 425.00 74 0 gear_ratio 3.0149 0.4563 0.2082 2.19 3.89 74 0 foreign ----- ----- ----- ----- ----- 74 0
There are a few important things to note from this output. First, the full row of missing values for make tell us that make is not compatible with dstatmt. When this occurs, it is most likely because the variable is a string variable.
Second, note that rep78 and foreign only contain values for the minimum and maximum observation. All other statistics are missing. This occurs because a variable is recognized by GAUSS as a categorical variable. We can preview the data in the data import wizard to confirm that make is a string variable and rep78 and foreign are categorical variables:
OLS With A Subset of Variables
Now that we know a little more about our data, let’s set up our linear model. For our first model, let’s run a simple regression of mpg against weight and length.
$$mpg = \alpha + \beta_1*weight + \beta_2*length$$
The GAUSS formula string representing this model is "mpg ~ weight + length".
call ols(fname, "mpg ~ weight + length");
The output from this regression reads:
Valid cases: 74 Dependent variable: mpg
Missing cases: 0 Deletion method: None
Total SS: 2443.459 Degrees of freedom: 71
R-squared: 0.661 Rbar-squared: 0.652
Residual SS: 827.379 Std error of est: 3.414
F(2,71): 69.341 Probability of F: 0.000
Standard Prob Standardized Cor with
Variable Estimate Error t-value >|t| Estimate Dep Var
-------------------------------------------------------------------------------
CONSTANT 47.884873 6.087870 7.865620 0.000 --- ---
weight -0.003851 0.001586 -2.428452 0.018 -0.517387 -0.807175
length -0.079593 0.055358 -1.437802 0.155 -0.306327 -0.795779
Include Factor Variables
We now wish to extend our previous model to include the levels of rep78. To specify that a variable is a categorical variable in a formula we use factor followed by the name of the variable inside a pair of parentheses. The formula for our extended model with be "mpg ~ weight + length + factor(foreign)".
call ols(fname, "mpg ~ weight + length + factor(rep78)");
Using factor in the formula strings tells GAUSS that dummy variables representing the different categories of rank should be included in the regression. This is seen in the printed output table which now includes coefficients for rep78=fair,average,good, excellent. Note that rep78=poor is automatically excluded from the regression as the base level.
Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var --------------------------------------------------------------------------------------- CONSTANT 49.954158 6.734554 7.417590 0.000 --- ---
weight -0.002299 0.001724 -1.333602 0.187 -0.310730 -0.805520 length -0.115486 0.058422 -1.976769 0.053 -0.447805 -0.803676 rep78: Fair -0.093428 2.710368 -0.034471 0.973 -0.005136 -0.134619 rep78: Average -0.531709 2.496377 -0.212992 0.832 -0.045260 -0.279593 rep78: Good -0.343326 2.551735 -0.134546 0.893 -0.025887 0.038439 rep78: Excellent 2.403347 2.668859 0.900515 0.371 0.151069 0.454192
Include Interaction Effects
Now let’s look at extending our model one step further to include interaction effects using formula strings. Two different operators are available for adding interaction terms. The colon operator, :, is used to add only a pure interaction term and an asterisk, *, is used to add each individual term, as well as the interaction term.
Let’s first consider using : to add the interaction of length and weight to our model. In this case the formula for our model is "mpg ~ factor(foreign) + weight + length length:weight".
//Case one with ":"
call ols(fname, "mpg ~ weight + length + length:weight + factor(rep78)");
In the output from this call we see that the coefficient for the interaction term length:weight has been added to our output table just below the coefficient for length.
Valid cases: 69 Dependent variable: mpg
Missing cases: 5 Deletion method: Listwise
Total SS: 2340.203 Degrees of freedom: 61
R-squared: 0.702 Rbar-squared: 0.668
Residual SS: 697.753 Std error of est: 3.382
F(7,61): 20.513 Probability of F: 0.000
Standard Prob Standardized Cor with
Variable Estimate Error t-value >|t| Estimate Dep Var
---------------------------------------------------------------------------------------
CONSTANT 69.748294 14.957072 4.663232 0.000 --- ---
weight -0.009885 0.005407 -1.828120 0.072 -1.336016 -0.805520
length -0.212137 0.087303 -2.429906 0.018 -0.822577 -0.803676
length:weight 0.000037 0.000025 1.478611 0.144 1.378839 -0.802872
rep78: Fair -0.313868 2.688941 -0.116726 0.907 -0.017255 -0.134619
rep78: Average -0.939373 2.488154 -0.377538 0.707 -0.079961 -0.279593
rep78: Good -1.203796 2.593793 -0.464107 0.644 -0.090766 0.038439
rep78: Excellent 1.765676 2.678632 0.659171 0.512 0.110986 0.454192
Now we will estimate the same model using *. In this case, the formula for our model is "mpg ~ length*weight + factor(foreign)".
//Case two with "*"
call ols(fname, "mpg ~ weight*length + factor(rep78)");
The resulting output table shows that coefficients for weight, length, and weight:length are estimated.
Valid cases: 69 Dependent variable: mpg
Missing cases: 5 Deletion method: Listwise
Total SS: 2340.203 Degrees of freedom: 61
R-squared: 0.702 Rbar-squared: 0.668
Residual SS: 697.753 Std error of est: 3.382
F(7,61): 20.513 Probability of F: 0.000
Standard Prob Standardized Cor with
Variable Estimate Error t-value >|t| Estimate Dep Var
---------------------------------------------------------------------------------------
CONSTANT 69.748294 14.957072 4.663232 0.000 --- ---
weight -0.009885 0.005407 -1.828120 0.072 -1.336016 -0.805520
length -0.212137 0.087303 -2.429906 0.018 -0.822577 -0.803676
weight:length 0.000037 0.000025 1.478611 0.144 1.378839 -0.802872
rep78: Fair -0.313868 2.688941 -0.116726 0.907 -0.017255 -0.134619
rep78: Average -0.939373 2.488154 -0.377538 0.707 -0.079961 -0.279593
rep78: Good -1.203796 2.593793 -0.464107 0.644 -0.090766 0.038439
rep78: Excellent 1.765676 2.678632 0.659171 0.512 0.110986 0.454192
OLS Without a Constant
As a final adjustment to our model, let’s remove the constant from our regression. The default when using GAUSS formulas for ols is to include a constant in the model. In order to run the model without a constant, we must add a -1 after the ~ in our formula. The -1 should be the first item on our list of independent variables. To remove the constant from our previous model we use the formula "mpg ~ -1 + weight + length + factor(foreign)"
call ols(fname , "mpg ~ -1 + weight + length + factor(rep78)");
The output from this line reads
Valid cases: 69 Dependent variable: mpg
Missing cases: 5 Deletion method: Listwise
Total SS: 33615.000 Degrees of freedom: 63
R-squared: 0.959 Rbar-squared: 0.956
Residual SS: 1364.160 Std error of est: 4.653
F(6,63): 248.236 Probability of F: 0.000
Standard Prob Standardized Cor with
Variable Estimate Error t-value >|t| Estimate Dep Var
---------------------------------------------------------------------------------------
weight -0.011862 0.001560 -7.604068 0.000 -1.683494 0.880216
length 0.271706 0.035757 7.598726 0.000 2.334454 0.932449
rep78: Fair 4.735923 3.585777 1.320752 0.191 0.073061 0.295039
rep78: Average 5.855229 3.193495 1.833486 0.071 0.174919 0.580552
rep78: Good 5.490387 3.308433 1.659513 0.102 0.127049 0.501374
rep78: Excellent 8.676492 3.449911 2.514990 0.014 0.156955 0.494998

