### Introduction

Both ordinary least squares and generalized linear models can be computed directly from a dataset using the GAUSS formula string syntax. In addition, the ability to transform variables, including factor variables, makes for compact and efficient modeling.

In this tutorial, we will examine several ways to utilize formula strings in OLS. When using formula strings in the GAUSS procedure `ols`

two inputs are required, dataset name and the formula.

## Represent a model with formula strings

In a model with a dependent (or response variable), the formula will list the dependent variable first, followed by a tilde `~`

and then the independent variables. For example, to represent the model

$$weight = \alpha + \beta*height$$

The correct formula string would be `"weight ~ height"`

.

## Descriptive Statistics

In this example, we will again use the `auto2.dta`

dataset. To learn a little more about the dataset, let’s first look at the descriptive statistics using `dstatmt`

.

```
//Create file name with full path
fname = getGAUSSHome() $+"examples/auto2.dta";
//Descriptive statistics
dstatmt(fname);
```

The output from this is

--------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Min Max Valid Missing --------------------------------------------------------------------------------------- make ----- ----- ----- ----- ----- 74 0 price 6165.2568 2949.4959 8699525.9743 3291.00 15906.00 74 0 mpg 21.2973 5.7855 33.4720 12.00 41.00 74 0 rep78 ----- ----- ----- ----- ----- 74 0 headroom 2.9932 0.8460 0.7157 1.50 5.00 74 0 trunk 13.7568 4.2774 18.2962 5.00 23.00 74 0 weight 3019.4595 777.1936 604029.8408 1760.00 4840.00 74 0 length 187.9324 22.2663 495.7899 142.00 233.00 74 0 turn 39.6486 4.3994 19.3543 31.00 51.00 74 0 displacement 197.2973 91.8372 8434.0748 79.00 425.00 74 0 gear_ratio 3.0149 0.4563 0.2082 2.19 3.89 74 0 foreign ----- ----- ----- ----- ----- 74 0

There are a few important things to note from this output. First, the full row of missing values for `make`

tell us that `make`

is not compatible with `dstatmt`

. When this occurs, it is most likely because the variable is a string variable.

Second, note that `rep78`

and foreign only contain values for the minimum and maximum observation. All other statistics are missing. This occurs because a variable is recognized by GAUSS as a categorical variable. We can preview the data in the data import wizard to confirm that `make`

is a string variable and `rep78`

and foreign are categorical variables:

## OLS With A Subset of Variables

Now that we know a little more about our data, let’s set up our linear model. For our first model, let’s run a simple regression of mpg against `weight`

and `length`

.

$$mpg = \alpha + \beta_1*weight + \beta_2*length$$

The GAUSS formula string representing this model is `"mpg ~ weight + length`

".

`call ols(fname, "mpg ~ weight + length");`

The output from this regression reads:

Valid cases: 74 Dependent variable: mpg Missing cases: 0 Deletion method: None Total SS: 2443.459 Degrees of freedom: 71 R-squared: 0.661 Rbar-squared: 0.652 Residual SS: 827.379 Std error of est: 3.414 F(2,71): 69.341 Probability of F: 0.000 Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var ------------------------------------------------------------------------------- CONSTANT 47.884873 6.087870 7.865620 0.000 --- --- weight -0.003851 0.001586 -2.428452 0.018 -0.517387 -0.807175 length -0.079593 0.055358 -1.437802 0.155 -0.306327 -0.795779

## Include Factor Variables

We now wish to extend our previous model to include the levels of `rep78`

. To specify that a variable is a categorical variable in a formula we use factor followed by the name of the variable inside a pair of parentheses. The formula for our extended model with be `"mpg ~ weight + length + factor(foreign)"`

.

`call ols(fname, "mpg ~ weight + length + factor(rep78)");`

Using `factor`

in the formula strings tells GAUSS that dummy variables representing the different categories of rank should be included in the regression. This is seen in the printed output table which now includes coefficients for `rep78=fair,average,good, excellent`

. Note that `rep78=poor`

is automatically excluded from the regression as the base level.

Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var --------------------------------------------------------------------------------------- CONSTANT 49.954158 6.734554 7.417590 0.000 --- ---

weight -0.002299 0.001724 -1.333602 0.187 -0.310730 -0.805520 length -0.115486 0.058422 -1.976769 0.053 -0.447805 -0.803676 rep78: Fair -0.093428 2.710368 -0.034471 0.973 -0.005136 -0.134619 rep78: Average -0.531709 2.496377 -0.212992 0.832 -0.045260 -0.279593 rep78: Good -0.343326 2.551735 -0.134546 0.893 -0.025887 0.038439 rep78: Excellent 2.403347 2.668859 0.900515 0.371 0.151069 0.454192

## Include Interaction Effects

Now let’s look at extending our model one step further to include interaction effects using formula strings. Two different operators are available for adding interaction terms. The colon operator, `:`

, is used to add only a pure interaction term and an asterisk, `*`

, is used to add each individual term, as well as the interaction term.

Let’s first consider using `:`

to add the interaction of `length`

and `weight`

to our model. In this case the formula for our model is `"mpg ~ factor(foreign) + weight + length length:weight"`

.

```
//Case one with ":"
call ols(fname, "mpg ~ weight + length + length:weight + factor(rep78)");
```

In the output from this call we see that the coefficient for the interaction term `length:weight`

has been added to our output table just below the coefficient for `length`

.

Valid cases: 69 Dependent variable: mpg Missing cases: 5 Deletion method: Listwise Total SS: 2340.203 Degrees of freedom: 61 R-squared: 0.702 Rbar-squared: 0.668 Residual SS: 697.753 Std error of est: 3.382 F(7,61): 20.513 Probability of F: 0.000 Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var --------------------------------------------------------------------------------------- CONSTANT 69.748294 14.957072 4.663232 0.000 --- ---

weight -0.009885 0.005407 -1.828120 0.072 -1.336016 -0.805520 length -0.212137 0.087303 -2.429906 0.018 -0.822577 -0.803676 length:weight 0.000037 0.000025 1.478611 0.144 1.378839 -0.802872 rep78: Fair -0.313868 2.688941 -0.116726 0.907 -0.017255 -0.134619 rep78: Average -0.939373 2.488154 -0.377538 0.707 -0.079961 -0.279593 rep78: Good -1.203796 2.593793 -0.464107 0.644 -0.090766 0.038439 rep78: Excellent 1.765676 2.678632 0.659171 0.512 0.110986 0.454192

Now we will estimate the same model using `*`

. In this case, the formula for our model is `"mpg ~ length*weight + factor(foreign)"`

.

```
//Case two with "*"
call ols(fname, "mpg ~ weight*length + factor(rep78)");
```

The resulting output table shows that coefficients for `weight`

, `length`

, and `weight:length`

are estimated.

Valid cases: 69 Dependent variable: mpg Missing cases: 5 Deletion method: Listwise Total SS: 2340.203 Degrees of freedom: 61 R-squared: 0.702 Rbar-squared: 0.668 Residual SS: 697.753 Std error of est: 3.382 F(7,61): 20.513 Probability of F: 0.000 Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var --------------------------------------------------------------------------------------- CONSTANT 69.748294 14.957072 4.663232 0.000 --- ---

weight -0.009885 0.005407 -1.828120 0.072 -1.336016 -0.805520 length -0.212137 0.087303 -2.429906 0.018 -0.822577 -0.803676 weight:length 0.000037 0.000025 1.478611 0.144 1.378839 -0.802872 rep78: Fair -0.313868 2.688941 -0.116726 0.907 -0.017255 -0.134619 rep78: Average -0.939373 2.488154 -0.377538 0.707 -0.079961 -0.279593 rep78: Good -1.203796 2.593793 -0.464107 0.644 -0.090766 0.038439 rep78: Excellent 1.765676 2.678632 0.659171 0.512 0.110986 0.454192

## OLS Without a Constant

As a final adjustment to our model, let’s remove the constant from our regression. The default when using GAUSS formulas for `ols`

is to include a constant in the model. In order to run the model without a constant, we must add a `-1`

after the `~`

in our formula. The `-1`

should be the first item on our list of independent variables. To remove the constant from our previous model we use the formula `"mpg ~ -1 + weight + length + factor(foreign)"`

`call ols(fname , "mpg ~ -1 + weight + length + factor(rep78)");`

The output from this line reads

Valid cases: 69 Dependent variable: mpg Missing cases: 5 Deletion method: Listwise Total SS: 33615.000 Degrees of freedom: 63 R-squared: 0.959 Rbar-squared: 0.956 Residual SS: 1364.160 Std error of est: 4.653 F(6,63): 248.236 Probability of F: 0.000 Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var --------------------------------------------------------------------------------------- weight -0.011862 0.001560 -7.604068 0.000 -1.683494 0.880216 length 0.271706 0.035757 7.598726 0.000 2.334454 0.932449 rep78: Fair 4.735923 3.585777 1.320752 0.191 0.073061 0.295039 rep78: Average 5.855229 3.193495 1.833486 0.071 0.174919 0.580552 rep78: Good 5.490387 3.308433 1.659513 0.102 0.127049 0.501374 rep78: Excellent 8.676492 3.449911 2.514990 0.014 0.156955 0.494998