### Introduction

Dummy variables are a common econometric tool, whether working with time series, cross-sectional, or panel data. Unfortunately, raw datasets rarely come formatted with dummy variables that are regression ready.

In today's blog, we explore several options for creating dummy variables from categorical data in GAUSS, including:

- Creating dummy variables from a file using formula strings.
- Creating dummy variables from an existing vector of categorical data.
- Creating dummy variables from an existing vector of continuous variables.

## Creating Dummy Variables from a File

Dummy variables can be conveniently created from files at the time of loading data or calling procedures using formula string notation. Formula string notation is a powerful GAUSS tool that allows you to represent a model or collection of variables in a compact and intuitive manner, using the variable names in the dataset.

### The `factor`

Keyword

The `factor`

keyword is used in formula strings to:

- Specify that a variable contains numeric categorical data.
- Create dummy variables (which are not present in the raw data) while loading data from a dataset.
- Include dummy variables in estimation functions such as
`olsmt`

,`glm`

, or`gmmFit`

.

Let's consider the model

$$mpg = \alpha + \beta_1 weight + \beta_2 length + \beta_3 rep78$$

We will use ordinary least squares to estimate this model with data from the `auto2.dta`

file which can be found in the GAUSSHOME/examples directory.

The variable `rep78`

is a categorical, 5-point variable that measures a car's repair record in 1978. To estimate the effects of the repair record on `mpg`

we can include dummy variables representing the different categories.

```
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Perform OLS estimation, creating dummy variables from 'rep78'
call olsmt(fname, "mpg ~ weight + factor(rep78)");
```

The printed output table includes coefficients for `rep78=fair, average, good, excellent`

. Note that `rep78=poor`

is automatically excluded from the regression as the base level.

Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var --------------------------------------------------------------------------------------- CONSTANT 38.0594 3.09336 12.3036 0.000 --- --- weight -0.00550304 0.000601001 -9.15645 0.000 -0.743741 -0.80552 rep78: Fair -0.478604 2.76503 -0.173092 0.863 -0.0263109 -0.134619 rep78: Average -0.471562 2.55314 -0.184699 0.854 -0.0401403 -0.279593 rep78: Good -0.599032 2.6066 -0.229814 0.819 -0.0451669 0.0384391 rep78: Excellent 2.08628 2.72482 0.765657 0.447 0.131139 0.454192

### The `cat`

Keyword

Some common file types, such as XLS and CSV do not have a robust method of determining the variable types. In these cases, the `cat`

keyword is used to:

- Denote a variable in a file as categorical text data.
- Instruct GAUSS to reclassify the string data to integer categories.

The `cat`

keyword can be combined with the `factor`

keyword to instruct GAUSS to load a column as string data, reclassify it to integers and then create dummy variables:

```
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/yarn.xlsx";
// Reclassify 'load' variable from 'high, low, med'
// to '0, 1, 2', then create dummy variables from
// integer categories and create OLS estimates
call olsmt(fname, "cycles ~ factor(cat(load))");
```

Using `factor(cat(load))`

in the formula strings tells GAUSS to create dummy variables representing the different categories of the `load`

variable. This is seen in the printed output table which now includes coefficients for `load=low, medium`

. Note that `load=high`

is automatically excluded from the regression as the base level.

Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var -------------------------------------------------------------------------------- CONSTANT 534.444444 292.474662 1.827319 0.080 --- --- load: low 621.555556 413.621634 1.502715 0.146 0.338504 0.240716 load: med 359.111111 413.621634 0.868212 0.394 0.195575 0.026323

### Creating Dummy Variables Using `loadd`

In our previous two examples, we used the `factor`

and `cat`

keywords directly in calls to estimation procedures. However, we can also use these keywords when loading data to create dummy variables in our data matrices.

For example, let's load the dummy variables associated with the `rep78`

variable:

```
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Perform OLS estimation, creating dummy variables from 'rep78'
reg_data = loadd(fname, "mpg + weight + factor(rep78)");
```

The `reg_data`

matrix is a 74 x 6 matrix. It contains the `mpg`

and `weight`

data, as well as 4 columns of dummy variables for `rep78=fair, average, good, excellent`

.

The first five rows look like this:

mpg weight rep78:fair rep78:avg rep78:good rep78:exc 22 2930 0 1 0 0 17 3350 0 1 0 0 22 2640 . . . . 20 3250 0 1 0 0 15 4080 0 0 1 0

Note that, again, `rep78=poor`

is automatically excluded as the base level.

## Creating Dummy Variables from a Categorical Vector

In the previous section, we looked at creating dummy variables at the time of loading data or running procedures. In this section, we consider how to create dummy variables from an existing GAUSS vector.

The GAUSS `design`

procedure provides a convenient method for creating dummy variables from a vector of discrete categories.

Let's load the data from the `auto2.dta`

dataset used in our earlier regression example. This time we won't load `rep78`

using `factor`

:

```
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Load auto data for regression
reg_data = loadd(fname, "mpg + weight + rep78");
// Remove missing values
reg_data = packr(reg_data);
```

The first five rows of `reg_data`

look like this:

22 2930 3 17 3350 3 20 3250 3 15 4080 4 18 3670 3

Our third column now contains discrete, categorical data with values ranging from 1-5, which represent `poor`

, `fair`

, `average`

, `good`

, and `excellent`

.

`auto2.dta`

file specifies the preferred order for the string categories.```
// Compute the unique values found
// in the third column of 'reg_data'
print unique(reg_data[., 3]);
```

1 2 3 4 5

`design`

creates a matrix with a column of indicator variables for each positive integer in the input. For example:

```
cats = { 1, 2, 1, 3 };
print design(cats);
```

will return:

1 0 0 0 1 0 1 0 0 0 0 1

Therefore, if we pass the third column of `reg_data`

to `design`

we will get a matrix with a column for all five categories. However, we want to drop the base case column for our regression.

$$mpg = \alpha + \beta_1 weight + \beta_2 length + \beta_3 rep78_{fair} + \beta_4 rep78_{avg} + \beta_5 rep78_{good} + \beta_6 rep78_{excl}$$

To do this, we shift the range of the categorical data from 1-5 to 0-4 by subtracting 1.

```
// Create dummy variables. Subtract one
// to remove the base case.
dummy_vars = design(reg_data[., 3] - 1);
```

This creates a 69x4 matrix, `dummy_vars`

, which contains dummy variables representing the final four levels of `rep78`

.

Now we can estimate our model as shown below.

```
// Select the 'mpg' data as the dependent variable
y = reg_data[., 1];
// Independent variables:
// 'weight' is in the second column of 'reg_data'.
// 'rep78'= Fair, Average, Good and Excellent
// are represented by the 4 columns
// of 'dummy_vars'.
x = reg_data[., 2]~dummy_vars;
// Estimate model using OLS
call olsmt("", y, x);
```

Our printed results are the same as earlier, except our table no longer includes variables names:

Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var ------------------------------------------------------------------------------- CONSTANT 38.059415 3.093361 12.303578 0.000 --- --- X1 -0.005503 0.000601 -9.156447 0.000 -0.743741 -0.805520 X2 -0.478604 2.765035 -0.173092 0.863 -0.026311 -0.134619 X3 -0.471562 2.553145 -0.184699 0.854 -0.040140 -0.279593 X4 -0.599032 2.606599 -0.229814 0.819 -0.045167 0.038439 X5 2.086276 2.724817 0.765657 0.447 0.131139 0.454192

## Creating Dummy Variables from Continuous Variables

The `design`

procedure works well when our data already contains categorical data. However, there may be cases when we want to create dummy variables based on ranges of continuous data. The GAUSS `dummybr`

, `dummydn`

, and `dummy`

procedures can be used to achieve this.

Consider a simple example:

```
x = { 1.53,
8.41,
3.81,
6.34,
0.03 };
// Breakpoints
v = { 1, 5, 7 };
```

All three procedures create a set of dummy (0/1) variables by breaking up a data vector into categories based on specified breakpoints. These procedures differ in how they treat boundary cases as shown below.

Category Boundaries | # dummies ($K$ breakpoints) |
Call | Result | |
---|---|---|---|---|

dummybr |
$$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ | $$K$$ | dm = dummybr(x, v); |
$$dm = \begin{matrix} 0 & 1 & 0\\ 0 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0 \end{matrix}$$ |

dummy |
$$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ $$7 \lt x $$ | $$K+1$$ | dm = dummy(x, v); |
$$dm = \begin{matrix} 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 1 & 0 & 0 & 0 \end{matrix}$$ |

dummydn |
$$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ $$7 \lt x $$ | $$K$$ | // Column to drop p = 2; dm = dummydn(x, v, p); |
$$dm = \begin{matrix} 0 & 0 & 0\\ 0 & 0 & 1\\ 0 & 0 & 0\\ 0 & 1 & 0\\ 1 & 0 & 0 \end{matrix}$$ |

Let's look a little closer at how these procedures work.

### Using `dummybr`

When creating dummy variables with `dummybr`

:

- All categories are:
- Open on the left (i.e., do not contain their left boundaries).
- Closed on the right (i.e., do contain their right boundaries).

- $K$ breakpoints are required to specify $K$ dummy variables.
- Missings are deleted before the dummy variables are created.

`dm = dummybr(x, v);`

The code above produces three dummies based upon the breakpoints in the vector `v`

:

x <= 1 1 < x <= 5 5 < x <= 7

The matrix `dm`

contains:

0 1 0 1.53 0 0 0 8.41 dm = 0 1 0 x = 3.81 0 0 1 6.34 1 0 0 0.03

Notice that in this case, the second row of `dm`

does not contain a 1 because `x = 8.41`

does not fall into any of our specified categories.

### Using `dummy`

Now, let's compare our results from `dummybr`

above to the `dummy`

procedure. When we use the `dummy`

procedure:

- All categories are:
- Open on the left (i.e., do not contain their left boundaries).
- Closed on the right (i.e., do contain their right boundaries),
*except*the highest (rightmost) category because it extends to $+\infty$.

- $K-1$ breakpoints are required to specify $K$ dummy variables.
- Missings are deleted before the dummy variables are created.

`dm = dummy(x, v);`

The code above produces four dummies based upon the breakpoints in the vector `v`

:

x <= 1 1 < x <= 5 5 < x <= 7 7 < x

The matrix `dm`

contains:

0 1 0 0 1.53 0 0 0 1 8.41 dm = 0 1 0 0 x = 3.81 0 0 1 0 6.34 1 0 0 0 0.03

These results vary from our previous example:

- The
`dummy`

procedure results in 4 columns of dummy variables. It adds a new column for the case where`7 < x`

. - The second row now contains a 1 in the final column to indicate that
`x = 8.41`

falls into the category`7 < x`

.

### Using `dummydn`

Our final function is `dummydn`

which behaves just like `dummy`

, except that the pth column of the matrix of dummies is dropped. This is convenient for specifying a base case to ensure that these variables will not be collinear with a vector of ones.

```
// Column to drop
p = 2;
// Create matrix of dummy variables
dm_dn = dummydn(x, v, p);
```

The code above produces three dummies based upon the breakpoints in the vector `v`

:

x <= 1 1 < x <= 5 // Since p = 2, this column is dropped 5 < x <= 7 7 < x

The matrix `dm_dn`

contains:

0 1 0 0 0 0 0 1.53 0 0 0 1 0 0 1 8.41 dm = 0 1 0 0 dm_dn = 0 0 0 x = 3.81 0 0 1 0 0 1 0 6.34 1 0 0 0 1 0 0 0.03

Note that the matrix `dm_dn`

is the same as `dm`

except the second column has been removed.

## Conclusion

Dummy variables are an important tool for data analysis whether we are working with time series data, cross-sectional data, or panel data. In today's blog, we have explored three GAUSS tools for generating dummy variables:

- Creating dummy variables from a file using formula strings.
- Creating dummy variables from an existing vector of categorical data using the
`design`

procedure. - Creating dummy variables from an existing vector of continuous variables using the
`dummy`

,`dummybr`

, and`dummydn`

procedures.

Erica has been working to build, distribute, and strengthen the GAUSS universe since 2012. She is an economist skilled in data analysis and software development. She has earned a B.A. and MSc in economics and engineering and has over 15 years combined industry and academic experience in data analysis and research.