### Introduction

Categorical variables offer an important opportunity to capture qualitative effects in statistical modeling. Unfortunately, it can be tedious and cumbersome to manage categorical variables in statistical software.

The new GAUSS category type, introduced in GAUSS 21, makes it easy and intuitive to work with categorical data.

In today's blog we use real-life housing data to explore the numerous advantages of the GAUSS category type including:

- Easy set-up and viewing of categorical data.
- Simple renaming of category labels.
- Easy changing of the reference base case and reordering of categories.
- Single-line frequency plots and tables.
- Internal creation of dummy variables for regressions.
- Proper labeling of categories in regression output.

Want to see these advantages for yourself? Contact us for a GAUSS 21 demo!

## The data

Throughout today's blog, we will be using the Melbourne Housing Snapshot dataset.

The dataset contains the following variables:

Variable | Description |
---|---|

Suburb | Number of the suburb. |

Address | House address. |

Rooms | Number of rooms. |

Type | Type of house. |

Price | Sale price. |

Method | Method of sale. |

SellerG | Real estate agent. |

Date | Date sold. |

Distance | Distance from CBD. |

Postcode | Postal code. |

Bedroom2 | Number of bedrooms. |

Bathrooms | Number of bathrooms. |

Car | Number of carspots. |

Landsize | Land size. |

BuildingArea | Building size. |

YearBuilt | Year the house was built. |

CouncilArea | Governing council for the area. |

Latitude | Location latitude. |

Longitude | Location longitude. |

Regionname | General region. |

Propertycount | Number of properties that exist in the suburb. |

## Loading our Data

Let's start by loading our dataset using the GAUSS **Data Import** window:

When we open the dataset for loading we can see quickly from the color-coded columns what type GAUSS is assigning to our variables. For example:

`Type`

and`Method`

(highlighted orange) are categorical variables.`Suburb`

,`Address`

, and`SellerG`

(highlighted yellow) are listed as string variables.`Rooms`

and`Price`

(highlighted blue) are numbers.

### Changing a Variable to a Category

Suppose that we also want the variable `Suburb`

to be loaded as a categorical variable. This can be easily done using the **Variables** tab.

When we change the variable `Suburb`

type to **Category**, the **Modify Column Mapping** window opens:

This window is very useful and gives us a number of insights. For example, from this window we can tell that:

- The category
*Abbotsford*is the base case. - All labels and category mapping.
- There are 314 different categories.

If we want to explore the categories more we can use the **Label Filter**:

Once we are done changing the type to category, if we click **OK**, the preview for `Suburb`

is changed to **Category** with orange highlighting:

### Importing our Data

Now that we've set `Suburb`

to be a categorical variable, we're ready to load our data. Clicking **Import** loads our data and auto-generates the GAUSS code for all import steps.

## Managing the Properties of Categorical Variables

One of the advantages of using the category type is that the category properties can be easily modified using the **Data Management** pane.

### Changing Category Labels Interactively

Suppose that after loading our data, we realize that our current labels for the `Type`

variable are not very clear. Instead, we wish to rename the labels such that :

Original label | New label |
---|---|

h | House |

t | Townhouse |

u | Duplex unit |

Once we open the dataframe in the data editor, this is easy to do. We simply:

- Select the
**Manage**button to open the**Data Management**pane: - Click the drop-down button to the right of the variable name and select
**Properties**to open the**Modify Column Mapping**dialog. - Enter the new label in the
**Renamed Label**textbox next to the category label we want to change.

Let's use the same process to also change the category labels on for the `Method`

variable:

Once we've done this, the variable names `Type`

and `Method`

are highlighted in red. This indicates that we have unsaved changes.

When we click **Apply** the changes are saved and the commands GAUSS uses to change the category labels are generated in the **Program Input/Output** window.

```
melb_data = setcollabels(move(melb_data), "House"$|"Townhouse"$|"Duplex unit", 0|1|2, "Type");
melb_data = setcollabels(move(melb_data), "Passed in"$|"Sold"$|"Sold after auction"$|"Sold prior auction"$|"Vendor bid", 0|1|2|3|4, "Method");
```

### Changing the Base Case

When working with categorical variables, it is useful to know that GAUSS always treats the variable with the key value "0" as the base case. For example, we saw earlier that the base case for the `Suburb`

variable was *Abbotsford*.

If you want to change the assigned base case (or otherwise reorder the categories), this can quickly be done using the **Modify Column Mapping** dialog.

For example, suppose we want to replicate a study that uses the category *Chelsea* as the base case. To do this we:

- Locate and select the category
*Chelsea*using the**Label Filter** - Use the
**Double Arrow**button to move*Chelsea*to the top of the list. -
**Apply**our changes.

## Exploring our Categorical Variables

To dive deeper into our categorical variables we can use the dstatmt, frequency, and plotFreq procedures.

### General Summary Statistics

First, let's get a general overview of our data, including the categorical variables, using the `dstatmt`

function:

```
// Print descriptive statistics for all variables in 'melb_data'
call dstatmt(melb_data);
```

This prints a table of descriptive statistics to the **Program Input/Output** window:

---------------------------------------------------------------------------------------------- Variable Mean Std Dev Variance Minimum Maximum Valid Missing ---------------------------------------------------------------------------------------------- Suburb ----- ----- ----- Chelsea Yarraville 13580 0 Address ----- ----- ----- ----- ----- 13580 0 Rooms 2.938 0.9557 0.9135 1 10 13580 0 Type ----- ----- ----- House Duplex uni 13580 0 Price 1.076e+06 639311 4.087e+11 8.5e+04 9e+06 13580 0 Method ----- ----- ----- Passed in Vendor bid 13580 0 SellerG ----- ----- ----- ----- ----- 13580 0 Date ----- ----- ----- 28/01/2016 23/09/2017 13580 0 Distance 10.14 5.869 34.44 0 48.1 13580 0 Postcode 3105 90.68 8222 3000 3977 13580 0 Bedroom2 2.915 0.9659 0.933 0 20 13580 0 Bathroom 1.534 0.6917 0.4785 0 8 13580 0 Car 1.61 0.9626 0.9267 0 10 13518 62 Landsize 558.4 3991 1.593e+07 0 4.33e+05 13580 0 BuildingArea 152 541 292697 0 4.452e+04 7130 6450 YearBuilt 1965 37.27 1389 1196 2018 8205 5375 CouncilArea ----- ----- ----- Banyule Yarra Rang 12211 1369 Latitude -37.81 0.07926 0.006282 -38.18 -37.41 13580 0 Longitude 145 0.1039 0.0108 144.4 145.5 13580 0 Regionname ----- ----- ----- Eastern Me Western Vi 13580 0 Propertycount 7454 4379 1.917e+07 249 2.165e+04 13580 0

Though traditional summary statistics aren't valid for categorical data, the descriptive statistics still provide some insights:

- The minimum category is always the base case for the variable.
- We can identify if there are any missing observations.

### Frequency Table of Categories

The `frequency`

procedure was introduced in GAUSS 21 specifically to provide frequency count tables. The procedure requires two inputs:

- x
- Data matrix or data frame.
- varlist
- String, names or indices of variables to be counted. If names, should be entered as a formula string e.g
`"rep78 + foreign"`

.

To see frequency counts for both `Method`

and `Type`

in our dataframe `melb_data`

we enter:

```
// Print frequency tables for the 'Type' and 'Method' variables
// in the 'melb_data' dataframe
frequency(melb_data, "Type + Method");
```

This prints the category names along with:

- Total counts.
- Frequency percentages.
- Cumulative percentages.

Label Count Total % Cum. % House 9449 69.58 69.58 Townhouse 1114 8.203 77.78 Duplex unit 3017 22.22 100 Total 13580 100

Label Count Total % Cum. % Passed in 1564 11.52 11.52 Sold 9022 66.44 77.95 Sold after auction 92 0.6775 78.63 Sold prior auction 1703 12.54 91.17 Vendor bid 1199 8.829 100 Total 13580 100

### Frequency Plot of Categories

The information provided with the `frequency`

procedure can be quickly visualized using the `plotFreq`

procedure.

For example, let's plot the frequencies for the `Method`

variable:

`plotFreq(melb_data, "Method");`

## Estimation with Categorical Variables

The final area we will explore today is the use of the GAUSS category type in estimation. Estimation of one of the areas that the GAUSS category type offers the greatest advantages.

GAUSS category variables can be used in estimation routines, such as olsmt or glm without taking any additional step.

When category variables are detected in estimation routines GAUSS will automatically:

- Create and use dummy variables during estimation.
- Exclude the base case category.
- Print output tables using specified category labels.

As an example, let's run a simple linear regression model estimating the role of `Method`

, `Bedroom2`

, and `Bath`

on `Price`

:

`call olsmt(melb_data, "price ~ method + bedroom2 + bath");`

Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var ------------------------------------------------------------------------------------------------- CONSTANT 59797.1 20407.1 2.93022 0.003 --- ---

Method: Sold 40895.7 14857.5 2.75252 0.006 0.0302079 0.0256233 Method: Sold after auction -54790.4 57921.6 -0.94594 0.344 -0.00703034 -0.0064481 Method: Sold prior auction -109114 18989.1 -5.74614 0.000 -0.0565257 -0.104125 Method: Vendor bid 77144.6 20737.2 3.7201 0.000 0.0342371 0.0442128 Bedroom2 202813 5924.64 34.2321 0.000 0.306426 0.475951 Bathroom 263854 8314.3 31.735 0.000 0.285481 0.467038

Notice that the results include estimated coefficients for each of the `Method`

categories, except the base case, *Passed In*.

## Conclusion

Today we've seen how fast and easy it can be to work with categorical variables using the new GAUSS category type. Whether you're just getting started exploring your data or you're in the final stage of estimation, the category type can speed up your work and get you to your results faster.

## Further reading

- Introduction to Categorical Variables.
- Easy and Fast Data Management in GAUSS 21 .
- Preparing and Cleaning FRED data in GAUSS/

Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.