### Introduction

In this tutorial, we will learn how to load all observations from

- All or a subset of the variables.
- With or without data transformations, such as
- Creating dummy variables.
- Reclassifying string variables to integer categories.
- Creating interaction terms.
`ln`

,`exp`

,`lag`

and more.

from a well-formed dataset. All sections below apply to any dataset that meets our definition of 'well-formed' which is explained below.

#### What files does this apply to?

Our definition of a well-formed dataset includes

- Comma-separated text files (
**CSV**) with headers in the first line of the file. **Excel**files (**XLS**,**XLSX**) with headers in the first row of the file.**GAUSS**datasets (**DAT**) and matrix files (**FMT**).**SAS**(**SAS7BCAT**,**SAS7BDAT**),**SPSS**(**POR**,**SAV**) and**Stata**(**DTA**) datasets.- HDF5 files (
**HF**) if the dataset contains an attribute called`headers`

which contains the variable names.

Regardless of the file type, each file must be organized as a consistent tabular dataset like the example below. Each row of the file must have the same number of columns and each column of the file must have the same number of rows.

Age,Height,Weight 29,61,134 44,74,191 32,70,223

#### What files does this not apply to?

This section does not apply to

- Text files delimited by a character other than a comma.
- CSV files without headers or with empty lines.
- Excel files without headers.

or files which have inconsistent numbers of rows, or columns.

// This dataset is NOT well-formed. // It has: // 1. Comments at the top of the file. // 2. Inconsistent numbers of columns per row. Age,Height,Weight 29,61,134 44,74,191,43,16 32,70,223

## Headers

The GAUSS function `getHeaders`

will return a string array containing all the variable names from a dataset. It takes only one input, the name of the dataset. The example below reads all of the variable names from the Stata dataset `auto2.dta`

which is located in the GAUSS examples directory.

```
// Create file name with full path to Stata dataset
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Read the variable names from the dataset
h = getHeaders(fname);
// Print string array containing the dataset headers
print h;
```

will return

make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign

## All variables

The GAUSS command `loadd`

can read variables from a dataset. To read all variables from a dataset you only need to pass one input, a string containing the name of the dataset. The example dataset, `binary.csv`

, contains four variables related to college admissions, `admit`

, `gre`

, `gpa`

, and `rank`

.

```
// Create file name with full path
fname = getGAUSSHome() $+ "examples/binary.csv";
// Read all 4 variables from the CSV file
X = loadd(fname);
// Print the first 5 rows of all columns of 'X'
print X[1:5,.];
```

will return

0.00 380.00 3.61 3.00 1.00 660.00 3.67 3.00 1.00 800.00 4.00 1.00 1.00 640.00 3.19 4.00 0.00 520.00 2.93 4.00

## A subset of variables

`loadd`

can accept an optional second argument which is a formula string. The formula string specifies which variables to load and which data transformations to perform. The following operators can be used in a formula string to load a subset of the variables from the dataset.

Operator | Description |
---|---|

. | The dot represents all variables. |

+ | The plus operator adds a variable. |

- | The minus operator removes a variable |

### Example: Load two variables by name

```
// Create file name with full path to the SAS dataset
fname = getGAUSSHome() $+ "examples/detroit.sas7bdat";
// Load 2 variables by name from the SAS dataset
X = loadd(fname, "unemployment + weekly_earn");
// Print the first 5 rows of all columns of 'X'
print X[1:5,.];
```

will return

11.0 117.18 7.0 134.02 5.2 141.68 4.3 147.98 3.5 159.85

### Example: Load all variables except for one

`cancer.dat`

is an example GAUSS dataset located in the GAUSS examples directory. It contains five variables, `time`

, `histolog`

, `stage`

, `count`

, and `risktime`

. The example below loads all of these variables, except for `stage`

.

```
// Create file name with full path
fname = getGAUSSHome() $+ "examples/cancer.dat";
// Load all but one variable from the GAUSS dataset
X = loadd(fname, ". -stage");
// Print the first 5 rows of all columns of 'X'
print X[1:5,.];
```

will return

1.00 1.00 9.00 157.00 1.00 2.00 5.00 77.00 1.00 3.00 1.00 21.00 2.00 1.00 2.00 139.00 2.00 2.00 2.00 68.00

## Categorical variables

The GAUSS formula string syntax allows you to automatically reclassify string variables to integer categories as well as to convert integer categories into dummy variables. The following keywords can be used in a formula string to reclassify string variables and create dummy variables.

Keyword | Description |
---|---|

factor | Create dummy variables from a column of integers. |

cat | Reclassify string variables into integer categories. |

### Example: Categorical variable to dummies

`housing.csv`

is an example dataset from the GAUSS examples directory. The variable `baths`

, represents the number of bathrooms in the home and contains the following unique values: 1, 2, 3, and 4.

The example code below loads the `baths`

variable unmodified for comparison. In the next step, the code tells `loadd`

to create dummy variables from the integer categories in the `baths`

by using the `factor`

keyword in the formula string and loads the dummy variables.

```
// Create file name with full path
fname = getGAUSSHome() $+ "examples/housing.csv";
// Load the original categorical data
baths = loadd(fname, "baths");
// Load the categorical variable and create dummy vars
dmy = loadd(fname, "factor(baths)");
```

After the code above, the first 5 rows of `baths`

and `dmy`

will be equal to

baths = 2 dmy = 1 0 0 1 0 0 0 2 1 0 0 2 1 0 0 3 0 1 0

As you can see above, the base case is set to the case when `baths`

equals one.

### Example: String variable to integer categories

The example Excel file, `nba_ht_wt.xls`

, contains seven variables with different information about NBA basketball players. The `Pos`

variable represents the position played by the basketball player. The levels are `C`

, `F`

, and `G`

, which represent center, forward and guard.

The code below uses the `cat`

keyword in the formula string to tell `loadd`

to reclassify the string variables into integer categories. By default `loadd`

assigns the integer levels to the string categories by alphabetical order.

```
// Create file name with full path
fname = getGAUSSHome() $+ "examples/nba_ht_wt.xls";
// Load the string variable and reclassify in one step
X = loadd(fname, "cat(Pos)");
// Print the first 5 rows of all columns of 'X'
print X[1:5,.];
```

will return

1 3 3 2 2

which corresponds to

C G G F F

### Combining keywords and operators

Both the `cat`

and `factor`

keywords can be combined with the `.`

, `+`

and `-`

operators. For example, the following statements would be legal.

```
fname = getGAUSSHome() $+ "examples/housing.csv";
X = loadd(fname, "price + factor(baths) + taxes");
fname = getGAUSSHome() $+ "examples/yarn.xlsx"
X = loadd(fname, "cat(amplitude) + cycles");
```

## Interaction effects

The `*`

and `:`

operators are used in formula strings to create interaction effects.

Operator | Description |
---|---|

* | Represents an interaction between two variables as well as the original variables. |

: | Represents only the interaction between the two specified variables. |

### Example: Interaction term

By default when an interaction term is specified in a formula string, the variables that form the interaction are also included. The example below will load 3 variables, `Height`

, `Weight`

and the interaction term of `Height*Weight`

.

```
// Create file name with full path
fname = getGAUSSHome() $+ "examples/nba_ht_wt.xls";
// Load the variables 'Height' and 'Weight'
// then create a third variable which is the
// interaction between them
X = loadd(fname, "Height*Weight");
// Print the first 5 rows of the 3 specified variables
print X[1:5,.];
```

The code above will print the following output

83 260 21580 74 180 13320 77 215 16555 81 260 21060 81 235 19035

### Example: Interaction term alone

Usually, when we create an interaction term, we will also include the original variables. However, it is sometimes useful to load only the interaction variable. We can specify that we only want the interaction, by using the colon operator, `:`

, in the formula string as shown below.

```
// Create file name with full path
fname = getGAUSSHome() $+ "examples/nba_ht_wt.xls";
// Load only one variable, which is the
// interaction between 'Height' and 'Weight'
X = loadd(fname, "Height:Weight");
// Print the first 5 rows of the 1 specified variable
print X[1:5];
```

The code above will print the following output

21580 13320 16555 21060 19035

## Data transformations

GAUSS allows you to transform your variables when loading, by using a procedure in a formula string.

### Example: Natural log

```
// Create file name with full path to Stata dataset
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Load 'price' from 'auto2.dta' and perform
// natural log transform
ln_price = loadd(fname, "ln(price)");
// Print the first 5 rows of 'ln_price'
print ln_price[1:5];
```

The code above will return the following output.

8.3185 8.4657 8.2425 8.4797 8.9653

### Example: The first difference of the natural log

Now let's do something slightly more complicated. Suppose you want to compute the first difference of the natural log of the `price`

variable from the `auto2.dta`

dataset. GAUSS allows you to use any procedure in a formula string **as long as it takes a column vector as the only input and returns a column vector of the same size as the only output**.

So we will first create a procedure to compute the first difference of the natural log. We will call it `lnDiff`

. Then we can use it in our formula string, like this

```
// Define procedure to compute the first
// difference of the natural log of a variable
proc (1) = lnDiff(x);
local ln_x;
// Compute the natural log of the input
ln_x = ln(x);
// Compute the difference of the natural log
// and return the result
retp(ln_x - lag(ln_x));
endp;
// Create file name with full path to Stata dataset
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Load the 'price' variable and call
// our 'lnDiff' procedure on it
X = loadd(fname, "lnDiff(price)");
// Print the first 5 observations
print X[1:5];
```

The code above will print the following output. Note that the first observation is a missing value, since we lose one observation when computing the lag.

. 0.1472 -0.2232 0.2372 0.4856

### Conclusion

In this tutorial, we have learned how to

- Load all or a subset of variables with the
`+`

,`-`

and`.`

operators. - Creating dummy variables with the
`factor`

keyword. - Reclassifying string variables to integer categories with the
`cat`

keyword. - Creating interaction terms with the
`*`

and`:`

operators. - Performing data transformations by using GAUSS procedures in formula strings.

from a well-formed, tabular dataset.