### Introduction

This tutorial explores the use of random forests (also called decision forests) to predict baseball players' salaries.

The example builds on the examples in Chapter 8 of G. James, *et al.* (2013). The model will include 16 predictors: `AtBat`

, `Hits`

, `HmRun`

, `Runs`

, `RBI`

, `Walks`

, `Years`

, `CAtBat`

, `CHits`

, `CHmRun`

, `CRuns`

, `CRBI`

, `CWalks`

, `PutOuts`

, `Assists`

, `Errors`

.

This tutorial shows how to:

- Load the data and transform the target variable.
- Use
`trainTestSplit`

to split a dataset into random training and testing subsets. - Specify parameters for random forest models using the
`dfControl`

structure. - Fit a random forest regression model from training data using
`decForestRFit`

. - Plot variable importance using
`plotVariableImportance`

. - Use
`decForestPredict`

to make predictions from a random forest model.

## Load the Data

The data for this tutorial is stored in the file `islr_hitters.xlsx`

. The model will use the natural log of salary as the response variable and the 16 previously mentioned variables as predictors:

We will use `loadd`

with GAUSS's formula string syntax, which allows for loading and transforming data in a single line.

```
new;
library gml;
// Load hitters dataset
dataset = getGAUSSHome $+ "pkgs/gml/examples/hitters.xlsx";
// Load salary and perform natural log transform
y = loadd(dataset, "ln(salary)");
// Load all variables except 'salary'
X = loadd(dataset, ". - salary");
```

## Construct Training and Test Sets

The GAUSS Machine Learning (GML) module includes the `trainTestSplit`

function for splitting full datasets into randomly drawn train and test subsets.

`trainTestSplit`

requires three inputs when using matrix inputs:

- y
- The dependent variable.
- X
- The independent variables.
- pct
- The proportion of data to include in the training dataset.

The procedure `trainTestSplit`

returns four outputs : `y_train`

, `X_train`

, `y_test`

, and `X_test`

. These outputs contain the feature and predictors, for the training and test datasets:

```
// Set seed for repeatable sampling
rndseed 234234;
// Split data into training and test sets
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);
```

## Specify Model Parameters

Since random forest is a copyrighted term, the GAUSS functions use the term decision forest. All of the structures and functions use the prefix, df.

The decision forest model parameters are specified using the `dfControl`

structure, which contains the following members:

Member | Description |
---|---|

numTrees | Scalar, number of trees (must be integer). Default = 100 |

obsPerTree | Scalar, observations per a tree. Default = 1.0. |

featuresPerNode | Scalar, number of features considered at a node. Default = nvars/3. |

maxTreeDepth | Scalar, maximum tree depth. Default = unlimited. |

minObsNode | Scalar, minimum observations per node. Default = 1. |

oobError | Scalar, 1 to compute OOB error, 0 otherwise. Default = 0. |

variableImpurityMethod | Scalar, method of calculating variable importance. 0 = none, 1 = mean decrease in impurity, 2 = mean decrease in accuracy (MDA), 3 = scaled MDA. Default = 0. |

Using the `dfControl`

structure to change the model parameter requires three steps:

- Declare an instance of the
`dfControl`

structure`struct dfControl dfc;`

- Fill the members in the
`dfControl`

structure with default values using`dfControlCreate`

:`dfc = dfControlCreate();`

- Change the desired members from their default values:
`dfc.oobError = 1`

The code below puts these three steps together to turn on both the out-of-bag error and variable importance computation:

```
/ Declare 'dfc' to be a dfControl structure
// and fill with default settings.
struct dfControl dfc;
dfc = dfControlCreate();
// Turn on variable importance
dfc.variableImportanceMethod = 1;
// Turn on OOB error
dfc.oobError = 1;
```

## Fitting the Random Forest Regression Model

Random forest (or decision forest) regression models are fit using the GAUSS procedure `decForestRfit`

.

The `decForestRfit`

procedure takes two required inputs, the training response vector, and the training predictors matrix. In addition, the `dfControl`

structure may be optionally included to specify model parameters.

The `decForestRfit`

procedure returns all output to a `dfModel`

structure. An instance of the `dfModel`

structure must be declared prior to calling `decForestRfit`

. Each instance of the `dfModel`

structure contains the following members:

Member | Description |
---|---|

variableImportance | Matrix, 1 x p, variable importance measure if the computation of variable importance is specified, zero otherwise. |

oobError | Scalar, out-of-bag error if OOB error computation is specified, zero otherwise. |

numClasses | Scalar, number of classes if classification model, zero otherwise. |

opaqueModel | Matrix, contains model details for internal use only. |

The code below fits the random forest model to the training data, *y_train* and *X_train*, which were generated earlier using `trainTestSplit`

. In addition, the inclusion of `dfc`

, the instance of the previously created `dfControl`

structure, results in the computation of both the out-of-bag error and the variable importance.

```
// Structure to hold model results
struct dfModel mdl;
// Fit training data using random forest
mdl = decForestRFit(y_train, X_train, dfc);
// OOB Error
print "Out-of-bag error:" mdl.oobError;
```

The output from the code above:

Out-of-bag error: 0.22886297

## Plotting Variable Importance

A useful aspect of the random forest model is the variable importance measure. This measure provides a tool for understanding the relative importance of each predictor in the model. The procedure `plotVariableImportance`

plots a pre-formatted bar graph of the variable importance.

The procedure takes a `dfModel`

structure as its only input.

```
/*
** Plot variable importance
*/
// Load variable names from dataset
// and assign to dfModel structure
mdl.varNames = getHeaders(dataset);
// Draw variable importance plot
plotVariableImportance(mdl);
```

## Make Predictions

The `decForestPredict`

function is used after `decForestRFit`

to make predictions from the random forest regression model. The function requires a filled `dfModel`

structure and a test set of predictors. The code below computes the predictions, prints the first 10 predictions and finds and compares the Random Forest MSE to OLS MSE:

```
// Make predictions using test data
predictions = decForestPredict(mdl, X_test);
// Print predictions and decision forest test MSE
print predictions[1:5,.]~y_test[1:5,.];
print "";
print "random forest test MSE: " meanc((predictions - y_test).^2);
// Print ols test MSE
b_hat = y_train / (ones(rows(X_train), 1)~X_train);
alpha_hat = b_hat[1];
b_hat = trimr(b_hat, 1, 0);
y_hat = alpha_hat + X_test * b_hat;
print "OLS test MSE :" meanc((y_hat - y_test).^2);
```

The output:

6.8781181 6.0402547 6.3337887 6.3630281 6.0366998 5.7838252 6.2387501 6.6200732 5.5766902 5.2522734 random forest test MSE: 0.23044959 OLS test MSE : 0.47283085

### Conclusion

In this post, you've seen how:

- Load and transform data.
- Create randomly sampled test and training sets.
- Estimate a basic random forest model.
- Draw a variable importance plot.

With just the default settings we were able to significantly outperform the OLS model.

Find the full code for this example here