## Using Random Forests to Predict Salary

This tutorial explores the use of random forests to predict baseball players' salaries. The examples builds on the examples in Chapter 8 of G. James, *et al.* (2013). This model will include 10 predictors: at bats, hits, home runs, RBIs, walks, years, put outs, assists, and errors. This tutorial examines how to:

- Use
`trainTestSplit`

to split a dataset into random training and testing subsets. - Specify parameters for random forest models using the
`rfControl`

structure. - Fit a random forest regression model from training data using
`rfRegressFit`

. - Plot variable importance using
`plotVariableImportance`

. - Use
`rfRegressPredict`

to make predictions from a random forest model.

### Construct training and testing subsets

The **GAUSS Machine Learning (GML)** module includes the `trainTestSplit`

function for splitting full datasets into randomly drawn train and test subsets. The function is fully compatible with the **GAUSS** formula string syntax. This allows subsets to be created without loading the full dataset. In addition, the formula string syntax allows for loading and transforming data in a single line. Detailed information on using formula strings is available in the formula string tutorials.

Using the formula string syntax in `trainTestSplit`

requires the specification of three inputs:

- A dataset specification.
- A formula which specifies how to load the data.
- The proportion of data to include in the training dataset.

The data for this tutorial is stored in the file islr_hitters.xlsx. This model uses the natural log of salary as the response variable and the 10 previously mentioned variables as predictors:

$$ ln(Salary) \sim AtBat + Hits + HmRun + Runs + \\ RBI + Walks + Years + PutOuts + Assists + Errors $$

The procedure `trainTestSplit`

returns four outputs : *y_train, x_train, y_test*, and *x_test*. These outputs contain the feature and predictors, respectively, for the training and testing datasets, respectively:

```
new;
library gml;
//Load hitters dataset
dataset = getGAUSSHome $+ "pkgs/gml/examples/islr_hitters.xlsx";
//Split data into training and test sets
{ y_train, y_test, x_train, x_test } = trainTestSplit(dataset, "ln(salary)~ AtBat +
Hits + HmRun + Runs + RBI + Walks + Years + PutOuts + Assists + Errors", 0.7);
```

After running the above code the data matrices *y_train, y_test, x_train,* and *x_test*. Note that the proportion of data in the training subsets is equal to 0.7.

### Specify model parameters

The random forest model parameters are specified using the `rfControl`

structure. The `rfControl`

structure contains the following members :

Member | Description |
---|---|

numTrees | Scalar, number of trees (must be integer). Default = 100 |

obsPerTree | Scalar, observations per a tree. Default = 1.0. |

featuresPerNode | Scalar, number of features considered at a node. Default = nvars/3. |

maxTreeDepth | Scalar, maximum tree depth. Default = unlimited. |

minObsNode | Scalar, minimum observations per node. Default = 1. |

oobError | Scalar, 1 to compute OOB error, 0 otherwise. Default = 0. |

variableImpurityMethod | Scalar, method of calculating variable importance. 0 = none, 1 = mean decrease in impurity, 2 = mean decrease in accuracy (MDA), 3 = scaled MDA. Default = 0. |

Using the `rfControl`

structure to change model parameter requires three steps:

- Declare an instance of the
`rfControl`

structure`struct rfControl rfc;`

- Fill the members in the
`rfControl`

structure with default values using`rfControlCreate`

:`rfc = rfControlCreate;`

- Change the desired members from their default values:
`rfc.oobError = 1`

The code below puts these three steps together to turn on both the out-of-bag error and variable importance computation:

```
//Use control structure for settings
struct rfControl rfc;
rfc = rfControlCreate;
//Turn on variable importance
rfc.variableImportanceMethod = 1;
//Turn on OOB error
rfc.oobError = 1;
```

### Fitting the random forest regression model

Random forest regression models are fit using the **GAUSS** procedure `rfRegressFit`

. The `rfRegressFit`

procedure takes two required inputs, the training response matrix and the training predictors matrix. In addition, the `rfControl`

structure may be optionally included to specify model parameters.

The `rfRegressFit`

procedure returns all output to a `rfModel`

structure. An instance of the `rfModel`

structure must be declared prior to calling `rfRegressFit`

. Each instance of the `rfModel`

structure contains the following members:

Member | Description |
---|---|

variableImportance | Matrix, 1 x p, variable importance measure if computation of variable importance is specified, zero otherwise. |

oobError | Scalar, out-of-bag error if OOB error computation is specified, zero otherwise. |

numClasses | Scalar, number of classes if classification model, zero otherwise. |

opaqueModel | Matrix, contains model details for internal use only. |

The code below fits the random forest model to the training data, *y_train* and *x_train*, which were generated earlier using `trainTestSplit`

. In addition, the inclusion of `rfc`

, the instance of the previously created `rfControl`

structure, results in the computation of both the out-of-bag error and the variable importance.

```
//Output structure
struct rfModel out;
//Fit training data using random forest
out = rfRegressFit(y_train, x_train, rfc);
//OOB Error
print "Out-of-bag error:" out.oobError;
```

The output from the code above:

Out-of-bag error: 0.32335252

### Plotting variable importance

A useful aspect of the random forest model is the variable importance measure. This measure provides a tool for understanding the relative importance of each predictor in the model. The procedure `plotVariableImportance`

plots a pre-formatted bar graph of the variable importance. The procedure takes two inputs, a `rfModel`

structure and a string array of variable names:

```
//Set up variable names
names = "AtBat"$|"Hits"$|"HmRun"$|"Runs"$|
"RBI"$|"Walks"$|"Years"$|"PutOuts"$|"Assists"$|"Errors";
//Plot variable names
plotVariableImportance(out, names);
```

The resulting plot:

### Make predictions

The `rfRegressPredict`

function is used after `rfRegressFit`

to make predictions from the random forest regression model. The function requires a filled `rfModel`

structure and a test set of predictors. The code below computes the predictions, prints the first 10 predictions and finds and compares the Random Forest MSE to OLS MSE:

```
//Make predictions using test data
predictions = rfRegressPredict(out, x_test);
//Print predictions
print predictions[1:10,.]~y_test[1:10,.];
print "random forest MSE: " meanc((predictions - y_test).^2);
//Print ols MSE
b_hat = y_train / (ones(rows(x_train), 1)~x_train);
y_hat = (ones(rows(x_test),1)~x_test) * b_hat;
print "OLS MSE using test data : " meanc((y_hat - y_test).^2);
```

The output:

6.2060345 6.1633148 6.6061033 6.2146081 5.2212251 4.5163390 6.2907685 6.6200732 5.0122389 4.6051702 4.9657023 4.3174881 5.6711929 6.5510803 5.9064099 5.4806389 5.4789927 4.6051702 6.9234039 6.8023948 random forest MSE: 0.40240875 OLS MSE using test data : 0.58021572

Find the full code for this example here