Using random forests to predict salary

Using Random Forests to Predict Salary

This tutorial explores the use of random forests to predict baseball players' salaries. The examples builds on the examples in Chapter 8 of G. James, et al. (2013). This model will include 10 predictors: at bats, hits, home runs, RBIs, walks, years, put outs, assists, and errors. This tutorial examines how to:

  1. Use trainTestSplit to split a dataset into random training and testing subsets.
  2. Specify parameters for random forest models using the rfControl structure.
  3. Fit a random forest regression model from training data using rfRegressFit.
  4. Plot variable importance using plotVariableImportance.
  5. Use rfRegressPredict to make predictions from a random forest model.

Construct training and testing subsets

The GAUSS Machine Learning (GML) module includes the trainTestSplit function for splitting full datasets into randomly drawn train and test subsets. The function is fully compatible with the GAUSS formula string syntax. This allows subsets to be created without loading the full dataset. In addition, the formula string syntax allows for loading and transforming data in a single line. Detailed information on using formula strings is available in the formula string tutorials.

Using the formula string syntax in trainTestSplit requires the specification of three inputs:

  1. A dataset specification.
  2. A formula which specifies how to load the data.
  3. The proportion of data to include in the training dataset.

The data for this tutorial is stored in the file islr_hitters.xlsx. This model uses the natural log of salary as the response variable and the 10 previously mentioned variables as predictors:

$$ ln(Salary) \sim AtBat + Hits + HmRun + Runs + \\ RBI + Walks + Years + PutOuts + Assists + Errors $$

The procedure trainTestSplit returns four outputs : y_train, x_train, y_test, and x_test. These outputs contain the feature and predictors, respectively, for the training and testing datasets, respectively:

library gml;

//Load hitters dataset
dataset = getGAUSSHome $+ "pkgs/gml/examples/islr_hitters.xlsx";

//Split data into training and test sets
{ y_train, y_test, x_train, x_test } = trainTestSplit(dataset, "ln(salary)~ AtBat + 
Hits + HmRun + Runs + RBI + Walks + Years + PutOuts + Assists + Errors", 0.7);

After running the above code the data matrices y_train, y_test, x_train, and x_test. Note that the proportion of data in the training subsets is equal to 0.7.

Training and Testing Datasets

Specify model parameters

The random forest model parameters are specified using the rfControl structure. The rfControl structure contains the following members :

Member Description
numTrees Scalar, number of trees (must be integer). Default = 100
obsPerTree Scalar, observations per a tree. Default = 1.0.
featuresPerNode Scalar, number of features considered at a node. Default = nvars/3.
maxTreeDepth Scalar, maximum tree depth. Default = unlimited.
minObsNode Scalar, minimum observations per node. Default = 1.
oobError Scalar, 1 to compute OOB error, 0 otherwise. Default = 0.
variableImpurityMethod Scalar, method of calculating variable importance. 0 = none, 1 = mean decrease in impurity, 2 = mean decrease in accuracy (MDA), 3 = scaled MDA. Default = 0.

Using the rfControl structure to change model parameter requires three steps:

  1. Declare an instance of the rfControl structure
    struct rfControl rfc;
  2. Fill the members in the rfControl structure with default values using rfControlCreate:
    rfc = rfControlCreate;
  3. Change the desired members from their default values:
    rfc.oobError = 1

    The code below puts these three steps together to turn on both the out-of-bag error and variable importance computation:

//Use control structure for settings
struct rfControl rfc;
rfc = rfControlCreate;

//Turn on variable importance
rfc.variableImportanceMethod = 1;

//Turn on OOB error
rfc.oobError = 1;

Fitting the random forest regression model

Random forest regression models are fit using the GAUSS procedure rfRegressFit. The rfRegressFit procedure takes two required inputs, the training response matrix and the training predictors matrix. In addition, the rfControl structure may be optionally included to specify model parameters.
The rfRegressFit procedure returns all output to a rfModel structure. An instance of the rfModel structure must be declared prior to calling rfRegressFit. Each instance of the rfModel structure contains the following members:

Member Description
variableImportance Matrix, 1 x p, variable importance measure if computation of variable importance is specified, zero otherwise.
oobError Scalar, out-of-bag error if OOB error computation is specified, zero otherwise.
numClasses Scalar, number of classes if classification model, zero otherwise.
opaqueModel Matrix, contains model details for internal use only.

The code below fits the random forest model to the training data, y_train and x_train, which were generated earlier using trainTestSplit. In addition, the inclusion of rfc, the instance of the previously created rfControl structure, results in the computation of both the out-of-bag error and the variable importance.

//Output structure
struct rfModel out;

//Fit training data using random forest
out = rfRegressFit(y_train, x_train, rfc);

//OOB Error
print "Out-of-bag error:" out.oobError;

The output from the code above:

Out-of-bag error:      0.32335252

Plotting variable importance

A useful aspect of the random forest model is the variable importance measure. This measure provides a tool for understanding the relative importance of each predictor in the model. The procedure plotVariableImportance plots a pre-formatted bar graph of the variable importance. The procedure takes two inputs, a rfModel structure and a string array of variable names:

//Set up variable names
names = "AtBat"$|"Hits"$|"HmRun"$|"Runs"$|

//Plot variable names
plotVariableImportance(out, names);

The resulting plot: Variable Importance

Make predictions

The rfRegressPredict function is used after rfRegressFit to make predictions from the random forest regression model. The function requires a filled rfModel structure and a test set of predictors. The code below computes the predictions, prints the first 10 predictions and finds and compares the Random Forest MSE to OLS MSE:

//Make predictions using test data
predictions = rfRegressPredict(out, x_test);

//Print predictions
print predictions[1:10,.]~y_test[1:10,.];
print "random forest MSE: " meanc((predictions - y_test).^2);

//Print ols MSE
b_hat = y_train / (ones(rows(x_train), 1)~x_train);
y_hat = (ones(rows(x_test),1)~x_test) * b_hat;
print "OLS MSE using test data  : " meanc((y_hat - y_test).^2);

The output:

6.2060345        6.1633148
6.6061033        6.2146081
5.2212251        4.5163390
6.2907685        6.6200732
5.0122389        4.6051702
4.9657023        4.3174881
5.6711929        6.5510803
5.9064099        5.4806389
5.4789927        4.6051702
6.9234039        6.8023948
random forest MSE:       0.40240875
OLS MSE using test data  :       0.58021572

Find the full code for this example here

Have a Specific Question?

Get a real answer from a real person

Need Support?

Get help from our friendly experts.

Try GAUSS for 30 days for FREE

See what GAUSS can do for your data

© Aptech Systems, Inc. All rights reserved.

Privacy Policy