Fundamentals of Tuning Machine Learning Hyperparameters

by Eric · Published April 24, 2023 · Updated April 11, 2024

Introduction

Machine learning algorithms often rely on hyperparameters that can impact the performance of the models. These hyperparameters are external to the data and are part of the modeling choices that practitioners must make.

An important step in machine learning modeling is optimizing model hyperparameters to improve prediction accuracy.

In today's blog, we will cover some fundamentals of hyperparameter tuning using our previous decision forest, or random forest, model.

Model Performance

Before we consider how to fit the best machine learning model, we need to look at what it means to be the best model.

First, we must keep in mind that the most common goal in machine learning is to create an algorithm that will create accurate predictions based on unseen data. How successful an algorithm is at achieving this goal is reflected in the out-of-sample, or generalization, error.

The error of a machine learning model can be broken into two main categories, bias, and variance.

Bias	The error that occurs when we fit a simple model to a more complex data-generating process. A model with high bias will underfit the training data as we see in the far left panel of the above plot.
Variance	The expected prediction error that occurs when we apply our model to a new dataset that the model has not seen. A model with high variance will usually overfit the training data which results in lower training set error, but will lead to higher error on any data not used for training.

Because of these two sources of error, fitting machine learning models requires finding the right model complexity without overfitting our training data.

Model Performance Measures

There are a number of methods for evaluating the performance of machine learning models. Ultimately, which performance measure is used should be based on business or research objectives.

Common Performance Measures
Method	Description	Uses
Mean Squared Error (MSE)	The average of the squared distance between the target value and the value predicted by the model.	Regression Models
Mean Absolute Error (MAE)	The average of the absolute value of the distance between the target value and the value predicted by the model.
Root Mean Squared Error (RMSE)	The square root of the mean squared error.
Accuracy	The number of correct predictions divided by the total number of predictions.	Classifications Models
Precision	Ratio of true positives to total positive predicted.
Recall	The proportion of true positives divided by the sum of true positives and false negatives.
F1-score	The harmonic mean of precision and recall.

Tuning Parameters

Adjusting hyperparameters is one important way that we can impact the performance of machine learning models. Hyperparameters are parameters that:

Are set before the model is trained and are not learned from the data.
Determine how the model learns from the data.
May need to be readjusted to maintain optimal performance as more data is collected.

Example Hyperparameters
Model	Hyperparameter
K-nearest neighbor	The number of neighbors used in classification group, $k$.
Ridge regression	$\lambda$, the weight on the L2 penalty.
Gradient Boosting Machines	The number of trees, the shrinkage parameter, and the number of splits in each tree.

Hyperparameters can have a big impact on how well a model performs. For this reason, it is important to systematically and strategically optimize hyperparameters using hyperparameter tuning.

Some popular methods for hyperparameter tuning include:

Grid Search: This is a simple but effective method where you specify a set of values for each hyperparameter, and the algorithm tries all possible combinations of values. This can be time-consuming, but it guarantees that you'll find the best set of hyperparameters within the specified options.
Random Search: This method randomly selects values for each hyperparameter from a specified range. This can be faster than grid search, especially if you have a large number of hyperparameters, but it's not guaranteed to find the best set of hyperparameters.
Bayesian Optimization: This is a more advanced method that uses probability models to choose the next set of hyperparameters to test. It takes into account the results of previous tests to choose values that are more likely to result in better performance.
Evolutionary Algorithms: This method simulates evolution by creating a population of potential solutions (sets of hyperparameters) and selecting the best ones to "breed" new solutions. This process continues until a good solution is found.

Examples

Today we will consider two examples of hyperparameter tuning. For each example we:

Use a decision forest model, similar to the one we previously built to predict the U.S. output gap.
Perform a grid search to determine the best hyperparameter value or values.
Use mean squared error as our model performance measure.

The Model

Our model:

Uses a combination of common economic indicators and GDP subcomponents as predictors of CBO-based U.S. output gap.
Uses a 70/30 training and testing split without shuffling.
Is estimated using the GAUSS Machine Learning library</a?.

When tuning a decision forest model, there are several hyperparameters that can be considered.

Decision Forest Hyperparameters
Parameter	Description	Impact
Number of trees	The number of decision trees that will be trained and combined to make predictions.	Increasing the number of trees can lead to better performance, but can also increase training time and memory requirements.
Maximum depth	The maximum depth, or number of splits, of each decision tree.	A deeper tree can capture more complex relationships in the data, but can also overfit the data and perform poorly on new data.
Observations per tree	The percentage of observations used per tree.	Increasing the percentage of observations used in a tree can improve accuracy but it also can increase computational cost, reduce interpretability, and lead to overfitting or loss of diversity.
Minimum observations per node	The minimum number of observations required to be at a leaf node.	Increasing this value can help prevent overfitting, but can also result in a less complex model.
Maximum features	The maximum number of features that can be used to split each node.	Limiting the number of features can help prevent overfitting and reduce training time, but can also result in a less accurate model.

Example One: Tuning a Single Parameter

In our first example, we will use a grid search to tune the number of features used for splitting each node. We will hold all other parameters constant at the GAUSS default values.

Parameter	GAUSS Default
Number of trees	100
Maximum tree depth	Unlimited
Minimum percentage of observations per tree	100%
Minimum observations per leaf	1
Maximum features	$\frac{\text{Number of Variables}}{3}$

The `dfControl` Structure

The dfControl structure is an optional argument used to pass hyperparameter values to the decForestRFit and decForestCFit procedures.

Using the structure to change hyperparameters requires three steps:

Declare an instance of the dfControl structure using the struct keyword.
Fill the default values for the members using the dfControlCreate procedure.
Set the desired parameter value using GAUSS "dot", ., notation.

// Declare an instance of the 
// dfControl structure
struct dfControl dfc;

// Set default values for
// structure members
dfc = dfControlCreate();

// Specify features per node
dfc.featuresPerSplit = 4;

Loading and Splitting our Data

The first step for our hyperparameter tuning example, is to load our data and split it into training and testing datasets. We can do this using the loadd procedure to load our data and the trainTestSplit procedure to split our data.

/*
** Load and split
*/
library gml;

// Load dataset
dataset = __FILE_DIR $+ "reg_data.gdat";
data = loadd(dataset);

/*
** Extract outcome and features
*/
// Extract outcome variable
y = data[., "CBO_GAP"];

// Extract features
X = delcols(data, "date"$|"CBO_GAP");

/*
** Split data into 70% training and 30% testing sets 
** without shuffling.
*/
shuffle = "False";
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7, shuffle);

Setting Non-Tuning Parameters

Next, we will set the non-tuning hyperparameters to the GAUSS defaults using the dfControl structure.

/*
** Settings for decision forest
*/
// Declare an instance of the 
// dfControl structure
struct dfControl dfc;

// Set default values for
// structure members
dfc = dfControlCreate();

Performing Grid Search

Now that we've set our default non-tuning parameters we will perform our grid search to tune the features per node. The first step is to initialize our grid and storage matrices.

/*
** Initialize grid and
** storage matrices
*/
// Create vector of possible
// features per node values
featuresPerSplit = seqa(1, 1, cols(X));

// Create storage dataframe for MSE
// with one column for training mse
// and one column for testing mse
mse = asDF(zeros(rows(featuresPerSplit), 2), "Train", "Test");

Note that in the case of tuning a single parameter, we only have to search over a vector of potential values, not a grid.

Next, we will loop over each possible value of features per split. For each potential value we:

Fit decision forest model using the training data.
Predict outcomes using the training data.
Predict outcomes using the testing data.
Compute the MSE for both the training and testing predictions.
Store the MSE values.

// Loop over all potential values
// of features per node
for i(1, rows(featuresPerSplit), 1);

    // Set featuresPerSplit parameter
    dfc.featuresPerSplit = featuresPerSplit[i];

    /*
    ** Decision Forest Model
    */
    // Declare 'mdl' to be an instance of a
    // dfModel structure to hold the estimation results
    struct dfModel mdl;

    // Fit the model with default settings
    mdl = decForestRFit(y_train, X_train, dfc);

    // Make predictions using training data
    df_prediction_train = decForestPredict(mdl, X_train);

    // Make predictions using testing data
    df_prediction_test = decForestPredict(mdl, X_test);

    /*
    ** Compute and store mse
    */
    // Training set MSE
    mse[i, "Train"] = meanSquaredError(y_train, df_prediction_train);

    // Testing set MSE
    mse[i, "Test"] = meanSquaredError(y_test, df_prediction_test);

endfor;

Note that within our loop we use the GML procedure, meanSquaredError to compute our MSE.

Results

A visualization of our MSE values gives us some insight into what happens as we increase the features per node in our decision forest model:

As we increase the features per node up to about 5 or 6, we see a general downward trend in both the testing and training MSE. Over this period, the increased features per node allows the model to capture more complex interactions and dependencies in the data.
Increasing the features per node beyond 6, results in a general upward trend in testing MSE and downward trend in training MSE. This points to overfitting. The model fits the training data too well - it captures noise and irrelevant patterns, which leads to decreased performance on the unseen testing data.

To confirm our optimal features per node parameter setting, we can locate the minimum testing MSE:

// Find the row index of the lowest MSE
idx = minindc(mse[., "Test"]);

// NOTE: two semi-colons at the end of a print statement
//       prevents it from printing a newline at the end
print "Optimal features per node: ";; featuresPerSplit[idx];
print "Minimum test MSE:";; asmatrix(mse[idx, "Test"]);

This confirms that the optimal features per leaf is 6 with a testing MSE of 3.212.

Optimal features per node:        6.0000000
Minimum test MSE:       3.2122050

Example Two: Simultaneously Tuning Hyperparameters

Now that we've seen how to tune a single hyperparameter, let's look at tuning two hyperparameters simultaneously. We will use the same data and set up from our previous example:

Data loading and preliminary setup

/*
** Load and split
*/
// Load dataset
dataset = __FILE_DIR $+ "reg_data.gdat";
data = loadd(dataset);

/*
** Extract outcome and features
*/
// Extract outcome variable
y = data[., "CBO_GAP"];

// Extract features
X = delcols(data, "date"$|"CBO_GAP");

/*
** Split data into 70% training and 30% testing sets 
** without shuffling
*/
shuffle = "False";
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7, shuffle);

/*
** Settings for decision forest
*/
// Declare an instance of the 
// dfControl structure
struct dfControl dfc;

// Set default values for
// structure members
dfc = dfControlCreate();

// Set features per split
dfc.featuresPerSplit = 6;

For convenience, we are using the featuresPerSplit value found in the previous section. The optimal value of one hyperparameter depends on the values of the others, so in practice, you should not optimize them separately.

Performing Grid Search

In this example, we will tune:

The minimum observations per leaf, ranging from 1 to 20.
The percentage of the observations per tree, ranging from 70% to 100%.

First, we initialize our grid and storage matrices. For this example, we will focus only on our testing MSE.

/*
** Initialize grid and
** storage matrices
*/
// Set potential values for 
// minimum observations per node
minObsLeaf = seqa(1, 1, 20);

// Set potential values for 
// percentage of observations
// in tree
pctObs = seqa(0.7, 0.1, 4);

// Storage matrices
test_mse = zeros(rows(minObsLeaf), rows(pctObs));

Next, we use nested for loops to search over all potential values of the minimum observations per a leaf and the minimum percentage of observations at the split.

for i(1, rows(minObsLeaf), 1);

    // Set the minimum obs per leaf
    dfc.minObsLeaf = minObsLeaf[i];

    for j(1, rows(pctObs), 1);

        // Set percentage of obs used for each tree
        dfc.pctObsPerTree = pctObs[j];

        /*
        ** Decision Forest Model
        */
        // Declare 'mdl' to be an instance of a
        // dfModel structure to hold the estimation results
        struct dfModel mdl;

        // Estimate the model with default settings
        mdl = decForestRFit(y_train, X_train, dfc);

        // Make predictions using testing data
        df_prediction_test = decForestPredict(mdl, X_test);

        /*
        ** Compute and store mse
        */
        // Testing set MSE
        test_mse[i, j] = meanSquaredError(y_test, df_prediction_test);

    endfor;
endfor;

Note that in this loop:

We use i, from the outer loop, to index the minObsLeaf vector.
We use j, from the inner loop, to index the pctObs vector.
Each row in our storage matrices represents a constant minimum samples per leaf.
Each column in our storage matrices represents a constant percentage of samples.

Results

The above plot shows us that with the GAUSS default settings for a random forest and featuresPerNode set to 6:

Taking a sample of 100% of the data for the creation of each tree is almost always best.
Setting minObsLeaf to between 5 and 10 seems best, with the minimum at about 7.
We did not get much of an improvement in our test MSE over the first example.

Optional: Finding the minimum MSE value in the output matrix

The final step is to find our optimal hyperparameter settings by locating the combination of parameters that yields the lowest MSE.

We can break this into two steps. First, we find the column that contains the minimum value.

// Create a column vector with the minimum MSE
// values for each column
mse_col_mins = minc(test_mse);

// Find the index of the smallest
// value in 'mse_col_mins'
idx_col_min = minindc(mse_col_mins);

Now that we have found which column contains the minimum MSE value, we use minindc to find the index of the smallest value in that column.

// Find the row that contains the smallest MSE value
idx_row_min = minindc(test_mse[.,idx_col_min]);

// Extract the lowest MSE across all
// combinations of tuning parameters
MSE_optimal = test_mse[idx_row_min, idx_col_min];

// Print results
sprintf( "Minimum testing MSE: %4f", MSE_optimal);
print "Minimum MSE occurs with";
sprintf("  minimum samples per leaf      : %d", minObsLeaf[idx_row_min]);
sprintf("  percentage of samples per tree: %g%%", 100 * pctObs[idx_col_min]);

This prints our results:

Minimum testing MSE: 3.151047
Minimum MSE occurs with
  minimum observationss per leaf      : 7
  percentage of observations per tree: 100%

For more information on using sprintf for printing see our previous blog, "How to Create a Simple Table Using Sprintf"

Conclusion

Today's blog demonstrations how practitioners can use hyperparameters to tune and improve machine learning models. It is important to remember that taking the time to systematically and strategically determine model hyperparameters can greatly improve machine learning model performance.

Stay tuned, because next time we will take a deeper dive into how to think about the data and which hyperparameter settings make sense to try out.

The code and data for this blog are available in our GitHub repository. You can find the repository here.

Further Machine Learning Reading

Eric( Director of Applications and Training at Aptech Systems, Inc. )

Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.

Fundamentals of Tuning Machine Learning Hyperparameters

Introduction

Model Performance

Model Performance Measures

Common Performance Measures

Tuning Parameters

Example Hyperparameters

Examples

The Model

Decision Forest Hyperparameters

Example One: Tuning a Single Parameter

The dfControl Structure

Loading and Splitting our Data

Setting Non-Tuning Parameters

Performing Grid Search

Results

Example Two: Simultaneously Tuning Hyperparameters

Data loading and preliminary setup

Performing Grid Search

Results

Optional: Finding the minimum MSE value in the output matrix

Conclusion

Further Machine Learning Reading

One thought on “Fundamentals of Tuning Machine Learning Hyperparameters”

The `dfControl` Structure