Machine learning – Aptech

Announcing the GAUSS Machine Learning Library

Eric — Mon, 28 Aug 2023 14:36:25 +0000

Introduction

The new GAUSS Machine Learning (GML) library offers powerful and efficient machine learning techniques in an accessible and friendly environment. Whether you're just getting familiar with machine learning or an experienced technician, you'll be running models in no time with GML.

Machine Learning Models at Your Fingertips

With the GAUSS Machine Learning library, you can run machine learning models out of the box, even without any machine learning background. It supports fundamental machine learning models for classification and regression including:

Quick and Painless Data Preparation and Management

We know model fitting and prediction is just the tip of the iceberg when it comes to any data analysis project. That's why we've focused on making GAUSS one of the best environments for data import, cleaning, and exploration.

GML provides machine learning specific data preparation tools including:

See how GAUSS reduces the pain and time of data wrangling and let's you get to the heart of your machine learning models quicker.

Easy to Implement Model Evaluation

Compare and evaluate machine learning models with tools for GML plotting and performance evaluation tools:

Interested in how GAUSS machine learning can work for you? Contact Us

Unparalleled Customer Support

We pride ourselves on offering unparalleled customer support and we truly care about your success. If you can't find what you need in our online documents, user forum, or blog, you can be confident that a GAUSS expert is here to quickly resolve your questions.

See It In Action

Want to see GML in action? Check out these real-world applications:

[contact-form-7]

Classification with Regularized Logistic Regression

Eric — Wed, 07 Jun 2023 15:59:02 +0000

Introduction

Logistic regression has been a long-standing popular tool for modeling categorical outcomes. It's widely used across fields like epidemiology, finance, and econometrics.

In today's blog we'll look at the fundamentals of logistic regression. We'll use a real-world survey data application and provide a step-by-step guide to implementing your own regularized logistic regression models using the GAUSS Machine Learning library, including:

Data preparation.
Model fitting.
Classification predictions.
Evaluating predictions and model fit.

What is Logistic Regression?

Logistic regression is a statistical method that can be used to predict the probability of an event occurring based on observed features or variables. The predicted probabilities can then be used to classify the data based on probability thresholds.

For example, if we are modeling a "TRUE" and "FALSE" outcome, we may predict that an outcome will be "TRUE" for all predicted probabilities of 0.5 and higher.

Mathematically, logistic regression models the relationship between the probability of an outcome as a logistic function of the independent variables:

$$ Pr(Y = 1 | X) = p(X) = \frac{e^{B_0 + B_1X}}{1 + e^{B_0 + B_1X}} $$

This log-odds representation is sometimes more common because it is linear in our independent variables:

$$ \log \bigg( \frac{p(X)}{1 + p(X)} \bigg) = B_0 + B_1X $$

There are some important aspects of this model to keep in mind:

The logistic regression model always yields a prediction between 0 and 1.
The magnitude of the coefficients in the logistic regression model cannot be as directly interpreted as in the classic linear model.
The signs of the coefficients in the logistic regression model can be interpreted as expected. For example, if the coefficient on $X_1$ is negative we can conclude that increasing $X_1$ decreases $p(X)$.

Logistic Regression with Regularization

One potential pitfall of logistic regression is its tendency for overfitting, particularly with high dimensional feature sets.

Regularization with L1 and/or L2 penalty parameters with can help prevent overfitting and improve prediction.

Comparison of L1 and L2 Regularization
	$L1$ penalty (Lasso)	$L2$ penalty (Ridge)
Penalty term	$\lambda \sum_{j=1}^p \|\beta_j\|$	$\lambda \sum_{j=1}^p \beta_j^2$
Robust to outliers		✓
Shrinks coefficients	✓	✓
Can select features	✓
Sensitive to correlated features		✓
Useful for preventing overfitting	✓	✓
Useful for addressing multicollinearity		✓
Requires hyperparameter selection (λ)	✓	✓

Our previous blog, "Predicting the Output Gap With Machine Learning Regression Models" provides a more detailed look at L1 and L2 regularization.

Predicting Customer Satisfaction Using Survey Data

Today we will use airline passenger satisfaction data to demonstrate logistic regression with regularization.

Our task is to predict passenger satisfaction using:

Available survey answers.
Flight information.
Passenger characteristics.

Variable	Description
id	Responder identification number
Gender	Gender identification: Female or Male.
Customer Type	Loyal or disloyal customer.
Age	Customer age in years.
Type of travel	Personal or business travel.
Class	Eco or business class seat.
Flight Distance	Flight distance in miles.
Wifi service	Customer rating on 0-5 scale.
Schedule convenient	Customer rating on 0-5 scale.
Ease of Online booking	Customer rating on 0-5 scale.
Gate location	Customer rating on 0-5 scale.
Food and drink	Customer rating on 0-5 scale.
Seat comfort	Customer rating on 0-5 scale.
Online boarding	Customer rating on 0-5 scale.
Inflight entertainment	Customer rating on 0-5 scale.
On-board service	Customer rating on 0-5 scale.
Leg room service	Customer rating on 0-5 scale.
Baggage handling	Customer rating on 0-5 scale.
Checkin service	Customer rating on 0-5 scale.
Inflight service	Customer rating on 0-5 scale.
Cleanliness	Customer rating on 0-5 scale.
Departure Delay in minutes	Minutes delayed when departing.
Arrival Delay in minutes	Minutes delayed when arriving.
satisfaction	Overall airline satisfaction. Possible responses include "satisfied" or "neutral or dissatisfied".

The first step in our analysis is to load our data using loadd:

new;
library gml;
rndseed 8906876;

/*
** Load datafile
*/
// Set path and filename
load_path = "data/";
fname = "airline_satisfaction.gdat";

// Load data
airline_data = loadd(load_path $+ fname);

// Split data
y = airline_data[., "satisfaction"];
X = delcols(airline_data, "satisfaction"$|"id");

Data Exploration

Before we begin modeling, let's do some preliminary data exploration. First, let's check for common issues that can arise with survey data.

We'll check for:

Duplicate observations using isunique.
Missing values using dstatmt.

First, we'll check for duplicates, so any duplicates can be removed prior to checking our summary statistics:

// Check for duplicates
isunique(airline_data);

The isunique procedure returns a 1 if the data is unique and 0 if there are duplicates.

1.00000000

In this case, it indicates that we have no duplicates in our data.

Next, we'll check for missing values:

/*
** Check for data cleaning
** issues
*/
// Summary statistics
call dstatmt(airline_data);

This prints summary statistics for all variables:

Variable                       Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------------------

Gender                        -----       -----         -----      Female        Male    103904    0
Customer Type                 -----       -----         -----  Loyal Cust  disloyal C    103904    0
Age                           39.38       15.11         228.5           7          85    103904    0
Type of Travel                -----       -----         -----  Business t  Personal T    103904    0
Class                         -----       -----         -----    Business    Eco Plus    103904    0
Flight Distance                2108        1266     1.603e+06           0        3801    103904    0
Wifi service                  -----       -----         -----           0           5    103904    0
Schedule convenient           -----       -----         -----           0           5    103904    0
Ease of Online booking        -----       -----         -----           0           5    103904    0
Gate location                 -----       -----         -----           0           5    103904    0
Food and drink                -----       -----         -----           0           5    103904    0
Online boarding               -----       -----         -----           0           5    103904    0
Seat comfort                  -----       -----         -----           0           5    103904    0
Inflight entertainment        -----       -----         -----           0           5    103904    0
Onboard service               -----       -----         -----           0           5    103904    0
Leg room service              -----       -----         -----           0           5    103904    0
Baggage handling              -----       -----         -----           1           5    103904    0
Checkin service               -----       -----         -----           0           5    103904    0
Inflight service              -----       -----         -----           0           5    103904    0
Cleanliness                   -----       -----         -----           0           5    103904    0
Departure Delay in Minutes    14.82       38.23          1462           0        1592    103904    0
Arrival Delay in Minutes      15.25       38.81          1506           0        1584    103904    0
satisfaction                  -----       -----         -----  neutral or   satisfied    103904    0

The summary statistics give us some useful insights:

There are no missing values in our dataset.
The summary statistics of our numerical variables don't indicate any obvious outliers.
All categorical survey data ranges from 0 to 5 with the exception of Baggage handling which ranges from 1 to 5. All categorical variables will need to be converted to dummy variables prior to modeling.

One other observation from our summary statistics is that many of the variable names are longer than necessary. Long variable names can be:

Difficult to remember.
Prone to typos
Cutoff when printing results.

(Not to mention they can be annoying to type!).

Let's streamline our variable names using dfname:

/*
** Update variable names
*/
// Create string array of short names
string short_names = {"Loyalty", "Reason", "Distance", "Wifi", 
                      "Schedule", "Booking", "Gate", "Boarding", 
                      "Entertainment", "Leg room", "Baggage", "Checkin", 
                      "Departure Delay", "Arrival Delay" };

// Create string array of original names to change                      
string original_names = { "Customer Type", "Type of Travel", "Flight Distance", "Wifi service",
                          "Schedule convenient", "Ease of Online booking", "Gate location", "Online boarding",
                          "Inflight entertainment", "Leg room service", "Baggage handling", "Checkin service",
                          "Departure Delay in Minutes", "Arrival Delay in Minutes" };

// Change names
airline_data = dfname(airline_data, short_names, original_names);

Data Visualization

Data visualization is a great way to get a feel for the relationships between our target variable and our features.

Let's explore the relationship between the customer and flight characteristics and reported satisfaction.

In particular, we'll look at how satisfaction relates to:

Age.
Gender.
Flight distance.
Seat class.
Customer type.

Preparing Our Data for Plotting

Today we'll use bar graphs to explore the relationships in our data. In particular, we will sort our data into subgroups and examine how those subgroups report satisfaction.

For categorical variables, we have naturally defined subgroups. However, For the continuous variables, Age and Distance, we first need to generate bins based on ranges of these variables.

First, let's place the Age variable in bins. To do this we will use the reclassifycuts and reclassify procedures:

For more information on reclassifying and other similar data transformations, see the Data Transformations section of our Data Management Guide.

/*
** Create bins for age
*/
// Set age categories cut points
// Class 0: 20 and Under
// Class 1: 21 - 30
// Class 2: 31 - 40
// Class 3: 41 - 50
// Class 4: 51 - 60
// Class 5: 61 - 70
// Class 6: Over 70
cut_pts = { 20, 
            30, 
            40, 
            50, 
            60, 
            70};

// Create numeric classes
age_new = reclassifycuts(airline_data[., "Age"], cut_pts);

// Generate labels to recode to
to = "20 and Under"$|
       "21-30"$|
       "31-40"$|
       "41-50"$|
       "51-60"$|
       "61-70"$|
       "Over 70";

// Recode to categorical variable
age_cat = reclassify(age_new, unique(age_new), to);

// Convert to dataframe
age_cat = asDF(age_cat, "Age Group");

For a quick frequency count of this categorical variable, we can use the frequency procedure:

// Check frequency of age groups
frequency(age_cat, "Age Group");

       Label      Count   Total %    Cum. %
20 and Under      11333     10.91     10.91
       21-30      21424     20.62     31.53
       31-40      21203     20.41     51.93
       41-50      23199     22.33     74.26
       51-60      18769     18.06     92.32
       61-70       7220     6.949     99.27
     Over 70        756    0.7276       100
       Total     103904       100

Now we will do the same for Distance.

/*
** Create bins for light distance
*/       
// Set distance categories
// Cut points for data 
cut_pts = { 1000, 
            1500, 
            2000, 
            2500, 
            3000,
            3500};

// Create numeric classes
distance_new = reclassifycuts(airline_data[., "Distance"], cut_pts);

// Generate labels to recode to
to = "1000 and Under"$|
       "1001-1500"$|
       "1501-2000"$|
       "2001-2500"$|
       "2501-3000"$|
       "3000-3500"$|
       "Over 3500";

// Recode to categorical variable
distance_cat = reclassify(distance_new, unique(distance_new), to);

// Convert to dataframe
distance_cat = asDF(distance_cat, "Flight Range");

// Check frequencies
frequency(distance_cat, "Flight Range");

         Label      Count   Total %    Cum. %
1000 and Under      28017     26.96     26.96
     1001-1500      10976     10.56     37.53
     1501-2000       9331      8.98     46.51
     2001-2500       7834      7.54     54.05
     2501-3000       8053      7.75      61.8
     3000-3500      24815     23.88     85.68
     Over 3500      14878     14.32       100
         Total     103904       100

Age

We can see from the plot above that passengers 20 and under and passengers over 60 are less likely to be satisfied than other age groups.

Gender

The plot suggests that gender has little impact on reported satisfaction.

Flight Distance

The flight distance plot shows that there are slightly lower rates of satisfaction for flight lengths 3000 miles and over and flight lengths 1000 miles and under.

Seat Class

There is a clear discrepancy in satisfaction between passengers that fly business class and other passengers. Business class customers have a much higher rate of satisfaction than those in economy or economy plus.

Customer Type

Finally, it also appears that loyal passengers are more often satisfied customers than disloyal passengers.

Feature Engineering

As is common with survey data, a number of our variables are categorical. We need to represent these as dummy variables before modeling.

We'll do this using the oneHot procedure. However, oneHot only accepts single variables, so we will need to loop through all the categorical variables.

To do this, we first create a list of all categorical variables.

/*
** Create dummy variables
*/
// Get all variable names
col_names = getColNames(X);

// Get types of all variables
col_types = getColTypes(X);

// Select names of variables
// that are categorical
cat_names = selif(col_names, col_types .== "category");

Next, we loop through all categorical variables and create dummy variables for each one using oneHot.

// Loop through categorical variables
// to create dummy variables
dummy_vars = {};
for i(1, rows(cat_names), 1); 
    dummy_vars = dummy_vars~oneHot(X[., cat_names[i]]);
endfor;

// Delete original categorical variables
// and replace with dummy variables
X = delcols(x, cat_names)~dummy_vars;

Model Evaluation

There are a number of classification metrics that are reported using the classificationMetrics procedure. These metrics provide information about how well the model meets different objectives.

Model Comparison Measures
Tool	Description
Accuracy	Overall model accuracy. Equal to the number of correct predictions divided by the total number of predictions.
Precision	How good a model is at correctly identifying the class outcomes. Equal to the number of true positives divided by the number of false positives plus true positives.
Recall	How good a model is at correctly predicting all the class outcomes. Equal to the number of true positives divided by the number of false negatives plus true positives.
F1-score	The harmonic mean of the precision and recall, it gives a more balanced picture of how our model performs. A score of 1 indicates perfect precision and recall.

We'll keep these in mind as we fit and test our model.

Logistic Regression Model Fitting

We're now ready to begin fitting our models. To start, we will prepare our data by:

Creating training and testing datasets using trainTestSplit.

// Split data into 70% training and 30% test set
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);

Scaling our data using rescale.

/*
** Data rescaling
*/
// Number of variables to rescale
numeric_vars = 4;

// Rescale training data
{ X_train[.,1:numeric_vars], x_mu, x_sd } = rescale(X_train[.,1:numeric_vars], "standardize");

// Rescale test data using same scaling factors as x_train
X_test[.,1:numeric_vars] = rescale(X_test[.,1:numeric_vars], x_mu, x_sd);

Unlike Random Forest models, logistic regression models are sensitive to large differences in the scale of the variables. Standardizing the variables as we do here is a good choice, but is not unequivocally the best option in all cases.

As you can see above, we compute the mean and standard deviation from the training set and use those parameters to scale the test set. This is important.

The purpose of our test set is to give us an estimate of how our model will do on unseen data. Using the mean and standard deviation from the entire dataset, before the train/test split would allow information from the test set to "leak" into our model. Information leakage is beyond the scope of this blog post, but in general the test set should be treated like information that is not available until after the model fit is complete.

Now we're ready to start fitting our models.

Case One: Logistic Regression Without Regularization

As a base case, we'll consider a logistic regression model without any regularization. For this case, we'll use all default settings, so our only inputs are the dependent and independent data.

Using our training data we will:

Train our model using logisticRegFit.
Make predictions on our training data using lmPredict.
Evaluate our training model predictions using classificationMetrics.

/*************************************
** Base case model
** No regularization
*************************************/

/*
** Training
*/
// Declare 'lr_mdl' to be 
// an 'logisticRegModel' structure
// to hold the trained model
struct logisticRegModel lr_mdl;

// Train the logistic regression classifier
lr_mdl = logisticRegFit(y_train, X_train);

// Check training set performance
y_hat_train = lmPredict(lr_mdl, X_train);

// Model evaluations
print "Training Metrics";
call classificationMetrics(y_train, y_hat_train);

The classificationMetrics procedure prints an evaluation table:

No regularization
Training Metrics
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.93    0.92      0.93    41102
              satisfied        0.90    0.91      0.90    31631

              Macro avg        0.91    0.92      0.91    72733
           Weighted avg        0.92    0.92      0.92    72733

               Accuracy                          0.92    72733

/*
** Testing
*/
// Make predictions on the test set, from our trained model
y_hat_test = lmPredict(lr_mdl, X_test);

/*
** Model evaluation
*/
print "Testing Metrics";
call classificationMetrics(y_test, y_hat_test);

This code prints the following to screen:

Testing Metrics
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.93    0.92      0.92    17777
              satisfied        0.90    0.91      0.90    13394

              Macro avg        0.91    0.91      0.91    31171
           Weighted avg        0.91    0.91      0.91    31171

               Accuracy                          0.91    31171

There are some good observations comparing our training data and testing data performance:

First, there is little difference in accuracy across our training and testing data set, with a training accuracy of 0.92 and a testing accuracy of 0.91.
Our model provides the same average F1-score, which provides a balanced measure of performance, across the testing and training dataset.

Why is this important? This comparison provides a good indication that we aren't overfitting our training set. Since the main purpose of regularization is to address overfitting the model to the training data, we don't have much reason to use it. However, for demonstration purposes, we'll show how to implement L2 regularization.

Case Two: Logistic Regression With L2 Regularization

To implement regularization with the logisticRegFit, we'll use a logisticRegControl structure.

/*************************************
** L2 Regularization
*************************************/

/*
** Training
*/
// Declare 'lrc' to be a logisticRegControl
// structure and fill with default settings 
struct logisticRegControl lrc;
lrc = logisticRegControlCreate();

// Set L2 regularization parameter
lrc.l2 = 0.05;

// Declare 'lr_mdl' to be 
// a 'logisticRegModel' structure
// to hold the trained model
struct logisticRegModel lr_mdl;

// Train the logistic regression classifier
lr_mdl = logisticRegFit(y_train, X_train, lrc);

/*
** Testing
*/
// Make predictions on the test set
y_hat_l2 = lmPredict(lr_mdl, X_test);

/*
** Model evaluation
*/
call classificationMetrics(y_test, y_hat_l2);

The classification metrics are printed:

L2 regularization
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.89    0.93      0.91    17777
              satisfied        0.90    0.84      0.87    13394

              Macro avg        0.90    0.89      0.89    31171
           Weighted avg        0.89    0.89      0.89    31171

               Accuracy                          0.89    31171

Note that with the L2 penalty, our model performance drops from the base case model, with lower accuracy (0.89) and lower average F1-score (0.89). This isn't surprising, given that we didn't find support of overfitting in our model.

Conclusion

In today's blog, we've looked at logistic regression and regularization.

Using a real-world airline passenger satisfaction data application we've:

Performed preliminary data and setup.
Trained logistic regression models with and without regularization.
Made classification predictions.
Interpreted classification metrics.

Further Machine Learning Reading

Machine Learning With Real-World Data

Eric — Tue, 16 May 2023 03:38:45 +0000

Introduction

If you've ever done empirical work, you know that real-world data rarely, if ever, arrives clean and ready for modeling. No data analysis project consists solely of fitting a model and making predictions.

In today's blog, we walk through a machine learning project from start to finish. We'll give you a foundation for completing your own machine learning project in GAUSS, working through:

Data Exploration and cleaning.
Splitting data for training and testing.
Model fitting and prediction.
Basic feature engineering.

Background

Our Data

Today we will be working with the California Housing Dataset from Kaggle.

This dataset is built from 1990 Census data. Though it is an older dataset, it is a great demonstration dataset and has been popular in many machine learning examples.

The dataset contains 10 variables measured in California at the block group level:

Variable	Description
longitude	Measure of how far west a house is.
latitude	Measure of how far north a house is.
housing_median_age	Median age of a house within a block.
total_rooms	Total number of rooms within a block.
total_bedrooms	Total number of bedrooms within a block.
population	Total number of people residing within a block.
households	Total number of households, a group of people residing within a home unit, for a block.
median_income	Median income for households within a block of houses (measured in tens of thousands of US Dollars).
median_house_value	Median house value for households within a block.
ocean_proximity	Location of the house w.r.t ocean/sea.

GAUSS Machine Learning

We will use the new GAUSS Machine Learning (GML) library. This library is extremely user friendly and provides easy-to-use machine learning tools for implementing fundamental machine learning models.

To access these tools, we need to load the library:

// Clear workspace and load library
new;
library gml;

// Set random seed
rndseed 8906876;

Note we also set the random seed to allow for replication.

Data Exploration and Cleaning

With GML loaded, we our now ready to import and clean our data. The first step is to use the loadd procedure to import our data into the GAUSS.

/*
** Import datafile
*/
load_path = "data/";
fname = "housing.csv";

// Load all variables
housing_data = loadd(load_path $+ fname);

Descriptive Statistics

Exploratory data analysis allows us to identify important data anomalies, like outliers and missing values.

Let's start by looking at standard descriptive statistics using the dstatmt procedure:

// Find descriptive statistics
// for all variables in housing_data
dstatmt(housing_data);

This prints a summary table of statistics for all variables.

--------------------------------------------------------------------------------------------------
Variable                  Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
--------------------------------------------------------------------------------------------------
longitude               -119.6       2.004         4.014      -124.3      -114.3     20640    0
latitude                 35.63       2.136         4.562       32.54       41.95     20640    0
housing_median_age       28.64       12.59         158.4           1          52     20640    0
total_rooms               2636        2182     4.759e+06           2   3.932e+04     20640    0
total_bedrooms           537.9       421.4     1.776e+05           1        6445     20433  207
population                1425        1132     1.282e+06           3   3.568e+04     20640    0
households               499.5       382.3     1.462e+05           1        6082     20640    0
median_income            3.871         1.9         3.609      0.4999          15     20640    0
median_house_value   2.069e+05   1.154e+05     1.332e+10     1.5e+04       5e+05     20640    0
ocean_proximity          -----       -----         -----   <1H OCEAN  NEAR OCEAN     20640    0

These statistics allow us to quickly identify several data issues that we need to address prior to fitting our model:

There are 207 missing observations of the total bedrooms variable (you may need to scroll to the right of the output).
Many of our variables show potential outliers, with high variance and large ranges. These should be further explored.

Missing Values

To get a better idea of how to best deal with the missing values, let's check the descriptive statistics for the observations with and without missing values separately.

// Conditional check 
// for missing values
e = housing_data[., "total_bedrooms"] .== miss();

// Get descriptive statistics
// for dataset with missing values
dstatmt(selif(housing_data, e));

------------------------------------------------------------------------------------------------
Variable                 Mean     Std Dev      Variance     Minimum     Maximum   Valid  Missing
------------------------------------------------------------------------------------------------

longitude              -119.5       2.001         4.006      -124.1      -114.6      207    0
latitude                 35.5       2.097         4.399       32.66       40.92      207    0
housing_median_age      29.27       11.96         143.2           4          52      207    0
total_rooms              2563        1787     3.194e+06         154   1.171e+04      207    0
total_bedrooms          -----       -----         -----        +INF        -INF        0  207
population               1478        1057     1.118e+06          37        7604      207    0
households                510       386.1     1.491e+05          16        3589      207    0
median_income           3.822       1.956         3.824      0.8527          15      207    0
median_house_value   2.06e+05   1.116e+05     1.246e+10    4.58e+04       5e+05      207    0
ocean_proximity         -----       -----         -----   <1H OCEAN  NEAR OCEAN      207    0

// Get descriptive statistics
// for dataset without missing values
dstatmt(delif(housing_data, e));

-------------------------------------------------------------------------------------------------
Variable                 Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------------

longitude              -119.6       2.004         4.014      -124.3      -114.3     20433    0
latitude                35.63       2.136         4.564       32.54       41.95     20433    0
housing_median_age      28.63       12.59         158.6           1          52     20433    0
total_rooms              2637        2185     4.775e+06           2   3.932e+04     20433    0
total_bedrooms          537.9       421.4     1.776e+05           1        6445     20433    0
population               1425        1133     1.284e+06           3   3.568e+04     20433    0
households              499.4       382.3     1.462e+05           1        6082     20433    0
median_income           3.871       1.899         3.607      0.4999          15     20433    0
median_house_value  2.069e+05   1.154e+05     1.333e+10     1.5e+04       5e+05     20433    0
ocean_proximity         -----       -----         -----   <1H OCEAN  NEAR OCEAN     20433    0

From visual inspection, the descriptive statistics for the data with the missing values are very similar to the descriptive statistics data without the missing values.

We could do more robust statistical tests to confirm this. However, those are outside of the scope of this blog, and we will rely on our visual inspection.

In addition, the missing values make up less than 1% of the total observations. Given this, we will delete the rows containing missing values, rather than imputing our missing values.

We can delete the rows with missing values using the packr procedure:

// Remove rows with missing values
// from housing_data
housing_data = packr(housing_data);

Outliers

Now that we've removed missing values, let's look for other data outliers. Data visualizations like histograms and box plots are a great way to identify potential outliers.

First, let's create a grid plot of histograms for all of our continuous variables:

/*
** Data visualizations
*/
// Get variables names
vars = getColNames(housing_data);

// Set up plotControl 
// structure for formatting graphs
struct plotControl plt;
plt = plotGetDefaults("bar");

// Set fonts
plotSetFonts(&plt, "title", "Arial", 14);
plotSetFonts(&plt, "ticks", "Arial", 12);

// Loop through the variables and draw histograms
for i(1, rows(vars)-1, 1);
    plotSetTitle(&plt, vars[i]);
    plotLayout(3, 3, i);
    plotHist(plt, housing_data[., vars[i]], 50);
endfor;

From our histograms, it appears that several variables suffer from outliers:

The total_rooms variable, with the majority of the data distributed between 0 and 10,000.
The total_bedrooms variable, with the majority of the data distributed between 0 and 2000.
The households variable, with the majority of the data distributed between 0 and 2000.
The population variable, with the majority of the data distributed between 0 and 100,000.

Box plots of these variables confirm that there are indeed outliers.

plt = plotGetDefaults("box");

// Set fonts
plotSetFonts(&plt, "title", "Arial", 14);
plotSetFonts(&plt, "ticks", "Arial", 12);

string box_vars = { "total_rooms", "total_bedrooms", "households", "population" };

// Loop through the variables and draw boxplots
for i(1, rows(box_vars), 1);
    plotLayout(2, 2, i);
    plotBox(plt, box_vars[i], housing_data[., box_vars[i]]);
endfor;

Let's filter the data to eliminate these outliers:

/*
** Filter to remove outliers
**
** Delete:
**    - total_rooms greater than or equal to 10000
**    - total_bedrooms greater than or equal to 20000
**    - households greater than or equal to 2000
**    - population greater than or equal to 6000
*/
mask = housing_data[., "total_rooms"] .>= 10000;
mask = mask .or housing_data[., "total_bedrooms"] .>= 2000;
mask = mask .or housing_data[., "households"] .>= 2000;
mask = mask .or housing_data[., "population"] .>= 6000;

housing_data = delif(housing_data, mask);

Note that we've taken a conservative approach to filtering outliers and haven't removed all points identified by the box plots as outliers.

Data Truncation

The histograms also point to truncation issues with housing_median_age and median_house_value. Let's look into this a little further:

We'll confirm that these are the most frequently occurring observations using modec. This provides evidence for our suspicion that these are truncation points.
We'll count the number of observations at these locations.

Remember that we've already filtered our outliers, so we're looking at a subset of our original data.

// House value
mode_value = modec(housing_data[., "median_house_value"]);
print "Most frequent median_house_value:" mode_value;

print "Counts:";
sumc(housing_data[., "median_house_value"] .== mode_value);

// House age
mode_age = modec(housing_data[., "housing_median_age"]);
print "Most frequent housing_median_age:" mode_age;

print "Counts:";
sumc(housing_data[., "housing_median_age"] .== mode_age);

We use modec because from our histogram we can't identify for that these points occur at the maximum. It makes sense to assume that they do but we can't be certain.

Most frequent median_house_value:
       500001.00
Counts:
       935.00000
Most frequent housing_median_age:
       52.000000
Counts:
       1262.0000

These combined observations make up about 10% of the total observations. Because we have no further information about what is occurring at these points, let's remove them from our model.

// Create binary vector with a 1 if either
// 'housing_median_age' or 'median_house_value'
// equal their mode value.
mask = (housing_data[., "housing_median_age"] .== mode_age)
       .or (housing_data[., "median_house_value"] .== mode_value);

// Delete the rows if they meet our above criteria
housing_data = delif(housing_data, mask);

Feature Modifications

Our final data cleaning step is to make feature modifications including:

Rescaling the median_house_value variable to be measured in 10,000 of US dollars (the same scale as median_income).
Generating dummy variables to account for the categories of ocean_proximity.

First, we rescale the median_house_value:

// Rescale median income variable
housing_data[., "median_house_value"] = 
    housing_data[., "median_house_value"] ./ 10000;

Next we generate dummy variables for ocean_proximity.

Let's get a feel for our categorical data using the frequency procedure:

// Check frequency of
// ocean_proximity categories
frequency(housing_data, "ocean_proximity");

This prints a convenient frequency table:

     Label      Count   Total %    Cum. %
 <1H OCEAN       8095     44.89     44.89
    INLAND       6136     34.03     78.93
    ISLAND          2   0.01109     78.94
  NEAR BAY       1525     8.458     87.39
NEAR OCEAN       2273     12.61       100
     Total      18031       100

We can see from this table that the ISLAND category is a very small category. We'll exclude it from our modeling dataset.

Now let's create our dummy variables using the oneHot procedure:

/*
** Generate dummy variables for 
** the ocean_proximity using
** one hot encoding
*/
dummy_matrix = oneHot(housing_data[., "ocean_proximity"]);

Finally, we'll save our modeling dataset in a GAUSS .gdat file using saved so we can directly access our clean data in the future:

/*
** Build matrix of features
** Note we exclude: 
**     - ISLAND dummy variable
**     - Original ocean_proximity variable
*/
model_data = delcols(housing_data, "ocean_proximity") ~ 
    delcols(dummy_matrix, "ocean_proximity_ISLAND");

// Saved data matrix
saved(model_data, load_path $+ "/model_data.gdat");

Data Splitting

In machine learning, it's customary to use separate datasets to fit the model and to evaluate model performance. Since the objective of machine learning models is to provide predictions for unseen data, using a testing set provides a more realistic measure of how our model will perform.

Cross-validation is an additional tool for evaluating model performance. To learn more about cross-validation, see our previous blog, "Understanding Cross-Validation".

To prepare our data for training and testing, we're going to take two steps:

Separate our target variable, median_house_value, and feature set.
Split our data into 70% testing and 30% training dataset using trainTestSplit.

new;
library gml;
rndseed 896876;

/*
** Load datafile
*/
load_path = "data/";
fname = "model_data.gdat";

// Load data
housing_data = loadd(load_path $+ fname);

/*
** Feature management
*/
// Separate dependent and independent data
y = housing_data[., "median_house_value"];
X = delcols(housing_data, "median_house_value");

// Split into 70% training data 
// and 30% testing data
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);

Fitting Our Model

Now that we've completed our data cleaning, we're finally ready to fit our model. Today we'll use a LASSO regression model to predict our target variable. LASSO is a form of regularization that has found relative success in economic and financial modeling. It offers a data-driven approach to dealing with high-dimensionality in linear models.

Model Fitting

To fit the LASSO model to our target variable, median_house_value, we'll use lassoFit from the GAUSS Machine Learning library.

/*
** LASSO Model
*/
// Set lambda values
lambda = { 0, 0.1, 0.3 };

// Declare 'mdl' to be an instance of a
// lassoModel structure to hold the estimation results
struct lassoModel mdl;

// Estimate the model with default settings
mdl = lassoFit(y_train, X_train, lambda);

The lassoFit procedure prints a model description and results:

==============================================================================
Model:                        Lasso     Target Variable:    median_house_value
Number observations:          12622     Number features:                    12
==============================================================================

===========================================================
                    Lambda          0        0.1        0.3
===========================================================

                 longitude     -2.347     -1.013   -0.02555
                  latitude     -2.192    -0.9269          0
        housing_median_age    0.07189    0.06384    0.03977
               total_rooms  -0.001004          0          0
            total_bedrooms    0.01165   0.006107   0.004828
                population  -0.004317  -0.003396  -0.001232
                households   0.006808   0.005119          0
             median_income      3.872      3.569      3.457
 ocean_proximity__1H OCEAN     -5.509          0          0
    ocean_proximity_INLAND     -9.437     -5.639     -6.575
  ocean_proximity_NEAR BAY     -7.083    -0.6395          0
ocean_proximity_NEAR OCEAN     -5.198     0.6378     0.6981
                    CONST.     -193.5     -82.98      3.451
===========================================================
                        DF         12         10          7
              Training MSE       33.7       34.7       37.4

The results highlight the variable selection function of LASSO. With $\lambda = 0$, a full least squares model, all features are represented in the model. When we get to $\lambda = 0.3$, the LASSO regression removes 4 of our 12 variables:

latitude
total_rooms
ocean_proximity__1H OCEAN
ocean_proximity_NEAR BAY

As we would expect, median_income has a large positive impact. However, there are a few noteworthy observations about the coefficients for the location related variables.

As we add more regularization to the model by increasing the value of $\lambda$, ocean_proximity__1H OCEAN and ocean_proximity_NEAR BAY are removed from the model, but the effect of ocean_proximity_INLAND increases substantially. latitude is also removed from the model. This could be because these effects are largely also explained by the location dummy variables and median_income.

Prediction

We can now test our model's prediction capability on the testing data using lmPredict:

// Predictions
predictions = lmPredict(mdl, X_test);

// Get MSE
testing_MSE = meanSquaredError(predictions, y_test);
print "Testing MSE"; testing_MSE;

Testing MSE

       33.814993
       34.726144
       37.199771

As expected, most of these values are above the training MSE but not by much. The test MSE for the model with the highest $\lambda$ value is actually lower than the training MSE. This suggests that our model is not overfitting.

Feature Engineering

Since our model is not overfitting, we can add more variables to the model. We could collect more data variables to add. However, it is likely that there is more information in our current data that we can make more accessible to our estimator. We are going to create some new features from combinations of our current features. This is part of a process called feature engineering which can make substantial contributions to your machine learning models.

We will start by generating per capita variables for total_rooms, total_bedrooms, and households.

/*
** Create per capita variables
** using population
*/
pc_data = housing_data[., "total_rooms" "total_bedrooms" "households"] 
    ./ housing_data[., "population"];

// Convert to a dataframe and add variable names
pc_data = asdf(pc_data, "rooms_pc"$|"bedrooms_pc"$|"households_pc");

Next we will great a variable representing the percentage of total_rooms made up by total_bedrooms:

beds_per_room = X[.,"total_bedrooms"] ./ X[.,"total_rooms"];

and add these columns to X:

X = X ~ pc_data ~ asdf(beds_per_room, "beds_per_room");

Fit and Predict the New Model

// Reset the random seed so we get the
// same test and train splits as our previous model
rndseed 896876;

// Split our new X into train and test splits
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);

// Set lambda values
lambda = { 0, 0.1, 0.3 };

// Declare 'mdl' to be an instance of a
// lassoModel structure to hold the estimation results
struct lassoModel mdl;

// Estimate the model with default settings
mdl = lassoFit(y_train, X_train, lambda);

// Predictions
predictions = lmPredict(mdl, X_test);

// Get MSE
testing_MSE = meanSquaredError(predictions, y_test);
print "Testing MSE"; testing_MSE;

==============================================================================
Model:                        Lasso     Target Variable:    median_house_value
Number observations:          12622     Number features:                    16
==============================================================================

===========================================================
                    Lambda          0        0.1        0.3
===========================================================

                 longitude     -2.495     -1.008          0
                  latitude      -2.36    -0.9354          0
        housing_median_age     0.0808    0.07167    0.04316
               total_rooms -0.0001714          0          0
            total_bedrooms   0.005301   0.001517  0.0008104
                population -0.0004661          0          0
                households  -0.001611          0          0
             median_income      3.947      4.011      3.675
 ocean_proximity__1H OCEAN     -5.171          0          0
    ocean_proximity_INLAND     -8.635     -4.963     -6.235
  ocean_proximity_NEAR BAY     -6.966     -0.875          0
ocean_proximity_NEAR OCEAN     -5.219     0.2927     0.1798
                  rooms_pc      2.678     0.1104          0
               bedrooms_pc     -11.68          0          0
             households_pc      22.23      21.47      20.23
             beds_per_room      33.03      17.03      8.029
                    CONST.     -221.9     -95.55     -3.059
===========================================================
                        DF         16         11          7
              Training MSE       31.6       32.5       34.3
Testing MSE

       31.505169
       32.457936
       34.155290

Our train and test MSE have improved for all values of $\lambda$. Of our new variables, households_pc and beds_per_room, seem to have the strongest effect.

Extensions

We use a linear regression model, LASSO, for modeling home values. This was chosen somewhat ad hoc, and there are a number of alternative and extensions that could help improve our predictions.

For example we could:

Use clustering or K-nearest neighbors to capture more location information.
Use principal component analysis to capture the variation in our features, then estimate the linear relationship between our the median home values and the principle component.
Use a random forest model, which generally provides good accuracy for tabular datasets.
Split the home values into bins and perform classification, rather than regression.

Conclusion

In today's blog we've seen the important role that data exploration and cleaning plays in developing a machine learning model. Rarely do we obtain data that we can plug directly into our models. It's best practice, to make time for data exploration and cleaning, because any machine learning model is only as reliable as its data.

Further Machine Learning Reading

Understanding Cross-Validation

Eric — Tue, 02 May 2023 13:08:47 +0000

Introduction

If you've explored machine learning models, you've probably come across the term "cross-validation" at some point. But what exactly is it, and why is it important?

In this blog, we'll break cross-validation into simple terms. With a practical demonstration, we'll equip you with the knowledge to confidently use cross-validation in your machine learning projects.

Model Validation in Machine Learning

Machine learning validation methods provide a means for us to estimate generalization error. This is crucial for determining what model provides the most best predictions for unobserved data.

In cases where large amounts of data are available, machine learning data validation begins with splitting the data into three separate datasets:

A training set is used to train the machine learning model(s) during development.
A validation set is used to estimate the generalization error of the model created from the training set for the purpose of model selection.
A test set is used to estimate the generalization error of the final model.

Cross-Validation in Machine Learning

The model validation process in the previous section works when we have large datasets. When data is limited we must instead use a technique called cross-validation.

The purpose of cross-validation is to provide a better estimate of a model's ability to perform on unseen data. It provides an unbiased estimate of the generalization error, especially in the case of limited data.

There are many reasons we may want to do this:

To have a clearer measure of how our model performs.
To tune hyperparameters.
To make model selections.

The intuition behind cross-validation is simple - rather than training our models on one training set we train our model on multiple subsets of data.

The basic steps of cross-validation are:

Split data into portions.
Train our model on a subset of the portions.
Test our model on the remaining subsets of the data.
Repeat steps 2-3 until the model has been trained and tested on the entire dataset.
Average the model performance across all iterations of testing to get the total model performance.

Common Cross-Validation Methods

Though the basic concept of cross-validation is fairly simple, there are a number of ways to go about each step. A few examples of cross-validation methods include

k-Fold Cross-Validation
In k-fold cross-validation:
- The dataset is divided into k equal sized-folds.
- The model is trained on k-1 folds and tested on the remaining fold.
- The process is repeated k times, with each fold serving as the test set exactly once.
- The performance metrics are averaged over the k iterations.
Stratified k-Fold Cross-Validation
This process is similar to k-fold cross-validation with minor but important exceptions:
- The class distribution in each fold is preserved.
- It is useful for imbalanced datasets.
Leave-One-Out Cross-Validation
The Leave-one-out cross-validation process:
- Trains the model using all data observations except one.
- Tests the data using the unused data point.
- Repeats this for n iterations until each data point is used exactly once as a test set.
Time-Series Cross-Validation
This cross-validation method, designed specifically for time-series:
- Splits the data into training and testing sets in a chronologically ordered manner, such as sliding or expanding windows.
- Trains the model on past data and tests the model on future data, based on the splitting point.

Method	Advantages	Disdvantages
k-Fold Cross-Validation	Provides a good estimate of the model's performance by using all the data for both training and testing. Reduces the variance in performance estimates compared to other methods.	Can be computationally expensive, especially for large datasets or complex models. May not work well for imbalanced datasets or when there is a specific order to the data.
Stratified k-Fold Cross-Validation	Ensures that each fold has a representative distribution of classes, which can improve performance estimates for imbalanced datasets. Reduces the variance in performance estimates compared to other methods.	Can still be computationally expensive, especially for large datasets or complex models. May not be necessary for balanced datasets where class distribution is already even.
Leave-One-Out Cross-Validation (LOOCV)	Provides the least biased estimate of the model's performance, as the model is tested on every data point. Can be useful when dealing with very limited data.	Can be computationally expensive, as it requires training and testing the model n times. May have high variance in performance estimates, due to the small size in the test set.
Time Series Cross-Validation	Accounts for temporal dependencies in time series data. Provides a realistic estimate of the model's performance in real-world scenarios.	May not be applicable for non-time series data. Can be sensitive to the choice of window size and data splitting strategy.

k-Fold Cross-Validation Example

Let's look at k-fold cross-validation in action, using the wine quality dataset included in the GAUSS Machine Learning (GML) library. This file is based on the Kaggle Wine Quality dataset.

Our objective is to classify wines into quality categories using 11 qualities:

Fixed acidity.
Volatile acidity.
Citric acid.
Residual sugar.
Chlorides.
Free sulfur dioxide.
Total sulfur dioxide.
Density.
pH.
Sulphates.
Alcohol.

We'll use k-fold cross-validation to examine the performance of a random forest classification model.

Data Loading and Organization

First we will load our data directly from the GML library:

/*
** Load data and prepare data
*/
// Filename
fname = getGAUSSHome("pkgs/gml/examples/winequality.csv");

// Load wine quality dataset
dataset = loadd(fname);

After loading the data, we need to shuffle the data and extract our dependent and independent variables.

// Enable repeatable sampling
rndseed 754931;

// Shuffle the dataset (sample without replacement),
// because cvSplit does not shuffle.
dataset = sampleData(dataset, rows(dataset));

y = dataset[.,"quality"];
X = delcols(dataset, "quality");

Data shuffling is not always necessary. However, we found that without shuffling, some folds did not contain a complete representation of the classes. This suggests that our data might also be a good candidate for stratified k-fold cross-validation.

Setting Random Forest Hyperparameters

After loading our data, we will set the random forest hyperparameters using the dfControl structure.

// Enable GML library functions
library gml;

/*
** Model settings
*/
// The dfModel structure holds the trained model
struct dfModel dfm;

// Declare 'dfc' to be a dfControl
// structure and fill with default settings
struct dfControl dfc;
dfc = dfControlCreate();

// Create 200 decision trees
dfc.numTrees = 200;

// Stop splitting if impurity at
// a node is less than 0.15
dfc.impurityThreshold = 0.15;

// Only consider 2 features per split
dfc.featuresPerSplit = 2;

Today's focus is not on how to pick these hyperparameters. For more information on hyperparameter tuning, see our previous blog.

k-fold Cross-Validation

Now that we have loaded our data and set our hyperparameters, we are ready to fit our random forest model and implement k-fold cross-validation.

First we setup the number of folds and pre-allocate a storage vector for model accuracy.

// Specify number of folds
// This generally is 5-10
nfolds = 5;

// Pre-allocate vector to hold the results
accuracy = zeros(nfolds, 1);

Next we use a GAUSS for loop to complete four steps:

Select testing and training data from our folds using the cvSplit procedure.
Fit our random forest classification model on the chosen training data using decForestCFit procedure.
Make classification predictions using the chosen testing data and the decForestPredict procedure.
Compute and store model accuracy for each iteration.

for i(1, nfolds, 1);
    { y_train, y_test, X_train, X_test } = cvSplit(y, X, nfolds, i);

    // Fit model using this fold's training data
    dfm = decForestCFit(y_train, X_train, dfc);

    // Make predictions using this fold's test data
    predictions = decForestPredict(dfm, X_test);

    accuracy[i] = meanc(y_test .== predictions);
endfor;

Results

Let's print the accuracy results and the total model accuracy:

/*
** Print Results
*/
sprintf("%7s %10s", "Fold", "Accuracy");;
sprintf("%7d %10.2f", seqa(1,1,nfolds), accuracy);
sprintf("Total model accuracy           : %10.2f", meanc(accuracy));
sprintf("Accuracy variation across folds: %10.3f", stdc(accuracy));

   Fold   Accuracy
      1       0.70
      2       0.73
      3       0.65
      4       0.71
      5       0.71
Total model accuracy           :       0.70
Accuracy variation across folds:      0.028

Our results provide some important insights into why we conduct cross-validation:

The model accuracy is different across folds, with a standard deviation of 0.028.
The maximum accuracy, using fold 2, is 0.73.
The minimum accuracy, using folds 3 is 0.65.

Depending on how we split our testing and training, we could get a different picture of model performance.

The total model accuracy, at 0.70, gives a better overall measure of model performance. The standard deviation of the accuracy gives us some insight into how much our prediction accuracy might vary.

Conclusion

If you're looking to improve the accuracy and reliability of your statistical analysis, cross-validation is a crucial technique to learn. In today's blog we've provided a guide to getting started with cross-validation.

Our step-by-step practical demonstration using GAUSS should prepare you to confidently implement cross-validation in your own data analysis projects.

Further Machine Learning Reading

Try Out GAUSS Machine Learning

[contact-form-7]

Fundamentals of Tuning Machine Learning Hyperparameters

Eric — Mon, 24 Apr 2023 13:37:58 +0000

Introduction

Machine learning algorithms often rely on hyperparameters that can impact the performance of the models. These hyperparameters are external to the data and are part of the modeling choices that practitioners must make.

An important step in machine learning modeling is optimizing model hyperparameters to improve prediction accuracy.

In today's blog, we will cover some fundamentals of hyperparameter tuning using our previous decision forest, or random forest, model.

Model Performance

Before we consider how to fit the best machine learning model, we need to look at what it means to be the best model.

First, we must keep in mind that the most common goal in machine learning is to create an algorithm that will create accurate predictions based on unseen data. How successful an algorithm is at achieving this goal is reflected in the out-of-sample, or generalization, error.

The error of a machine learning model can be broken into two main categories, bias, and variance.

Bias	The error that occurs when we fit a simple model to a more complex data-generating process. A model with high bias will underfit the training data as we see in the far left panel of the above plot.
Variance	The expected prediction error that occurs when we apply our model to a new dataset that the model has not seen. A model with high variance will usually overfit the training data which results in lower training set error, but will lead to higher error on any data not used for training.

Because of these two sources of error, fitting machine learning models requires finding the right model complexity without overfitting our training data.

Model Performance Measures

There are a number of methods for evaluating the performance of machine learning models. Ultimately, which performance measure is used should be based on business or research objectives.

Common Performance Measures
Method	Description	Uses
Mean Squared Error (MSE)	The average of the squared distance between the target value and the value predicted by the model.	Regression Models
Mean Absolute Error (MAE)	The average of the absolute value of the distance between the target value and the value predicted by the model.
Root Mean Squared Error (RMSE)	The square root of the mean squared error.
Accuracy	The number of correct predictions divided by the total number of predictions.	Classifications Models
Precision	Ratio of true positives to total positive predicted.
Recall	The proportion of true positives divided by the sum of true positives and false negatives.
F1-score	The harmonic mean of precision and recall.

Tuning Parameters

Adjusting hyperparameters is one important way that we can impact the performance of machine learning models. Hyperparameters are parameters that:

Are set before the model is trained and are not learned from the data.
Determine how the model learns from the data.
May need to be readjusted to maintain optimal performance as more data is collected.

Example Hyperparameters
Model	Hyperparameter
K-nearest neighbor	The number of neighbors used in classification group, $k$.
Ridge regression	$\lambda$, the weight on the L2 penalty.
Gradient Boosting Machines	The number of trees, the shrinkage parameter, and the number of splits in each tree.

Hyperparameters can have a big impact on how well a model performs. For this reason, it is important to systematically and strategically optimize hyperparameters using hyperparameter tuning.

Some popular methods for hyperparameter tuning include:

Grid Search: This is a simple but effective method where you specify a set of values for each hyperparameter, and the algorithm tries all possible combinations of values. This can be time-consuming, but it guarantees that you'll find the best set of hyperparameters within the specified options.
Random Search: This method randomly selects values for each hyperparameter from a specified range. This can be faster than grid search, especially if you have a large number of hyperparameters, but it's not guaranteed to find the best set of hyperparameters.
Bayesian Optimization: This is a more advanced method that uses probability models to choose the next set of hyperparameters to test. It takes into account the results of previous tests to choose values that are more likely to result in better performance.
Evolutionary Algorithms: This method simulates evolution by creating a population of potential solutions (sets of hyperparameters) and selecting the best ones to "breed" new solutions. This process continues until a good solution is found.

Examples

Today we will consider two examples of hyperparameter tuning. For each example we:

Use a decision forest model, similar to the one we previously built to predict the U.S. output gap.
Perform a grid search to determine the best hyperparameter value or values.
Use mean squared error as our model performance measure.

The Model

Our model:

Uses a combination of common economic indicators and GDP subcomponents as predictors of CBO-based U.S. output gap.
Uses a 70/30 training and testing split without shuffling.
Is estimated using the GAUSS Machine Learning library

When tuning a decision forest model, there are several hyperparameters that can be considered.

Decision Forest Hyperparameters
Parameter	Description	Impact
Number of trees	The number of decision trees that will be trained and combined to make predictions.	Increasing the number of trees can lead to better performance, but can also increase training time and memory requirements.
Maximum depth	The maximum depth, or number of splits, of each decision tree.	A deeper tree can capture more complex relationships in the data, but can also overfit the data and perform poorly on new data.
Observations per tree	The percentage of observations used per tree.	Increasing the percentage of observations used in a tree can improve accuracy but it also can increase computational cost, reduce interpretability, and lead to overfitting or loss of diversity.
Minimum observations per node	The minimum number of observations required to be at a leaf node.	Increasing this value can help prevent overfitting, but can also result in a less complex model.
Maximum features	The maximum number of features that can be used to split each node.	Limiting the number of features can help prevent overfitting and reduce training time, but can also result in a less accurate model.

Example One: Tuning a Single Parameter

In our first example, we will use a grid search to tune the number of features used for splitting each node. We will hold all other parameters constant at the GAUSS default values.

Parameter	GAUSS Default
Number of trees	100
Maximum tree depth	Unlimited
Minimum percentage of observations per tree	100%
Minimum observations per leaf	1
Maximum features	$\frac{\text{Number of Variables}}{3}$

The `dfControl` Structure

The dfControl structure is an optional argument used to pass hyperparameter values to the decForestRFit and decForestCFit procedures.

Using the structure to change hyperparameters requires three steps:

Declare an instance of the dfControl structure using the struct keyword.
Fill the default values for the members using the dfControlCreate procedure.
Set the desired parameter value using GAUSS "dot", ., notation.

// Declare an instance of the 
// dfControl structure
struct dfControl dfc;

// Set default values for
// structure members
dfc = dfControlCreate();

// Specify features per node
dfc.featuresPerSplit = 4;

Loading and Splitting our Data

The first step for our hyperparameter tuning example, is to load our data and split it into training and testing datasets. We can do this using the loadd procedure to load our data and the trainTestSplit procedure to split our data.

/*
** Load and split
*/
library gml;

// Load dataset
dataset = __FILE_DIR $+ "reg_data.gdat";
data = loadd(dataset);

/*
** Extract outcome and features
*/
// Extract outcome variable
y = data[., "CBO_GAP"];

// Extract features
X = delcols(data, "date"$|"CBO_GAP");

/*
** Split data into 70% training and 30% testing sets 
** without shuffling.
*/
shuffle = "False";
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7, shuffle);

Setting Non-Tuning Parameters

Next, we will set the non-tuning hyperparameters to the GAUSS defaults using the dfControl structure.

/*
** Settings for decision forest
*/
// Declare an instance of the 
// dfControl structure
struct dfControl dfc;

// Set default values for
// structure members
dfc = dfControlCreate();

Performing Grid Search

Now that we've set our default non-tuning parameters we will perform our grid search to tune the features per node. The first step is to initialize our grid and storage matrices.

/*
** Initialize grid and
** storage matrices
*/
// Create vector of possible
// features per node values
featuresPerSplit = seqa(1, 1, cols(X));

// Create storage dataframe for MSE
// with one column for training mse
// and one column for testing mse
mse = asDF(zeros(rows(featuresPerSplit), 2), "Train", "Test");

Note that in the case of tuning a single parameter, we only have to search over a vector of potential values, not a grid.

Next, we will loop over each possible value of features per split. For each potential value we:

Fit decision forest model using the training data.
Predict outcomes using the training data.
Predict outcomes using the testing data.
Compute the MSE for both the training and testing predictions.
Store the MSE values.

// Loop over all potential values
// of features per node
for i(1, rows(featuresPerSplit), 1);

    // Set featuresPerSplit parameter
    dfc.featuresPerSplit = featuresPerSplit[i];

    /*
    ** Decision Forest Model
    */
    // Declare 'mdl' to be an instance of a
    // dfModel structure to hold the estimation results
    struct dfModel mdl;

    // Fit the model with default settings
    mdl = decForestRFit(y_train, X_train, dfc);

    // Make predictions using training data
    df_prediction_train = decForestPredict(mdl, X_train);

    // Make predictions using testing data
    df_prediction_test = decForestPredict(mdl, X_test);

    /*
    ** Compute and store mse
    */
    // Training set MSE
    mse[i, "Train"] = meanSquaredError(y_train, df_prediction_train);

    // Testing set MSE
    mse[i, "Test"] = meanSquaredError(y_test, df_prediction_test);

endfor;

Note that within our loop we use the GML procedure, meanSquaredError to compute our MSE.

Results

A visualization of our MSE values gives us some insight into what happens as we increase the features per node in our decision forest model:

As we increase the features per node up to about 5 or 6, we see a general downward trend in both the testing and training MSE. Over this period, the increased features per node allows the model to capture more complex interactions and dependencies in the data.
Increasing the features per node beyond 6, results in a general upward trend in testing MSE and downward trend in training MSE. This points to overfitting. The model fits the training data too well - it captures noise and irrelevant patterns, which leads to decreased performance on the unseen testing data.

To confirm our optimal features per node parameter setting, we can locate the minimum testing MSE:

// Find the row index of the lowest MSE
idx = minindc(mse[., "Test"]);

// NOTE: two semi-colons at the end of a print statement
//       prevents it from printing a newline at the end
print "Optimal features per node: ";; featuresPerSplit[idx];
print "Minimum test MSE:";; asmatrix(mse[idx, "Test"]);

This confirms that the optimal features per leaf is 6 with a testing MSE of 3.212.

Optimal features per node:        6.0000000
Minimum test MSE:       3.2122050

Example Two: Simultaneously Tuning Hyperparameters

Now that we've seen how to tune a single hyperparameter, let's look at tuning two hyperparameters simultaneously. We will use the same data and set up from our previous example:

Data loading and preliminary setup

/*
** Load and split
*/
// Load dataset
dataset = __FILE_DIR $+ "reg_data.gdat";
data = loadd(dataset);

/*
** Extract outcome and features
*/
// Extract outcome variable
y = data[., "CBO_GAP"];

// Extract features
X = delcols(data, "date"$|"CBO_GAP");

/*
** Split data into 70% training and 30% testing sets 
** without shuffling
*/
shuffle = "False";
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7, shuffle);

/*
** Settings for decision forest
*/
// Declare an instance of the 
// dfControl structure
struct dfControl dfc;

// Set default values for
// structure members
dfc = dfControlCreate();

// Set features per split
dfc.featuresPerSplit = 6;

For convenience, we are using the featuresPerSplit value found in the previous section. The optimal value of one hyperparameter depends on the values of the others, so in practice, you should not optimize them separately.

Performing Grid Search

In this example, we will tune:

The minimum observations per leaf, ranging from 1 to 20.
The percentage of the observations per tree, ranging from 70% to 100%.

First, we initialize our grid and storage matrices. For this example, we will focus only on our testing MSE.

/*
** Initialize grid and
** storage matrices
*/
// Set potential values for 
// minimum observations per node
minObsLeaf = seqa(1, 1, 20);

// Set potential values for 
// percentage of observations
// in tree
pctObs = seqa(0.7, 0.1, 4);

// Storage matrices
test_mse = zeros(rows(minObsLeaf), rows(pctObs));

Next, we use nested for loops to search over all potential values of the minimum observations per a leaf and the minimum percentage of observations at the split.

for i(1, rows(minObsLeaf), 1);

    // Set the minimum obs per leaf
    dfc.minObsLeaf = minObsLeaf[i];

    for j(1, rows(pctObs), 1);

        // Set percentage of obs used for each tree
        dfc.pctObsPerTree = pctObs[j];

        /*
        ** Decision Forest Model
        */
        // Declare 'mdl' to be an instance of a
        // dfModel structure to hold the estimation results
        struct dfModel mdl;

        // Estimate the model with default settings
        mdl = decForestRFit(y_train, X_train, dfc);

        // Make predictions using testing data
        df_prediction_test = decForestPredict(mdl, X_test);

        /*
        ** Compute and store mse
        */
        // Testing set MSE
        test_mse[i, j] = meanSquaredError(y_test, df_prediction_test);

    endfor;
endfor;

Note that in this loop:

We use i, from the outer loop, to index the minObsLeaf vector.
We use j, from the inner loop, to index the pctObs vector.
Each row in our storage matrices represents a constant minimum samples per leaf.
Each column in our storage matrices represents a constant percentage of samples.

Results

The above plot shows us that with the GAUSS default settings for a random forest and featuresPerNode set to 6:

Taking a sample of 100% of the data for the creation of each tree is almost always best.
Setting minObsLeaf to between 5 and 10 seems best, with the minimum at about 7.
We did not get much of an improvement in our test MSE over the first example.

Optional: Finding the minimum MSE value in the output matrix

The final step is to find our optimal hyperparameter settings by locating the combination of parameters that yields the lowest MSE.

We can break this into two steps. First, we find the column that contains the minimum value.

// Create a column vector with the minimum MSE
// values for each column
mse_col_mins = minc(test_mse);

// Find the index of the smallest
// value in 'mse_col_mins'
idx_col_min = minindc(mse_col_mins);

Now that we have found which column contains the minimum MSE value, we use minindc to find the index of the smallest value in that column.

// Find the row that contains the smallest MSE value
idx_row_min = minindc(test_mse[.,idx_col_min]);

// Extract the lowest MSE across all
// combinations of tuning parameters
MSE_optimal = test_mse[idx_row_min, idx_col_min];

// Print results
sprintf( "Minimum testing MSE: %4f", MSE_optimal);
print "Minimum MSE occurs with";
sprintf("  minimum samples per leaf      : %d", minObsLeaf[idx_row_min]);
sprintf("  percentage of samples per tree: %g%%", 100 * pctObs[idx_col_min]);

This prints our results:

Minimum testing MSE: 3.151047
Minimum MSE occurs with
  minimum observationss per leaf      : 7
  percentage of observations per tree: 100%

For more information on using sprintf for printing see our previous blog, "How to Create a Simple Table Using Sprintf"

Conclusion

Today's blog demonstrations how practitioners can use hyperparameters to tune and improve machine learning models. It is important to remember that taking the time to systematically and strategically determine model hyperparameters can greatly improve machine learning model performance.

Stay tuned, because next time we will take a deeper dive into how to think about the data and which hyperparameter settings make sense to try out.

The code and data for this blog are available in our GitHub repository. You can find the repository here.

Further Machine Learning Reading

Predicting The Output Gap With Machine Learning Regression Models

Eric — Wed, 12 Apr 2023 18:44:19 +0000

Introduction

Economists are increasingly exploring the potential for machine learning models in economic forecasting. This blog offers an introduction to using three different machine learning regression techniques for economic modeling, using an empirical application to the real U.S. GDP output gap.

We look specifically at:

Measuring the output gap.
The fundamentals of three machine learning regression models.
Model estimation using the GAUSS Machine Learning library.

Measuring GDP Output Gap

The GDP output gap is a macroeconomic indicator that measures the difference between potential GDP and actual GDP. It is an interesting and useful economic statistic:

It indicates whether the economy is operating with unemployment, inefficiencies, or inflationary pressures making it useful for policymaking.
The potential GDP is unobservable and must be estimated, with a large literature devoted to what is the best estimate of potential GDP.
Positive output gaps indicate that the economy is operating over potential GDP and at risk of inflation.
Negative output gaps indicate that the economy is operating below potential GDP and possibly in recession.

Our goal today is to demonstrate different machine learning regression techniques. For simplicity, we're going to use the output gap based on the Congressional Budget Office's estimate of real potential GDP to train our model.

We compute the output gap as the percent deviation of real U.S. GDP from the CBO's estimate of real potential GDP. Both components are available for download from the FRED database.

The Models

Today we will look at three machine learning models used specifically for predicting continuous data:

Decision forest regression (also known as Random forest regression).
LASSO regression.
Ridge regression.

Decision Forest Regression

Decision Trees

Decision forest regression utilizes decision trees for continuous data which:

Segment the data into subsets using data-based splitting rules.
Assign the average of the target variable within a subset as the prediction for all observations that fall inside that subset.

To implement a single decision tree, a sample is split into segments using recursive binary splitting. This iterative approach determines where and how to split the data based on what leads to the lowest residual sum of squares (RSS).

Decision Forests

Single decision trees can have low, non-robust predictive power and suffer from high variance. This can be overcome using random decision forests that offer performance improvements by combining results from groups, or "forests", of trees.

The random decision forest algorithm:

Randomly chooses $m$ predictors to be used as candidates for splitting the data.
Constructs a decision tree from a bootstrapped training set.
Repeats the decision tree formation for a specified number of iterations.
Averages the results from all trees to make a final prediction.

LASSO and Ridge Regression

LASSO and ridge regression aim to reduce prediction variances using a modified least squares approach. Let's look a little more closely at how this works.

Recall that ordinary least squares estimates coefficients through the minimization of the residual sum of squares (RSS):

$$ RSS = \bigg[\sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij})\bigg]^2$$

Penalized least squares estimates coefficients using a modified function:

$$ S_{\lambda} = \bigg[\sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij})\bigg]^2 + \lambda J_2 $$

where $\lambda$ is the tuning parameter and $\lambda J_2$ is the penalty term.

Method	Description	Penalty term
LASSO Regression	$L1$ penalized linear regression model.	$\lambda \sum_{j=1}^p \|\beta_j\|$
Ridge Regression	$L2$ penalized linear regression model.	$\lambda \sum_{j=1}^p \beta_j^2$

Our Prediction Process

Our prediction process is motivated by the idea that as new information becomes available, it should be used to improve our forecasting model.

Based on this motivation, we use an expanding training window to make one-step ahead forecasts:

Train the model using all observed data in the training window, features and output gap, up to time $t$.
Predict the output gap at time $t + 1$ using the observed features at time $t + 1$.
Expand the training window to include all observed data up to time $t + 1$.
Repeat model training and prediction.

It's worth noting that while this method utilizes the most information available for prediction there is a trade-off in timeliness. If we were using this method in a real-world setting, it means we only forecast output gap one-quarter ahead. This may not be far enough in advance if we're using this forecast to guide business or investment decisions.

Predictors

Today we will use a combination of common economic indicators and GDP subcomponents as predictors.

Variable	Description	Transformations
UMCSENT	University of Michigan consumer sentiment, quarterly average.	None
UNRATE	Civilian unemployment rate as a percentage, quarterly average.	None.
CR	The credit spread between Moody's BAA and AAA corporate bond yields.	None.
TS	The difference between the yield on the 10-year treasury bond and the 1-yr treasury bill.	None
FEDFUNDS	The Federal Funds rate.	First differences.
SP500	The S&P 500 index value at market closing.	Percent change, computed as difference in natural logs.
CPIAUCSL	Consumer price index for all urban consumers.	Percent change, computed as difference in natural logs.
INDPRO	The industrial production (IP) index.	Percent change, computed as difference in natural logs.
HOUST	New privately-owned housing unit starts.	Percent change, computed as difference in natural logs.
GAP_CH	The change in output gap.	None.

For our model:

All predictors are available from FRED in levels.
Monthly variables are aggregated to quarterly data using averages.
Four lags of all variables are included.

Estimation in GAUSS

Data Loading

Because we want to primarily focus on the models, rather than data cleaning, we don't go into the details of our data cleaning process here. Instead, the cleaned and prepped data is available for download here.

For more information about data cleaning and management try one of our earlier blogs such as:
• Importing FRED Data To GAUSS
• Getting to Know Your Data With GAUSS
• Preparing And Cleaning FRED Data In GAUSS

Prior to estimating any model, we load the data and separate our outcome and feature data:

library gml;
rndseed 23423;

/*
** Load data and prepare data
*/
// Load dataset
dataset = __FILE_DIR $+ "reg_data.gdat";
data = loadd(dataset);

// Trim rows from the top of data to account
// for lagged and differenced data
max_lag = 4;
data = trimr(data, max_lag + 1, 0);

/*
** Extract outcome and features
*/
// Extract outcome variable
y = data[., "CBO_GAP"];

// Extract features
X = delcols(data, "date"$|"CBO_GAP");

General One-Step-Ahead Process

The full data sample ranges from 1967Q1 to 2022Q4. We'll start computing one-step-ahead forecasts in 1995Q1, using an initial training period of 1967Q1 to 1994Q4.

To implement the expanding window one-step-ahead forecasts, we use a GAUSS do while loop:

// Specify starting date 
st_date = asDate("1994-Q4", "%Y-Q%q");

// Find the index of 'st_date'
st_indx = indnv(st_date, data[., "date"]); 

// Iterate over remaining observations
// using expanding window to fit model
do while st_indx < rows(x)-1;

    // Get y_train and x_train
    y_train = y[1:st_indx];
    x_train = X[1:st_indx, .]; 
    x_test = X[st_indx+1, .];

    // Fit model
    ...

    // Compute one-step-ahead prediction
    ...

    // Update st_indx
    st_indx = st_indx + 1;
endo;

Model and Prediction Procedures

The GAUSS machine learning library offers all the procedures we need for our model training and prediction.

Model	Fitting Procedure	Prediction Procedure
Decision Forest	decForestRFit	decForestPredict
LASSO Regression	lassoFit	lmPredict
Ridge Regression	ridgeFit	lmPredict

To simplify our code we will will use three GAUSS procedures that combine the fitting and prediction for each method.

We define one procedure for the one-step ahead prediction for the LASSO model:

proc (1) = osaLasso(y_train, x_train, x_test, lambda);
    local lasso_prediction;

    /*
    ** Lasso Model
    */
    // Declare 'mdl' to be an instance of a
    // lassoModel structure to hold the estimation results
    struct lassoModel mdl;

    // Estimate the model with default settings
    mdl = lassoFit(y_train, x_train, lambda);

    // Make predictions using test data
    lasso_prediction = lmPredict(mdl, x_test);

    retp(lasso_prediction);
endp;

The second procedure performs fitting and prediction for the ridge model:

proc (1) = osaRidge(y_train, x_train, x_test, lambda);
    local ridge_prediction;

    /*
    ** Ridge Model
    */
    // Declare 'mdl' to be an instance of a
    // ridgeModel structure to hold the estimation results
    struct ridgeModel mdl;

    // Estimate the model with default settings
    mdl = ridgeFit(y_train, x_train, lambda);

    // Make predictions using test data
    ridge_prediction = lmPredict(mdl, x_test);

    retp(ridge_prediction);
endp;

The final procedure performs fitting and prediction for the decision forest model:

proc (1) = osaDF(y_train, x_train, x_test, struct dfControl dfc);
    local df_prediction;

    /*
    ** Decision Forest Model
    */
    // Declare 'mdl' to be an instance of a
    // dfModel structure to hold the estimation results
    struct dfModel mdl;

    // Estimate the model with default settings
    mdl = decForestRFit(y_train, x_train, dfc);

    // Make predictions using test data
    df_prediction = decForestPredict(mdl, x_test);

    retp(df_prediction);
endp;

Computing Predictions

Finally we are ready to begin computing our predictions. First, we set the necessary tuning parameters:

/*
** Set up tuning parameters
*/

// L2 and L1 regularization penalty
lambda = 0.3;

/*
** Settings for decision forest
*/
// Use control structure for settings
struct dfControl dfc;
dfc = dfControlCreate();

// Turn on variable importance
dfc.variableImportanceMethod = 1;

// Turn on out-of-bag error calculation
dfc.oobError = 1;

We used a λ of 0.3 for both the ridge and LASSO models and all GAUSS default settings for decision forest hyperparameters. This brings to light the fact that we have not taken any steps to optimize our models. The topic of model selection and optimization will be covered in later blogs.

Next, we initialize the starting point for our loop and our prediction storage matrix.

/*
** Initialize starting point and
** storage matrix for expanding 
** window loop
*/
st_date = asDate("1994-Q4", "%Y-Q%q");
st_indx = indnv(st_date, data[., "date"]);

// Set up storage dataframe for predictions
// using one column for each model
osa_pred = asDF(zeros(rows(X), 3), "LASSO", "Ridge", "Decision Forest");

Finally, we implement our expanding window do while loop:

do while st_indx < rows(X)-1;

    // Get y and x subsets for
    // fitting and prediction
    y_train = Y[1:st_indx];
    X_train = X[1:st_indx, .]; 
    X_test = X[st_indx+1, .];

    // LASSO Model
    osa_pred[st_indx+1, "LASSO"] = osaLasso(y_train, X_train, X_test, lambda);

    // Ridge Model
    osa_pred[st_indx+1, "Ridge"] = osaRidge(y_train, X_train, X_test, lambda);

    // Decision Forest Model
    osa_pred[st_indx+1, "Decision Forest"] = osaDF(y_train, X_train, X_test, dfc);

    // Update st_indx
    st_indx = st_indx + 1;
endo;

Results

Prediction Visualization

The graph above plots the predictions from all three of our models against the actual CBO implied output gap. There are a few things worth noting about these results:

All three models fail to predict the output decline associated with start of the COVID pandemic. This isn't a surprise as the onset of COVID was a hard to predict shock to the economy.
The models underestimate the persistent effects of the 2008 global financial crisis. While all three trend in the same direction as the observed output gap, they all predict better economic performance than actually obtained. This tells us that our feature set doesn't contain the information needed to capture the ongoing effects of the financial crisis. We could potentially improve our model by incorporating more features like bank balances or home foreclosures.
The ridge model overestimates the short-term impacts of the 2008 global financial crisis, predicting a larger drop in the output gap than both the other models and the actual output gap.

To learn more about formatting GAUSS plots see our GAUSS graphics blogs.

Model Performance

We can also compare the performance of our models using the mean squared error (MSE). This can easily be calculated from our predictions and our observed output gap:

/*
** Computing MSE
*/
// Compute residuals
residuals = osa_pred - y;

// Filter for prediction window
residuals = selif(residuals, data[., "date"] .>= st_date);

// Compute the MSE for prediction window
mse  = meanc((residuals).^2);

A comparison of the MSE shows that models perform similarly, with our decision forest model offering a slight advantage in MSE over LASSO and ridge.

Model	MSE
LASSO	2.08
Ridge	2.36
Decision Forest	1.80

Conclusion

In today's blog we examined the performance of several machine learning regression models used to predict output gap. This blog is meant to provide an introduction to these models and leaves room to discuss model selection and optimization in future blogs.

After today's blog, you should have a better understanding of:

The foundations of decision forest regression models.
LASSO and ridge regression models.
How machine learning models can be used to help predict economic and financial outcomes.

Further Machine Learning Reading

Applications of Principal Components Analysis in Finance

Eric — Thu, 16 Mar 2023 03:45:47 +0000

Introduction

Principal components analysis (PCA) is a useful tool that can help practitioners streamline data without losing information. In today’s blog, we’ll examine the use of principal components analysis in finance using an empirical example.

Specifically, we’ll look more closely at:

What PCA is.
How PCA works.
How to use the GAUSS Machine Learning library to perform PCA.
How to interpret PCA results.

What is Principal Components Analysis?

Principal components analysis (PCA) is an unsupervised learning method that results in a low-dimensional representation of a dataset. The intuition behind PCA is that the most important information is drawn from the features by eliminating redundancy and noise. The resulting dataset captures the most interesting components of the data.

PCA Snapshot
Uses linear transformations to capture the most important characteristics of a set of features.
Uses variance of the features to distinguish relevant features from pure noise.
Identifies and removes redundancy in features.

Unsupervised learning methods are a subcategory of machine learning models. Rather than predicting outcomes or responses, unsupervised learning methods aim to characterize and answer questions about a feature set.

How Do We Find Principal Components?

Principal components are found by identifying the normalized, linear combination of features

$$Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \ldots + \phi_{p1}X_p$$

which has the largest variance.

The coefficients $\phi_{11}, \phi_{21}, \ldots, \phi_{p1}$ are referred to as the loadings and are restricted such that their sum of squares is equal to one.

To compute the first principal component we:

Center our feature data to have a mean of zero.
Find loadings that result with the largest sample variance, subject to the constraint that $\sum_{j=1}^p \phi_{j,1}^2 = 1$.

Once the first principal component is found, we can find a second principal component, $Z_2$, which is constrained to be uncorrelated with $Z_1$.

When Should You Use PCA?

The most common use of PCA is to reduce the size of a feature set without losing too much information. The feature set can then be used in second stages of modeling. However, this is not the only use of PCA, and there are a number of insightful ways PCA can be applied.

Real World Applications of PCA
Reducing the size of images.	PCA can be used to reduce the size of an image without significantly impacting the quality. Beyond just reducing the size, this is useful for image classification algorithms.
Visualizing multidimensional data.	PCA allows us to represent the information contained in multidimensional data in reduced dimensions which are more compatible with visualization.
Finding patterns in high-dimensional datasets.	Examining the relationships between principal components and original features can help uncover patterns in the data that are harder to identify in our full dataset.
Stock price prediction in finance.	Many models of stock price prediction rely on estimating covariance matrices. However, this can be difficult with high-dimensional data. PCA can be used for data reduction to help remedy this issue.
Dataset reduction in healthcare models.	Healthcare models use high-dimensional datasets because there are many factors that influence healthcare outcomes. PCA provides a method to reduce the dimensionality while still capturing the relevant variance.

Empirical Example

Let's take a look at principal components analysis in action! We'll start by extending the PCA application to US Treasury bills and bonds from Introductory Econometrics For Finance by Chris Brooks.

In our example we will:

Update the dataset to use current data.
Use the pcaFit and pcaTransform functions available in the GAUSS Machine Learning library (GML).

Loading FRED Data

Our initial dataset includes 6 variables capturing short-term and long-term yields on U.S. bonds and bills.

Variable	Description
GS3M	Market yield on 3 month US Treasury bill.
GS6M	Market yield on 6 month US Treasury bill.
GS1	Market yield on 1 year US Treasury bond.
GS3	Market yield on 3 year US Treasury bond.
GS5	Market yield on 5 year US Treasury bond.
GS10	Market yield on 10 year US Treasury bond.

This data can be directly imported into GAUSS from the FRED database.

/*
** Import U.S bond and bill data
** directly from FRED
*/
// Set observation_start parameter
// to use all data on or after 1990-01-01 and before or on 2023-03-01
params_cpi = fred_set("observation_start", "1990-01-01", "observation_end", "2023-03-01");

// Load data from FRED
data = fred_load("GS3M + GS6M + GS1 + GS3 + GS5 + GS10", params_cpi);

// Reorder data to match the organization in original example
data = order(data, "date"$|"GS3M"$|"GS6M"$|"GS1"$|"GS3"$|"GS5"$|"GS10");

// Preview the first 5 rows
head(data);

The data preview printed to the Command Window helps verify that our data has loaded correctly:

            date        GS3M        GS6M         GS1         GS3         GS5        GS10
      1990-01-01        7.90        7.96        7.92        8.13        8.12        8.21
      1990-02-01        8.00        8.12        8.11        8.39        8.42        8.47
      1990-03-01        8.17        8.28        8.35        8.63        8.60        8.59
      1990-04-01        8.04        8.27        8.40        8.78        8.77        8.79
      1990-05-01        8.01        8.19        8.32        8.69        8.74        8.76

For more information on importing FRED data to GAUSS, see our earlier blog.

Normalizing Yields

Following the Brooks' example, we will normalize the yields to have zero mean and standard deviation of one using the rescale procedure.

/*
** Normalizing the yield
*/
// Create a dataframe that contains
// the yields, but not the 'Date' variable
yields = delcols(data, "date");

// Standardize the yields using rescale
{ yields_norm, location, scale_factor } = rescale(yields, "standardize");

head(yields_norm);

This prints a preview of our normalized yields:

            GS3M             GS6M              GS1              GS3              GS5             GS10
       2.3153725        2.2469720        2.1773318        2.0802078        2.0025703        1.9626705
       2.3591880        2.3159905        2.2593350        2.1936833        2.1395985        2.0916968
       2.4336745        2.3850090        2.3629181        2.2984298        2.2218155        2.1512474
       2.3767142        2.3806953        2.3844979        2.3638964        2.2994648        2.2504985
       2.3635696        2.3461861        2.3499702        2.3246164        2.2857620        2.2356108

Fitting the PCA Model

Next, we will use the pcaFit procedure from GML to fit our principal components analysis model.

The pcaFit procedure requires two inputs, a data matrix, and the number of components to compute.

struct pcaModel mdl;
mdl = pcaFit(x, n_components);

X: $N \times P$ matrix, feature data to be reduced.
n_components: Scalar, the number of components to compute.

The pcaFit procedures stores all output in a pcaModel structure. The most relevant members of the pcaModel structure include:

mdl.singular_values: $n_{components} \times 1$ vector, the largest singular values of X. Equal to the square root of the eigenvalues.
mdl.components: $P \times n_{components}$ matrix, the principal component vectors which represent the directions of greatest variance. Also known as the factor loadings.
mdl.explained_variance_ratio: $n_{components} \times 1$ vector, the variance explained by each of the returned component vectors.

/*
** Perform PCA on normalized yields
*/
// Specify number of components
n_components = 6;

// `pcaModel` structure for holding
//  output from model
struct pcaModel mdl;
mdl = pcaFit(yields_norm, n_components);

Dissecting Results

After running the pcaFit procedure results will be printed to the Command Window. These results include:

A general summary of model.
The proportion of variance explained by each component.
The loadings for all variables in each component.

General Summary

The general summary provides basic information about the model setup, including the number of variables in the original data and the number of components found.

==================================================
Model:                                         PCA
Number observations:                           399
Number variables:                                6
Number components:                               6
==================================================

Proportion of Variance

The proportion of variance table tells us how much of the total variance in the data is described by each principal component.

Component                Proportion     Cumulative
                        Of Variance     Proportion
PC1                           0.960          0.960
PC2                           0.038          0.997
PC3                           0.002          1.000
PC4                           0.000          1.000
PC5                           0.000          1.000
PC6                           0.000          1.000

For the Treasury bills and bonds yields, the first component captures 96.0% of the total variance, while the first three components explain nearly all of the total variance. If our goal was data reduction for use in a later model, this is quite promising. We could capture 96% of the variance of all 6 of our original variables using just the first principal component.

The Factor Loadings

===========================================================================
Principal
components            PC1       PC2       PC3       PC4       PC5       PC6
===========================================================================
GS3M              -0.4079    0.4111    0.4863   -0.5416    0.3029    0.2076
GS6M              -0.4094    0.3883    0.1535    0.2221   -0.5448   -0.5585
GS1               -0.4122    0.2970   -0.2404    0.6120    0.1557    0.5342
GS3               -0.4154   -0.0855   -0.5911   -0.1926    0.4744   -0.4567
GS5               -0.4102   -0.3607   -0.2806   -0.3932   -0.5725    0.3750
GS10              -0.3939   -0.6742    0.5040    0.3020    0.1856   -0.1024

The factor loadings indicate how much each of the variables contributes to the component. As noted in the Brooks example, they also offer some insight into the yield curve:

PC1	All maturities have the same sign and a similar magnitude. Captures changes in the level, or parallel shifts, of the yield curve.
PC2	Short-term and long-term maturities have opposing signs. Short-term and long-term maturities move in opposite directions. Captures changes in the slope, or the steepening/flattening, of the yield curve.
PC3	Shortest and longest-term maturities have the same sign, while the middle maturities have the opposite sign. Reflects changes in the curvature of the curve.

Transforming Original Data

After fitting the PCA model, we can use the results to transform our original data into its principal components using the pcaTransform procedure.

// Transform original data
x_trans = pcaTransform(yields_norm, mdl);

Since the first three components capture most of the variation in our data, let's look at them in a plot:

If you're familiar with U.S. interest rates, this plot likely seems to contradict what we observe in the real world. As we said earlier, the first principal component represents the overall level of interest rates. However, our plot of the first principal component shows an overall upward trend through 2022, with a sharp downtick starting post-2022 — exactly opposite the overall trend in U.S. interest rates.

This highlights an important feature of PCA — the sign on the factor loadings is arbitrary.

The signs can all be flipped without any change to our analysis. For example, if we multiply all our factor loadings by -1 our principal components look like:

Conclusion

In today's blog, we've seen that PCA is a powerful data analysis tool with uses beyond data reduction. We've also explored how to use the GAUSS Machine Learning library to fit a PCA model and transform data.

Predicting Recessions with Machine Learning Techniques

Eric — Tue, 21 Feb 2023 20:03:05 +0000

Introduction

Forecasts have become a valuable commodity in today's data-driven world. Unfortunately, not all forecasting models are of equal caliber, and incorrect predictions can lead to costly decisions.

Today we will compare the performance of several prediction models used to predict recessions. In particular, we’ll look at how a traditional baseline econometric model compares to machine learning models.

Our models will include:

The aim of today’s blog isn’t to provide a definitive answer on what model is best, but rather to provide background and context for different models. We will look more closely at model tuning and optimization in a later blog.

Background

Before diving into estimating our models, let's look more closely at the data and models we will be using.

Recession dating

Today we will focus on predicting recessions, using the NBER recession indicator. The NBER indicator:

Uses a dummy variable to represent periods of expansion and recessions.
Takes a value of 1 during a recession and 0 during an expansion.
Can be directly imported from FRED using the series ID "USREC".

Because the NBER recession data is binary data, our forecasting exercise becomes one of classification. In other words, we want to identify whether an observation is more likely to fall into the non-recession or recession category.

For this reason, we will use need to use models that are suitable for discrete data and classification.

Models

Probit

The probit model is a discrete choice model which:

Is commonly used in classical econometrics to model binary or ordered data.
Estimates the probability that an outcome falls into a specific category.
Has a simple log-likelihood function, which can be used to estimate the model parameters with maximum likelihood.

K-Nearest Neighbor

The k-nearest neighbor (KNN) method is one of the simplest non-parametric techniques for classification and regression.

KNN relies on the intuition that if an observation is "near" another it is likely to fall within the same category.

The KNN model:

Locates the $k$ nearest neighbors using the observed features and a measure of distance, such as euclidian.
Finds the most common "class" among the $k$ nearest neighbors.
Assigns the most common "class" as the predicted category for the unknown outcome.

Decision Trees

Decision trees are a machine learning model which can be used to predict discrete or continuous data.

Tree-based methods rely on a fairly simple process:

Split the data into subsets, using the characteristics of the data. For example, if “Married” is one of our observed characteristics, we can split the sample into "Yes" and "No". We can ask multiple "questions" about our data to create branches that break our data into smaller and smaller subsets.
The mostly frequently occuring outcome within the subset is then used as the outcome classifier prediction for all observations that fall inside those subsets.

Ridge Regression

Ridge regression is part of a family of linear regression models that aim to improve on the standard least squares fitting model. These methods use a modified least squares approach to shrink coefficient estimates towards zero, which in turn, reduces the estimates’ variances.

Like OLS, these methods rely on minimizing the residual sum of squares (RSS) to estimate coefficients. However, they add a penalty, based on cumulative coefficient size, to the RSS objective function.

Model Setup

Today we will include a number of variables in our model. These are chosen based on commonly used predictors in the recession modeling literature:

Recession Model Predictors
Variable	Description
INDPRO	Monthly growth rates of industrial production. Included in the level and 1-month lag.
PAYEMS	Monthly growth rates of nonfarm payrolls. Included in the level and 1-month lag.
RPI	Monthly growth rates of real personal income excluding transfer payments. Included in the level and 1-month lag.
UNRATE	Annual growth rate of headline unemployment. Included in the level and 1-month lag.
YLD	The yield curve slope, computed as the difference between the yield on the 10-year treasury bond and the 3-month treasury bill. Included in the level, 6-month lag, and 12-month lag.
CORP	The credit spread between between Moody's BAA and AAA corporate bond yields. Included in the level, 6-month lag, and 12-month lag.

Our complete dataset ranges from January, 1963 to December, 2022.

Training period	January, 1963 to December, 1998
Testing period	January, 1999 to December, 2022

The complete dataset, including lags, is available here.

Model Comparison

There are many components to evaluating how well a classification model performs. To compare models, we will use a set of binary class metrics including:

Model Comparison Measures
Tool	Description
Confusion matrix	Summarizes the performance of a classification algorithm. Compares the number of predicted outcomes to actual outcomes in tabular form.
Accuracy	Overall model accuracy. Equal to the number of correct predictions divided by the total number of predictions.
Precision	How good a model is at correctly identifying positive outcomes. Equal to the number of true positives divided by the number of false positives plus true positives.
Recall	How good a model is at correctly predicting all the positive outcomes. Equal to the number of true positives divided by the number of false negatives plus true positives.
F-score	The harmonic mean of the precision and recall. A score of 1 indicates perfect precision and recall.
Specificity	Ability to predict a true negative. Equal to the number of true negatives divided by the number of true negatives plus false positives.
Area under the ROC	Reflects the probability that a model ranks a random positive more highly than a random negative.

It's important to view these metrics in the context of the data being modeled. For example, our data is not very balanced across classes. There are 263 non-recession observations and 28 recession observations. This implies that:

Model accuracy is not a very informative metric. If we predict that all observations are non-recession, our accuracy is 90%.
F-score is a better metric for us to consider. It gives a more balanced picture of how our model performs across both the recession and non-recession class.

Estimation

We will use two GAUSS libraries to estimate our models:

Constrained Maximum Likelihood MT (CMLMT) to estimate the probit model.
GAUSS Machine Learning (GML) to estimate our machine learning models.

Loading our data and libraries

To start we will load our data directly from the url:

// Load libraries
library gml, cmlmt;

/*
** Load data from url
*/
url = "https://github.com/aptech/gauss_blog/blob/master/machine-learning/recession-predicting/data/final_data.gdat?raw=true";
reg_data = loadd(url);

// Compute summary statistics
dstatmt(reg_data);

This loads our regression dataset and prints a table of summary statistics to the Command Window:

----------------------------------------------------------------------------------------
Variable         Mean     Std Dev     Variance     Minimum     Maximum    Valid  Missing
----------------------------------------------------------------------------------------

date            -----       -----        -----  1963-01-01  2022-12-01      720     0
USREC          0.1181      0.3229       0.1043           0           1      720     0
INDPRO         0.1976      0.9403       0.8842       -13.2       6.275      720     0
PAYEMS         0.1428      0.5746       0.3302      -13.59       3.431      720     0
RPI            0.2627       1.253        1.569      -13.55          20      720     0
UNRATE       -0.03208       1.393        1.941        -8.6        11.1      720     0
corp           -1.021      0.4389       0.1926       -3.38       -0.32      720     0
yld             1.496       1.221        1.492       -2.65        4.42      720     0
yld_l6          1.504       1.215        1.475       -2.65        4.42      720     0
yld_l12           1.5       1.215        1.475       -2.65        4.42      720     0
corp_l6        -1.017      0.4397       0.1933       -3.38       -0.32      720     0
corp_l12       -1.015      0.4403       0.1939       -3.38       -0.32      720     0
ip_l           0.1986      0.9397        0.883       -13.2       6.275      720     0
nfp_l          0.1425      0.5747       0.3302      -13.59       3.431      720     0
rpi_l          0.2632       1.253        1.569      -13.55          20      720     0
un_l         -0.03222       1.393        1.942        -8.6        11.1      720     0

The file final_data.gdat is a GAUSS data file format introduced in GAUSS 23. The dataset is compiled using raw date from FRED. You can view the data import, transformation, and merging here.

Splitting Data

Next, we will use the trainTestSplit function to split the data into a test and training set.

/*
** Split data
*/

// Dependent data 
y = reg_data[., "USREC"];

// Load independent variables
x = reg_data[., 3:cols(reg_data)];

// Split data into (60%) training
// and (40%) test sets
shuffle = "False";
{ y_train, y_test, x_train, x_test } = 
     trainTestSplit(y, x, 0.6, shuffle);

Because our data is time series data, it is important to keep the sequential ordering. To do this, we turn "shuffling" off when splitting the data.

Probit Model Results

To estimate the probit model we will rely on the probit likelihood function:

$$LL(\beta|y;X) = \sum^N_{i=1} \big[y_i ln(F(x_i \beta)) + (1 - y_i)ln(1 - F(x_i \beta))\big]$$

/*
** Likelihood Function
*/
proc (1) = probit(beta_, y, X, ind);
    local mu;

    // Declare 'mm' to be a modelResults
    // structure to hold the function value
    struct modelResults mm;

    // Compute mu
    mu = X * beta_;

    // Assign the log-likelihood value to the
    // 'function' member of the modelResults structure
    mm.function = y.*lncdfn(mu) + (1-y).*lncdfnc(mu);

    // Return the model results structure
    retp(mm);
endp;

We can quickly estimate this model using the GAUSS cmlmt procedure:

/*
** Estimate model
*/
// Assign starting values for estimation
beta_strt = 0.5*ones(cols(x), 1);

// Declare 'out' to be a cmlmtResults structure
// to hold the results of the estimation
struct cmlmtResults cout;

// Perform estimation and print results
cout = cmlmt(&probit, beta_strt, y_train, x_train);
call cmlmtPrt(cout);

The fitted probit model can be used to predict the probability that an outcome lies in a recessionary period given the observed data. Using a cutoff of 50% we will sort predictions into recession/non-recession periods

/*
** Predictions
*/
// Extract parameters
beta_hat = pvUnpack(cout.par, "x");

// Predicted probability of recession 
y_prob = cdfn(x_test * beta_hat);

// Classify data as recession or non-recession
y_hat = where(y_prob .>= 0.5, 1, 0);

Plotted against the observed recession dates, the estimated probability of recession looks fairly good:

However, we can get a more robust evaluation of the model performance using the classificationMetrics from the GML library:

call binaryClassMetrics(y_test, y_hat);

The first portion of this report is the Confusion Matrix:

Probit model with 50% cutoff.
==================================
                  Confusion matrix
==================================
                   Predicted class
                   ---------------
                         +       -
       True class
       ----------
            1 (+)       22       6
            0 (-)       17     243

The confusion matrix provides a summary of how many predictions our model got "right" and how many it got "wrong", based on which category they fall in:

The confusion matrix for our estimated probit model shows:

There are 22 correctly predicted recession periods and 6 incorrectly predicted recession periods.
There are 243 correctly predicted non-recession and 17 incorrectly predicted non-recession periods.

The remaining statistics help quantify these outcomes more clearly:

             Accuracy           0.9201
            Precision           0.5641
               Recall           0.7857
              F-score           0.6567
          Specificity           0.9346
    Balanced Accuracy           0.8692

Overall for the probit model:

Has an F-score of 66%.
Is better at predicting negative outcomes (93% specificity) than positive outcomes (precision 56%).

KNN Model Results

We will start our machine learning models with the KNN model. We will fit the model on the same training data using the knnFit procedure:

/*
** Train the model
*/

// Specify the number of neighbors
k = 5;

// The knnModl structure 
// holds the trained model
struct knnModel mdl;

// Train model using KNN
mdl = knnFit(y_train, X_train, k);

After fitting the model, the knnClassify procedure can be used to predict outcomes and metrics for the test data:

/*
** Predictions on the test set
*/
y_hat = knnClassify(mdl, X_test);

// Print out model quality 
// evaluation statistics
print "KNN Model";
call binaryClassMetrics(y_test, y_hat);

KNN Model
==================================
                  Confusion matrix
==================================
                   Predicted class
                   ---------------
                         +       -
       True class
       ----------
            1 (+)       20       8
            0 (-)        3     257

The confusion matrix for our estimated KNN model shows:

There are 20 correctly predicted recession periods and 8 incorrectly predicted recession periods.
There are 257 correctly predicted non-recession periods and 3 incorrectly predicted non-recession periods.

         Accuracy           0.9618
        Precision           0.8696
           Recall           0.7143
          F-score           0.7843
      Specificity           0.9885
Balanced Accuracy           0.8514

The KNN model:

Has an F-score of 78%.
Is better at predicting negative outcomes than positive outcomes.

Compared to our baseline probit model the KNN model:

Shows improved performance when balancing performance across both classes, with a better F-score.
Is better at predicting negative outcomes (99% specificity) but worse at predicting positive outcomes (precision 87%).

Decision Forest Classification

Next, we fit our decision forest classification model using the decForestCFit procedure:

/*
** Train the model
*/

// The dfModel structure 
// holds the trained model
struct dfModel dfm;

// Fit training data 
// using decision forest classification
dfm = decForestCFit(y_train, x_train);

After fitting the model, the decForestPredict procedure can be used to predict outcomes and metrics for the test data:

/*
** Predictions on the test set
*/
y_hat = decForestPredict(dfm, x_test);

// Print out model quality 
// evaluation statistics
print "Decision Forest";
call binaryClassMetrics(y_test, y_hat);

Decision Forest
==================================
                  Confusion matrix
==================================
                   Predicted class
                   ---------------
                         +       -
       True class
       ----------
            1 (+)       25       3
            0 (-)        1     259

The confusion matrix for our estimated decision forest model shows:

There are 25 correctly predicted recession periods and 3 false positives.
There are 259 correctly predicted non-recession periods and 1 false negatives.

         Accuracy           0.9861
        Precision           0.9615
           Recall           0.8929
          F-score           0.9259
      Specificity           0.9962
Balanced Accuracy           0.9445

The decision forest model:

Has an F-score of 93%.
Is better at predicting negative outcomes (99% specificity) than positive outcomes (96% precision).

Compared to our baseline probit model the decision forest model:

Is much better than our baseline probit model when balancing performance across both classes.
Is better at predicting negative outcomes and positive outcomes.

Ridge Classification

Finally, we estimate the ridge classification model using the ridgeCFit procedure:

/*
** Train the model
*/

// L2 regularization penalty
lambda = 0.5;

// Declare 'mdl' to be an instance of a
// ridgeModel structure to hold the estimation results
struct ridgeModel mdl;

// Train the model
// using the ridge classification
mdl = ridgeCFit(y_train, X_train, lambda);

The ridgeCPredict procedure can be used to predict outcomes and metrics for the test data:

/*
** Predictions on the test set
*/

// Compute test mse
predictions = ridgeCPredict(mdl, x_test);

// Print out model quality 
// evaluation statistics
print "Ridge Classification";
call binaryClassMetrics(y_test, predictions);

Ridge Classification
==================================
                  Confusion matrix
==================================
                   Predicted class
                   ---------------
                         +       -
       True class
       ----------
            1 (+)       22       6
            0 (-)        4     256

The confusion matrix for our estimated ridge classification model shows:

There are 22 correctly predicted recession periods and 6 incorrectly predicted recession periods.
There are 256 correctly predicted non-recession periods and 4 incorrectly predicted non-recession periods.

         Accuracy           0.9653
        Precision           0.8462
           Recall           0.7857
          F-score           0.8148
      Specificity           0.9846
Balanced Accuracy           0.8852

The ridge classification model:

Has an F-score of 81%.
Is better at predicting negative outcomes (98% specificity) than positive outcomes (84% precision).

Compared to our baseline probit model the ridge classification model:

Is better than our baseline probit model when balancing performance across both classes.
Is better at predicting negative outcomes and at predicting positive outcomes.

Results Summary

	Probit	KNN	Decision Forest	Ridge Classification
True Positives	22	20	25	22
False Positives	17	3	1	4
True Negatives	243	257	259	256
False Negatives	6	8	3	6
Accuracy	92%	96%	99%	96%
Precision	56%	87%	96%	84%
Recall	79%	71%	89%	79%
F-score	66%	78%	92%	81%
Specificity	93%	99%	99%	98%
Balanced Accuracy	87%	85%	94%	89%

From the summary table, we see clearly that, even without tuning, the decision forest classification is superior by all evaluation standards to our other models. While all models perform strongly when predicting non-recession periods, the decision forest model is the clear winner for predicting recession periods.

Conclusion

In today's blog, we examined the performance of several prediction models used to predict recessions. After today's blog, you should have a better understanding of:

How to implement machine learning models in GAUSS.
How to compare model classification models.
How machine learning models can be used to improve prediction.

Try Out Machine Learning in GAUSS

[contact-form-7]

Machine learning – Aptech

Announcing the GAUSS Machine Learning Library

Introduction

Machine Learning Models at Your Fingertips

Quick and Painless Data Preparation and Management

Easy to Implement Model Evaluation

Unparalleled Customer Support

See It In Action

Try out GAUSS Machine Learning

Classification with Regularized Logistic Regression

Introduction

What is Logistic Regression?

Logistic Regression with Regularization

Comparison of L1 and L2 Regularization

Predicting Customer Satisfaction Using Survey Data

Data Exploration

Data Visualization

Preparing Our Data for Plotting

Age

Gender

Flight Distance

Seat Class

Customer Type

Feature Engineering

Model Evaluation

Model Comparison Measures

Logistic Regression Model Fitting

Case One: Logistic Regression Without Regularization

Case Two: Logistic Regression With L2 Regularization

Conclusion

Further Machine Learning Reading

Machine Learning With Real-World Data

Introduction

Background

Our Data

GAUSS Machine Learning

Data Exploration and Cleaning

Descriptive Statistics

Missing Values

Outliers

Data Truncation

Feature Modifications

Data Splitting

Fitting Our Model

Model Fitting

Prediction

Feature Engineering

Fit and Predict the New Model

Extensions

Conclusion

Further Machine Learning Reading

Understanding Cross-Validation

Introduction

Model Validation in Machine Learning

Cross-Validation in Machine Learning

Common Cross-Validation Methods

k-Fold Cross-Validation Example

Data Loading and Organization

Setting Random Forest Hyperparameters

k-fold Cross-Validation

Results

Conclusion

Further Machine Learning Reading

Try Out GAUSS Machine Learning

Fundamentals of Tuning Machine Learning Hyperparameters

Introduction

Model Performance

Model Performance Measures

Common Performance Measures

Tuning Parameters

Example Hyperparameters

Examples

The Model

Decision Forest Hyperparameters

Example One: Tuning a Single Parameter

The dfControl Structure

Loading and Splitting our Data

Setting Non-Tuning Parameters

Performing Grid Search

Results

The `dfControl` Structure