Machine Learning With Real-World Data

by Eric · Published May 15, 2023 · Updated January 25, 2024

Introduction

If you've ever done empirical work, you know that real-world data rarely, if ever, arrives clean and ready for modeling. No data analysis project consists solely of fitting a model and making predictions.

In today's blog, we walk through a machine learning project from start to finish. We'll give you a foundation for completing your own machine learning project in GAUSS, working through:

Data Exploration and cleaning.
Splitting data for training and testing.
Model fitting and prediction.
Basic feature engineering.

Background

Our Data

Today we will be working with the California Housing Dataset from Kaggle.

This dataset is built from 1990 Census data. Though it is an older dataset, it is a great demonstration dataset and has been popular in many machine learning examples.

The dataset contains 10 variables measured in California at the block group level:

Variable	Description
longitude	Measure of how far west a house is.
latitude	Measure of how far north a house is.
housing_median_age	Median age of a house within a block.
total_rooms	Total number of rooms within a block.
total_bedrooms	Total number of bedrooms within a block.
population	Total number of people residing within a block.
households	Total number of households, a group of people residing within a home unit, for a block.
median_income	Median income for households within a block of houses (measured in tens of thousands of US Dollars).
median_house_value	Median house value for households within a block.
ocean_proximity	Location of the house w.r.t ocean/sea.

GAUSS Machine Learning

We will use the new GAUSS Machine Learning (GML) library. This library is extremely user friendly and provides easy-to-use machine learning tools for implementing fundamental machine learning models.

To access these tools, we need to load the library:

// Clear workspace and load library
new;
library gml;

// Set random seed
rndseed 8906876;

Note we also set the random seed to allow for replication.

Data Exploration and Cleaning

With GML loaded, we our now ready to import and clean our data. The first step is to use the loadd procedure to import our data into the GAUSS.

/*
** Import datafile
*/
load_path = "data/";
fname = "housing.csv";

// Load all variables
housing_data = loadd(load_path $+ fname);

Descriptive Statistics

Exploratory data analysis allows us to identify important data anomalies, like outliers and missing values.

Let's start by looking at standard descriptive statistics using the dstatmt procedure:

// Find descriptive statistics
// for all variables in housing_data
dstatmt(housing_data);

This prints a summary table of statistics for all variables.

--------------------------------------------------------------------------------------------------
Variable                  Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
--------------------------------------------------------------------------------------------------
longitude               -119.6       2.004         4.014      -124.3      -114.3     20640    0
latitude                 35.63       2.136         4.562       32.54       41.95     20640    0
housing_median_age       28.64       12.59         158.4           1          52     20640    0
total_rooms               2636        2182     4.759e+06           2   3.932e+04     20640    0
total_bedrooms           537.9       421.4     1.776e+05           1        6445     20433  207
population                1425        1132     1.282e+06           3   3.568e+04     20640    0
households               499.5       382.3     1.462e+05           1        6082     20640    0
median_income            3.871         1.9         3.609      0.4999          15     20640    0
median_house_value   2.069e+05   1.154e+05     1.332e+10     1.5e+04       5e+05     20640    0
ocean_proximity          -----       -----         -----   <1H OCEAN  NEAR OCEAN     20640    0

These statistics allow us to quickly identify several data issues that we need to address prior to fitting our model:

There are 207 missing observations of the total bedrooms variable (you may need to scroll to the right of the output).
Many of our variables show potential outliers, with high variance and large ranges. These should be further explored.

Missing Values

To get a better idea of how to best deal with the missing values, let's check the descriptive statistics for the observations with and without missing values separately.

// Conditional check 
// for missing values
e = housing_data[., "total_bedrooms"] .== miss();

// Get descriptive statistics
// for dataset with missing values
dstatmt(selif(housing_data, e));

------------------------------------------------------------------------------------------------
Variable                 Mean     Std Dev      Variance     Minimum     Maximum   Valid  Missing
------------------------------------------------------------------------------------------------

longitude              -119.5       2.001         4.006      -124.1      -114.6      207    0
latitude                 35.5       2.097         4.399       32.66       40.92      207    0
housing_median_age      29.27       11.96         143.2           4          52      207    0
total_rooms              2563        1787     3.194e+06         154   1.171e+04      207    0
total_bedrooms          -----       -----         -----        +INF        -INF        0  207
population               1478        1057     1.118e+06          37        7604      207    0
households                510       386.1     1.491e+05          16        3589      207    0
median_income           3.822       1.956         3.824      0.8527          15      207    0
median_house_value   2.06e+05   1.116e+05     1.246e+10    4.58e+04       5e+05      207    0
ocean_proximity         -----       -----         -----   <1H OCEAN  NEAR OCEAN      207    0

// Get descriptive statistics
// for dataset without missing values
dstatmt(delif(housing_data, e));

-------------------------------------------------------------------------------------------------
Variable                 Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------------

longitude              -119.6       2.004         4.014      -124.3      -114.3     20433    0
latitude                35.63       2.136         4.564       32.54       41.95     20433    0
housing_median_age      28.63       12.59         158.6           1          52     20433    0
total_rooms              2637        2185     4.775e+06           2   3.932e+04     20433    0
total_bedrooms          537.9       421.4     1.776e+05           1        6445     20433    0
population               1425        1133     1.284e+06           3   3.568e+04     20433    0
households              499.4       382.3     1.462e+05           1        6082     20433    0
median_income           3.871       1.899         3.607      0.4999          15     20433    0
median_house_value  2.069e+05   1.154e+05     1.333e+10     1.5e+04       5e+05     20433    0
ocean_proximity         -----       -----         -----   <1H OCEAN  NEAR OCEAN     20433    0

From visual inspection, the descriptive statistics for the data with the missing values are very similar to the descriptive statistics data without the missing values.

We could do more robust statistical tests to confirm this. However, those are outside of the scope of this blog, and we will rely on our visual inspection.

In addition, the missing values make up less than 1% of the total observations. Given this, we will delete the rows containing missing values, rather than imputing our missing values.

We can delete the rows with missing values using the packr procedure:

// Remove rows with missing values
// from housing_data
housing_data = packr(housing_data);

Outliers

Now that we've removed missing values, let's look for other data outliers. Data visualizations like histograms and box plots are a great way to identify potential outliers.

First, let's create a grid plot of histograms for all of our continuous variables:

/*
** Data visualizations
*/
// Get variables names
vars = getColNames(housing_data);

// Set up plotControl 
// structure for formatting graphs
struct plotControl plt;
plt = plotGetDefaults("bar");

// Set fonts
plotSetFonts(&plt, "title", "Arial", 14);
plotSetFonts(&plt, "ticks", "Arial", 12);

// Loop through the variables and draw histograms
for i(1, rows(vars)-1, 1);
    plotSetTitle(&plt, vars[i]);
    plotLayout(3, 3, i);
    plotHist(plt, housing_data[., vars[i]], 50);
endfor;

From our histograms, it appears that several variables suffer from outliers:

The total_rooms variable, with the majority of the data distributed between 0 and 10,000.
The total_bedrooms variable, with the majority of the data distributed between 0 and 2000.
The households variable, with the majority of the data distributed between 0 and 2000.
The population variable, with the majority of the data distributed between 0 and 100,000.

Box plots of these variables confirm that there are indeed outliers.

plt = plotGetDefaults("box");

// Set fonts
plotSetFonts(&plt, "title", "Arial", 14);
plotSetFonts(&plt, "ticks", "Arial", 12);

string box_vars = { "total_rooms", "total_bedrooms", "households", "population" };

// Loop through the variables and draw boxplots
for i(1, rows(box_vars), 1);
    plotLayout(2, 2, i);
    plotBox(plt, box_vars[i], housing_data[., box_vars[i]]);
endfor;

Let's filter the data to eliminate these outliers:

/*
** Filter to remove outliers
**
** Delete:
**    - total_rooms greater than or equal to 10000
**    - total_bedrooms greater than or equal to 20000
**    - households greater than or equal to 2000
**    - population greater than or equal to 6000
*/
mask = housing_data[., "total_rooms"] .>= 10000;
mask = mask .or housing_data[., "total_bedrooms"] .>= 2000;
mask = mask .or housing_data[., "households"] .>= 2000;
mask = mask .or housing_data[., "population"] .>= 6000;

housing_data = delif(housing_data, mask);

Note that we've taken a conservative approach to filtering outliers and haven't removed all points identified by the box plots as outliers.

Data Truncation

The histograms also point to truncation issues with housing_median_age and median_house_value. Let's look into this a little further:

We'll confirm that these are the most frequently occurring observations using modec. This provides evidence for our suspicion that these are truncation points.
We'll count the number of observations at these locations.

Remember that we've already filtered our outliers, so we're looking at a subset of our original data.

// House value
mode_value = modec(housing_data[., "median_house_value"]);
print "Most frequent median_house_value:" mode_value;

print "Counts:";
sumc(housing_data[., "median_house_value"] .== mode_value);

// House age
mode_age = modec(housing_data[., "housing_median_age"]);
print "Most frequent housing_median_age:" mode_age;

print "Counts:";
sumc(housing_data[., "housing_median_age"] .== mode_age);

We use modec because from our histogram we can't identify for that these points occur at the maximum. It makes sense to assume that they do but we can't be certain.

Most frequent median_house_value:
       500001.00
Counts:
       935.00000
Most frequent housing_median_age:
       52.000000
Counts:
       1262.0000

These combined observations make up about 10% of the total observations. Because we have no further information about what is occurring at these points, let's remove them from our model.

// Create binary vector with a 1 if either
// 'housing_median_age' or 'median_house_value'
// equal their mode value.
mask = (housing_data[., "housing_median_age"] .== mode_age)
       .or (housing_data[., "median_house_value"] .== mode_value);

// Delete the rows if they meet our above criteria
housing_data = delif(housing_data, mask);

Feature Modifications

Our final data cleaning step is to make feature modifications including:

Rescaling the median_house_value variable to be measured in 10,000 of US dollars (the same scale as median_income).
Generating dummy variables to account for the categories of ocean_proximity.

First, we rescale the median_house_value:

// Rescale median income variable
housing_data[., "median_house_value"] = 
    housing_data[., "median_house_value"] ./ 10000;

Next we generate dummy variables for ocean_proximity.

Let's get a feel for our categorical data using the frequency procedure:

// Check frequency of
// ocean_proximity categories
frequency(housing_data, "ocean_proximity");

This prints a convenient frequency table:

     Label      Count   Total %    Cum. %
 <1H OCEAN       8095     44.89     44.89
    INLAND       6136     34.03     78.93
    ISLAND          2   0.01109     78.94
  NEAR BAY       1525     8.458     87.39
NEAR OCEAN       2273     12.61       100
     Total      18031       100

We can see from this table that the ISLAND category is a very small category. We'll exclude it from our modeling dataset.

Now let's create our dummy variables using the oneHot procedure:

/*
** Generate dummy variables for 
** the ocean_proximity using
** one hot encoding
*/
dummy_matrix = oneHot(housing_data[., "ocean_proximity"]);

Finally, we'll save our modeling dataset in a GAUSS .gdat file using saved so we can directly access our clean data in the future:

/*
** Build matrix of features
** Note we exclude: 
**     - ISLAND dummy variable
**     - Original ocean_proximity variable
*/
model_data = delcols(housing_data, "ocean_proximity") ~ 
    delcols(dummy_matrix, "ocean_proximity_ISLAND");

// Saved data matrix
saved(model_data, load_path $+ "/model_data.gdat");

Data Splitting

In machine learning, it's customary to use separate datasets to fit the model and to evaluate model performance. Since the objective of machine learning models is to provide predictions for unseen data, using a testing set provides a more realistic measure of how our model will perform.

Cross-validation is an additional tool for evaluating model performance. To learn more about cross-validation, see our previous blog, "Understanding Cross-Validation".

To prepare our data for training and testing, we're going to take two steps:

Separate our target variable, median_house_value, and feature set.
Split our data into 70% testing and 30% training dataset using trainTestSplit.

new;
library gml;
rndseed 896876;

/*
** Load datafile
*/
load_path = "data/";
fname = "model_data.gdat";

// Load data
housing_data = loadd(load_path $+ fname);

/*
** Feature management
*/
// Separate dependent and independent data
y = housing_data[., "median_house_value"];
X = delcols(housing_data, "median_house_value");

// Split into 70% training data 
// and 30% testing data
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);

Fitting Our Model

Now that we've completed our data cleaning, we're finally ready to fit our model. Today we'll use a LASSO regression model to predict our target variable. LASSO is a form of regularization that has found relative success in economic and financial modeling. It offers a data-driven approach to dealing with high-dimensionality in linear models.

Model Fitting

To fit the LASSO model to our target variable, median_house_value, we'll use lassoFit from the GAUSS Machine Learning library.

/*
** LASSO Model
*/
// Set lambda values
lambda = { 0, 0.1, 0.3 };

// Declare 'mdl' to be an instance of a
// lassoModel structure to hold the estimation results
struct lassoModel mdl;

// Estimate the model with default settings
mdl = lassoFit(y_train, X_train, lambda);

The lassoFit procedure prints a model description and results:

==============================================================================
Model:                        Lasso     Target Variable:    median_house_value
Number observations:          12622     Number features:                    12
==============================================================================

===========================================================
                    Lambda          0        0.1        0.3
===========================================================

                 longitude     -2.347     -1.013   -0.02555
                  latitude     -2.192    -0.9269          0
        housing_median_age    0.07189    0.06384    0.03977
               total_rooms  -0.001004          0          0
            total_bedrooms    0.01165   0.006107   0.004828
                population  -0.004317  -0.003396  -0.001232
                households   0.006808   0.005119          0
             median_income      3.872      3.569      3.457
 ocean_proximity__1H OCEAN     -5.509          0          0
    ocean_proximity_INLAND     -9.437     -5.639     -6.575
  ocean_proximity_NEAR BAY     -7.083    -0.6395          0
ocean_proximity_NEAR OCEAN     -5.198     0.6378     0.6981
                    CONST.     -193.5     -82.98      3.451
===========================================================
                        DF         12         10          7
              Training MSE       33.7       34.7       37.4

The results highlight the variable selection function of LASSO. With $\lambda = 0$, a full least squares model, all features are represented in the model. When we get to $\lambda = 0.3$, the LASSO regression removes 4 of our 12 variables:

latitude
total_rooms
ocean_proximity__1H OCEAN
ocean_proximity_NEAR BAY

As we would expect, median_income has a large positive impact. However, there are a few noteworthy observations about the coefficients for the location related variables.

As we add more regularization to the model by increasing the value of $\lambda$, ocean_proximity__1H OCEAN and ocean_proximity_NEAR BAY are removed from the model, but the effect of ocean_proximity_INLAND increases substantially. latitude is also removed from the model. This could be because these effects are largely also explained by the location dummy variables and median_income.

Prediction

We can now test our model's prediction capability on the testing data using lmPredict:

// Predictions
predictions = lmPredict(mdl, X_test);

// Get MSE
testing_MSE = meanSquaredError(predictions, y_test);
print "Testing MSE"; testing_MSE;

Testing MSE

       33.814993
       34.726144
       37.199771

As expected, most of these values are above the training MSE but not by much. The test MSE for the model with the highest $\lambda$ value is actually lower than the training MSE. This suggests that our model is not overfitting.

Feature Engineering

Since our model is not overfitting, we can add more variables to the model. We could collect more data variables to add. However, it is likely that there is more information in our current data that we can make more accessible to our estimator. We are going to create some new features from combinations of our current features. This is part of a process called feature engineering which can make substantial contributions to your machine learning models.

We will start by generating per capita variables for total_rooms, total_bedrooms, and households.

/*
** Create per capita variables
** using population
*/
pc_data = housing_data[., "total_rooms" "total_bedrooms" "households"] 
    ./ housing_data[., "population"];

// Convert to a dataframe and add variable names
pc_data = asdf(pc_data, "rooms_pc"$|"bedrooms_pc"$|"households_pc");

Next we will great a variable representing the percentage of total_rooms made up by total_bedrooms:

beds_per_room = X[.,"total_bedrooms"] ./ X[.,"total_rooms"];

and add these columns to X:

X = X ~ pc_data ~ asdf(beds_per_room, "beds_per_room");

Fit and Predict the New Model

// Reset the random seed so we get the
// same test and train splits as our previous model
rndseed 896876;

// Split our new X into train and test splits
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);

// Set lambda values
lambda = { 0, 0.1, 0.3 };

// Declare 'mdl' to be an instance of a
// lassoModel structure to hold the estimation results
struct lassoModel mdl;

// Estimate the model with default settings
mdl = lassoFit(y_train, X_train, lambda);

// Predictions
predictions = lmPredict(mdl, X_test);

// Get MSE
testing_MSE = meanSquaredError(predictions, y_test);
print "Testing MSE"; testing_MSE;

==============================================================================
Model:                        Lasso     Target Variable:    median_house_value
Number observations:          12622     Number features:                    16
==============================================================================

===========================================================
                    Lambda          0        0.1        0.3
===========================================================

                 longitude     -2.495     -1.008          0
                  latitude      -2.36    -0.9354          0
        housing_median_age     0.0808    0.07167    0.04316
               total_rooms -0.0001714          0          0
            total_bedrooms   0.005301   0.001517  0.0008104
                population -0.0004661          0          0
                households  -0.001611          0          0
             median_income      3.947      4.011      3.675
 ocean_proximity__1H OCEAN     -5.171          0          0
    ocean_proximity_INLAND     -8.635     -4.963     -6.235
  ocean_proximity_NEAR BAY     -6.966     -0.875          0
ocean_proximity_NEAR OCEAN     -5.219     0.2927     0.1798
                  rooms_pc      2.678     0.1104          0
               bedrooms_pc     -11.68          0          0
             households_pc      22.23      21.47      20.23
             beds_per_room      33.03      17.03      8.029
                    CONST.     -221.9     -95.55     -3.059
===========================================================
                        DF         16         11          7
              Training MSE       31.6       32.5       34.3
Testing MSE

       31.505169
       32.457936
       34.155290

Our train and test MSE have improved for all values of $\lambda$. Of our new variables, households_pc and beds_per_room, seem to have the strongest effect.

Extensions

We use a linear regression model, LASSO, for modeling home values. This was chosen somewhat ad hoc, and there are a number of alternative and extensions that could help improve our predictions.

For example we could:

Use clustering or K-nearest neighbors to capture more location information.
Use principal component analysis to capture the variation in our features, then estimate the linear relationship between our the median home values and the principle component.
Use a random forest model, which generally provides good accuracy for tabular datasets.
Split the home values into bins and perform classification, rather than regression.

Conclusion

In today's blog we've seen the important role that data exploration and cleaning plays in developing a machine learning model. Rarely do we obtain data that we can plug directly into our models. It's best practice, to make time for data exploration and cleaning, because any machine learning model is only as reliable as its data.

Further Machine Learning Reading

Predicting Recessions with Machine Learning Techniques
Applications of Principal Components Analysis in Finance
Predicting The Output Gap With Machine Learning Regression Models
Fundamentals of Tuning Machine Learning Hyperparameters
Understanding Cross-Validation
Classification with Regularized Logistic Regression
Eric( Director of Applications and Training at Aptech Systems, Inc. )
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.