<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Machine learning &#8211; Aptech</title>
	<atom:link href="https://www.aptech.com/blog/category/machine-learning/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.aptech.com</link>
	<description>GAUSS Software - Fastest Platform for Data Analytics</description>
	<lastBuildDate>Thu, 20 Mar 2025 21:15:22 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	
	<item>
		<title>Announcing the GAUSS Machine Learning Library</title>
		<link>https://www.aptech.com/blog/announcing-the-gauss-machine-learning-library/</link>
					<comments>https://www.aptech.com/blog/announcing-the-gauss-machine-learning-library/#respond</comments>
		
		<dc:creator><![CDATA[Eric]]></dc:creator>
		<pubDate>Mon, 28 Aug 2023 14:36:25 +0000</pubDate>
				<category><![CDATA[Econometrics]]></category>
		<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Releases]]></category>
		<guid isPermaLink="false">https://www.aptech.com/?p=11584015</guid>

					<description><![CDATA[The new GAUSS Machine Learning (GML) library offers powerful and efficient machine learning techniques in an accessible and friendly environment. Whether you're just getting familiar with machine learning or an experienced technician, you'll be running models in no time with GML. ]]></description>
										<content:encoded><![CDATA[<h3 id="introduction">Introduction</h3>
<p>The new GAUSS Machine Learning (GML) library offers powerful and efficient machine learning techniques in an accessible and friendly environment. Whether you're just getting familiar with machine learning or an experienced technician, you'll be running models in no time with GML. </p>
<h2 id="machine-learning-models-at-your-fingertips">Machine Learning Models at Your Fingertips</h2>
<p>With the GAUSS Machine Learning library, you can run machine learning models out of the box, even without any machine learning background. It supports fundamental machine learning models for classification and regression including:</p>
<ul>
<li><a href="https://docs.aptech.com/gauss/logisticregfit.html" target="_blank" rel="noopener">Logistic regression</a>.</li>
<li><a href="https://docs.aptech.com/gauss/lassofit.html" target="_blank" rel="noopener">LASSO</a> and <a href="https://docs.aptech.com/gauss/ridgefit.html" target="_blank" rel="noopener">ridge</a> regression.</li>
<li><a href="https://docs.aptech.com/gauss/decforestrfit.html" target="_blank" rel="noopener">Decision forests</a>.</li>
<li><a href="https://docs.aptech.com/gauss/pcafit.html" target="_blank" rel="noopener">Principal component analysis</a>.</li>
<li><a href="https://docs.aptech.com/gauss/knnfit.html" target="_blank" rel="noopener">K-nearest neighbors</a>.</li>
<li><a href="https://docs.aptech.com/gauss/kmeansfit.html" target="_blank" rel="noopener">K-means clustering</a>.</li>
</ul>
<p><a href="https://docs.aptech.com/gauss/plotlr.html" target="_blank" rel="noopener"><img src="https://docs.aptech.com/gauss/_images/lassofit.jpg" width="800" height="600" alt="LASSO regression coefficient response plot." class="aligncenter size-full" /></a></p>
<h2 id="quick-and-painless-data-preparation-and-management">Quick and Painless Data Preparation and Management</h2>
<p>We know model fitting and prediction is just the tip of the iceberg when it comes to any data analysis project. That's why we've focused on making GAUSS one of the best environments for data import, cleaning, and exploration. </p>
<p><a href="https://www.aptech.com/wp-content/uploads/2021/11/g22-donor-strclean-2-frames-npup2.gif"><img src="https://www.aptech.com/wp-content/uploads/2021/11/g22-donor-strclean-2-frames-npup2.gif" alt="" width="605" height="374" class="size-full wp-image-11581929" /></a></p>
<p>GML provides machine learning specific data preparation tools including:</p>
<ul>
<li><a href="https://docs.aptech.com/gauss/onehot.html" target="_blank" rel="noopener">One-hot encoding</a>.  </li>
<li><a href="https://docs.aptech.com/gauss/traintestsplit.html" target="_blank" rel="noopener">Testing and training data splits</a>. </li>
<li><a href="https://docs.aptech.com/gauss/cvsplit.html" target="_blank" rel="noopener">Cross-validation splits</a>. </li>
<li>Internal data scaling. </li>
</ul>
<p>See how GAUSS reduces the pain and time of data wrangling and let's you get to the heart of your machine learning models quicker. </p>
<h2 id="easy-to-implement-model-evaluation">Easy to Implement Model Evaluation</h2>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/08/classification-statistics.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/08/classification-statistics.jpg" alt="GAUSS classification metrics from machine learning library. " width="574" height="406" class="aligncenter size-full wp-image-11584020" /></a></p>
<p>Compare and evaluate machine learning models with tools for GML plotting and performance evaluation tools:</p>
<ul>
<li><a href="https://docs.aptech.com/gauss/plotclasses.html" target="_blank" rel="noopener">Data class plots</a>. </li>
<li><a href="https://docs.aptech.com/gauss/meansquarederror.html" target="_blank" rel="noopener">Model mean squared error</a>. </li>
<li><a href="https://docs.aptech.com/gauss/classificationmetrics.html" target="_blank" rel="noopener">Classification metrics</a>.</li>
<li><a href="https://docs.aptech.com/gauss/plotvariableimportance.html" target="_blank" rel="noopener">Variable importance tables</a>. </li>
</ul>
<div style="text-align:center;background-color:#f0f2f4"><hr>Interested in how GAUSS machine learning can work for you?<a href="https://www.aptech.com/contact-us/" target="_blank" rel="noopener"> Contact Us<hr></a></div>
<h2 id="unparalleled-customer-support">Unparalleled Customer Support</h2>
<p>We pride ourselves on offering unparalleled customer support and we truly care about your success. If you can't find what you need in our <a href="https://docs.aptech.com/gauss/" target="_blank" rel="noopener">online documents</a>, <a href="https://www.aptech.com/questions/" target="_blank" rel="noopener">user forum</a>, or <a href="https://www.aptech.com/blog/" target="_blank" rel="noopener">blog</a>, you can be confident that a GAUSS expert is here to quickly resolve your questions.</p>
<h2 id="see-it-in-action">See It In Action</h2>
<p>Want to see GML in action? Check out these real-world applications:</p>
<ol>
<li><a href="https://www.aptech.com/blog/classification-with-regularized-logistic-regression/" target="_blank" rel="noopener">Classification With Regularized Logistic Regression</a>.</li>
<li><a href="https://www.aptech.com/blog/machine-learning-with-real-world-data/" target="_blank" rel="noopener">Machine Learning With Real-World Data</a>.</li>
<li><a href="https://www.aptech.com/blog/understanding-cross-validation/" target="_blank" rel="noopener">Understanding Cross-Validation</a>.</li>
<li><a href="https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/" target="_blank" rel="noopener">Fundamentals of Tuning Machine Learning Hyperparameters</a>.</li>
<li><a href="https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/" target="_blank" rel="noopener">Predicting The Output Gap With Machine Learning Regression Models</a>.</li>
<li><a href="https://www.aptech.com/blog/applications-of-principal-components-analysis-in-finance/" target="_blank" rel="noopener">Applications of Principal Components Analysis in Finance</a>. </li>
<li><a href="https://www.aptech.com/blog/predicting-recessions-with-machine-learning-techniques/" target="_blank" rel="noopener">Predicting Recessions With Machine Learning Techniques</a>. 
<h2 id="try-out-gauss-machine-learning">Try out GAUSS Machine Learning</h2></li>
</ol>

[contact-form-7]
]]></content:encoded>
					
					<wfw:commentRss>https://www.aptech.com/blog/announcing-the-gauss-machine-learning-library/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Classification with Regularized Logistic Regression</title>
		<link>https://www.aptech.com/blog/classification-with-regularized-logistic-regression/</link>
					<comments>https://www.aptech.com/blog/classification-with-regularized-logistic-regression/#respond</comments>
		
		<dc:creator><![CDATA[Eric]]></dc:creator>
		<pubDate>Wed, 07 Jun 2023 15:59:02 +0000</pubDate>
				<category><![CDATA[Econometrics]]></category>
		<category><![CDATA[Machine learning]]></category>
		<guid isPermaLink="false">https://www.aptech.com/?p=11583861</guid>

					<description><![CDATA[Logistic regression has been a long-standing popular tool for modeling categorical outcomes. It's widely used across fields like epidemiology, finance, and econometrics. 

In today's blog we'll look at the fundamentals of logistic regression. We'll use a real-world survey data application and provide a step-by-step guide to implementing your own regularized logistic regression models using the <a href="https://docs.aptech.com/gauss/gml-landing.html" target="_blank" rel="noopener">GAUSS Machine Learning library</a>, including:
<ol><li>Data preparation.</li>
<li>Model fitting.</li>
<li>Classification predictions. </li>
<li>Evaluating predictions and model fit. </li>
</ol>]]></description>
										<content:encoded><![CDATA[    <!-- MathJax configuration -->
    <style>
        .mjx-svg-href {
            fill: "inherit" !important;
            stroke: "inherit" !important;
        }
    </style>
    <script type="text/x-mathjax-config">
        MathJax.Hub.Config({ TeX: { equationNumbers: {autoNumber: "AMS"} } });
    </script>
    <script type="text/javascript">
window.MathJax = {
  tex2jax: {
    inlineMath: [ ['$','$'] ],
    displayMath: [ ['$$','$$'] ],
    processEscapes: true,
    processEnvironments: true
  },
  // Center justify equations in code and markdown cells. Elsewhere
  // we use CSS to left justify single line equations in code cells.
  displayAlign: 'center',
  "HTML-CSS": {
    styles: {'.MathJax_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  "SVG": {
    styles: {'.MathJax_SVG_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  showProcessingMessages: false,
  messageStyle: "none",
  menuSettings: { zoom: "Click" },
  AuthorInit: function() {
    MathJax.Hub.Register.StartupHook("End", function() {
            var timeout = false, // holder for timeout id
            delay = 250; // delay after event is "complete" to run callback
            var shrinkMath = function() {
              //var dispFormulas = document.getElementsByClassName("formula");
              var dispFormulas = document.getElementsByClassName("MathJax_SVG_Display");
              if (dispFormulas){
                // caculate relative size of indentation
                var contentTest = document.getElementsByTagName("body")[0];
                var nodesWidth = contentTest.offsetWidth;
                // if you have indentation
                var mathIndent = MathJax.Hub.config.displayIndent; //assuming px's
                var mathIndentValue = mathIndent.substring(0,mathIndent.length - 2);
                for (var i=0; i<dispFormulas.length; i++){
                  var dispFormula = dispFormulas[i];
                  var wrapper = dispFormula;
                  //var wrapper = dispFormula.getElementsByClassName("MathJax_Preview")[0].nextSibling;
                  var child = wrapper.firstChild;
                  wrapper.style.transformOrigin = "center"; //or top-left if you left-align your equations
                  var oldScale = child.style.transform;
                  //var newValue = Math.min(0.80*dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newValue = Math.min(dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newScale = "scale(" + newValue + ")";
                  if(newValue != "NaN" && !(newScale === oldScale)){
                    wrapper.style.transform = newScale;
                    wrapper.style["margin-left"]= Math.pow(newValue,4)*mathIndentValue + "px";
                    var wrapperStyle = window.getComputedStyle(wrapper);
                    var wrapperHeight = parseFloat(wrapperStyle.height);
                    wrapper.style.height = "" + (wrapperHeight * newValue) + "px";
                    if(newValue === "1.00"){
                      wrapper.style.cursor = "";
                      wrapper.style.height = "";
                    }
                    else {
                      wrapper.style.cursor = "zoom-in";
                    }
                  }

                }
            }
            };
            shrinkMath();
            window.addEventListener('resize', function() {
              clearTimeout(timeout);
              timeout = setTimeout(shrinkMath, delay);
            });
          });
  }
}
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-AMS_SVG"></script>
<h3 id="introduction">Introduction</h3>
<p>Logistic regression has been a long-standing popular tool for modeling categorical outcomes. It's widely used across fields like epidemiology, finance, and econometrics. </p>
<p>In today's blog we'll look at the fundamentals of logistic regression. We'll use a real-world survey data application and provide a step-by-step guide to implementing your own regularized logistic regression models using the <a href="https://docs.aptech.com/gauss/gml-landing.html" target="_blank" rel="noopener">GAUSS Machine Learning library</a>, including:</p>
<ol>
<li>Data preparation.</li>
<li>Model fitting.</li>
<li>Classification predictions. </li>
<li>Evaluating predictions and model fit. </li>
</ol>
<h2 id="what-is-logistic-regression">What is Logistic Regression?</h2>
<p>Logistic regression is a statistical method that can be used to predict the probability of an event occurring based on observed features or variables. The predicted probabilities can then be used to classify the data based on probability thresholds. </p>
<p>For example, if we are modeling a &quot;TRUE&quot; and &quot;FALSE&quot; outcome, we may predict that an outcome will be &quot;TRUE&quot; for all predicted probabilities of 0.5 and higher. </p>
<p>Mathematically, logistic regression models the relationship between the probability of an outcome as a logistic function of the independent variables:</p>
<p>$$ Pr(Y = 1 | X) = p(X) = \frac{e^{B_0 + B_1X}}{1 + e^{B_0 + B_1X}} $$</p>
<p>This log-odds representation is sometimes more common because it is linear in our independent variables:</p>
<p>$$ \log \bigg( \frac{p(X)}{1 + p(X)} \bigg) = B_0 + B_1X $$</p>
<p>There are some important aspects of this model to keep in mind:</p>
<ul>
<li>The logistic regression model always yields a prediction between 0 and 1.</li>
<li>The magnitude of the coefficients in the logistic regression model cannot be as directly interpreted as in the classic linear model. </li>
<li>The signs of the coefficients in the logistic regression model can be interpreted as expected. For example, if the coefficient on $X_1$ is negative we can conclude that increasing $X_1$ decreases $p(X)$. </li>
</ul>
<h2 id="logistic-regression-with-regularization">Logistic Regression with Regularization</h2>
<p>One potential pitfall of logistic regression is its tendency for overfitting, particularly with high dimensional feature sets. </p>
<p>Regularization with L1 and/or L2 penalty parameters with can help prevent overfitting and improve prediction. </p>
<table>
 <thead>
 <tr>
      <th style="background-color: #36434C" colspan="3"><h3 id="comparison-of-l1-and-l2-regularization"><span style="color:#FFFFFF">Comparison of L1 and L2 Regularization</span></h3>
      </th>
   </tr>
</thead>
<tbody>
<tr><th></th><th>$L1$ penalty (Lasso)</th><th>$L2$ penalty (Ridge)</th></tr>
<tr><td>Penalty term</td><td>$\lambda \sum_{j=1}^p |\beta_j|$</td><td>$\lambda \sum_{j=1}^p \beta_j^2$</td></tr>
<tr><td>Robust to outliers</td><td></td><td>✓</td></tr>
<tr><td>Shrinks coefficients</td><td>✓</td><td>✓</td></tr>
<tr><td>Can select features</td><td>✓</td><td></td></tr>
<tr><td>Sensitive to correlated features</td><td></td><td>✓</td></tr>
<tr><td>Useful for preventing overfitting</td><td>✓</td><td>✓</td></tr>
<tr><td>Useful for addressing multicollinearity</td><td></td><td>✓</td></tr>
<tr><td>Requires hyperparameter selection (λ)</td><td>✓</td><td>✓</td></tr>
</tbody>
</table>
<div class="alert alert-info" role="alert">Our previous blog, <a href="https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/" target="_blank" rel="noopener">&quot;Predicting the Output Gap With Machine Learning Regression Models&quot;</a> provides a more detailed look at L1 and L2 regularization.</div>
<h2 id="predicting-customer-satisfaction-using-survey-data">Predicting Customer Satisfaction Using Survey Data</h2>
<p>Today we will use <a href="https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction" target="_blank" rel="noopener">airline passenger satisfaction data</a> to demonstrate logistic regression with regularization. </p>
<p>Our task is to predict passenger satisfaction using:</p>
<ul>
<li>Available <a href="https://www.aptech.com/blog/getting-started-with-survey-data-in-gauss/" target="_blank" rel="noopener">survey answers</a>. </li>
<li>Flight information. </li>
<li>Passenger characteristics.</li>
</ul>
<div>
<table>
 <thead>
 <tr>
      <th style="background-color: #36434C; position: sticky; top: 0;"><span style="color:#FFFFFF">Variable</span></th><th style="background-color: #36434C; position: sticky; top: 0;"><span style="color:#FFFFFF">Description</span></th>
 </tr>
</thead>
<tbody>
<tr><td>id</td><td>Responder identification number</td></tr>
<tr><td>Gender</td><td>Gender identification: Female or Male.</td></tr>
<tr><td>Customer Type</td><td>Loyal or disloyal customer.</td></tr>
<tr><td>Age</td><td>Customer age in years.</td></tr>
<tr><td>Type of travel</td><td>Personal or business travel.</td></tr>
<tr><td>Class</td><td>Eco or business class seat.</td></tr>
<tr><td>Flight Distance</td><td>Flight distance in miles.</td></tr>
<tr><td>Wifi service</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Schedule convenient</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Ease of Online booking</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Gate location</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Food and drink</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Seat comfort</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Online boarding</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Inflight entertainment</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>On-board service</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Leg room service</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Baggage handling</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Checkin service</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Inflight service</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Cleanliness</td><td>Customer rating on 0-5 scale.</td></tr>
<tr><td>Departure Delay in minutes</td><td>Minutes delayed when departing.</td></tr>
<tr><td>Arrival Delay in minutes</td><td>Minutes delayed when arriving.</td></tr>
<tr><td>satisfaction</td><td>Overall airline satisfaction. Possible responses include "satisfied" or "neutral or dissatisfied".</td></tr>
</tbody>
</table>
</div>
<p><br>
The first step in our analysis is to load our data using <a href="https://docs.aptech.com/gauss/loadd.html" target="_blank" rel="noopener"><code>loadd</code></a>:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">new;
library gml;
rndseed 8906876;

/*
** Load datafile
*/
// Set path and filename
load_path = "data/";
fname = "airline_satisfaction.gdat";

// Load data
airline_data = loadd(load_path $+ fname);

// Split data
y = airline_data[., "satisfaction"];
X = delcols(airline_data, "satisfaction"$|"id");</code></pre>
<h3 id="data-exploration">Data Exploration</h3>
<p>Before we begin modeling, let's do some preliminary <a href="https://docs.aptech.com/gauss/data-management/data-exploration.html" target="_blank" rel="noopener">data exploration</a>. First, let's check for common issues that can arise with survey data. </p>
<p>We'll check for:</p>
<ul>
<li>Duplicate observations using <a href="https://docs.aptech.com/gauss/dstatmt.html" target="_blank" rel="noopener"><code>isunique</code></a>.</li>
<li><a href="https://www.aptech.com/blog/introduction-to-handling-missing-values/" target="_blank" rel="noopener">Missing values</a> using <a href="https://docs.aptech.com/gauss/dstatmt.html" target="_blank" rel="noopener"><code>dstatmt</code></a>.</li>
</ul>
<p>First, we'll check for duplicates, so any duplicates can be removed prior to checking our summary statistics:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Check for duplicates
isunique(airline_data);</code></pre>
<p>The <code>isunique</code> procedure returns a 1 if the data is unique and 0 if there are duplicates.</p>
<pre>1.00000000</pre>
<p>In this case, it indicates that we have no duplicates in our data.</p>
<p>Next, we'll check for missing values:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Check for data cleaning
** issues
*/
// Summary statistics
call dstatmt(airline_data);</code></pre>
<p>This prints <a href="https://www.aptech.com/resources/tutorials/formula-string-syntax/descriptive-statistics-from-a-dataset/" target="_blank" rel="noopener">summary statistics</a> for all variables:</p>
<pre>Variable                       Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------------------

Gender                        -----       -----         -----      Female        Male    103904    0
Customer Type                 -----       -----         -----  Loyal Cust  disloyal C    103904    0
Age                           39.38       15.11         228.5           7          85    103904    0
Type of Travel                -----       -----         -----  Business t  Personal T    103904    0
Class                         -----       -----         -----    Business    Eco Plus    103904    0
Flight Distance                2108        1266     1.603e+06           0        3801    103904    0
Wifi service                  -----       -----         -----           0           5    103904    0
Schedule convenient           -----       -----         -----           0           5    103904    0
Ease of Online booking        -----       -----         -----           0           5    103904    0
Gate location                 -----       -----         -----           0           5    103904    0
Food and drink                -----       -----         -----           0           5    103904    0
Online boarding               -----       -----         -----           0           5    103904    0
Seat comfort                  -----       -----         -----           0           5    103904    0
Inflight entertainment        -----       -----         -----           0           5    103904    0
Onboard service               -----       -----         -----           0           5    103904    0
Leg room service              -----       -----         -----           0           5    103904    0
Baggage handling              -----       -----         -----           1           5    103904    0
Checkin service               -----       -----         -----           0           5    103904    0
Inflight service              -----       -----         -----           0           5    103904    0
Cleanliness                   -----       -----         -----           0           5    103904    0
Departure Delay in Minutes    14.82       38.23          1462           0        1592    103904    0
Arrival Delay in Minutes      15.25       38.81          1506           0        1584    103904    0
satisfaction                  -----       -----         -----  neutral or   satisfied    103904    0 </pre>
<p>The summary statistics give us some useful insights:</p>
<ul>
<li>There are no missing values in our dataset.</li>
<li>The summary statistics of our numerical variables don't indicate any obvious outliers. </li>
<li>All categorical survey data ranges from 0 to 5 with the exception of <code>Baggage handling</code> which ranges from 1 to 5. All <a href="https://www.aptech.com/blog/easy-management-of-categorical-variables/" target="_blank" rel="noopener">categorical variables</a> will need to be converted to dummy variables prior to modeling. </li>
</ul>
<p>One other observation from our summary statistics is that many of the variable names are longer than necessary. Long variable names can be:</p>
<ul>
<li>Difficult to remember.</li>
<li>Prone to typos</li>
<li>Cutoff when printing results. </li>
</ul>
<p>(Not to mention they can be annoying to type!).</p>
<p>Let's streamline our variable names using <a href="https://docs.aptech.com/gauss/dfname.html" target="_blank" rel="noopener"><code>dfname</code></a>:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Update variable names
*/
// Create string array of short names
string short_names = {"Loyalty", "Reason", "Distance", "Wifi", 
                      "Schedule", "Booking", "Gate", "Boarding", 
                      "Entertainment", "Leg room", "Baggage", "Checkin", 
                      "Departure Delay", "Arrival Delay" };

// Create string array of original names to change                      
string original_names = { "Customer Type", "Type of Travel", "Flight Distance", "Wifi service",
                          "Schedule convenient", "Ease of Online booking", "Gate location", "Online boarding",
                          "Inflight entertainment", "Leg room service", "Baggage handling", "Checkin service",
                          "Departure Delay in Minutes", "Arrival Delay in Minutes" };

// Change names
airline_data = dfname(airline_data, short_names, original_names);
</code></pre>
<h3 id="data-visualization">Data Visualization</h3>
<p><a href="https://www.aptech.com/blog/category/graphics/" target="_blank" rel="noopener">Data visualization</a> is a great way to get a feel for the relationships between our target variable and our features.</p>
<p>Let's explore the relationship between the customer and flight characteristics and reported satisfaction. </p>
<p>In particular, we'll look at how satisfaction relates to:</p>
<ul>
<li>Age.</li>
<li>Gender.</li>
<li>Flight distance.</li>
<li>Seat class.</li>
<li>Customer type. </li>
</ul>
<h4 id="preparing-our-data-for-plotting">Preparing Our Data for Plotting</h4>
<p>Today we'll use bar graphs to explore the relationships in our data. In particular, we will sort our data into subgroups and examine how those subgroups report satisfaction. </p>
<p>For <a href="https://www.aptech.com/blog/easy-management-of-categorical-variables/" target="_blank" rel="noopener">categorical variables</a>, we have naturally defined subgroups. However, For the continuous variables, <code>Age</code> and <code>Distance</code>, we first need to generate bins based on ranges of these variables. </p>
<p>First, let's place the <code>Age</code> variable in bins. To do this we will use the <a href="https://docs.aptech.com/gauss/reclassifycuts.html" target="_blank" rel="noopener"><code>reclassifycuts</code></a> and <a href="https://docs.aptech.com/gauss/reclassify.html" target="_blank" rel="noopener"><code>reclassify</code></a> procedures: </p>
<div class="alert alert-info" role="alert">For more information on reclassifying and other similar data transformations, see the <a href="https://docs.aptech.com/gauss/data-management/data-transformations.html?highlight=reclassify#" target="_blank" rel="noopener">Data Transformations</a> section of our Data Management Guide.</div>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Create bins for age
*/
// Set age categories cut points
// Class 0: 20 and Under
// Class 1: 21 - 30
// Class 2: 31 - 40
// Class 3: 41 - 50
// Class 4: 51 - 60
// Class 5: 61 - 70
// Class 6: Over 70
cut_pts = { 20, 
            30, 
            40, 
            50, 
            60, 
            70};

// Create numeric classes
age_new = reclassifycuts(airline_data[., "Age"], cut_pts);

// Generate labels to recode to
to = "20 and Under"$|
       "21-30"$|
       "31-40"$|
       "41-50"$|
       "51-60"$|
       "61-70"$|
       "Over 70";

// Recode to categorical variable
age_cat = reclassify(age_new, unique(age_new), to);

// Convert to dataframe
age_cat = asDF(age_cat, "Age Group");</code></pre>
<p>For a quick frequency count of this categorical variable, we can use the <a href="https://docs.aptech.com/gauss/frequency.html" target="_blank" rel="noopener"><code>frequency</code></a> procedure:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Check frequency of age groups
frequency(age_cat, "Age Group");</code></pre>
<pre>       Label      Count   Total %    Cum. %
20 and Under      11333     10.91     10.91
       21-30      21424     20.62     31.53
       31-40      21203     20.41     51.93
       41-50      23199     22.33     74.26
       51-60      18769     18.06     92.32
       61-70       7220     6.949     99.27
     Over 70        756    0.7276       100
       Total     103904       100     </pre>
<p>Now we will do the same for <code>Distance</code>.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Create bins for light distance
*/       
// Set distance categories
// Cut points for data 
cut_pts = { 1000, 
            1500, 
            2000, 
            2500, 
            3000,
            3500};

// Create numeric classes
distance_new = reclassifycuts(airline_data[., "Distance"], cut_pts);

// Generate labels to recode to
to = "1000 and Under"$|
       "1001-1500"$|
       "1501-2000"$|
       "2001-2500"$|
       "2501-3000"$|
       "3000-3500"$|
       "Over 3500";

// Recode to categorical variable
distance_cat = reclassify(distance_new, unique(distance_new), to);

// Convert to dataframe
distance_cat = asDF(distance_cat, "Flight Range");

// Check frequencies
frequency(distance_cat, "Flight Range");</code></pre>
<pre>         Label      Count   Total %    Cum. %
1000 and Under      28017     26.96     26.96
     1001-1500      10976     10.56     37.53
     1501-2000       9331      8.98     46.51
     2001-2500       7834      7.54     54.05
     2501-3000       8053      7.75      61.8
     3000-3500      24815     23.88     85.68
     Over 3500      14878     14.32       100
         Total     103904       100    </pre>
<h4 id="age">Age</h4>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/05/age-satisfaction.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/05/age-satisfaction.jpg" alt="" width="800" height="600" class="aligncenter size-full wp-image-11583902" /></a></p>
<p>We can see from the plot above that passengers 20 and under and passengers over 60 are less likely to be satisfied than other age groups. </p>
<h4 id="gender">Gender</h4>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/05/gender-satisfaction.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/05/gender-satisfaction.jpg" alt="" width="800" height="600" class="aligncenter size-full wp-image-11583903" /></a></p>
<p>The plot suggests that gender has little impact on reported satisfaction. </p>
<h4 id="flight-distance">Flight Distance</h4>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/05/flight-length-satisfaction.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/05/flight-length-satisfaction.jpg" alt="" width="800" height="600" class="aligncenter size-full wp-image-11583904" /></a>
The flight distance plot shows that there are slightly lower rates of satisfaction for flight lengths 3000 miles and over and flight lengths 1000 miles and under. </p>
<h4 id="seat-class">Seat Class</h4>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/05/seat-type.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/05/seat-type.jpg" alt="" width="800" height="600" class="aligncenter size-full wp-image-11583905" /></a>
There is a clear discrepancy in satisfaction between passengers that fly business class and other passengers. Business class customers have a much higher rate of satisfaction than those in economy or economy plus. </p>
<h4 id="customer-type">Customer Type</h4>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/05/customer-type-satisfaction.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/05/customer-type-satisfaction.jpg" alt="" width="800" height="600" class="aligncenter size-full wp-image-11583906" /></a>
Finally, it also appears that loyal passengers are more often satisfied customers than disloyal passengers. </p>
<h3 id="feature-engineering">Feature Engineering</h3>
<p>As is common with survey data, a number of our variables are categorical. We need to represent these as <a href="https://www.aptech.com/blog/how-to-create-dummy-variables-in-gauss/" target="_blank" rel="noopener">dummy variables</a> before modeling. </p>
<p>We'll do this using the <a href="https://docs.aptech.com/gauss/onehot.html" target="_blank" rel="noopener"><code>oneHot</code></a> procedure. However, <code>oneHot</code> only accepts single variables, so we will need to loop through all the categorical variables.</p>
<p>To do this, we first create a list of all categorical variables. </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Create dummy variables
*/
// Get all variable names
col_names = getColNames(X);

// Get types of all variables
col_types = getColTypes(X);

// Select names of variables
// that are categorical
cat_names = selif(col_names, col_types .== "category");</code></pre>
<p>Next, we loop through all categorical variables and create dummy variables for each one using <code>oneHot</code>.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Loop through categorical variables
// to create dummy variables
dummy_vars = {};
for i(1, rows(cat_names), 1); 
    dummy_vars = dummy_vars~oneHot(X[., cat_names[i]]);
endfor;

// Delete original categorical variables
// and replace with dummy variables
X = delcols(x, cat_names)~dummy_vars;</code></pre>
<h2 id="model-evaluation">Model Evaluation</h2>
<p>There are a number of classification metrics that are reported using the <a href="https://docs.aptech.com/gauss/classificationmetrics.html" target="_blank" rel="noopener"><code>classificationMetrics</code></a> procedure. These metrics provide information about how well the model meets different objectives. </p>
<table>
 <thead>
 <tr>
      <th style="background-color: #36434C" colspan="3"><h3 id="model-comparison-measures"><span style="color:#FFFFFF">Model Comparison Measures</span></h3>
      </th>
   </tr>
<tr><th>Tool</th><th>Description</th></tr>
</thead>
<tbody>
<tr><td>Accuracy</td><td>Overall model accuracy. Equal to the number of correct predictions divided by the total number of predictions.</td></tr>
<tr><td>Precision</td><td>How good a model is at correctly identifying the class outcomes. Equal to the number of true positives divided by the number of false positives plus true positives. </td></tr>
<tr><td>Recall</td><td>How good a model is at correctly predicting all the class outcomes. Equal to the number of true positives divided by the number of false negatives plus true positives.</td></tr>
<tr><td>F1-score</td><td>The harmonic mean of the precision and recall, it gives a more balanced picture of how our model performs. A score of 1 indicates perfect precision and recall. </td></tr>
</tbody>
</table>
<p>We'll keep these in mind as we fit and test our model.</p>
<h2 id="logistic-regression-model-fitting">Logistic Regression Model Fitting</h2>
<p>We're now ready to begin fitting our models. To start, we will prepare our data by:</p>
<p>Creating training and testing datasets using <a href="https://docs.aptech.com/gauss/traintestsplit.html" target="_blank" rel="noopener"><code>trainTestSplit</code></a>. </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Split data into 70% training and 30% test set
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);</code></pre>
<p>Scaling our data using <a href="https://docs.aptech.com/gauss/traintestsplit.html" target="_blank" rel="noopener"><code>rescale</code></a>.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Data rescaling
*/
// Number of variables to rescale
numeric_vars = 4;

// Rescale training data
{ X_train[.,1:numeric_vars], x_mu, x_sd } = rescale(X_train[.,1:numeric_vars], "standardize");

// Rescale test data using same scaling factors as x_train
X_test[.,1:numeric_vars] = rescale(X_test[.,1:numeric_vars], x_mu, x_sd);</code></pre>
<p>Unlike <a href="https://docs.aptech.com/gauss/decforestcfit.html" target="_blank" rel="noopener">Random Forest models</a>, logistic regression models are sensitive to large differences in the scale of the variables. Standardizing the variables as we do here is a good choice, but is not unequivocally the best option in all cases.</p>
<p>As you can see above, we compute the mean and standard deviation from the training set and use those parameters to scale the test set. This is important. </p>
<p>The purpose of our test set is to give us an estimate of how our model will do on unseen data. Using the mean and standard deviation from the entire dataset, before the train/test split would allow information from the test set to &quot;leak&quot; into our model. Information leakage is beyond the scope of this blog post, but in general the test set should be treated like information that is not available until after the model fit is complete.</p>
<p>Now we're ready to start fitting our models.</p>
<h3 id="case-one-logistic-regression-without-regularization">Case One: Logistic Regression Without Regularization</h3>
<p>As a base case, we'll consider a logistic regression model without any regularization. For this case, we'll use all default settings, so our only inputs are the dependent and independent data. </p>
<p>Using our training data we will:</p>
<ol>
<li>Train our model using <code>logisticRegFit</code>.</li>
<li>Make predictions on our training data using <a href="https://docs.aptech.com/gauss/lmpredict.html" target="_blank" rel="noopener"><code>lmPredict</code></a>. </li>
<li>Evaluate our training model predictions using <a href="https://docs.aptech.com/gauss/classificationmetrics.html" target="_blank" rel="noopener"><code>classificationMetrics</code></a>. </li>
</ol>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*************************************
** Base case model
** No regularization
*************************************/

/*
** Training
*/
// Declare 'lr_mdl' to be 
// an 'logisticRegModel' structure
// to hold the trained model
struct logisticRegModel lr_mdl;

// Train the logistic regression classifier
lr_mdl = logisticRegFit(y_train, X_train);

// Check training set performance
y_hat_train = lmPredict(lr_mdl, X_train);

// Model evaluations
print "Training Metrics";
call classificationMetrics(y_train, y_hat_train);</code></pre>
<p>The <code>classificationMetrics</code> procedure prints an evaluation table:</p>
<pre>No regularization
Training Metrics
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.93    0.92      0.93    41102
              satisfied        0.90    0.91      0.90    31631

              Macro avg        0.91    0.92      0.91    72733
           Weighted avg        0.92    0.92      0.92    72733

               Accuracy                          0.92    72733</pre>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Testing
*/
// Make predictions on the test set, from our trained model
y_hat_test = lmPredict(lr_mdl, X_test);

/*
** Model evaluation
*/
print "Testing Metrics";
call classificationMetrics(y_test, y_hat_test);</code></pre>
<p>This code prints the following to screen:</p>
<pre>Testing Metrics
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.93    0.92      0.92    17777
              satisfied        0.90    0.91      0.90    13394

              Macro avg        0.91    0.91      0.91    31171
           Weighted avg        0.91    0.91      0.91    31171

               Accuracy                          0.91    31171</pre>
<p>There are some good observations comparing our training data and testing data performance:</p>
<ul>
<li>First, there is little difference in accuracy across our training and testing data set, with a training accuracy of 0.92 and a testing accuracy of 0.91.</li>
<li>Our model provides the same average F1-score, which provides a balanced measure of performance, across the testing and training dataset. </li>
</ul>
<p>Why is this important? <b>This comparison provides a good indication that we aren't overfitting our training set.</b> Since the main purpose of regularization is to address overfitting the model to the training data, we don't have much reason to use it. However, for demonstration purposes, we'll show how to implement L2 regularization.</p>
<h3 id="case-two-logistic-regression-with-l2-regularization">Case Two: Logistic Regression With L2 Regularization</h3>
<p>To implement regularization with the <code>logisticRegFit</code>, we'll use a <a href="https://www.aptech.com/resources/tutorials/a-gentle-introduction-to-using-structures/" target="_blank" rel="noopener"><code>logisticRegControl</code></a> structure. </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*************************************
** L2 Regularization
*************************************/

/*
** Training
*/
// Declare 'lrc' to be a logisticRegControl
// structure and fill with default settings 
struct logisticRegControl lrc;
lrc = logisticRegControlCreate();

// Set L2 regularization parameter
lrc.l2 = 0.05;

// Declare 'lr_mdl' to be 
// a 'logisticRegModel' structure
// to hold the trained model
struct logisticRegModel lr_mdl;

// Train the logistic regression classifier
lr_mdl = logisticRegFit(y_train, X_train, lrc);

/*
** Testing
*/
// Make predictions on the test set
y_hat_l2 = lmPredict(lr_mdl, X_test);

/*
** Model evaluation
*/
call classificationMetrics(y_test, y_hat_l2);</code></pre>
<p>The classification metrics are printed:</p>
<pre>L2 regularization
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.89    0.93      0.91    17777
              satisfied        0.90    0.84      0.87    13394

              Macro avg        0.90    0.89      0.89    31171
           Weighted avg        0.89    0.89      0.89    31171

               Accuracy                          0.89    31171</pre>
<p>Note that with the L2 penalty, our model performance drops from the base case model, with lower accuracy (0.89) and lower average F1-score (0.89). This isn't surprising, given that we didn't find support of overfitting in our model. </p>
<h3 id="conclusion">Conclusion</h3>
<p>In today's blog, we've looked at logistic regression and regularization. </p>
<p>Using a real-world airline passenger satisfaction data application we've:</p>
<ol>
<li>Performed preliminary data and setup.</li>
<li>Trained logistic regression models with and without regularization. </li>
<li>Made classification predictions. </li>
<li>Interpreted classification metrics. </li>
</ol>
<h3 id="further-machine-learning-reading">Further Machine Learning Reading</h3>
<ol>
<li><a href="https://www.aptech.com/blog/predicting-recessions-with-machine-learning-techniques/" target="_blank" rel="noopener">Predicting Recessions with Machine Learning Techniques</a>  </li>
<li><a href="https://www.aptech.com/blog/applications-of-principal-components-analysis-in-finance/" target="_blank" rel="noopener">Applications of Principal Components Analysis in Finance</a>  </li>
<li><a href="https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/" target="_blank" rel="noopener">Predicting The Output Gap With Machine Learning Regression Models</a>  </li>
<li><a href="https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/" target="_blank" rel="noopener">Fundamentals of Tuning Machine Learning Hyperparameters</a>  </li>
<li><a href="https://www.aptech.com/blog/understanding-cross-validation/" target="_blank" rel="noopener">Understanding Cross-Validation</a>  </li>
<li><a href="https://www.aptech.com/blog/machine-learning-with-real-world-data/" target="_blank" rel="noopener">Machine Learning With Real-World Data</a>  </li>
</ol>
<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.aptech.com/blog/classification-with-regularized-logistic-regression/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Machine Learning With Real-World Data</title>
		<link>https://www.aptech.com/blog/machine-learning-with-real-world-data/</link>
					<comments>https://www.aptech.com/blog/machine-learning-with-real-world-data/#respond</comments>
		
		<dc:creator><![CDATA[Eric]]></dc:creator>
		<pubDate>Tue, 16 May 2023 03:38:45 +0000</pubDate>
				<category><![CDATA[Econometrics]]></category>
		<category><![CDATA[Machine learning]]></category>
		<guid isPermaLink="false">https://www.aptech.com/?p=11583790</guid>

					<description><![CDATA[If you've ever done empirical work, you know that real-world data rarely, if ever, arrives clean and ready for modeling. No data analysis project consists solely of fitting a model and making predictions. 

In today's blog, we walk through a machine learning project from start to finish. We'll give you a foundation for completing your own machine learning project in GAUSS, working through:
<ul>
<li>Data Exploration and cleaning.</li>
<li>Splitting data for training and testing. </li>
<li>Model fitting and prediction. </li>
</ul>]]></description>
										<content:encoded><![CDATA[<h3 id="introduction">Introduction</h3>
<p>If you've ever done empirical work, you know that real-world data rarely, if ever, arrives clean and ready for modeling. No data analysis project consists solely of fitting a model and making predictions. </p>
<p>In today's blog, we walk through a machine learning project from start to finish. We'll give you a foundation for completing your own machine learning project in GAUSS, working through:</p>
<ul>
<li>Data Exploration and cleaning.</li>
<li>Splitting data for training and testing. </li>
<li>Model fitting and prediction. </li>
<li>Basic feature engineering.</li>
</ul>
<h2 id="background">Background</h2>
<h3 id="our-data">Our Data</h3>
<p>Today we will be working with the <a href="https://www.kaggle.com/datasets/camnugent/california-housing-prices" target="_blank" rel="noopener">California Housing Dataset from Kaggle</a>. </p>
<p>This dataset is built from 1990 Census data. Though it is an older dataset, it is a great demonstration dataset and has been popular in many machine learning examples.</p>
<p>The dataset contains 10 variables measured in California at the block group level:</p>
<table>
<tbody>
<tr><th>Variable</th><th>Description</th></tr>
<tr><td>longitude</td><td>Measure of how far west a house is.</td></tr>
<tr><td>latitude</td><td>Measure of how far north a house is.</td></tr>
<tr><td>housing_median_age</td><td>Median age of a house within a block.</td></tr>
<tr><td>total_rooms</td><td>Total number of rooms within a block.</td></tr>
<tr><td>total_bedrooms</td><td>Total number of bedrooms within a block.</td></tr>
<tr><td>population</td><td>Total number of people residing within a block.</td></tr>
<tr><td>households</td><td>Total number of households, a group of people residing within a home unit, for a block.</td></tr>
<tr><td>median_income</td><td>Median income for households within a block of houses (measured in tens of thousands of US Dollars).</td></tr>
<tr><td>median_house_value</td><td>Median house value for households within a block.</td></tr>
<tr><td>ocean_proximity</td><td>Location of the house w.r.t ocean/sea.</td></tr>
</tbody>
</table>
<h3 id="gauss-machine-learning">GAUSS Machine Learning</h3>
<p>We will use the new <a href="https://docs.aptech.com/gauss/gml-landing.html" target="_blank" rel="noopener">GAUSS Machine Learning (GML)</a> library. This library is extremely user friendly and provides easy-to-use machine learning tools for implementing fundamental machine learning models. </p>
<p>To access these tools, we need to load the library: </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Clear workspace and load library
new;
library gml;

// Set random seed
rndseed 8906876;</code></pre>
<div class="alert alert-info" role="alert">Note we also set the random seed to allow for replication.</div>
<h2 id="data-exploration-and-cleaning">Data Exploration and Cleaning</h2>
<p>With GML loaded, we our now ready to import and <a href="https://www.aptech.com/blog/preparing-and-cleaning-data-fred-data-in-gauss/" target="_blank" rel="noopener">clean our data</a>. The first step is to use the <a href="https://docs.aptech.com/gauss/loadd.html" target="_blank" rel="noopener"><code>loadd</code></a> procedure to import our data into the GAUSS.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Import datafile
*/
load_path = "data/";
fname = "housing.csv";

// Load all variables
housing_data = loadd(load_path $+ fname);</code></pre>
<h3 id="descriptive-statistics">Descriptive Statistics</h3>
<p><a href="https://docs.aptech.com/gauss/data-management/data-exploration.html" target="_blank" rel="noopener">Exploratory data analysis</a> allows us to identify important data anomalies, like outliers and <a href="https://www.aptech.com/blog/introduction-to-handling-missing-values/" target="_blank" rel="noopener">missing values</a>. </p>
<p>Let's start by looking at standard <a href="https://www.aptech.com/resources/tutorials/formula-string-syntax/descriptive-statistics-from-a-dataset/" target="_blank" rel="noopener">descriptive statistics</a> using the <a href="https://docs.aptech.com/gauss/dstatmt.html" target="_blank" rel="noopener"><code>dstatmt</code></a> procedure:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Find descriptive statistics
// for all variables in housing_data
dstatmt(housing_data);</code></pre>
<p>This prints a summary table of statistics for all variables.</p>
<pre>--------------------------------------------------------------------------------------------------
Variable                  Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
--------------------------------------------------------------------------------------------------
longitude               -119.6       2.004         4.014      -124.3      -114.3     20640    0
latitude                 35.63       2.136         4.562       32.54       41.95     20640    0
housing_median_age       28.64       12.59         158.4           1          52     20640    0
total_rooms               2636        2182     4.759e+06           2   3.932e+04     20640    0
total_bedrooms           537.9       421.4     1.776e+05           1        6445     20433  207
population                1425        1132     1.282e+06           3   3.568e+04     20640    0
households               499.5       382.3     1.462e+05           1        6082     20640    0
median_income            3.871         1.9         3.609      0.4999          15     20640    0
median_house_value   2.069e+05   1.154e+05     1.332e+10     1.5e+04       5e+05     20640    0
ocean_proximity          -----       -----         -----   &lt;1H OCEAN  NEAR OCEAN     20640    0 </pre>
<p>These statistics allow us to quickly identify several data issues that we need to address prior to fitting our model:</p>
<ol>
<li>There are 207 missing observations of the <code>total bedrooms</code> variable (you may need to scroll to the right of the output). </li>
<li>Many of our variables show potential outliers, with high variance and large ranges. These should be further explored. </li>
</ol>
<h3 id="missing-values">Missing Values</h3>
<p>To get a better idea of how to best deal with the missing values, let's check the descriptive statistics for the observations with and without missing values separately.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Conditional check 
// for missing values
e = housing_data[., "total_bedrooms"] .== miss();

// Get descriptive statistics
// for dataset with missing values
dstatmt(selif(housing_data, e));</code></pre>
<pre>------------------------------------------------------------------------------------------------
Variable                 Mean     Std Dev      Variance     Minimum     Maximum   Valid  Missing
------------------------------------------------------------------------------------------------

longitude              -119.5       2.001         4.006      -124.1      -114.6      207    0
latitude                 35.5       2.097         4.399       32.66       40.92      207    0
housing_median_age      29.27       11.96         143.2           4          52      207    0
total_rooms              2563        1787     3.194e+06         154   1.171e+04      207    0
total_bedrooms          -----       -----         -----        +INF        -INF        0  207
population               1478        1057     1.118e+06          37        7604      207    0
households                510       386.1     1.491e+05          16        3589      207    0
median_income           3.822       1.956         3.824      0.8527          15      207    0
median_house_value   2.06e+05   1.116e+05     1.246e+10    4.58e+04       5e+05      207    0
ocean_proximity         -----       -----         -----   &lt;1H OCEAN  NEAR OCEAN      207    0</pre>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Get descriptive statistics
// for dataset without missing values
dstatmt(delif(housing_data, e));</code></pre>
<pre>-------------------------------------------------------------------------------------------------
Variable                 Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------------

longitude              -119.6       2.004         4.014      -124.3      -114.3     20433    0
latitude                35.63       2.136         4.564       32.54       41.95     20433    0
housing_median_age      28.63       12.59         158.6           1          52     20433    0
total_rooms              2637        2185     4.775e+06           2   3.932e+04     20433    0
total_bedrooms          537.9       421.4     1.776e+05           1        6445     20433    0
population               1425        1133     1.284e+06           3   3.568e+04     20433    0
households              499.4       382.3     1.462e+05           1        6082     20433    0
median_income           3.871       1.899         3.607      0.4999          15     20433    0
median_house_value  2.069e+05   1.154e+05     1.333e+10     1.5e+04       5e+05     20433    0
ocean_proximity         -----       -----         -----   &lt;1H OCEAN  NEAR OCEAN     20433    0 </pre>
<p>From visual inspection, the descriptive statistics for the data with the missing values are very similar to the descriptive statistics data without the missing values. </p>
<div class="alert alert-info" role="alert">We could do more robust statistical tests to confirm this. However, those are outside of the scope of this blog, and we will rely on our visual inspection.</div>
<p>In addition, the missing values make up less than 1% of the total observations. Given this, we will delete the rows containing missing values, rather than <a href="https://www.aptech.com/blog/introduction-to-handling-missing-values/" target="_blank" rel="noopener">imputing our missing values</a>.</p>
<p>We can delete the rows with missing values using the <a href="https://docs.aptech.com/gauss/packr.html" target="_blank" rel="noopener"><code>packr</code></a> procedure:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Remove rows with missing values
// from housing_data
housing_data = packr(housing_data);</code></pre>
<h3 id="outliers">Outliers</h3>
<p>Now that we've removed missing values, let's look for other data outliers. Data visualizations like histograms and box plots are a great way to identify potential outliers.</p>
<p>First, let's create a grid plot of <a href="https://docs.aptech.com/gauss/plothist.html" target="_blank" rel="noopener">histograms</a> for all of our continuous variables:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Data visualizations
*/
// Get variables names
vars = getColNames(housing_data);

// Set up plotControl 
// structure for formatting graphs
struct plotControl plt;
plt = plotGetDefaults("bar");

// Set fonts
plotSetFonts(&amp;plt, "title", "Arial", 14);
plotSetFonts(&amp;plt, "ticks", "Arial", 12);

// Loop through the variables and draw histograms
for i(1, rows(vars)-1, 1);
    plotSetTitle(&amp;plt, vars[i]);
    plotLayout(3, 3, i);
    plotHist(plt, housing_data[., vars[i]], 50);
endfor;</code></pre>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/05/histogram_all.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/05/histogram_all.jpg" alt="Histogram of all variables in our California Housing dataset. " width="1200" height="800" class="aligncenter size-full wp-image-11583804" /></a></p>
<p>From our histograms, it appears that several variables suffer from outliers:</p>
<ul>
<li>The <code>total_rooms</code> variable, with the majority of the data distributed between 0 and 10,000.</li>
<li>The <code>total_bedrooms</code> variable, with the majority of the data distributed between 0 and 2000.</li>
<li>The <code>households</code> variable, with the majority of the data distributed between 0 and 2000.</li>
<li>The <code>population</code> variable, with the majority of the data distributed between 0 and 100,000.</li>
</ul>
<p><a href="https://docs.aptech.com/gauss/plotbox.html" target="_blank" rel="noopener">Box plots</a> of these variables confirm that there are indeed outliers.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">plt = plotGetDefaults("box");

// Set fonts
plotSetFonts(&amp;plt, "title", "Arial", 14);
plotSetFonts(&amp;plt, "ticks", "Arial", 12);

string box_vars = { "total_rooms", "total_bedrooms", "households", "population" };

// Loop through the variables and draw boxplots
for i(1, rows(box_vars), 1);
    plotLayout(2, 2, i);
    plotBox(plt, box_vars[i], housing_data[., box_vars[i]]);
endfor;</code></pre>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/05/boxplot_outliers.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/05/boxplot_outliers.jpg" alt="" width="1200" height="800" class="aligncenter size-full wp-image-11583807" /></a></p>
<p>Let's <a href="https://docs.aptech.com/gauss/data-management/data-cleaning.html#filtering-observations-of-a-dataframe" target="_blank" rel="noopener">filter the data</a> to eliminate these outliers:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Filter to remove outliers
**
** Delete:
**    - total_rooms greater than or equal to 10000
**    - total_bedrooms greater than or equal to 20000
**    - households greater than or equal to 2000
**    - population greater than or equal to 6000
*/
mask = housing_data[., "total_rooms"] .&gt;= 10000;
mask = mask .or housing_data[., "total_bedrooms"] .&gt;= 2000;
mask = mask .or housing_data[., "households"] .&gt;= 2000;
mask = mask .or housing_data[., "population"] .&gt;= 6000;

housing_data = delif(housing_data, mask);</code></pre>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/05/boxplot_no_outliers.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/05/boxplot_no_outliers.jpg" alt="" width="1200" height="800" class="aligncenter size-full wp-image-11583805" /></a></p>
<div class="alert alert-info" role="alert">Note that we've taken a conservative approach to filtering outliers and haven't removed all points identified by the box plots as outliers. </div>
<h3 id="data-truncation">Data Truncation</h3>
<p>The histograms also point to truncation issues with <code>housing_median_age</code> and <code>median_house_value</code>. Let's look into this a little further:</p>
<ol>
<li>We'll confirm that these are the most frequently occurring observations using <a href="https://docs.aptech.com/gauss/modec.html" target="_blank" rel="noopener"><code>modec</code></a>. This provides evidence for our suspicion that these are truncation points.</li>
<li>We'll count the number of observations at these locations.</li>
</ol>
<div class="alert alert-info" role="alert">Remember that we've already filtered our outliers, so we're looking at a subset of our original data.</div>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// House value
mode_value = modec(housing_data[., "median_house_value"]);
print "Most frequent median_house_value:" mode_value;

print "Counts:";
sumc(housing_data[., "median_house_value"] .== mode_value);

// House age
mode_age = modec(housing_data[., "housing_median_age"]);
print "Most frequent housing_median_age:" mode_age;

print "Counts:";
sumc(housing_data[., "housing_median_age"] .== mode_age);</code></pre>
<div class="alert alert-info" role="alert">We use <code>modec</code> because from our histogram we can't identify for that these points occur at the maximum. It makes sense to assume that they do but we can't be certain. </div>
<pre>Most frequent median_house_value:
       500001.00
Counts:
       935.00000
Most frequent housing_median_age:
       52.000000
Counts:
       1262.0000</pre>
<p>These combined observations make up about 10% of the total observations. Because we have no further information about what is occurring at these points, let's remove them from our model.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Create binary vector with a 1 if either
// 'housing_median_age' or 'median_house_value'
// equal their mode value.
mask = (housing_data[., "housing_median_age"] .== mode_age)
       .or (housing_data[., "median_house_value"] .== mode_value);

// Delete the rows if they meet our above criteria
housing_data = delif(housing_data, mask);</code></pre>
<h3 id="feature-modifications">Feature Modifications</h3>
<p>Our final data cleaning step is to make feature modifications including:</p>
<ol>
<li>Rescaling the <code>median_house_value</code> variable to be measured in 10,000 of US dollars (the same scale as <code>median_income</code>).</li>
<li>Generating dummy variables to account for the categories of <code>ocean_proximity</code>.</li>
</ol>
<p>First, we rescale the <code>median_house_value</code>:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Rescale median income variable
housing_data[., "median_house_value"] = 
    housing_data[., "median_house_value"] ./ 10000;</code></pre>
<p>Next we generate <a href="https://www.aptech.com/blog/how-to-create-dummy-variables-in-gauss/" target="_blank" rel="noopener">dummy variables</a> for <code>ocean_proximity</code>. </p>
<p>Let's get a feel for our <a href="https://www.aptech.com/blog/easy-management-of-categorical-variables/" target="_blank" rel="noopener">categorical data</a> using the <a href="https://docs.aptech.com/gauss/frequency.html" target="_blank" rel="noopener"><code>frequency</code></a> procedure:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Check frequency of
// ocean_proximity categories
frequency(housing_data, "ocean_proximity");</code></pre>
<p>This prints a convenient frequency table:</p>
<pre>     Label      Count   Total %    Cum. %
 &lt;1H OCEAN       8095     44.89     44.89
    INLAND       6136     34.03     78.93
    ISLAND          2   0.01109     78.94
  NEAR BAY       1525     8.458     87.39
NEAR OCEAN       2273     12.61       100
     Total      18031       100         </pre>
<p>We can see from this table that the <code>ISLAND</code> category is a very small category. We'll exclude it from our modeling dataset.</p>
<p>Now let's create our dummy variables using the <a href="https://docs.aptech.com/gauss/onehot.html" target="_blank" rel="noopener"><code>oneHot</code></a> procedure:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Generate dummy variables for 
** the ocean_proximity using
** one hot encoding
*/
dummy_matrix = oneHot(housing_data[., "ocean_proximity"]);</code></pre>
<p>Finally, we'll save our modeling dataset in a GAUSS <a href="https://www.aptech.com/blog/gauss23/#first-class-dataframe-storage" target="_blank" rel="noopener">.gdat</a> file using <a href="https://docs.aptech.com/gauss/saved.html" target="_blank" rel="noopener"><code>saved</code></a> so we can directly access our clean data in the future:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Build matrix of features
** Note we exclude: 
**     - ISLAND dummy variable
**     - Original ocean_proximity variable
*/
model_data = delcols(housing_data, "ocean_proximity") ~ 
    delcols(dummy_matrix, "ocean_proximity_ISLAND");

// Saved data matrix
saved(model_data, load_path $+ "/model_data.gdat");</code></pre>
<h2 id="data-splitting">Data Splitting</h2>
<p>In machine learning, it's customary to use separate datasets to fit the model and to evaluate model performance. Since the objective of machine learning models is to provide predictions for unseen data, using a testing set provides a more realistic measure of how our model will perform. </p>
<div class="alert alert-info" role="alert">Cross-validation is an additional tool for evaluating model performance. To learn more about cross-validation, see our previous blog, <a href="https://docs.aptech.com/gauss/dstatmt.html" target="_blank" rel="noopener">&quot;Understanding Cross-Validation&quot;</a>.</div>
<p>To prepare our data for training and testing, we're going to take two steps:</p>
<ol>
<li>Separate our target variable, <code>median_house_value</code>, and feature set.</li>
<li>Split our data into 70% testing and 30% training dataset using <a href="https://docs.aptech.com/gauss/traintestsplit.html" target="_blank" rel="noopener"><code>trainTestSplit</code></a>.</li>
</ol>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">new;
library gml;
rndseed 896876;

/*
** Load datafile
*/
load_path = "data/";
fname = "model_data.gdat";

// Load data
housing_data = loadd(load_path $+ fname);

/*
** Feature management
*/
// Separate dependent and independent data
y = housing_data[., "median_house_value"];
X = delcols(housing_data, "median_house_value");

// Split into 70% training data 
// and 30% testing data
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);</code></pre>
<h2 id="fitting-our-model">Fitting Our Model</h2>
<p>Now that we've completed our data cleaning, we're finally ready to fit our model. Today we'll use a LASSO <a href="https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/" target="_blank" rel="noopener">regression model</a> to predict our target variable. LASSO is a form of regularization that has found relative success in <a href="https://www.aptech.com/industry-solutions/econometrics/" target="_blank" rel="noopener">economic</a> and <a href="https://www.aptech.com/industry-solutions/finance/" target="_blank" rel="noopener">financial modeling</a>. It offers a data-driven approach to dealing with high-dimensionality in linear models. </p>
<h3 id="model-fitting">Model Fitting</h3>
<p>To fit the LASSO model to our target variable, <code>median_house_value</code>, we'll use <a href="https://docs.aptech.com/gauss/lassofit.html" target="_blank" rel="noopener"><code>lassoFit</code></a> from the GAUSS Machine Learning library.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** LASSO Model
*/
// Set lambda values
lambda = { 0, 0.1, 0.3 };

// Declare 'mdl' to be an instance of a
// lassoModel structure to hold the estimation results
struct lassoModel mdl;

// Estimate the model with default settings
mdl = lassoFit(y_train, X_train, lambda);</code></pre>
<p>The <code>lassoFit</code> procedure prints a model description and results:</p>
<pre>==============================================================================
Model:                        Lasso     Target Variable:    median_house_value
Number observations:          12622     Number features:                    12
==============================================================================

===========================================================
                    Lambda          0        0.1        0.3
===========================================================

                 longitude     -2.347     -1.013   -0.02555
                  latitude     -2.192    -0.9269          0
        housing_median_age    0.07189    0.06384    0.03977
               total_rooms  -0.001004          0          0
            total_bedrooms    0.01165   0.006107   0.004828
                population  -0.004317  -0.003396  -0.001232
                households   0.006808   0.005119          0
             median_income      3.872      3.569      3.457
 ocean_proximity__1H OCEAN     -5.509          0          0
    ocean_proximity_INLAND     -9.437     -5.639     -6.575
  ocean_proximity_NEAR BAY     -7.083    -0.6395          0
ocean_proximity_NEAR OCEAN     -5.198     0.6378     0.6981
                    CONST.     -193.5     -82.98      3.451
===========================================================
                        DF         12         10          7
              Training MSE       33.7       34.7       37.4</pre>
<p>The results highlight the variable selection function of LASSO. With $\lambda = 0$, a full least squares model, all features are represented in the model. When we get to $\lambda = 0.3$, the LASSO regression removes 4 of our 12 variables: </p>
<ul>
<li><code>latitude</code></li>
<li><code>total_rooms</code></li>
<li><code>ocean_proximity__1H OCEAN</code></li>
<li><code>ocean_proximity_NEAR BAY</code></li>
</ul>
<p>As we would expect, <code>median_income</code> has a large positive impact. However, there are a few noteworthy observations about the coefficients for the location related variables.</p>
<p>As we add more regularization to the model by increasing the value of $\lambda$, <code>ocean_proximity__1H OCEAN</code> and <code>ocean_proximity_NEAR BAY</code> are removed from the model, but the effect of <code>ocean_proximity_INLAND</code> increases substantially. <code>latitude</code> is also removed from the model. This could be because these effects are largely also explained by the location dummy variables and <code>median_income</code>.</p>
<h3 id="prediction">Prediction</h3>
<p>We can now test our model's prediction capability on the testing data using <a href="https://docs.aptech.com/gauss/lmpredict.html" target="_blank" rel="noopener"><code>lmPredict</code></a>: </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Predictions
predictions = lmPredict(mdl, X_test);

// Get MSE
testing_MSE = meanSquaredError(predictions, y_test);
print "Testing MSE"; testing_MSE;</code></pre>
<pre>Testing MSE

       33.814993
       34.726144
       37.199771</pre>
<p>As expected, most of these values are above the training MSE but not by much. The test MSE for the model with the highest  $\lambda$ value is actually lower than the training MSE. This suggests that our model is not overfitting.</p>
<h2 id="feature-engineering">Feature Engineering</h2>
<p>Since our model is not overfitting, we can add more variables to the model. We could collect more data variables to add. However, it is likely that there is more information in our current data that we can make more accessible to our estimator. We are going to create some new features from combinations of our current features. This is part of a process called feature engineering which can make substantial contributions to your machine learning models.</p>
<p>We will start by generating per capita variables for <code>total_rooms</code>, <code>total_bedrooms</code>, and <code>households</code>.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Create per capita variables
** using population
*/
pc_data = housing_data[., "total_rooms" "total_bedrooms" "households"] 
    ./ housing_data[., "population"];

// Convert to a dataframe and add variable names
pc_data = asdf(pc_data, "rooms_pc"$|"bedrooms_pc"$|"households_pc");</code></pre>
<p>Next we will great a variable representing the percentage of <code>total_rooms</code> made up by <code>total_bedrooms</code>:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">beds_per_room = X[.,"total_bedrooms"] ./ X[.,"total_rooms"];</code></pre>
<p>and add these columns to <code>X</code>:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">X = X ~ pc_data ~ asdf(beds_per_room, "beds_per_room");</code></pre>
<h3 id="fit-and-predict-the-new-model">Fit and Predict the New Model</h3>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Reset the random seed so we get the
// same test and train splits as our previous model
rndseed 896876;

// Split our new X into train and test splits
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);

// Set lambda values
lambda = { 0, 0.1, 0.3 };

// Declare 'mdl' to be an instance of a
// lassoModel structure to hold the estimation results
struct lassoModel mdl;

// Estimate the model with default settings
mdl = lassoFit(y_train, X_train, lambda);

// Predictions
predictions = lmPredict(mdl, X_test);

// Get MSE
testing_MSE = meanSquaredError(predictions, y_test);
print "Testing MSE"; testing_MSE;</code></pre>
<pre>==============================================================================
Model:                        Lasso     Target Variable:    median_house_value
Number observations:          12622     Number features:                    16
==============================================================================

===========================================================
                    Lambda          0        0.1        0.3
===========================================================

                 longitude     -2.495     -1.008          0
                  latitude      -2.36    -0.9354          0
        housing_median_age     0.0808    0.07167    0.04316
               total_rooms -0.0001714          0          0
            total_bedrooms   0.005301   0.001517  0.0008104
                population -0.0004661          0          0
                households  -0.001611          0          0
             median_income      3.947      4.011      3.675
 ocean_proximity__1H OCEAN     -5.171          0          0
    ocean_proximity_INLAND     -8.635     -4.963     -6.235
  ocean_proximity_NEAR BAY     -6.966     -0.875          0
ocean_proximity_NEAR OCEAN     -5.219     0.2927     0.1798
                  rooms_pc      2.678     0.1104          0
               bedrooms_pc     -11.68          0          0
             households_pc      22.23      21.47      20.23
             beds_per_room      33.03      17.03      8.029
                    CONST.     -221.9     -95.55     -3.059
===========================================================
                        DF         16         11          7
              Training MSE       31.6       32.5       34.3
Testing MSE

       31.505169
       32.457936
       34.155290 </pre>
<p>Our train and test MSE have improved for all values of $\lambda$. Of our new variables, <code>households_pc</code> and <code>beds_per_room</code>, seem to have the strongest effect.</p>
<h2 id="extensions">Extensions</h2>
<p>We use a linear regression model, LASSO, for modeling home values. This was chosen somewhat ad hoc, and there are a number of alternative and extensions that could help improve our predictions. </p>
<p>For example we could:</p>
<ul>
<li>Use <a href="https://docs.aptech.com/gauss/kmeansfit.html" target="_blank" rel="noopener">clustering</a> or <a href="https://docs.aptech.com/gauss/knnfit.html" target="_blank" rel="noopener">K-nearest neighbors</a> to capture more location information.</li>
<li>Use <a href="https://www.aptech.com/blog/applications-of-principal-components-analysis-in-finance/" target="_blank" rel="noopener">principal component analysis</a> to capture the variation in our features, then estimate the linear relationship between our the median home values and the principle component. </li>
<li>Use a <a href="https://docs.aptech.com/gauss/decforestrfit.html" target="_blank" rel="noopener">random forest model</a>, which generally provides good accuracy for tabular datasets. </li>
<li>Split the home values into bins and perform <a href="https://www.aptech.com/blog/predicting-recessions-with-machine-learning-techniques/" target="_blank" rel="noopener">classification</a>, rather than regression.</li>
</ul>
<h3 id="conclusion">Conclusion</h3>
<p>In today's blog we've seen the important role that data exploration and cleaning plays in developing a machine learning model. Rarely do we obtain data that we can plug directly into our models. It's best practice, to make time for data exploration and cleaning, because any machine learning model is only as reliable as its data. </p>
<h3 id="further-machine-learning-reading">Further Machine Learning Reading</h3>
<ol>
<li><a href="https://www.aptech.com/blog/predicting-recessions-with-machine-learning-techniques/" target="_blank" rel="noopener">Predicting Recessions with Machine Learning Techniques</a></li>
<li><a href="https://www.aptech.com/blog/applications-of-principal-components-analysis-in-finance/" target="_blank" rel="noopener">Applications of Principal Components Analysis in Finance</a></li>
<li><a href="https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/" target="_blank" rel="noopener">Predicting The Output Gap With Machine Learning Regression Models</a></li>
<li><a href="https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/" target="_blank" rel="noopener">Fundamentals of Tuning Machine Learning Hyperparameters</a></li>
<li><a href="https://www.aptech.com/blog/understanding-cross-validation/" target="_blank" rel="noopener">Understanding Cross-Validation</a></li>
<li><a href="https://www.aptech.com/blog/classification-with-regularized-logistic-regression/" target="_blank" rel="noopener">Classification with Regularized Logistic Regression</a>
    <!-- MathJax configuration -->
    <style>
        .mjx-svg-href {
            fill: "inherit" !important;
            stroke: "inherit" !important;
        }
    </style>
    <script type="text/x-mathjax-config">
        MathJax.Hub.Config({ TeX: { equationNumbers: {autoNumber: "AMS"} } });
    </script>
    <script type="text/javascript">
window.MathJax = {
  tex2jax: {
    inlineMath: [ ['$','$'] ],
    displayMath: [ ['$$','$$'] ],
    processEscapes: true,
    processEnvironments: true
  },
  // Center justify equations in code and markdown cells. Elsewhere
  // we use CSS to left justify single line equations in code cells.
  displayAlign: 'center',
  "HTML-CSS": {
    styles: {'.MathJax_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  "SVG": {
    styles: {'.MathJax_SVG_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  showProcessingMessages: false,
  messageStyle: "none",
  menuSettings: { zoom: "Click" },
  AuthorInit: function() {
    MathJax.Hub.Register.StartupHook("End", function() {
            var timeout = false, // holder for timeout id
            delay = 250; // delay after event is "complete" to run callback
            var shrinkMath = function() {
              //var dispFormulas = document.getElementsByClassName("formula");
              var dispFormulas = document.getElementsByClassName("MathJax_SVG_Display");
              if (dispFormulas){
                // caculate relative size of indentation
                var contentTest = document.getElementsByTagName("body")[0];
                var nodesWidth = contentTest.offsetWidth;
                // if you have indentation
                var mathIndent = MathJax.Hub.config.displayIndent; //assuming px's
                var mathIndentValue = mathIndent.substring(0,mathIndent.length - 2);
                for (var i=0; i<dispFormulas.length; i++){
                  var dispFormula = dispFormulas[i];
                  var wrapper = dispFormula;
                  //var wrapper = dispFormula.getElementsByClassName("MathJax_Preview")[0].nextSibling;
                  var child = wrapper.firstChild;
                  wrapper.style.transformOrigin = "center"; //or top-left if you left-align your equations
                  var oldScale = child.style.transform;
                  //var newValue = Math.min(0.80*dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newValue = Math.min(dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newScale = "scale(" + newValue + ")";
                  if(newValue != "NaN" && !(newScale === oldScale)){
                    wrapper.style.transform = newScale;
                    wrapper.style["margin-left"]= Math.pow(newValue,4)*mathIndentValue + "px";
                    var wrapperStyle = window.getComputedStyle(wrapper);
                    var wrapperHeight = parseFloat(wrapperStyle.height);
                    wrapper.style.height = "" + (wrapperHeight * newValue) + "px";
                    if(newValue === "1.00"){
                      wrapper.style.cursor = "";
                      wrapper.style.height = "";
                    }
                    else {
                      wrapper.style.cursor = "zoom-in";
                    }
                  }

                }
            }
            };
            shrinkMath();
            window.addEventListener('resize', function() {
              clearTimeout(timeout);
              timeout = setTimeout(shrinkMath, delay);
            });
          });
  }
}
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-AMS_SVG"></script>
</li>
</ol>]]></content:encoded>
					
					<wfw:commentRss>https://www.aptech.com/blog/machine-learning-with-real-world-data/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Understanding Cross-Validation</title>
		<link>https://www.aptech.com/blog/understanding-cross-validation/</link>
					<comments>https://www.aptech.com/blog/understanding-cross-validation/#respond</comments>
		
		<dc:creator><![CDATA[Eric]]></dc:creator>
		<pubDate>Tue, 02 May 2023 13:08:47 +0000</pubDate>
				<category><![CDATA[Machine learning]]></category>
		<guid isPermaLink="false">https://www.aptech.com/?p=11583747</guid>

					<description><![CDATA[If you've explored machine learning models, you've most likely encountered the term "cross-validation" at some point. Cross-validation is an important step for training robust and reliable maachine learning models. 

In this blog, we'll break cross-validation into simple terms. Using a practical demonstration, we'll equip you with the knowledge to confidently use cross-validation in your machine learning projects. ]]></description>
										<content:encoded><![CDATA[<h3 id="introduction">Introduction</h3>
<p>If you've explored machine learning models, you've probably come across the term &quot;cross-validation&quot; at some point. But what exactly is it, and why is it important? </p>
<p>In this blog, we'll break cross-validation into simple terms. With a practical demonstration, we'll equip you with the knowledge to confidently use cross-validation in your machine learning projects. </p>
<h2 id="model-validation-in-machine-learning">Model Validation in Machine Learning</h2>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/05/Blank-diagram-2-1.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/05/Blank-diagram-2-1.jpg" alt="Model validation and cross validation using testing and training datasets for machine learning models." width="841" height="716" class="aligncenter size-full wp-image-11583787" /></a></p>
<p>Machine learning validation methods provide a means for us to estimate generalization error. This is crucial for determining what model provides the most best predictions for unobserved data.</p>
<p>In cases where large amounts of data are available, machine learning data validation begins with splitting the data into three separate datasets:</p>
<ul>
<li>A training set is used to train the machine learning model(s) during development.  </li>
<li>A validation set is used to estimate the generalization error of the model created from the training set for the purpose of model selection.  </li>
<li>A test set is used to estimate the generalization error of the final model.  </li>
</ul>
<h2 id="cross-validation-in-machine-learning">Cross-Validation in Machine Learning</h2>
<p>The model validation process in the previous section works when we have large datasets. When data is limited we must instead use a technique called cross-validation. </p>
<p><b>The purpose of cross-validation is to provide a better estimate of a model's ability to perform on unseen data.</b> It provides an unbiased estimate of the generalization error, especially in the case of limited data. </p>
<p>There are many reasons we may want to do this:</p>
<ul>
<li>To have a clearer measure of how our model performs. </li>
<li>To tune hyperparameters. </li>
<li>To make model selections. </li>
</ul>
<p>The intuition behind cross-validation is simple - rather than training our models on one training set we train our model on multiple subsets of data. </p>
<p>The basic steps of cross-validation are:</p>
<ol>
<li>Split data into portions. </li>
<li>Train our model on a subset of the portions.</li>
<li>Test our model on the remaining subsets of the data. </li>
<li>Repeat steps 2-3 until the model has been trained and tested on the entire dataset.</li>
<li>Average the model performance across all iterations of testing to get the total model performance.</li>
</ol>
<h3 id="common-cross-validation-methods">Common Cross-Validation Methods</h3>
<p>Though the basic concept of cross-validation is fairly simple, there are a number of ways to go about each step. A few examples of cross-validation methods include</p>
<ol>
<li>
<p><b> k-Fold Cross-Validation</b><br />
In k-fold cross-validation:</p>
<ul>
<li>The dataset is divided into k equal sized-folds. </li>
<li>The model is trained on k-1 folds and tested on the remaining fold. </li>
<li>The process is repeated k times, with each fold serving as the test set exactly once. </li>
<li>The performance metrics are averaged over the k iterations. </li>
</ul>
</li>
<li>
<p><b> Stratified k-Fold Cross-Validation</b><br />
This process is similar to k-fold cross-validation with minor but important exceptions:</p>
<ul>
<li>The class distribution in each fold is preserved. </li>
<li>It is useful for imbalanced datasets.</li>
</ul>
</li>
<li>
<p><b>Leave-One-Out Cross-Validation</b><br />
The Leave-one-out cross-validation process:</p>
<ul>
<li>Trains the model using all data observations except one. </li>
<li>Tests the data using the unused data point. </li>
<li>Repeats this for <em>n</em> iterations until each data point is used exactly once as a test set. </li>
</ul>
</li>
<li><b>Time-Series Cross-Validation</b><br />
This cross-validation method, designed specifically for time-series:
<ul>
<li>Splits the data into training and testing sets in a chronologically ordered manner, such as sliding or expanding windows. </li>
<li>Trains the model on past data and tests the model on future data, based on the splitting point. </li>
</ul></li>
</ol>
<table>
 <thead>
<tr><th>Method</th><th>Advantages</th><th>Disdvantages</th></tr>
</thead>
<tbody>
<tr><th>k-Fold Cross-Validation</th><td><ul><li>Provides a good estimate of the model's performance by using all the data for both training and testing.</li><li>Reduces the variance in performance estimates compared to other methods.</li></ul></td><td><ul><li>Can be computationally expensive, especially for large datasets or complex models.</li><li>May not work well for imbalanced datasets or when there is a specific order to the data.</li></ul></td></tr>
<tr><th>Stratified k-Fold Cross-Validation</th><td><ul><li>Ensures that each fold has a representative distribution of classes, which can improve performance estimates for imbalanced datasets.</li><li>Reduces the variance in performance estimates compared to other methods.</li></ul></td><td><ul><li>Can still be computationally expensive, especially for large datasets or complex models.</li><li>May not be necessary for balanced datasets where class distribution is already even.</li></ul></td></tr>
<tr><th>Leave-One-Out Cross-Validation (LOOCV)</th><td><ul><li>Provides the least biased estimate of the model's performance, as the model is tested on every data point.</li><li>Can be useful when dealing with very limited data.</li></ul></td><td><ul><li>Can be computationally expensive, as it requires training and testing the model n times.</li><li>May have high variance in performance estimates, due to the small size in the test set.</li></ul></td></tr>
<tr><th>Time Series Cross-Validation</th><td><ul><li>Accounts for temporal dependencies in time series data.</li><li>Provides a realistic estimate of the model's performance in real-world scenarios.</li></ul></td><td><ul><li>May not be applicable for non-time series data.</li><li>Can be sensitive to the choice of window size and data splitting strategy.</li></ul></td></tr>
</tbody>
</table>
<h2 id="k-fold-cross-validation-example">k-Fold Cross-Validation Example</h2>
<p>Let's look at k-fold cross-validation in action, using the wine quality dataset included in the <a href="https://docs.aptech.com/gauss/gml-landing.html" target="_blank" rel="noopener">GAUSS Machine Learning (GML) library</a>. This file is based on the <a href="https://www.kaggle.com/datasets/yasserh/wine-quality-dataset" target="_blank" rel="noopener">Kaggle Wine Quality dataset</a>. </p>
<p>Our objective is to classify wines into quality categories using 11 qualities:</p>
<ul>
<li>Fixed acidity.</li>
<li>Volatile acidity.</li>
<li>Citric acid. </li>
<li>Residual sugar.</li>
<li>Chlorides.</li>
<li>Free sulfur dioxide. </li>
<li>Total sulfur dioxide. </li>
<li>Density.</li>
<li>pH.</li>
<li>Sulphates.</li>
<li>Alcohol.</li>
</ul>
<p>We'll use k-fold cross-validation to examine the performance of a random forest classification model. </p>
<h3 id="data-loading-and-organization">Data Loading and Organization</h3>
<p>First we will load our data directly from the GML library:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Load data and prepare data
*/
// Filename
fname = getGAUSSHome("pkgs/gml/examples/winequality.csv");

// Load wine quality dataset
dataset = loadd(fname);</code></pre>
<p>After loading the data, we need to shuffle the data and extract our dependent and independent variables. </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Enable repeatable sampling
rndseed 754931;

// Shuffle the dataset (sample without replacement),
// because cvSplit does not shuffle.
dataset = sampleData(dataset, rows(dataset));

y = dataset[.,"quality"];
X = delcols(dataset, "quality");</code></pre>
<div class="alert alert-info" role="alert">Data shuffling is not always necessary. However, we found that without shuffling, some folds did not contain a complete representation of the classes. This suggests that our data might also be a good candidate for stratified k-fold cross-validation.</div>
<h3 id="setting-random-forest-hyperparameters">Setting Random Forest Hyperparameters</h3>
<p>After loading our data, we will set the <a href="https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/" target="_blank" rel="noopener">random forest hyperparameters</a> using the <code>dfControl</code> structure. </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Enable GML library functions
library gml;

/*
** Model settings
*/
// The dfModel structure holds the trained model
struct dfModel dfm;

// Declare 'dfc' to be a dfControl
// structure and fill with default settings
struct dfControl dfc;
dfc = dfControlCreate();

// Create 200 decision trees
dfc.numTrees = 200;

// Stop splitting if impurity at
// a node is less than 0.15
dfc.impurityThreshold = 0.15;

// Only consider 2 features per split
dfc.featuresPerSplit = 2;</code></pre>
<div class="alert alert-info" role="alert">Today's focus is not on how to pick these hyperparameters. For more information on hyperparameter tuning, see our previous <a href="https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/" target="_blank" rel="noopener">blog</a>.</div>
<h3 id="k-fold-cross-validation">k-fold Cross-Validation</h3>
<p>Now that we have loaded our data and set our hyperparameters, we are ready to fit our random forest model and implement k-fold cross-validation. </p>
<p>First we setup the number of folds and pre-allocate a storage vector for model accuracy.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Specify number of folds
// This generally is 5-10
nfolds = 5;

// Pre-allocate vector to hold the results
accuracy = zeros(nfolds, 1);</code></pre>
<p>Next we use a GAUSS <a href="https://docs.aptech.com/gauss/for.html" target="_blank" rel="noopener"><code>for loop</code></a> to complete four steps:</p>
<ol>
<li>Select testing and training data from our folds using the <a href="https://docs.aptech.com/gauss/cvsplit.html" target="_blank" rel="noopener"><code>cvSplit</code></a> procedure.  </li>
<li>Fit our random forest classification model on the chosen training data using <a href="https://docs.aptech.com/gauss/decforestcfit.html" target="_blank" rel="noopener"><code>decForestCFit</code></a> procedure.</li>
<li>Make classification predictions using the chosen testing data and the <a href="https://docs.aptech.com/gauss/decforestpredict.html" target="_blank" rel="noopener"><code>decForestPredict</code></a> procedure.</li>
<li>Compute and store model accuracy for each iteration. </li>
</ol>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">for i(1, nfolds, 1);
    { y_train, y_test, X_train, X_test } = cvSplit(y, X, nfolds, i);

    // Fit model using this fold's training data
    dfm = decForestCFit(y_train, X_train, dfc);

    // Make predictions using this fold's test data
    predictions = decForestPredict(dfm, X_test);

    accuracy[i] = meanc(y_test .== predictions);
endfor;</code></pre>
<h3 id="results">Results</h3>
<p>Let's print the accuracy results and the total model accuracy:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Print Results
*/
sprintf("%7s %10s", "Fold", "Accuracy");;
sprintf("%7d %10.2f", seqa(1,1,nfolds), accuracy);
sprintf("Total model accuracy           : %10.2f", meanc(accuracy));
sprintf("Accuracy variation across folds: %10.3f", stdc(accuracy));</code></pre>
<pre>   Fold   Accuracy
      1       0.70
      2       0.73
      3       0.65
      4       0.71
      5       0.71
Total model accuracy           :       0.70
Accuracy variation across folds:      0.028</pre>
<p>Our results provide some important insights into why we conduct cross-validation:</p>
<ul>
<li>The model accuracy is different across folds, with a standard deviation of 0.028. </li>
<li>The maximum accuracy, using fold 2, is 0.73.</li>
<li>The minimum accuracy, using folds 3 is 0.65.</li>
</ul>
<p>Depending on how we split our testing and training, we could get a different picture of model performance. </p>
<p>The total model accuracy, at 0.70, gives a better overall measure of model performance. The standard deviation of the accuracy gives us some insight into how much our prediction accuracy might vary.</p>
<h3 id="conclusion">Conclusion</h3>
<p>If you're looking to improve the accuracy and reliability of your statistical analysis, cross-validation is a crucial technique to learn. In today's blog we've provided a guide to getting started with cross-validation. </p>
<p>Our step-by-step practical demonstration using GAUSS should prepare you to confidently implement cross-validation in your own data analysis projects. </p>
<h3 id="further-machine-learning-reading">Further Machine Learning Reading</h3>
<ol>
<li><a href="https://www.aptech.com/blog/predicting-recessions-with-machine-learning-techniques/" target="_blank" rel="noopener">Predicting Recessions with Machine Learning Techniques</a></li>
<li><a href="https://www.aptech.com/blog/applications-of-principal-components-analysis-in-finance/" target="_blank" rel="noopener">Applications of Principal Components Analysis in Finance</a></li>
<li><a href="https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/" target="_blank" rel="noopener">Predicting The Output Gap With Machine Learning Regression Models</a></li>
<li><a href="https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/" target="_blank" rel="noopener">Fundamentals of Tuning Machine Learning Hyperparameters</a></li>
<li><a href="https://www.aptech.com/blog/machine-learning-with-real-world-data/" target="_blank" rel="noopener">Machine Learning With Real-World Data</a></li>
<li><a href="https://www.aptech.com/blog/classification-with-regularized-logistic-regression/" target="_blank" rel="noopener">Classification with Regularized Logistic Regression</a></li>
</ol>
<h2 id="try-out-gauss-machine-learning">Try Out GAUSS Machine Learning</h2>

[contact-form-7]

]]></content:encoded>
					
					<wfw:commentRss>https://www.aptech.com/blog/understanding-cross-validation/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Fundamentals of Tuning Machine Learning Hyperparameters</title>
		<link>https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/</link>
					<comments>https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/#comments</comments>
		
		<dc:creator><![CDATA[Eric]]></dc:creator>
		<pubDate>Mon, 24 Apr 2023 13:37:58 +0000</pubDate>
				<category><![CDATA[Machine learning]]></category>
		<guid isPermaLink="false">https://www.aptech.com/?p=11583628</guid>

					<description><![CDATA[Machine learning algorithms often rely on hyperparameters that can impact the performance of the models. These hyperparameters are external to the data and are part of the modeling choices that practitioners must make.

An important step in machine learning modeling is optimizing model hyperparameters to improve prediction accuracy.

In today's blog, we will cover some fundamentals of parameter tuning and will look more specifically at fine-tuning our previous decision forest model.]]></description>
										<content:encoded><![CDATA[<h3 id="introduction">Introduction</h3>
<p>Machine learning algorithms often rely on hyperparameters that can impact the performance of the models. These hyperparameters are external to the data and are part of the modeling choices that practitioners must make.</p>
<p>An important step in machine learning modeling is optimizing model hyperparameters to improve prediction accuracy.</p>
<p>In today's blog, we will cover some fundamentals of hyperparameter tuning using our previous decision forest, or random forest, model.</p>
<h2 id="model-performance">Model Performance</h2>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/04/bias-variance.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/04/bias-variance.jpg" alt="" width="800" height="250" class="aligncenter size-full wp-image-11583689" /></a>
Before we consider how to fit the best machine learning model, we need to look at what it means to be the best model. </p>
<p>First, we must keep in mind that the most common goal in machine learning is to create an algorithm that will create accurate predictions based on unseen data. How successful an algorithm is at achieving this goal is reflected in the out-of-sample, or generalization, error.</p>
<p>The error of a machine learning model can be broken into two main categories, bias, and variance.</p>
<table>
<tbody>
<tr><th>Bias</th><td>The error that occurs when we fit a simple model to a more complex data-generating process. A model with high bias will underfit the training data as we see in the far left panel of the above plot.</td></tr>
<tr><th>Variance</th><td>The expected prediction error that occurs when we apply our model to a new dataset that the model has not seen. A model with high variance will usually overfit the training data which results in lower training set error, but will lead to higher error on any data not used for training.
</td></tr>
</tbody>
</table>
<p>Because of these two sources of error, fitting machine learning models requires finding the right model complexity without overfitting our training data. </p>
<h2 id="model-performance-measures">Model Performance Measures</h2>
<p>There are a number of methods for evaluating the performance of machine learning models. Ultimately, which performance measure is used should be based on business or research objectives. </p>
<table>
 <thead>
 <tr>
      <th style="background-color: #36434C" colspan="3"><h3 id="common-performance-measures"><span style="color:#FFFFFF">Common Performance Measures</span></h3>
      </th>
   </tr>
<tr><th>Method</th><th>Description</th><th>Uses</th></tr>
</thead>
<tbody>
<tr><td>Mean Squared Error (MSE)</td><td>The average of the squared distance between the target value and the value predicted by the model.</td><td rowspan="3">Regression Models</td></tr>
<tr><td>Mean Absolute Error (MAE)</td><td>The average of the absolute value of the distance between the target value and the value predicted by the model.</td></tr>
<tr><td>Root Mean Squared Error (RMSE)</td><td>The square root of the mean squared error.</td></tr>
<tr><td>Accuracy</td><td>The number of correct predictions divided by the total number of predictions.</td><td rowspan="4">Classifications Models</td></tr>
<tr><td>Precision</td><td>Ratio of true positives to total positive predicted.</td></tr>
<tr><td>Recall</td><td>The proportion of true positives divided by the sum of true positives and false negatives.</td></tr>
<tr><td>F1-score</td><td>The harmonic mean of precision and recall.</td></tr>
</tbody>
</table>
<h2 id="tuning-parameters">Tuning Parameters</h2>
<p>Adjusting hyperparameters is one important way that we can impact the performance of machine learning models. Hyperparameters are parameters that:</p>
<ul>
<li>Are set before the model is trained and are not learned from the data.</li>
<li>Determine how the model learns from the data. </li>
<li>May need to be readjusted to maintain optimal performance as more data is collected.</li>
</ul>
<table>
 <thead>
 <tr>
      <th style="background-color: #36434C" colspan="3"><h3 id="example-hyperparameters"><span style="color:#FFFFFF">Example Hyperparameters</span></h3>
      </th>
   </tr>
<tr><th>Model</th><th>Hyperparameter</th></tr>
</thead>
<tbody>
<tr><td>K-nearest neighbor</td><td>The number of neighbors used in classification group, $k$.</td></tr>
<tr><td>Ridge regression</td><td>$\lambda$, the weight on the L2 penalty.</td></tr>
<tr><td>Gradient Boosting Machines</td><td>The number of trees, the shrinkage parameter, and the number of splits in each tree.</td></tr>
</tbody>
</table>
<p>Hyperparameters can have a big impact on how well a model performs. For this reason, it is important to systematically and strategically optimize hyperparameters using hyperparameter tuning. </p>
<p>Some popular methods for hyperparameter tuning include:</p>
<ol>
<li>
<p><b>Grid Search:</b> This is a simple but effective method where you specify a set of values for each hyperparameter, and the algorithm tries all possible combinations of values. This can be time-consuming, but it guarantees that you'll find the best set of hyperparameters within the specified options.</p>
</li>
<li>
<p><b>Random Search:</b> This method randomly selects values for each hyperparameter from a specified range. This can be faster than grid search, especially if you have a large number of hyperparameters, but it's not guaranteed to find the best set of hyperparameters.</p>
</li>
<li>
<p><b>Bayesian Optimization:</b> This is a more advanced method that uses probability models to choose the next set of hyperparameters to test. It takes into account the results of previous tests to choose values that are more likely to result in better performance.</p>
</li>
<li><b>Evolutionary Algorithms:</b> This method simulates evolution by creating a population of potential solutions (sets of hyperparameters) and selecting the best ones to &quot;breed&quot; new solutions. This process continues until a good solution is found.</li>
</ol>
<h2 id="examples">Examples</h2>
<p>Today we will consider two examples of hyperparameter tuning. For each example we:</p>
<ol>
<li>Use a decision forest model, similar to the one we <a href="https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/" target="_blank" rel="noopener"> previously built to predict the U.S. output gap</a>. </li>
<li>Perform a grid search to determine the best hyperparameter value or values. </li>
<li>Use mean squared error as our model performance measure. </li>
</ol>
<h3 id="the-model">The Model</h3>
<p>Our model:</p>
<ul>
<li>Uses a combination of common economic indicators and GDP subcomponents as predictors of CBO-based U.S. output gap.  </li>
<li>Uses a 70/30 training and testing split without shuffling.</li>
<li>Is estimated using the <a href="https://docs.aptech.com/gauss/gml-landing.html" target="_blank" rel="noopener">GAUSS Machine Learning library&lt;/a?.</li>
</ul>
<p>When tuning a decision forest model, there are several hyperparameters that can be considered. </p>
<table>
 <thead>
 <tr>
      <th style="background-color: #36434C" colspan="3"><h3 id="decision-forest-hyperparameters"><span style="color:#FFFFFF">Decision Forest Hyperparameters</span></h3>
      </th>
   </tr>
<tr><th>Parameter</th><th>Description</th><th>Impact</th></tr>
</thead>
<tbody>
<tr><th>Number of trees</th><td>The number of decision trees that will be trained and combined to make predictions.</td><td>Increasing the number of trees can lead to better performance, but can also increase training time and memory requirements.</td></tr>
<tr><th>Maximum depth</th><td>The maximum depth, or number of splits, of each decision tree.</td><td>A deeper tree can capture more complex relationships in the data, but can also overfit the data and perform poorly on new data.</td></tr>
<tr><th>Observations per tree</th><td>The percentage of observations used per tree.</td><td>Increasing the percentage of observations used in a tree can improve accuracy but it also can increase computational cost, reduce interpretability, and lead to overfitting or loss of diversity. </td></tr>
<tr><th>Minimum observations per node</th><td>The minimum number of observations required to be at a leaf node. </td><td>Increasing this value can help prevent overfitting, but can also result in a less complex model.</td></tr>
<tr><th>Maximum features</th><td>The maximum number of features that can be used to split each node.</td><td>Limiting the number of features can help prevent overfitting and reduce training time, but can also result in a less accurate model.</td></tr>
</tbody>
</table>
<h2 id="example-one-tuning-a-single-parameter">Example One: Tuning a Single Parameter</h2>
<p>In our first example, we will use a grid search to tune the number of features used for splitting each node. We will hold all other parameters constant at the GAUSS default values.</p>
<table>
<tbody>
<tr><th>Parameter</th><th>GAUSS Default</th></tr>
<tr><td>Number of trees</td><td style="text-align: center">100</td></tr>
<tr><td>Maximum tree depth</td><td style="text-align: center">Unlimited</td></tr>
<tr><td>Minimum percentage of <br>observations per tree</td><td style="text-align: center">100%</td></tr>
<tr><td>Minimum observations per leaf</td><td style="text-align: center">1</td></tr>
<tr><td>Maximum features</td><td style="text-align: center">$\frac{\text{Number of Variables}}{3}$</td></tr>
</tbody>
</table>
<h3 id="the-dfcontrol-structure">The <code>dfControl</code> Structure</h3>
<p>The <code>dfControl</code> <a href="https://www.aptech.com/resources/tutorials/a-gentle-introduction-to-using-structures/" target="_blank" rel="noopener">structure</a> is an <a href="https://www.aptech.com/blog/the-basics-of-optional-arguments-in-gauss-procedures/" target="_blank" rel="noopener">optional argument</a> used to pass hyperparameter values to the  <a href="https://docs.aptech.com/gauss/decforestrfit.html" target="_blank" rel="noopener"><code>decForestRFit</code></a> and <a href="https://docs.aptech.com/gauss/decforestcfit.html" target="_blank" rel="noopener"><code>decForestCFit</code></a> procedures.</p>
<p>Using the structure to change hyperparameters requires three steps:</p>
<ol>
<li>Declare an instance of the <code>dfControl</code> structure using the <code>struct</code> keyword. </li>
<li>Fill the default values for the members using the <a href="https://docs.aptech.com/gauss/dfcontrolcreate.html" target="_blank" rel="noopener"><code>dfControlCreate</code></a> procedure. </li>
<li>Set the desired parameter value using GAUSS &quot;dot&quot;, <code>.</code>, notation.</li>
</ol>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Declare an instance of the 
// dfControl structure
struct dfControl dfc;

// Set default values for
// structure members
dfc = dfControlCreate();

// Specify features per node
dfc.featuresPerSplit = 4;</code></pre>
<h3 id="loading-and-splitting-our-data">Loading and Splitting our Data</h3>
<p>The first step for our hyperparameter tuning example, is to load our data and split it into training and testing datasets. We can do this using the <a href="https://docs.aptech.com/gauss/loadd.html"><code>loadd</code></a> procedure to load our data and the <a href="https://docs.aptech.com/gauss/traintestsplit.html" target="_blank" rel="noopener"><code>trainTestSplit</code></a> procedure to split our data.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Load and split
*/
library gml;

// Load dataset
dataset = __FILE_DIR $+ "reg_data.gdat";
data = loadd(dataset);

/*
** Extract outcome and features
*/
// Extract outcome variable
y = data[., "CBO_GAP"];

// Extract features
X = delcols(data, "date"$|"CBO_GAP");

/*
** Split data into 70% training and 30% testing sets 
** without shuffling.
*/
shuffle = "False";
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7, shuffle);</code></pre>
<h3 id="setting-non-tuning-parameters">Setting Non-Tuning Parameters</h3>
<p>Next, we will set the non-tuning hyperparameters to the GAUSS defaults using the <code>dfControl</code> structure.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Settings for decision forest
*/
// Declare an instance of the 
// dfControl structure
struct dfControl dfc;

// Set default values for
// structure members
dfc = dfControlCreate();</code></pre>
<h3 id="performing-grid-search">Performing Grid Search</h3>
<p>Now that we've set our default non-tuning parameters we will perform our grid search to tune the features per node. The first step is to initialize our grid and <a href="https://www.aptech.com/blog/gauss-basics-3-introduction-to-matrices/" target="_blank" rel="noopener">storage matrices</a>.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Initialize grid and
** storage matrices
*/
// Create vector of possible
// features per node values
featuresPerSplit = seqa(1, 1, cols(X));

// Create storage dataframe for MSE
// with one column for training mse
// and one column for testing mse
mse = asDF(zeros(rows(featuresPerSplit), 2), "Train", "Test");</code></pre>
<div class="alert alert-info" role="alert">Note that in the case of tuning a single parameter, we only have to search over a vector of potential values, not a grid. </div>
<p>Next, we will loop over each possible value of features per split. For each potential value we:</p>
<ol>
<li>Fit decision forest model using the training data.</li>
<li>Predict outcomes using the training data. </li>
<li>Predict outcomes using the testing data. </li>
<li>Compute the MSE for both the training and testing predictions.</li>
<li>Store the MSE values. </li>
</ol>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Loop over all potential values
// of features per node
for i(1, rows(featuresPerSplit), 1);

    // Set featuresPerSplit parameter
    dfc.featuresPerSplit = featuresPerSplit[i];

    /*
    ** Decision Forest Model
    */
    // Declare 'mdl' to be an instance of a
    // dfModel structure to hold the estimation results
    struct dfModel mdl;

    // Fit the model with default settings
    mdl = decForestRFit(y_train, X_train, dfc);

    // Make predictions using training data
    df_prediction_train = decForestPredict(mdl, X_train);

    // Make predictions using testing data
    df_prediction_test = decForestPredict(mdl, X_test);

    /*
    ** Compute and store mse
    */
    // Training set MSE
    mse[i, "Train"] = meanSquaredError(y_train, df_prediction_train);

    // Testing set MSE
    mse[i, "Test"] = meanSquaredError(y_test, df_prediction_test);

endfor;</code></pre>
<p>Note that within our loop we use the GML procedure,  <code>meanSquaredError</code> to compute our MSE.</p>
<h3 id="results">Results</h3>
<p>A visualization of our MSE values gives us some insight into what happens as we increase the features per node in our decision forest model:</p>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/04/gblog_df_hp_tune_mse_1.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/04/gblog_df_hp_tune_mse_1.jpg" alt="Training and testing MSE as the features per node changes in a random forest model." width="600" height="400" class="aligncenter size-full wp-image-11583715" /></a></p>
<ul>
<li>As we increase the features per node up to about 5 or 6, we see a general downward trend in both the testing and training MSE. Over this period, the increased features per node allows the model to capture more complex interactions and dependencies in the data.</li>
<li>Increasing the features per node beyond 6, results in a general upward trend in testing MSE and downward trend in training MSE. This points to overfitting. The model fits the training data too well - it captures noise and irrelevant patterns, which leads to decreased performance on the unseen testing data.</li>
</ul>
<p>To confirm our optimal features per node parameter setting, we can locate the minimum testing MSE:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Find the row index of the lowest MSE
idx = minindc(mse[., "Test"]);

// NOTE: two semi-colons at the end of a print statement
//       prevents it from printing a newline at the end
print "Optimal features per node: ";; featuresPerSplit[idx];
print "Minimum test MSE:";; asmatrix(mse[idx, "Test"]);</code></pre>
<p>This confirms that the optimal features per leaf is 6 with a testing MSE of 3.212.</p>
<pre>Optimal features per node:        6.0000000
Minimum test MSE:       3.2122050 </pre>
<h2 id="example-two-simultaneously-tuning-hyperparameters">Example Two: Simultaneously Tuning Hyperparameters</h2>
<p>Now that we've seen how to tune a single hyperparameter, let's look at tuning two hyperparameters simultaneously. We will use the same data and set up from our previous example:</p>
<h3 id="data-loading-and-preliminary-setup">Data loading and preliminary setup</h3>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Load and split
*/
// Load dataset
dataset = __FILE_DIR $+ "reg_data.gdat";
data = loadd(dataset);

/*
** Extract outcome and features
*/
// Extract outcome variable
y = data[., "CBO_GAP"];

// Extract features
X = delcols(data, "date"$|"CBO_GAP");

/*
** Split data into 70% training and 30% testing sets 
** without shuffling
*/
shuffle = "False";
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7, shuffle);

/*
** Settings for decision forest
*/
// Declare an instance of the 
// dfControl structure
struct dfControl dfc;

// Set default values for
// structure members
dfc = dfControlCreate();

// Set features per split
dfc.featuresPerSplit = 6;</code></pre>
<div class="alert alert-info" role="alert">For convenience, we are using the <code>featuresPerSplit</code> value found in the previous section. The optimal value of one hyperparameter depends on the values of the others, so in practice, you should not optimize them separately.</div>
<h3 id="performing-grid-search-1">Performing Grid Search</h3>
<p>In this example, we will tune:</p>
<ul>
<li>The minimum observations per leaf, ranging from 1 to 20.  </li>
<li>The percentage of the observations per tree, ranging from 70% to 100%. </li>
</ul>
<p>First, we initialize our grid and storage matrices. For this example, we will focus only on our testing MSE. </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Initialize grid and
** storage matrices
*/
// Set potential values for 
// minimum observations per node
minObsLeaf = seqa(1, 1, 20);

// Set potential values for 
// percentage of observations
// in tree
pctObs = seqa(0.7, 0.1, 4);

// Storage matrices
test_mse = zeros(rows(minObsLeaf), rows(pctObs));</code></pre>
<p>Next, we use nested <a href="https://docs.aptech.com/gauss/for.html" target="_blank" rel="noopener"><code>for loops</code></a> to search over all potential values of the minimum observations per a leaf and the minimum percentage of observations at the split.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">for i(1, rows(minObsLeaf), 1);

    // Set the minimum obs per leaf
    dfc.minObsLeaf = minObsLeaf[i];

    for j(1, rows(pctObs), 1);

        // Set percentage of obs used for each tree
        dfc.pctObsPerTree = pctObs[j];

        /*
        ** Decision Forest Model
        */
        // Declare 'mdl' to be an instance of a
        // dfModel structure to hold the estimation results
        struct dfModel mdl;

        // Estimate the model with default settings
        mdl = decForestRFit(y_train, X_train, dfc);

        // Make predictions using testing data
        df_prediction_test = decForestPredict(mdl, X_test);

        /*
        ** Compute and store mse
        */
        // Testing set MSE
        test_mse[i, j] = meanSquaredError(y_test, df_prediction_test);

    endfor;
endfor;</code></pre>
<p>Note that in this loop:</p>
<ul>
<li>We use <em>i</em>, from the outer loop, to index the <code>minObsLeaf</code> vector.</li>
<li>We use <em>j</em>, from the inner loop, to index the <code>pctObs</code> vector.</li>
<li>Each row in our storage matrices represents a constant minimum samples per leaf. </li>
<li>Each column in our storage matrices represents a constant percentage of samples.</li>
</ul>
<h3 id="results-1">Results</h3>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/04/gblog_df_hp_tune_mse_2.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/04/gblog_df_hp_tune_mse_2.jpg" alt="Test MSE for a random forest model with varying hyperparameters." width="600" height="400" class="aligncenter size-full wp-image-11583724" /></a></p>
<p>The above plot shows us that with the GAUSS default settings for a random forest and <code>featuresPerNode</code> set to 6:</p>
<ul>
<li>Taking a sample of 100% of the data for the creation of each tree is almost always best.</li>
<li>Setting <code>minObsLeaf</code> to between 5 and 10 seems best, with the minimum at about 7.</li>
<li>We did not get much of an improvement in our test MSE over the first example.</li>
</ul>
<h3 id="optional-finding-the-minimum-mse-value-in-the-output-matrix">Optional: Finding the minimum MSE value in the output matrix</h3>
<p>The final step is to find our optimal hyperparameter settings by locating the combination of parameters that yields the lowest MSE. </p>
<p>We can break this into two steps. First, we find the column that contains the minimum value.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Create a column vector with the minimum MSE
// values for each column
mse_col_mins = minc(test_mse);

// Find the index of the smallest
// value in 'mse_col_mins'
idx_col_min = minindc(mse_col_mins);</code></pre>
<p>Now that we have found which column contains the minimum MSE value, we use <a href="https://docs.aptech.com/gauss/minindc.html"><code>minindc</code></a> to find the index of the smallest value in that column.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Find the row that contains the smallest MSE value
idx_row_min = minindc(test_mse[.,idx_col_min]);

// Extract the lowest MSE across all
// combinations of tuning parameters
MSE_optimal = test_mse[idx_row_min, idx_col_min];

// Print results
sprintf( "Minimum testing MSE: %4f", MSE_optimal);
print "Minimum MSE occurs with";
sprintf("  minimum samples per leaf      : %d", minObsLeaf[idx_row_min]);
sprintf("  percentage of samples per tree: %g%%", 100 * pctObs[idx_col_min]); </code></pre>
<p>This prints our results:</p>
<pre>Minimum testing MSE: 3.151047
Minimum MSE occurs with
  minimum observationss per leaf      : 7
  percentage of observations per tree: 100%</pre>
<div class="alert alert-info" role="alert">For more information on using <a href="https://docs.aptech.com/gauss/sprintf.html" target="_blank" rel="noopener">sprintf</a> for printing see our previous blog, <a href="https://www.aptech.com/blog/how-to-create-a-simple-table-with-sprintf-in-gauss/" target="_blank" rel="noopener">&quot;How to Create a Simple Table Using Sprintf&quot;</a></div>
<h2 id="conclusion">Conclusion</h2>
<p>Today's blog demonstrations how practitioners can use hyperparameters to tune and improve machine learning models. It is important to remember that taking the time to systematically and strategically determine model hyperparameters can greatly improve machine learning model performance.</p>
<p>Stay tuned, because next time we will take a deeper dive into how to think about the data and which hyperparameter settings make sense to try out.</p>
<div class="alert alert-info" role="alert">The code and data for this blog are available in our GitHub repository. You can find the repository <a href="https://github.com/aptech/gauss_blog/tree/master/machine-learning/parameter-tuning" target="_blank" rel="noopener">here</a>.</div>
<h3 id="further-machine-learning-reading">Further Machine Learning Reading</h3>
<ol>
<li><a href="https://www.aptech.com/blog/predicting-recessions-with-machine-learning-techniques/" target="_blank" rel="noopener">Predicting Recessions with Machine Learning Techniques</a>  </li>
<li><a href="https://www.aptech.com/blog/applications-of-principal-components-analysis-in-finance/" target="_blank" rel="noopener">Applications of Principal Components Analysis in Finance</a>  </li>
<li><a href="https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/" target="_blank" rel="noopener">Predicting The Output Gap With Machine Learning Regression Models</a>  </li>
<li><a href="https://www.aptech.com/blog/classification-with-regularized-logistic-regression/" target="_blank" rel="noopener">Classification with Regularized Logistic Regression</a>  </li>
<li><a href="https://www.aptech.com/blog/understanding-cross-validation/" target="_blank" rel="noopener">Understanding Cross-Validation</a>  </li>
<li><a href="https://www.aptech.com/blog/machine-learning-with-real-world-data/" target="_blank" rel="noopener">Machine Learning With Real-World Data</a>  </li>
</ol>
<p></table>
    <!-- MathJax configuration -->
    <style>
        .mjx-svg-href {
            fill: "inherit" !important;
            stroke: "inherit" !important;
        }
    </style>
    <script type="text/x-mathjax-config">
        MathJax.Hub.Config({ TeX: { equationNumbers: {autoNumber: "AMS"} } });
    </script>
    <script type="text/javascript">
window.MathJax = {
  tex2jax: {
    inlineMath: [ ['$','$'] ],
    displayMath: [ ['$$','$$'] ],
    processEscapes: true,
    processEnvironments: true
  },
  // Center justify equations in code and markdown cells. Elsewhere
  // we use CSS to left justify single line equations in code cells.
  displayAlign: 'center',
  "HTML-CSS": {
    styles: {'.MathJax_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  "SVG": {
    styles: {'.MathJax_SVG_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  showProcessingMessages: false,
  messageStyle: "none",
  menuSettings: { zoom: "Click" },
  AuthorInit: function() {
    MathJax.Hub.Register.StartupHook("End", function() {
            var timeout = false, // holder for timeout id
            delay = 250; // delay after event is "complete" to run callback
            var shrinkMath = function() {
              //var dispFormulas = document.getElementsByClassName("formula");
              var dispFormulas = document.getElementsByClassName("MathJax_SVG_Display");
              if (dispFormulas){
                // caculate relative size of indentation
                var contentTest = document.getElementsByTagName("body")[0];
                var nodesWidth = contentTest.offsetWidth;
                // if you have indentation
                var mathIndent = MathJax.Hub.config.displayIndent; //assuming px's
                var mathIndentValue = mathIndent.substring(0,mathIndent.length - 2);
                for (var i=0; i<dispFormulas.length; i++){
                  var dispFormula = dispFormulas[i];
                  var wrapper = dispFormula;
                  //var wrapper = dispFormula.getElementsByClassName("MathJax_Preview")[0].nextSibling;
                  var child = wrapper.firstChild;
                  wrapper.style.transformOrigin = "center"; //or top-left if you left-align your equations
                  var oldScale = child.style.transform;
                  //var newValue = Math.min(0.80*dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newValue = Math.min(dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newScale = "scale(" + newValue + ")";
                  if(newValue != "NaN" && !(newScale === oldScale)){
                    wrapper.style.transform = newScale;
                    wrapper.style["margin-left"]= Math.pow(newValue,4)*mathIndentValue + "px";
                    var wrapperStyle = window.getComputedStyle(wrapper);
                    var wrapperHeight = parseFloat(wrapperStyle.height);
                    wrapper.style.height = "" + (wrapperHeight * newValue) + "px";
                    if(newValue === "1.00"){
                      wrapper.style.cursor = "";
                      wrapper.style.height = "";
                    }
                    else {
                      wrapper.style.cursor = "zoom-in";
                    }
                  }

                }
            }
            };
            shrinkMath();
            window.addEventListener('resize', function() {
              clearTimeout(timeout);
              timeout = setTimeout(shrinkMath, delay);
            });
          });
  }
}
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-AMS_SVG"></script>
</p>]]></content:encoded>
					
					<wfw:commentRss>https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Predicting The Output Gap With Machine Learning Regression Models</title>
		<link>https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/</link>
					<comments>https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/#comments</comments>
		
		<dc:creator><![CDATA[Eric]]></dc:creator>
		<pubDate>Wed, 12 Apr 2023 18:44:19 +0000</pubDate>
				<category><![CDATA[Machine learning]]></category>
		<category><![CDATA[Time Series]]></category>
		<guid isPermaLink="false">https://www.aptech.com/?p=11583590</guid>

					<description><![CDATA[In today's blog, we compare three different machine learning regression techniques for predicting U.S. real GDP output gap. We will use a combination of common economic indicators and GDP subcomponents to predict the quarterly GDP output gap. ]]></description>
										<content:encoded><![CDATA[    <!-- MathJax configuration -->
    <style>
        .mjx-svg-href {
            fill: "inherit" !important;
            stroke: "inherit" !important;
        }
    </style>
    <script type="text/x-mathjax-config">
        MathJax.Hub.Config({ TeX: { equationNumbers: {autoNumber: "AMS"} } });
    </script>
    <script type="text/javascript">
window.MathJax = {
  tex2jax: {
    inlineMath: [ ['$','$'] ],
    displayMath: [ ['$$','$$'] ],
    processEscapes: true,
    processEnvironments: true
  },
  // Center justify equations in code and markdown cells. Elsewhere
  // we use CSS to left justify single line equations in code cells.
  displayAlign: 'center',
  "HTML-CSS": {
    styles: {'.MathJax_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  "SVG": {
    styles: {'.MathJax_SVG_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  showProcessingMessages: false,
  messageStyle: "none",
  menuSettings: { zoom: "Click" },
  AuthorInit: function() {
    MathJax.Hub.Register.StartupHook("End", function() {
            var timeout = false, // holder for timeout id
            delay = 250; // delay after event is "complete" to run callback
            var shrinkMath = function() {
              //var dispFormulas = document.getElementsByClassName("formula");
              var dispFormulas = document.getElementsByClassName("MathJax_SVG_Display");
              if (dispFormulas){
                // caculate relative size of indentation
                var contentTest = document.getElementsByTagName("body")[0];
                var nodesWidth = contentTest.offsetWidth;
                // if you have indentation
                var mathIndent = MathJax.Hub.config.displayIndent; //assuming px's
                var mathIndentValue = mathIndent.substring(0,mathIndent.length - 2);
                for (var i=0; i<dispFormulas.length; i++){
                  var dispFormula = dispFormulas[i];
                  var wrapper = dispFormula;
                  //var wrapper = dispFormula.getElementsByClassName("MathJax_Preview")[0].nextSibling;
                  var child = wrapper.firstChild;
                  wrapper.style.transformOrigin = "center"; //or top-left if you left-align your equations
                  var oldScale = child.style.transform;
                  //var newValue = Math.min(0.80*dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newValue = Math.min(dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newScale = "scale(" + newValue + ")";
                  if(newValue != "NaN" && !(newScale === oldScale)){
                    wrapper.style.transform = newScale;
                    wrapper.style["margin-left"]= Math.pow(newValue,4)*mathIndentValue + "px";
                    var wrapperStyle = window.getComputedStyle(wrapper);
                    var wrapperHeight = parseFloat(wrapperStyle.height);
                    wrapper.style.height = "" + (wrapperHeight * newValue) + "px";
                    if(newValue === "1.00"){
                      wrapper.style.cursor = "";
                      wrapper.style.height = "";
                    }
                    else {
                      wrapper.style.cursor = "zoom-in";
                    }
                  }

                }
            }
            };
            shrinkMath();
            window.addEventListener('resize', function() {
              clearTimeout(timeout);
              timeout = setTimeout(shrinkMath, delay);
            });
          });
  }
}
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-AMS_SVG"></script>
<h3 id="introduction">Introduction</h3>
<p>Economists are increasingly exploring the potential for machine learning models in economic forecasting. This blog offers an introduction to using three different machine learning regression techniques for economic modeling, using an empirical application to the real U.S. GDP output gap. </p>
<p>We look specifically at:</p>
<ul>
<li>Measuring the output gap. </li>
<li>The fundamentals of three machine learning regression models. </li>
<li>Model estimation using the <a href="https://docs.aptech.com/gauss/gml-landing.html" target="_blank" rel="noopener">GAUSS Machine Learning library</a>.</li>
</ul>
<h2 id="measuring-gdp-output-gap">Measuring GDP Output Gap</h2>
<p>The GDP output gap is a macroeconomic indicator that measures the difference between potential GDP and actual GDP. It is an interesting and useful economic statistic:</p>
<ul>
<li>It indicates whether the economy is operating with unemployment, inefficiencies, or inflationary pressures making it useful for policymaking. </li>
<li>The potential GDP is unobservable and must be estimated, with a large literature devoted to what is the best estimate of potential GDP. </li>
<li>Positive output gaps indicate that the economy is operating over potential GDP and at risk of inflation. </li>
<li>Negative output gaps indicate that the economy is operating below potential GDP and possibly in recession.  </li>
</ul>
<p>Our goal today is to demonstrate different machine learning regression techniques. For simplicity, we're going to use the output gap based on the <a href="https://fred.stlouisfed.org/series/GDPPOT" target="_blank" rel="noopener">Congressional Budget Office's estimate of real potential GDP</a> to train our model. </p>
<div class="alert alert-info" role="alert">We compute the output gap as the percent deviation of real U.S. GDP from the CBO's estimate of real potential GDP. Both components are available for download from the FRED database. </div>
<h2 id="the-models">The Models</h2>
<p>Today we will look at three machine learning models used specifically for predicting continuous data:</p>
<ul>
<li>Decision forest regression (also known as Random forest regression). </li>
<li>LASSO regression. </li>
<li>Ridge regression. </li>
</ul>
<h3 id="decision-forest-regression">Decision Forest Regression</h3>
<h4 id="decision-trees">Decision Trees</h4>
<p>Decision forest regression utilizes decision trees for continuous data which:</p>
<ol>
<li>Segment the data into subsets using data-based <em>splitting rules</em>. </li>
<li>Assign the average of the target variable within a subset as the prediction for all observations that fall inside that subset. </li>
</ol>
<p>To implement a single decision tree, a sample is split into segments using <em>recursive binary splitting</em>. This iterative approach determines where and how to split the data based on what leads to the lowest residual sum of squares (RSS).</p>
<h4 id="decision-forests">Decision Forests</h4>
<p>Single decision trees can have low, non-robust predictive power and suffer from high variance. This can be overcome using random decision forests that offer performance improvements by combining results from groups, or &quot;forests&quot;, of trees.</p>
<p>The random decision forest algorithm:</p>
<ol>
<li>Randomly chooses $m$ predictors to be used as candidates for splitting the data.</li>
<li>Constructs a decision tree from a bootstrapped training set. </li>
<li>Repeats the decision tree formation for a specified number of iterations. </li>
<li>Averages the results from all trees to make a final prediction.</li>
</ol>
<h3 id="lasso-and-ridge-regression">LASSO and Ridge Regression</h3>
<p>LASSO and ridge regression aim to reduce prediction variances using a modified least squares approach. Let's look a little more closely at how this works. </p>
<p>Recall that ordinary least squares estimates coefficients through the minimization of the residual sum of squares (RSS):</p>
<p>$$ RSS = \bigg[\sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij})\bigg]^2$$</p>
<p>Penalized least squares estimates coefficients using a modified function:</p>
<p>$$ S_{\lambda} = \bigg[\sum_{i=1}^n (y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij})\bigg]^2 + \lambda J_2 $$</p>
<p>where $\lambda$ is the tuning parameter and $\lambda J_2$ is the penalty term. </p>
<table>
<tbody>
<tr><th>Method</th><th>Description</th><th>Penalty term</th></tr>
<tr><td>LASSO Regression</td><td>$L1$ penalized linear regression model.</td><td>$\lambda \sum_{j=1}^p |\beta_j|$</td></tr>
<tr><td>Ridge Regression</td><td>$L2$ penalized linear regression model.</td><td>$\lambda \sum_{j=1}^p \beta_j^2$</td></tr>
</tbody>
</table>
<h2 id="our-prediction-process">Our Prediction Process</h2>
<p>Our prediction process is motivated by the idea that as new information becomes available, it should be used to improve our forecasting model. </p>
<p>Based on this motivation, we use an expanding training window to make one-step ahead forecasts:</p>
<ul>
<li>Train the model using all observed data in the training window, features and output gap, up to time $t$.</li>
<li>Predict the output gap at time $t + 1$ using the observed features at time $t + 1$. </li>
<li>Expand the training window to include all observed data up to time $t + 1$.</li>
<li>Repeat model training and prediction. </li>
</ul>
<p>It's worth noting that while this method utilizes the most information available for prediction there is a trade-off in timeliness. If we were using this method in a real-world setting, it means we only forecast output gap one-quarter ahead. This may not be far enough in advance if we're using this forecast to guide business or investment decisions. </p>
<h2 id="predictors">Predictors</h2>
<p>Today we will use a combination of common economic indicators and GDP subcomponents as predictors. </p>
<table>
 <thead>
<tr><th>Variable</th><th>Description</th><th>Transformations</th></tr>
</thead>
<tbody>
<tr><td><a href="https://fred.stlouisfed.org/series/UMCSENT" target="_blank" rel="noopener">UMCSENT</a></td><td>University of Michigan consumer sentiment, quarterly average.</td><td>None</td></tr>
<tr><td><a href="https://fred.stlouisfed.org/series/UNRATE" target="_blank" rel="noopener">UNRATE</a></td><td>Civilian unemployment rate as a percentage, quarterly average.</td><td>None.</td></tr>
<tr><td>CR</td><td>The credit spread between <a href="https://fred.stlouisfed.org/series/BAAFFM" target="_blank" rel="noopener">Moody's BAA</a> and <a href="https://fred.stlouisfed.org/series/AAAFFM" target="_blank" rel="noopener">AAA</a> corporate bond yields.</td><td>None.</td></tr>
<tr><td>TS</td><td>The difference between the yield on the <a href="https://fred.stlouisfed.org/series/DGS10" target="_blank" rel="noopener">10-year treasury bond</a> and the <a href="https://fred.stlouisfed.org/series/DGS1" target="_blank" rel="noopener">1-yr treasury bill</a>.</td><td>None</td></tr>
<tr><td><a href="https://fred.stlouisfed.org/series/FEDFUNDS" target="_blank" rel="noopener">FEDFUNDS</a></td><td>The Federal Funds rate.</td><td>First differences.</td></tr>
<tr><td><a href="https://fred.stlouisfed.org/series/SP500" target="_blank" rel="noopener">SP500</a></td><td>The S&amp;P 500 index value at market closing.</td><td>Percent change, computed as difference in natural logs.</td></tr>
<tr><td><a href="https://fred.stlouisfed.org/series/CPIAUCSL" target="_blank" rel="noopener">CPIAUCSL</a></td><td>Consumer price index for all urban consumers.</td><td>Percent change, computed as difference in natural logs.</td></tr>
<tr><td><a href="https://fred.stlouisfed.org/series/INDPRO" target="_blank" rel="noopener">INDPRO</a></td><td>The industrial production (IP) index.</td><td>Percent change, computed as difference in natural logs.</td></tr>
<tr><td><a href="https://fred.stlouisfed.org/series/HOUST" target="_blank" rel="noopener">HOUST</a></td><td>New privately-owned housing unit starts.</td><td>Percent change, computed as difference in natural logs.</td></tr>
<tr><td>GAP_CH</td><td>The change in output gap.</td><td>None.</td></tr>
</tbody>
</table>
<p>For our model:</p>
<ul>
<li>All predictors are available from <a href="https://fred.stlouisfed.org/series/GDPPOT" target="_blank" rel="noopener">FRED</a> in levels.         </li>
<li>Monthly variables are aggregated to quarterly data using averages.</li>
<li>Four lags of all variables are included. </li>
</ul>
<h2 id="estimation-in-gauss">Estimation in GAUSS</h2>
<h3 id="data-loading">Data Loading</h3>
<p>Because we want to primarily focus on the models, rather than data cleaning, we don't go into the details of our data cleaning process here. Instead, the cleaned and prepped data is available for download <a href="https://github.com/aptech/gauss_blog/blob/master/machine-learning/ml-regressions/reg_data.gdat?raw=true" target="_blank" rel="noopener">here</a>. </p>
<div class="alert alert-info" role="alert">For more information about data cleaning and management try one of our earlier blogs such as:
<br>• <a href="https://www.aptech.com/blog/importing-fred-data-to-gauss/" target="_blank" rel="noopener">Importing FRED Data To GAUSS</a><br>• <a href="https://www.aptech.com/blog/getting-to-know-your-data-with-gauss-22/" target="_blank" rel="noopener">Getting to Know Your Data With GAUSS</a><br>• <a href="https://www.aptech.com/blog/preparing-and-cleaning-data-fred-data-in-gauss/" target="_blank" rel="noopener">Preparing And Cleaning FRED Data In GAUSS</a>
</div>
<p>Prior to estimating any model, we load the data and separate our outcome and feature data:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">library gml;
rndseed 23423;

/*
** Load data and prepare data
*/
// Load dataset
dataset = __FILE_DIR $+ "reg_data.gdat";
data = loadd(dataset);

// Trim rows from the top of data to account
// for lagged and differenced data
max_lag = 4;
data = trimr(data, max_lag + 1, 0);

/*
** Extract outcome and features
*/
// Extract outcome variable
y = data[., "CBO_GAP"];

// Extract features
X = delcols(data, "date"$|"CBO_GAP");
</code></pre>
<h3 id="general-one-step-ahead-process">General One-Step-Ahead Process</h3>
<p>The full data sample ranges from 1967Q1 to 2022Q4. We'll start computing one-step-ahead forecasts in 1995Q1, using an initial training period of 1967Q1 to 1994Q4. </p>
<p>To implement the expanding window one-step-ahead forecasts, we use a GAUSS <a href="https://docs.aptech.com/gauss/dowhiledountil.html" target="_blank" rel="noopener"><code>do while</code></a> loop:  </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Specify starting date 
st_date = asDate("1994-Q4", "%Y-Q%q");

// Find the index of 'st_date'
st_indx = indnv(st_date, data[., "date"]); 

// Iterate over remaining observations
// using expanding window to fit model
do while st_indx &lt; rows(x)-1;

    // Get y_train and x_train
    y_train = y[1:st_indx];
    x_train = X[1:st_indx, .]; 
    x_test = X[st_indx+1, .];

    // Fit model
    ...

    // Compute one-step-ahead prediction
    ...

    // Update st_indx
    st_indx = st_indx + 1;
endo;</code></pre>
<h3 id="model-and-prediction-procedures">Model and Prediction Procedures</h3>
<p>The GAUSS machine learning library offers all the procedures we need for our model training and prediction. </p>
<table>
<tbody>
<tr><th>Model</th><th>Fitting Procedure</th><th>Prediction Procedure</th></tr>
<tr><td>Decision Forest</td><td><a href="https://docs.aptech.com/gauss/decforestrfit.html" target="_blank" rel="noopener">decForestRFit</a></td><td><a href="https://docs.aptech.com/gauss/decforestpredict.html" target="_blank" rel="noopener">decForestPredict</a></td></tr>
<tr><td>LASSO Regression</td><td><a href="https://docs.aptech.com/gauss/lassofit.html" target="_blank" rel="noopener">lassoFit</a></td><td><a href="https://docs.aptech.com/gauss/lmpredict.html" target="_blank" rel="noopener">lmPredict</a></td></tr>
<tr><td>Ridge Regression</td><td><a href="https://docs.aptech.com/gauss/lassofit.html" target="_blank" rel="noopener">ridgeFit</a></td><td><a href="https://docs.aptech.com/gauss/lmpredict.html" target="_blank" rel="noopener">lmPredict</a></td></tr>
</tbody>
</table>
<p>To simplify our code we will will use three <a href="https://www.aptech.com/blog/basics-of-gauss-procedures/" target="_blank" rel="noopener">GAUSS procedures</a> that combine the fitting and prediction for each method. </p>
<p>We define one procedure for the one-step ahead prediction for the LASSO model:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">proc (1) = osaLasso(y_train, x_train, x_test, lambda);
    local lasso_prediction;

    /*
    ** Lasso Model
    */
    // Declare 'mdl' to be an instance of a
    // lassoModel structure to hold the estimation results
    struct lassoModel mdl;

    // Estimate the model with default settings
    mdl = lassoFit(y_train, x_train, lambda);

    // Make predictions using test data
    lasso_prediction = lmPredict(mdl, x_test);

    retp(lasso_prediction);
endp;</code></pre>
<p>The second procedure performs fitting and prediction for the ridge model:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">proc (1) = osaRidge(y_train, x_train, x_test, lambda);
    local ridge_prediction;

    /*
    ** Ridge Model
    */
    // Declare 'mdl' to be an instance of a
    // ridgeModel structure to hold the estimation results
    struct ridgeModel mdl;

    // Estimate the model with default settings
    mdl = ridgeFit(y_train, x_train, lambda);

    // Make predictions using test data
    ridge_prediction = lmPredict(mdl, x_test);

    retp(ridge_prediction);
endp;</code></pre>
<p>The final procedure performs fitting and prediction for the decision forest model:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">proc (1) = osaDF(y_train, x_train, x_test, struct dfControl dfc);
    local df_prediction;

    /*
    ** Decision Forest Model
    */
    // Declare 'mdl' to be an instance of a
    // dfModel structure to hold the estimation results
    struct dfModel mdl;

    // Estimate the model with default settings
    mdl = decForestRFit(y_train, x_train, dfc);

    // Make predictions using test data
    df_prediction = decForestPredict(mdl, x_test);

    retp(df_prediction);
endp;</code></pre>
<h3 id="computing-predictions">Computing Predictions</h3>
<p>Finally we are ready to begin computing our predictions. First, we set the necessary tuning parameters:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Set up tuning parameters
*/

// L2 and L1 regularization penalty
lambda = 0.3;

/*
** Settings for decision forest
*/
// Use control structure for settings
struct dfControl dfc;
dfc = dfControlCreate();

// Turn on variable importance
dfc.variableImportanceMethod = 1;

// Turn on out-of-bag error calculation
dfc.oobError = 1;</code></pre>
<div class="alert alert-info" role="alert">We used a λ of 0.3 for both the ridge and LASSO models and all GAUSS default settings for decision forest hyperparameters. This brings to light the fact that we have not taken any steps to optimize our models. The topic of model selection and optimization will be covered in later blogs. </div>
<p>Next, we initialize the starting point for our loop and our prediction storage matrix. </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Initialize starting point and
** storage matrix for expanding 
** window loop
*/
st_date = asDate("1994-Q4", "%Y-Q%q");
st_indx = indnv(st_date, data[., "date"]);

// Set up storage dataframe for predictions
// using one column for each model
osa_pred = asDF(zeros(rows(X), 3), "LASSO", "Ridge", "Decision Forest");</code></pre>
<p>Finally, we implement our expanding window <code>do while</code> loop:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">do while st_indx &lt; rows(X)-1;

    // Get y and x subsets for
    // fitting and prediction
    y_train = Y[1:st_indx];
    X_train = X[1:st_indx, .]; 
    X_test = X[st_indx+1, .];

    // LASSO Model
    osa_pred[st_indx+1, "LASSO"] = osaLasso(y_train, X_train, X_test, lambda);

    // Ridge Model
    osa_pred[st_indx+1, "Ridge"] = osaRidge(y_train, X_train, X_test, lambda);

    // Decision Forest Model
    osa_pred[st_indx+1, "Decision Forest"] = osaDF(y_train, X_train, X_test, dfc);

    // Update st_indx
    st_indx = st_indx + 1;
endo;</code></pre>
<h2 id="results">Results</h2>
<h3 id="prediction-visualization">Prediction Visualization</h3>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/08/lr-prediction-comparisons.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/08/lr-prediction-comparisons.jpg" alt="Comparison of output gap predictions using LASSO, ridge, and decision forest regression. " width="800" height="600" class="aligncenter size-full wp-image-11584010" /></a></p>
<p>The graph above plots the predictions from all three of our models against the actual CBO implied output gap. There are a few things worth noting about these results:</p>
<ul>
<li>All three models fail to predict the output decline associated with start of the COVID pandemic. This isn't a surprise as the onset of COVID was a hard to predict shock to the economy. </li>
<li>The models underestimate the persistent effects of the 2008 global financial crisis. While all three trend in the same direction as the observed output gap, they all predict better economic performance than actually obtained. This tells us that our feature set doesn't contain the information needed to capture the ongoing effects of the financial crisis. We could potentially improve our model by incorporating more features like bank balances or home foreclosures. </li>
<li>The ridge model overestimates the short-term impacts of the 2008 global financial crisis, predicting a larger drop in the output gap than both the other models and the actual output gap.</li>
</ul>
<div class="alert alert-info" role="alert">To learn more about formatting GAUSS plots see our <a href="https://www.aptech.com/blog/category/graphics/">GAUSS graphics blogs</a>. </div>
<h3 id="model-performance">Model Performance</h3>
<p>We can also compare the performance of our models using the mean squared error (MSE). This can easily be calculated from our predictions and our observed output gap:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Computing MSE
*/
// Compute residuals
residuals = osa_pred - y;

// Filter for prediction window
residuals = selif(residuals, data[., "date"] .&gt;= st_date);

// Compute the MSE for prediction window
mse  = meanc((residuals).^2);</code></pre>
<p>A comparison of the MSE shows that models perform similarly, with our decision forest model offering a slight advantage in MSE over LASSO and ridge. </p>
<table style="width:35%;margin-left:auto;margin-right:auto;">
<tbody>
<tr><th style="width:60%">Model</th><th>MSE</th></tr>
<tr><td>LASSO</td><td>2.08</td></tr>
<tr><td>Ridge</td><td>2.36</td></tr>
<tr><td>Decision Forest</td><td>1.80</td></tr>
</tbody>
</table>
<h3 id="conclusion">Conclusion</h3>
<p>In today's blog we examined the performance of several machine learning regression models used to predict output gap. This blog is meant to provide an introduction to these models and leaves room to discuss model selection and optimization in future blogs. </p>
<p>After today's blog, you should have a better understanding of:</p>
<ul>
<li>The foundations of decision forest regression models.</li>
<li>LASSO and ridge regression models.</li>
<li>How machine learning models can be used to help predict economic and financial outcomes.</li>
</ul>
<h3 id="further-machine-learning-reading">Further Machine Learning Reading</h3>
<ol>
<li><a href="https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/" target="_blank" rel="noopener">Fundamentals of Tuning Machine Learning Hyperparameters</a>  </li>
<li><a href="https://www.aptech.com/blog/applications-of-principal-components-analysis-in-finance/" target="_blank" rel="noopener">Applications of Principal Components Analysis in Finance</a>  </li>
<li><a href="https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/" target="_blank" rel="noopener">Predicting The Output Gap With Machine Learning Regression Models</a>  </li>
<li><a href="https://www.aptech.com/blog/classification-with-regularized-logistic-regression/" target="_blank" rel="noopener">Classification with Regularized Logistic Regression</a>  </li>
<li><a href="https://www.aptech.com/blog/understanding-cross-validation/" target="_blank" rel="noopener">Understanding Cross-Validation</a>  </li>
<li><a href="https://www.aptech.com/blog/machine-learning-with-real-world-data/" target="_blank" rel="noopener">Machine Learning With Real-World Data</a>  </li>
</ol>
<p></p>
]]></content:encoded>
					
					<wfw:commentRss>https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/feed/</wfw:commentRss>
			<slash:comments>2</slash:comments>
		
		
			</item>
		<item>
		<title>Applications of Principal Components Analysis in Finance</title>
		<link>https://www.aptech.com/blog/applications-of-principal-components-analysis-in-finance/</link>
					<comments>https://www.aptech.com/blog/applications-of-principal-components-analysis-in-finance/#comments</comments>
		
		<dc:creator><![CDATA[Eric]]></dc:creator>
		<pubDate>Thu, 16 Mar 2023 03:45:47 +0000</pubDate>
				<category><![CDATA[Econometrics]]></category>
		<category><![CDATA[Machine learning]]></category>
		<guid isPermaLink="false">https://www.aptech.com/?p=11583484</guid>

					<description><![CDATA[Principal components analysis (PCA) is a useful tool that can help practitioners streamline data without losing information. In today’s blog, we’ll examine the use of principal components analysis in finance using an empirical example. 

We'll look more closely at:
<ul>
<li>What PCA is.</li>
<li>How PCA works.</li> 
<li>How to use the GAUSS Machine Learning library to perform PCA.</li> 
<li>How to interpret PCA results.</li>
</ul>]]></description>
										<content:encoded><![CDATA[<h3 id="introduction">Introduction</h3>
<p>Principal components analysis (PCA) is a useful tool that can help practitioners streamline data without losing information. In today’s blog, we’ll examine the use of principal components analysis in finance using an empirical example. </p>
<p>Specifically, we’ll look more closely at:</p>
<ul>
<li>What PCA is. </li>
<li>How PCA works. </li>
<li>How to use the GAUSS Machine Learning library to perform PCA. </li>
<li>How to interpret PCA results. </li>
</ul>
<h2 id="what-is-principal-components-analysis">What is Principal Components Analysis?</h2>
<p>Principal components analysis (PCA) is an unsupervised learning method that results in a low-dimensional representation of a dataset. The intuition behind PCA is that the most important information is drawn from the features by eliminating redundancy and noise. The resulting dataset captures the most interesting components of the data. </p>
<table>
 <thead>
 <tr>
      <th style="background-color: #36434C" colspan="3"><h3 id="pca-snapshot"><span style="color:#FFFFFF">PCA Snapshot</span></h3>
      </th>
   </tr>
</thead>
<tbody>
<tr><td>Uses linear transformations to capture the most important characteristics of a set of features. </td></tr>
<tr><td>Uses variance of the features to distinguish relevant features from pure noise.</td></tr>
<tr><td>Identifies and removes redundancy in features.</td></tr>
</tbody>
</table>
<div class="alert alert-info" role="alert">Unsupervised learning methods are a subcategory of machine learning models. Rather than predicting outcomes or responses, unsupervised learning methods aim to characterize and answer questions about a feature set. </div>
<h2 id="how-do-we-find-principal-components">How Do We Find Principal Components?</h2>
<p>Principal components are found by identifying the normalized, linear combination of features</p>
<p>$$Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \ldots + \phi_{p1}X_p$$</p>
<p>which has the largest variance. </p>
<p>The coefficients $\phi_{11}, \phi_{21}, \ldots, \phi_{p1}$ are referred to as the loadings and are restricted such that their sum of squares is equal to one. </p>
<p>To compute the first principal component we:</p>
<ol>
<li>Center our feature data to have a mean of zero. </li>
<li>Find loadings that result with the largest sample variance, subject to the constraint that $\sum_{j=1}^p \phi_{j,1}^2 = 1$.</li>
</ol>
<p>Once the first principal component is found, we can find a second principal component, $Z_2$, which is constrained to be uncorrelated with $Z_1$.</p>
<h2 id="when-should-you-use-pca">When Should You Use PCA?</h2>
<p>The most common use of PCA is to reduce the size of a feature set without losing too much information. The feature set can then be used in second stages of modeling. However, this is not the only use of PCA, and there are a number of insightful ways PCA can be applied. </p>
<table>
 <thead>
 <tr>
      <th style="background-color: #36434C" colspan="3"><h3 id="real-world-applications-of-pca"><span style="color:#FFFFFF">Real World Applications of PCA</span></h3>
      </th>
   </tr>
</thead>
<tbody>
<tr><td>Reducing the size of images.</td><td>PCA can be used to reduce the size of an image without significantly impacting the quality. Beyond just reducing the size, this is useful for image classification algorithms.</td></tr>
<tr><td><a href="https://www.aptech.com/blog/category/graphics/" target="_blank" rel="noopener">Visualizing</a> multidimensional data.</td><td>PCA allows us to represent the information contained in multidimensional data in reduced dimensions which are more compatible with visualization.</td></tr>
<tr><td>Finding patterns in high-dimensional datasets.</td><td>Examining the relationships between principal components and original features can help uncover patterns in the data that are harder to identify in our full dataset.</td></tr>
<tr><td>Stock price prediction in <a href="https://www.aptech.com/industry-solutions/finance/" target="_blank" rel="noopener">finance.</a></td><td>Many models of stock price prediction rely on estimating covariance matrices. However, this can be difficult with high-dimensional data. PCA can be used for data reduction to help remedy this issue.</td></tr>
<tr><td>Dataset reduction in <a href="https://www.aptech.com/industry-solutions/epidemiology/" target="_blank" rel="noopener">healthcare models.</a></td><td>Healthcare models use high-dimensional datasets because there are many factors that influence healthcare outcomes. PCA provides a method to reduce the dimensionality while still capturing the relevant variance.</td></tr>
</tbody>
</table>
<h2 id="empirical-example">Empirical Example</h2>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/03/us-treasury-yields.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/03/us-treasury-yields.jpg" alt="" width="800" height="600" class="aligncenter size-full wp-image-11583500" /></a></p>
<p>Let's take a look at principal components analysis in action! We'll start by extending the <a href="https://docs.aptech.com/gauss/textbook-examples/brooks-introductory-econometrics-for-finance/docs/principal-components-tbills.html" target="_blank" rel="noopener">PCA application to US Treasury bills and bonds from Introductory Econometrics For Finance by Chris Brooks</a>. </p>
<p>In our example we will:</p>
<ul>
<li>Update the dataset to use current data. </li>
<li>Use the <a href="https://docs.aptech.com/gauss/pcafit.html" target="_blank" rel="noopener">pcaFit</a> and <a href="https://docs.aptech.com/gauss/pcatransform.html" target="_blank" rel="noopener">pcaTransform</a> functions available in the <a href="https://docs.aptech.com/gauss/gml-landing.html" target="_blank" rel="noopener">GAUSS Machine Learning library (GML)</a>.</li>
</ul>
<h3 id="loading-fred-data">Loading FRED Data</h3>
<p>Our initial dataset includes 6 variables capturing short-term and long-term yields on U.S. bonds and bills. </p>
<table>
 <thead>
<tr><th>Variable</th><th>Description</th></tr>
</thead>
<tbody>
<tr><td>GS3M</td><td>Market yield on 3 month US Treasury bill.</td></tr>
<tr><td>GS6M</td><td>Market yield on 6 month US Treasury bill.</td></tr>
<tr><td>GS1</td><td>Market yield on 1 year US Treasury bond.</td></tr>
<tr><td>GS3</td><td>Market yield on 3 year US Treasury bond.</td></tr>
<tr><td>GS5</td><td>Market yield on 5 year US Treasury bond.</td></tr>
<tr><td>GS10</td><td>Market yield on 10 year US Treasury bond.</td></tr>
</tbody>
</table>
<p>This data can be directly imported into GAUSS from the <a href="https://fred.stlouisfed.org/" target="_blank" rel="noopener">FRED</a> database.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Import U.S bond and bill data
** directly from FRED
*/
// Set observation_start parameter
// to use all data on or after 1990-01-01 and before or on 2023-03-01
params_cpi = fred_set("observation_start", "1990-01-01", "observation_end", "2023-03-01");

// Load data from FRED
data = fred_load("GS3M + GS6M + GS1 + GS3 + GS5 + GS10", params_cpi);

// Reorder data to match the organization in original example
data = order(data, "date"$|"GS3M"$|"GS6M"$|"GS1"$|"GS3"$|"GS5"$|"GS10");

// Preview the first 5 rows
head(data);</code></pre>
<p>The data preview printed to the <strong>Command Window</strong> helps verify that our data has loaded correctly:</p>
<pre>            date        GS3M        GS6M         GS1         GS3         GS5        GS10
      1990-01-01        7.90        7.96        7.92        8.13        8.12        8.21
      1990-02-01        8.00        8.12        8.11        8.39        8.42        8.47
      1990-03-01        8.17        8.28        8.35        8.63        8.60        8.59
      1990-04-01        8.04        8.27        8.40        8.78        8.77        8.79
      1990-05-01        8.01        8.19        8.32        8.69        8.74        8.76 </pre>
<div class="alert alert-info" role="alert">For more information on importing FRED data to GAUSS, see our <a href="https://www.aptech.com/blog/importing-fred-data-to-gauss/" target="_blank" rel="noopener">earlier blog</a>. </div>
<h3 id="normalizing-yields">Normalizing Yields</h3>
<p>Following the Brooks' example, we will normalize the yields to have zero mean and standard deviation of one using the <a href="https://docs.aptech.com/gauss/rescale.html" target="_blank" rel="noopener"><code>rescale</code></a> procedure. </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Normalizing the yield
*/
// Create a dataframe that contains
// the yields, but not the 'Date' variable
yields = delcols(data, "date");

// Standardize the yields using rescale
{ yields_norm, location, scale_factor } = rescale(yields, "standardize");

head(yields_norm);</code></pre>
<p>This prints a preview of our normalized yields:</p>
<pre>            GS3M             GS6M              GS1              GS3              GS5             GS10
       2.3153725        2.2469720        2.1773318        2.0802078        2.0025703        1.9626705
       2.3591880        2.3159905        2.2593350        2.1936833        2.1395985        2.0916968
       2.4336745        2.3850090        2.3629181        2.2984298        2.2218155        2.1512474
       2.3767142        2.3806953        2.3844979        2.3638964        2.2994648        2.2504985
       2.3635696        2.3461861        2.3499702        2.3246164        2.2857620        2.2356108 </pre>
<h3 id="fitting-the-pca-model">Fitting the PCA Model</h3>
<p>Next, we will use the <code>pcaFit</code> procedure from GML to fit our principal components analysis model. </p>
<p>The <code>pcaFit</code> procedure requires two inputs, a data matrix, and the number of components to compute. </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">struct pcaModel mdl;
mdl = pcaFit(x, n_components);</code></pre>
<hr>
<dl>
<dt>X</dt>
<dd>$N \times P$ matrix, feature data to be reduced.</dd>
<dt>n_components</dt>
<dd>Scalar, the number of components to compute.
<hr></dd>
</dl>
<p>The <code>pcaFit</code> procedures stores all output in a <code>pcaModel</code> structure. The most relevant members of the <code>pcaModel</code> structure include:</p>
<hr>
<dl>
<dt>mdl.singular_values</dt>
<dd>$n_{components} \times 1$ vector, the largest singular values of X. Equal to the square root of the eigenvalues.</dd>
<dt>mdl.components</dt>
<dd>$P \times n_{components}$ matrix, the principal component vectors which represent the directions of greatest variance. Also known as the factor loadings.</dd>
<dt>mdl.explained_variance_ratio</dt>
<dd>$n_{components} \times 1$ vector, the variance explained by each of the returned component vectors.
<hr></dd>
</dl>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Perform PCA on normalized yields
*/
// Specify number of components
n_components = 6;

// `pcaModel` structure for holding
//  output from model
struct pcaModel mdl;
mdl = pcaFit(yields_norm, n_components);</code></pre>
<h2 id="dissecting-results">Dissecting Results</h2>
<p>After running the <code>pcaFit</code> procedure results will be printed to the <strong>Command Window</strong>. These results include:</p>
<ul>
<li>A general summary of model.</li>
<li>The proportion of variance explained by each component.</li>
<li>The loadings for all variables in each component.</li>
</ul>
<h3 id="general-summary">General Summary</h3>
<p>The general summary provides basic information about the model setup, including the number of variables in the original data and the number of components found. </p>
<pre>==================================================
Model:                                         PCA
Number observations:                           399
Number variables:                                6
Number components:                               6
==================================================</pre>
<h3 id="proportion-of-variance">Proportion of Variance</h3>
<p>The proportion of variance table tells us how much of the total variance in the data is described by each principal component. </p>
<pre>Component                Proportion     Cumulative
                        Of Variance     Proportion
PC1                           0.960          0.960
PC2                           0.038          0.997
PC3                           0.002          1.000
PC4                           0.000          1.000
PC5                           0.000          1.000
PC6                           0.000          1.000 </pre>
<p>For the Treasury bills and bonds yields, the first component captures 96.0% of the total variance, while the first three components explain nearly all of the total variance. If our goal was data reduction for use in a later model, this is quite promising. We could capture 96% of the variance of all 6 of our original variables using just the first principal component.  </p>
<h3 id="the-factor-loadings">The Factor Loadings</h3>
<pre>===========================================================================
Principal
components            PC1       PC2       PC3       PC4       PC5       PC6
===========================================================================
GS3M              -0.4079    0.4111    0.4863   -0.5416    0.3029    0.2076
GS6M              -0.4094    0.3883    0.1535    0.2221   -0.5448   -0.5585
GS1               -0.4122    0.2970   -0.2404    0.6120    0.1557    0.5342
GS3               -0.4154   -0.0855   -0.5911   -0.1926    0.4744   -0.4567
GS5               -0.4102   -0.3607   -0.2806   -0.3932   -0.5725    0.3750
GS10              -0.3939   -0.6742    0.5040    0.3020    0.1856   -0.1024
</pre>
<p>The factor loadings indicate how much each of the variables contributes to the component. As noted in the Brooks example, they also offer some insight into the yield curve:</p>
<table>

<tbody>
<tr><td style="width:20%">PC1</td><td><ul><li>All maturities have the same sign and a similar magnitude.</li><li>Captures changes in the level, or parallel shifts, of the yield curve.</li></ul></td></tr>
<tr><td style="width:20%">PC2</td><td><ul><li>Short-term and long-term maturities have opposing signs.</li><li>Short-term and long-term maturities move in opposite directions.</li><li>Captures changes in the slope, or the steepening/flattening, of the yield curve.</li></ul></td></tr>
<tr><td style="width:20%">PC3</td><td><ul><li>Shortest and longest-term maturities have the same sign, while the middle maturities have the opposite sign.</li><li>Reflects changes in the curvature of the curve.</li></ul></td></tr>
</tbody>
</table>
<h2 id="transforming-original-data">Transforming Original Data</h2>
<p>After fitting the PCA model, we can use the results to transform our original data into its principal components using the <code>pcaTransform</code> procedure. </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Transform original data
x_trans = pcaTransform(yields_norm, mdl);</code></pre>
<p>Since the first three components capture most of the variation in our data, let's look at them in a plot:</p>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/03/treasury-pca.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/03/treasury-pca.jpg" alt="" width="800" height="600" class="aligncenter size-full wp-image-11583515" /></a></p>
<p>If you're familiar with U.S. interest rates, this plot likely seems to contradict what we observe in the real world. As we said earlier, the first principal component represents the overall level of interest rates. However, our plot of the first principal component shows an overall upward trend through 2022, with a sharp downtick starting post-2022 — exactly opposite the overall trend in U.S. interest rates.</p>
<p>This highlights an important feature of PCA — <b>the sign on the factor loadings is arbitrary</b>. </p>
<p>The signs can all be flipped without any change to our analysis. For example, if we multiply all our factor loadings by -1 our principal components look like:</p>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/03/treasury-pca-flipped.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/03/treasury-pca-flipped.jpg" alt="" width="800" height="600" class="aligncenter size-full wp-image-11583516" /></a></p>
<h3 id="conclusion">Conclusion</h3>
<p>In today's blog, we've seen that PCA is a powerful data analysis tool with uses beyond data reduction. We've also explored how to use the GAUSS Machine Learning library to fit a PCA model and transform data.</p>
<h3 id="further-reading">Further Reading</h3>
<ol>
<li><a href="https://www.aptech.com/blog/predicting-recessions-with-machine-learning-techniques/" target="_blank" rel="noopener">Predicting Recessions with Machine Learning Techniques</a>  </li>
<li><a href="https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/" target="_blank" rel="noopener">Predicting The Output Gap With Machine Learning Regression Models</a>  </li>
<li><a href="https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/" target="_blank" rel="noopener">Fundamentals of Tuning Machine Learning Hyperparameters</a>  </li>
<li><a href="https://www.aptech.com/blog/understanding-cross-validation/" target="_blank" rel="noopener">Understanding Cross-Validation</a>  </li>
<li><a href="https://www.aptech.com/blog/machine-learning-with-real-world-data/" target="_blank" rel="noopener">Machine Learning With Real-World Data</a>  </li>
<li><a href="https://www.aptech.com/blog/classification-with-regularized-logistic-regression/" target="_blank" rel="noopener">Classification with Regularized Logistic Regression</a><br />

    <!-- MathJax configuration -->
    <style>
        .mjx-svg-href {
            fill: "inherit" !important;
            stroke: "inherit" !important;
        }
    </style>
    <script type="text/x-mathjax-config">
        MathJax.Hub.Config({ TeX: { equationNumbers: {autoNumber: "AMS"} } });
    </script>
    <script type="text/javascript">
window.MathJax = {
  tex2jax: {
    inlineMath: [ ['$','$'] ],
    displayMath: [ ['$$','$$'] ],
    processEscapes: true,
    processEnvironments: true
  },
  // Center justify equations in code and markdown cells. Elsewhere
  // we use CSS to left justify single line equations in code cells.
  displayAlign: 'center',
  "HTML-CSS": {
    styles: {'.MathJax_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  "SVG": {
    styles: {'.MathJax_SVG_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  showProcessingMessages: false,
  messageStyle: "none",
  menuSettings: { zoom: "Click" },
  AuthorInit: function() {
    MathJax.Hub.Register.StartupHook("End", function() {
            var timeout = false, // holder for timeout id
            delay = 250; // delay after event is "complete" to run callback
            var shrinkMath = function() {
              //var dispFormulas = document.getElementsByClassName("formula");
              var dispFormulas = document.getElementsByClassName("MathJax_SVG_Display");
              if (dispFormulas){
                // caculate relative size of indentation
                var contentTest = document.getElementsByTagName("body")[0];
                var nodesWidth = contentTest.offsetWidth;
                // if you have indentation
                var mathIndent = MathJax.Hub.config.displayIndent; //assuming px's
                var mathIndentValue = mathIndent.substring(0,mathIndent.length - 2);
                for (var i=0; i<dispFormulas.length; i++){
                  var dispFormula = dispFormulas[i];
                  var wrapper = dispFormula;
                  //var wrapper = dispFormula.getElementsByClassName("MathJax_Preview")[0].nextSibling;
                  var child = wrapper.firstChild;
                  wrapper.style.transformOrigin = "center"; //or top-left if you left-align your equations
                  var oldScale = child.style.transform;
                  //var newValue = Math.min(0.80*dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newValue = Math.min(dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newScale = "scale(" + newValue + ")";
                  if(newValue != "NaN" && !(newScale === oldScale)){
                    wrapper.style.transform = newScale;
                    wrapper.style["margin-left"]= Math.pow(newValue,4)*mathIndentValue + "px";
                    var wrapperStyle = window.getComputedStyle(wrapper);
                    var wrapperHeight = parseFloat(wrapperStyle.height);
                    wrapper.style.height = "" + (wrapperHeight * newValue) + "px";
                    if(newValue === "1.00"){
                      wrapper.style.cursor = "";
                      wrapper.style.height = "";
                    }
                    else {
                      wrapper.style.cursor = "zoom-in";
                    }
                  }

                }
            }
            };
            shrinkMath();
            window.addEventListener('resize', function() {
              clearTimeout(timeout);
              timeout = setTimeout(shrinkMath, delay);
            });
          });
  }
}
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-AMS_SVG"></script></li>
</ol>]]></content:encoded>
					
					<wfw:commentRss>https://www.aptech.com/blog/applications-of-principal-components-analysis-in-finance/feed/</wfw:commentRss>
			<slash:comments>1</slash:comments>
		
		
			</item>
		<item>
		<title>Predicting Recessions with Machine Learning Techniques</title>
		<link>https://www.aptech.com/blog/predicting-recessions-with-machine-learning-techniques/</link>
					<comments>https://www.aptech.com/blog/predicting-recessions-with-machine-learning-techniques/#respond</comments>
		
		<dc:creator><![CDATA[Eric]]></dc:creator>
		<pubDate>Tue, 21 Feb 2023 20:03:05 +0000</pubDate>
				<category><![CDATA[Econometrics]]></category>
		<category><![CDATA[Machine learning]]></category>
		<guid isPermaLink="false">https://www.aptech.com/?p=11583378</guid>

					<description><![CDATA[Forecasts have become a valuable commodity in today's data-driven world. Unfortunately, not all forecasting models are of equal caliber, and incorrect predictions can lead to costly decisions. 

Today we will compare the performance of several prediction models used to predict recessions. In particular, we’ll look at how a traditional baseline econometric model compares to machine learning models. 

Our models will include: 
<ul>
<li> A baseline probit model.</li>
<li>K-nearest neighbors.</li>
<li>Decision forests.</li>
<li>Ridge classification.</li>
</ul>]]></description>
										<content:encoded><![CDATA[<h3 id="introduction">Introduction</h3>
<p>Forecasts have become a valuable commodity in today's data-driven world. Unfortunately, not all forecasting models are of equal caliber, and incorrect predictions can lead to costly decisions. </p>
<p>Today we will compare the performance of several prediction models used to predict recessions. In particular, we’ll look at how a traditional baseline econometric model compares to machine learning models. </p>
<p>Our models will include: </p>
<ul>
<li>A baseline <a href="https://www.aptech.com/examples/cmlmt/ordered-probit-estimation-with-constrained-maximum-likelihood/" target="_blank" rel="noopener">probit model</a>.</li>
<li><a href="https://www.aptech.com/resources/tutorials/gml/k-nearest-neighbor-classification/" target="_blank" rel="noopener">K-nearest neighbors</a>.</li>
<li><a href="https://www.aptech.com/resources/tutorials/gml/random-forests-salary/" target="_blank" rel="noopener">Decision forests</a>.</li>
<li><a href="https://docs.aptech.com/gauss/ridgefit.html#ridgeFit" target="_blank" rel="noopener">Ridge classification</a>.</li>
</ul>
<div class="alert alert-info" role="alert">The aim of today’s blog isn’t to provide a definitive answer on what model is best, but rather to provide background and context for different models. We will look more closely at model tuning and optimization in a later blog.</div>
<h2 id="background">Background</h2>
<p>Before diving into estimating our models, let's look more closely at the data and models we will be using. </p>
<h3 id="recession-dating">Recession dating</h3>
<p>Today we will focus on predicting recessions, using the <a href="https://www.nber.org/research/business-cycle-dating" target="_blank" rel="noopener">NBER recession indicator</a>. The NBER indicator:</p>
<ul>
<li>Uses a dummy variable to represent periods of expansion and recessions. </li>
<li>Takes a value of 1 during a recession and 0 during an expansion. </li>
<li>Can be directly imported from FRED using the series ID <code>"USREC"</code>. </li>
</ul>
<p>Because the NBER recession data is binary data, our forecasting exercise becomes one of classification. In other words, we want to identify whether an observation is more likely to fall into the non-recession or recession category. </p>
<p>For this reason, we will use need to use models that are suitable for <a href="https://www.aptech.com/blog/introduction-to-categorical-variables/" target="_blank" rel="noopener">discrete data and classification</a>. </p>
<h3 id="models">Models</h3>
<h4 id="probit">Probit</h4>
<p>The probit model is a <a href="https://www.aptech.com/blog/update-discrete-choice-application-module/" target="_blank" rel="noopener">discrete choice</a> model which:</p>
<ul>
<li>Is commonly used in classical econometrics to model binary or ordered data.</li>
<li>Estimates the probability that an outcome falls into a specific category. </li>
<li>Has a simple log-likelihood function, which can be used to estimate the model parameters with <a href="https://www.aptech.com/blog/beginners-guide-to-maximum-likelihood-estimation-in-gauss/" target="_blank" rel="noopener">maximum likelihood</a>.</li>
</ul>
<h4 id="k-nearest-neighbor">K-Nearest Neighbor</h4>
<p>The k-nearest neighbor (KNN) method is one of the simplest non-parametric techniques for classification and regression.</p>
<p>KNN relies on the intuition that if an observation is &quot;near&quot; another it is likely to fall within the same category. </p>
<p>The KNN model:</p>
<ol>
<li>Locates the $k$ nearest neighbors using the observed features and a measure of distance, such as euclidian.</li>
<li>Finds the most common &quot;class&quot; among the $k$ nearest neighbors.</li>
<li>Assigns the most common &quot;class&quot; as the predicted category for the unknown outcome.</li>
</ol>
<h4 id="decision-trees">Decision Trees</h4>
<p>Decision trees are a machine learning model which can be used to predict discrete or continuous data. </p>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/02/decforest.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/02/decforest.jpg" alt="" width="600" height="450" class="aligncenter size-full wp-image-11583442" /></a></p>
<p>Tree-based methods rely on a fairly simple process:</p>
<ol>
<li>Split the data into subsets, using the characteristics of the data. For example, if “Married” is one of our observed characteristics, we can split the sample into &quot;Yes&quot; and &quot;No&quot;. We can ask multiple &quot;questions&quot; about our data to create branches that break our data into smaller and smaller subsets. </li>
<li>The mostly frequently occuring outcome within the subset is then used as the outcome classifier prediction for all observations that fall inside those subsets. </li>
</ol>
<h4 id="ridge-regression">Ridge Regression</h4>
<p>Ridge regression is part of a family of linear regression models that aim to improve on the standard least squares fitting model. These methods use a modified least squares approach to shrink coefficient estimates towards zero, which in turn, reduces the estimates’ variances. </p>
<p>Like <a href="https://www.aptech.com/resources/tutorials/econometrics/linear-regression/" target="_blank" rel="noopener">OLS</a>, these methods rely on minimizing the residual sum of squares (RSS) to estimate coefficients. However, they add a penalty, based on cumulative coefficient size, to the RSS objective function.</p>
<h2 id="model-setup">Model Setup</h2>
<p>Today we will include a number of variables in our model. These are chosen based on commonly used predictors in the recession modeling literature:  </p>
<table>
 <thead>
 <tr>
      <th style="background-color: #36434C" colspan="3"><h3 id="recession-model-predictors"><span style="color:#FFFFFF">Recession Model Predictors</span></h3>
      </th>
   </tr>
<tr><th>Variable</th><th>Description</th></tr>
</thead>
<tbody>
<tr><td>INDPRO</td><td>Monthly growth rates of industrial production. Included in the level and 1-month lag.</td></tr>
<tr><td>PAYEMS</td><td>Monthly growth rates of nonfarm payrolls. Included in the level and 1-month lag.</td></tr>
<tr><td>RPI</td><td>Monthly growth rates of real personal income excluding transfer payments. Included in the level and 1-month lag.</td></tr>
<tr><td>UNRATE</td><td>Annual growth rate of headline unemployment. Included in the level and 1-month lag.</td></tr>
<tr><td>YLD</td><td>The yield curve slope, computed as the difference between the yield on the 10-year treasury bond and the 3-month treasury bill. Included in the level, 6-month lag, and 12-month lag.</td></tr>
<tr><td>CORP</td><td>The credit spread between between Moody's BAA and AAA corporate bond yields. Included in the level, 6-month lag, and 12-month lag.</td></tr>
</tbody>
</table>
<p>Our complete dataset ranges from January, 1963 to December, 2022. </p>
<table>
<tbody>
<tr><td>Training period</td><td>January, 1963 to December, 1998</td></tr>
<tr><td>Testing period</td><td>January, 1999 to December, 2022</td></tr>
</tbody>
</table>
<p>The complete dataset, including lags, is available <a href="https://github.com/aptech/gauss_blog/blob/a202120902f4acdb80bbd50589480e6871359257/machine-learning/recession-predicting/data/final_data.gdat" target="_blank" rel="noopener">here</a>.</p>
<h3 id="model-comparison">Model Comparison</h3>
<p>There are many components to evaluating how well a classification model performs. To compare models, we will use a set of binary class metrics including:</p>
<table>
 <thead>
 <tr>
      <th style="background-color: #36434C" colspan="3"><h3 id="model-comparison-measures"><span style="color:#FFFFFF">Model Comparison Measures</span></h3>
      </th>
   </tr>
<tr><th>Tool</th><th>Description</th></tr>
</thead>
<tbody>
<tr><td>Confusion matrix</td><td>Summarizes the performance of a classification algorithm. Compares the number of predicted outcomes to actual outcomes in tabular form.</td></tr>
<tr><td>Accuracy</td><td>Overall model accuracy. Equal to the number of correct predictions divided by the total number of predictions.</td></tr>
<tr><td>Precision</td><td>How good a model is at correctly identifying positive outcomes. Equal to the number of true positives divided by the number of false positives plus true positives. </td></tr>
<tr><td>Recall</td><td>How good a model is at correctly predicting all the positive outcomes. Equal to the number of true positives divided by the number of false negatives plus true positives.</td></tr>
<tr><td>F-score</td><td>The harmonic mean of the precision and recall. A score of 1 indicates perfect precision and recall.</td></tr>
<tr><td>Specificity</td><td>Ability to predict a true negative. Equal to the number of true negatives divided by the number of true negatives plus false positives.</td></tr>
<tr><td>Area under the ROC</td><td>Reflects the probability that a model ranks a random positive more highly than a random negative.</td></tr>
</tbody>
</table>
<p>It's important to view these metrics in the context of the data being modeled. For example, our data is not very balanced across classes. There are 263 non-recession observations and 28 recession observations. This implies that:</p>
<ul>
<li>Model accuracy is not a very informative metric. If we predict that all observations are non-recession, our accuracy is 90%.</li>
<li>F-score is a better metric for us to consider. It gives a more balanced picture of how our model performs across both the recession and non-recession class. </li>
</ul>
<h2 id="estimation">Estimation</h2>
<p>We will use two GAUSS libraries to estimate our models:</p>
<ul>
<li><a href="https://store.aptech.com/gauss-applications-category/constrained-maximum-likelihood-mt.html" target="_blank" rel="noopener">Constrained Maximum Likelihood MT (CMLMT)</a> to estimate the probit model.</li>
<li><a href="https://docs.aptech.com/gauss/gml-landing.html" target="_blank" rel="noopener">GAUSS Machine Learning (GML)</a> to estimate our machine learning models.</li>
</ul>
<h3 id="loading-our-data-and-libraries">Loading our data and libraries</h3>
<p>To start we will <a href="https://docs.aptech.com/gauss/data-management/programmatic-import.html" target="_blank" rel="noopener">load our data</a> directly from the url:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">// Load libraries
library gml, cmlmt;

/*
** Load data from url
*/
url = "https://github.com/aptech/gauss_blog/blob/master/machine-learning/recession-predicting/data/final_data.gdat?raw=true";
reg_data = loadd(url);

// Compute summary statistics
dstatmt(reg_data);</code></pre>
<p>This loads our regression dataset and prints a table of <a href="https://www.aptech.com/blog/getting-to-know-your-data-with-gauss-22/" target="_blank" rel="noopener">summary statistics</a> to the <strong>Command Window</strong>:</p>
<pre>----------------------------------------------------------------------------------------
Variable         Mean     Std Dev     Variance     Minimum     Maximum    Valid  Missing
----------------------------------------------------------------------------------------

date            -----       -----        -----  1963-01-01  2022-12-01      720     0
USREC          0.1181      0.3229       0.1043           0           1      720     0
INDPRO         0.1976      0.9403       0.8842       -13.2       6.275      720     0
PAYEMS         0.1428      0.5746       0.3302      -13.59       3.431      720     0
RPI            0.2627       1.253        1.569      -13.55          20      720     0
UNRATE       -0.03208       1.393        1.941        -8.6        11.1      720     0
corp           -1.021      0.4389       0.1926       -3.38       -0.32      720     0
yld             1.496       1.221        1.492       -2.65        4.42      720     0
yld_l6          1.504       1.215        1.475       -2.65        4.42      720     0
yld_l12           1.5       1.215        1.475       -2.65        4.42      720     0
corp_l6        -1.017      0.4397       0.1933       -3.38       -0.32      720     0
corp_l12       -1.015      0.4403       0.1939       -3.38       -0.32      720     0
ip_l           0.1986      0.9397        0.883       -13.2       6.275      720     0
nfp_l          0.1425      0.5747       0.3302      -13.59       3.431      720     0
rpi_l          0.2632       1.253        1.569      -13.55          20      720     0
un_l         -0.03222       1.393        1.942        -8.6        11.1      720     0 </pre>
<div class="alert alert-info" role="alert">The file <i>final_data.gdat</i> is a GAUSS data file format introduced in <a href="https://www.aptech.com/blog/gauss23/" target="_blank" rel="noopener">GAUSS 23.</a> The dataset is compiled using raw date from <a href="https://www.aptech.com/blog/importing-fred-data-to-gauss/" target="_blank" rel="noopener">FRED</a>. You can view the data import, transformation, and merging <a href="https://github.com/aptech/gauss_blog/blob/a202120902f4acdb80bbd50589480e6871359257/machine-learning/recession-predicting/code/data-summary.gss" target="_blank" rel="noopener">here</a>.</div>
<h3 id="splitting-data">Splitting Data</h3>
<p>Next, we will use the <a href="https://docs.aptech.com/gauss/traintestsplit.html" target="_blank" rel="noopener"><code>trainTestSplit</code></a> function to split the data into a test and training set.</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Split data
*/

// Dependent data 
y = reg_data[., "USREC"];

// Load independent variables
x = reg_data[., 3:cols(reg_data)];

// Split data into (60%) training
// and (40%) test sets
shuffle = "False";
{ y_train, y_test, x_train, x_test } = 
     trainTestSplit(y, x, 0.6, shuffle);</code></pre>
<div class="alert alert-info" role="alert">Because our data is <a href="https://www.aptech.com/blog/introduction-to-the-fundamentals-of-time-series-data-and-analysis/" target="_blank" rel="noopener">time series data</a>, it is important to keep the sequential ordering. To do this, we turn &quot;shuffling&quot; off when splitting the data.</div>
<h2 id="probit-model-results">Probit Model Results</h2>
<p>To estimate the probit model we will rely on the probit likelihood function:</p>
<p>$$LL(\beta|y;X) = \sum^N_{i=1} \big[y_i ln(F(x_i \beta)) + (1 - y_i)ln(1 - F(x_i \beta))\big]$$</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Likelihood Function
*/
proc (1) = probit(beta_, y, X, ind);
    local mu;

    // Declare 'mm' to be a modelResults
    // structure to hold the function value
    struct modelResults mm;

    // Compute mu
    mu = X * beta_;

    // Assign the log-likelihood value to the
    // 'function' member of the modelResults structure
    mm.function = y.*lncdfn(mu) + (1-y).*lncdfnc(mu);

    // Return the model results structure
    retp(mm);
endp;</code></pre>
<p>We can quickly estimate this model using the GAUSS <code>cmlmt</code> procedure:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Estimate model
*/
// Assign starting values for estimation
beta_strt = 0.5*ones(cols(x), 1);

// Declare 'out' to be a cmlmtResults structure
// to hold the results of the estimation
struct cmlmtResults cout;

// Perform estimation and print results
cout = cmlmt(&amp;probit, beta_strt, y_train, x_train);
call cmlmtPrt(cout);</code></pre>
<p>The fitted probit model can be used to predict the probability that an outcome lies in a recessionary period given the observed data. Using a cutoff of 50% we will sort predictions into recession/non-recession periods</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Predictions
*/
// Extract parameters
beta_hat = pvUnpack(cout.par, "x");

// Predicted probability of recession 
y_prob = cdfn(x_test * beta_hat);

// Classify data as recession or non-recession
y_hat = where(y_prob .&gt;= 0.5, 1, 0);</code></pre>
<p>Plotted against the observed recession dates, the estimated probability of recession looks fairly good:</p>
<p><a href="https://www.aptech.com/wp-content/uploads/2023/02/probit-recession-plot-1.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/02/probit-recession-plot-1.jpg" alt="Demonstrate the use of probit model to estimate recession periods. " width="800" height="600" class="aligncenter size-full wp-image-11583473" /></a></p>
<p>However, we can get a more robust evaluation of the model performance using the <a href="https://docs.aptech.com/gauss/classificationmetrics.html" target="_blank" rel="noopener"><code>classificationMetrics</code></a> from the GML library: </p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">call binaryClassMetrics(y_test, y_hat);</code></pre>
<p>The first portion of this report is the Confusion Matrix:</p>
<pre>Probit model with 50% cutoff.
==================================
                  Confusion matrix
==================================
                   Predicted class
                   ---------------
                         +       -
       True class
       ----------
            1 (+)       22       6
            0 (-)       17     243 </pre>
<p>The confusion matrix provides a summary of how many predictions our model got &quot;right&quot; and how many it got &quot;wrong&quot;, based on which category they fall in:
<a href="https://www.aptech.com/wp-content/uploads/2023/04/confusionmatrix.jpg"><img src="https://www.aptech.com/wp-content/uploads/2023/04/confusionmatrix.jpg" alt="" width="506" height="506" class="aligncenter size-full wp-image-11583769" /></a></p>
<p>The confusion matrix for our estimated probit model shows:</p>
<ul>
<li>There are 22 correctly predicted recession periods and 6 incorrectly predicted recession periods.</li>
<li>There are 243 correctly predicted non-recession and 17 incorrectly predicted non-recession periods. </li>
</ul>
<p>The remaining statistics help quantify these outcomes more clearly:</p>
<pre>             Accuracy           0.9201
            Precision           0.5641
               Recall           0.7857
              F-score           0.6567
          Specificity           0.9346
    Balanced Accuracy           0.8692 </pre>
<p>Overall for the probit model:</p>
<ul>
<li>Has an F-score of 66%. </li>
<li>Is better at predicting negative outcomes (93% specificity) than positive outcomes (precision 56%). </li>
</ul>
<h2 id="knn-model-results">KNN Model Results</h2>
<p>We will start our machine learning models with the KNN model. We will fit the model on the same training data using the <a href="https://docs.aptech.com/gauss/knnfit.html" target="_blank" rel="noopener"><code>knnFit</code></a> procedure:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Train the model
*/

// Specify the number of neighbors
k = 5;

// The knnModl structure 
// holds the trained model
struct knnModel mdl;

// Train model using KNN
mdl = knnFit(y_train, X_train, k);</code></pre>
<p>After fitting the model, the <a href="https://docs.aptech.com/gauss/knnclassify.html" target="_blank" rel="noopener"><code>knnClassify</code></a> procedure can be used to predict outcomes and metrics for the test data:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Predictions on the test set
*/
y_hat = knnClassify(mdl, X_test);

// Print out model quality 
// evaluation statistics
print "KNN Model";
call binaryClassMetrics(y_test, y_hat);</code></pre>
<pre>KNN Model
==================================
                  Confusion matrix
==================================
                   Predicted class
                   ---------------
                         +       -
       True class
       ----------
            1 (+)       20       8
            0 (-)        3     257 </pre>
<p>The confusion matrix for our estimated KNN model shows:</p>
<ul>
<li>There are 20 correctly predicted recession periods and 8 incorrectly predicted recession periods.</li>
<li>There are 257 correctly predicted non-recession periods and 3 incorrectly predicted non-recession periods. </li>
</ul>
<pre>         Accuracy           0.9618
        Precision           0.8696
           Recall           0.7143
          F-score           0.7843
      Specificity           0.9885
Balanced Accuracy           0.8514  </pre>
<p>The KNN model:</p>
<ul>
<li>Has an F-score of 78%.</li>
<li>Is better at predicting negative outcomes than positive outcomes.</li>
</ul>
<p>Compared to our baseline probit model the KNN model:</p>
<ul>
<li>Shows improved performance when balancing performance across both classes, with a better F-score. </li>
<li>Is better at predicting negative outcomes (99% specificity) but worse at predicting positive outcomes (precision 87%). </li>
</ul>
<h2 id="decision-forest-classification">Decision Forest Classification</h2>
<p>Next, we fit our decision forest classification model using the <a href="https://docs.aptech.com/gauss/decforestcfit.html" target="_blank" rel="noopener"><code>decForestCFit</code></a> procedure:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Train the model
*/

// The dfModel structure 
// holds the trained model
struct dfModel dfm;

// Fit training data 
// using decision forest classification
dfm = decForestCFit(y_train, x_train);</code></pre>
<p>After fitting the model, the <a href="https://docs.aptech.com/gauss/decforestpredict.html" target="_blank" rel="noopener"><code>decForestPredict</code></a> procedure can be used to predict outcomes and metrics for the test data:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Predictions on the test set
*/
y_hat = decForestPredict(dfm, x_test);

// Print out model quality 
// evaluation statistics
print "Decision Forest";
call binaryClassMetrics(y_test, y_hat);</code></pre>
<pre>Decision Forest
==================================
                  Confusion matrix
==================================
                   Predicted class
                   ---------------
                         +       -
       True class
       ----------
            1 (+)       25       3
            0 (-)        1     259 </pre>
<p>The confusion matrix for our estimated decision forest model shows:</p>
<ul>
<li>There are 25 correctly predicted recession periods and 3 false positives.</li>
<li>There are 259 correctly predicted non-recession periods and 1 false negatives. </li>
</ul>
<pre>         Accuracy           0.9861
        Precision           0.9615
           Recall           0.8929
          F-score           0.9259
      Specificity           0.9962
Balanced Accuracy           0.9445  </pre>
<p>The decision forest model:</p>
<ul>
<li>Has an F-score of 93%.</li>
<li>Is better at predicting negative outcomes (99% specificity) than positive outcomes (96% precision).</li>
</ul>
<p>Compared to our baseline probit model the decision forest model:</p>
<ul>
<li>Is much better than our baseline probit model when balancing performance across both classes. </li>
<li>Is better at predicting negative outcomes and positive outcomes. </li>
</ul>
<h2 id="ridge-classification">Ridge Classification</h2>
<p>Finally, we estimate the ridge classification model using the <code>ridgeCFit</code> procedure:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Train the model
*/

// L2 regularization penalty
lambda = 0.5;

// Declare 'mdl' to be an instance of a
// ridgeModel structure to hold the estimation results
struct ridgeModel mdl;

// Train the model
// using the ridge classification
mdl = ridgeCFit(y_train, X_train, lambda);</code></pre>
<p>The <code>ridgeCPredict</code> procedure can be used to predict outcomes and metrics for the test data:</p>
<pre class="hljs-container hljs-container-solo"><code class="lang-gauss">/*
** Predictions on the test set
*/

// Compute test mse
predictions = ridgeCPredict(mdl, x_test);

// Print out model quality 
// evaluation statistics
print "Ridge Classification";
call binaryClassMetrics(y_test, predictions);</code></pre>
<pre>Ridge Classification
==================================
                  Confusion matrix
==================================
                   Predicted class
                   ---------------
                         +       -
       True class
       ----------
            1 (+)       22       6
            0 (-)        4     256 </pre>
<p>The confusion matrix for our estimated ridge classification model shows:</p>
<ul>
<li>There are 22 correctly predicted recession periods and 6 incorrectly predicted recession periods.</li>
<li>There are 256 correctly predicted non-recession periods and 4 incorrectly predicted non-recession periods. </li>
</ul>
<pre>         Accuracy           0.9653
        Precision           0.8462
           Recall           0.7857
          F-score           0.8148
      Specificity           0.9846
Balanced Accuracy           0.8852 </pre>
<p>The ridge classification model:</p>
<ul>
<li>Has an F-score of 81%.</li>
<li>Is better at predicting negative outcomes (98% specificity) than positive outcomes (84% precision).</li>
</ul>
<p>Compared to our baseline probit model the ridge classification model:</p>
<ul>
<li>Is better than our baseline probit model when balancing performance across both classes.  </li>
<li>Is better at predicting negative outcomes and at predicting positive outcomes.  </li>
</ul>
<h2 id="results-summary">Results Summary</h2>
<table>
 <thead>
<tr><th></th><th style="width:20%">Probit</th><th style="width:20%">KNN</th><th style="width:20%">Decision Forest</th><th style="width:20%">Ridge Classification</th></tr>
</thead>
<tbody>
<tr><th>True Positives</th><td>22</td><td>20</td><td style="background-color: #fde5d2">25</td><td>22</td></tr>
<tr><th>False Positives</th><td>17</td><td>3</td><td style="background-color: #fde5d2">1</td><td>4</td></tr>
<tr><th>True Negatives</th><td>243</td><td>257</td><td style="background-color: #fde5d2">259</td><td>256</td></tr>
<tr><th>False Negatives</th><td>6</td><td>8</td><td style="background-color: #fde5d2">3</td><td>6</td></tr>
<tr><th>Accuracy</th><td>92%</td><td>96%</td><td style="background-color: #fde5d2">99%</td><td>96%</td></tr>
<tr><th>Precision</th><td>56%</td><td>87%</td><td style="background-color: #fde5d2">96%</td><td>84%</td></tr>
<tr><th>Recall</th><td>79%</td><td>71%</td><td style="background-color: #fde5d2">89%</td><td>79%</td></tr>
<tr><th>F-score</th><td>66%</td><td>78%</td><td style="background-color: #fde5d2">92%</td><td>81%</td></tr>
<tr><th>Specificity</th><td>93%</td><td>99%</td><td style="background-color: #fde5d2">99%</td><td>98%</td></tr>
<tr><th>Balanced Accuracy</th><td>87%</td><td>85%</td><td style="background-color: #fde5d2">94%</td><td>89%</td></tr>
</tbody>
</table>
<p>From the summary table, we see clearly that, even without tuning, the decision forest classification is superior by all evaluation standards to our other models. While all models perform strongly when predicting non-recession periods, the decision forest model is the clear winner for predicting recession periods. </p>
<h2 id="conclusion">Conclusion</h2>
<p>In today's blog, we examined the performance of several prediction models used to predict recessions. After today's blog, you should have a better understanding of:</p>
<ul>
<li>How to implement machine learning models in GAUSS.</li>
<li>How to compare model classification models. </li>
<li>How machine learning models can be used to improve prediction. </li>
</ul>
<h3 id="further-reading">Further Reading</h3>
<ol>
<li><a href="https://www.aptech.com/blog/applications-of-principal-components-analysis-in-finance/" target="_blank" rel="noopener">Applications of Principal Components Analysis in Finance</a></li>
<li><a href="https://www.aptech.com/blog/predicting-the-output-gap-with-machine-learning-regression-models/" target="_blank" rel="noopener">Predicting The Output Gap With Machine Learning Regression Models</a></li>
<li><a href="https://www.aptech.com/blog/fundamentals-of-tuning-machine-learning-hyperparameters/" target="_blank" rel="noopener">Fundamentals of Tuning Machine Learning Hyperparameters</a></li>
<li><a href="https://www.aptech.com/blog/understanding-cross-validation/" target="_blank" rel="noopener">Understanding Cross-Validation</a></li>
<li><a href="https://www.aptech.com/blog/machine-learning-with-real-world-data/" target="_blank" rel="noopener">Machine Learning With Real-World Data</a></li>
<li><a href="https://www.aptech.com/blog/classification-with-regularized-logistic-regression/" target="_blank" rel="noopener">Classification with Regularized Logistic Regression</a></li>
</ol>
<h2 id="try-out-machine-learning-in-gauss">Try Out Machine Learning in GAUSS</h2>
<p>    <!-- MathJax configuration -->
    <style>
        .mjx-svg-href {
            fill: "inherit" !important;
            stroke: "inherit" !important;
        }
    </style>
    <script type="text/x-mathjax-config">
        MathJax.Hub.Config({ TeX: { equationNumbers: {autoNumber: "AMS"} } });
    </script>
    <script type="text/javascript">
window.MathJax = {
  tex2jax: {
    inlineMath: [ ['$','$'] ],
    displayMath: [ ['$$','$$'] ],
    processEscapes: true,
    processEnvironments: true
  },
  // Center justify equations in code and markdown cells. Elsewhere
  // we use CSS to left justify single line equations in code cells.
  displayAlign: 'center',
  "HTML-CSS": {
    styles: {'.MathJax_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  "SVG": {
    styles: {'.MathJax_SVG_Display': {"margin": 0}},
    linebreaks: { automatic: false }
  },
  showProcessingMessages: false,
  messageStyle: "none",
  menuSettings: { zoom: "Click" },
  AuthorInit: function() {
    MathJax.Hub.Register.StartupHook("End", function() {
            var timeout = false, // holder for timeout id
            delay = 250; // delay after event is "complete" to run callback
            var shrinkMath = function() {
              //var dispFormulas = document.getElementsByClassName("formula");
              var dispFormulas = document.getElementsByClassName("MathJax_SVG_Display");
              if (dispFormulas){
                // caculate relative size of indentation
                var contentTest = document.getElementsByTagName("body")[0];
                var nodesWidth = contentTest.offsetWidth;
                // if you have indentation
                var mathIndent = MathJax.Hub.config.displayIndent; //assuming px's
                var mathIndentValue = mathIndent.substring(0,mathIndent.length - 2);
                for (var i=0; i<dispFormulas.length; i++){
                  var dispFormula = dispFormulas[i];
                  var wrapper = dispFormula;
                  //var wrapper = dispFormula.getElementsByClassName("MathJax_Preview")[0].nextSibling;
                  var child = wrapper.firstChild;
                  wrapper.style.transformOrigin = "center"; //or top-left if you left-align your equations
                  var oldScale = child.style.transform;
                  //var newValue = Math.min(0.80*dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newValue = Math.min(dispFormula.offsetWidth / child.offsetWidth,1.0).toFixed(2);
                  var newScale = "scale(" + newValue + ")";
                  if(newValue != "NaN" && !(newScale === oldScale)){
                    wrapper.style.transform = newScale;
                    wrapper.style["margin-left"]= Math.pow(newValue,4)*mathIndentValue + "px";
                    var wrapperStyle = window.getComputedStyle(wrapper);
                    var wrapperHeight = parseFloat(wrapperStyle.height);
                    wrapper.style.height = "" + (wrapperHeight * newValue) + "px";
                    if(newValue === "1.00"){
                      wrapper.style.cursor = "";
                      wrapper.style.height = "";
                    }
                    else {
                      wrapper.style.cursor = "zoom-in";
                    }
                  }

                }
            }
            };
            shrinkMath();
            window.addEventListener('resize', function() {
              clearTimeout(timeout);
              timeout = setTimeout(shrinkMath, delay);
            });
          });
  }
}
</script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/MathJax.js?config=TeX-AMS_SVG"></script></p>

[contact-form-7]

]]></content:encoded>
					
					<wfw:commentRss>https://www.aptech.com/blog/predicting-recessions-with-machine-learning-techniques/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
