Page not found – Aptech https://www.aptech.com GAUSS Software - Fastest Platform for Data Analytics Mon, 25 Mar 2024 17:18:55 +0000 en-US hourly 1 https://wordpress.org/?v=6.3.3 Introducing the GAUSS Data Management Guide https://www.aptech.com/blog/introducing-the-gauss-data-management-guide/ https://www.aptech.com/blog/introducing-the-gauss-data-management-guide/#respond Tue, 20 Feb 2024 18:50:08 +0000 https://www.aptech.com/?p=11584443 We created the **GAUSS Data Management Guide** with that exact goal in mind. It's aimed to help you save time and make the most of your data.
Today's blog looks at what the GAUSS Data Management Guide offers and how to best use the guide.]]>
Introduction

If you've worked with real-world data, you know that data cleaning and management can eat up your time. Efficiently tackling tedious data cleaning, organization, and management tasks can have a huge impact on productivity.

We created the GAUSS Data Management Guide with that exact goal in mind. It's aimed to help you save time and make the most of your data.

Today's blog looks at what the GAUSS Data Management Guide offers and how to best use the guide.

What is the GAUSS Data Management Guide?

The GAUSS Data Management Guide is a comprehensive reference tool for accomplishing data-related tasks in GAUSS. It provides a detailed roadmap for working with data in GAUSS, from basic data import and manipulation to advanced data cleaning and visualization.

The guide is intentionally designed for all levels of GAUSS users with:

  • Extensive coverage.
  • Step-by-step instructions.
  • Annotated examples.

What does the GAUSS Data Management Guide cover?

The GAUSS Data Management Guide includes sections for:

How should I use the GAUSS Data Management Guide?

  • Use page outlines, located on the right-hand side of each page, to identify and navigate to specific tasks.

  • Copy the examples in the guide and paste into GAUSS program files to use as templates.

  • Use the links to complete function reference pages to find additional support.

Conclusion

The GAUSS Data Management Guide provides practical examples, detailed instructions, and comprehensive coverage that can help work productively and efficiently with your data.

]]>
https://www.aptech.com/blog/introducing-the-gauss-data-management-guide/feed/ 0
Using Feasible Generalized Least Squares To Improve Estimates https://www.aptech.com/blog/using-feasible-generalized-least-squares-to-improve-estimates/ https://www.aptech.com/blog/using-feasible-generalized-least-squares-to-improve-estimates/#respond Thu, 25 Jan 2024 22:09:11 +0000 https://www.aptech.com/?p=11584289 Introduction

Data analysis in reality is rarely as clean and tidy as it is presented in the textbooks. Consider linear regression -- data rarely meets the stringent assumptions required for OLS. Failing to recognize this and incorrectly implementing OLS can lead to embarrassing, inaccurate conclusions.

In today's blog, we'll look at how to use feasible generalized least squares to deal with data that does not meet the OLS assumption of Independent and Identically Distributed (IID) error terms.

What Is Feasible Generalized Least Squares (FGLS)?

FGLS is a flexible and powerful tool that provides a reliable approach for regression analysis in the presence of non-constant variances and correlated errors.

Feasible Generalized Least Squares (FGLS):

Why is this important?

Recall the fundamental OLS IID assumption which implies that the error terms have constant variance and are uncorrelated. When this assumption is violated:

  • OLS estimators are no longer efficient.
  • The estimated covariance matrix of the coefficients will be inconsistent.
  • Standard inferences will be incorrect.

Unfortunately, many real-world cross-sectional, panel data, and time series datasets do violate this fundamental assumption.

FGLS allows for a more accurate modeling of complex and realistic data structures by accommodating the heteroscedasticity and autocorrelation in the error terms.

How Does FGLS Work?

FGLS uses a weighting matrix that captures the structure of the variance-covariance matrix of the errors.

This allows FGLS to:

  • Give more weight to observations with smaller variances.
  • Account for correlations.
  • Provide more efficient and unbiased estimates in the presence of non-constant variance and serial correlation.

The method uses a relatively simple iterative process:

  1. Pick a method for estimating the covariance matrix based on believed data structure.
  2. Make initial OLS parameter estimates.
  3. Use the OLS residuals and the chosen method to estimate an initial covariance matrix.
  4. Compute FGLS estimates using the estimated covariance matrix for weighting.
  5. Calculate residuals and refine the weighting matrix.
  6. Repeat steps 3, 4, and 5 until convergence.

How Do I Know If I Should Use FGLS?

We've already noted that you should use FGLS when you encounter heteroscedasticity and/or autocorrelation. It's easy to say this but how do you identify when this is the case?

There are a number of tools that can help.


Example Tools for Identifying Heteroscedasticity and Autocorrelation

ToolDescriptionUsed to Identify
Scatter plots
  • Plot the dependent variable against each independent variable and look for patterns that suggest relationships between the variance and variables.
  • Plot the residuals over time and look for cycles or trends in the residuals.
Heteroscedasticity and autocorrelation.
Residual plot
  • A fan-shaped or funnel-shaped pattern in a plot of the residuals against fitted values indicates that the variance of the residuals is not constant across all levels of the independent variable.
  • A pattern of correlation in plots of residuals against lagged residuals may indicate autocorrelation.
Heteroscedasticity and autocorrelation.
Histogram of residualsPlot a histogram of the residuals. If the histogram is skewed or has unequal spread, it could suggest heteroscedasticity or non-normal distribution.Heteroscedasticity.
Durbin-Watson statisticThe Durbin-Watson statistic tests for first-order autocorrelation in the residuals. The test statistic ranges from 0 to 4, with values around 2 indicating no autocorrelation.Autocorrelation.
Breusch-Pagan testThe Breusch-Pagan test considers the null hypothesis of homoscedasticity against the alternative of heteroscedasticity.Heteroscedasticity.
Breusch-Godfrey testThe Breusch-Godfrey test extends the Durbin-Watson test to higher-order autocorrelation. The test assesses whether larger lags of residuals and independent variables help explain the current residuals.Autocorrelation.
White testSimilar to the Breusch-Pagan test, the White test considers the null hypothesis of homoscedasticity.Heteroscedasticity.

Example One: US Consumer Price Index (CPI)

Let's get a better feel for FGLS using real-world data. In this application, we will:

  • Find OLS estimates and examine the results for signs of heteroscedasticity and autocorrelation.
  • Compute FGLS estimates and discuss results.

Data

For this example, we will use publicly available FRED time series data:

  • Consumer Price Index for All Urban Consumers: All Items in U.S. City Average (CPIAUCSL), seasonally adjusted.
  • Compensation of employees, paid (COE), seasonally adjusted.

Both variables are quarterly, continuously compounded rates of change spanning from 1947Q2 to 2023Q3.

// Load data 
fred_fgls = loadd("fred_fgls.gdat");

// Preview data
head(fred_fgls);
tail(fred_fgls);
            date              COE         CPIAUCSL
      1947-04-01      0.013842900      0.014184600
      1947-07-01      0.015131100      0.021573900
      1947-10-01      0.030381600      0.027915600
      1948-01-01      0.025448400      0.020966300
      1948-04-01      0.011788800      0.015823300

            date              COE         CPIAUCSL
      2022-07-01      0.023345300      0.013491800
      2022-10-01     0.0048207000      0.010199500
      2023-01-01      0.021001700     0.0093545000
      2023-04-01      0.013436000     0.0066823000
      2023-07-01      0.013337800     0.0088013000

OLS Estimation

Let's start by using OLS to examine the relationship between COE and CPI returns. We'll be sure to have GAUSS save our residuals so we can use them to evaluate OLS performance.

// Declare 'ols_ctl' to be an olsmtControl structure
// and fill with default settings
struct olsmtControl ols_ctl;
ols_ctl = olsmtControlCreate();

// Set the 'res' member of the olsmtControl structure
// so that 'olsmt' will compute residuals and the
// Durbin-Watson statistic
ols_ctl.res = 1;

// Declare 'ols_out' to be an olsmtOut structure
// to hold the results of the computations
struct olsmtOut ols_out;

// Perform estimation, using settings in the 'ols_ctl'
// control structure and store the results in 'ols_out'
ols_out = olsmt(fred_fgls, "CPIAUCSL ~ COE", ols_ctl);
Valid cases:                   306      Dependent variable:            CPIAUCSL
Missing cases:                   0      Deletion method:                   None
Total SS:                    0.019      Degrees of freedom:                 304
R-squared:                   0.197      Rbar-squared:                     0.195
Residual SS:                 0.016      Std error of est:                 0.007
F(1,304):                   74.673      Probability of F:                 0.000
Durbin-Watson:               0.773

                         Standard                 Prob   Standardized  Cor with
Variable     Estimate      Error      t-value     >|t|     Estimate    Dep Var
-------------------------------------------------------------------------------
CONSTANT   0.00397578 0.000678566     5.85909     0.000       ---         ---
COE          0.303476   0.0351191     8.64133     0.000    0.444067    0.444067

Evaluating the OLS Results

Taken at face value, these results look good. The standard errors on both estimates are small and both variables are statistically significant. We may be tempted to stop there. However, let's look more closely using some of the tools mentioned earlier.

Checking For Heteroscedasticity

First, let's create some plots using our residuals to check for heteroscedasticity. We will look at:

  • A histogram of the residuals.
  • The residuals versus the independent variable.
/*
** Plot a histogram of the residuals 
** Check for skewed distribution
*/
plotHist(ols_out.resid, 50);

Our histogram indicates that the residuals from our OLS regression are asymmetric and slightly skewed. While the results aren't dramatic, they warrant further exploration to check for heteroscedasticity.

/*
** Plot residuals against COE
** Check for increasing or decreasing variance 
** as the independent variable changes.
*/
plotScatter(fred_fgls[., "COE"], ols_out.resid);

It's hard to determine if these results are indicative of heteroscedasticity or not. Let's add random normal observations to our scatter plot as see how they compare.

// Add random normal observations to our scatter plot
// scale by 100 to put on same scale as residuals
rndseed 897680;
plotAddScatter(fred_fgls[., "COE"], rndn(rows(ols_out.resid), 1)/100);

Our residual plot doesn't vary substantially from the random normal observations and there isn't strong visual evidence of heteroscedasticity.

If we did have heteroscedasticity, our residuals would exhibit a fan-like shape, indicating a change in the spread between residuals as our observed data changes. For example, consider this plot of hypothetical residuals against COE:

Example of heteroscedasticity observed in a residual plot.

Checking For Autocorrelation

Comparison of error terms with autocorrelation and without.

As you may have noticed, we don't have to look further than our OLS results for signs of autocorrelation. The olsmt procedure reports the Durbin-Watson statistic as part of the printed output. For this regression, the Durbin-Watson statistic is 0.773, which is significantly below 2, suggesting positive autocorrelation.

We can find further support for this conclusion by inspecting our residual plots, starting with a plot of the residuals against time.

// Checking for autocorrelation
/*
** Plot the residuals over time and 
** look for cycles or trends to 
** check for autocorrelation.
*/
plotXY(fred_fgls[., "date"], ols_out.resid);

Our time plot of residuals:

  • Has extended periods of large residuals, (roughly 1970-1977, 1979-1985, and 2020-2022).
  • Suggests positive autocorrelation.

Now let's examine the plot of our residuals against lagged residuals:

/*
** Plot residuals against lagged residuals 
** look for relationships and trends
*/
// Lag residuals and remove missing values
lagged_res = lagn(ols_out.resid, 1);

// Trim first observations and plot residuals
// against lagged residuals
plotScatter(lagged_res, ols_out.resid);

This plot gives an even clearer visual of our autocorrelation issue demonstrating:

  • A clear linear relationship between the residuals and their lags.
  • Larger residuals in the previous period lead to larger residuals in the current period.

FGLS Estimation

After examining the results more closely from the OLS estimation, we have clear support for using FGLS. We can do this using the fgls procedure, introduced in GAUSS 24.

The GAUSS fgls Procedure

The fgls procedure allows for model specification in one of two styles. The first style requires a dataframe input and a formula string:

// Calling fgls using a dataframe and formula string
out = fgls(data, formula);

The second option requires an input matrix or dataframe containing the dependent variable and an input matrix or dataframe containing the independent variables:

// Calling fgls using dependent variable
// and independent variable inputs
out = fgls(depvar, indvars);

Both options also allow for:

  • An optional input specifying the computation method for the weighting matrix. GAUSS includes 7 pre-programmed options for the weighting matrix or allows for a user-specified weighting matrix.
  • An optional fglsControl structure input for advanced estimation settings.
out = fgls(data, formula [, method, ctl])

The results from the FGLS estimation are stored in a fglsOut structure containing the following members:

MemberDescription
out.beta_fglsThe feasible least squares estimates of parameters.
out.sigma_fglsCovariance matrix of the estimated parameters.
out.se_fglsStandard errors of the estimated parameters.
out.ciConfidence intervals of the estimated parameters.
out.t_statsThe t-statistics of the estimated parameters.
out.pvtsThe p-value of the t-statistics of the estimated parameters.
out.residThe estimate residuals.
out.dfDegrees of freedom.
out.sseSum of squared errors.
out.sstTotal sum of squares.
out.std_estStandard deviation of the residuals.
out.fstatModel f-stat.
out.pvfP-value of the model f-stat.
out.rsqR-squared.
out.dwDurbin-Watson statistic.

Running FGLS

Let's use FGLS and see if it helps with autocorrelation. We'll start with the default weighting matrix, which is an AR(1) structure.

// Estimate FGLS parameters using
// default setting
struct fglsOut fOut;
fOut = fgls(fred_fgls, "CPIAUCSL ~ COE");
Valid cases:                    306          Dependent variable:             COE
Total SS:                     0.019          Degrees of freedom:             304
R-squared:                    0.140          Rbar-squared:                 0.137
Residual SS:                  0.017          Std error of est:             0.007
F(1,304)                     49.511          Probability of F:             0.000
Durbin-Watson                 0.614

--------------------------------------------------------------------------------
                        Standard                    Prob
Variable   Estimates       Error     t-value        >|t|  [95% Conf.   Interval]
--------------------------------------------------------------------------------

Constant     0.00652    0.000908        7.19       0.000     0.00474      0.0083
CPIAUCSL        0.14      0.0286         4.9       0.000       0.084       0.196

The FGLS estimates the AR(1) weighting matrix differ from our OLS estimates in both the coefficients and standard errors.

Let's look at a plot of our residuals:

// Plot FGLS residual 
lagged_resid = lagn(fOut.resid, 1);
plotScatter(lagged_resid, fOut.resid);

Our residuals suggest that FGLS hasn't fully addressed our autocorrelation. What should we take from this?

This likely means that we need to consider higher-order autocorrelation. We may want to extend this analysis by:


Ready to give FGLS a try? Get started with GAUSS demo today!

Example Two: American Community Survey

Now let's consider a second example using a subset of data from the 2019 American Community Survey (ACS).

Data

The 2019 ACS data subset was cleaned and provided by the Social Science Computing Cooperative from University of Wisconsin-Madison.

The survey data subset contains 5000 observations of the following variables:

VariableCensus Codebook NameDescription
householdSERIALNOHousing unit identifier.
personSPORDERPerson number.
stateSTState.
ageAGEPAge in years.
other_languagesLANXAnother language is spoken at home.
englishENGSelf-rated ability to speak English, if another language is spoken.
commute_timeJWMNPTravel time to work in minutes, top-coded at 200.
marital_statusMARMarital status.
educationSCHLEducational attainment, collapsed into categories.
sexSEXSex (male or female).
hours_workedWKHPUsual hours worked per week in the past 12 months.
weeks_workedWKHNWeeks worked per year in the past 12 months.
raceRAC1PRace.
incomePINCPTotal income in current dollars, rounded.

Let's run a naive model of income against two independent variables, age and hours_worked.

Our first step is loading our data:

/*
** Step One: Data Loading 
** Using the 2019 ACS 
*/
// Load data 
acs_fgls = loadd("acs2019sample.dta", "income + age + hours_worked");

// Review the summary statistics
dstatmt(acs_fgls);
---------------------------------------------------------------------------------------------
Variable             Mean     Std Dev      Variance     Minimum     Maximum    Valid Missing
---------------------------------------------------------------------------------------------

income          4.062e+04   5.133e+04     2.634e+09       -8800   6.887e+05     4205    795
age                 43.38       24.17           584           0          94     5000      0
hours_worked        38.09       13.91         193.5           1          99     2761   2239

Based on our descriptive statistics there are a few data cleaning steps that will help our model:

  • Remove missing values using the packr procedure.
  • Transform income to thousands of dollars to improve data scaling.
  • Remove cases with negative incomes.
// Remove missing values
acs_fgls = packr(acs_fgls);

// Transform income
acs_fgls[., "income"] = acs_fgls[., "income"]/1000;

// Filter out cases with negative incomes
acs_fgls = delif(acs_fgls, acs_fgls[., "income"] .< 0);

OLS Estimation

Now we're ready to run a preliminary OLS estimation.

// Declare 'ols_ctl' to be an olsmtControl structure
// and fill with default settings
struct olsmtControl ols_ctl;
ols_ctl = olsmtControlCreate();

// Set the 'res' member of the olsmtControl structure
// so that 'olsmt' will compute residuals and the Durbin-Watson statistic
ols_ctl.res = 1;

// Declare 'ols_out' to be an olsmtOut structure
// to hold the results of the computations
struct olsmtOut ols_out;

// Perform estimation, using settings in the 'ols_ctl'
// control structure and store the results in 'ols_out'
ols_out = olsmt(acs_fgls, "income ~ age + hours_worked", ols_ctl);
Valid cases:                  2758      Dependent variable:              income
Missing cases:                   0      Deletion method:                   None
Total SS:              8771535.780      Degrees of freedom:                2755
R-squared:                   0.147      Rbar-squared:                     0.146
Residual SS:           7481437.527      Std error of est:                52.111
F(2,2755):                 237.536      Probability of F:                 0.000
Durbin-Watson:               1.932

                             Standard                 Prob   Standardized  Cor with
Variable         Estimate      Error      t-value     >|t|     Estimate    Dep Var
-----------------------------------------------------------------------------------
CONSTANT         -31.0341     3.91814    -7.92062     0.000       ---         ---
age              0.762573   0.0620066     12.2983     0.000    0.216528    0.227563
hours_worked      1.25521   0.0715453     17.5443     0.000    0.308893    0.316628

Our results make intuitive sense and suggest that:

  • Both age and hours_worked are statistically significant.
  • Increases in age lead to increases in income.
  • Increase in hours_worked lead to increases in income.

Evaluating the OLS Results

As we know from our previous example, we need to look beyond the estimated coefficients and standard errors when evaluating our model results. Let's start with the histogram of our residuals:

/*
** Plot a histogram of the residuals 
** Check for skewed distribution
*/
plotHist(ols_out.resid, 50);

The histogram of our residuals is right skewed with a long tail on the right side.

However, because our initial data is truncated, residual scatter plots will be more useful for checking for heteroscedasticity.

/*
** Plot residuals against independent variables
** Check for increasing or decreasing variance 
** as the independent variable changes.
*/
plotScatter(acs_fgls[., "age"], ols_out.resid);

// Open second plot window
plotOpenWindow();
plotScatter(acs_fgls[., "hours_worked"], ols_out.resid);

Both plots show signs of heteroscedasticity:

  • The age scatter plot demonstrates the tell-tale fan-shaped relationship with residuals. This indicates that variance in residuals increases as age increases.
  • The hours_worked scatter plot is less obvious but does seem to indicate higher variance in the residuals at the middle ranges (40-60) than the lower and higher ends.

FGLS estimation

To address the issues of heteroscedasticity, let's use FGLS. This time we'll use the "HC0" weighting matrix (White, 1980).

// Estimate FGLS parameters 
// using the HC1 weighting matrix
struct fglsOut fOut;
fOut = fgls(acs_fgls, "income ~ age + hours_worked", "HC0");
Valid cases:                   2758              Dependent variable:             age
Total SS:               8771535.780              Degrees of freedom:            2755
R-squared:                    0.147              Rbar-squared:                 0.146
Residual SS:            7481440.027              Std error of est:            52.111
F(2,2755)                   237.535              Probability of F:             0.000
Durbin-Watson                 1.932

-------------------------------------------------------------------------------------
                             Standard                    Prob
     Variable   Estimates       Error     t-value        >|t|  [95% Conf.   Interval]
-------------------------------------------------------------------------------------
    Constant       -30.9      0.0743        -416       0.000       -31.1       -30.8
hours_worked       0.762    0.000475    1.61e+03       0.000       0.761       0.763
      income        1.25     0.00181         694       0.000        1.25        1.26

While using FGLS results in slightly different coefficient estimates, it has a big impact on the standard error estimations. In this case, these changes don't have an impact on our inferences -- all of our regressors are still statistically significant.

Conclusion

Today we've seen how FGLS offers a potential solution for data that doesn't fall within the restrictive IID assumption of OLS.

After today, you should have a better understanding of how to:

  • Identify heteroscedasticity and autocorrelation.
  • Compute OLS and FGLS estimates using GAUSS.

Further Reading

  1. Introduction to the Fundamentals of Time Series Data and Analysis.
  2. Finding the PACF and ACF.
  3. Generating and Visualizing Regression Residuals.
  4. OLS diagnostics: Heteroscedasticity.

]]>
https://www.aptech.com/blog/using-feasible-generalized-least-squares-to-improve-estimates/feed/ 0
Getting Started With Survey Data In GAUSS https://www.aptech.com/blog/getting-started-with-survey-data-in-gauss/ https://www.aptech.com/blog/getting-started-with-survey-data-in-gauss/#respond Thu, 11 Jan 2024 15:27:35 +0000 https://www.aptech.com/?p=11584205 In today's blog we'll look more closely at survey data including:
  • Fundamental characteristics of survey data.
  • Data cleaning considerations.
  • Data exploration using frequency tables and data visualizations.
  • Managing survey data in GAUSS.
  • ]]> Introduction

    Survey data is a powerful analysis tool, providing a window into people's thoughts, behaviors, and experiences. By collecting responses from a diverse sample of responders on a range of topics, surveys offer invaluable insights. These can help researchers, businesses, and policymakers make informed decisions and understand diverse perspectives.

    In today's blog we'll look more closely at survey data including:

    • Fundamental characteristics of survey data.
    • Data cleaning considerations.
    • Data exploration using frequency tables and data visualizations.
    • Managing survey data in GAUSS.

    Survey Data

    Survey data presents unique characteristics and challenges that require careful consideration during the data analysis process.

    Survey Data Characteristics

    Categorical NatureSurvey data often involves categorical variables, where responses are grouped into distinct categories. Understanding the nature of these categories is crucial for choosing appropriate analysis methods.
    Ordinal and Nominal Variables It is important to recognize the distinction between ordinal variables (categories with a meaningful order) and nominal variables (categories without a specific order). This impacts the choice of statistical tests and visualization techniques.
    Missing Data Surveys may have missing or incomplete responses. Strategies for handling missing data, such as imputation or excluding incomplete cases, need to be considered.
    Large Sample SizesSurveys often involve large sample sizes, leading to statistically significant but not necessarily practically significant results. It's crucial to consider whether the observed results are meaningful or impactful in the specific context of the study.
    Multivariate NatureSurveys explore relationships among multiple variables simultaneously. Multivariate analysis allows for a more comprehensive understanding of the complex relationships between different factors.
    Choice modelingSurveys act as a primary data collection method for understanding individuals' preferences and choices. Choice modeling techniques expand the insights gained from survey responses, providing a quantitative framework for analyzing decision-making processes in various contexts.

    Data Cleaning Considerations For Analyzing Survey Data

    Data cleaning allows us to identify and address errors, inconsistencies, and missing values. It is crucial for survey data and helps to:

    • Ensure accuracy.
    • Improve reliability.
    • Make meaningful and trustworthy insights.

    Cleaning survey data includes some standard steps, such as:

    • Handling missing values,
    • Detecting outliers,

    and some steps that are more specific to survey data, such as:

    • Performing consistency checks on survey responses,
    • Recoding categorical variables,
    • Handling open-ended responses.

    Common Survey Data Cleaning Steps

    Handling Missing Data
    • Identify missing data.
    • Determine if missing values are systematic or random.
    • Decide if missing values should be imputed or observations should be removed.
    Outlier Detection and Treatment
    • Identify outliers that might skew the analysis.
    • Decide whether outliers should be treated, transformed, or if they represent valid data points.
    Standardize Variables
    • Standardize units and formats of variables to ensure consistency.
    • Convert units, standardize date formats, and/or transform variables for better comparability.
    Checking for Consistency
    • Perform consistency checks on the survey responses.
    • Look for contradictory or illogical responses that may indicate errors in data entry.
    Addressing Duplicate Entries
    • Identify and remove duplicate entries to avoid double-counting.
    Recoding and Categorization
    • Recode variables or categorize responses to simplify analysis.
    • Group similar categories, collapse response options, or create new variables based on recoded values.
    Handling Open-Ended Responses
    • Categorize and code open-ended responses for analysis.
    Dealing with Coding Errors
    • Check for coding errors in categorical variables.
    • Ensure that each category is correctly labeled and that coding aligns with the intended meaning of the variable.

    Exploring Survey Data

    Exploratory data analysis is an important tool that can help us uncover insights from survey data without complicated computations. During this step, basic statistical tools like frequency tables, contingency tables, and summary statistics can shed light on important patterns and trends in the data.

    One-Way Frequency Tables

    Frequency tables provide a simple tabulation of the number of occurrences of each category in a single categorical variable. They display the counts (frequencies) of each category along with their corresponding percentages or proportions. Frequency tables are univariate, meaning they describe the distribution of one variable.

    A simple frequency table can help us identify:

    • Inconsistencies, coding errors, typos, and other errors in categorical labels.
    • Outliers and missing values.
    • General distribution characteristics. For example, we may find that one level of a categorical variable makes up 90% of our observations.
    CountTotal %Cum. %
    Coffee3145.645.6
    Tea2739.785.3
    Soda2814.7100

    Two-Way Tables

    Two-way tables, also known as contingency tables, are similar to frequency tables but offer additional information about data interactions. They display the frequency combinations of two categorical variables. This provides a snapshot of how these variables interact, and helps us uncover patterns and associations within survey data.

    Two-way tables present information in a structured grid:

    • The columns correspond to one variable.
    • The rows correspond to the other variable.
    • The intersection of a row and column represent the frequency of observations having a pair of outcomes.
    BreakfastLunchDinner
    Coffee2083
    Tea12105
    Soda81010

    As an example, consider the table above:

    • The columns represent the outcomes for a variable meal_time: Breakfast, Lunch, and Dinner.
    • The rows represent the outcomes for a variable beverage_choice: Coffee, Tea, and Soda.
    • The bottom row contains the counts for Soda orders across all possible meal times.
    • The last column contains counts for all beverage options at Dinner.
    • The bottom, right corner tell us that 10 Sodas were ordered at Dinner.

    Two-way tables are an efficient way to reveal the intricate relationships between two categorical variables. By presenting information in a structured grid, these tables offer a straightforward way to discern patterns, making it easier to grasp how variables interact.

    Data Visualizations

    Data plots are a great way to find understand data trends, observe outliers, and identify other data issues. When choosing a data plot, it is important to consider what plot is best suited for the type of the variable.

    Bar ChartsIdeal for comparing the frequency or distribution of categorical variables.
    Stacked Bar ChartsUseful for comparing the composition of different groups, where each bar is divided into segments representing subcategories.
    Pie ChartsShows the proportion of each category in relation to the whole.
    HistogramsDepicts the distribution of a continuous variable by dividing it into intervals (bins) and showing the frequency of observations in each interval.
    Line ChartsDemonstrates trends or patterns over a continuous variable or time.
    Scatter PlotsVisualizes the relationship between two continuous variables.
    Box Plots (Box-and-Whisker Plots) Displays the distribution of a variable, including median, quartiles, and outliers.

    Hands-On With Survey Data: NextGen National Household Travel Survey

    Let's look at more closely at survey data using GAUSS and real-world transportation data.

    Today's Data

    Today we'll be working with the 2022 National Household Travel Survey (NHTS). This survey is designed to collect comprehensive information about travel patterns and travel behavior in the United States.

    The NHTS survey:

    • Gathers data on various aspects of travel, including daily commuting, recreational trips, shopping, and other activities.
    • Is typically conducted at regular intervals to capture changes in travel behavior over time, though today we will only consider the 2022 survey results.
    • Utilizes a combination of interviews and diaries to collect data from a representative sample of households across the country.
    • Is valuable for transportation planners, policymakers, and researchers in making informed decisions regarding infrastructure development, traffic management, and other transportation-related initiatives.

    The raw data from the NHTS is split into four separate CSV files containing:

    • Vehicle data.
    • Trip data.
    • Household data.
    • Person data.

    Today we will work with the trip data.

    Loading The Data

    Let's get started by loading the data into GAUSS using the loadd procedure. We will also compute descriptive statistics for our data:

    // Load trip data
    trip_data = loadd("trip_data.gdat");
    
    // Preliminary summary stats
    dstatmt(trip_data);
    -------------------------------------------------------------------------------------------
    Variable           Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
    -------------------------------------------------------------------------------------------
    
    HOUSEID           9e+09    5.83e+04     3.399e+09       9e+09       9e+09     31074    0
    PERSONID          1.681      0.9994        0.9989           1           9     31074    0
    TRIPID            2.438       1.792         3.209           1          36     31074    0
    SEQ_TRIPID        2.436        1.79         3.203           1          36     31074    0
    VEHCASEID     7.619e+11   3.244e+11     1.052e+23          -1       9e+11     31074    0
    FRSTHM            -----       -----         -----         Yes          No     31074    0
    PARK              -----       -----         -----  Valid skip          No     31074    0
    TRAVDAY           -----       -----         -----      Sunday    Saturday     31074    0
    DWELTIME          95.18       164.3       2.7e+04          -9        1050     31074    0
    PUBTRANS          -----       -----         -----  Used publi  Did not us     31074    0
    TRIPPURP          -----       -----         -----  Not ascert  Not a home     31074    0
    WHYTRP1S          -----       -----         -----        Home  Something      31074    0
    TRVLCMIN          24.55       46.48          2161          -9        1425     31074    0
    TRPTRANS          -----       -----         -----         Car  School bus     31074    0
    NUMONTRP          1.997       3.478          12.1           1          99     31074    0
    NONHHCNT         0.4141       3.388         11.48           0          98     31074    0
    HHACCCNT          1.583      0.8916         0.795           1           8     31074    0
    WHYTO             -----       -----         -----  Regular ac  Something      31074    0
    WALK              -----       -----         -----  Valid skip  N/A - Didn     31074    0
    TRPMILES          13.97       85.42          7296          -9        4859     31074    0
    VMT_MILE          7.527       32.18          1035          -9        1683     31074    0
    GASPRICE            398       68.46          4686       272.7       597.9     31074    0
    NUMADLT           2.059      0.7616          0.58           1           8     31074    0
    HOMEOWN           -----       -----         -----  Owned by h  Occupied w     31074    0
    RAIL              -----       -----         -----         Yes          No     31074    0
    CENSUS_D          -----       -----         -----  New Englan     Pacific     31074    0
    CENSUS_R          -----       -----         -----   Northeast        West     31074    0
    CDIVMSAR          -----       -----         -----  New Englan  Pacific No     31074    0
    HHFAMINC          -----       -----         -----  I prefer n  $125,000 t     31074    0
    HH_RACE           -----       -----         -----       White  Other race     31074    0
    HHSIZE            2.822       1.447         2.093           1          10     31074    0
    HHVEHCNT          2.134       1.078         1.163           0          11     31074    0
    MSACAT            -----       -----         -----  MSA of 1 m  Not in MSA     31074    0
    MSASIZE           -----       -----         -----  In an MSA   Not in MSA     31074    0
    URBAN             -----       -----         -----  In an urba  Not in urb     31074    0
    URBANSIZE         -----       -----         -----  50,000-199  Not in urb     31074    0
    URBRUR            -----       -----         -----       Urban       Rural     31074    0
    TDAYDATE          -----       -----         -----  2022-01-01  2023-01-01     31074    0
    WRKCOUNT          1.304      0.9474        0.8976           0           6     31074    0
    R_AGE              46.8       20.77         431.2           5          92     31074    0
    R_SEX             -----       -----         -----      Refuse      Female     31074    0
    R_RACE            -----       -----         -----       White  Other race     31074    0
    EDUC              -----       -----         -----  Valid skip  Profession     31074    0
    VEHTYPE           -----       -----         -----  Valid skip  Motorcycle     31074    0 

    There are many ways to preview dataframes in GAUSS but with a wide dataset that contains many variables, I find dstatmt to be the easiest to view.

    The descriptive statistics themselves provide some useful information:

    • Many of the continuous variables, such as TRPMILES and TRVLCMIN, have minimum values below zero. These don't make sense and it is likely the -9 is coded to represent something different, such as non-responses.
    • There are 31074 valid observations and no missing values for all variables.

    The descriptive statistics report also provides insights beyond the traditional descriptive statistics:

    • The data contains a mixture of categorical and numerical data.
    • Observations in our dataset are defined by a set of identification variables: HOUSEID, PERSONID, TRIPID, SEQ_TRIPID, VEHCASEID .

    Checking For Duplicates

    As a first step, we'll confirm that our data contains unique observations using the isunique procedure.

    isunique(trip_data);
    1.0000000 

    This indicates that our dataset is unique without any duplicates.

    Examining Category Labels

    Now that we confirmed that our dataset is unique, one of the first data cleaning steps with categorical data is to examine the category labels to check for errors and to get an understanding of the distribution.

    Let’s look at the labels of the TRIPPURP variable using a sorted frequency table.

    // Print frequency table for 'TRIPPURP'
    frequency(trip_data, "TRIPPURP", 1);
                                     Label      Count   Total %    Cum. %
                    Home-based other (HBO)       7714     24.82     24.82
               Not a home-based trip (NHB)       7035     22.64     47.46
               Home-based shopping (HBSHP)       6884     22.15     69.62
                     Home-based work (HBW)       4871     15.68     85.29
    Home-based social/recreational (HBSOC)       4546     14.63     99.92
                           Not ascertained         24   0.07723       100
                                     Total      31074       100           

    Using this we can see that three categories make up almost 70% of the trips: "Home-based other", "Not a home-based trip", and "Home-based shopping".

    The frequency table is also useful for learning more about our labels. In this table, the labels appear to be clean and we don’t see anything that suggests typos or errors.

    To clean up the labels, let's separate the abbreviations from the descriptions. We can do this using some simple string manipulation in GAUSS.

    First, let’s separate the abbreviations from the full descriptions by splitting the labels at "(" and storing the new string arrays:

    // Use '(' to split existing labels into 2 columns
    tmp = strsplit(trip_data[. , "TRIPPURP"], "(" );
    
    // Trim whitespace from the front and back of both variables
    tmp = strtrim(tmp);
    
    // Rename columns 
    tmp = setColNames(tmp , "TRIP_DESC"$|"TRIP_ABBR");
    
    // Preview data
    head(tmp);
                  TRIP_DESC        TRIP_ABBR
           Home-based socia           HBSOC)
           Home-based socia           HBSOC)
           Home-based shopp           HBSHP)
           Not a home-based             NHB)
           Home-based shopp           HBSHP)

    The TRIP_DESC variable looks good – it stores the full description of the TRIPPURP. However, the abbreviations in the TRIP_ABBR don’t quite look right, we still need to strip the ")".

    /*
    ** Remove the right parenthesis
    */
    // Replace ')' with an empty string
    tmp[. , "TRIP_ABBR"]  = strreplace(tmp[. , "TRIP_ABBR"], ")", "");
    
    // Check frequencies for both variables
    frequency(tmp, "TRIP_DESC + TRIP_ABBR");
                             Label      Count     Total %      Cum. %
                  Home-based other       7714       24.82       24.82
               Home-based shopping       6884       22.15       46.98
    Home-based social/recreational       4546       14.63       61.61
                   Home-based work       4871       15.68       77.28
             Not a home-based trip       7035       22.64       99.92
                   Not ascertained         24     0.07723         100
                             Total      31074         100
    
                             Label      Count     Total %      Cum. %
                                           24     0.07723     0.07723
                               HBO       7714       24.82        24.9
                             HBSHP       6884       22.15       47.06
                             HBSOC       4546       14.63       61.69
                               HBW       4871       15.68       77.36
                               NHB       7035       22.64         100
                             Total      31074         100
    

    One final change we may want to make is to replace the missing abbreviation label for the "Not Ascertained" category using the recodeCatLabels.

    /*
    ** Recode missing label
    */
    // Add missing label for 'NA'
    tmp[., 2] = recodecatlabels(tmp[., 2], "", "NA", "TRIP_ABBR");
    
    // Check frequencies for both variables
    frequency(tmp, "TRIP_DESC + TRIP_ABBR");
                             Label      Count     Total %      Cum. %
                  Home-based other       7714       24.82       24.82
               Home-based shopping       6884       22.15       46.98
    Home-based social/recreational       4546       14.63       61.61
                   Home-based work       4871       15.68       77.28
             Not a home-based trip       7035       22.64       99.92
                   Not ascertained         24     0.07723         100
                             Total      31074         100
    
                             Label      Count     Total %      Cum. %
                                NA         24     0.07723     0.07723
                               HBO       7714       24.82        24.9
                             HBSHP       6884       22.15       47.06
                             HBSOC       4546       14.63       61.69
                               HBW       4871       15.68       77.36
                               NHB       7035       22.64         100
                             Total      31074         100

    We've successfully created two new variables - TRIP_DESC and TRIP_ABBR which we can concatenate to our trip_data dataframe:

    // Add the new variables to the end of 'trip_data'
    trip_data = trip_data ~ tmp;

    Two-Way Tables

    Frequency tables give provide insights into a single categorical variable. However, if we are interested in the relationship between multiple categorical variables, we need to use two-way, or contingency, tables.

    Let's use a contingency table to look at the relationship between the URBRUR and the VEHTYPE. To do this we can use the tabulate procedure, introduced in GAUSS 24.

    The tabulate function requires either a dataframe or filename input, along with a formula string to specify which variables to include in the table. It also takes an optional tabControl structure input for advanced options.


    data
    A GAUSS dataframe or filename.
    formula
    String, formula string. E.g "df1 ~ df2 + df3", "df1" categories will be reported in rows, separate columns will be returned for each category in "df2" and "df3".
    tbctl
    Optional, an instance of the tabControl structure used for advanced table options.
    // Compute a two-way table with
    // VEHTYPE categories in rows
    // URBUR categories in columns
    // Results stored in tab_df
    tab_df = tabulate(trip_data, "VEHTYPE ~ URBRUR");
    ===============================================================
               VEHTYPE                   URBRUR               Total
    ===============================================================
                                Urban          Rural
    
            Valid skip           4061            719           4780
      Car/Stationwagon           9306           1774          11080
                   Van           1438            358           1796
                   SUV           8275           1935          10210
          Pickup Truck           2043           1043           3086
           Other Truck             36             24             60
          RV/Motorhome              4              4              8
      Motorcycle/Moped             39             15             54
    
                 Total          25202           5872          31074
    ===============================================================

    The initial counts provide us some insights:

    • The total counts of vehicles are higher in urban areas.
    • In urban areas the most frequently occurring type of vehicle is theCar/Stationwagon.
    • In rural areas the most frequently occurring type of vehicle is SUV.

    It might useful to see relative percentages of the vehicle types. Because we stored the counts in the tab_df, this can easily be done.

    First, let's look at what percentage each category makes up of the total vehicles in the urban and rural areas, respectively.

    // Compute percentages within urban and rural areas
    // by dividing by column totals
    tab_df[., 1]~(tab_df[., 2:3]./sumc(tab_df[., 2:3])');
             VEHTYPE       URBUR_Urban   URBUR_Rural
          Valid skip            0.1611        0.1224
    Car/Stationwagon            0.3692        0.3021
                 Van            0.0571        0.0610
                 SUV            0.3283        0.3295
        Pickup Truck            0.0811        0.1776
         Other Truck            0.0014        0.0041
        RV/Motorhome            0.0002        0.0007
    Motorcycle/Moped            0.0015        0.0026 

    These percentages help us see that:

    • The distribution of Car/Stationwagon, Van, and SUV are fairly similar in urban and rural areas.
    • There is a higher percentage of the Pickup Truck, Other Truck, Motorcycle/Moped categories in rural areas.

    Alternatively we can look at the distribution of each vehicle type across rural and urban areas.

    // Compute percentages across urban and rural areas
    // by dividing by row totals
    tab_df[., 1]~(tab_df[., 2:3]./sumr(tab_df[., 2:3]))
             VEHTYPE      URBUR_Urban  URBUR_Rural
          Valid skip           0.8496       0.1504
    Car/Stationwagon           0.8399       0.1601
                 Van           0.8007       0.1993
                 SUV           0.8105       0.1895
        Pickup Truck           0.6620       0.3380
         Other Truck           0.6000       0.4000
        RV/Motorhome           0.5000       0.5000
    Motorcycle/Moped           0.7222       0.2778

    This table tells a similar store from a different perspective:

    • Urban vehicles make up 80-83% of the Cars/Stationwagon, Van, and SUV categories.
    • Urban vehicles only make up 60% and 66% the Pickup Truck and Other Truck categories, respectively.
    • Urban vehicles make up 72% of the Motorcyle/Moped category.

    Excluding Categories

    Suppose we don't want to include the Valid skip responses in our contingency table. We can remove these using the exclude member of the tabControl structure.

    To specify categories to be excluded from the contingency table, we use a string to specify the variable name and category separated by a ":".

    // Declare structure
    struct tabControl tbCtl;
    
    // Fill defaults
    tbCtl = tabControlCreate();
    
    // Specify to exclude the 'Valid skip' category
    // from the 'VEHTYPE' variable
    tbCtl.exclude = "VEHTYPE:Valid skip";
    
    // Find contingency table including tbCtl input
    tab_df2 =  tabulate(trip_data, "VEHTYPE ~ URBRUR", tbCtl);
    =============================================================================
                             VEHTYPE                   URBRUR               Total
    =============================================================================
                                              Urban          Rural
    
                    Car/Stationwagon           9306           1774          11080
                                 Van           1438            358           1796
                                 SUV           8275           1935          10210
                        Pickup Truck           2043           1043           3086
                         Other Truck             36             24             60
                        RV/Motorhome              4              4              8
                    Motorcycle/Moped             39             15             54
    
                               Total          21141           5153          26294
    =============================================================================

    Now our table excludes the Valid skip category.


    Ready to try it for yourself in GAUSS 24? Start your free trial today!

    Data Visualizations

    Data visualizations are one of the most useful tools for data exploration. There are several ways to utilize the plotting capabilities of GAUSS to explore survey data.

    Frequency plots

    First, let's use a frequency plot to explore the distribution of responses across census regions. To do this, we will utilize the plotFreq procedure.

    // Census region frequencies
    plotFreq(trip_data, "CENSUS_R", 1);

    The sorted frequency plot allows us to quickly identify that the most frequently occurring region in our data is "South".

    Plotting Contingency Tables

    Like frequency tables, frequency plots are useful for visualizing the categories of one variable. However, they don't provide much insight into the relationship across categorical variables.

    To visualize the relationship between VEHTYPE and URBRUR, let's create a bar plot using our stored contingency table dataframe, tab_df2.

    The plotBar function requires two inputs, labels for the x-axis and corresponding heights.

    The labels for our bar plot are the vehicle types which are stored as a dataframe in the first column of the tab_df2. To use them as inputs we will need to:

    • Get the category labels.
    • Convert them to a string array.
    // Get category labels
    labels = getCategories(tab_df2, "VEHTYPE");
    
    // Convert to string array
    labels_sa = ntos(labels);

    The corresponding heights will come from the tab_df2 variable. Let's find out the variable names in tab_df2:

    // Print the variable names from 'tab_df2'
    getcolnames(tab_df2);
         VEHTYPE
    URBRUR_Urban
    URBRUR_Rural

    The final two variable names were created by the tabulate function to tell us which original variable the column came from, URBRUR, and which category is being referenced. Let's change the variable names to just Urban and Rural to make them more concise.

    new_names = "Urban" $| "Rural";
    col_idx = { 2, 3 };
    tab_df2 = setcolnames(tab_df2, "Urban" $| "Rural", col_idx);

    Now we're ready to use the Urban and Rural count variables to plot our data.

    plotBar(labels_sa, tab_df2[., "Urban" "Rural"]);

    By default, this plots our bars side-by-side. We can change this using a plotControl structure and plotsetbar.

    // Declare structure
    struct plotControl plt;
    
    // Fill defaults
    plt = plotGetDefaults("bar");
    
    // Set bars to be solid and stacked
    plotSetBar(&plt, 1, 1);
    
    // Plot contingency table
    plotBar(plt, labels_sa, tab_df2[., "Urban" "Rural"]);

    Scatter Plots

    Now suppose we wish to examine the relationship between a categorical variable and continuous variables. We can do this using the 'by' keyword and the plotScatter function.

    // Plot TRIPMILES vs GASPRICE 
    // Sorting by color using the categories in CENSUS_R
    plotScatter(trip_data, "TRPMILES ~ GASPRICE + by(CENSUS_R)");

    Adding the census regions provides some interesting observations:

    • The West region has higher gas prices than other regions.
    • The South region seems to have lower gas prices than other regions.

    Conclusion

    In this blog, we've covered some fundamental concepts related to survey data and looked at some GAUSS tools for cleaning, exploring, and visualizing survey data.

    ]]>
    https://www.aptech.com/blog/getting-started-with-survey-data-in-gauss/feed/ 0 Transforming Panel Data to Long Form in GAUSS https://www.aptech.com/blog/transforming-panel-data-to-long-form-in-gauss/ https://www.aptech.com/blog/transforming-panel-data-to-long-form-in-gauss/#respond Tue, 12 Dec 2023 21:24:59 +0000 https://www.aptech.com/?p=11584134 The new dfLonger and dfWider procedures introduced in GAUSS 24 make great strides towards fixing that. Extensive planning has gone into each procedure, resulting in comprehensive but intuitive functions.
    In today's blog, we will walk through all you need to know about the dfLonger procedure to tackle even the most complex cases of transforming wide form panel data to long form.]]>
    Introduction

    Anyone who works with panel data knows that pivoting between long and wide form, though commonly necessary, can still be painstakingly tedious, at best. It can lead to frustrating errors, unexpected results, and lengthy troubleshooting, at worst.

    The new dfLonger and dfWider procedures introduced in GAUSS 24 make great strides towards fixing that. Extensive planning has gone into each procedure, resulting in comprehensive but intuitive functions.

    In today's blog, we will walk through all you need to know about the dfLonger procedure to tackle even the most complex cases of transforming wide form panel data to long form.

    The Rules of Tidy Data

    Before we get started, it will be useful to consider what makes data tidy (and why tidy data is important).

    It's useful to think of breaking our data into components (these subsets will come in handy later when working with dflonger):

    • Values.
    • Observations.
    • Variables.

    Components of data.

    We can use these components to define some basic rules for tidy data:

    1. Variables have unique columns.
    2. Observations have unique rows.
    3. Values have unique cells.

    Example One: Wide Form State Population Table

    State202020212022
    Alabama5,031,3625,049,8465,074,296
    Alaska732,923734,182733,583
    Arizona7,179,9437,264,8777,359,197
    Arkansas3,014,1953,028,1223,045,637
    California39,501,65339,142,99139,029,342

    Though not clearly labeled, we can deduce that this data presents values for three different variables: State, Year, and Population.

    Looking more closely we see:

    • State is stored in a unique column.
    • The values of Years are stored as column names.
    • The values of Population are stored in separate columns for each year.

    Our variables do not each have a unique column, violating the rules of tidy data.

    Example Two: Long Form State Population Table

    StateYearPopulation
    Alabama20205,031,362
    Alabama20215,049,846
    Alabama20225,074,296
    Alaska2020732,923
    Alaska2021734,182
    Alaska2022733,583
    Arizona20207,179,943
    Arizona20217,264,877
    Arizona20227,359,197

    The transformed data above now has three columns, one for each variable State, Year, and Population. We can also confirm that each observation has a single row and each value has a single cell.

    Transforming the data to long form has resulted in a tidy data table.

    Why Do We Care About Tidy Data?

    Working with tidy data offers a number of advantages:

    • Tidy data storage offers consistency when trying to compare, explore, and analyze data whether it be panel data, time series data or cross-sectional data.
    • Using columns for variables is aligned with vectorization and matrix notation, both of which are fundamental to efficient computations.
    • Many software tools expect tidy data and will only work reliably with tidy data.

    Ready to elevate your research? Try GAUSS 24 today.

    Transforming From Wide to Long Panel Data

    In this section, we will look at how to use the GAUSS procedure dfLonger to transform panel data from wide to long form. This section will cover:

    • The fundamentals of the dfLonger procedure.
    • A standard process for setting up panel data transformations.

    The dfLonger Procedure

    The dfLonger procedure transforms wide form GAUSS dataframes to long form GAUSS dataframes. It has four required inputs and one optional input:

    df_long = dfLonger(df_wide, columns, names_to, values_to [, pctl]);

    df_wide
    A GAUSS dataframe in wide panel format.
    columns
    String array, the columns that should be used in the conversion.
    names_to
    String array, specifies the variable name(s) for the new column(s) created to store the wide variable names.
    value_to
    String, the name of the new column containing the values.
    pctl
    Optional, an instance of the pivotControl structure used for advanced pivoting options.

    Setting Up Panel Data Transformations

    Having a systematic process for transforming wide panel data to long panel data will:

    • Save time.
    • Eliminate frustration.
    • Prevent errors.

    Let's use our wide form state population data to work through the steps.

    Step 1: Identify variables.

    In our wide form population table, there are three variables: State, Year, and Population.

    Step 2: Identify columns to convert.

    The easiest way to determine what columns need to be converted is to identify the "problem" columns in your wide form data.

    For example, in our original state population table, the columns named 2020, 2021, 2022, represent our Year variable. They store the values for the Population variable.

    These are the columns we will need to address in order to make our data tidy.

    columns = "2020"$|"2021"$|"2022";

    We only have three columns to transform and it is easy to just type out our column names in a string array. This won't always be the case, though. Fortunately, GAUSS has a lot of great convenience functions to help with creating your column lists.

    My favorites include:

    FunctionDescriptionExample
    getColNamesReturns the column variable names. varnames = getColNames(df_wide)
    startsWithReturns a 1 if a string starts with a specified pattern. mask = startsWith(colNames, pattern)
    trimrTrims rows from the top and/or bottom of a matrix. names = trimr(full_list, top, bottom)
    rowcontainsReturns a 1 if the row contains the data specified by the needle variable, otherwise it returns a 0. mask = rowcontains(haystack, needle)
    selifSelects rows from a matrix, dataframe or string array, based upon a vector of 1’s and 0’s. names = rowcontains(full_list, mask)

    For more complex cases, it useful to approach creating column lists as a two-step process:

    1. Get all column names using getColNames.
    2. Select a subset of columns names using a selection convenience functions.

    As an example, suppose our state population dataset contains a year column as the first column and the remaining columns contain the populations for 1950-2022. It would be difficult to write out the column list for all years.

    Instead we could:

    1. Get a list of all the column names using getColNames.
    2. Trim the first name off the list.
    // Get all columns names
    colNames = getColNames(pop_wide);
    
    // Trim first name `year` 
    // from top of the name list
    colNames = trimr(colNames, 1, 0);

    Step 3: Name the new columns for storing names.

    The names of the columns being transformed from our wide form data will be stored in a variable specified by the input names_to.

    In this case, we want to store the names from the wide data in one new variable called, "Years". In later examples, we will look at how to split names into multiple variables using prefixes, separators, or patterns.

    names_to = "Years";

    Step 4: Name the new columns for storing values.

    The values stored in the columns being transformed will be stored in a variable specified by the input values_to.

    For our population table, we will store the values in a variable named "Population".

    values_to = "Population";

    Basic Pivoting

    Now it's time to put all these steps together into a working example. Let's continue with our state population example.

    We'll start by loading the complete state population dataset from the state_pop.gdat file:

    // Load data 
    pop_wide = loadd("state_pop.gdat");
    
    // Preview data
    head(pop_wide);
               State             2020             2021             2022
             Alabama        5031362.0        5049846.0        5074296.0
              Alaska        732923.00        734182.00        733583.00
             Arizona        7179943.0        7264877.0        7359197.0
            Arkansas        3014195.0        3028122.0        3045637.0
          California        39501653.        39142991.        39029342. 

    Now, let's set up our information for transforming our data:

    // Identify columns
    columns = "2020"$|"2021"$|"2022";
    
    // Variable for storing names
    names_to = "Year";
    
    // Variable for storing values
    values_to = "Population";

    Finally, we'll transform our data using df_longer:

    // Convert data using df_longer
    pop_long = dfLonger(pop_wide, columns, names_to, values_to);
    
    // Preview data
    head(pop_long);
               State             Year       Population
             Alabama             2020        5031362.0
             Alabama             2021        5049846.0
             Alabama             2022        5074296.0
              Alaska             2020        732923.00
              Alaska             2021        734182.00 

    Advanced Pivoting

    One of the most appealing things about dfLonger is that while simple to use, it offers tools for tackling the most complex cases. In this section, we'll cover everything you need to know for moving beyond basic pivoting.

    The pivotControl Structure

    The pivotControl structure allows you to control pivoting specifications using the following members:

    MemberPurpose
    names_prefixA string input which specifies which characters, if any, should be stripped from the front of the wide variable names before they are assigned to a long column.
    names_sep_splitA string input which specifies which characters, if any, mark where the names_to names should be broken up.
    names_pattern_splitA string input containing a regular expression specifying group(s) in names_to names which should be broken up.
    names_typesA string input specifying data types for the names_to variable.
    values_drop_missingScalar, is set to 1 all rows with missing values will be removed.

    Changing Variable Types

    By default the variables created from the pieces of the variable names will be categorical variables.

    If we examine the variable type of pop_long from our previous example,

    // Check the type of the 'Year' variables
    getColTypes(pop_long[., "Year"]);

    we can see that the Year variable is a categorical variable:

                type
            category 

    This isn't ideal and we'd prefer our Year variable to be a date. We can control the assigned type using the names_types member of the pivotControl structure. The names_types member can be specified in one of two ways:

    1. As a column vector of types for each of the names_to variables.
    2. An n x 2 string array where the first column is the name of the variable(s) and the second column contains the type(s) to be assigned.

    For our example, we wish to specify that the Year variable should be a date but we don't need to change any of the other assigned types, so we will use the second option:

    // Declare pivotControl structure and fill with default values
    struct pivotControl pctl;
    pctl = pivotControlCreate();
    
    // Specify that 'Year' should be
    // converted to a date variable
    pctl.names_types = {"Year" "date"};

    Next, we complete the steps for pivoting:

    // Get all column names and remove the first column, 'State'
    columns = getColNames(pop_wide);
    columns = trimr(columns, 1, 0);
    
    // Variable for storing names
    names_to = "Year";
    
    // Variable for storing values
    values_to = "Population";

    Finally, we call dfLonger including the pivotControl structure, pctl, as the final input:

    // Call dfLonger with optional control structure
    pop_long = dfLonger(pop_wide, columns, names_to, values_to, pctl);
    
    // Preview data
    head(pop_long);
               State             Year       Population
             Alabama             2020        5031362.0
             Alabama             2021        5049846.0
             Alabama             2022        5074296.0
              Alaska             2020        732923.00
              Alaska             2021        734182.00

    Now if we check the type of our Year variable:

    // Check the type of 'Year'
    getColTypes(pop_long[., "Year"]);

    It is a date variable:

      type
      date

    Stripping Prefixes

    In our previous example, the wide data names only contained the year. However, the column names of a wide dataset often have common prefixes. The names_prefix member of the pivotControl structure offers a convenient way to strip unwanted prefixes.

    Suppose that our wide form state population columns were labeled "yr_2020", "yr_2021", "yr_2022":

    // Load data
    pop_wide2 = loadd("state_pop2.gdat");
    
    // Preview data
    head(pop_wide2);
               State          yr_2020          yr_2021          yr_2022
             Alabama        5031362.0        5049846.0        5074296.0
              Alaska        732923.00        734182.00        733583.00
             Arizona        7179943.0        7264877.0        7359197.0
            Arkansas        3014195.0        3028122.0        3045637.0
          California        39501653.        39142991.        39029342.

    We need to strip these prefixes when transforming our data to long form.

    To accomplish this we first need to specify that our name columns have the common prefix "yr":

    // Declare pivotControl structure and fill with default values
    struct pivotControl pctl;
    pctl = pivotControlCreate();
    
    // Specify prefix
    pctl.names_prefix = "yr_";

    Next, we complete the steps for pivoting:

    // Get all column names and remove the first column, 'State'
    columns = getColNames(pop_wide2);
    columns = trimr(columns, 1, 0);
    
    // Variable for storing names
    names_to = "Year";
    
    // Variable for storing values
    values_to = "Population";

    Finally, we call dfLonger:

    // Call dfLonger with optional control structure
    pop_long = dfLonger(pop_wide2, columns, names_to, values_to, pctl);
    
    // Preview data
    head(pop_long);
               State             Year       Population
             Alabama             2020        5031362.0
             Alabama             2021        5049846.0
             Alabama             2022        5074296.0
              Alaska             2020        732923.00
              Alaska             2021        734182.00

    Splitting Names

    In our basic example the only information contained in the names columns was the year. We created one variable to store that information, "Year". However, we may have cases where our wide form data contains more than one piece of information.

    In theses case there are two important steps to take:

    1. Name the variables that will store the information contained in the wide data column names using the names_to input.
    2. Indicate to GAUSS how to split the wide data column names into the names_to variables.

    Names Include a Separator

    One way that names in wide data can contain multiple pieces of information is through the use of separators.

    For example, suppose our data looks like this:

               State       urban_2020       urban_2021       urban_2022       rural_2020       rural_2021       rural_2022
             Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
              Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
             Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
            Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
          California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858. 

    Now our names specify:

    • Whether the population is the urban or rural population.
    • The year of the observation.

    In this case, we:

    • Use the names_sep_split member of the pivotControl structure to indicate how to split the names.
    • Specify a names_to variable for each group created by the separator.
    // Load data
    pop_wide3 = loadd("state_pop3.gdat");
    
    // Declare pivotControl structure and fill with default values
    struct pivotControl pctl;
    pctl = pivotControlCreate();
    
    // Specify how to separate names
    pctl.names_sep_split = "_";
    
    // Specify two variables for holding
    // names information:
    //    'Location' for the information before the separator
    //    'Year' for the information after the separator
    names_to = "Location"$|"Year";
    
    // Variable for storing values
    values_to = "Population";
    
    // Call dfLonger with optional control structure
    pop_long = dfLonger(pop_wide3, columns, names_to, values_to, pctl);
    
    // Preview data
    head(pop_long);
               State         Location             Year       Population
             Alabama            urban             2020        6558153.0
             Alabama            urban             2021        4972982.0
             Alabama            urban             2022        12375977.
             Alabama            rural             2020        1526791.0
             Alabama            rural             2021        76863.000

    Now, the pop_long dataframe contains:

    • The information in the wide form names found before the separator, "_", (urban or rural) in the Location variable.
    • The information in the wide form names found after the separator, "_", in the Year variable.

    Variable Names With Regular Expressions

    In our example above, the variables contained in the names were clearly separated by a "_". However, this isn't always the case. Sometimes names use a pattern rather than separator:

    // Load data
    pop_wide4 = loadd("state_pop4.gdat");
    
    // Preview data
    head(pop_wide4);
               State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
             Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
              Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
             Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
            Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
          California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858. 

    In cases like this, we can use the names_pattern_split member to tell GAUSS we want to pass in a regular expression that will split the columns. We can't cover the full details of regular expressions here. However, there are a few fundamentals that will help us get started with this example.

    In regEx:

    1. Each statement inside a pair of parentheses is a group.
    2. To match any upper or lower case letter we use "[a-zA-Z]". More specifically, this tells GAUSS that we want to match any lowercase letter ranging from a-z and any upper case letter ranging from A-Z. If we wanted to limit this to any lowercase letters from t to z and any uppercase letter B to M we would say "[t-zB-M]".
    3. To match any integer we use "[0-9]".
    4. To represent that we want to match one or more instances of a pattern we use "+".
    5. To represent that we want to match zero or more instances of a pattern we use "*".

    In this case, we want to separate our names so that "urban" and "rural" are collected in Location and 2020, 2021, and 2022 are collected in the Year variable:

    1. We have two groups.
    2. We can capture both urban and rural using "[a-zA-Z]+".
    3. We can capture the years by matching one or more number using "[0-9]+".

    Let's use regEx to specify our names_pattern_split member:

    // Declare pivotControl structure and fill with default values
    struct pivotControl pctl;
    pctl = pivotControlCreate();
    
    // Specify how to separate names 
    // using the pivotControl structure
    pctl.names_pattern_split = "([a-zA-Z]+)([0-9]+)"; 

    Next, we can put this together with our other steps to transform our wide data:

    // Variable for storing names
    names_to = "Location"$|"Year";
    
    // Get all column names and remove the first column, 'State'
    columns = getColNames(pop_wide4);
    columns = trimr(columns, 1, 0);
    
    // Variable for storing values
    values_to = "Population";
    
    // Call dfLonger with optional control structure
    pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl4);
    head(pop_long);
               State         Location             Year       Population
             Alabama            urban             2020        6558153.0
             Alabama            urban             2021        4972982.0
             Alabama            urban             2022        12375977.
             Alabama            rural             2020        1526791.0
             Alabama            rural             2021        76863.000

    Multiple Value Variables

    In all our previous examples we had values that needed to be stored in one variable. However, it's more realistic that our dataset contains multiple groups of values and we will need to specify multiple variables to store these values.

    Let's consider our previous example which used the pop_wide4 dataset:

               State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
             Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
              Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
             Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
            Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
          California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858. 

    Suppose that rather than creating a location variable, we wish to separate the population information into two variables, urban and rural. To do this we will:

    1. Split the variable names by words ("urban" or "rural") and integers.
    2. Create a Year column from the integer portions of the names.
    3. Create two values columns, urban and rural, from the word portions.

    First, we will specify our columns:

    // Get all column names and remove the first column, 'State'
    columns = getColNames(pop_wide4);
    columns = trimr(columns, 1, 0);

    Next, we need to specify our names_to and values_to inputs. However, this time we want our values_to variables to be determined by the information in our names.

    We do this using ".value".

    // Tell GAUSS to use the first group of the split names 
    // to set the values variables and 
    // store the remaining group in 'Year'
    names_to = ".value" $| "Year";
    
    // Tell GAUSS to get 'values_to' variables from 'names_to'
    values_to = "";

    Setting ".value" as the first element in our names_to input tells dfLonger to take the first piece of the wide data names and create a column with the all the values from all matching columns.

    In other words, combine all the values from the variables urban2020, urban2021, urban2022 into a single variable named urban and do the same for the rural columns.

    Finally, we need to tell GAUSS how to split the variable names.

    // Declare 'pctl' to be a pivotControl structure
    // and fill with default settings
    struct pivotControl pctl;
    pctl = pivotControlCreate();
    
    // Set the regex to split the variable names
    pctl.names_pattern_split = "(urban|rural)([0-9]+)";

    This time, we specify the variable names, "(urban|rural)" rather than use the general specifier "([a-zA-Z])".

    Now we call dfLonger:

    // Convert the dataframe to long format according to our specifications
    pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl);
    
    // Print the first 5 rows of the long form dataframe
    head(pop_long);
               State             Year            urban            rural
             Alabama             2020        6558153.0        1526791.0
             Alabama             2021        4972982.0        76863.000
             Alabama             2022        12375977.        7301681.0
              Alaska             2020        21944.000        710978.00
              Alaska             2021        467051.00        267130.00

    Now the urban population and rural population are stored in their own column, named urban and rural.

    Conclusion

    As we've seen today, pivoting panel data from wide to long can be complicated. However, using a systematic approach and the GAUSS dfLonger procedure help to alleviate the frustration, time, and errors.


    Discover how GAUSS 24 can help you reach your goals.

     
    ]]>
    https://www.aptech.com/blog/transforming-panel-data-to-long-form-in-gauss/feed/ 0
    Introducing GAUSS 24 https://www.aptech.com/blog/introducing-gauss-24/ https://www.aptech.com/blog/introducing-gauss-24/#respond Tue, 05 Dec 2023 17:15:52 +0000 https://www.aptech.com/?p=11584082 Introduction

    People working at a computer.

    We're happy to announce the release of GAUSS 24, with new features for everything from everyday data management to refined statistical modeling.

    GAUSS 24 features a robust suite of tools designed to elevate your research. With these advancements, GAUSS 24 continues our commitment to helping you conduct insightful analysis and achieve your goals.

    New Panel Data Management Tools

    GAUSS 24 makes working with panel data easier than ever. Effortlessly load, clean, and explore panel data without ever leaving GAUSS, making it the smoothest experience yet!

    • Easily and intuitively pivot between long and wide form data with new dfLonger and dfWider functions.
    • Explore group-level descriptive statistics and estimate group-level linear models with expanded by keyword functionality.
    // Load data 
    auto2 = loadd("auto2.dta");
    
    // Print statistics table
    call dstatmt(auto2, "mpg + by(foreign)");
    =======================================================================
    foreign: Domestic
    -----------------------------------------------------------------------
    Variable        Mean     Std Dev      Variance     Minimum     Maximum
    -----------------------------------------------------------------------
    mpg            19.83       4.743          22.5          12          34
    =======================================================================
    foreign: Foreign
    -----------------------------------------------------------------------
    Variable        Mean     Std Dev      Variance     Minimum     Maximum
    -----------------------------------------------------------------------
    mpg            24.77       6.611         43.71          14          41 

    Feasible GLS Estimation

    // Load data
    df_returns = loadd("df_returns.gdat");
    
    // Run FGLS with defaults AR(1) Innovations
    fgls(df_returns, "rcoe ~ rcpi";
    Valid cases:                    248          Dependent variable:            rcpi
    Total SS:                     0.027          Degrees of freedom:             246
    R-squared:                    0.110          Rbar-squared:                 0.107
    Residual SS:                  0.024          Std error of est:             0.010
    F(1,246)                     30.453          Probability of F:             0.000
    Durbin-Watson                 0.757
    --------------------------------------------------------------------------------
                            Standard                    Prob
    Variable   Estimates       Error     t-value        >|t|  [95% Conf.   Interval]
    --------------------------------------------------------------------------------
    
    Constant      0.0148     0.00122        12.1       0.000      0.0124      0.0172
        rcoe       0.196      0.0685        2.86       0.005      0.0619        0.33 
    • Compute feasible GLS coefficients and associated standard errors, t-statistics, p-values, and confidence intervals.
    • Provides model evaluation statistics including R-squared, F-stat, and the Durbin-Watson statistic.
    • Choose from 7 built-in covariance estimation methods or provide your own covariance matrix.

    Expanded Tabulation Capabilities

    // Load data
    df = loadd("tips2.dta");
    
    // Two-way table
    call tabulate(df, "sex ~ smoker");
    ============================================================
                sex                   smoker               Total
    ============================================================
                                No            Yes
    
             Female             55             33             88
               Male             99             60            159
    
              Total            154             93            247
    ============================================================

    New tools for two-way tabulation provides a structured and systematic approach to understanding and drawing insights from categorical variables.

    • New procedure tabulate for computing two-way tables with advanced options for excluding categories and formatting reports.
    • Expanded functionality for the frequency function:
      • New two-way tables.
      • Sorted frequency reports and charts.
    // Print sorted frequency table
    // of 'rep78' in 'auto2' dataframe
    frequency(auto2, "rep78", 1)
        Label      Count   Total %    Cum. %
      Average         30     43.48     43.48
         Good         18     26.09     69.57
    Excellent         11     15.94     85.51
         Fair          8     11.59      97.1
         Poor          2     2.899       100
        Total         69       100



    Ready to elevate your research? Try GAUSS 24 today.

    New Time and Date Extraction Tools

    • 12 new procedures for extracting date and time components from dataframe dates.
    • Extract date and time components ranging from seconds to years.

    New Convenience Functions for Data Management and Exploration

    • dropCategories - Drops observations of specific categories from a dataframe and updates the associated labels and key values .
    • getCategories - Returns the category labels for a categorical variable. 
    • isString - Verify if an input is a string or string array. 
    • startsWith - Locates elements that start with a specified string.
    • insertCols - Inserts one or more new columns into a matrix or dataframe at a specified location.

    Improved Performance and Speed-ups

    • Expanded functionality of strindx allows for searching of unique substrings across multiple variables.
    • The upmat function now has the option to specify an offset from the main diagonal, the option to return only the upper triangular elements as a vector and is faster for medium and large matrices.
    • Significant speed improvements when using combinate with large values of n.
    • Remove missing values from large vectors more efficiently with speed increases in packr.

    Conclusion

    For a complete list of all GAUSS 24 offers please see the complete changelog.


    Discover how GAUSS 24 can help you reach your goals.

     
    ]]>
    https://www.aptech.com/blog/introducing-gauss-24/feed/ 0
    Announcing the GAUSS Machine Learning Library https://www.aptech.com/blog/announcing-the-gauss-machine-learning-library/ https://www.aptech.com/blog/announcing-the-gauss-machine-learning-library/#respond Mon, 28 Aug 2023 14:36:25 +0000 https://www.aptech.com/?p=11584015 Introduction

    The new GAUSS Machine Learning (GML) library offers powerful and efficient machine learning techniques in an accessible and friendly environment. Whether you're just getting familiar with machine learning or an experienced technician, you'll be running models in no time with GML.

    Machine Learning Models at Your Fingertips

    With the GAUSS Machine Learning library, you can run machine learning models out of the box, even without any machine learning background. It supports fundamental machine learning models for classification and regression including:

    LASSO regression coefficient response plot.

    Quick and Painless Data Preparation and Management

    We know model fitting and prediction is just the tip of the iceberg when it comes to any data analysis project. That's why we've focused on making GAUSS one of the best environments for data import, cleaning, and exploration.

    GML provides machine learning specific data preparation tools including:

    See how GAUSS reduces the pain and time of data wrangling and let's you get to the heart of your machine learning models quicker.

    Easy to Implement Model Evaluation

    GAUSS classification metrics from machine learning library.

    Compare and evaluate machine learning models with tools for GML plotting and performance evaluation tools:


    Interested in how GAUSS machine learning can work for you? Contact Us

    Unparalleled Customer Support

    We pride ourselves on offering unparalleled customer support and we truly care about your success. If you can't find what you need in our online documents, user forum, or blog, you can be confident that a GAUSS expert is here to quickly resolve your questions.

    See It In Action

    Want to see GML in action? Check out these real-world applications:

    1. Classification With Regularized Logistic Regression.
    2. Machine Learning With Real-World Data.
    3. Understanding Cross-Validation.
    4. Fundamentals of Tuning Machine Learning Hyperparameters.
    5. Predicting The Output Gap With Machine Learning Regression Models.
    6. Applications of Principal Components Analysis in Finance.
    7. Predicting Recessions With Machine Learning Techniques.
    ]]>
    https://www.aptech.com/blog/announcing-the-gauss-machine-learning-library/feed/ 0
    New Release TSPDLIB 3.0 https://www.aptech.com/blog/new-release-tspdlib-3-0-0/ https://www.aptech.com/blog/new-release-tspdlib-3-0-0/#respond Thu, 20 Jul 2023 16:38:38 +0000 https://www.aptech.com/?p=11583976 TSPDLIB 3.0.0. The TSPDLIB 3.0.0 package includes expanded functions for time series and panel data testing both with and without structural breaks and causality testing. It requires a GAUSS 23+ for use. ]]> Introduction

    The preliminary econometric package for Time Series and Panel Data Methods has been updated and functionality has been expanded with over 20 new functions in this release of TSPDLIB 3.0.

    The TSPDLIB 3.0 package includes expanded functions for time series and panel data testing both with and without structural breaks and causality testing.

    It requires a GAUSS 23+ for use.

    Changelog 3.0:

    1. New functionality: Add metadata based variable names for improved printing.
    2. Improvement: Simplified data loading formulas using expanded GAUSS 23 .
    3. New unit root testing procedures:
      • fourier_kpss - KPSS stationarity testing with flexible Fourier form, smooth structural breaks.
      • fourier_kss - KSS unit root test with flexible Fourier form, smooth structural breaks.
      • fourier_wadf - Wavelet ADF unit root test with flexible Fourier form, smooth structural breaks.
      • fourier_wkss - Wavelet KSS unit root test with flexible Fourier form, smooth structural breaks.
      • kss - KSS unit root test.
      • qr_fourier_adf - Quantile ADF unit root test with flexible Fourier form, smooth structural breaks.
      • qr_fourier_kss - Quantile KSS unit root test with flexible Fourier form, smooth structural breaks.
      • qr_kss - Quantile KSS unit root test.
      • qks_tests - Quantile Kolmogorov-Smirnov (QKS) tests.
      • wkss - Wavelet KSS unit root test.
      • sbur_gls - Carrion-i-Silvestre, Kim, and Perron (2009) GLS-unit root tests with multiple structural breaks.
    4. New cointegration tests:
    5. New panel data unit root tests:
      • pd_kpss - Carrion-i-Silvestre, et al.(2005) panel data KPSS test with multiple structural breaks.
      • pd_stationary - Tests for unit roots in heterogeneous panel data including with or without cross-sectional averages, with or without flexible Fourier from structural breaks.
    6. New causality tests:
      • asymCause - Hatemi-J tests for asymmetric causality.
      • pd_cause - Tests for Granger causality in heterogenous panels including Fisher, Zhnc, and SUR Wald tests.
    7. Other new functions:
      • sbvar_icss - Sanso, Arag & Carrion (2002) ICSS test for changes in unconditional variance.
      • pd_getCDError - Tests for cross-sectional dependency.
    8. New examples:
      • actest.e
      • ascomp.e
      • fourier_kss.e
      • fourier_kpss.e
      • fourier_wadf.e
      • fourier_wkss.e
      • kss.e
      • pd_cause.e
      • pd_getcderror.e
      • pd_coint_wedgerton.e
      • pd_kpss.e
      • qr_fourier_adf.e
      • qr_fourier_kss.e
      • qr_kss.e
      • qr_qks.e
      • sbur.e
      • sbvar_icss.e
      • wkss.e

    Citation

    If using this library please include the following citation:

    Nazlioglu, S (2018) TSPDLIB: GAUSS Time Series and Panel Data Methods (Version 3.0). Source Code. https://github.com/aptech/tspdlib

    Getting Started

    Prerequisites

    The program files require a working copy of GAUSS 23+.

    Installing

    The GAUSS Time Series and Panel data tests library should only be installed and updated directly in GAUSS using the GAUSS package manager.

    Before using the functions created by tspdlib you will need to load the newly created tspdlib library. This can be done in a number of ways:

    • Navigate to the library tool view window and click the small wrench located next to the tspdlib library. Select Load Library.
    • Enter library tspdlib in the program input/output window.
    • Put the line library tspdlib; at the beginning of your program files.

    Examples

    After installing the library, examples for all available procedures can be found in your GAUSS home directory in the directory pkgs > tspdlib >examples. The example uses GAUSS and .csv datasets which are included in the pkgs > tspdlib >examples directory.

    Using GAUSS Packages

    For more information on how to make the best use of the TSPDLIB, please see our blog, Using GAUSS Packages Complete Guide.

    Example Applications

    1. A Guide to Conducting Cointegration Tests
    2. How to Conduct Unit Root Tests in GAUSS
    3. Panel data, structural breaks, and unit root testing
    4. Unit Root Tests with Structural Breaks
    5. How to Run the Maki Cointegration Test (Video)
    6. How to Run the Fourier LM Test (Video)

    ]]>
    https://www.aptech.com/blog/new-release-tspdlib-3-0-0/feed/ 0
    Classification with Regularized Logistic Regression https://www.aptech.com/blog/classification-with-regularized-logistic-regression/ https://www.aptech.com/blog/classification-with-regularized-logistic-regression/#respond Wed, 07 Jun 2023 15:59:02 +0000 https://www.aptech.com/?p=11583861 GAUSS Machine Learning library, including:
    1. Data preparation.
    2. Model fitting.
    3. Classification predictions.
    4. Evaluating predictions and model fit.
    ]]>

    Introduction

    Logistic regression has been a long-standing popular tool for modeling categorical outcomes. It's widely used across fields like epidemiology, finance, and econometrics.

    In today's blog we'll look at the fundamentals of logistic regression. We'll use a real-world survey data application and provide a step-by-step guide to implementing your own regularized logistic regression models using the GAUSS Machine Learning library, including:

    1. Data preparation.
    2. Model fitting.
    3. Classification predictions.
    4. Evaluating predictions and model fit.

    What is Logistic Regression?

    Logistic regression is a statistical method that can be used to predict the probability of an event occurring based on observed features or variables. The predicted probabilities can then be used to classify the data based on probability thresholds.

    For example, if we are modeling a "TRUE" and "FALSE" outcome, we may predict that an outcome will be "TRUE" for all predicted probabilities of 0.5 and higher.

    Mathematically, logistic regression models the relationship between the probability of an outcome as a logistic function of the independent variables:

    $$ Pr(Y = 1 | X) = p(X) = \frac{e^{B_0 + B_1X}}{1 + e^{B_0 + B_1X}} $$

    This log-odds representation is sometimes more common because it is linear in our independent variables:

    $$ \log \bigg( \frac{p(X)}{1 + p(X)} \bigg) = B_0 + B_1X $$

    There are some important aspects of this model to keep in mind:

    • The logistic regression model always yields a prediction between 0 and 1.
    • The magnitude of the coefficients in the logistic regression model cannot be as directly interpreted as in the classic linear model.
    • The signs of the coefficients in the logistic regression model can be interpreted as expected. For example, if the coefficient on $X_1$ is negative we can conclude that increasing $X_1$ decreases $p(X)$.

    Logistic Regression with Regularization

    One potential pitfall of logistic regression is its tendency for overfitting, particularly with high dimensional feature sets.

    Regularization with L1 and/or L2 penalty parameters with can help prevent overfitting and improve prediction.

    Comparison of L1 and L2 Regularization

    $L1$ penalty (Lasso)$L2$ penalty (Ridge)
    Penalty term$\lambda \sum_{j=1}^p |\beta_j|$$\lambda \sum_{j=1}^p \beta_j^2$
    Robust to outliers
    Shrinks coefficients
    Can select features
    Sensitive to correlated features
    Useful for preventing overfitting
    Useful for addressing multicollinearity
    Requires hyperparameter selection (λ)

    Predicting Customer Satisfaction Using Survey Data

    Today we will use airline passenger satisfaction data to demonstrate logistic regression with regularization.

    Our task is to predict passenger satisfaction using:

    • Available survey answers.
    • Flight information.
    • Passenger characteristics.
    VariableDescription
    idResponder identification number
    GenderGender identification: Female or Male.
    Customer TypeLoyal or disloyal customer.
    AgeCustomer age in years.
    Type of travelPersonal or business travel.
    ClassEco or business class seat.
    Flight DistanceFlight distance in miles.
    Wifi serviceCustomer rating on 0-5 scale.
    Schedule convenientCustomer rating on 0-5 scale.
    Ease of Online bookingCustomer rating on 0-5 scale.
    Gate locationCustomer rating on 0-5 scale.
    Food and drinkCustomer rating on 0-5 scale.
    Seat comfortCustomer rating on 0-5 scale.
    Online boardingCustomer rating on 0-5 scale.
    Inflight entertainmentCustomer rating on 0-5 scale.
    On-board serviceCustomer rating on 0-5 scale.
    Leg room serviceCustomer rating on 0-5 scale.
    Baggage handlingCustomer rating on 0-5 scale.
    Checkin serviceCustomer rating on 0-5 scale.
    Inflight serviceCustomer rating on 0-5 scale.
    CleanlinessCustomer rating on 0-5 scale.
    Departure Delay in minutesMinutes delayed when departing.
    Arrival Delay in minutesMinutes delayed when arriving.
    satisfactionOverall airline satisfaction. Possible responses include "satisfied" or "neutral or dissatisfied".


    The first step in our analysis is to load our data using loadd:

    new;
    library gml;
    rndseed 8906876;
    
    /*
    ** Load datafile
    */
    // Set path and filename
    load_path = "data/";
    fname = "airline_satisfaction.gdat";
    
    // Load data
    airline_data = loadd(load_path $+ fname);
    
    // Split data
    y = airline_data[., "satisfaction"];
    X = delcols(airline_data, "satisfaction"$|"id");

    Data Exploration

    Before we begin modeling, let's do some preliminary data exploration. First, let's check for common issues that can arise with survey data.

    We'll check for:

    First, we'll check for duplicates, so any duplicates can be removed prior to checking our summary statistics:

    // Check for duplicates
    isunique(airline_data);

    The isunique procedure returns a 1 if the data is unique and 0 if there are duplicates.

    1.00000000

    In this case, it indicates that we have no duplicates in our data.

    Next, we'll check for missing values:

    /*
    ** Check for data cleaning
    ** issues
    */
    // Summary statistics
    call dstatmt(airline_data);

    This prints summary statistics for all variables:

    Variable                       Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
    -------------------------------------------------------------------------------------------------------
    
    Gender                        -----       -----         -----      Female        Male    103904    0
    Customer Type                 -----       -----         -----  Loyal Cust  disloyal C    103904    0
    Age                           39.38       15.11         228.5           7          85    103904    0
    Type of Travel                -----       -----         -----  Business t  Personal T    103904    0
    Class                         -----       -----         -----    Business    Eco Plus    103904    0
    Flight Distance                2108        1266     1.603e+06           0        3801    103904    0
    Wifi service                  -----       -----         -----           0           5    103904    0
    Schedule convenient           -----       -----         -----           0           5    103904    0
    Ease of Online booking        -----       -----         -----           0           5    103904    0
    Gate location                 -----       -----         -----           0           5    103904    0
    Food and drink                -----       -----         -----           0           5    103904    0
    Online boarding               -----       -----         -----           0           5    103904    0
    Seat comfort                  -----       -----         -----           0           5    103904    0
    Inflight entertainment        -----       -----         -----           0           5    103904    0
    Onboard service               -----       -----         -----           0           5    103904    0
    Leg room service              -----       -----         -----           0           5    103904    0
    Baggage handling              -----       -----         -----           1           5    103904    0
    Checkin service               -----       -----         -----           0           5    103904    0
    Inflight service              -----       -----         -----           0           5    103904    0
    Cleanliness                   -----       -----         -----           0           5    103904    0
    Departure Delay in Minutes    14.82       38.23          1462           0        1592    103904    0
    Arrival Delay in Minutes      15.25       38.81          1506           0        1584    103904    0
    satisfaction                  -----       -----         -----  neutral or   satisfied    103904    0 

    The summary statistics give us some useful insights:

    • There are no missing values in our dataset.
    • The summary statistics of our numerical variables don't indicate any obvious outliers.
    • All categorical survey data ranges from 0 to 5 with the exception of Baggage handling which ranges from 1 to 5. All categorical variables will need to be converted to dummy variables prior to modeling.

    One other observation from our summary statistics is that many of the variable names are longer than necessary. Long variable names can be:

    • Difficult to remember.
    • Prone to typos
    • Cutoff when printing results.

    (Not to mention they can be annoying to type!).

    Let's streamline our variable names using dfname:

    /*
    ** Update variable names
    */
    // Create string array of short names
    string short_names = {"Loyalty", "Reason", "Distance", "Wifi", 
                          "Schedule", "Booking", "Gate", "Boarding", 
                          "Entertainment", "Leg room", "Baggage", "Checkin", 
                          "Departure Delay", "Arrival Delay" };
    
    // Create string array of original names to change                      
    string original_names = { "Customer Type", "Type of Travel", "Flight Distance", "Wifi service",
                              "Schedule convenient", "Ease of Online booking", "Gate location", "Online boarding",
                              "Inflight entertainment", "Leg room service", "Baggage handling", "Checkin service",
                              "Departure Delay in Minutes", "Arrival Delay in Minutes" };
    
    // Change names
    airline_data = dfname(airline_data, short_names, original_names);
    

    Data Visualization

    Data visualization is a great way to get a feel for the relationships between our target variable and our features.

    Let's explore the relationship between the customer and flight characteristics and reported satisfaction.

    In particular, we'll look at how satisfaction relates to:

    • Age.
    • Gender.
    • Flight distance.
    • Seat class.
    • Customer type.

    Preparing Our Data for Plotting

    Today we'll use bar graphs to explore the relationships in our data. In particular, we will sort our data into subgroups and examine how those subgroups report satisfaction.

    For categorical variables, we have naturally defined subgroups. However, For the continuous variables, Age and Distance, we first need to generate bins based on ranges of these variables.

    First, let's place the Age variable in bins. To do this we will use the reclassifycuts and reclassify procedures:

    /*
    ** Create bins for age
    */
    // Set age categories cut points
    // Class 0: 20 and Under
    // Class 1: 21 - 30
    // Class 2: 31 - 40
    // Class 3: 41 - 50
    // Class 4: 51 - 60
    // Class 5: 61 - 70
    // Class 6: Over 70
    cut_pts = { 20, 
                30, 
                40, 
                50, 
                60, 
                70};
    
    // Create numeric classes
    age_new = reclassifycuts(airline_data[., "Age"], cut_pts);
    
    // Generate labels to recode to
    to = "20 and Under"$|
           "21-30"$|
           "31-40"$|
           "41-50"$|
           "51-60"$|
           "61-70"$|
           "Over 70";
    
    // Recode to categorical variable
    age_cat = reclassify(age_new, unique(age_new), to);
    
    // Convert to dataframe
    age_cat = asDF(age_cat, "Age Group");

    For a quick frequency count of this categorical variable, we can use the frequency procedure:

    // Check frequency of age groups
    frequency(age_cat, "Age Group");
           Label      Count   Total %    Cum. %
    20 and Under      11333     10.91     10.91
           21-30      21424     20.62     31.53
           31-40      21203     20.41     51.93
           41-50      23199     22.33     74.26
           51-60      18769     18.06     92.32
           61-70       7220     6.949     99.27
         Over 70        756    0.7276       100
           Total     103904       100     

    Now we will do the same for Distance.

    /*
    ** Create bins for light distance
    */       
    // Set distance categories
    // Cut points for data 
    cut_pts = { 1000, 
                1500, 
                2000, 
                2500, 
                3000,
                3500};
    
    // Create numeric classes
    distance_new = reclassifycuts(airline_data[., "Distance"], cut_pts);
    
    // Generate labels to recode to
    to = "1000 and Under"$|
           "1001-1500"$|
           "1501-2000"$|
           "2001-2500"$|
           "2501-3000"$|
           "3000-3500"$|
           "Over 3500";
    
    // Recode to categorical variable
    distance_cat = reclassify(distance_new, unique(distance_new), to);
    
    // Convert to dataframe
    distance_cat = asDF(distance_cat, "Flight Range");
    
    // Check frequencies
    frequency(distance_cat, "Flight Range");
             Label      Count   Total %    Cum. %
    1000 and Under      28017     26.96     26.96
         1001-1500      10976     10.56     37.53
         1501-2000       9331      8.98     46.51
         2001-2500       7834      7.54     54.05
         2501-3000       8053      7.75      61.8
         3000-3500      24815     23.88     85.68
         Over 3500      14878     14.32       100
             Total     103904       100    

    Age

    We can see from the plot above that passengers 20 and under and passengers over 60 are less likely to be satisfied than other age groups.

    Gender

    The plot suggests that gender has little impact on reported satisfaction.

    Flight Distance

    The flight distance plot shows that there are slightly lower rates of satisfaction for flight lengths 3000 miles and over and flight lengths 1000 miles and under.

    Seat Class

    There is a clear discrepancy in satisfaction between passengers that fly business class and other passengers. Business class customers have a much higher rate of satisfaction than those in economy or economy plus.

    Customer Type

    Finally, it also appears that loyal passengers are more often satisfied customers than disloyal passengers.

    Feature Engineering

    As is common with survey data, a number of our variables are categorical. We need to represent these as dummy variables before modeling.

    We'll do this using the oneHot procedure. However, oneHot only accepts single variables, so we will need to loop through all the categorical variables.

    To do this, we first create a list of all categorical variables.

    /*
    ** Create dummy variables
    */
    // Get all variable names
    col_names = getColNames(X);
    
    // Get types of all variables
    col_types = getColTypes(X);
    
    // Select names of variables
    // that are categorical
    cat_names = selif(col_names, col_types .== "category");

    Next, we loop through all categorical variables and create dummy variables for each one using oneHot.

    // Loop through categorical variables
    // to create dummy variables
    dummy_vars = {};
    for i(1, rows(cat_names), 1); 
        dummy_vars = dummy_vars~oneHot(X[., cat_names[i]]);
    endfor;
    
    // Delete original categorical variables
    // and replace with dummy variables
    X = delcols(x, cat_names)~dummy_vars;

    Model Evaluation

    There are a number of classification metrics that are reported using the classificationMetrics procedure. These metrics provide information about how well the model meets different objectives.

    Model Comparison Measures

    ToolDescription
    AccuracyOverall model accuracy. Equal to the number of correct predictions divided by the total number of predictions.
    PrecisionHow good a model is at correctly identifying the class outcomes. Equal to the number of true positives divided by the number of false positives plus true positives.
    RecallHow good a model is at correctly predicting all the class outcomes. Equal to the number of true positives divided by the number of false negatives plus true positives.
    F1-scoreThe harmonic mean of the precision and recall, it gives a more balanced picture of how our model performs. A score of 1 indicates perfect precision and recall.

    We'll keep these in mind as we fit and test our model.

    Logistic Regression Model Fitting

    We're now ready to begin fitting our models. To start, we will prepare our data by:

    Creating training and testing datasets using trainTestSplit.

    // Split data into 70% training and 30% test set
    { y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);

    Scaling our data using rescale.

    /*
    ** Data rescaling
    */
    // Number of variables to rescale
    numeric_vars = 4;
    
    // Rescale training data
    { X_train[.,1:numeric_vars], x_mu, x_sd } = rescale(X_train[.,1:numeric_vars], "standardize");
    
    // Rescale test data using same scaling factors as x_train
    X_test[.,1:numeric_vars] = rescale(X_test[.,1:numeric_vars], x_mu, x_sd);

    Unlike Random Forest models, logistic regression models are sensitive to large differences in the scale of the variables. Standardizing the variables as we do here is a good choice, but is not unequivocally the best option in all cases.

    As you can see above, we compute the mean and standard deviation from the training set and use those parameters to scale the test set. This is important.

    The purpose of our test set is to give us an estimate of how our model will do on unseen data. Using the mean and standard deviation from the entire dataset, before the train/test split would allow information from the test set to "leak" into our model. Information leakage is beyond the scope of this blog post, but in general the test set should be treated like information that is not available until after the model fit is complete.

    Now we're ready to start fitting our models.

    Case One: Logistic Regression Without Regularization

    As a base case, we'll consider a logistic regression model without any regularization. For this case, we'll use all default settings, so our only inputs are the dependent and independent data.

    Using our training data we will:

    1. Train our model using logisticRegFit.
    2. Make predictions on our training data using lmPredict.
    3. Evaluate our training model predictions using classificationMetrics.
    /*************************************
    ** Base case model
    ** No regularization
    *************************************/
    
    /*
    ** Training
    */
    // Declare 'lr_mdl' to be 
    // an 'logisticRegModel' structure
    // to hold the trained model
    struct logisticRegModel lr_mdl;
    
    // Train the logistic regression classifier
    lr_mdl = logisticRegFit(y_train, X_train);
    
    // Check training set performance
    y_hat_train = lmPredict(lr_mdl, X_train);
    
    // Model evaluations
    print "Training Metrics";
    call classificationMetrics(y_train, y_hat_train);

    The classificationMetrics procedure prints an evaluation table:

    No regularization
    Training Metrics
    ==============================================================
                                            Classification metrics
    ==============================================================
                      Class   Precision  Recall  F1-score  Support
    
    neutral or dissatisfied        0.93    0.92      0.93    41102
                  satisfied        0.90    0.91      0.90    31631
    
                  Macro avg        0.91    0.92      0.91    72733
               Weighted avg        0.92    0.92      0.92    72733
    
                   Accuracy                          0.92    72733
    /*
    ** Testing
    */
    // Make predictions on the test set, from our trained model
    y_hat_test = lmPredict(lr_mdl, X_test);
    
    /*
    ** Model evaluation
    */
    print "Testing Metrics";
    call classificationMetrics(y_test, y_hat_test);

    This code prints the following to screen:

    Testing Metrics
    ==============================================================
                                            Classification metrics
    ==============================================================
                      Class   Precision  Recall  F1-score  Support
    
    neutral or dissatisfied        0.93    0.92      0.92    17777
                  satisfied        0.90    0.91      0.90    13394
    
                  Macro avg        0.91    0.91      0.91    31171
               Weighted avg        0.91    0.91      0.91    31171
    
                   Accuracy                          0.91    31171

    There are some good observations comparing our training data and testing data performance:

    • First, there is little difference in accuracy across our training and testing data set, with a training accuracy of 0.92 and a testing accuracy of 0.91.
    • Our model provides the same average F1-score, which provides a balanced measure of performance, across the testing and training dataset.

    Why is this important? This comparison provides a good indication that we aren't overfitting our training set. Since the main purpose of regularization is to address overfitting the model to the training data, we don't have much reason to use it. However, for demonstration purposes, we'll show how to implement L2 regularization.

    Case Two: Logistic Regression With L2 Regularization

    To implement regularization with the logisticRegFit, we'll use a logisticRegControl structure.

    /*************************************
    ** L2 Regularization
    *************************************/
    
    /*
    ** Training
    */
    // Declare 'lrc' to be a logisticRegControl
    // structure and fill with default settings 
    struct logisticRegControl lrc;
    lrc = logisticRegControlCreate();
    
    // Set L2 regularization parameter
    lrc.l2 = 0.05;
    
    // Declare 'lr_mdl' to be 
    // a 'logisticRegModel' structure
    // to hold the trained model
    struct logisticRegModel lr_mdl;
    
    // Train the logistic regression classifier
    lr_mdl = logisticRegFit(y_train, X_train, lrc);
    
    /*
    ** Testing
    */
    // Make predictions on the test set
    y_hat_l2 = lmPredict(lr_mdl, X_test);
    
    /*
    ** Model evaluation
    */
    call classificationMetrics(y_test, y_hat_l2);

    The classification metrics are printed:

    L2 regularization
    ==============================================================
                                            Classification metrics
    ==============================================================
                      Class   Precision  Recall  F1-score  Support
    
    neutral or dissatisfied        0.89    0.93      0.91    17777
                  satisfied        0.90    0.84      0.87    13394
    
                  Macro avg        0.90    0.89      0.89    31171
               Weighted avg        0.89    0.89      0.89    31171
    
                   Accuracy                          0.89    31171

    Note that with the L2 penalty, our model performance drops from the base case model, with lower accuracy (0.89) and lower average F1-score (0.89). This isn't surprising, given that we didn't find support of overfitting in our model.

    Conclusion

    In today's blog, we've looked at logistic regression and regularization.

    Using a real-world airline passenger satisfaction data application we've:

    1. Performed preliminary data and setup.
    2. Trained logistic regression models with and without regularization.
    3. Made classification predictions.
    4. Interpreted classification metrics.

    Further Machine Learning Reading

    1. Predicting Recessions with Machine Learning Techniques
    2. Applications of Principal Components Analysis in Finance
    3. Predicting The Output Gap With Machine Learning Regression Models
    4. Fundamentals of Tuning Machine Learning Hyperparameters
    5. Understanding Cross-Validation
    6. Machine Learning With Real-World Data

    ]]>
    https://www.aptech.com/blog/classification-with-regularized-logistic-regression/feed/ 0
    Machine Learning With Real-World Data https://www.aptech.com/blog/machine-learning-with-real-world-data/ https://www.aptech.com/blog/machine-learning-with-real-world-data/#respond Tue, 16 May 2023 03:38:45 +0000 https://www.aptech.com/?p=11583790
  • Data Exploration and cleaning.
  • Splitting data for training and testing.
  • Model fitting and prediction.
]]>
Introduction

If you've ever done empirical work, you know that real-world data rarely, if ever, arrives clean and ready for modeling. No data analysis project consists solely of fitting a model and making predictions.

In today's blog, we walk through a machine learning project from start to finish. We'll give you a foundation for completing your own machine learning project in GAUSS, working through:

  • Data Exploration and cleaning.
  • Splitting data for training and testing.
  • Model fitting and prediction.
  • Basic feature engineering.

Background

Our Data

Today we will be working with the California Housing Dataset from Kaggle.

This dataset is built from 1990 Census data. Though it is an older dataset, it is a great demonstration dataset and has been popular in many machine learning examples.

The dataset contains 10 variables measured in California at the block group level:

VariableDescription
longitudeMeasure of how far west a house is.
latitudeMeasure of how far north a house is.
housing_median_ageMedian age of a house within a block.
total_roomsTotal number of rooms within a block.
total_bedroomsTotal number of bedrooms within a block.
populationTotal number of people residing within a block.
householdsTotal number of households, a group of people residing within a home unit, for a block.
median_incomeMedian income for households within a block of houses (measured in tens of thousands of US Dollars).
median_house_valueMedian house value for households within a block.
ocean_proximityLocation of the house w.r.t ocean/sea.

GAUSS Machine Learning

We will use the new GAUSS Machine Learning (GML) library. This library is extremely user friendly and provides easy-to-use machine learning tools for implementing fundamental machine learning models.

To access these tools, we need to load the library:

// Clear workspace and load library
new;
library gml;

// Set random seed
rndseed 8906876;

Data Exploration and Cleaning

With GML loaded, we our now ready to import and clean our data. The first step is to use the loadd procedure to import our data into the GAUSS.

/*
** Import datafile
*/
load_path = "data/";
fname = "housing.csv";

// Load all variables
housing_data = loadd(load_path $+ fname);

Descriptive Statistics

Exploratory data analysis allows us to identify important data anomalies, like outliers and missing values.

Let's start by looking at standard descriptive statistics using the dstatmt procedure:

// Find descriptive statistics
// for all variables in housing_data
dstatmt(housing_data);

This prints a summary table of statistics for all variables.

--------------------------------------------------------------------------------------------------
Variable                  Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
--------------------------------------------------------------------------------------------------
longitude               -119.6       2.004         4.014      -124.3      -114.3     20640    0
latitude                 35.63       2.136         4.562       32.54       41.95     20640    0
housing_median_age       28.64       12.59         158.4           1          52     20640    0
total_rooms               2636        2182     4.759e+06           2   3.932e+04     20640    0
total_bedrooms           537.9       421.4     1.776e+05           1        6445     20433  207
population                1425        1132     1.282e+06           3   3.568e+04     20640    0
households               499.5       382.3     1.462e+05           1        6082     20640    0
median_income            3.871         1.9         3.609      0.4999          15     20640    0
median_house_value   2.069e+05   1.154e+05     1.332e+10     1.5e+04       5e+05     20640    0
ocean_proximity          -----       -----         -----   <1H OCEAN  NEAR OCEAN     20640    0 

These statistics allow us to quickly identify several data issues that we need to address prior to fitting our model:

  1. There are 207 missing observations of the total bedrooms variable (you may need to scroll to the right of the output).
  2. Many of our variables show potential outliers, with high variance and large ranges. These should be further explored.

Missing Values

To get a better idea of how to best deal with the missing values, let's check the descriptive statistics for the observations with and without missing values separately.

// Conditional check 
// for missing values
e = housing_data[., "total_bedrooms"] .== miss();

// Get descriptive statistics
// for dataset with missing values
dstatmt(selif(housing_data, e));
------------------------------------------------------------------------------------------------
Variable                 Mean     Std Dev      Variance     Minimum     Maximum   Valid  Missing
------------------------------------------------------------------------------------------------

longitude              -119.5       2.001         4.006      -124.1      -114.6      207    0
latitude                 35.5       2.097         4.399       32.66       40.92      207    0
housing_median_age      29.27       11.96         143.2           4          52      207    0
total_rooms              2563        1787     3.194e+06         154   1.171e+04      207    0
total_bedrooms          -----       -----         -----        +INF        -INF        0  207
population               1478        1057     1.118e+06          37        7604      207    0
households                510       386.1     1.491e+05          16        3589      207    0
median_income           3.822       1.956         3.824      0.8527          15      207    0
median_house_value   2.06e+05   1.116e+05     1.246e+10    4.58e+04       5e+05      207    0
ocean_proximity         -----       -----         -----   <1H OCEAN  NEAR OCEAN      207    0
// Get descriptive statistics
// for dataset without missing values
dstatmt(delif(housing_data, e));
-------------------------------------------------------------------------------------------------
Variable                 Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------------

longitude              -119.6       2.004         4.014      -124.3      -114.3     20433    0
latitude                35.63       2.136         4.564       32.54       41.95     20433    0
housing_median_age      28.63       12.59         158.6           1          52     20433    0
total_rooms              2637        2185     4.775e+06           2   3.932e+04     20433    0
total_bedrooms          537.9       421.4     1.776e+05           1        6445     20433    0
population               1425        1133     1.284e+06           3   3.568e+04     20433    0
households              499.4       382.3     1.462e+05           1        6082     20433    0
median_income           3.871       1.899         3.607      0.4999          15     20433    0
median_house_value  2.069e+05   1.154e+05     1.333e+10     1.5e+04       5e+05     20433    0
ocean_proximity         -----       -----         -----   <1H OCEAN  NEAR OCEAN     20433    0 

From visual inspection, the descriptive statistics for the data with the missing values are very similar to the descriptive statistics data without the missing values.

In addition, the missing values make up less than 1% of the total observations. Given this, we will delete the rows containing missing values, rather than imputing our missing values.

We can delete the rows with missing values using the packr procedure:

// Remove rows with missing values
// from housing_data
housing_data = packr(housing_data);

Outliers

Now that we've removed missing values, let's look for other data outliers. Data visualizations like histograms and box plots are a great way to identify potential outliers.

First, let's create a grid plot of histograms for all of our continuous variables:

/*
** Data visualizations
*/
// Get variables names
vars = getColNames(housing_data);

// Set up plotControl 
// structure for formatting graphs
struct plotControl plt;
plt = plotGetDefaults("bar");

// Set fonts
plotSetFonts(&plt, "title", "Arial", 14);
plotSetFonts(&plt, "ticks", "Arial", 12);

// Loop through the variables and draw histograms
for i(1, rows(vars)-1, 1);
    plotSetTitle(&plt, vars[i]);
    plotLayout(3, 3, i);
    plotHist(plt, housing_data[., vars[i]], 50);
endfor;

Histogram of all variables in our California Housing dataset.

From our histograms, it appears that several variables suffer from outliers:

  • The total_rooms variable, with the majority of the data distributed between 0 and 10,000.
  • The total_bedrooms variable, with the majority of the data distributed between 0 and 2000.
  • The households variable, with the majority of the data distributed between 0 and 2000.
  • The population variable, with the majority of the data distributed between 0 and 100,000.

Box plots of these variables confirm that there are indeed outliers.

plt = plotGetDefaults("box");

// Set fonts
plotSetFonts(&plt, "title", "Arial", 14);
plotSetFonts(&plt, "ticks", "Arial", 12);

string box_vars = { "total_rooms", "total_bedrooms", "households", "population" };

// Loop through the variables and draw boxplots
for i(1, rows(box_vars), 1);
    plotLayout(2, 2, i);
    plotBox(plt, box_vars[i], housing_data[., box_vars[i]]);
endfor;

Let's filter the data to eliminate these outliers:

/*
** Filter to remove outliers
**
** Delete:
**    - total_rooms greater than or equal to 10000
**    - total_bedrooms greater than or equal to 20000
**    - households greater than or equal to 2000
**    - population greater than or equal to 6000
*/
mask = housing_data[., "total_rooms"] .>= 10000;
mask = mask .or housing_data[., "total_bedrooms"] .>= 2000;
mask = mask .or housing_data[., "households"] .>= 2000;
mask = mask .or housing_data[., "population"] .>= 6000;

housing_data = delif(housing_data, mask);

Data Truncation

The histograms also point to truncation issues with housing_median_age and median_house_value. Let's look into this a little further:

  1. We'll confirm that these are the most frequently occurring observations using modec. This provides evidence for our suspicion that these are truncation points.
  2. We'll count the number of observations at these locations.
// House value
mode_value = modec(housing_data[., "median_house_value"]);
print "Most frequent median_house_value:" mode_value;

print "Counts:";
sumc(housing_data[., "median_house_value"] .== mode_value);

// House age
mode_age = modec(housing_data[., "housing_median_age"]);
print "Most frequent housing_median_age:" mode_age;

print "Counts:";
sumc(housing_data[., "housing_median_age"] .== mode_age);
Most frequent median_house_value:
       500001.00
Counts:
       935.00000
Most frequent housing_median_age:
       52.000000
Counts:
       1262.0000

These combined observations make up about 10% of the total observations. Because we have no further information about what is occurring at these points, let's remove them from our model.

// Create binary vector with a 1 if either
// 'housing_median_age' or 'median_house_value'
// equal their mode value.
mask = (housing_data[., "housing_median_age"] .== mode_age)
       .or (housing_data[., "median_house_value"] .== mode_value);

// Delete the rows if they meet our above criteria
housing_data = delif(housing_data, mask);

Feature Modifications

Our final data cleaning step is to make feature modifications including:

  1. Rescaling the median_house_value variable to be measured in 10,000 of US dollars (the same scale as median_income).
  2. Generating dummy variables to account for the categories of ocean_proximity.

First, we rescale the median_house_value:

// Rescale median income variable
housing_data[., "median_house_value"] = 
    housing_data[., "median_house_value"] ./ 10000;

Next we generate dummy variables for ocean_proximity.

Let's get a feel for our categorical data using the frequency procedure:

// Check frequency of
// ocean_proximity categories
frequency(housing_data, "ocean_proximity");

This prints a convenient frequency table:

     Label      Count   Total %    Cum. %
 <1H OCEAN       8095     44.89     44.89
    INLAND       6136     34.03     78.93
    ISLAND          2   0.01109     78.94
  NEAR BAY       1525     8.458     87.39
NEAR OCEAN       2273     12.61       100
     Total      18031       100         

We can see from this table that the ISLAND category is a very small category. We'll exclude it from our modeling dataset.

Now let's create our dummy variables using the oneHot procedure:

/*
** Generate dummy variables for 
** the ocean_proximity using
** one hot encoding
*/
dummy_matrix = oneHot(housing_data[., "ocean_proximity"]);

Finally, we'll save our modeling dataset in a GAUSS .gdat file using saved so we can directly access our clean data in the future:

/*
** Build matrix of features
** Note we exclude: 
**     - ISLAND dummy variable
**     - Original ocean_proximity variable
*/
model_data = delcols(housing_data, "ocean_proximity") ~ 
    delcols(dummy_matrix, "ocean_proximity_ISLAND");

// Saved data matrix
saved(model_data, load_path $+ "/model_data.gdat");

Data Splitting

In machine learning, it's customary to use separate datasets to fit the model and to evaluate model performance. Since the objective of machine learning models is to provide predictions for unseen data, using a testing set provides a more realistic measure of how our model will perform.

To prepare our data for training and testing, we're going to take two steps:

  1. Separate our target variable, median_house_value, and feature set.
  2. Split our data into 70% testing and 30% training dataset using trainTestSplit.
new;
library gml;
rndseed 896876;

/*
** Load datafile
*/
load_path = "data/";
fname = "model_data.gdat";

// Load data
housing_data = loadd(load_path $+ fname);

/*
** Feature management
*/
// Separate dependent and independent data
y = housing_data[., "median_house_value"];
X = delcols(housing_data, "median_house_value");

// Split into 70% training data 
// and 30% testing data
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);

Fitting Our Model

Now that we've completed our data cleaning, we're finally ready to fit our model. Today we'll use a LASSO regression model to predict our target variable. LASSO is a form of regularization that has found relative success in economic and financial modeling. It offers a data-driven approach to dealing with high-dimensionality in linear models.

Model Fitting

To fit the LASSO model to our target variable, median_house_value, we'll use lassoFit from the GAUSS Machine Learning library.

/*
** LASSO Model
*/
// Set lambda values
lambda = { 0, 0.1, 0.3 };

// Declare 'mdl' to be an instance of a
// lassoModel structure to hold the estimation results
struct lassoModel mdl;

// Estimate the model with default settings
mdl = lassoFit(y_train, X_train, lambda);

The lassoFit procedure prints a model description and results:

==============================================================================
Model:                        Lasso     Target Variable:    median_house_value
Number observations:          12622     Number features:                    12
==============================================================================

===========================================================
                    Lambda          0        0.1        0.3
===========================================================

                 longitude     -2.347     -1.013   -0.02555
                  latitude     -2.192    -0.9269          0
        housing_median_age    0.07189    0.06384    0.03977
               total_rooms  -0.001004          0          0
            total_bedrooms    0.01165   0.006107   0.004828
                population  -0.004317  -0.003396  -0.001232
                households   0.006808   0.005119          0
             median_income      3.872      3.569      3.457
 ocean_proximity__1H OCEAN     -5.509          0          0
    ocean_proximity_INLAND     -9.437     -5.639     -6.575
  ocean_proximity_NEAR BAY     -7.083    -0.6395          0
ocean_proximity_NEAR OCEAN     -5.198     0.6378     0.6981
                    CONST.     -193.5     -82.98      3.451
===========================================================
                        DF         12         10          7
              Training MSE       33.7       34.7       37.4

The results highlight the variable selection function of LASSO. With $\lambda = 0$, a full least squares model, all features are represented in the model. When we get to $\lambda = 0.3$, the LASSO regression removes 4 of our 12 variables:

  • latitude
  • total_rooms
  • ocean_proximity__1H OCEAN
  • ocean_proximity_NEAR BAY

As we would expect, median_income has a large positive impact. However, there are a few noteworthy observations about the coefficients for the location related variables.

As we add more regularization to the model by increasing the value of $\lambda$, ocean_proximity__1H OCEAN and ocean_proximity_NEAR BAY are removed from the model, but the effect of ocean_proximity_INLAND increases substantially. latitude is also removed from the model. This could be because these effects are largely also explained by the location dummy variables and median_income.

Prediction

We can now test our model's prediction capability on the testing data using lmPredict:

// Predictions
predictions = lmPredict(mdl, X_test);

// Get MSE
testing_MSE = meanSquaredError(predictions, y_test);
print "Testing MSE"; testing_MSE;
Testing MSE

       33.814993
       34.726144
       37.199771

As expected, most of these values are above the training MSE but not by much. The test MSE for the model with the highest $\lambda$ value is actually lower than the training MSE. This suggests that our model is not overfitting.

Feature Engineering

Since our model is not overfitting, we can add more variables to the model. We could collect more data variables to add. However, it is likely that there is more information in our current data that we can make more accessible to our estimator. We are going to create some new features from combinations of our current features. This is part of a process called feature engineering which can make substantial contributions to your machine learning models.

We will start by generating per capita variables for total_rooms, total_bedrooms, and households.

/*
** Create per capita variables
** using population
*/
pc_data = housing_data[., "total_rooms" "total_bedrooms" "households"] 
    ./ housing_data[., "population"];

// Convert to a dataframe and add variable names
pc_data = asdf(pc_data, "rooms_pc"$|"bedrooms_pc"$|"households_pc");

Next we will great a variable representing the percentage of total_rooms made up by total_bedrooms:

beds_per_room = X[.,"total_bedrooms"] ./ X[.,"total_rooms"];

and add these columns to X:

X = X ~ pc_data ~ asdf(beds_per_room, "beds_per_room");

Fit and Predict the New Model

// Reset the random seed so we get the
// same test and train splits as our previous model
rndseed 896876;

// Split our new X into train and test splits
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);

// Set lambda values
lambda = { 0, 0.1, 0.3 };

// Declare 'mdl' to be an instance of a
// lassoModel structure to hold the estimation results
struct lassoModel mdl;

// Estimate the model with default settings
mdl = lassoFit(y_train, X_train, lambda);

// Predictions
predictions = lmPredict(mdl, X_test);

// Get MSE
testing_MSE = meanSquaredError(predictions, y_test);
print "Testing MSE"; testing_MSE;
==============================================================================
Model:                        Lasso     Target Variable:    median_house_value
Number observations:          12622     Number features:                    16
==============================================================================

===========================================================
                    Lambda          0        0.1        0.3
===========================================================

                 longitude     -2.495     -1.008          0
                  latitude      -2.36    -0.9354          0
        housing_median_age     0.0808    0.07167    0.04316
               total_rooms -0.0001714          0          0
            total_bedrooms   0.005301   0.001517  0.0008104
                population -0.0004661          0          0
                households  -0.001611          0          0
             median_income      3.947      4.011      3.675
 ocean_proximity__1H OCEAN     -5.171          0          0
    ocean_proximity_INLAND     -8.635     -4.963     -6.235
  ocean_proximity_NEAR BAY     -6.966     -0.875          0
ocean_proximity_NEAR OCEAN     -5.219     0.2927     0.1798
                  rooms_pc      2.678     0.1104          0
               bedrooms_pc     -11.68          0          0
             households_pc      22.23      21.47      20.23
             beds_per_room      33.03      17.03      8.029
                    CONST.     -221.9     -95.55     -3.059
===========================================================
                        DF         16         11          7
              Training MSE       31.6       32.5       34.3
Testing MSE

       31.505169
       32.457936
       34.155290 

Our train and test MSE have improved for all values of $\lambda$. Of our new variables, households_pc and beds_per_room, seem to have the strongest effect.

Extensions

We use a linear regression model, LASSO, for modeling home values. This was chosen somewhat ad hoc, and there are a number of alternative and extensions that could help improve our predictions.

For example we could:

Conclusion

In today's blog we've seen the important role that data exploration and cleaning plays in developing a machine learning model. Rarely do we obtain data that we can plug directly into our models. It's best practice, to make time for data exploration and cleaning, because any machine learning model is only as reliable as its data.

Further Machine Learning Reading

  1. Predicting Recessions with Machine Learning Techniques
  2. Applications of Principal Components Analysis in Finance
  3. Predicting The Output Gap With Machine Learning Regression Models
  4. Fundamentals of Tuning Machine Learning Hyperparameters
  5. Understanding Cross-Validation
  6. Classification with Regularized Logistic Regression
]]>
https://www.aptech.com/blog/machine-learning-with-real-world-data/feed/ 0
Understanding Cross-Validation https://www.aptech.com/blog/understanding-cross-validation/ https://www.aptech.com/blog/understanding-cross-validation/#respond Tue, 02 May 2023 13:08:47 +0000 https://www.aptech.com/?p=11583747 Introduction

If you've explored machine learning models, you've probably come across the term "cross-validation" at some point. But what exactly is it, and why is it important?

In this blog, we'll break cross-validation into simple terms. With a practical demonstration, we'll equip you with the knowledge to confidently use cross-validation in your machine learning projects.

Model Validation in Machine Learning

Model validation and cross validation using testing and training datasets for machine learning models.

Machine learning validation methods provide a means for us to estimate generalization error. This is crucial for determining what model provides the most best predictions for unobserved data.

In cases where large amounts of data are available, machine learning data validation begins with splitting the data into three separate datasets:

  • A training set is used to train the machine learning model(s) during development.
  • A validation set is used to estimate the generalization error of the model created from the training set for the purpose of model selection.
  • A test set is used to estimate the generalization error of the final model.

Cross-Validation in Machine Learning

The model validation process in the previous section works when we have large datasets. When data is limited we must instead use a technique called cross-validation.

The purpose of cross-validation is to provide a better estimate of a model's ability to perform on unseen data. It provides an unbiased estimate of the generalization error, especially in the case of limited data.

There are many reasons we may want to do this:

  • To have a clearer measure of how our model performs.
  • To tune hyperparameters.
  • To make model selections.

The intuition behind cross-validation is simple - rather than training our models on one training set we train our model on multiple subsets of data.

The basic steps of cross-validation are:

  1. Split data into portions.
  2. Train our model on a subset of the portions.
  3. Test our model on the remaining subsets of the data.
  4. Repeat steps 2-3 until the model has been trained and tested on the entire dataset.
  5. Average the model performance across all iterations of testing to get the total model performance.

Common Cross-Validation Methods

Though the basic concept of cross-validation is fairly simple, there are a number of ways to go about each step. A few examples of cross-validation methods include

  1. k-Fold Cross-Validation
    In k-fold cross-validation:

    • The dataset is divided into k equal sized-folds.
    • The model is trained on k-1 folds and tested on the remaining fold.
    • The process is repeated k times, with each fold serving as the test set exactly once.
    • The performance metrics are averaged over the k iterations.
  2. Stratified k-Fold Cross-Validation
    This process is similar to k-fold cross-validation with minor but important exceptions:

    • The class distribution in each fold is preserved.
    • It is useful for imbalanced datasets.
  3. Leave-One-Out Cross-Validation
    The Leave-one-out cross-validation process:

    • Trains the model using all data observations except one.
    • Tests the data using the unused data point.
    • Repeats this for n iterations until each data point is used exactly once as a test set.
  4. Time-Series Cross-Validation
    This cross-validation method, designed specifically for time-series:
    • Splits the data into training and testing sets in a chronologically ordered manner, such as sliding or expanding windows.
    • Trains the model on past data and tests the model on future data, based on the splitting point.
MethodAdvantagesDisdvantages
k-Fold Cross-Validation
  • Provides a good estimate of the model's performance by using all the data for both training and testing.
  • Reduces the variance in performance estimates compared to other methods.
  • Can be computationally expensive, especially for large datasets or complex models.
  • May not work well for imbalanced datasets or when there is a specific order to the data.
Stratified k-Fold Cross-Validation
  • Ensures that each fold has a representative distribution of classes, which can improve performance estimates for imbalanced datasets.
  • Reduces the variance in performance estimates compared to other methods.
  • Can still be computationally expensive, especially for large datasets or complex models.
  • May not be necessary for balanced datasets where class distribution is already even.
Leave-One-Out Cross-Validation (LOOCV)
  • Provides the least biased estimate of the model's performance, as the model is tested on every data point.
  • Can be useful when dealing with very limited data.
  • Can be computationally expensive, as it requires training and testing the model n times.
  • May have high variance in performance estimates, due to the small size in the test set.
Time Series Cross-Validation
  • Accounts for temporal dependencies in time series data.
  • Provides a realistic estimate of the model's performance in real-world scenarios.
  • May not be applicable for non-time series data.
  • Can be sensitive to the choice of window size and data splitting strategy.

k-Fold Cross-Validation Example

Let's look at k-fold cross-validation in action, using the wine quality dataset included in the GAUSS Machine Learning (GML) library. This file is based on the Kaggle Wine Quality dataset.

Our objective is to classify wines into quality categories using 11 qualities:

  • Fixed acidity.
  • Volatile acidity.
  • Citric acid.
  • Residual sugar.
  • Chlorides.
  • Free sulfur dioxide.
  • Total sulfur dioxide.
  • Density.
  • pH.
  • Sulphates.
  • Alcohol.

We'll use k-fold cross-validation to examine the performance of a random forest classification model.

Data Loading and Organization

First we will load our data directly from the GML library:

/*
** Load data and prepare data
*/
// Filename
fname = getGAUSSHome("pkgs/gml/examples/winequality.csv");

// Load wine quality dataset
dataset = loadd(fname);

After loading the data, we need to shuffle the data and extract our dependent and independent variables.

// Enable repeatable sampling
rndseed 754931;

// Shuffle the dataset (sample without replacement),
// because cvSplit does not shuffle.
dataset = sampleData(dataset, rows(dataset));

y = dataset[.,"quality"];
X = delcols(dataset, "quality");

Setting Random Forest Hyperparameters

After loading our data, we will set the random forest hyperparameters using the dfControl structure.

// Enable GML library functions
library gml;

/*
** Model settings
*/
// The dfModel structure holds the trained model
struct dfModel dfm;

// Declare 'dfc' to be a dfControl
// structure and fill with default settings
struct dfControl dfc;
dfc = dfControlCreate();

// Create 200 decision trees
dfc.numTrees = 200;

// Stop splitting if impurity at
// a node is less than 0.15
dfc.impurityThreshold = 0.15;

// Only consider 2 features per split
dfc.featuresPerSplit = 2;

k-fold Cross-Validation

Now that we have loaded our data and set our hyperparameters, we are ready to fit our random forest model and implement k-fold cross-validation.

First we setup the number of folds and pre-allocate a storage vector for model accuracy.

// Specify number of folds
// This generally is 5-10
nfolds = 5;

// Pre-allocate vector to hold the results
accuracy = zeros(nfolds, 1);

Next we use a GAUSS for loop to complete four steps:

  1. Select testing and training data from our folds using the cvSplit procedure.
  2. Fit our random forest classification model on the chosen training data using decForestCFit procedure.
  3. Make classification predictions using the chosen testing data and the decForestPredict procedure.
  4. Compute and store model accuracy for each iteration.
for i(1, nfolds, 1);
    { y_train, y_test, X_train, X_test } = cvSplit(y, X, nfolds, i);

    // Fit model using this fold's training data
    dfm = decForestCFit(y_train, X_train, dfc);

    // Make predictions using this fold's test data
    predictions = decForestPredict(dfm, X_test);

    accuracy[i] = meanc(y_test .== predictions);
endfor;

Results

Let's print the accuracy results and the total model accuracy:

/*
** Print Results
*/
sprintf("%7s %10s", "Fold", "Accuracy");;
sprintf("%7d %10.2f", seqa(1,1,nfolds), accuracy);
sprintf("Total model accuracy           : %10.2f", meanc(accuracy));
sprintf("Accuracy variation across folds: %10.3f", stdc(accuracy));
   Fold   Accuracy
      1       0.70
      2       0.73
      3       0.65
      4       0.71
      5       0.71
Total model accuracy           :       0.70
Accuracy variation across folds:      0.028

Our results provide some important insights into why we conduct cross-validation:

  • The model accuracy is different across folds, with a standard deviation of 0.028.
  • The maximum accuracy, using fold 2, is 0.73.
  • The minimum accuracy, using folds 3 is 0.65.

Depending on how we split our testing and training, we could get a different picture of model performance.

The total model accuracy, at 0.70, gives a better overall measure of model performance. The standard deviation of the accuracy gives us some insight into how much our prediction accuracy might vary.

Conclusion

If you're looking to improve the accuracy and reliability of your statistical analysis, cross-validation is a crucial technique to learn. In today's blog we've provided a guide to getting started with cross-validation.

Our step-by-step practical demonstration using GAUSS should prepare you to confidently implement cross-validation in your own data analysis projects.

Further Machine Learning Reading

  1. Predicting Recessions with Machine Learning Techniques
  2. Applications of Principal Components Analysis in Finance
  3. Predicting The Output Gap With Machine Learning Regression Models
  4. Fundamentals of Tuning Machine Learning Hyperparameters
  5. Machine Learning With Real-World Data
  6. Classification with Regularized Logistic Regression

]]>
https://www.aptech.com/blog/understanding-cross-validation/feed/ 0