Page not found – Aptech

Get Started with Panel Data in GAUSS (Video)

Eric — Wed, 17 Apr 2024 16:00:50 +0000

Introduction

In this video, you'll learn the basics of panel data analysis in GAUSS. We demonstrate panel data modeling start to finish, from loading data to running a group specific intercept model.

This video is available, along with all GAUSS videos, on our GAUSS YouTube Channel. Be sure to explore all our GAUSS videos and subscribe to the channel to get the latest videos as they are released.

Summary and Timeline

You'll see firsthand how to:

Load and verify panel data.
Merge data from different sources.
Convert between wide and long form panel data.
Explore and clean data.
Create panel data plots.
Prepare panel data for estimation.
Estimate a model with group-specific intercepts.

Timeline

0:41 Set the current working directory.
1:03 Load panel data from an Excel file.
5:32 Merging data from different sources.
06:53 Preliminary data cleaning.
08:40 Panel data plots.
11:12 Stationarity testing.
11:56 Convert long form to wide form panel data.
14:49 Estimate a model with group-specific intercepts.

Additional Resources

New Video! Get Started with Choice Modeling in GAUSS

Eric — Mon, 08 Apr 2024 16:32:03 +0000

Introduction

In this video, you'll learn the basics of choice data analysis in GAUSS. Our video demonstration shows just how quick and easy it is to get started with everything from data loading to discrete data modeling.

Summary and Timeline

You'll see firsthand how to:

Load and verify survey data.
Compute descriptive statistics.
Merge data from different sources.
Create basic scatter and frequency plots.
Fit a basic probit model.

Timeline

0:52 Load and verify CSV survey data.
2:53 Change the base case of a categorical variable.
5:24 Merge dataframes.
06:40 Descriptive statistics.
09:25 XY and frequency plots.
11:11 Create an indicator variable from a categorical choice variable.
12:25 Create a categorical variable and set the labels for the levels.
14:47 Estimate a probit model.

Additional Resources

Introducing the GAUSS Data Management Guide

Eric — Tue, 20 Feb 2024 18:50:08 +0000

Introduction

If you've worked with real-world data, you know that data cleaning and management can eat up your time. Efficiently tackling tedious data cleaning, organization, and management tasks can have a huge impact on productivity.

We created the GAUSS Data Management Guide with that exact goal in mind. It's aimed to help you save time and make the most of your data.

Today's blog looks at what the GAUSS Data Management Guide offers and how to best use the guide.

What is the GAUSS Data Management Guide?

The GAUSS Data Management Guide is a comprehensive reference tool for accomplishing data-related tasks in GAUSS. It provides a detailed roadmap for working with data in GAUSS, from basic data import and manipulation to advanced data cleaning and visualization.

The guide is intentionally designed for all levels of GAUSS users with:

Extensive coverage.
Step-by-step instructions.
Annotated examples.

What does the GAUSS Data Management Guide cover?

The GAUSS Data Management Guide includes sections for:

How should I use the GAUSS Data Management Guide?

Use page outlines, located on the right-hand side of each page, to identify and navigate to specific tasks.
Copy the examples in the guide and paste into GAUSS program files to use as templates.
Use the links to complete function reference pages to find additional support.

Conclusion

The GAUSS Data Management Guide provides practical examples, detailed instructions, and comprehensive coverage that can help work productively and efficiently with your data.

Using Feasible Generalized Least Squares To Improve Estimates

Eric — Thu, 25 Jan 2024 22:09:11 +0000

Introduction

Data analysis in reality is rarely as clean and tidy as it is presented in the textbooks. Consider linear regression -- data rarely meets the stringent assumptions required for OLS. Failing to recognize this and incorrectly implementing OLS can lead to embarrassing, inaccurate conclusions.

In today's blog, we'll look at how to use feasible generalized least squares to deal with data that does not meet the OLS assumption of Independent and Identically Distributed (IID) error terms.

What Is Feasible Generalized Least Squares (FGLS)?

FGLS is a flexible and powerful tool that provides a reliable approach for regression analysis in the presence of non-constant variances and correlated errors.

Feasible Generalized Least Squares (FGLS):

Is an extension of the traditional Ordinary Least Squares (OLS) regression method.
Accommodates heteroscedasticity and serial correlation in data.
Allows for more robust parameter estimates by considering the structure of the error terms.

Why is this important?

Recall the fundamental OLS IID assumption which implies that the error terms have constant variance and are uncorrelated. When this assumption is violated:

OLS estimators are no longer efficient.
The estimated covariance matrix of the coefficients will be inconsistent.
Standard inferences will be incorrect.

Unfortunately, many real-world cross-sectional, panel data, and time series datasets do violate this fundamental assumption.

FGLS allows for a more accurate modeling of complex and realistic data structures by accommodating the heteroscedasticity and autocorrelation in the error terms.

How Does FGLS Work?

FGLS uses a weighting matrix that captures the structure of the variance-covariance matrix of the errors.

This allows FGLS to:

Give more weight to observations with smaller variances.
Account for correlations.
Provide more efficient and unbiased estimates in the presence of non-constant variance and serial correlation.

The method uses a relatively simple iterative process:

Pick a method for estimating the covariance matrix based on believed data structure.
Make initial OLS parameter estimates.
Use the OLS residuals and the chosen method to estimate an initial covariance matrix.
Compute FGLS estimates using the estimated covariance matrix for weighting.
Calculate residuals and refine the weighting matrix.
Repeat steps 3, 4, and 5 until convergence.

How Do I Know If I Should Use FGLS?

We've already noted that you should use FGLS when you encounter heteroscedasticity and/or autocorrelation. It's easy to say this but how do you identify when this is the case?

There are a number of tools that can help.

Example Tools for Identifying Heteroscedasticity and Autocorrelation
Tool	Description	Used to Identify
Scatter plots	Plot the dependent variable against each independent variable and look for patterns that suggest relationships between the variance and variables. Plot the residuals over time and look for cycles or trends in the residuals.	Heteroscedasticity and autocorrelation.
Residual plot	A fan-shaped or funnel-shaped pattern in a plot of the residuals against fitted values indicates that the variance of the residuals is not constant across all levels of the independent variable. A pattern of correlation in plots of residuals against lagged residuals may indicate autocorrelation.	Heteroscedasticity and autocorrelation.
Histogram of residuals	Plot a histogram of the residuals. If the histogram is skewed or has unequal spread, it could suggest heteroscedasticity or non-normal distribution.	Heteroscedasticity.
Durbin-Watson statistic	The Durbin-Watson statistic tests for first-order autocorrelation in the residuals. The test statistic ranges from 0 to 4, with values around 2 indicating no autocorrelation.	Autocorrelation.
Breusch-Pagan test	The Breusch-Pagan test considers the null hypothesis of homoscedasticity against the alternative of heteroscedasticity.	Heteroscedasticity.
Breusch-Godfrey test	The Breusch-Godfrey test extends the Durbin-Watson test to higher-order autocorrelation. The test assesses whether larger lags of residuals and independent variables help explain the current residuals.	Autocorrelation.
White test	Similar to the Breusch-Pagan test, the White test considers the null hypothesis of homoscedasticity.	Heteroscedasticity.

Example One: US Consumer Price Index (CPI)

Let's get a better feel for FGLS using real-world data. In this application, we will:

Find OLS estimates and examine the results for signs of heteroscedasticity and autocorrelation.
Compute FGLS estimates and discuss results.

Data

For this example, we will use publicly available FRED time series data:

Consumer Price Index for All Urban Consumers: All Items in U.S. City Average (CPIAUCSL), seasonally adjusted.
Compensation of employees, paid (COE), seasonally adjusted.

Both variables are quarterly, continuously compounded rates of change spanning from 1947Q2 to 2023Q3.

// Load data 
fred_fgls = loadd("fred_fgls.gdat");

// Preview data
head(fred_fgls);
tail(fred_fgls);

            date              COE         CPIAUCSL
      1947-04-01      0.013842900      0.014184600
      1947-07-01      0.015131100      0.021573900
      1947-10-01      0.030381600      0.027915600
      1948-01-01      0.025448400      0.020966300
      1948-04-01      0.011788800      0.015823300

            date              COE         CPIAUCSL
      2022-07-01      0.023345300      0.013491800
      2022-10-01     0.0048207000      0.010199500
      2023-01-01      0.021001700     0.0093545000
      2023-04-01      0.013436000     0.0066823000
      2023-07-01      0.013337800     0.0088013000

For convenience, we're using an already saved GAUSS dataframe to load the data. However, FRED data can be directly loaded into GAUSS using a personal API KEY and the GAUSS FRED data tools. The code for loading the data from FRED is available here.

OLS Estimation

Let's start by using OLS to examine the relationship between COE and CPI returns. We'll be sure to have GAUSS save our residuals so we can use them to evaluate OLS performance.

// Declare 'ols_ctl' to be an olsmtControl structure
// and fill with default settings
struct olsmtControl ols_ctl;
ols_ctl = olsmtControlCreate();

// Set the 'res' member of the olsmtControl structure
// so that 'olsmt' will compute residuals and the
// Durbin-Watson statistic
ols_ctl.res = 1;

// Declare 'ols_out' to be an olsmtOut structure
// to hold the results of the computations
struct olsmtOut ols_out;

// Perform estimation, using settings in the 'ols_ctl'
// control structure and store the results in 'ols_out'
ols_out = olsmt(fred_fgls, "CPIAUCSL ~ COE", ols_ctl);

Valid cases:                   306      Dependent variable:            CPIAUCSL
Missing cases:                   0      Deletion method:                   None
Total SS:                    0.019      Degrees of freedom:                 304
R-squared:                   0.197      Rbar-squared:                     0.195
Residual SS:                 0.016      Std error of est:                 0.007
F(1,304):                   74.673      Probability of F:                 0.000
Durbin-Watson:               0.773

                         Standard                 Prob   Standardized  Cor with
Variable     Estimate      Error      t-value     >|t|     Estimate    Dep Var
-------------------------------------------------------------------------------
CONSTANT   0.00397578 0.000678566     5.85909     0.000       ---         ---
COE          0.303476   0.0351191     8.64133     0.000    0.444067    0.444067

Evaluating the OLS Results

Taken at face value, these results look good. The standard errors on both estimates are small and both variables are statistically significant. We may be tempted to stop there. However, let's look more closely using some of the tools mentioned earlier.

Checking For Heteroscedasticity

First, let's create some plots using our residuals to check for heteroscedasticity. We will look at:

A histogram of the residuals.
The residuals versus the independent variable.

/*
** Plot a histogram of the residuals 
** Check for skewed distribution
*/
plotHist(ols_out.resid, 50);

Our histogram indicates that the residuals from our OLS regression are asymmetric and slightly skewed. While the results aren't dramatic, they warrant further exploration to check for heteroscedasticity.

/*
** Plot residuals against COE
** Check for increasing or decreasing variance 
** as the independent variable changes.
*/
plotScatter(fred_fgls[., "COE"], ols_out.resid);

It's hard to determine if these results are indicative of heteroscedasticity or not. Let's add random normal observations to our scatter plot as see how they compare.

// Add random normal observations to our scatter plot
// scale by 100 to put on same scale as residuals
rndseed 897680;
plotAddScatter(fred_fgls[., "COE"], rndn(rows(ols_out.resid), 1)/100);

Our residual plot doesn't vary substantially from the random normal observations and there isn't strong visual evidence of heteroscedasticity.

If we did have heteroscedasticity, our residuals would exhibit a fan-like shape, indicating a change in the spread between residuals as our observed data changes. For example, consider this plot of hypothetical residuals against COE:

Checking For Autocorrelation

As you may have noticed, we don't have to look further than our OLS results for signs of autocorrelation. The olsmt procedure reports the Durbin-Watson statistic as part of the printed output. For this regression, the Durbin-Watson statistic is 0.773, which is significantly below 2, suggesting positive autocorrelation.

We can find further support for this conclusion by inspecting our residual plots, starting with a plot of the residuals against time.

// Checking for autocorrelation
/*
** Plot the residuals over time and 
** look for cycles or trends to 
** check for autocorrelation.
*/
plotXY(fred_fgls[., "date"], ols_out.resid);

Our time plot of residuals:

Has extended periods of large residuals, (roughly 1970-1977, 1979-1985, and 2020-2022).
Suggests positive autocorrelation.

Now let's examine the plot of our residuals against lagged residuals:

/*
** Plot residuals against lagged residuals 
** look for relationships and trends
*/
// Lag residuals and remove missing values
lagged_res = lagn(ols_out.resid, 1);

// Trim first observations and plot residuals
// against lagged residuals
plotScatter(lagged_res, ols_out.resid);

This plot gives an even clearer visual of our autocorrelation issue demonstrating:

A clear linear relationship between the residuals and their lags.
Larger residuals in the previous period lead to larger residuals in the current period.

FGLS Estimation

After examining the results more closely from the OLS estimation, we have clear support for using FGLS. We can do this using the fgls procedure, introduced in GAUSS 24.

The GAUSS `fgls` Procedure

The fgls procedure allows for model specification in one of two styles. The first style requires a dataframe input and a formula string:

// Calling fgls using a dataframe and formula string
out = fgls(data, formula);

The second option requires an input matrix or dataframe containing the dependent variable and an input matrix or dataframe containing the independent variables:

// Calling fgls using dependent variable
// and independent variable inputs
out = fgls(depvar, indvars);

Both options also allow for:

An optional input specifying the computation method for the weighting matrix. GAUSS includes 7 pre-programmed options for the weighting matrix or allows for a user-specified weighting matrix.
An optional fglsControl structure input for advanced estimation settings.

out = fgls(data, formula [, method, ctl])

The results from the FGLS estimation are stored in a fglsOut structure containing the following members:

Member	Description
out.beta_fgls	The feasible least squares estimates of parameters.
out.sigma_fgls	Covariance matrix of the estimated parameters.
out.se_fgls	Standard errors of the estimated parameters.
out.ci	Confidence intervals of the estimated parameters.
out.t_stats	The t-statistics of the estimated parameters.
out.pvts	The p-value of the t-statistics of the estimated parameters.
out.resid	The estimate residuals.
out.df	Degrees of freedom.
out.sse	Sum of squared errors.
out.sst	Total sum of squares.
out.std_est	Standard deviation of the residuals.
out.fstat	Model f-stat.
out.pvf	P-value of the model f-stat.
out.rsq	R-squared.
out.dw	Durbin-Watson statistic.

Running FGLS

Let's use FGLS and see if it helps with autocorrelation. We'll start with the default weighting matrix, which is an AR(1) structure.

// Estimate FGLS parameters using
// default setting
struct fglsOut fOut;
fOut = fgls(fred_fgls, "CPIAUCSL ~ COE");

Valid cases:                    306          Dependent variable:             COE
Total SS:                     0.019          Degrees of freedom:             304
R-squared:                    0.140          Rbar-squared:                 0.137
Residual SS:                  0.017          Std error of est:             0.007
F(1,304)                     49.511          Probability of F:             0.000
Durbin-Watson                 0.614

--------------------------------------------------------------------------------
                        Standard                    Prob
Variable   Estimates       Error     t-value        >|t|  [95% Conf.   Interval]
--------------------------------------------------------------------------------

Constant     0.00652    0.000908        7.19       0.000     0.00474      0.0083
CPIAUCSL        0.14      0.0286         4.9       0.000       0.084       0.196

The FGLS estimates the AR(1) weighting matrix differ from our OLS estimates in both the coefficients and standard errors.

Let's look at a plot of our residuals:

// Plot FGLS residual 
lagged_resid = lagn(fOut.resid, 1);
plotScatter(lagged_resid, fOut.resid);

Our residuals suggest that FGLS hasn't fully addressed our autocorrelation. What should we take from this?

This likely means that we need to consider higher-order autocorrelation. We may want to extend this analysis by:

Running the Breusch-Godfrey test to check for higher-order autocorrelation.
Examining the autocorrelation function (ACF) and partial autocorrelation functions (PACF).
Estimating an alternative time series model, such as an ARIMA model.

Ready to give FGLS a try? Get started with GAUSS demo today!

Example Two: American Community Survey

Now let's consider a second example using a subset of data from the 2019 American Community Survey (ACS).

Data

The 2019 ACS data subset was cleaned and provided by the Social Science Computing Cooperative from University of Wisconsin-Madison.

The survey data subset contains 5000 observations of the following variables:

Variable	Census Codebook Name	Description
household	SERIALNO	Housing unit identifier.
person	SPORDER	Person number.
state	ST	State.
age	AGEP	Age in years.
other_languages	LANX	Another language is spoken at home.
english	ENG	Self-rated ability to speak English, if another language is spoken.
commute_time	JWMNP	Travel time to work in minutes, top-coded at 200.
marital_status	MAR	Marital status.
education	SCHL	Educational attainment, collapsed into categories.
sex	SEX	Sex (male or female).
hours_worked	WKHP	Usual hours worked per week in the past 12 months.
weeks_worked	WKHN	Weeks worked per year in the past 12 months.
race	RAC1P	Race.
income	PINCP	Total income in current dollars, rounded.

Let's run a naive model of income against two independent variables, age and hours_worked.

Our first step is loading our data:

/*
** Step One: Data Loading 
** Using the 2019 ACS 
*/
// Load data 
acs_fgls = loadd("acs2019sample.dta", "income + age + hours_worked");

// Review the summary statistics
dstatmt(acs_fgls);

---------------------------------------------------------------------------------------------
Variable             Mean     Std Dev      Variance     Minimum     Maximum    Valid Missing
---------------------------------------------------------------------------------------------

income          4.062e+04   5.133e+04     2.634e+09       -8800   6.887e+05     4205    795
age                 43.38       24.17           584           0          94     5000      0
hours_worked        38.09       13.91         193.5           1          99     2761   2239

Based on our descriptive statistics there are a few data cleaning steps that will help our model:

Remove missing values using the packr procedure.
Transform income to thousands of dollars to improve data scaling.
Remove cases with negative incomes.

// Remove missing values
acs_fgls = packr(acs_fgls);

// Transform income
acs_fgls[., "income"] = acs_fgls[., "income"]/1000;

// Filter out cases with negative incomes
acs_fgls = delif(acs_fgls, acs_fgls[., "income"] .< 0);

In this example we will overlook the issues that arise with using truncated data.

OLS Estimation

Now we're ready to run a preliminary OLS estimation.

// Declare 'ols_ctl' to be an olsmtControl structure
// and fill with default settings
struct olsmtControl ols_ctl;
ols_ctl = olsmtControlCreate();

// Set the 'res' member of the olsmtControl structure
// so that 'olsmt' will compute residuals and the Durbin-Watson statistic
ols_ctl.res = 1;

// Declare 'ols_out' to be an olsmtOut structure
// to hold the results of the computations
struct olsmtOut ols_out;

// Perform estimation, using settings in the 'ols_ctl'
// control structure and store the results in 'ols_out'
ols_out = olsmt(acs_fgls, "income ~ age + hours_worked", ols_ctl);

Valid cases:                  2758      Dependent variable:              income
Missing cases:                   0      Deletion method:                   None
Total SS:              8771535.780      Degrees of freedom:                2755
R-squared:                   0.147      Rbar-squared:                     0.146
Residual SS:           7481437.527      Std error of est:                52.111
F(2,2755):                 237.536      Probability of F:                 0.000
Durbin-Watson:               1.932

                             Standard                 Prob   Standardized  Cor with
Variable         Estimate      Error      t-value     >|t|     Estimate    Dep Var
-----------------------------------------------------------------------------------
CONSTANT         -31.0341     3.91814    -7.92062     0.000       ---         ---
age              0.762573   0.0620066     12.2983     0.000    0.216528    0.227563
hours_worked      1.25521   0.0715453     17.5443     0.000    0.308893    0.316628

Our results make intuitive sense and suggest that:

Both age and hours_worked are statistically significant.
Increases in age lead to increases in income.
Increase in hours_worked lead to increases in income.

Evaluating the OLS Results

As we know from our previous example, we need to look beyond the estimated coefficients and standard errors when evaluating our model results. Let's start with the histogram of our residuals:

/*
** Plot a histogram of the residuals 
** Check for skewed distribution
*/
plotHist(ols_out.resid, 50);

The histogram of our residuals is right skewed with a long tail on the right side.

However, because our initial data is truncated, residual scatter plots will be more useful for checking for heteroscedasticity.

/*
** Plot residuals against independent variables
** Check for increasing or decreasing variance 
** as the independent variable changes.
*/
plotScatter(acs_fgls[., "age"], ols_out.resid);

// Open second plot window
plotOpenWindow();
plotScatter(acs_fgls[., "hours_worked"], ols_out.resid);

Both plots show signs of heteroscedasticity:

The age scatter plot demonstrates the tell-tale fan-shaped relationship with residuals. This indicates that variance in residuals increases as age increases.
The hours_worked scatter plot is less obvious but does seem to indicate higher variance in the residuals at the middle ranges (40-60) than the lower and higher ends.

FGLS estimation

To address the issues of heteroscedasticity, let's use FGLS. This time we'll use the "HC0" weighting matrix (White, 1980).

// Estimate FGLS parameters 
// using the HC1 weighting matrix
struct fglsOut fOut;
fOut = fgls(acs_fgls, "income ~ age + hours_worked", "HC0");

Valid cases:                   2758              Dependent variable:             age
Total SS:               8771535.780              Degrees of freedom:            2755
R-squared:                    0.147              Rbar-squared:                 0.146
Residual SS:            7481440.027              Std error of est:            52.111
F(2,2755)                   237.535              Probability of F:             0.000
Durbin-Watson                 1.932

-------------------------------------------------------------------------------------
                             Standard                    Prob
     Variable   Estimates       Error     t-value        >|t|  [95% Conf.   Interval]
-------------------------------------------------------------------------------------
    Constant       -30.9      0.0743        -416       0.000       -31.1       -30.8
hours_worked       0.762    0.000475    1.61e+03       0.000       0.761       0.763
      income        1.25     0.00181         694       0.000        1.25        1.26

While using FGLS results in slightly different coefficient estimates, it has a big impact on the standard error estimations. In this case, these changes don't have an impact on our inferences -- all of our regressors are still statistically significant.

Conclusion

Today we've seen how FGLS offers a potential solution for data that doesn't fall within the restrictive IID assumption of OLS.

After today, you should have a better understanding of how to:

Identify heteroscedasticity and autocorrelation.
Compute OLS and FGLS estimates using GAUSS.

Getting Started With Survey Data In GAUSS

Eric — Thu, 11 Jan 2024 15:27:35 +0000

Introduction

Survey data is a powerful analysis tool, providing a window into people's thoughts, behaviors, and experiences. By collecting responses from a diverse sample of responders on a range of topics, surveys offer invaluable insights. These can help researchers, businesses, and policymakers make informed decisions and understand diverse perspectives.

In today's blog we'll look more closely at survey data including:

Fundamental characteristics of survey data.
Data cleaning considerations.
Data exploration using frequency tables and data visualizations.
Managing survey data in GAUSS.

While survey design and data collection are both important topics and can have significant impacts on analysis, they are beyond the scope of what we'll look at today.

Survey Data

Survey data presents unique characteristics and challenges that require careful consideration during the data analysis process.

Survey Data Characteristics
Categorical Nature	Survey data often involves categorical variables, where responses are grouped into distinct categories. Understanding the nature of these categories is crucial for choosing appropriate analysis methods.
Ordinal and Nominal Variables	It is important to recognize the distinction between ordinal variables (categories with a meaningful order) and nominal variables (categories without a specific order). This impacts the choice of statistical tests and visualization techniques.
Missing Data	Surveys may have missing or incomplete responses. Strategies for handling missing data, such as imputation or excluding incomplete cases, need to be considered.
Large Sample Sizes	Surveys often involve large sample sizes, leading to statistically significant but not necessarily practically significant results. It's crucial to consider whether the observed results are meaningful or impactful in the specific context of the study.
Multivariate Nature	Surveys explore relationships among multiple variables simultaneously. Multivariate analysis allows for a more comprehensive understanding of the complex relationships between different factors.
Choice modeling	Surveys act as a primary data collection method for understanding individuals' preferences and choices. Choice modeling techniques expand the insights gained from survey responses, providing a quantitative framework for analyzing decision-making processes in various contexts.

Data Cleaning Considerations For Analyzing Survey Data

Data cleaning allows us to identify and address errors, inconsistencies, and missing values. It is crucial for survey data and helps to:

Ensure accuracy.
Improve reliability.
Make meaningful and trustworthy insights.

Cleaning survey data includes some standard steps, such as:

Handling missing values,
Detecting outliers,

and some steps that are more specific to survey data, such as:

Performing consistency checks on survey responses,
Recoding categorical variables,
Handling open-ended responses.

Common Survey Data Cleaning Steps
Handling Missing Data	Identify missing data. Determine if missing values are systematic or random. Decide if missing values should be imputed or observations should be removed.
Outlier Detection and Treatment	Identify outliers that might skew the analysis. Decide whether outliers should be treated, transformed, or if they represent valid data points.
Standardize Variables	Standardize units and formats of variables to ensure consistency. Convert units, standardize date formats, and/or transform variables for better comparability.
Checking for Consistency	Perform consistency checks on the survey responses. Look for contradictory or illogical responses that may indicate errors in data entry.
Addressing Duplicate Entries	Identify and remove duplicate entries to avoid double-counting.
Recoding and Categorization	Recode variables or categorize responses to simplify analysis. Group similar categories, collapse response options, or create new variables based on recoded values.
Handling Open-Ended Responses	Categorize and code open-ended responses for analysis.
Dealing with Coding Errors	Check for coding errors in categorical variables. Ensure that each category is correctly labeled and that coding aligns with the intended meaning of the variable.

Exploring Survey Data

Exploratory data analysis is an important tool that can help us uncover insights from survey data without complicated computations. During this step, basic statistical tools like frequency tables, contingency tables, and summary statistics can shed light on important patterns and trends in the data.

One-Way Frequency Tables

Frequency tables provide a simple tabulation of the number of occurrences of each category in a single categorical variable. They display the counts (frequencies) of each category along with their corresponding percentages or proportions. Frequency tables are univariate, meaning they describe the distribution of one variable.

A simple frequency table can help us identify:

Inconsistencies, coding errors, typos, and other errors in categorical labels.
Outliers and missing values.
General distribution characteristics. For example, we may find that one level of a categorical variable makes up 90% of our observations.

	Count	Total %	Cum. %
Coffee	31	45.6	45.6
Tea	27	39.7	85.3
Soda	28	14.7	100

Two-Way Tables

Two-way tables, also known as contingency tables, are similar to frequency tables but offer additional information about data interactions. They display the frequency combinations of two categorical variables. This provides a snapshot of how these variables interact, and helps us uncover patterns and associations within survey data.

Two-way tables present information in a structured grid:

The columns correspond to one variable.
The rows correspond to the other variable.
The intersection of a row and column represent the frequency of observations having a pair of outcomes.

	Breakfast	Lunch	Dinner
Coffee	20	8	3
Tea	12	10	5
Soda	8	10	10

As an example, consider the table above:

The columns represent the outcomes for a variable meal_time: Breakfast, Lunch, and Dinner.
The rows represent the outcomes for a variable beverage_choice: Coffee, Tea, and Soda.
The bottom row contains the counts for Soda orders across all possible meal times.
The last column contains counts for all beverage options at Dinner.
The bottom, right corner tell us that 10 Sodas were ordered at Dinner.

Two-way tables are an efficient way to reveal the intricate relationships between two categorical variables. By presenting information in a structured grid, these tables offer a straightforward way to discern patterns, making it easier to grasp how variables interact.

Data Visualizations

Data plots are a great way to find understand data trends, observe outliers, and identify other data issues. When choosing a data plot, it is important to consider what plot is best suited for the type of the variable.

Bar Charts	Ideal for comparing the frequency or distribution of categorical variables.
Stacked Bar Charts	Useful for comparing the composition of different groups, where each bar is divided into segments representing subcategories.
Pie Charts	Shows the proportion of each category in relation to the whole.
Histograms	Depicts the distribution of a continuous variable by dividing it into intervals (bins) and showing the frequency of observations in each interval.
Line Charts	Demonstrates trends or patterns over a continuous variable or time.
Scatter Plots	Visualizes the relationship between two continuous variables.
Box Plots (Box-and-Whisker Plots)	Displays the distribution of a variable, including median, quartiles, and outliers.

Hands-On With Survey Data: NextGen National Household Travel Survey

Let's look at more closely at survey data using GAUSS and real-world transportation data.

Today's Data

Today we'll be working with the 2022 National Household Travel Survey (NHTS). This survey is designed to collect comprehensive information about travel patterns and travel behavior in the United States.

The NHTS survey:

Gathers data on various aspects of travel, including daily commuting, recreational trips, shopping, and other activities.
Is typically conducted at regular intervals to capture changes in travel behavior over time, though today we will only consider the 2022 survey results.
Utilizes a combination of interviews and diaries to collect data from a representative sample of households across the country.
Is valuable for transportation planners, policymakers, and researchers in making informed decisions regarding infrastructure development, traffic management, and other transportation-related initiatives.

The raw data from the NHTS is split into four separate CSV files containing:

Vehicle data.
Trip data.
Household data.
Person data.

Today we will work with the trip data.

Data Citation:
Federal Highway Administration. (2022). 2022 National Household Travel Survey, U.S. Department of Transportation, Washington, DC. Available online: https://nhts.ornl.gov.

Loading The Data

Let's get started by loading the data into GAUSS using the loadd procedure. We will also compute descriptive statistics for our data:

// Load trip data
trip_data = loadd("trip_data.gdat");

// Preliminary summary stats
dstatmt(trip_data);

-------------------------------------------------------------------------------------------
Variable           Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------

HOUSEID           9e+09    5.83e+04     3.399e+09       9e+09       9e+09     31074    0
PERSONID          1.681      0.9994        0.9989           1           9     31074    0
TRIPID            2.438       1.792         3.209           1          36     31074    0
SEQ_TRIPID        2.436        1.79         3.203           1          36     31074    0
VEHCASEID     7.619e+11   3.244e+11     1.052e+23          -1       9e+11     31074    0
FRSTHM            -----       -----         -----         Yes          No     31074    0
PARK              -----       -----         -----  Valid skip          No     31074    0
TRAVDAY           -----       -----         -----      Sunday    Saturday     31074    0
DWELTIME          95.18       164.3       2.7e+04          -9        1050     31074    0
PUBTRANS          -----       -----         -----  Used publi  Did not us     31074    0
TRIPPURP          -----       -----         -----  Not ascert  Not a home     31074    0
WHYTRP1S          -----       -----         -----        Home  Something      31074    0
TRVLCMIN          24.55       46.48          2161          -9        1425     31074    0
TRPTRANS          -----       -----         -----         Car  School bus     31074    0
NUMONTRP          1.997       3.478          12.1           1          99     31074    0
NONHHCNT         0.4141       3.388         11.48           0          98     31074    0
HHACCCNT          1.583      0.8916         0.795           1           8     31074    0
WHYTO             -----       -----         -----  Regular ac  Something      31074    0
WALK              -----       -----         -----  Valid skip  N/A - Didn     31074    0
TRPMILES          13.97       85.42          7296          -9        4859     31074    0
VMT_MILE          7.527       32.18          1035          -9        1683     31074    0
GASPRICE            398       68.46          4686       272.7       597.9     31074    0
NUMADLT           2.059      0.7616          0.58           1           8     31074    0
HOMEOWN           -----       -----         -----  Owned by h  Occupied w     31074    0
RAIL              -----       -----         -----         Yes          No     31074    0
CENSUS_D          -----       -----         -----  New Englan     Pacific     31074    0
CENSUS_R          -----       -----         -----   Northeast        West     31074    0
CDIVMSAR          -----       -----         -----  New Englan  Pacific No     31074    0
HHFAMINC          -----       -----         -----  I prefer n  $125,000 t     31074    0
HH_RACE           -----       -----         -----       White  Other race     31074    0
HHSIZE            2.822       1.447         2.093           1          10     31074    0
HHVEHCNT          2.134       1.078         1.163           0          11     31074    0
MSACAT            -----       -----         -----  MSA of 1 m  Not in MSA     31074    0
MSASIZE           -----       -----         -----  In an MSA   Not in MSA     31074    0
URBAN             -----       -----         -----  In an urba  Not in urb     31074    0
URBANSIZE         -----       -----         -----  50,000-199  Not in urb     31074    0
URBRUR            -----       -----         -----       Urban       Rural     31074    0
TDAYDATE          -----       -----         -----  2022-01-01  2023-01-01     31074    0
WRKCOUNT          1.304      0.9474        0.8976           0           6     31074    0
R_AGE              46.8       20.77         431.2           5          92     31074    0
R_SEX             -----       -----         -----      Refuse      Female     31074    0
R_RACE            -----       -----         -----       White  Other race     31074    0
EDUC              -----       -----         -----  Valid skip  Profession     31074    0
VEHTYPE           -----       -----         -----  Valid skip  Motorcycle     31074    0

There are many ways to preview dataframes in GAUSS but with a wide dataset that contains many variables, I find dstatmt to be the easiest to view.

The descriptive statistics themselves provide some useful information:

Many of the continuous variables, such as TRPMILES and TRVLCMIN, have minimum values below zero. These don't make sense and it is likely the -9 is coded to represent something different, such as non-responses.
There are 31074 valid observations and no missing values for all variables.

The descriptive statistics report also provides insights beyond the traditional descriptive statistics:

The data contains a mixture of categorical and numerical data.
Observations in our dataset are defined by a set of identification variables: HOUSEID, PERSONID, TRIPID, SEQ_TRIPID, VEHCASEID .

The data in trip_data.gdat has had preliminary cleaning from the raw data.

Checking For Duplicates

As a first step, we'll confirm that our data contains unique observations using the isunique procedure.

isunique(trip_data);

1.0000000

This indicates that our dataset is unique without any duplicates.

Examining Category Labels

Now that we confirmed that our dataset is unique, one of the first data cleaning steps with categorical data is to examine the category labels to check for errors and to get an understanding of the distribution.

Let’s look at the labels of the TRIPPURP variable using a sorted frequency table.

// Print frequency table for 'TRIPPURP'
frequency(trip_data, "TRIPPURP", 1);

                                 Label      Count   Total %    Cum. %
                Home-based other (HBO)       7714     24.82     24.82
           Not a home-based trip (NHB)       7035     22.64     47.46
           Home-based shopping (HBSHP)       6884     22.15     69.62
                 Home-based work (HBW)       4871     15.68     85.29
Home-based social/recreational (HBSOC)       4546     14.63     99.92
                       Not ascertained         24   0.07723       100
                                 Total      31074       100

Using this we can see that three categories make up almost 70% of the trips: "Home-based other", "Not a home-based trip", and "Home-based shopping".

The frequency table is also useful for learning more about our labels. In this table, the labels appear to be clean and we don’t see anything that suggests typos or errors.

To clean up the labels, let's separate the abbreviations from the descriptions. We can do this using some simple string manipulation in GAUSS.

First, let’s separate the abbreviations from the full descriptions by splitting the labels at "(" and storing the new string arrays:

// Use '(' to split existing labels into 2 columns
tmp = strsplit(trip_data[. , "TRIPPURP"], "(" );

// Trim whitespace from the front and back of both variables
tmp = strtrim(tmp);

// Rename columns 
tmp = setColNames(tmp , "TRIP_DESC"$|"TRIP_ABBR");

// Preview data
head(tmp);

              TRIP_DESC        TRIP_ABBR
       Home-based socia           HBSOC)
       Home-based socia           HBSOC)
       Home-based shopp           HBSHP)
       Not a home-based             NHB)
       Home-based shopp           HBSHP)

The TRIP_DESC variable looks good – it stores the full description of the TRIPPURP. However, the abbreviations in the TRIP_ABBR don’t quite look right, we still need to strip the ")".

/*
** Remove the right parenthesis
*/
// Replace ')' with an empty string
tmp[. , "TRIP_ABBR"]  = strreplace(tmp[. , "TRIP_ABBR"], ")", "");

// Check frequencies for both variables
frequency(tmp, "TRIP_DESC + TRIP_ABBR");

                         Label      Count     Total %      Cum. %
              Home-based other       7714       24.82       24.82
           Home-based shopping       6884       22.15       46.98
Home-based social/recreational       4546       14.63       61.61
               Home-based work       4871       15.68       77.28
         Not a home-based trip       7035       22.64       99.92
               Not ascertained         24     0.07723         100
                         Total      31074         100

                         Label      Count     Total %      Cum. %
                                       24     0.07723     0.07723
                           HBO       7714       24.82        24.9
                         HBSHP       6884       22.15       47.06
                         HBSOC       4546       14.63       61.69
                           HBW       4871       15.68       77.36
                           NHB       7035       22.64         100
                         Total      31074         100

One final change we may want to make is to replace the missing abbreviation label for the "Not Ascertained" category using the recodeCatLabels.

/*
** Recode missing label
*/
// Add missing label for 'NA'
tmp[., 2] = recodecatlabels(tmp[., 2], "", "NA", "TRIP_ABBR");

// Check frequencies for both variables
frequency(tmp, "TRIP_DESC + TRIP_ABBR");

                         Label      Count     Total %      Cum. %
              Home-based other       7714       24.82       24.82
           Home-based shopping       6884       22.15       46.98
Home-based social/recreational       4546       14.63       61.61
               Home-based work       4871       15.68       77.28
         Not a home-based trip       7035       22.64       99.92
               Not ascertained         24     0.07723         100
                         Total      31074         100

                         Label      Count     Total %      Cum. %
                            NA         24     0.07723     0.07723
                           HBO       7714       24.82        24.9
                         HBSHP       6884       22.15       47.06
                         HBSOC       4546       14.63       61.69
                           HBW       4871       15.68       77.36
                           NHB       7035       22.64         100
                         Total      31074         100

We've successfully created two new variables - TRIP_DESC and TRIP_ABBR which we can concatenate to our trip_data dataframe:

// Add the new variables to the end of 'trip_data'
trip_data = trip_data ~ tmp;

Two-Way Tables

Frequency tables give provide insights into a single categorical variable. However, if we are interested in the relationship between multiple categorical variables, we need to use two-way, or contingency, tables.

Let's use a contingency table to look at the relationship between the URBRUR and the VEHTYPE. To do this we can use the tabulate procedure, introduced in GAUSS 24.

The tabulate function requires either a dataframe or filename input, along with a formula string to specify which variables to include in the table. It also takes an optional tabControl structure input for advanced options.

data: A GAUSS dataframe or filename.
formula: String, formula string. E.g "df1 ~ df2 + df3", "df1" categories will be reported in rows, separate columns will be returned for each category in "df2" and "df3".
tbctl: Optional, an instance of the tabControl structure used for advanced table options.

// Compute a two-way table with
// VEHTYPE categories in rows
// URBUR categories in columns
// Results stored in tab_df
tab_df = tabulate(trip_data, "VEHTYPE ~ URBRUR");

===============================================================
           VEHTYPE                   URBRUR               Total
===============================================================
                            Urban          Rural

        Valid skip           4061            719           4780
  Car/Stationwagon           9306           1774          11080
               Van           1438            358           1796
               SUV           8275           1935          10210
      Pickup Truck           2043           1043           3086
       Other Truck             36             24             60
      RV/Motorhome              4              4              8
  Motorcycle/Moped             39             15             54

             Total          25202           5872          31074
===============================================================

The initial counts provide us some insights:

The total counts of vehicles are higher in urban areas.
In urban areas the most frequently occurring type of vehicle is theCar/Stationwagon.
In rural areas the most frequently occurring type of vehicle is SUV.

It might useful to see relative percentages of the vehicle types. Because we stored the counts in the tab_df, this can easily be done.

First, let's look at what percentage each category makes up of the total vehicles in the urban and rural areas, respectively.

// Compute percentages within urban and rural areas
// by dividing by column totals
tab_df[., 1]~(tab_df[., 2:3]./sumc(tab_df[., 2:3])');

         VEHTYPE       URBUR_Urban   URBUR_Rural
      Valid skip            0.1611        0.1224
Car/Stationwagon            0.3692        0.3021
             Van            0.0571        0.0610
             SUV            0.3283        0.3295
    Pickup Truck            0.0811        0.1776
     Other Truck            0.0014        0.0041
    RV/Motorhome            0.0002        0.0007
Motorcycle/Moped            0.0015        0.0026

These percentages help us see that:

The distribution of Car/Stationwagon, Van, and SUV are fairly similar in urban and rural areas.
There is a higher percentage of the Pickup Truck, Other Truck, Motorcycle/Moped categories in rural areas.

Alternatively we can look at the distribution of each vehicle type across rural and urban areas.

// Compute percentages across urban and rural areas
// by dividing by row totals
tab_df[., 1]~(tab_df[., 2:3]./sumr(tab_df[., 2:3]))

         VEHTYPE      URBUR_Urban  URBUR_Rural
      Valid skip           0.8496       0.1504
Car/Stationwagon           0.8399       0.1601
             Van           0.8007       0.1993
             SUV           0.8105       0.1895
    Pickup Truck           0.6620       0.3380
     Other Truck           0.6000       0.4000
    RV/Motorhome           0.5000       0.5000
Motorcycle/Moped           0.7222       0.2778

This table tells a similar store from a different perspective:

Urban vehicles make up 80-83% of the Cars/Stationwagon, Van, and SUV categories.
Urban vehicles only make up 60% and 66% the Pickup Truck and Other Truck categories, respectively.
Urban vehicles make up 72% of the Motorcyle/Moped category.

Excluding Categories

Suppose we don't want to include the Valid skip responses in our contingency table. We can remove these using the exclude member of the tabControl structure.

To specify categories to be excluded from the contingency table, we use a string to specify the variable name and category separated by a ":".

// Declare structure
struct tabControl tbCtl;

// Fill defaults
tbCtl = tabControlCreate();

// Specify to exclude the 'Valid skip' category
// from the 'VEHTYPE' variable
tbCtl.exclude = "VEHTYPE:Valid skip";

// Find contingency table including tbCtl input
tab_df2 =  tabulate(trip_data, "VEHTYPE ~ URBRUR", tbCtl);

=============================================================================
                         VEHTYPE                   URBRUR               Total
=============================================================================
                                          Urban          Rural

                Car/Stationwagon           9306           1774          11080
                             Van           1438            358           1796
                             SUV           8275           1935          10210
                    Pickup Truck           2043           1043           3086
                     Other Truck             36             24             60
                    RV/Motorhome              4              4              8
                Motorcycle/Moped             39             15             54

                           Total          21141           5153          26294
=============================================================================

Now our table excludes the Valid skip category.

Ready to try it for yourself in GAUSS 24? Start your free trial today!

Data Visualizations

Data visualizations are one of the most useful tools for data exploration. There are several ways to utilize the plotting capabilities of GAUSS to explore survey data.

Frequency plots

First, let's use a frequency plot to explore the distribution of responses across census regions. To do this, we will utilize the plotFreq procedure.

// Census region frequencies
plotFreq(trip_data, "CENSUS_R", 1);

The sorted frequency plot allows us to quickly identify that the most frequently occurring region in our data is "South".

Note that support for sorting frequency plots was added in GAUSS 24.

Plotting Contingency Tables

Like frequency tables, frequency plots are useful for visualizing the categories of one variable. However, they don't provide much insight into the relationship across categorical variables.

To visualize the relationship between VEHTYPE and URBRUR, let's create a bar plot using our stored contingency table dataframe, tab_df2.

The plotBar function requires two inputs, labels for the x-axis and corresponding heights.

The labels for our bar plot are the vehicle types which are stored as a dataframe in the first column of the tab_df2. To use them as inputs we will need to:

Get the category labels.
Convert them to a string array.

// Get category labels
labels = getCategories(tab_df2, "VEHTYPE");

// Convert to string array
labels_sa = ntos(labels);

The corresponding heights will come from the tab_df2 variable. Let's find out the variable names in tab_df2:

// Print the variable names from 'tab_df2'
getcolnames(tab_df2);

     VEHTYPE
URBRUR_Urban
URBRUR_Rural

The final two variable names were created by the tabulate function to tell us which original variable the column came from, URBRUR, and which category is being referenced. Let's change the variable names to just Urban and Rural to make them more concise.

new_names = "Urban" $| "Rural";
col_idx = { 2, 3 };
tab_df2 = setcolnames(tab_df2, "Urban" $| "Rural", col_idx);

Now we're ready to use the Urban and Rural count variables to plot our data.

plotBar(labels_sa, tab_df2[., "Urban" "Rural"]);

By default, this plots our bars side-by-side. We can change this using a plotControl structure and plotsetbar .

// Declare structure
struct plotControl plt;

// Fill defaults
plt = plotGetDefaults("bar");

// Set bars to be solid and stacked
plotSetBar(&plt, 1, 1);

// Plot contingency table
plotBar(plt, labels_sa, tab_df2[., "Urban" "Rural"]);

Scatter Plots

Now suppose we wish to examine the relationship between a categorical variable and continuous variables. We can do this using the 'by' keyword and the plotScatter function.

// Plot TRIPMILES vs GASPRICE 
// Sorting by color using the categories in CENSUS_R
plotScatter(trip_data, "TRPMILES ~ GASPRICE + by(CENSUS_R)");

Adding the census regions provides some interesting observations:

The West region has higher gas prices than other regions.
The South region seems to have lower gas prices than other regions.

Conclusion

In this blog, we've covered some fundamental concepts related to survey data and looked at some GAUSS tools for cleaning, exploring, and visualizing survey data.

Transforming Panel Data to Long Form in GAUSS

Eric — Tue, 12 Dec 2023 21:24:59 +0000

Introduction

Anyone who works with panel data knows that pivoting between long and wide form, though commonly necessary, can still be painstakingly tedious, at best. It can lead to frustrating errors, unexpected results, and lengthy troubleshooting, at worst.

The new dfLonger and dfWider procedures introduced in GAUSS 24 make great strides towards fixing that. Extensive planning has gone into each procedure, resulting in comprehensive but intuitive functions.

In today's blog, we will walk through all you need to know about the dfLonger procedure to tackle even the most complex cases of transforming wide form panel data to long form.

The Rules of Tidy Data

Before we get started, it will be useful to consider what makes data tidy (and why tidy data is important).

It's useful to think of breaking our data into components (these subsets will come in handy later when working with dflonger):

Values.
Observations.
Variables.

We can use these components to define some basic rules for tidy data:

Variables have unique columns.
Observations have unique rows.
Values have unique cells.

Example One: Wide Form State Population Table

State	2020	2021	2022
Alabama	5,031,362	5,049,846	5,074,296
Alaska	732,923	734,182	733,583
Arizona	7,179,943	7,264,877	7,359,197
Arkansas	3,014,195	3,028,122	3,045,637
California	39,501,653	39,142,991	39,029,342

Though not clearly labeled, we can deduce that this data presents values for three different variables: State, Year, and Population.

Looking more closely we see:

State is stored in a unique column.
The values of Years are stored as column names.
The values of Population are stored in separate columns for each year.

Our variables do not each have a unique column, violating the rules of tidy data.

Example Two: Long Form State Population Table

State	Year	Population
Alabama	2020	5,031,362
Alabama	2021	5,049,846
Alabama	2022	5,074,296
Alaska	2020	732,923
Alaska	2021	734,182
Alaska	2022	733,583
Arizona	2020	7,179,943
Arizona	2021	7,264,877
Arizona	2022	7,359,197

The transformed data above now has three columns, one for each variable State, Year, and Population. We can also confirm that each observation has a single row and each value has a single cell.

Transforming the data to long form has resulted in a tidy data table.

Why Do We Care About Tidy Data?

Working with tidy data offers a number of advantages:

Tidy data storage offers consistency when trying to compare, explore, and analyze data whether it be panel data, time series data or cross-sectional data.
Using columns for variables is aligned with vectorization and matrix notation, both of which are fundamental to efficient computations.
Many software tools expect tidy data and will only work reliably with tidy data.

Ready to elevate your research? Try GAUSS 24 today.

Transforming From Wide to Long Panel Data

In this section, we will look at how to use the GAUSS procedure dfLonger to transform panel data from wide to long form. This section will cover:

The fundamentals of the dfLonger procedure.
A standard process for setting up panel data transformations.

The `dfLonger` Procedure

The dfLonger procedure transforms wide form GAUSS dataframes to long form GAUSS dataframes. It has four required inputs and one optional input:

df_long = dfLonger(df_wide, columns, names_to, values_to [, pctl]);

df_wide: A GAUSS dataframe in wide panel format.
columns: String array, the columns that should be used in the conversion.
names_to: String array, specifies the variable name(s) for the new column(s) created to store the wide variable names.
value_to: String, the name of the new column containing the values.
pctl: Optional, an instance of the pivotControl structure used for advanced pivoting options.

Setting Up Panel Data Transformations

Having a systematic process for transforming wide panel data to long panel data will:

Save time.
Eliminate frustration.
Prevent errors.

Let's use our wide form state population data to work through the steps.

Step 1: Identify variables.

In our wide form population table, there are three variables: State, Year, and Population.

Variables are not always are clearly labeled in wide form data. You will often need to have background information to identify variables. Make sure to pay attention to references, titles, or other sources to ensure that you clearly understand the variables.

Step 2: Identify columns to convert.

The easiest way to determine what columns need to be converted is to identify the "problem" columns in your wide form data.

For example, in our original state population table, the columns named 2020, 2021, 2022, represent our Year variable. They store the values for the Population variable.

These are the columns we will need to address in order to make our data tidy.

columns = "2020"$|"2021"$|"2022";

We only have three columns to transform and it is easy to just type out our column names in a string array. This won't always be the case, though. Fortunately, GAUSS has a lot of great convenience functions to help with creating your column lists.

My favorites include:

Function	Description	Example
getColNames	Returns the column variable names.	`varnames = getColNames(df_wide)`
startsWith	Returns a 1 if a string starts with a specified pattern.	`mask = startsWith(colNames, pattern)`
trimr	Trims rows from the top and/or bottom of a matrix.	`names = trimr(full_list, top, bottom)`
rowcontains	Returns a 1 if the row contains the data specified by the `needle` variable, otherwise it returns a 0.	`mask = rowcontains(haystack, needle)`
selif	Selects rows from a matrix, dataframe or string array, based upon a vector of 1’s and 0’s.	`names = rowcontains(full_list, mask)`

For more complex cases, it useful to approach creating column lists as a two-step process:

Get all column names using getColNames.
Select a subset of columns names using a selection convenience functions.

As an example, suppose our state population dataset contains a year column as the first column and the remaining columns contain the populations for 1950-2022. It would be difficult to write out the column list for all years.

Instead we could:

Get a list of all the column names using getColNames.
Trim the first name off the list.

// Get all columns names
colNames = getColNames(pop_wide);

// Trim first name `year` 
// from top of the name list
colNames = trimr(colNames, 1, 0);

Step 3: Name the new columns for storing names.

The names of the columns being transformed from our wide form data will be stored in a variable specified by the input names_to.

In this case, we want to store the names from the wide data in one new variable called, "Years". In later examples, we will look at how to split names into multiple variables using prefixes, separators, or patterns.

names_to = "Years";

Step 4: Name the new columns for storing values.

The values stored in the columns being transformed will be stored in a variable specified by the input values_to.

For our population table, we will store the values in a variable named "Population".

values_to = "Population";

Basic Pivoting

Now it's time to put all these steps together into a working example. Let's continue with our state population example.

We'll start by loading the complete state population dataset from the state_pop.gdat file:

// Load data 
pop_wide = loadd("state_pop.gdat");

// Preview data
head(pop_wide);

           State             2020             2021             2022
         Alabama        5031362.0        5049846.0        5074296.0
          Alaska        732923.00        734182.00        733583.00
         Arizona        7179943.0        7264877.0        7359197.0
        Arkansas        3014195.0        3028122.0        3045637.0
      California        39501653.        39142991.        39029342.

Now, let's set up our information for transforming our data:

// Identify columns
columns = "2020"$|"2021"$|"2022";

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we'll transform our data using df_longer:

// Convert data using df_longer
pop_long = dfLonger(pop_wide, columns, names_to, values_to);

// Preview data
head(pop_long);

           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00

Advanced Pivoting

One of the most appealing things about dfLonger is that while simple to use, it offers tools for tackling the most complex cases. In this section, we'll cover everything you need to know for moving beyond basic pivoting.

The `pivotControl` Structure

The pivotControl structure allows you to control pivoting specifications using the following members:

Member	Purpose
names_prefix	A string input which specifies which characters, if any, should be stripped from the front of the wide variable names before they are assigned to a long column.
names_sep_split	A string input which specifies which characters, if any, mark where the names_to names should be broken up.
names_pattern_split	A string input containing a regular expression specifying group(s) in names_to names which should be broken up.
names_types	A string input specifying data types for the names_to variable.
values_drop_missing	Scalar, is set to 1 all rows with missing values will be removed.

We will demonstrate more how to use the pivotControl structure in later examples. However, if you are unfamiliar with structures you may find it useful to review our tutorial, "A Gentle Introduction to Using Structures."

Changing Variable Types

By default the variables created from the pieces of the variable names will be categorical variables.

If we examine the variable type of pop_long from our previous example,

// Check the type of the 'Year' variables
getColTypes(pop_long[., "Year"]);

we can see that the Year variable is a categorical variable:

            type
        category

This isn't ideal and we'd prefer our Year variable to be a date. We can control the assigned type using the names_types member of the pivotControl structure. The names_types member can be specified in one of two ways:

As a column vector of types for each of the names_to variables.
An n x 2 string array where the first column is the name of the variable(s) and the second column contains the type(s) to be assigned.

For our example, we wish to specify that the Year variable should be a date but we don't need to change any of the other assigned types, so we will use the second option:

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify that 'Year' should be
// converted to a date variable
pctl.names_types = {"Year" "date"};

Next, we complete the steps for pivoting:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide);
columns = trimr(columns, 1, 0);

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we call dfLonger including the pivotControl structure, pctl, as the final input:

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);

           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00

Now if we check the type of our Year variable:

// Check the type of 'Year'
getColTypes(pop_long[., "Year"]);

It is a date variable:

  type
  date

Stripping Prefixes

In our previous example, the wide data names only contained the year. However, the column names of a wide dataset often have common prefixes. The names_prefix member of the pivotControl structure offers a convenient way to strip unwanted prefixes.

Suppose that our wide form state population columns were labeled "yr_2020", "yr_2021", "yr_2022":

// Load data
pop_wide2 = loadd("state_pop2.gdat");

// Preview data
head(pop_wide2);

           State          yr_2020          yr_2021          yr_2022
         Alabama        5031362.0        5049846.0        5074296.0
          Alaska        732923.00        734182.00        733583.00
         Arizona        7179943.0        7264877.0        7359197.0
        Arkansas        3014195.0        3028122.0        3045637.0
      California        39501653.        39142991.        39029342.

We need to strip these prefixes when transforming our data to long form.

To accomplish this we first need to specify that our name columns have the common prefix "yr":

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify prefix
pctl.names_prefix = "yr_";

Next, we complete the steps for pivoting:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide2);
columns = trimr(columns, 1, 0);

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we call dfLonger:

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide2, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);

           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00

Splitting Names

In our basic example the only information contained in the names columns was the year. We created one variable to store that information, "Year". However, we may have cases where our wide form data contains more than one piece of information.

In theses case there are two important steps to take:

Name the variables that will store the information contained in the wide data column names using the names_to input.
Indicate to GAUSS how to split the wide data column names into the names_to variables.

Names Include a Separator

One way that names in wide data can contain multiple pieces of information is through the use of separators.

For example, suppose our data looks like this:

           State       urban_2020       urban_2021       urban_2022       rural_2020       rural_2021       rural_2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858.

Now our names specify:

Whether the population is the urban or rural population.
The year of the observation.

In this case, we:

Use the names_sep_split member of the pivotControl structure to indicate how to split the names.
Specify a names_to variable for each group created by the separator.

// Load data
pop_wide3 = loadd("state_pop3.gdat");

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify how to separate names
pctl.names_sep_split = "_";

// Specify two variables for holding
// names information:
//    'Location' for the information before the separator
//    'Year' for the information after the separator
names_to = "Location"$|"Year";

// Variable for storing values
values_to = "Population";

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide3, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);

           State         Location             Year       Population
         Alabama            urban             2020        6558153.0
         Alabama            urban             2021        4972982.0
         Alabama            urban             2022        12375977.
         Alabama            rural             2020        1526791.0
         Alabama            rural             2021        76863.000

Now, the pop_long dataframe contains:

The information in the wide form names found before the separator, "_", (urban or rural) in the Location variable.
The information in the wide form names found after the separator, "_", in the Year variable.

Variable Names With Regular Expressions

In our example above, the variables contained in the names were clearly separated by a "_". However, this isn't always the case. Sometimes names use a pattern rather than separator:

// Load data
pop_wide4 = loadd("state_pop4.gdat");

// Preview data
head(pop_wide4);

           State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858.

In cases like this, we can use the names_pattern_split member to tell GAUSS we want to pass in a regular expression that will split the columns. We can't cover the full details of regular expressions here. However, there are a few fundamentals that will help us get started with this example.

In regEx:

Each statement inside a pair of parentheses is a group.
To match any upper or lower case letter we use "[a-zA-Z]". More specifically, this tells GAUSS that we want to match any lowercase letter ranging from a-z and any upper case letter ranging from A-Z. If we wanted to limit this to any lowercase letters from t to z and any uppercase letter B to M we would say "[t-zB-M]".
To match any integer we use "[0-9]".
To represent that we want to match one or more instances of a pattern we use "+".
To represent that we want to match zero or more instances of a pattern we use "*".

In this case, we want to separate our names so that "urban" and "rural" are collected in Location and 2020, 2021, and 2022 are collected in the Year variable:

We have two groups.
We can capture both urban and rural using "[a-zA-Z]+".
We can capture the years by matching one or more number using "[0-9]+".

Let's use regEx to specify our names_pattern_split member:

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify how to separate names 
// using the pivotControl structure
pctl.names_pattern_split = "([a-zA-Z]+)([0-9]+)";

Next, we can put this together with our other steps to transform our wide data:

// Variable for storing names
names_to = "Location"$|"Year";

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide4);
columns = trimr(columns, 1, 0);

// Variable for storing values
values_to = "Population";

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl4);
head(pop_long);

           State         Location             Year       Population
         Alabama            urban             2020        6558153.0
         Alabama            urban             2021        4972982.0
         Alabama            urban             2022        12375977.
         Alabama            rural             2020        1526791.0
         Alabama            rural             2021        76863.000

Multiple Value Variables

In all our previous examples we had values that needed to be stored in one variable. However, it's more realistic that our dataset contains multiple groups of values and we will need to specify multiple variables to store these values.

Let's consider our previous example which used the pop_wide4 dataset:

           State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858.

Suppose that rather than creating a location variable, we wish to separate the population information into two variables, urban and rural. To do this we will:

Split the variable names by words ("urban" or "rural") and integers.
Create a Year column from the integer portions of the names.
Create two values columns, urban and rural, from the word portions.

First, we will specify our columns:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide4);
columns = trimr(columns, 1, 0);

Since we are using the same data as our previous example, we don't need to load any additional data.

Next, we need to specify our names_to and values_to inputs. However, this time we want our values_to variables to be determined by the information in our names.

We do this using ".value".

// Tell GAUSS to use the first group of the split names 
// to set the values variables and 
// store the remaining group in 'Year'
names_to = ".value" $| "Year";

// Tell GAUSS to get 'values_to' variables from 'names_to'
values_to = "";

Setting ".value" as the first element in our names_to input tells dfLonger to take the first piece of the wide data names and create a column with the all the values from all matching columns.

In other words, combine all the values from the variables urban2020, urban2021, urban2022 into a single variable named urban and do the same for the rural columns.

Finally, we need to tell GAUSS how to split the variable names.

// Declare 'pctl' to be a pivotControl structure
// and fill with default settings
struct pivotControl pctl;
pctl = pivotControlCreate();

// Set the regex to split the variable names
pctl.names_pattern_split = "(urban|rural)([0-9]+)";

This time, we specify the variable names, "(urban|rural)" rather than use the general specifier "([a-zA-Z])".

Now we call dfLonger:

// Convert the dataframe to long format according to our specifications
pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl);

// Print the first 5 rows of the long form dataframe
head(pop_long);

           State             Year            urban            rural
         Alabama             2020        6558153.0        1526791.0
         Alabama             2021        4972982.0        76863.000
         Alabama             2022        12375977.        7301681.0
          Alaska             2020        21944.000        710978.00
          Alaska             2021        467051.00        267130.00

Now the urban population and rural population are stored in their own column, named urban and rural.

These names can easily be changed using the Data Manager or setColNames

Conclusion

As we've seen today, pivoting panel data from wide to long can be complicated. However, using a systematic approach and the GAUSS dfLonger procedure help to alleviate the frustration, time, and errors.

Discover how GAUSS 24 can help you reach your goals.

Request Demo Request pricing

Introducing GAUSS 24

Eric — Tue, 05 Dec 2023 17:15:52 +0000

Introduction

We're happy to announce the release of GAUSS 24, with new features for everything from everyday data management to refined statistical modeling.

GAUSS 24 features a robust suite of tools designed to elevate your research. With these advancements, GAUSS 24 continues our commitment to helping you conduct insightful analysis and achieve your goals.

New Panel Data Management Tools

GAUSS 24 makes working with panel data easier than ever. Effortlessly load, clean, and explore panel data without ever leaving GAUSS, making it the smoothest experience yet!

Easily and intuitively pivot between long and wide form data with new dfLonger and dfWider functions.
Explore group-level descriptive statistics and estimate group-level linear models with expanded by keyword functionality.

// Load data 
auto2 = loadd("auto2.dta");

// Print statistics table
call dstatmt(auto2, "mpg + by(foreign)");

=======================================================================
foreign: Domestic
-----------------------------------------------------------------------
Variable        Mean     Std Dev      Variance     Minimum     Maximum
-----------------------------------------------------------------------
mpg            19.83       4.743          22.5          12          34
=======================================================================
foreign: Foreign
-----------------------------------------------------------------------
Variable        Mean     Std Dev      Variance     Minimum     Maximum
-----------------------------------------------------------------------
mpg            24.77       6.611         43.71          14          41

Feasible GLS Estimation

// Load data
df_returns = loadd("df_returns.gdat");

// Run FGLS with defaults AR(1) Innovations
fgls(df_returns, "rcoe ~ rcpi";

Valid cases:                    248          Dependent variable:            rcpi
Total SS:                     0.027          Degrees of freedom:             246
R-squared:                    0.110          Rbar-squared:                 0.107
Residual SS:                  0.024          Std error of est:             0.010
F(1,246)                     30.453          Probability of F:             0.000
Durbin-Watson                 0.757
--------------------------------------------------------------------------------
                        Standard                    Prob
Variable   Estimates       Error     t-value        >|t|  [95% Conf.   Interval]
--------------------------------------------------------------------------------

Constant      0.0148     0.00122        12.1       0.000      0.0124      0.0172
    rcoe       0.196      0.0685        2.86       0.005      0.0619        0.33

Compute feasible GLS coefficients and associated standard errors, t-statistics, p-values, and confidence intervals.
Provides model evaluation statistics including R-squared, F-stat, and the Durbin-Watson statistic.
Choose from 7 built-in covariance estimation methods or provide your own covariance matrix.

Expanded Tabulation Capabilities

// Load data
df = loadd("tips2.dta");

// Two-way table
call tabulate(df, "sex ~ smoker");

============================================================
            sex                   smoker               Total
============================================================
                            No            Yes

         Female             55             33             88
           Male             99             60            159

          Total            154             93            247
============================================================

New tools for two-way tabulation provides a structured and systematic approach to understanding and drawing insights from categorical variables.

New procedure tabulate for computing two-way tables with advanced options for excluding categories and formatting reports.
Expanded functionality for the frequency function:
- New two-way tables.
- Sorted frequency reports and charts.

// Print sorted frequency table
// of 'rep78' in 'auto2' dataframe
frequency(auto2, "rep78", 1)

    Label      Count   Total %    Cum. %
  Average         30     43.48     43.48
     Good         18     26.09     69.57
Excellent         11     15.94     85.51
     Fair          8     11.59      97.1
     Poor          2     2.899       100
    Total         69       100

Ready to elevate your research? Try GAUSS 24 today.

New Time and Date Extraction Tools

12 new procedures for extracting date and time components from dataframe dates.
Extract date and time components ranging from seconds to years.

New Convenience Functions for Data Management and Exploration

dropCategories - Drops observations of specific categories from a dataframe and updates the associated labels and key values .
getCategories - Returns the category labels for a categorical variable.
isString - Verify if an input is a string or string array.
startsWith - Locates elements that start with a specified string.
insertCols - Inserts one or more new columns into a matrix or dataframe at a specified location.

Improved Performance and Speed-ups

Expanded functionality of strindx allows for searching of unique substrings across multiple variables.
The upmat function now has the option to specify an offset from the main diagonal, the option to return only the upper triangular elements as a vector and is faster for medium and large matrices.
Significant speed improvements when using combinate with large values of n.
Remove missing values from large vectors more efficiently with speed increases in packr.

Conclusion

For a complete list of all GAUSS 24 offers please see the complete changelog.

Discover how GAUSS 24 can help you reach your goals.

Request Demo Request pricing

Announcing the GAUSS Machine Learning Library

Eric — Mon, 28 Aug 2023 14:36:25 +0000

Introduction

The new GAUSS Machine Learning (GML) library offers powerful and efficient machine learning techniques in an accessible and friendly environment. Whether you're just getting familiar with machine learning or an experienced technician, you'll be running models in no time with GML.

Machine Learning Models at Your Fingertips

With the GAUSS Machine Learning library, you can run machine learning models out of the box, even without any machine learning background. It supports fundamental machine learning models for classification and regression including:

Quick and Painless Data Preparation and Management

We know model fitting and prediction is just the tip of the iceberg when it comes to any data analysis project. That's why we've focused on making GAUSS one of the best environments for data import, cleaning, and exploration.

GML provides machine learning specific data preparation tools including:

See how GAUSS reduces the pain and time of data wrangling and let's you get to the heart of your machine learning models quicker.

Easy to Implement Model Evaluation

Compare and evaluate machine learning models with tools for GML plotting and performance evaluation tools:

Interested in how GAUSS machine learning can work for you? Contact Us

Unparalleled Customer Support

We pride ourselves on offering unparalleled customer support and we truly care about your success. If you can't find what you need in our online documents, user forum, or blog, you can be confident that a GAUSS expert is here to quickly resolve your questions.

See It In Action

Want to see GML in action? Check out these real-world applications:

New Release TSPDLIB 3.0

Eric — Thu, 20 Jul 2023 16:38:38 +0000

Introduction

The preliminary econometric package for Time Series and Panel Data Methods has been updated and functionality has been expanded with over 20 new functions in this release of TSPDLIB 3.0.

The TSPDLIB 3.0 package includes expanded functions for time series and panel data testing both with and without structural breaks and causality testing.

It requires a GAUSS 23+ for use.

Changelog 3.0:

New functionality: Add metadata based variable names for improved printing.
Improvement: Simplified data loading formulas using expanded GAUSS 23 .
New unit root testing procedures:
- fourier_kpss - KPSS stationarity testing with flexible Fourier form, smooth structural breaks.
- fourier_kss - KSS unit root test with flexible Fourier form, smooth structural breaks.
- fourier_wadf - Wavelet ADF unit root test with flexible Fourier form, smooth structural breaks.
- fourier_wkss - Wavelet KSS unit root test with flexible Fourier form, smooth structural breaks.
- kss - KSS unit root test.
- qr_fourier_adf - Quantile ADF unit root test with flexible Fourier form, smooth structural breaks.
- qr_fourier_kss - Quantile KSS unit root test with flexible Fourier form, smooth structural breaks.
- qr_kss - Quantile KSS unit root test.
- qks_tests - Quantile Kolmogorov-Smirnov (QKS) tests.
- wkss - Wavelet KSS unit root test.
- sbur_gls - Carrion-i-Silvestre, Kim, and Perron (2009) GLS-unit root tests with multiple structural breaks.
New cointegration tests:
- pd_coint_wedgerton - Westerlund and Edgerton (2008) panel cointegration test.
New panel data unit root tests:
- pd_kpss - Carrion-i-Silvestre, et al.(2005) panel data KPSS test with multiple structural breaks.
- pd_stationary - Tests for unit roots in heterogeneous panel data including with or without cross-sectional averages, with or without flexible Fourier from structural breaks.
New causality tests:
- asymCause - Hatemi-J tests for asymmetric causality.
- pd_cause - Tests for Granger causality in heterogenous panels including Fisher, Zhnc, and SUR Wald tests.
Other new functions:
- sbvar_icss - Sanso, Arag & Carrion (2002) ICSS test for changes in unconditional variance.
- pd_getCDError - Tests for cross-sectional dependency.
New examples:
- actest.e
- ascomp.e
- fourier_kss.e
- fourier_kpss.e
- fourier_wadf.e
- fourier_wkss.e
- kss.e
- pd_cause.e
- pd_getcderror.e
- pd_coint_wedgerton.e
- pd_kpss.e
- qr_fourier_adf.e
- qr_fourier_kss.e
- qr_kss.e
- qr_qks.e
- sbur.e
- sbvar_icss.e
- wkss.e

Citation

If using this library please include the following citation:

Nazlioglu, S (2018) TSPDLIB: GAUSS Time Series and Panel Data Methods (Version 3.0). Source Code. https://github.com/aptech/tspdlib

Getting Started

Prerequisites

The program files require a working copy of GAUSS 23+.

Installing

The GAUSS Time Series and Panel data tests library should only be installed and updated directly in GAUSS using the GAUSS package manager.

Before using the functions created by tspdlib you will need to load the newly created tspdlib library. This can be done in a number of ways:

Navigate to the library tool view window and click the small wrench located next to the tspdlib library. Select Load Library.
Enter library tspdlib in the program input/output window.
Put the line library tspdlib; at the beginning of your program files.

Examples

After installing the library, examples for all available procedures can be found in your GAUSS home directory in the directory pkgs > tspdlib >examples. The example uses GAUSS and .csv datasets which are included in the pkgs > tspdlib >examples directory.

Using GAUSS Packages

For more information on how to make the best use of the TSPDLIB, please see our blog, Using GAUSS Packages Complete Guide.

Example Applications

Classification with Regularized Logistic Regression

Eric — Wed, 07 Jun 2023 15:59:02 +0000

Introduction

Logistic regression has been a long-standing popular tool for modeling categorical outcomes. It's widely used across fields like epidemiology, finance, and econometrics.

In today's blog we'll look at the fundamentals of logistic regression. We'll use a real-world survey data application and provide a step-by-step guide to implementing your own regularized logistic regression models using the GAUSS Machine Learning library, including:

Data preparation.
Model fitting.
Classification predictions.
Evaluating predictions and model fit.

What is Logistic Regression?

Logistic regression is a statistical method that can be used to predict the probability of an event occurring based on observed features or variables. The predicted probabilities can then be used to classify the data based on probability thresholds.

For example, if we are modeling a "TRUE" and "FALSE" outcome, we may predict that an outcome will be "TRUE" for all predicted probabilities of 0.5 and higher.

Mathematically, logistic regression models the relationship between the probability of an outcome as a logistic function of the independent variables:

$$ Pr(Y = 1 | X) = p(X) = \frac{e^{B_0 + B_1X}}{1 + e^{B_0 + B_1X}} $$

This log-odds representation is sometimes more common because it is linear in our independent variables:

$$ \log \bigg( \frac{p(X)}{1 + p(X)} \bigg) = B_0 + B_1X $$

There are some important aspects of this model to keep in mind:

The logistic regression model always yields a prediction between 0 and 1.
The magnitude of the coefficients in the logistic regression model cannot be as directly interpreted as in the classic linear model.
The signs of the coefficients in the logistic regression model can be interpreted as expected. For example, if the coefficient on $X_1$ is negative we can conclude that increasing $X_1$ decreases $p(X)$.

Logistic Regression with Regularization

One potential pitfall of logistic regression is its tendency for overfitting, particularly with high dimensional feature sets.

Regularization with L1 and/or L2 penalty parameters with can help prevent overfitting and improve prediction.

Comparison of L1 and L2 Regularization
	$L1$ penalty (Lasso)	$L2$ penalty (Ridge)
Penalty term	$\lambda \sum_{j=1}^p \|\beta_j\|$	$\lambda \sum_{j=1}^p \beta_j^2$
Robust to outliers		✓
Shrinks coefficients	✓	✓
Can select features	✓
Sensitive to correlated features		✓
Useful for preventing overfitting	✓	✓
Useful for addressing multicollinearity		✓
Requires hyperparameter selection (λ)	✓	✓

Our previous blog, "Predicting the Output Gap With Machine Learning Regression Models" provides a more detailed look at L1 and L2 regularization.

Predicting Customer Satisfaction Using Survey Data

Today we will use airline passenger satisfaction data to demonstrate logistic regression with regularization.

Our task is to predict passenger satisfaction using:

Available survey answers.
Flight information.
Passenger characteristics.

Variable	Description
id	Responder identification number
Gender	Gender identification: Female or Male.
Customer Type	Loyal or disloyal customer.
Age	Customer age in years.
Type of travel	Personal or business travel.
Class	Eco or business class seat.
Flight Distance	Flight distance in miles.
Wifi service	Customer rating on 0-5 scale.
Schedule convenient	Customer rating on 0-5 scale.
Ease of Online booking	Customer rating on 0-5 scale.
Gate location	Customer rating on 0-5 scale.
Food and drink	Customer rating on 0-5 scale.
Seat comfort	Customer rating on 0-5 scale.
Online boarding	Customer rating on 0-5 scale.
Inflight entertainment	Customer rating on 0-5 scale.
On-board service	Customer rating on 0-5 scale.
Leg room service	Customer rating on 0-5 scale.
Baggage handling	Customer rating on 0-5 scale.
Checkin service	Customer rating on 0-5 scale.
Inflight service	Customer rating on 0-5 scale.
Cleanliness	Customer rating on 0-5 scale.
Departure Delay in minutes	Minutes delayed when departing.
Arrival Delay in minutes	Minutes delayed when arriving.
satisfaction	Overall airline satisfaction. Possible responses include "satisfied" or "neutral or dissatisfied".

The first step in our analysis is to load our data using loadd:

new;
library gml;
rndseed 8906876;

/*
** Load datafile
*/
// Set path and filename
load_path = "data/";
fname = "airline_satisfaction.gdat";

// Load data
airline_data = loadd(load_path $+ fname);

// Split data
y = airline_data[., "satisfaction"];
X = delcols(airline_data, "satisfaction"$|"id");

Data Exploration

Before we begin modeling, let's do some preliminary data exploration. First, let's check for common issues that can arise with survey data.

We'll check for:

Duplicate observations using isunique.
Missing values using dstatmt.

First, we'll check for duplicates, so any duplicates can be removed prior to checking our summary statistics:

// Check for duplicates
isunique(airline_data);

The isunique procedure returns a 1 if the data is unique and 0 if there are duplicates.

1.00000000

In this case, it indicates that we have no duplicates in our data.

Next, we'll check for missing values:

/*
** Check for data cleaning
** issues
*/
// Summary statistics
call dstatmt(airline_data);

This prints summary statistics for all variables:

Variable                       Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------------------

Gender                        -----       -----         -----      Female        Male    103904    0
Customer Type                 -----       -----         -----  Loyal Cust  disloyal C    103904    0
Age                           39.38       15.11         228.5           7          85    103904    0
Type of Travel                -----       -----         -----  Business t  Personal T    103904    0
Class                         -----       -----         -----    Business    Eco Plus    103904    0
Flight Distance                2108        1266     1.603e+06           0        3801    103904    0
Wifi service                  -----       -----         -----           0           5    103904    0
Schedule convenient           -----       -----         -----           0           5    103904    0
Ease of Online booking        -----       -----         -----           0           5    103904    0
Gate location                 -----       -----         -----           0           5    103904    0
Food and drink                -----       -----         -----           0           5    103904    0
Online boarding               -----       -----         -----           0           5    103904    0
Seat comfort                  -----       -----         -----           0           5    103904    0
Inflight entertainment        -----       -----         -----           0           5    103904    0
Onboard service               -----       -----         -----           0           5    103904    0
Leg room service              -----       -----         -----           0           5    103904    0
Baggage handling              -----       -----         -----           1           5    103904    0
Checkin service               -----       -----         -----           0           5    103904    0
Inflight service              -----       -----         -----           0           5    103904    0
Cleanliness                   -----       -----         -----           0           5    103904    0
Departure Delay in Minutes    14.82       38.23          1462           0        1592    103904    0
Arrival Delay in Minutes      15.25       38.81          1506           0        1584    103904    0
satisfaction                  -----       -----         -----  neutral or   satisfied    103904    0

The summary statistics give us some useful insights:

There are no missing values in our dataset.
The summary statistics of our numerical variables don't indicate any obvious outliers.
All categorical survey data ranges from 0 to 5 with the exception of Baggage handling which ranges from 1 to 5. All categorical variables will need to be converted to dummy variables prior to modeling.

One other observation from our summary statistics is that many of the variable names are longer than necessary. Long variable names can be:

Difficult to remember.
Prone to typos
Cutoff when printing results.

(Not to mention they can be annoying to type!).

Let's streamline our variable names using dfname:

/*
** Update variable names
*/
// Create string array of short names
string short_names = {"Loyalty", "Reason", "Distance", "Wifi", 
                      "Schedule", "Booking", "Gate", "Boarding", 
                      "Entertainment", "Leg room", "Baggage", "Checkin", 
                      "Departure Delay", "Arrival Delay" };

// Create string array of original names to change                      
string original_names = { "Customer Type", "Type of Travel", "Flight Distance", "Wifi service",
                          "Schedule convenient", "Ease of Online booking", "Gate location", "Online boarding",
                          "Inflight entertainment", "Leg room service", "Baggage handling", "Checkin service",
                          "Departure Delay in Minutes", "Arrival Delay in Minutes" };

// Change names
airline_data = dfname(airline_data, short_names, original_names);

Data Visualization

Data visualization is a great way to get a feel for the relationships between our target variable and our features.

Let's explore the relationship between the customer and flight characteristics and reported satisfaction.

In particular, we'll look at how satisfaction relates to:

Age.
Gender.
Flight distance.
Seat class.
Customer type.

Preparing Our Data for Plotting

Today we'll use bar graphs to explore the relationships in our data. In particular, we will sort our data into subgroups and examine how those subgroups report satisfaction.

For categorical variables, we have naturally defined subgroups. However, For the continuous variables, Age and Distance, we first need to generate bins based on ranges of these variables.

First, let's place the Age variable in bins. To do this we will use the reclassifycuts and reclassify procedures:

For more information on reclassifying and other similar data transformations, see the Data Transformations section of our Data Management Guide.

/*
** Create bins for age
*/
// Set age categories cut points
// Class 0: 20 and Under
// Class 1: 21 - 30
// Class 2: 31 - 40
// Class 3: 41 - 50
// Class 4: 51 - 60
// Class 5: 61 - 70
// Class 6: Over 70
cut_pts = { 20, 
            30, 
            40, 
            50, 
            60, 
            70};

// Create numeric classes
age_new = reclassifycuts(airline_data[., "Age"], cut_pts);

// Generate labels to recode to
to = "20 and Under"$|
       "21-30"$|
       "31-40"$|
       "41-50"$|
       "51-60"$|
       "61-70"$|
       "Over 70";

// Recode to categorical variable
age_cat = reclassify(age_new, unique(age_new), to);

// Convert to dataframe
age_cat = asDF(age_cat, "Age Group");

For a quick frequency count of this categorical variable, we can use the frequency procedure:

// Check frequency of age groups
frequency(age_cat, "Age Group");

       Label      Count   Total %    Cum. %
20 and Under      11333     10.91     10.91
       21-30      21424     20.62     31.53
       31-40      21203     20.41     51.93
       41-50      23199     22.33     74.26
       51-60      18769     18.06     92.32
       61-70       7220     6.949     99.27
     Over 70        756    0.7276       100
       Total     103904       100

Now we will do the same for Distance.

/*
** Create bins for light distance
*/       
// Set distance categories
// Cut points for data 
cut_pts = { 1000, 
            1500, 
            2000, 
            2500, 
            3000,
            3500};

// Create numeric classes
distance_new = reclassifycuts(airline_data[., "Distance"], cut_pts);

// Generate labels to recode to
to = "1000 and Under"$|
       "1001-1500"$|
       "1501-2000"$|
       "2001-2500"$|
       "2501-3000"$|
       "3000-3500"$|
       "Over 3500";

// Recode to categorical variable
distance_cat = reclassify(distance_new, unique(distance_new), to);

// Convert to dataframe
distance_cat = asDF(distance_cat, "Flight Range");

// Check frequencies
frequency(distance_cat, "Flight Range");

         Label      Count   Total %    Cum. %
1000 and Under      28017     26.96     26.96
     1001-1500      10976     10.56     37.53
     1501-2000       9331      8.98     46.51
     2001-2500       7834      7.54     54.05
     2501-3000       8053      7.75      61.8
     3000-3500      24815     23.88     85.68
     Over 3500      14878     14.32       100
         Total     103904       100

Age

We can see from the plot above that passengers 20 and under and passengers over 60 are less likely to be satisfied than other age groups.

Gender

The plot suggests that gender has little impact on reported satisfaction.

Flight Distance

The flight distance plot shows that there are slightly lower rates of satisfaction for flight lengths 3000 miles and over and flight lengths 1000 miles and under.

Seat Class

There is a clear discrepancy in satisfaction between passengers that fly business class and other passengers. Business class customers have a much higher rate of satisfaction than those in economy or economy plus.

Customer Type

Finally, it also appears that loyal passengers are more often satisfied customers than disloyal passengers.

Feature Engineering

As is common with survey data, a number of our variables are categorical. We need to represent these as dummy variables before modeling.

We'll do this using the oneHot procedure. However, oneHot only accepts single variables, so we will need to loop through all the categorical variables.

To do this, we first create a list of all categorical variables.

/*
** Create dummy variables
*/
// Get all variable names
col_names = getColNames(X);

// Get types of all variables
col_types = getColTypes(X);

// Select names of variables
// that are categorical
cat_names = selif(col_names, col_types .== "category");

Next, we loop through all categorical variables and create dummy variables for each one using oneHot.

// Loop through categorical variables
// to create dummy variables
dummy_vars = {};
for i(1, rows(cat_names), 1); 
    dummy_vars = dummy_vars~oneHot(X[., cat_names[i]]);
endfor;

// Delete original categorical variables
// and replace with dummy variables
X = delcols(x, cat_names)~dummy_vars;

Model Evaluation

There are a number of classification metrics that are reported using the classificationMetrics procedure. These metrics provide information about how well the model meets different objectives.

Model Comparison Measures
Tool	Description
Accuracy	Overall model accuracy. Equal to the number of correct predictions divided by the total number of predictions.
Precision	How good a model is at correctly identifying the class outcomes. Equal to the number of true positives divided by the number of false positives plus true positives.
Recall	How good a model is at correctly predicting all the class outcomes. Equal to the number of true positives divided by the number of false negatives plus true positives.
F1-score	The harmonic mean of the precision and recall, it gives a more balanced picture of how our model performs. A score of 1 indicates perfect precision and recall.

We'll keep these in mind as we fit and test our model.

Logistic Regression Model Fitting

We're now ready to begin fitting our models. To start, we will prepare our data by:

Creating training and testing datasets using trainTestSplit.

// Split data into 70% training and 30% test set
{ y_train, y_test, X_train, X_test } = trainTestSplit(y, X, 0.7);

Scaling our data using rescale.

/*
** Data rescaling
*/
// Number of variables to rescale
numeric_vars = 4;

// Rescale training data
{ X_train[.,1:numeric_vars], x_mu, x_sd } = rescale(X_train[.,1:numeric_vars], "standardize");

// Rescale test data using same scaling factors as x_train
X_test[.,1:numeric_vars] = rescale(X_test[.,1:numeric_vars], x_mu, x_sd);

Unlike Random Forest models, logistic regression models are sensitive to large differences in the scale of the variables. Standardizing the variables as we do here is a good choice, but is not unequivocally the best option in all cases.

As you can see above, we compute the mean and standard deviation from the training set and use those parameters to scale the test set. This is important.

The purpose of our test set is to give us an estimate of how our model will do on unseen data. Using the mean and standard deviation from the entire dataset, before the train/test split would allow information from the test set to "leak" into our model. Information leakage is beyond the scope of this blog post, but in general the test set should be treated like information that is not available until after the model fit is complete.

Now we're ready to start fitting our models.

Case One: Logistic Regression Without Regularization

As a base case, we'll consider a logistic regression model without any regularization. For this case, we'll use all default settings, so our only inputs are the dependent and independent data.

Using our training data we will:

Train our model using logisticRegFit.
Make predictions on our training data using lmPredict.
Evaluate our training model predictions using classificationMetrics.

/*************************************
** Base case model
** No regularization
*************************************/

/*
** Training
*/
// Declare 'lr_mdl' to be 
// an 'logisticRegModel' structure
// to hold the trained model
struct logisticRegModel lr_mdl;

// Train the logistic regression classifier
lr_mdl = logisticRegFit(y_train, X_train);

// Check training set performance
y_hat_train = lmPredict(lr_mdl, X_train);

// Model evaluations
print "Training Metrics";
call classificationMetrics(y_train, y_hat_train);

The classificationMetrics procedure prints an evaluation table:

No regularization
Training Metrics
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.93    0.92      0.93    41102
              satisfied        0.90    0.91      0.90    31631

              Macro avg        0.91    0.92      0.91    72733
           Weighted avg        0.92    0.92      0.92    72733

               Accuracy                          0.92    72733

/*
** Testing
*/
// Make predictions on the test set, from our trained model
y_hat_test = lmPredict(lr_mdl, X_test);

/*
** Model evaluation
*/
print "Testing Metrics";
call classificationMetrics(y_test, y_hat_test);

This code prints the following to screen:

Testing Metrics
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.93    0.92      0.92    17777
              satisfied        0.90    0.91      0.90    13394

              Macro avg        0.91    0.91      0.91    31171
           Weighted avg        0.91    0.91      0.91    31171

               Accuracy                          0.91    31171

There are some good observations comparing our training data and testing data performance:

First, there is little difference in accuracy across our training and testing data set, with a training accuracy of 0.92 and a testing accuracy of 0.91.
Our model provides the same average F1-score, which provides a balanced measure of performance, across the testing and training dataset.

Why is this important? This comparison provides a good indication that we aren't overfitting our training set. Since the main purpose of regularization is to address overfitting the model to the training data, we don't have much reason to use it. However, for demonstration purposes, we'll show how to implement L2 regularization.

Case Two: Logistic Regression With L2 Regularization

To implement regularization with the logisticRegFit, we'll use a logisticRegControl structure.

/*************************************
** L2 Regularization
*************************************/

/*
** Training
*/
// Declare 'lrc' to be a logisticRegControl
// structure and fill with default settings 
struct logisticRegControl lrc;
lrc = logisticRegControlCreate();

// Set L2 regularization parameter
lrc.l2 = 0.05;

// Declare 'lr_mdl' to be 
// a 'logisticRegModel' structure
// to hold the trained model
struct logisticRegModel lr_mdl;

// Train the logistic regression classifier
lr_mdl = logisticRegFit(y_train, X_train, lrc);

/*
** Testing
*/
// Make predictions on the test set
y_hat_l2 = lmPredict(lr_mdl, X_test);

/*
** Model evaluation
*/
call classificationMetrics(y_test, y_hat_l2);

The classification metrics are printed:

L2 regularization
==============================================================
                                        Classification metrics
==============================================================
                  Class   Precision  Recall  F1-score  Support

neutral or dissatisfied        0.89    0.93      0.91    17777
              satisfied        0.90    0.84      0.87    13394

              Macro avg        0.90    0.89      0.89    31171
           Weighted avg        0.89    0.89      0.89    31171

               Accuracy                          0.89    31171

Note that with the L2 penalty, our model performance drops from the base case model, with lower accuracy (0.89) and lower average F1-score (0.89). This isn't surprising, given that we didn't find support of overfitting in our model.

Conclusion

In today's blog, we've looked at logistic regression and regularization.

Using a real-world airline passenger satisfaction data application we've:

Performed preliminary data and setup.
Trained logistic regression models with and without regularization.
Made classification predictions.
Interpreted classification metrics.

Page not found – Aptech

Get Started with Panel Data in GAUSS (Video)

Introduction

Summary and Timeline

Timeline

Additional Resources

New Video! Get Started with Choice Modeling in GAUSS

Introduction

Summary and Timeline

Timeline

Additional Resources

Introducing the GAUSS Data Management Guide

Introduction

What is the GAUSS Data Management Guide?

What does the GAUSS Data Management Guide cover?

How should I use the GAUSS Data Management Guide?

Conclusion

Using Feasible Generalized Least Squares To Improve Estimates

Introduction

What Is Feasible Generalized Least Squares (FGLS)?

How Does FGLS Work?

How Do I Know If I Should Use FGLS?

Example Tools for Identifying Heteroscedasticity and Autocorrelation

Example One: US Consumer Price Index (CPI)

Data

OLS Estimation

Evaluating the OLS Results

Checking For Heteroscedasticity

Checking For Autocorrelation

FGLS Estimation

The GAUSS fgls Procedure

Running FGLS

Example Two: American Community Survey

Data

OLS Estimation

Evaluating the OLS Results

FGLS estimation

Conclusion

Further Reading

Getting Started With Survey Data In GAUSS

Introduction

Survey Data

Survey Data Characteristics

Data Cleaning Considerations For Analyzing Survey Data

Common Survey Data Cleaning Steps

Exploring Survey Data

One-Way Frequency Tables

Two-Way Tables

Data Visualizations

Hands-On With Survey Data: NextGen National Household Travel Survey

Today's Data

Loading The Data

Checking For Duplicates

Examining Category Labels

Two-Way Tables

Excluding Categories

Data Visualizations

Frequency plots

Plotting Contingency Tables

Scatter Plots

Conclusion

Transforming Panel Data to Long Form in GAUSS

Introduction

The Rules of Tidy Data

Example One: Wide Form State Population Table

Example Two: Long Form State Population Table

Why Do We Care About Tidy Data?

Transforming From Wide to Long Panel Data

The dfLonger Procedure

Setting Up Panel Data Transformations

Step 1: Identify variables.

Step 2: Identify columns to convert.

Step 3: Name the new columns for storing names.

Step 4: Name the new columns for storing values.

Basic Pivoting

Advanced Pivoting

The pivotControl Structure

Changing Variable Types

Stripping Prefixes

Splitting Names

The GAUSS `fgls` Procedure

The `dfLonger` Procedure

The `pivotControl` Structure