Getting Started With Survey Data In GAUSS

by Eric · Published January 11, 2024 · Updated February 5, 2024

Introduction

Survey data is a powerful analysis tool, providing a window into people's thoughts, behaviors, and experiences. By collecting responses from a diverse sample of responders on a range of topics, surveys offer invaluable insights. These can help researchers, businesses, and policymakers make informed decisions and understand diverse perspectives.

In today's blog we'll look more closely at survey data including:

Fundamental characteristics of survey data.
Data cleaning considerations.
Data exploration using frequency tables and data visualizations.
Managing survey data in GAUSS.

While survey design and data collection are both important topics and can have significant impacts on analysis, they are beyond the scope of what we'll look at today.

Survey Data

Survey data presents unique characteristics and challenges that require careful consideration during the data analysis process.

Survey Data Characteristics
Categorical Nature	Survey data often involves categorical variables, where responses are grouped into distinct categories. Understanding the nature of these categories is crucial for choosing appropriate analysis methods.
Ordinal and Nominal Variables	It is important to recognize the distinction between ordinal variables (categories with a meaningful order) and nominal variables (categories without a specific order). This impacts the choice of statistical tests and visualization techniques.
Missing Data	Surveys may have missing or incomplete responses. Strategies for handling missing data, such as imputation or excluding incomplete cases, need to be considered.
Large Sample Sizes	Surveys often involve large sample sizes, leading to statistically significant but not necessarily practically significant results. It's crucial to consider whether the observed results are meaningful or impactful in the specific context of the study.
Multivariate Nature	Surveys explore relationships among multiple variables simultaneously. Multivariate analysis allows for a more comprehensive understanding of the complex relationships between different factors.
Choice modeling	Surveys act as a primary data collection method for understanding individuals' preferences and choices. Choice modeling techniques expand the insights gained from survey responses, providing a quantitative framework for analyzing decision-making processes in various contexts.

Data Cleaning Considerations For Analyzing Survey Data

Data cleaning allows us to identify and address errors, inconsistencies, and missing values. It is crucial for survey data and helps to:

Ensure accuracy.
Improve reliability.
Make meaningful and trustworthy insights.

Cleaning survey data includes some standard steps, such as:

Handling missing values,
Detecting outliers,

and some steps that are more specific to survey data, such as:

Performing consistency checks on survey responses,
Recoding categorical variables,
Handling open-ended responses.

Common Survey Data Cleaning Steps
Handling Missing Data	Identify missing data. Determine if missing values are systematic or random. Decide if missing values should be imputed or observations should be removed.
Outlier Detection and Treatment	Identify outliers that might skew the analysis. Decide whether outliers should be treated, transformed, or if they represent valid data points.
Standardize Variables	Standardize units and formats of variables to ensure consistency. Convert units, standardize date formats, and/or transform variables for better comparability.
Checking for Consistency	Perform consistency checks on the survey responses. Look for contradictory or illogical responses that may indicate errors in data entry.
Addressing Duplicate Entries	Identify and remove duplicate entries to avoid double-counting.
Recoding and Categorization	Recode variables or categorize responses to simplify analysis. Group similar categories, collapse response options, or create new variables based on recoded values.
Handling Open-Ended Responses	Categorize and code open-ended responses for analysis.
Dealing with Coding Errors	Check for coding errors in categorical variables. Ensure that each category is correctly labeled and that coding aligns with the intended meaning of the variable.

Exploring Survey Data

Exploratory data analysis is an important tool that can help us uncover insights from survey data without complicated computations. During this step, basic statistical tools like frequency tables, contingency tables, and summary statistics can shed light on important patterns and trends in the data.

One-Way Frequency Tables

Frequency tables provide a simple tabulation of the number of occurrences of each category in a single categorical variable. They display the counts (frequencies) of each category along with their corresponding percentages or proportions. Frequency tables are univariate, meaning they describe the distribution of one variable.

A simple frequency table can help us identify:

Inconsistencies, coding errors, typos, and other errors in categorical labels.
Outliers and missing values.
General distribution characteristics. For example, we may find that one level of a categorical variable makes up 90% of our observations.

	Count	Total %	Cum. %
Coffee	31	45.6	45.6
Tea	27	39.7	85.3
Soda	28	14.7	100

Two-Way Tables

Two-way tables, also known as contingency tables, are similar to frequency tables but offer additional information about data interactions. They display the frequency combinations of two categorical variables. This provides a snapshot of how these variables interact, and helps us uncover patterns and associations within survey data.

Two-way tables present information in a structured grid:

The columns correspond to one variable.
The rows correspond to the other variable.
The intersection of a row and column represent the frequency of observations having a pair of outcomes.

	Breakfast	Lunch	Dinner
Coffee	20	8	3
Tea	12	10	5
Soda	8	10	10

As an example, consider the table above:

The columns represent the outcomes for a variable meal_time: Breakfast, Lunch, and Dinner.
The rows represent the outcomes for a variable beverage_choice: Coffee, Tea, and Soda.
The bottom row contains the counts for Soda orders across all possible meal times.
The last column contains counts for all beverage options at Dinner.
The bottom, right corner tell us that 10 Sodas were ordered at Dinner.

Two-way tables are an efficient way to reveal the intricate relationships between two categorical variables. By presenting information in a structured grid, these tables offer a straightforward way to discern patterns, making it easier to grasp how variables interact.

Data Visualizations

Data plots are a great way to find understand data trends, observe outliers, and identify other data issues. When choosing a data plot, it is important to consider what plot is best suited for the type of the variable.

Bar Charts	Ideal for comparing the frequency or distribution of categorical variables.
Stacked Bar Charts	Useful for comparing the composition of different groups, where each bar is divided into segments representing subcategories.
Pie Charts	Shows the proportion of each category in relation to the whole.
Histograms	Depicts the distribution of a continuous variable by dividing it into intervals (bins) and showing the frequency of observations in each interval.
Line Charts	Demonstrates trends or patterns over a continuous variable or time.
Scatter Plots	Visualizes the relationship between two continuous variables.
Box Plots (Box-and-Whisker Plots)	Displays the distribution of a variable, including median, quartiles, and outliers.

Hands-On With Survey Data: NextGen National Household Travel Survey

Let's look at more closely at survey data using GAUSS and real-world transportation data.

Today's Data

Today we'll be working with the 2022 National Household Travel Survey (NHTS). This survey is designed to collect comprehensive information about travel patterns and travel behavior in the United States.

The NHTS survey:

Gathers data on various aspects of travel, including daily commuting, recreational trips, shopping, and other activities.
Is typically conducted at regular intervals to capture changes in travel behavior over time, though today we will only consider the 2022 survey results.
Utilizes a combination of interviews and diaries to collect data from a representative sample of households across the country.
Is valuable for transportation planners, policymakers, and researchers in making informed decisions regarding infrastructure development, traffic management, and other transportation-related initiatives.

The raw data from the NHTS is split into four separate CSV files containing:

Vehicle data.
Trip data.
Household data.
Person data.

Today we will work with the trip data.

Data Citation:
Federal Highway Administration. (2022). 2022 National Household Travel Survey, U.S. Department of Transportation, Washington, DC. Available online: https://nhts.ornl.gov.

Loading The Data

Let's get started by loading the data into GAUSS using the loadd procedure. We will also compute descriptive statistics for our data:

// Load trip data
trip_data = loadd("trip_data.gdat");

// Preliminary summary stats
dstatmt(trip_data);

-------------------------------------------------------------------------------------------
Variable           Mean     Std Dev      Variance     Minimum     Maximum     Valid Missing
-------------------------------------------------------------------------------------------

HOUSEID           9e+09    5.83e+04     3.399e+09       9e+09       9e+09     31074    0
PERSONID          1.681      0.9994        0.9989           1           9     31074    0
TRIPID            2.438       1.792         3.209           1          36     31074    0
SEQ_TRIPID        2.436        1.79         3.203           1          36     31074    0
VEHCASEID     7.619e+11   3.244e+11     1.052e+23          -1       9e+11     31074    0
FRSTHM            -----       -----         -----         Yes          No     31074    0
PARK              -----       -----         -----  Valid skip          No     31074    0
TRAVDAY           -----       -----         -----      Sunday    Saturday     31074    0
DWELTIME          95.18       164.3       2.7e+04          -9        1050     31074    0
PUBTRANS          -----       -----         -----  Used publi  Did not us     31074    0
TRIPPURP          -----       -----         -----  Not ascert  Not a home     31074    0
WHYTRP1S          -----       -----         -----        Home  Something      31074    0
TRVLCMIN          24.55       46.48          2161          -9        1425     31074    0
TRPTRANS          -----       -----         -----         Car  School bus     31074    0
NUMONTRP          1.997       3.478          12.1           1          99     31074    0
NONHHCNT         0.4141       3.388         11.48           0          98     31074    0
HHACCCNT          1.583      0.8916         0.795           1           8     31074    0
WHYTO             -----       -----         -----  Regular ac  Something      31074    0
WALK              -----       -----         -----  Valid skip  N/A - Didn     31074    0
TRPMILES          13.97       85.42          7296          -9        4859     31074    0
VMT_MILE          7.527       32.18          1035          -9        1683     31074    0
GASPRICE            398       68.46          4686       272.7       597.9     31074    0
NUMADLT           2.059      0.7616          0.58           1           8     31074    0
HOMEOWN           -----       -----         -----  Owned by h  Occupied w     31074    0
RAIL              -----       -----         -----         Yes          No     31074    0
CENSUS_D          -----       -----         -----  New Englan     Pacific     31074    0
CENSUS_R          -----       -----         -----   Northeast        West     31074    0
CDIVMSAR          -----       -----         -----  New Englan  Pacific No     31074    0
HHFAMINC          -----       -----         -----  I prefer n  $125,000 t     31074    0
HH_RACE           -----       -----         -----       White  Other race     31074    0
HHSIZE            2.822       1.447         2.093           1          10     31074    0
HHVEHCNT          2.134       1.078         1.163           0          11     31074    0
MSACAT            -----       -----         -----  MSA of 1 m  Not in MSA     31074    0
MSASIZE           -----       -----         -----  In an MSA   Not in MSA     31074    0
URBAN             -----       -----         -----  In an urba  Not in urb     31074    0
URBANSIZE         -----       -----         -----  50,000-199  Not in urb     31074    0
URBRUR            -----       -----         -----       Urban       Rural     31074    0
TDAYDATE          -----       -----         -----  2022-01-01  2023-01-01     31074    0
WRKCOUNT          1.304      0.9474        0.8976           0           6     31074    0
R_AGE              46.8       20.77         431.2           5          92     31074    0
R_SEX             -----       -----         -----      Refuse      Female     31074    0
R_RACE            -----       -----         -----       White  Other race     31074    0
EDUC              -----       -----         -----  Valid skip  Profession     31074    0
VEHTYPE           -----       -----         -----  Valid skip  Motorcycle     31074    0

There are many ways to preview dataframes in GAUSS but with a wide dataset that contains many variables, I find dstatmt to be the easiest to view.

The descriptive statistics themselves provide some useful information:

Many of the continuous variables, such as TRPMILES and TRVLCMIN, have minimum values below zero. These don't make sense and it is likely the -9 is coded to represent something different, such as non-responses.
There are 31074 valid observations and no missing values for all variables.

The descriptive statistics report also provides insights beyond the traditional descriptive statistics:

The data contains a mixture of categorical and numerical data.
Observations in our dataset are defined by a set of identification variables: HOUSEID, PERSONID, TRIPID, SEQ_TRIPID, VEHCASEID .

The data in trip_data.gdat has had preliminary cleaning from the raw data.

Checking For Duplicates

As a first step, we'll confirm that our data contains unique observations using the isunique procedure.

isunique(trip_data);

1.0000000

This indicates that our dataset is unique without any duplicates.

Examining Category Labels

Now that we confirmed that our dataset is unique, one of the first data cleaning steps with categorical data is to examine the category labels to check for errors and to get an understanding of the distribution.

Let’s look at the labels of the TRIPPURP variable using a sorted frequency table.

// Print frequency table for 'TRIPPURP'
frequency(trip_data, "TRIPPURP", 1);

                                 Label      Count   Total %    Cum. %
                Home-based other (HBO)       7714     24.82     24.82
           Not a home-based trip (NHB)       7035     22.64     47.46
           Home-based shopping (HBSHP)       6884     22.15     69.62
                 Home-based work (HBW)       4871     15.68     85.29
Home-based social/recreational (HBSOC)       4546     14.63     99.92
                       Not ascertained         24   0.07723       100
                                 Total      31074       100

Using this we can see that three categories make up almost 70% of the trips: "Home-based other", "Not a home-based trip", and "Home-based shopping".

The frequency table is also useful for learning more about our labels. In this table, the labels appear to be clean and we don’t see anything that suggests typos or errors.

To clean up the labels, let's separate the abbreviations from the descriptions. We can do this using some simple string manipulation in GAUSS.

First, let’s separate the abbreviations from the full descriptions by splitting the labels at "(" and storing the new string arrays:

// Use '(' to split existing labels into 2 columns
tmp = strsplit(trip_data[. , "TRIPPURP"], "(" );

// Trim whitespace from the front and back of both variables
tmp = strtrim(tmp);

// Rename columns 
tmp = setColNames(tmp , "TRIP_DESC"$|"TRIP_ABBR");

// Preview data
head(tmp);

              TRIP_DESC        TRIP_ABBR
       Home-based socia           HBSOC)
       Home-based socia           HBSOC)
       Home-based shopp           HBSHP)
       Not a home-based             NHB)
       Home-based shopp           HBSHP)

The TRIP_DESC variable looks good – it stores the full description of the TRIPPURP. However, the abbreviations in the TRIP_ABBR don’t quite look right, we still need to strip the ")".

/*
** Remove the right parenthesis
*/
// Replace ')' with an empty string
tmp[. , "TRIP_ABBR"]  = strreplace(tmp[. , "TRIP_ABBR"], ")", "");

// Check frequencies for both variables
frequency(tmp, "TRIP_DESC + TRIP_ABBR");

                         Label      Count     Total %      Cum. %
              Home-based other       7714       24.82       24.82
           Home-based shopping       6884       22.15       46.98
Home-based social/recreational       4546       14.63       61.61
               Home-based work       4871       15.68       77.28
         Not a home-based trip       7035       22.64       99.92
               Not ascertained         24     0.07723         100
                         Total      31074         100

                         Label      Count     Total %      Cum. %
                                       24     0.07723     0.07723
                           HBO       7714       24.82        24.9
                         HBSHP       6884       22.15       47.06
                         HBSOC       4546       14.63       61.69
                           HBW       4871       15.68       77.36
                           NHB       7035       22.64         100
                         Total      31074         100

One final change we may want to make is to replace the missing abbreviation label for the "Not Ascertained" category using the recodeCatLabels.

/*
** Recode missing label
*/
// Add missing label for 'NA'
tmp[., 2] = recodecatlabels(tmp[., 2], "", "NA", "TRIP_ABBR");

// Check frequencies for both variables
frequency(tmp, "TRIP_DESC + TRIP_ABBR");

                         Label      Count     Total %      Cum. %
              Home-based other       7714       24.82       24.82
           Home-based shopping       6884       22.15       46.98
Home-based social/recreational       4546       14.63       61.61
               Home-based work       4871       15.68       77.28
         Not a home-based trip       7035       22.64       99.92
               Not ascertained         24     0.07723         100
                         Total      31074         100

                         Label      Count     Total %      Cum. %
                            NA         24     0.07723     0.07723
                           HBO       7714       24.82        24.9
                         HBSHP       6884       22.15       47.06
                         HBSOC       4546       14.63       61.69
                           HBW       4871       15.68       77.36
                           NHB       7035       22.64         100
                         Total      31074         100

We've successfully created two new variables - TRIP_DESC and TRIP_ABBR which we can concatenate to our trip_data dataframe:

// Add the new variables to the end of 'trip_data'
trip_data = trip_data ~ tmp;

Two-Way Tables

Frequency tables give provide insights into a single categorical variable. However, if we are interested in the relationship between multiple categorical variables, we need to use two-way, or contingency, tables.

Let's use a contingency table to look at the relationship between the URBRUR and the VEHTYPE. To do this we can use the tabulate procedure, introduced in GAUSS 24.

The tabulate function requires either a dataframe or filename input, along with a formula string to specify which variables to include in the table. It also takes an optional tabControl structure input for advanced options.

data: A GAUSS dataframe or filename.
formula: String, formula string. E.g "df1 ~ df2 + df3", "df1" categories will be reported in rows, separate columns will be returned for each category in "df2" and "df3".
tbctl: Optional, an instance of the tabControl structure used for advanced table options.

// Compute a two-way table with
// VEHTYPE categories in rows
// URBUR categories in columns
// Results stored in tab_df
tab_df = tabulate(trip_data, "VEHTYPE ~ URBRUR");

===============================================================
           VEHTYPE                   URBRUR               Total
===============================================================
                            Urban          Rural

        Valid skip           4061            719           4780
  Car/Stationwagon           9306           1774          11080
               Van           1438            358           1796
               SUV           8275           1935          10210
      Pickup Truck           2043           1043           3086
       Other Truck             36             24             60
      RV/Motorhome              4              4              8
  Motorcycle/Moped             39             15             54

             Total          25202           5872          31074
===============================================================

The initial counts provide us some insights:

The total counts of vehicles are higher in urban areas.
In urban areas the most frequently occurring type of vehicle is theCar/Stationwagon.
In rural areas the most frequently occurring type of vehicle is SUV.

It might useful to see relative percentages of the vehicle types. Because we stored the counts in the tab_df, this can easily be done.

First, let's look at what percentage each category makes up of the total vehicles in the urban and rural areas, respectively.

// Compute percentages within urban and rural areas
// by dividing by column totals
tab_df[., 1]~(tab_df[., 2:3]./sumc(tab_df[., 2:3])');

         VEHTYPE       URBUR_Urban   URBUR_Rural
      Valid skip            0.1611        0.1224
Car/Stationwagon            0.3692        0.3021
             Van            0.0571        0.0610
             SUV            0.3283        0.3295
    Pickup Truck            0.0811        0.1776
     Other Truck            0.0014        0.0041
    RV/Motorhome            0.0002        0.0007
Motorcycle/Moped            0.0015        0.0026

These percentages help us see that:

The distribution of Car/Stationwagon, Van, and SUV are fairly similar in urban and rural areas.
There is a higher percentage of the Pickup Truck, Other Truck, Motorcycle/Moped categories in rural areas.

Alternatively we can look at the distribution of each vehicle type across rural and urban areas.

// Compute percentages across urban and rural areas
// by dividing by row totals
tab_df[., 1]~(tab_df[., 2:3]./sumr(tab_df[., 2:3]))

         VEHTYPE      URBUR_Urban  URBUR_Rural
      Valid skip           0.8496       0.1504
Car/Stationwagon           0.8399       0.1601
             Van           0.8007       0.1993
             SUV           0.8105       0.1895
    Pickup Truck           0.6620       0.3380
     Other Truck           0.6000       0.4000
    RV/Motorhome           0.5000       0.5000
Motorcycle/Moped           0.7222       0.2778

This table tells a similar store from a different perspective:

Urban vehicles make up 80-83% of the Cars/Stationwagon, Van, and SUV categories.
Urban vehicles only make up 60% and 66% the Pickup Truck and Other Truck categories, respectively.
Urban vehicles make up 72% of the Motorcyle/Moped category.

Excluding Categories

Suppose we don't want to include the Valid skip responses in our contingency table. We can remove these using the exclude member of the tabControl structure.

To specify categories to be excluded from the contingency table, we use a string to specify the variable name and category separated by a ":".

// Declare structure
struct tabControl tbCtl;

// Fill defaults
tbCtl = tabControlCreate();

// Specify to exclude the 'Valid skip' category
// from the 'VEHTYPE' variable
tbCtl.exclude = "VEHTYPE:Valid skip";

// Find contingency table including tbCtl input
tab_df2 =  tabulate(trip_data, "VEHTYPE ~ URBRUR", tbCtl);

=============================================================================
                         VEHTYPE                   URBRUR               Total
=============================================================================
                                          Urban          Rural

                Car/Stationwagon           9306           1774          11080
                             Van           1438            358           1796
                             SUV           8275           1935          10210
                    Pickup Truck           2043           1043           3086
                     Other Truck             36             24             60
                    RV/Motorhome              4              4              8
                Motorcycle/Moped             39             15             54

                           Total          21141           5153          26294
=============================================================================

Now our table excludes the Valid skip category.

Ready to try it for yourself in GAUSS 24? Start your free trial today!

Data Visualizations

Data visualizations are one of the most useful tools for data exploration. There are several ways to utilize the plotting capabilities of GAUSS to explore survey data.

Frequency plots

First, let's use a frequency plot to explore the distribution of responses across census regions. To do this, we will utilize the plotFreq procedure.

// Census region frequencies
plotFreq(trip_data, "CENSUS_R", 1);

The sorted frequency plot allows us to quickly identify that the most frequently occurring region in our data is "South".

Note that support for sorting frequency plots was added in GAUSS 24.

Plotting Contingency Tables

Like frequency tables, frequency plots are useful for visualizing the categories of one variable. However, they don't provide much insight into the relationship across categorical variables.

To visualize the relationship between VEHTYPE and URBRUR, let's create a bar plot using our stored contingency table dataframe, tab_df2.

The plotBar function requires two inputs, labels for the x-axis and corresponding heights.

The labels for our bar plot are the vehicle types which are stored as a dataframe in the first column of the tab_df2. To use them as inputs we will need to:

Get the category labels.
Convert them to a string array.

// Get category labels
labels = getCategories(tab_df2, "VEHTYPE");

// Convert to string array
labels_sa = ntos(labels);

The corresponding heights will come from the tab_df2 variable. Let's find out the variable names in tab_df2:

// Print the variable names from 'tab_df2'
getcolnames(tab_df2);

     VEHTYPE
URBRUR_Urban
URBRUR_Rural

The final two variable names were created by the tabulate function to tell us which original variable the column came from, URBRUR, and which category is being referenced. Let's change the variable names to just Urban and Rural to make them more concise.

new_names = "Urban" $| "Rural";
col_idx = { 2, 3 };
tab_df2 = setcolnames(tab_df2, "Urban" $| "Rural", col_idx);

Now we're ready to use the Urban and Rural count variables to plot our data.

plotBar(labels_sa, tab_df2[., "Urban" "Rural"]);

By default, this plots our bars side-by-side. We can change this using a plotControl structure and plotsetbar .

// Declare structure
struct plotControl plt;

// Fill defaults
plt = plotGetDefaults("bar");

// Set bars to be solid and stacked
plotSetBar(&plt, 1, 1);

// Plot contingency table
plotBar(plt, labels_sa, tab_df2[., "Urban" "Rural"]);

Scatter Plots

Now suppose we wish to examine the relationship between a categorical variable and continuous variables. We can do this using the 'by' keyword and the plotScatter function.

// Plot TRIPMILES vs GASPRICE 
// Sorting by color using the categories in CENSUS_R
plotScatter(trip_data, "TRPMILES ~ GASPRICE + by(CENSUS_R)");

Adding the census regions provides some interesting observations:

The West region has higher gas prices than other regions.
The South region seems to have lower gas prices than other regions.

Conclusion

In this blog, we've covered some fundamental concepts related to survey data and looked at some GAUSS tools for cleaning, exploring, and visualizing survey data.

Eric( Director of Applications and Training at Aptech Systems, Inc. )

Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.

Getting Started With Survey Data In GAUSS

Introduction

Survey Data

Survey Data Characteristics

Data Cleaning Considerations For Analyzing Survey Data

Common Survey Data Cleaning Steps

Exploring Survey Data

One-Way Frequency Tables

Two-Way Tables

Data Visualizations

Hands-On With Survey Data: NextGen National Household Travel Survey

Today's Data

Loading The Data

Checking For Duplicates

Examining Category Labels

Two-Way Tables

Excluding Categories

Data Visualizations

Frequency plots

Plotting Contingency Tables

Scatter Plots

Conclusion