Transforming Panel Data to Long Form in GAUSS

Introduction

Anyone who works with panel data knows that pivoting between long and wide form, though commonly necessary, can still be painstakingly tedious, at best. It can lead to frustrating errors, unexpected results, and lengthy troubleshooting, at worst.

The new dfLonger and dfWider procedures introduced in GAUSS 24 make great strides towards fixing that. Extensive planning has gone into each procedure, resulting in comprehensive but intuitive functions.

In today's blog, we will walk through all you need to know about the dfLonger procedure to tackle even the most complex cases of transforming wide form panel data to long form.

The Rules of Tidy Data

Before we get started, it will be useful to consider what makes data tidy (and why tidy data is important).

It's useful to think of breaking our data into components (these subsets will come in handy later when working with dflonger):

  • Values.
  • Observations.
  • Variables.

Components of data.

We can use these components to define some basic rules for tidy data:

  1. Variables have unique columns.
  2. Observations have unique rows.
  3. Values have unique cells.

Example One: Wide Form State Population Table

State202020212022
Alabama5,031,3625,049,8465,074,296
Alaska732,923734,182733,583
Arizona7,179,9437,264,8777,359,197
Arkansas3,014,1953,028,1223,045,637
California39,501,65339,142,99139,029,342

Though not clearly labeled, we can deduce that this data presents values for three different variables: State, Year, and Population.

Looking more closely we see:

  • State is stored in a unique column.
  • The values of Years are stored as column names.
  • The values of Population are stored in separate columns for each year.

Our variables do not each have a unique column, violating the rules of tidy data.

Example Two: Long Form State Population Table

StateYearPopulation
Alabama20205,031,362
Alabama20215,049,846
Alabama20225,074,296
Alaska2020732,923
Alaska2021734,182
Alaska2022733,583
Arizona20207,179,943
Arizona20217,264,877
Arizona20227,359,197

The transformed data above now has three columns, one for each variable State, Year, and Population. We can also confirm that each observation has a single row and each value has a single cell.

Transforming the data to long form has resulted in a tidy data table.

Why Do We Care About Tidy Data?

Working with tidy data offers a number of advantages:

  • Tidy data storage offers consistency when trying to compare, explore, and analyze data whether it be panel data, time series data or cross-sectional data.
  • Using columns for variables is aligned with vectorization and matrix notation, both of which are fundamental to efficient computations.
  • Many software tools expect tidy data and will only work reliably with tidy data.

Ready to elevate your research? Try GAUSS 24 today.

Transforming From Wide to Long Panel Data

In this section, we will look at how to use the GAUSS procedure dfLonger to transform panel data from wide to long form. This section will cover:

  • The fundamentals of the dfLonger procedure.
  • A standard process for setting up panel data transformations.

The dfLonger Procedure

The dfLonger procedure transforms wide form GAUSS dataframes to long form GAUSS dataframes. It has four required inputs and one optional input:

df_long = dfLonger(df_wide, columns, names_to, values_to [, pctl]);

df_wide
A GAUSS dataframe in wide panel format.
columns
String array, the columns that should be used in the conversion.
names_to
String array, specifies the variable name(s) for the new column(s) created to store the wide variable names.
value_to
String, the name of the new column containing the values.
pctl
Optional, an instance of the pivotControl structure used for advanced pivoting options.

Setting Up Panel Data Transformations

Having a systematic process for transforming wide panel data to long panel data will:

  • Save time.
  • Eliminate frustration.
  • Prevent errors.

Let's use our wide form state population data to work through the steps.

Step 1: Identify variables.

In our wide form population table, there are three variables: State, Year, and Population.

Step 2: Identify columns to convert.

The easiest way to determine what columns need to be converted is to identify the "problem" columns in your wide form data.

For example, in our original state population table, the columns named 2020, 2021, 2022, represent our Year variable. They store the values for the Population variable.

These are the columns we will need to address in order to make our data tidy.

columns = "2020"$|"2021"$|"2022";

We only have three columns to transform and it is easy to just type out our column names in a string array. This won't always be the case, though. Fortunately, GAUSS has a lot of great convenience functions to help with creating your column lists.

My favorites include:

FunctionDescriptionExample
getColNamesReturns the column variable names. varnames = getColNames(df_wide)
startsWithReturns a 1 if a string starts with a specified pattern. mask = startsWith(colNames, pattern)
trimrTrims rows from the top and/or bottom of a matrix. names = trimr(full_list, top, bottom)
rowcontainsReturns a 1 if the row contains the data specified by the needle variable, otherwise it returns a 0. mask = rowcontains(haystack, needle)
selifSelects rows from a matrix, dataframe or string array, based upon a vector of 1’s and 0’s. names = rowcontains(full_list, mask)

For more complex cases, it useful to approach creating column lists as a two-step process:

  1. Get all column names using getColNames.
  2. Select a subset of columns names using a selection convenience functions.

As an example, suppose our state population dataset contains a year column as the first column and the remaining columns contain the populations for 1950-2022. It would be difficult to write out the column list for all years.

Instead we could:

  1. Get a list of all the column names using getColNames.
  2. Trim the first name off the list.
// Get all columns names
colNames = getColNames(pop_wide);

// Trim first name `year` 
// from top of the name list
colNames = trimr(colNames, 1, 0);

Step 3: Name the new columns for storing names.

The names of the columns being transformed from our wide form data will be stored in a variable specified by the input names_to.

In this case, we want to store the names from the wide data in one new variable called, "Years". In later examples, we will look at how to split names into multiple variables using prefixes, separators, or patterns.

names_to = "Years";

Step 4: Name the new columns for storing values.

The values stored in the columns being transformed will be stored in a variable specified by the input values_to.

For our population table, we will store the values in a variable named "Population".

values_to = "Population";

Basic Pivoting

Now it's time to put all these steps together into a working example. Let's continue with our state population example.

We'll start by loading the complete state population dataset from the state_pop.gdat file:

// Load data 
pop_wide = loadd("state_pop.gdat");

// Preview data
head(pop_wide);
           State             2020             2021             2022
         Alabama        5031362.0        5049846.0        5074296.0
          Alaska        732923.00        734182.00        733583.00
         Arizona        7179943.0        7264877.0        7359197.0
        Arkansas        3014195.0        3028122.0        3045637.0
      California        39501653.        39142991.        39029342. 

Now, let's set up our information for transforming our data:

// Identify columns
columns = "2020"$|"2021"$|"2022";

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we'll transform our data using df_longer:

// Convert data using df_longer
pop_long = dfLonger(pop_wide, columns, names_to, values_to);

// Preview data
head(pop_long);
           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00 

Advanced Pivoting

One of the most appealing things about dfLonger is that while simple to use, it offers tools for tackling the most complex cases. In this section, we'll cover everything you need to know for moving beyond basic pivoting.

The pivotControl Structure

The pivotControl structure allows you to control pivoting specifications using the following members:

MemberPurpose
names_prefixA string input which specifies which characters, if any, should be stripped from the front of the wide variable names before they are assigned to a long column.
names_sep_splitA string input which specifies which characters, if any, mark where the names_to names should be broken up.
names_pattern_splitA string input containing a regular expression specifying group(s) in names_to names which should be broken up.
names_typesA string input specifying data types for the names_to variable.
values_drop_missingScalar, is set to 1 all rows with missing values will be removed.

Changing Variable Types

By default the variables created from the pieces of the variable names will be categorical variables.

If we examine the variable type of pop_long from our previous example,

// Check the type of the 'Year' variables
getColTypes(pop_long[., "Year"]);

we can see that the Year variable is a categorical variable:

            type
        category 

This isn't ideal and we'd prefer our Year variable to be a date. We can control the assigned type using the names_types member of the pivotControl structure. The names_types member can be specified in one of two ways:

  1. As a column vector of types for each of the names_to variables.
  2. An n x 2 string array where the first column is the name of the variable(s) and the second column contains the type(s) to be assigned.

For our example, we wish to specify that the Year variable should be a date but we don't need to change any of the other assigned types, so we will use the second option:

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify that 'Year' should be
// converted to a date variable
pctl.names_types = {"Year" "date"};

Next, we complete the steps for pivoting:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide);
columns = trimr(columns, 1, 0);

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we call dfLonger including the pivotControl structure, pctl, as the final input:

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);
           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00

Now if we check the type of our Year variable:

// Check the type of 'Year'
getColTypes(pop_long[., "Year"]);

It is a date variable:

  type
  date

Stripping Prefixes

In our previous example, the wide data names only contained the year. However, the column names of a wide dataset often have common prefixes. The names_prefix member of the pivotControl structure offers a convenient way to strip unwanted prefixes.

Suppose that our wide form state population columns were labeled "yr_2020", "yr_2021", "yr_2022":

// Load data
pop_wide2 = loadd("state_pop2.gdat");

// Preview data
head(pop_wide2);
           State          yr_2020          yr_2021          yr_2022
         Alabama        5031362.0        5049846.0        5074296.0
          Alaska        732923.00        734182.00        733583.00
         Arizona        7179943.0        7264877.0        7359197.0
        Arkansas        3014195.0        3028122.0        3045637.0
      California        39501653.        39142991.        39029342.

We need to strip these prefixes when transforming our data to long form.

To accomplish this we first need to specify that our name columns have the common prefix "yr":

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify prefix
pctl.names_prefix = "yr_";

Next, we complete the steps for pivoting:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide2);
columns = trimr(columns, 1, 0);

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we call dfLonger:

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide2, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);
           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00

Splitting Names

In our basic example the only information contained in the names columns was the year. We created one variable to store that information, "Year". However, we may have cases where our wide form data contains more than one piece of information.

In theses case there are two important steps to take:

  1. Name the variables that will store the information contained in the wide data column names using the names_to input.
  2. Indicate to GAUSS how to split the wide data column names into the names_to variables.

Names Include a Separator

One way that names in wide data can contain multiple pieces of information is through the use of separators.

For example, suppose our data looks like this:

           State       urban_2020       urban_2021       urban_2022       rural_2020       rural_2021       rural_2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858. 

Now our names specify:

  • Whether the population is the urban or rural population.
  • The year of the observation.

In this case, we:

  • Use the names_sep_split member of the pivotControl structure to indicate how to split the names.
  • Specify a names_to variable for each group created by the separator.
// Load data
pop_wide3 = loadd("state_pop3.gdat");

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify how to separate names
pctl.names_sep_split = "_";

// Specify two variables for holding
// names information:
//    'Location' for the information before the separator
//    'Year' for the information after the separator
names_to = "Location"$|"Year";

// Variable for storing values
values_to = "Population";

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide3, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);
           State         Location             Year       Population
         Alabama            urban             2020        6558153.0
         Alabama            urban             2021        4972982.0
         Alabama            urban             2022        12375977.
         Alabama            rural             2020        1526791.0
         Alabama            rural             2021        76863.000

Now, the pop_long dataframe contains:

  • The information in the wide form names found before the separator, "_", (urban or rural) in the Location variable.
  • The information in the wide form names found after the separator, "_", in the Year variable.

Variable Names With Regular Expressions

In our example above, the variables contained in the names were clearly separated by a "_". However, this isn't always the case. Sometimes names use a pattern rather than separator:

// Load data
pop_wide4 = loadd("state_pop4.gdat");

// Preview data
head(pop_wide4);
           State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858. 

In cases like this, we can use the names_pattern_split member to tell GAUSS we want to pass in a regular expression that will split the columns. We can't cover the full details of regular expressions here. However, there are a few fundamentals that will help us get started with this example.

In regEx:

  1. Each statement inside a pair of parentheses is a group.
  2. To match any upper or lower case letter we use "[a-zA-Z]". More specifically, this tells GAUSS that we want to match any lowercase letter ranging from a-z and any upper case letter ranging from A-Z. If we wanted to limit this to any lowercase letters from t to z and any uppercase letter B to M we would say "[t-zB-M]".
  3. To match any integer we use "[0-9]".
  4. To represent that we want to match one or more instances of a pattern we use "+".
  5. To represent that we want to match zero or more instances of a pattern we use "*".

In this case, we want to separate our names so that "urban" and "rural" are collected in Location and 2020, 2021, and 2022 are collected in the Year variable:

  1. We have two groups.
  2. We can capture both urban and rural using "[a-zA-Z]+".
  3. We can capture the years by matching one or more number using "[0-9]+".

Let's use regEx to specify our names_pattern_split member:

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify how to separate names 
// using the pivotControl structure
pctl.names_pattern_split = "([a-zA-Z]+)([0-9]+)"; 

Next, we can put this together with our other steps to transform our wide data:

// Variable for storing names
names_to = "Location"$|"Year";

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide4);
columns = trimr(columns, 1, 0);

// Variable for storing values
values_to = "Population";

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl4);
head(pop_long);
           State         Location             Year       Population
         Alabama            urban             2020        6558153.0
         Alabama            urban             2021        4972982.0
         Alabama            urban             2022        12375977.
         Alabama            rural             2020        1526791.0
         Alabama            rural             2021        76863.000

Multiple Value Variables

In all our previous examples we had values that needed to be stored in one variable. However, it's more realistic that our dataset contains multiple groups of values and we will need to specify multiple variables to store these values.

Let's consider our previous example which used the pop_wide4 dataset:

           State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858. 

Suppose that rather than creating a location variable, we wish to separate the population information into two variables, urban and rural. To do this we will:

  1. Split the variable names by words ("urban" or "rural") and integers.
  2. Create a Year column from the integer portions of the names.
  3. Create two values columns, urban and rural, from the word portions.

First, we will specify our columns:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide4);
columns = trimr(columns, 1, 0);

Next, we need to specify our names_to and values_to inputs. However, this time we want our values_to variables to be determined by the information in our names.

We do this using ".value".

// Tell GAUSS to use the first group of the split names 
// to set the values variables and 
// store the remaining group in 'Year'
names_to = ".value" $| "Year";

// Tell GAUSS to get 'values_to' variables from 'names_to'
values_to = "";

Setting ".value" as the first element in our names_to input tells dfLonger to take the first piece of the wide data names and create a column with the all the values from all matching columns.

In other words, combine all the values from the variables urban2020, urban2021, urban2022 into a single variable named urban and do the same for the rural columns.

Finally, we need to tell GAUSS how to split the variable names.

// Declare 'pctl' to be a pivotControl structure
// and fill with default settings
struct pivotControl pctl;
pctl = pivotControlCreate();

// Set the regex to split the variable names
pctl.names_pattern_split = "(urban|rural)([0-9]+)";

This time, we specify the variable names, "(urban|rural)" rather than use the general specifier "([a-zA-Z])".

Now we call dfLonger:

// Convert the dataframe to long format according to our specifications
pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl);

// Print the first 5 rows of the long form dataframe
head(pop_long);
           State             Year            urban            rural
         Alabama             2020        6558153.0        1526791.0
         Alabama             2021        4972982.0        76863.000
         Alabama             2022        12375977.        7301681.0
          Alaska             2020        21944.000        710978.00
          Alaska             2021        467051.00        267130.00

Now the urban population and rural population are stored in their own column, named urban and rural.

Conclusion

As we've seen today, pivoting panel data from wide to long can be complicated. However, using a systematic approach and the GAUSS dfLonger procedure help to alleviate the frustration, time, and errors.


Discover how GAUSS 24 can help you reach your goals.

 
Leave a Reply

Have a Specific Question?

Get a real answer from a real person

Need Support?

Get help from our friendly experts.

Try GAUSS for 14 days for FREE

See what GAUSS can do for your data

© Aptech Systems, Inc. All rights reserved.

Privacy Policy