# Transforming Panel Data to Long Form in GAUSS

### Introduction

Anyone who works with panel data knows that pivoting between long and wide form, though commonly necessary, can still be painstakingly tedious, at best. It can lead to frustrating errors, unexpected results, and lengthy troubleshooting, at worst.

The new dfLonger and dfWider procedures introduced in GAUSS 24 make great strides towards fixing that. Extensive planning has gone into each procedure, resulting in comprehensive but intuitive functions.

In today's blog, we will walk through all you need to know about the `dfLonger` procedure to tackle even the most complex cases of transforming wide form panel data to long form.

## The Rules of Tidy Data

Before we get started, it will be useful to consider what makes data tidy (and why tidy data is important).

It's useful to think of breaking our data into components (these subsets will come in handy later when working with `dflonger`):

• Values.
• Observations.
• Variables.

We can use these components to define some basic rules for tidy data:

1. Variables have unique columns.
2. Observations have unique rows.
3. Values have unique cells.

### Example One: Wide Form State Population Table

State202020212022
Alabama5,031,3625,049,8465,074,296
Arizona7,179,9437,264,8777,359,197
Arkansas3,014,1953,028,1223,045,637
California39,501,65339,142,99139,029,342

Though not clearly labeled, we can deduce that this data presents values for three different variables: State, Year, and Population.

Looking more closely we see:

• State is stored in a unique column.
• The values of Years are stored as column names.
• The values of Population are stored in separate columns for each year.

Our variables do not each have a unique column, violating the rules of tidy data.

### Example Two: Long Form State Population Table

StateYearPopulation
Alabama20205,031,362
Alabama20215,049,846
Alabama20225,074,296
Arizona20207,179,943
Arizona20217,264,877
Arizona20227,359,197

The transformed data above now has three columns, one for each variable State, Year, and Population. We can also confirm that each observation has a single row and each value has a single cell.

Transforming the data to long form has resulted in a tidy data table.

### Why Do We Care About Tidy Data?

Working with tidy data offers a number of advantages:

• Tidy data storage offers consistency when trying to compare, explore, and analyze data whether it be panel data, time series data or cross-sectional data.
• Using columns for variables is aligned with vectorization and matrix notation, both of which are fundamental to efficient computations.
• Many software tools expect tidy data and will only work reliably with tidy data.

## Transforming From Wide to Long Panel Data

In this section, we will look at how to use the GAUSS procedure `dfLonger` to transform panel data from wide to long form. This section will cover:

• The fundamentals of the `dfLonger` procedure.
• A standard process for setting up panel data transformations.

### The `dfLonger` Procedure

The `dfLonger` procedure transforms wide form GAUSS dataframes to long form GAUSS dataframes. It has four required inputs and one optional input:

``df_long = dfLonger(df_wide, columns, names_to, values_to [, pctl]);``

df_wide
A GAUSS dataframe in wide panel format.
columns
String array, the columns that should be used in the conversion.
names_to
String array, specifies the variable name(s) for the new column(s) created to store the wide variable names.
value_to
String, the name of the new column containing the values.
pctl
Optional, an instance of the `pivotControl` structure used for advanced pivoting options.

### Setting Up Panel Data Transformations

Having a systematic process for transforming wide panel data to long panel data will:

• Save time.
• Eliminate frustration.
• Prevent errors.

Let's use our wide form state population data to work through the steps.

### Step 1: Identify variables.

In our wide form population table, there are three variables: State, Year, and Population.

### Step 2: Identify columns to convert.

The easiest way to determine what columns need to be converted is to identify the "problem" columns in your wide form data.

For example, in our original state population table, the columns named 2020, 2021, 2022, represent our Year variable. They store the values for the Population variable.

These are the columns we will need to address in order to make our data tidy.

``columns = "2020"\$|"2021"\$|"2022";``

We only have three columns to transform and it is easy to just type out our column names in a string array. This won't always be the case, though. Fortunately, GAUSS has a lot of great convenience functions to help with creating your column lists.

My favorites include:

FunctionDescriptionExample
getColNamesReturns the column variable names.``` varnames = getColNames(df_wide)```
startsWithReturns a 1 if a string starts with a specified pattern.``` mask = startsWith(colNames, pattern)```
trimrTrims rows from the top and/or bottom of a matrix.``` names = trimr(full_list, top, bottom)```
rowcontainsReturns a 1 if the row contains the data specified by the `needle` variable, otherwise it returns a 0.``` mask = rowcontains(haystack, needle)```
selifSelects rows from a matrix, dataframe or string array, based upon a vector of 1’s and 0’s.``` names = rowcontains(full_list, mask)```

For more complex cases, it useful to approach creating column lists as a two-step process:

1. Get all column names using `getColNames`.
2. Select a subset of columns names using a selection convenience functions.

As an example, suppose our state population dataset contains a year column as the first column and the remaining columns contain the populations for 1950-2022. It would be difficult to write out the column list for all years.

1. Get a list of all the column names using `getColNames`.
2. Trim the first name off the list.
``````// Get all columns names
colNames = getColNames(pop_wide);

// Trim first name `year`
// from top of the name list
colNames = trimr(colNames, 1, 0);``````

### Step 3: Name the new columns for storing names.

The names of the columns being transformed from our wide form data will be stored in a variable specified by the input names_to.

In this case, we want to store the names from the wide data in one new variable called, `"Years"`. In later examples, we will look at how to split names into multiple variables using prefixes, separators, or patterns.

``names_to = "Years";``

### Step 4: Name the new columns for storing values.

The values stored in the columns being transformed will be stored in a variable specified by the input values_to.

For our population table, we will store the values in a variable named `"Population"`.

``values_to = "Population";``

## Basic Pivoting

Now it's time to put all these steps together into a working example. Let's continue with our state population example.

We'll start by loading the complete state population dataset from the state_pop.gdat file:

``````// Load data

// Preview data
```           State             2020             2021             2022
Alabama        5031362.0        5049846.0        5074296.0
Arizona        7179943.0        7264877.0        7359197.0
Arkansas        3014195.0        3028122.0        3045637.0
California        39501653.        39142991.        39029342. ```

Now, let's set up our information for transforming our data:

``````// Identify columns
columns = "2020"\$|"2021"\$|"2022";

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";``````

Finally, we'll transform our data using `df_longer`:

``````// Convert data using df_longer
pop_long = dfLonger(pop_wide, columns, names_to, values_to);

// Preview data
```           State             Year       Population
Alabama             2020        5031362.0
Alabama             2021        5049846.0
Alabama             2022        5074296.0

One of the most appealing things about `dfLonger` is that while simple to use, it offers tools for tackling the most complex cases. In this section, we'll cover everything you need to know for moving beyond basic pivoting.

### The `pivotControl` Structure

The `pivotControl` structure allows you to control pivoting specifications using the following members:

MemberPurpose
names_prefixA string input which specifies which characters, if any, should be stripped from the front of the wide variable names before they are assigned to a long column.
names_sep_splitA string input which specifies which characters, if any, mark where the names_to names should be broken up.
names_pattern_splitA string input containing a regular expression specifying group(s) in names_to names which should be broken up.
names_typesA string input specifying data types for the names_to variable.
values_drop_missingScalar, is set to 1 all rows with missing values will be removed.

### Changing Variable Types

By default the variables created from the pieces of the variable names will be categorical variables.

If we examine the variable type of pop_long from our previous example,

``````// Check the type of the 'Year' variables
getColTypes(pop_long[., "Year"]);``````

we can see that the Year variable is a categorical variable:

```            type
category ```

This isn't ideal and we'd prefer our Year variable to be a date. We can control the assigned type using the names_types member of the `pivotControl` structure. The names_types member can be specified in one of two ways:

1. As a column vector of types for each of the names_to variables.
2. An n x 2 string array where the first column is the name of the variable(s) and the second column contains the type(s) to be assigned.

For our example, we wish to specify that the Year variable should be a date but we don't need to change any of the other assigned types, so we will use the second option:

``````// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify that 'Year' should be
// converted to a date variable
pctl.names_types = {"Year" "date"};``````

Next, we complete the steps for pivoting:

``````// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide);
columns = trimr(columns, 1, 0);

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";``````

Finally, we call `dfLonger` including the `pivotControl` structure, pctl, as the final input:

``````// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide, columns, names_to, values_to, pctl);

// Preview data
```           State             Year       Population
Alabama             2020        5031362.0
Alabama             2021        5049846.0
Alabama             2022        5074296.0

Now if we check the type of our Year variable:

``````// Check the type of 'Year'
getColTypes(pop_long[., "Year"]);``````

It is a date variable:

```  type
date```

### Stripping Prefixes

In our previous example, the wide data names only contained the year. However, the column names of a wide dataset often have common prefixes. The names_prefix member of the `pivotControl` structure offers a convenient way to strip unwanted prefixes.

Suppose that our wide form state population columns were labeled `"yr_2020"`, `"yr_2021"`, `"yr_2022"`:

``````// Load data

// Preview data
```           State          yr_2020          yr_2021          yr_2022
Alabama        5031362.0        5049846.0        5074296.0
Arizona        7179943.0        7264877.0        7359197.0
Arkansas        3014195.0        3028122.0        3045637.0
California        39501653.        39142991.        39029342.```

We need to strip these prefixes when transforming our data to long form.

To accomplish this we first need to specify that our name columns have the common prefix `"yr"`:

``````// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify prefix
pctl.names_prefix = "yr_";``````

Next, we complete the steps for pivoting:

``````// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide2);
columns = trimr(columns, 1, 0);

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";``````

Finally, we call `dfLonger`:

``````// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide2, columns, names_to, values_to, pctl);

// Preview data
```           State             Year       Population
Alabama             2020        5031362.0
Alabama             2021        5049846.0
Alabama             2022        5074296.0

### Splitting Names

In our basic example the only information contained in the names columns was the year. We created one variable to store that information, `"Year"`. However, we may have cases where our wide form data contains more than one piece of information.

In theses case there are two important steps to take:

1. Name the variables that will store the information contained in the wide data column names using the names_to input.
2. Indicate to GAUSS how to split the wide data column names into the names_to variables.

#### Names Include a Separator

One way that names in wide data can contain multiple pieces of information is through the use of separators.

For example, suppose our data looks like this:

```           State       urban_2020       urban_2021       urban_2022       rural_2020       rural_2021       rural_2022
Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858. ```

Now our names specify:

• Whether the population is the urban or rural population.
• The year of the observation.

In this case, we:

• Use the names_sep_split member of the `pivotControl` structure to indicate how to split the names.
• Specify a names_to variable for each group created by the separator.
``````// Load data

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify how to separate names
pctl.names_sep_split = "_";

// Specify two variables for holding
// names information:
//    'Location' for the information before the separator
//    'Year' for the information after the separator
names_to = "Location"\$|"Year";

// Variable for storing values
values_to = "Population";

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide3, columns, names_to, values_to, pctl);

// Preview data
```           State         Location             Year       Population
Alabama            urban             2020        6558153.0
Alabama            urban             2021        4972982.0
Alabama            urban             2022        12375977.
Alabama            rural             2020        1526791.0
Alabama            rural             2021        76863.000```

Now, the pop_long dataframe contains:

• The information in the wide form names found before the separator, `"_"`, (urban or rural) in the Location variable.
• The information in the wide form names found after the separator, `"_"`, in the Year variable.

#### Variable Names With Regular Expressions

In our example above, the variables contained in the names were clearly separated by a `"_"`. However, this isn't always the case. Sometimes names use a pattern rather than separator:

``````// Load data

// Preview data
```           State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858. ```

In cases like this, we can use the names_pattern_split member to tell GAUSS we want to pass in a regular expression that will split the columns. We can't cover the full details of regular expressions here. However, there are a few fundamentals that will help us get started with this example.

In regEx:

1. Each statement inside a pair of parentheses is a group.
2. To match any upper or lower case letter we use `"[a-zA-Z]"`. More specifically, this tells GAUSS that we want to match any lowercase letter ranging from a-z and any upper case letter ranging from A-Z. If we wanted to limit this to any lowercase letters from t to z and any uppercase letter B to M we would say `"[t-zB-M]"`.
3. To match any integer we use `"[0-9]"`.
4. To represent that we want to match one or more instances of a pattern we use `"+"`.
5. To represent that we want to match zero or more instances of a pattern we use `"*"`.

In this case, we want to separate our names so that "urban" and "rural" are collected in Location and 2020, 2021, and 2022 are collected in the Year variable:

1. We have two groups.
2. We can capture both `urban` and `rural` using `"[a-zA-Z]+"`.
3. We can capture the years by matching one or more number using `"[0-9]+"`.

Let's use regEx to specify our names_pattern_split member:

``````// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify how to separate names
// using the pivotControl structure
pctl.names_pattern_split = "([a-zA-Z]+)([0-9]+)"; ``````

Next, we can put this together with our other steps to transform our wide data:

``````// Variable for storing names
names_to = "Location"\$|"Year";

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide4);
columns = trimr(columns, 1, 0);

// Variable for storing values
values_to = "Population";

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl4);
```           State         Location             Year       Population
Alabama            urban             2020        6558153.0
Alabama            urban             2021        4972982.0
Alabama            urban             2022        12375977.
Alabama            rural             2020        1526791.0
Alabama            rural             2021        76863.000```

### Multiple Value Variables

In all our previous examples we had values that needed to be stored in one variable. However, it's more realistic that our dataset contains multiple groups of values and we will need to specify multiple variables to store these values.

Let's consider our previous example which used the pop_wide4 dataset:

```           State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858. ```

Suppose that rather than creating a location variable, we wish to separate the population information into two variables, urban and rural. To do this we will:

1. Split the variable names by words (`"urban"` or `"rural"`) and integers.
2. Create a Year column from the integer portions of the names.
3. Create two values columns, urban and rural, from the word portions.

First, we will specify our columns:

``````// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide4);
columns = trimr(columns, 1, 0);``````

Next, we need to specify our names_to and values_to inputs. However, this time we want our values_to variables to be determined by the information in our names.

We do this using `".value"`.

``````// Tell GAUSS to use the first group of the split names
// to set the values variables and
// store the remaining group in 'Year'
names_to = ".value" \$| "Year";

// Tell GAUSS to get 'values_to' variables from 'names_to'
values_to = "";``````

Setting `".value"` as the first element in our names_to input tells `dfLonger` to take the first piece of the wide data names and create a column with the all the values from all matching columns.

In other words, combine all the values from the variables urban2020, urban2021, urban2022 into a single variable named urban and do the same for the rural columns.

Finally, we need to tell GAUSS how to split the variable names.

``````// Declare 'pctl' to be a pivotControl structure
// and fill with default settings
struct pivotControl pctl;
pctl = pivotControlCreate();

// Set the regex to split the variable names
pctl.names_pattern_split = "(urban|rural)([0-9]+)";``````

This time, we specify the variable names, `"(urban|rural)"` rather than use the general specifier `"([a-zA-Z])"`.

Now we call `dfLonger`:

``````// Convert the dataframe to long format according to our specifications
pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl);

// Print the first 5 rows of the long form dataframe
```           State             Year            urban            rural
Alabama             2020        6558153.0        1526791.0
Alabama             2021        4972982.0        76863.000
Alabama             2022        12375977.        7301681.0

Now the urban population and rural population are stored in their own column, named urban and rural.

### Conclusion

As we've seen today, pivoting panel data from wide to long can be complicated. However, using a systematic approach and the GAUSS `dfLonger` procedure help to alleviate the frustration, time, and errors.

### Have a Specific Question?

Get a real answer from a real person

### Need Support?

Get help from our friendly experts.