Panel data – Aptech

Exploring and Cleaning Panel Data with GAUSS 25

Eric — Tue, 28 Jan 2025 17:38:02 +0000

Introduction

Panel data offers a unique opportunity to examine both individual-specific and time-specific effects. However, as anyone who has worked with panel data knows, these same features that make panel data so useful can also make exploration and cleaning particularly challenging.

GAUSS 25 was designed with these challenges in mind. It introduces a comprehensive new suite of tools, tailored to make working with panel data in GAUSS easier, faster, and more intuitive.

In today's blog, we’ll demonstrate how these tools can simplify everyday panel data tasks, including:

Loading your data.
Preparing your panel dataset.
Exploring panel data characteristics.
Visualizing panel data.
Transforming your data for modeling.

Data

Today we will work use a subset of the publicly available Penn World Table version 10.01, available for download here.

Penn World Table Variables
Variable Name	Description
currency_unit	The currency unit used for GDP measurements.
countrycode	The three-letter ISO country code.
country	The name of the country.
year	The year of observation.
rgdpe	Real GDP at constant prices (expenditure-side).
rgdpo	Real GDP at constant prices (output-side).
pop	Population of the country.
emp	Number of employed persons.
irr	Investment rate of return.

Data Citation:
Feenstra, Robert C., Robert Inklaar and Marcel P. Timmer (2015), "The Next Generation of the Penn World Table" American Economic Review, 105(10), 3150-3182, available for download at www.ggdc.net.

Loading Our Panel Data

We'll start by using the loadd procedure to load our data.

// Load data from 'pwt_10.gdat
// Using __FILE_DIR to specify data path
pwt_10 = loadd(__FILE_DIR $+ "pwt_10.gdat");

// Preview data 
head(pwt_10);

For more information on using __FILE_DIR please see our earlier blog, Make Your Code Portable: Data Paths

The head procedure prints the first five observations of the our dataset, helping us check that our data has loaded properly:

   currency_unit      countrycode          country             year            rgdpe            rgdpo              pop              emp              irr
  Aruban Guilder              ABW            Aruba       1991-01-01        2804.5005        3177.4575      0.064622000      0.029200001       0.11486563
  Aruban Guilder              ABW            Aruba       1992-01-01        2944.5161        3370.5376      0.068235000      0.030903272       0.11182721
  Aruban Guilder              ABW            Aruba       1993-01-01        3131.3708        3698.5325      0.072504000      0.032911807       0.11131135
  Aruban Guilder              ABW            Aruba       1994-01-01        3537.9534        4172.8242      0.076700000      0.034895979       0.10574290
  Aruban Guilder              ABW            Aruba       1995-01-01        3412.8745        4184.1562      0.080324000      0.036628015       0.10471709

It's important to note that to identify our panel, GAUSS requires a dataframe to have at least one date variable and one categorical or string variable.

We will look more closely at how GAUSS identifies panels in the next section. For now, let's check that our data meets this requirement using the getcoltypes procedure.

// Check column types
getcoltypes(pwt_10);

            type
        category
        category
        category
            date
          number
          number
          number
          number
          number

Our data meets the GAUSS requirement for panel data, with three categorical variables and one date variable.

Ready to get started using GAUSS for panel data? Contact us for a GAUSS 25 demo!

Preparing Panel Data

Besides the data type requirements, the GAUSS panel data procedures assume a few important things about the form of your panel data.

In particular, your panel data should:

Be in stacked long form.
Have the date and group identification columns occurring before other date and categorical/string variables. (This is not required but it is the most convenient way to work the GAUSS panel data procedures.)
Be sorted by group then time.

Let’s look more closely at how to use GAUSS to ensure that our data meets these requirements.

Transforming panel data to long form

If your panel data is in wide form, it's easy to convert to long form using the dflonger procedure. This procedure is a very versatile procedure -- it's designed to be intuitive enough to cover basic transformation with little effort but flexible enough to tackle complex cases.

Since, the pwt_10 data is already in long form, so we don't need to transform our data. However, for an in-depth look at dflonger, including examples, see our previous blog, Transforming Panel Data to Long Form in GAUSS.

Ordering variables

One of the most convenient features of the new panel data procedures is their ability to intelligently detect group and time variables. To ensure this works properly, simply make sure that the date variable and group variable identifying your panel are the first occurring date and categorical/string variables in your dataset.

Let's take a look at our pwt_10 dataframe:

The Ctrl+E hot key opens the variable under cursor in a floating symbol editor window, allowing you to quickly view workplace symbols.

Identifying panel data groups
Our dataset contains three categorical variables: currency_unit, countrycode, and country. By default, GAUSS will use the first occurring categorical variable, currency_unit, to identify the groups in the panel, unless we specify otherwise.

Identifying time dimension
Our dataset also includes a date variable, year, which GAUSS will automatically use to identify the time dimension of the panel.

As the dataframe is now, GAUSS will use currency_unit and year to identify our panel. In this dataset, however, the panel should be identified by country and year. To address this, we could use optional arguments to specify that our group variable is country. However, we would need to do this every time we use one of the panel data procedures.

Instead, we can use the order procedure to move the country and year variables to the front of our dataframe.

// Move country 
pwt_10 = order(pwt_10, "country"$|"year");

Now, in our reordered pwt_10 dataframe, we see that country and year appear as the first two columns. GAUSS will automatically use these to identify the group and time dimensions, respectively.

A few things to note:

It is not necessary to move the year variable. Since it is only date variable in the dataframe, GAUSS will use year to identify our time dimension regardless of its position.
The country variable does not need to be the first column in the dataframe. It only needs to appear before the other categorical variables for GAUSS to automatically recognize it as the group dimension.

Sorting panel data

Beyond the fact that the GAUSS panel data functions expect sorted data, there are many advantages to working with sorted data:

Sorted data is easier to browse and explore.
Econometric techniques, such as calculating lags and differences, rely on the data being ordered consistently.
Proper sorting helps avoid errors, ensures reproducibility, and lays a solid foundation for reliable results.

The new pdsort procedure allows you to quickly sort panel data by the group then date dimension.

// Sort data using
// automatic group and date variables 
pwt_10 = pdSort(pwt_10);

Assessing Panel Data Structure

When working with panel data, understanding your data's structure is important. It can play a role in the methods and assumptions applied in your models. For example, many techniques are only valid for balanced data and will produce unreliable results if your panel is unbalanced.

Some important considerations include:

Whether the data is balanced.
The presence of gaps or missing data.
The ratio of groups to the number of time observations for each group.

By examining our panel’s structure upfront, we can:

Identify potential challenges.
Select the most appropriate analytical techniques.
Prevent errors that might result in biased or misleading conclusions.

GAUSS includes a suite of panel data tools, introduced in GAUSS 25, that are designed for exploring the structure of panel data.

GAUSS Functions for Panel Data Structure
Function Name	Description	Example
pdIsBalanced	Determines whether each group in a panel dataset covers the maximum time span.	`groupisBalanced = pdIsBalanced(pwt_10)`
pdAllBalanced	Checks if a panel dataset is strongly balanced and returns 1 if balanced, 0 otherwise.	`isBalanced = pdAllBalanced(pwt_10)`
pdIsConsecutive	Checks if each group in a panel dataset covers consecutive time periods without gaps.	`groupisConsecutive = pdIsConsecutive(pwt_10)`
pdAllConsecutive	Verifies whether all groups in a panel dataset have consecutive time periods without gaps.	`isConsecutive = pdAllConsecutive(pwt_10)`
pdSize	Provides size description of a panel dataset including the number of groups, number of time observations for each group.	`{ num_grps, T, balanced } = pdSize(pwt_10)`
pdTimeSpans	Returns the time span (start and end dates) by group of variables in panel data.	`df_tspans = pdTimeSpans(pwt_10)`

Exploring the structure of the Penn World Table

Now let's take a look at the structure of our Penn World Table data. First, we'll quickly check if our panel is balanced strongly balanced and consecutive.

print "Panel is balanced:";
pdAllBalanced(pwt_10);

// Check for consecutiveness
print "Panel is consecutive:";
pdAllConsecutive(pwt_10);

Panel is balanced:
       0.0000000
Panel is consecutive:
       1.0000000

This tells us that our panel is not strongly balanced but it is consecutive.

Now that we know our panel is unbalanced, we should take a closer look our data structure using pdSize.

// Get summary of panel dimensions
{ num_grps, T, balanced } = pdSize(pwt_10);

================================================================================
Group ID:                   country          Balanced:                        No
Valid cases:                   7540          Missings:                         0
N. Groups:                      137          T. Average:                  55.036
================================================================================
country                                       T[i]     Start Date       End Date
--------------------------------------------------------------------------------

Angola                                          50     1970-01-01     2019-01-01 
Argentina                                       70     1950-01-01     2019-01-01 
Armenia                                         30     1990-01-01     2019-01-01 
Aruba                                           29     1991-01-01     2019-01-01 
Australia                                       70     1950-01-01     2019-01-01 
Austria                                         70     1950-01-01     2019-01-01 
Azerbaijan                                      30     1990-01-01     2019-01-01 
Bahamas                                         47     1973-01-01     2019-01-01 
Bahrain                                         50     1970-01-01     2019-01-01 
Barbados                                        60     1960-01-01     2019-01-01 
Belarus                                         30     1990-01-01     2019-01-01 
Belgium                                         70     1950-01-01     2019-01-01 
Benin                                           40     1980-01-01     2019-01-01 
Bermuda                                         34     1986-01-01     2019-01-01 
Bolivia (Plurinational State of)                70     1950-01-01     2019-01-01 
Bosnia and Herzegovina                          30     1990-01-01     2019-01-01 
Botswana                                        60     1960-01-01     2019-01-01 
Brazil                                          70     1950-01-01     2019-01-01 
British Virgin Islands                          29     1991-01-01     2019-01-01 
Bulgaria                                        50     1970-01-01     2019-01-01 
Burkina Faso                                    61     1959-01-01     2019-01-01 
Burundi                                         40     1980-01-01     2019-01-01 
Cabo Verde                                      40     1980-01-01     2019-01-01 
Cameroon                                        60     1960-01-01     2019-01-01 
Canada                                          70     1950-01-01     2019-01-01 
Cayman Islands                                  29     1991-01-01     2019-01-01 
Central African Republic                        40     1980-01-01     2019-01-01 
Chad                                            60     1960-01-01     2019-01-01 
Chile                                           69     1951-01-01     2019-01-01 
China                                           68     1952-01-01     2019-01-01 
China, Hong Kong SAR                            60     1960-01-01     2019-01-01 
China, Macao SAR                                40     1980-01-01     2019-01-01 
Colombia                                        70     1950-01-01     2019-01-01 
Costa Rica                                      70     1950-01-01     2019-01-01 
Croatia                                         30     1990-01-01     2019-01-01 
Cyprus                                          70     1950-01-01     2019-01-01 
Czech Republic                                  30     1990-01-01     2019-01-01 
Côte d'Ivoire                                   60     1960-01-01     2019-01-01 
Denmark                                         70     1950-01-01     2019-01-01 
Djibouti                                        40     1980-01-01     2019-01-01 
Dominican Republic                              69     1951-01-01     2019-01-01 
Ecuador                                         70     1950-01-01     2019-01-01 
Egypt                                           70     1950-01-01     2019-01-01 
Estonia                                         30     1990-01-01     2019-01-01 
Eswatini                                        40     1980-01-01     2019-01-01 
Fiji                                            40     1980-01-01     2019-01-01 
Finland                                         70     1950-01-01     2019-01-01 
France                                          70     1950-01-01     2019-01-01 
Gabon                                           60     1960-01-01     2019-01-01 
Georgia                                         30     1990-01-01     2019-01-01 
Germany                                         70     1950-01-01     2019-01-01 
Greece                                          69     1951-01-01     2019-01-01 
Guatemala                                       70     1950-01-01     2019-01-01 
Guinea                                          40     1980-01-01     2019-01-01 
Honduras                                        50     1970-01-01     2019-01-01 
Hungary                                         50     1970-01-01     2019-01-01 
Iceland                                         70     1950-01-01     2019-01-01 
India                                           70     1950-01-01     2019-01-01 
Indonesia                                       60     1960-01-01     2019-01-01 
Iran (Islamic Republic of)                      65     1955-01-01     2019-01-01 
Iraq                                            50     1970-01-01     2019-01-01 
Ireland                                         70     1950-01-01     2019-01-01 
Israel                                          70     1950-01-01     2019-01-01 
Italy                                           70     1950-01-01     2019-01-01 
Jamaica                                         67     1953-01-01     2019-01-01 
Japan                                           70     1950-01-01     2019-01-01 
Jordan                                          66     1954-01-01     2019-01-01 
Kazakhstan                                      30     1990-01-01     2019-01-01 
Kenya                                           70     1950-01-01     2019-01-01 
Kuwait                                          50     1970-01-01     2019-01-01 
Kyrgyzstan                                      30     1990-01-01     2019-01-01 
Lao People's DR                                 40     1980-01-01     2019-01-01 
Latvia                                          30     1990-01-01     2019-01-01 
Lebanon                                         50     1970-01-01     2019-01-01 
Lesotho                                         40     1980-01-01     2019-01-01 
Lithuania                                       30     1990-01-01     2019-01-01 
Luxembourg                                      70     1950-01-01     2019-01-01 
Malaysia                                        65     1955-01-01     2019-01-01 
Malta                                           66     1954-01-01     2019-01-01 
Mauritania                                      43     1977-01-01     2019-01-01 
Mauritius                                       70     1950-01-01     2019-01-01 
Mexico                                          70     1950-01-01     2019-01-01 
Mongolia                                        40     1980-01-01     2019-01-01 
Morocco                                         70     1950-01-01     2019-01-01 
Mozambique                                      60     1960-01-01     2019-01-01 
Namibia                                         60     1960-01-01     2019-01-01 
Netherlands                                     70     1950-01-01     2019-01-01 
New Zealand                                     70     1950-01-01     2019-01-01 
Nicaragua                                       40     1980-01-01     2019-01-01 
Niger                                           60     1960-01-01     2019-01-01 
Nigeria                                         70     1950-01-01     2019-01-01 
North Macedonia                                 30     1990-01-01     2019-01-01 
Norway                                          70     1950-01-01     2019-01-01 
Oman                                            50     1970-01-01     2019-01-01 
Panama                                          51     1969-01-01     2019-01-01 
Paraguay                                        69     1951-01-01     2019-01-01 
Peru                                            70     1950-01-01     2019-01-01 
Philippines                                     70     1950-01-01     2019-01-01 
Poland                                          50     1970-01-01     2019-01-01 
Portugal                                        70     1950-01-01     2019-01-01 
Qatar                                           50     1970-01-01     2019-01-01 
Republic of Korea                               67     1953-01-01     2019-01-01 
Republic of Moldova                             30     1990-01-01     2019-01-01 
Romania                                         60     1960-01-01     2019-01-01 
Russian Federation                              30     1990-01-01     2019-01-01 
Rwanda                                          60     1960-01-01     2019-01-01 
Sao Tome and Principe                           40     1980-01-01     2019-01-01 
Saudi Arabia                                    50     1970-01-01     2019-01-01 
Senegal                                         60     1960-01-01     2019-01-01 
Serbia                                          30     1990-01-01     2019-01-01 
Sierra Leone                                    40     1980-01-01     2019-01-01 
Singapore                                       60     1960-01-01     2019-01-01 
Slovakia                                        30     1990-01-01     2019-01-01 
Slovenia                                        30     1990-01-01     2019-01-01 
South Africa                                    70     1950-01-01     2019-01-01 
Spain                                           70     1950-01-01     2019-01-01 
Sri Lanka                                       70     1950-01-01     2019-01-01 
Sudan                                           50     1970-01-01     2019-01-01 
Suriname                                        47     1973-01-01     2019-01-01 
Sweden                                          70     1950-01-01     2019-01-01 
Switzerland                                     70     1950-01-01     2019-01-01 
Taiwan                                          69     1951-01-01     2019-01-01 
Tajikistan                                      30     1990-01-01     2019-01-01 
Thailand                                        70     1950-01-01     2019-01-01 
Togo                                            40     1980-01-01     2019-01-01 
Trinidad and Tobago                             70     1950-01-01     2019-01-01 
Tunisia                                         60     1960-01-01     2019-01-01 
Turkey                                          70     1950-01-01     2019-01-01 
U.R. of Tanzania: Mainland                      60     1960-01-01     2019-01-01 
Ukraine                                         30     1990-01-01     2019-01-01 
United Kingdom                                  70     1950-01-01     2019-01-01 
United States                                   70     1950-01-01     2019-01-01 
Uruguay                                         70     1950-01-01     2019-01-01 
Uzbekistan                                      30     1990-01-01     2019-01-01 
Venezuela (Bolivarian Republic of)              70     1950-01-01     2019-01-01 
Zambia                                          65     1955-01-01     2019-01-01 
Zimbabwe                                        66     1954-01-01     2019-01-01 
================================================================================

The pdSize procedure provides a nice summary of our panel data structure including the:

Total number of groups and a full list of the groups.
Number of observations per a group.
Number of missing values.
The start and end date of each group in our panel.

While there are no missing values in this data, this isn't always the case. In fact, it is quite common that variables cover only part of the full timespan. For example, a country may have a longer history of providing real GDP data than IRR data.

The pdTimeSpans procedure reports the full timespan for each group, along with the timespans for a specified variable list. If no variable list is provided, it returns the timespan for all variables in the dataframe.

For example, suppose we want to use the emp and rgdpo variables in a model and want to know the maximum timespan our model can cover. We can use pdTimeSpans to see the timespan of each variable:

pwt_model_timespans = pdTimeSpans(pwt_10, "emp"$|"rgdpo");
pwt_model_timespans;

         country       Start year         End year        emp Start          emp End      rgdpo Start        rgdpo End 
          Angola       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
       Argentina       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
         Armenia       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
           Aruba       1991-01-01       2019-01-01       1991-01-01       2019-01-01       1991-01-01       2019-01-01 
       Australia       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
         Austria       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
      Azerbaijan       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
         Bahamas       1973-01-01       2019-01-01       1973-01-01       2019-01-01       1973-01-01       2019-01-01 
         Bahrain       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
        Barbados       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
         Belarus       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
         Belgium       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
           Benin       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
         Bermuda       1986-01-01       2019-01-01       1986-01-01       2019-01-01       1986-01-01       2019-01-01 
Bolivia (Plurina       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
Bosnia and Herze       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
        Botswana       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
          Brazil       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
British Virgin I       1991-01-01       2019-01-01       1991-01-01       2019-01-01       1991-01-01       2019-01-01 
        Bulgaria       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
    Burkina Faso       1959-01-01       2019-01-01       1959-01-01       2019-01-01       1959-01-01       2019-01-01 
         Burundi       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
      Cabo Verde       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
        Cameroon       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
          Canada       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
  Cayman Islands       1991-01-01       2019-01-01       1991-01-01       2019-01-01       1991-01-01       2019-01-01 
Central African        1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
            Chad       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
           Chile       1951-01-01       2019-01-01       1951-01-01       2019-01-01       1951-01-01       2019-01-01 
           China       1952-01-01       2019-01-01       1952-01-01       2019-01-01       1952-01-01       2019-01-01 
China, Hong Kong       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
China, Macao SAR       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
        Colombia       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
      Costa Rica       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
         Croatia       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
          Cyprus       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
  Czech Republic       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
   Côte d'Ivoire       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
         Denmark       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
        Djibouti       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
Dominican Republ       1951-01-01       2019-01-01       1951-01-01       2019-01-01       1951-01-01       2019-01-01 
         Ecuador       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
           Egypt       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
         Estonia       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
        Eswatini       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
            Fiji       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
         Finland       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
          France       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
           Gabon       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
         Georgia       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
         Germany       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
          Greece       1951-01-01       2019-01-01       1951-01-01       2019-01-01       1951-01-01       2019-01-01 
       Guatemala       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
          Guinea       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
        Honduras       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
         Hungary       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
         Iceland       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
           India       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
       Indonesia       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
Iran (Islamic Re       1955-01-01       2019-01-01       1955-01-01       2019-01-01       1955-01-01       2019-01-01 
            Iraq       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
         Ireland       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
          Israel       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
           Italy       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
         Jamaica       1953-01-01       2019-01-01       1953-01-01       2019-01-01       1953-01-01       2019-01-01 
           Japan       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
          Jordan       1954-01-01       2019-01-01       1954-01-01       2019-01-01       1954-01-01       2019-01-01 
      Kazakhstan       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
           Kenya       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
          Kuwait       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
      Kyrgyzstan       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
 Lao People's DR       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
          Latvia       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
         Lebanon       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
         Lesotho       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
       Lithuania       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
      Luxembourg       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
        Malaysia       1955-01-01       2019-01-01       1955-01-01       2019-01-01       1955-01-01       2019-01-01 
           Malta       1954-01-01       2019-01-01       1954-01-01       2019-01-01       1954-01-01       2019-01-01 
      Mauritania       1977-01-01       2019-01-01       1977-01-01       2019-01-01       1977-01-01       2019-01-01 
       Mauritius       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
          Mexico       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
        Mongolia       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
         Morocco       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
      Mozambique       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
         Namibia       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
     Netherlands       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
     New Zealand       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
       Nicaragua       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
           Niger       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
         Nigeria       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
 North Macedonia       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
          Norway       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
            Oman       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
          Panama       1969-01-01       2019-01-01       1969-01-01       2019-01-01       1969-01-01       2019-01-01 
        Paraguay       1951-01-01       2019-01-01       1951-01-01       2019-01-01       1951-01-01       2019-01-01 
            Peru       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
     Philippines       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
          Poland       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
        Portugal       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
           Qatar       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
Republic of Kore       1953-01-01       2019-01-01       1953-01-01       2019-01-01       1953-01-01       2019-01-01 
Republic of Mold       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
         Romania       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
Russian Federati       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
          Rwanda       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
Sao Tome and Pri       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
    Saudi Arabia       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
         Senegal       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
          Serbia       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
    Sierra Leone       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
       Singapore       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
        Slovakia       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
        Slovenia       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
    South Africa       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
           Spain       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
       Sri Lanka       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
           Sudan       1970-01-01       2019-01-01       1970-01-01       2019-01-01       1970-01-01       2019-01-01 
        Suriname       1973-01-01       2019-01-01       1973-01-01       2019-01-01       1973-01-01       2019-01-01 
          Sweden       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
     Switzerland       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
          Taiwan       1951-01-01       2019-01-01       1951-01-01       2019-01-01       1951-01-01       2019-01-01 
      Tajikistan       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
        Thailand       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
            Togo       1980-01-01       2019-01-01       1980-01-01       2019-01-01       1980-01-01       2019-01-01 
Trinidad and Tob       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
         Tunisia       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
          Turkey       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
U.R. of Tanzania       1960-01-01       2019-01-01       1960-01-01       2019-01-01       1960-01-01       2019-01-01 
         Ukraine       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
  United Kingdom       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
   United States       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
         Uruguay       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
      Uzbekistan       1990-01-01       2019-01-01       1990-01-01       2019-01-01       1990-01-01       2019-01-01 
Venezuela (Boliv       1950-01-01       2019-01-01       1950-01-01       2019-01-01       1950-01-01       2019-01-01 
          Zambia       1955-01-01       2019-01-01       1955-01-01       2019-01-01       1955-01-01       2019-01-01 
        Zimbabwe       1954-01-01       2019-01-01       1954-01-01       2019-01-01       1954-01-01       2019-01-01

Again, because we aren't missing any data, both emp and rgdpo cover the full timespan for each group as reported by pdSize.

Ready to elevate your research? Try GAUSS 25 today.

Panel Data Summary Statistics

When analyzing panel data, it's important to understand how variability is distributed across different dimensions of the data. Specifically:

Overall statistics which summarize the variability across all observations in the dataset, providing a high-level view of the data.
Within-group statistics which measure variability within each individual group, reflecting how a variable changes over time for a specific group.
Between-group statistics, which capture variability across groups, showing how groups differ from each other on average.

Understanding these patterns ensures that we select the right modeling approach and properly account for both group-specific and overall trends in our analysis.

We'll use the pdSummary procedure to compute these statistics. However, to simplify our examples and output moving forward, let's limit our panel to include only countries that use the Euro.

// Filter to include only Euro using countries
pwt_10 = selif(pwt_10, pwt_10[., "currency_unit"] .$== "Euro");

// Get summary statistics
pdSummary(pwt_10);

==========================================================================================
Group ID:                        country          Balanced:                             No
Valid cases:                        1125          Missings:                              0
N. Groups:                            19          T. Average:                       59.211
==========================================================================================
Variable               Measure           Mean      Std. Dev.        Minimum        Maximum
------------------------------------------------------------------------------------------
emp                    Overall          7.933         10.754          0.088         44.795
                       Between          -----         10.382          0.128         38.430
                        Within          -----          1.339          0.359         14.298
irr                    Overall          0.097          0.049          0.010          0.316
                       Between          -----          0.042          0.049          0.214
                        Within          -----          0.026          0.025          0.259
pop                    Overall         18.322         23.778          0.296         83.517
                       Between          -----         23.039          0.364         78.163
                        Within          -----          2.763          4.844         29.645
rgdpe                  Overall     457547.712     746910.097        568.248    4308861.500
                       Between          -----     594454.666       6365.400    2072470.938
                        Within          -----     431603.648   -1262654.069    2693938.274
rgdpo                  Overall     454655.015     750364.973         69.909    4275312.000
                       Between          -----     596209.636       5725.827    2097340.112
                        Within          -----     434813.497   -1283383.285    2632626.903
==========================================================================================
Non-numeric variables dropped from summary.

One very clear observation from our summary table is that our GDP variables, rgdpo and rgpde, are a much different scale than our other variables. We'll look at how to transform these next.

Transforming Data for Modeling

Because panel data usually contains a time dimension, it is very common to need to take lags or differences of our data. While this is very straightforward with time series data, doing this with panel data can be much more difficult.

Fortunately, the pdLag and pdDiff procedures, introduced in GAUSS 25, will efficiently compute panel data lags and differences for you.

$$\text{rdgpo growth rate} = \ln rgdpo_{t} - \ln rgdpo_{t-1} $$

Let's use the pdDiff procedure to create a new real GDP growth variable.

// Take natural log of rgdpo
ln_rgdpo = ln(pwt_10[., "rgdpo"]);

// Add to pwt_10 dataframe
// we need to do this so GAUSS
// can identify or panel 
// using the 'country' and 'year' variables
pwt_10 = pwt_10 ~ asDF(ln_rgdpo, "ln_rgdpo");

// Take first difference of ln_rgdpo
// GAUSS will use 'country' and 'year' to 
// automatically detect panel
gr_rgdpo = pdDiff(pwt_10[., "country" "year" "ln_rgdpo"]);

// Summarize 'gr_rgdpo' 
// GAUSS will use 'country' and 'year' to 
// automatically detect panel
call pdSummary(gr_rgdpo);

==========================================================================================
Group ID:                        country          Balanced:                             No
Valid cases:                        1106          Missings:                             19
N. Groups:                            19          T. Average:                       58.211
==========================================================================================
Variable               Measure           Mean      Std. Dev.        Minimum        Maximum
------------------------------------------------------------------------------------------
ln_rgdpo               Overall          0.036          0.092         -1.741          1.476
                       Between          -----          0.013          0.008          0.062
                        Within          -----          0.091         -1.768          1.450
==========================================================================================

Data Visualization

As a final step, let's create a quick visualization of this new variable using plotXY and the by keyword. We'll use a subset of countries to keep our plot from getting to crowded.

// Create subset of countries 
country_list = "Austria"$|"France"$|"Germany"$|"Spain"$|"Italy";

// Select data for plot
plot_data = selif(gr_rgdpo, sumr(gr_rgdpo[., "country"] .$== country_list'));

// Plot rgdpo growth variable by country
plotXY(plot_data, "ln_rgdpo~year + by(country)");

Conclusion

Today we've seen how the new panel data tools in GAUSS 25 can simplify your everyday panel data tasks, using a hands-on example. We've covered fundamental tasks, including:

Loading your data.
Preparing your panel dataset.
Exploring panel data characteristics.
Visualizing panel data.
Transforming your data for modeling.

Get Started with Panel Data in GAUSS (Video)

Eric — Wed, 17 Apr 2024 16:00:50 +0000

Introduction

In this video, you'll learn the basics of panel data analysis in GAUSS. We demonstrate panel data modeling start to finish, from loading data to running a group specific intercept model.

This video is available, along with all GAUSS videos, on our GAUSS YouTube Channel. Be sure to explore all our GAUSS videos and subscribe to the channel to get the latest videos as they are released.

Summary and Timeline

You'll see firsthand how to:

Load and verify panel data.
Merge data from different sources.
Convert between wide and long form panel data.
Explore and clean data.
Create panel data plots.
Prepare panel data for estimation.
Estimate a model with group-specific intercepts.

Timeline

0:41 Set the current working directory.
1:03 Load panel data from an Excel file.
5:32 Merging data from different sources.
06:53 Preliminary data cleaning.
08:40 Panel data plots.
11:12 Stationarity testing.
11:56 Convert long form to wide form panel data.
14:49 Estimate a model with group-specific intercepts.

Additional Resources

Transforming Panel Data to Long Form in GAUSS

Eric — Tue, 12 Dec 2023 21:24:59 +0000

Introduction

Anyone who works with panel data knows that pivoting between long and wide form, though commonly necessary, can still be painstakingly tedious, at best. It can lead to frustrating errors, unexpected results, and lengthy troubleshooting, at worst.

The new dfLonger and dfWider procedures introduced in GAUSS 24 make great strides towards fixing that. Extensive planning has gone into each procedure, resulting in comprehensive but intuitive functions.

In today's blog, we will walk through all you need to know about the dfLonger procedure to tackle even the most complex cases of transforming wide form panel data to long form.

The Rules of Tidy Data

Before we get started, it will be useful to consider what makes data tidy (and why tidy data is important).

It's useful to think of breaking our data into components (these subsets will come in handy later when working with dflonger):

Values.
Observations.
Variables.

We can use these components to define some basic rules for tidy data:

Variables have unique columns.
Observations have unique rows.
Values have unique cells.

Example One: Wide Form State Population Table

State	2020	2021	2022
Alabama	5,031,362	5,049,846	5,074,296
Alaska	732,923	734,182	733,583
Arizona	7,179,943	7,264,877	7,359,197
Arkansas	3,014,195	3,028,122	3,045,637
California	39,501,653	39,142,991	39,029,342

Though not clearly labeled, we can deduce that this data presents values for three different variables: State, Year, and Population.

Looking more closely we see:

State is stored in a unique column.
The values of Years are stored as column names.
The values of Population are stored in separate columns for each year.

Our variables do not each have a unique column, violating the rules of tidy data.

Example Two: Long Form State Population Table

State	Year	Population
Alabama	2020	5,031,362
Alabama	2021	5,049,846
Alabama	2022	5,074,296
Alaska	2020	732,923
Alaska	2021	734,182
Alaska	2022	733,583
Arizona	2020	7,179,943
Arizona	2021	7,264,877
Arizona	2022	7,359,197

The transformed data above now has three columns, one for each variable State, Year, and Population. We can also confirm that each observation has a single row and each value has a single cell.

Transforming the data to long form has resulted in a tidy data table.

Why Do We Care About Tidy Data?

Working with tidy data offers a number of advantages:

Tidy data storage offers consistency when trying to compare, explore, and analyze data whether it be panel data, time series data or cross-sectional data.
Using columns for variables is aligned with vectorization and matrix notation, both of which are fundamental to efficient computations.
Many software tools expect tidy data and will only work reliably with tidy data.

Ready to elevate your research? Try GAUSS today.

Transforming From Wide to Long Panel Data

In this section, we will look at how to use the GAUSS procedure dfLonger to transform panel data from wide to long form. This section will cover:

The fundamentals of the dfLonger procedure.
A standard process for setting up panel data transformations.

The `dfLonger` Procedure

The dfLonger procedure transforms wide form GAUSS dataframes to long form GAUSS dataframes. It has four required inputs and one optional input:

df_long = dfLonger(df_wide, columns, names_to, values_to [, pctl]);

df_wide: A GAUSS dataframe in wide panel format.
columns: String array, the columns that should be used in the conversion.
names_to: String array, specifies the variable name(s) for the new column(s) created to store the wide variable names.
value_to: String, the name of the new column containing the values.
pctl: Optional, an instance of the pivotControl structure used for advanced pivoting options.

Setting Up Panel Data Transformations

Having a systematic process for transforming wide panel data to long panel data will:

Save time.
Eliminate frustration.
Prevent errors.

Let's use our wide form state population data to work through the steps.

Step 1: Identify variables.

In our wide form population table, there are three variables: State, Year, and Population.

Variables are not always are clearly labeled in wide form data. You will often need to have background information to identify variables. Make sure to pay attention to references, titles, or other sources to ensure that you clearly understand the variables.

Step 2: Identify columns to convert.

The easiest way to determine what columns need to be converted is to identify the "problem" columns in your wide form data.

For example, in our original state population table, the columns named 2020, 2021, 2022, represent our Year variable. They store the values for the Population variable.

These are the columns we will need to address in order to make our data tidy.

columns = "2020"$|"2021"$|"2022";

We only have three columns to transform and it is easy to just type out our column names in a string array. This won't always be the case, though. Fortunately, GAUSS has a lot of great convenience functions to help with creating your column lists.

My favorites include:

Function	Description	Example
getColNames	Returns the column variable names.	`varnames = getColNames(df_wide)`
startsWith	Returns a 1 if a string starts with a specified pattern.	`mask = startsWith(colNames, pattern)`
trimr	Trims rows from the top and/or bottom of a matrix.	`names = trimr(full_list, top, bottom)`
rowcontains	Returns a 1 if the row contains the data specified by the `needle` variable, otherwise it returns a 0.	`mask = rowcontains(haystack, needle)`
selif	Selects rows from a matrix, dataframe or string array, based upon a vector of 1’s and 0’s.	`names = rowcontains(full_list, mask)`

For more complex cases, it useful to approach creating column lists as a two-step process:

Get all column names using getColNames.
Select a subset of columns names using a selection convenience functions.

As an example, suppose our state population dataset contains a year column as the first column and the remaining columns contain the populations for 1950-2022. It would be difficult to write out the column list for all years.

Instead we could:

Get a list of all the column names using getColNames.
Trim the first name off the list.

// Get all columns names
colNames = getColNames(pop_wide);

// Trim first name `year` 
// from top of the name list
colNames = trimr(colNames, 1, 0);

Step 3: Name the new columns for storing names.

The names of the columns being transformed from our wide form data will be stored in a variable specified by the input names_to.

In this case, we want to store the names from the wide data in one new variable called, "Years". In later examples, we will look at how to split names into multiple variables using prefixes, separators, or patterns.

names_to = "Years";

Step 4: Name the new columns for storing values.

The values stored in the columns being transformed will be stored in a variable specified by the input values_to.

For our population table, we will store the values in a variable named "Population".

values_to = "Population";

Basic Pivoting

Now it's time to put all these steps together into a working example. Let's continue with our state population example.

We'll start by loading the complete state population dataset from the state_pop.gdat file:

// Load data 
pop_wide = loadd("state_pop.gdat");

// Preview data
head(pop_wide);

           State             2020             2021             2022
         Alabama        5031362.0        5049846.0        5074296.0
          Alaska        732923.00        734182.00        733583.00
         Arizona        7179943.0        7264877.0        7359197.0
        Arkansas        3014195.0        3028122.0        3045637.0
      California        39501653.        39142991.        39029342.

Now, let's set up our information for transforming our data:

// Identify columns
columns = "2020"$|"2021"$|"2022";

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we'll transform our data using df_longer:

// Convert data using df_longer
pop_long = dfLonger(pop_wide, columns, names_to, values_to);

// Preview data
head(pop_long);

           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00

Advanced Pivoting

One of the most appealing things about dfLonger is that while simple to use, it offers tools for tackling the most complex cases. In this section, we'll cover everything you need to know for moving beyond basic pivoting.

The `pivotControl` Structure

The pivotControl structure allows you to control pivoting specifications using the following members:

Member	Purpose
names_prefix	A string input which specifies which characters, if any, should be stripped from the front of the wide variable names before they are assigned to a long column.
names_sep_split	A string input which specifies which characters, if any, mark where the names_to names should be broken up.
names_pattern_split	A string input containing a regular expression specifying group(s) in names_to names which should be broken up.
names_types	A string input specifying data types for the names_to variable.
values_drop_missing	Scalar, is set to 1 all rows with missing values will be removed.

We will demonstrate more how to use the pivotControl structure in later examples. However, if you are unfamiliar with structures you may find it useful to review our tutorial, "A Gentle Introduction to Using Structures."

Changing Variable Types

By default the variables created from the pieces of the variable names will be categorical variables.

If we examine the variable type of pop_long from our previous example,

// Check the type of the 'Year' variables
getColTypes(pop_long[., "Year"]);

we can see that the Year variable is a categorical variable:

            type
        category

This isn't ideal and we'd prefer our Year variable to be a date. We can control the assigned type using the names_types member of the pivotControl structure. The names_types member can be specified in one of two ways:

As a column vector of types for each of the names_to variables.
An n x 2 string array where the first column is the name of the variable(s) and the second column contains the type(s) to be assigned.

For our example, we wish to specify that the Year variable should be a date but we don't need to change any of the other assigned types, so we will use the second option:

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify that 'Year' should be
// converted to a date variable
pctl.names_types = {"Year" "date"};

Next, we complete the steps for pivoting:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide);
columns = trimr(columns, 1, 0);

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we call dfLonger including the pivotControl structure, pctl, as the final input:

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);

           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00

Now if we check the type of our Year variable:

// Check the type of 'Year'
getColTypes(pop_long[., "Year"]);

It is a date variable:

  type
  date

Stripping Prefixes

In our previous example, the wide data names only contained the year. However, the column names of a wide dataset often have common prefixes. The names_prefix member of the pivotControl structure offers a convenient way to strip unwanted prefixes.

Suppose that our wide form state population columns were labeled "yr_2020", "yr_2021", "yr_2022":

// Load data
pop_wide2 = loadd("state_pop2.gdat");

// Preview data
head(pop_wide2);

           State          yr_2020          yr_2021          yr_2022
         Alabama        5031362.0        5049846.0        5074296.0
          Alaska        732923.00        734182.00        733583.00
         Arizona        7179943.0        7264877.0        7359197.0
        Arkansas        3014195.0        3028122.0        3045637.0
      California        39501653.        39142991.        39029342.

We need to strip these prefixes when transforming our data to long form.

To accomplish this we first need to specify that our name columns have the common prefix "yr":

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify prefix
pctl.names_prefix = "yr_";

Next, we complete the steps for pivoting:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide2);
columns = trimr(columns, 1, 0);

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we call dfLonger:

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide2, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);

           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00

Splitting Names

In our basic example the only information contained in the names columns was the year. We created one variable to store that information, "Year". However, we may have cases where our wide form data contains more than one piece of information.

In theses case there are two important steps to take:

Name the variables that will store the information contained in the wide data column names using the names_to input.
Indicate to GAUSS how to split the wide data column names into the names_to variables.

Names Include a Separator

One way that names in wide data can contain multiple pieces of information is through the use of separators.

For example, suppose our data looks like this:

           State       urban_2020       urban_2021       urban_2022       rural_2020       rural_2021       rural_2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858.

Now our names specify:

Whether the population is the urban or rural population.
The year of the observation.

In this case, we:

Use the names_sep_split member of the pivotControl structure to indicate how to split the names.
Specify a names_to variable for each group created by the separator.

// Load data
pop_wide3 = loadd("state_pop3.gdat");

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify how to separate names
pctl.names_sep_split = "_";

// Specify two variables for holding
// names information:
//    'Location' for the information before the separator
//    'Year' for the information after the separator
names_to = "Location"$|"Year";

// Variable for storing values
values_to = "Population";

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide3, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);

           State         Location             Year       Population
         Alabama            urban             2020        6558153.0
         Alabama            urban             2021        4972982.0
         Alabama            urban             2022        12375977.
         Alabama            rural             2020        1526791.0
         Alabama            rural             2021        76863.000

Now, the pop_long dataframe contains:

The information in the wide form names found before the separator, "_", (urban or rural) in the Location variable.
The information in the wide form names found after the separator, "_", in the Year variable.

Variable Names With Regular Expressions

In our example above, the variables contained in the names were clearly separated by a "_". However, this isn't always the case. Sometimes names use a pattern rather than separator:

// Load data
pop_wide4 = loadd("state_pop4.gdat");

// Preview data
head(pop_wide4);

           State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858.

In cases like this, we can use the names_pattern_split member to tell GAUSS we want to pass in a regular expression that will split the columns. We can't cover the full details of regular expressions here. However, there are a few fundamentals that will help us get started with this example.

In regEx:

Each statement inside a pair of parentheses is a group.
To match any upper or lower case letter we use "[a-zA-Z]". More specifically, this tells GAUSS that we want to match any lowercase letter ranging from a-z and any upper case letter ranging from A-Z. If we wanted to limit this to any lowercase letters from t to z and any uppercase letter B to M we would say "[t-zB-M]".
To match any integer we use "[0-9]".
To represent that we want to match one or more instances of a pattern we use "+".
To represent that we want to match zero or more instances of a pattern we use "*".

In this case, we want to separate our names so that "urban" and "rural" are collected in Location and 2020, 2021, and 2022 are collected in the Year variable:

We have two groups.
We can capture both urban and rural using "[a-zA-Z]+".
We can capture the years by matching one or more number using "[0-9]+".

Let's use regEx to specify our names_pattern_split member:

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify how to separate names 
// using the pivotControl structure
pctl.names_pattern_split = "([a-zA-Z]+)([0-9]+)";

Next, we can put this together with our other steps to transform our wide data:

// Variable for storing names
names_to = "Location"$|"Year";

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide4);
columns = trimr(columns, 1, 0);

// Variable for storing values
values_to = "Population";

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl4);
head(pop_long);

           State         Location             Year       Population
         Alabama            urban             2020        6558153.0
         Alabama            urban             2021        4972982.0
         Alabama            urban             2022        12375977.
         Alabama            rural             2020        1526791.0
         Alabama            rural             2021        76863.000

Multiple Value Variables

In all our previous examples we had values that needed to be stored in one variable. However, it's more realistic that our dataset contains multiple groups of values and we will need to specify multiple variables to store these values.

Let's consider our previous example which used the pop_wide4 dataset:

           State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858.

Suppose that rather than creating a location variable, we wish to separate the population information into two variables, urban and rural. To do this we will:

Split the variable names by words ("urban" or "rural") and integers.
Create a Year column from the integer portions of the names.
Create two values columns, urban and rural, from the word portions.

First, we will specify our columns:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide4);
columns = trimr(columns, 1, 0);

Since we are using the same data as our previous example, we don't need to load any additional data.

Next, we need to specify our names_to and values_to inputs. However, this time we want our values_to variables to be determined by the information in our names.

We do this using ".value".

// Tell GAUSS to use the first group of the split names 
// to set the values variables and 
// store the remaining group in 'Year'
names_to = ".value" $| "Year";

// Tell GAUSS to get 'values_to' variables from 'names_to'
values_to = "";

Setting ".value" as the first element in our names_to input tells dfLonger to take the first piece of the wide data names and create a column with the all the values from all matching columns.

In other words, combine all the values from the variables urban2020, urban2021, urban2022 into a single variable named urban and do the same for the rural columns.

Finally, we need to tell GAUSS how to split the variable names.

// Declare 'pctl' to be a pivotControl structure
// and fill with default settings
struct pivotControl pctl;
pctl = pivotControlCreate();

// Set the regex to split the variable names
pctl.names_pattern_split = "(urban|rural)([0-9]+)";

This time, we specify the variable names, "(urban|rural)" rather than use the general specifier "([a-zA-Z])".

Now we call dfLonger:

// Convert the dataframe to long format according to our specifications
pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl);

// Print the first 5 rows of the long form dataframe
head(pop_long);

           State             Year            urban            rural
         Alabama             2020        6558153.0        1526791.0
         Alabama             2021        4972982.0        76863.000
         Alabama             2022        12375977.        7301681.0
          Alaska             2020        21944.000        710978.00
          Alaska             2021        467051.00        267130.00

Now the urban population and rural population are stored in their own column, named urban and rural.

These names can easily be changed using the Data Manager or setColNames

Conclusion

As we've seen today, pivoting panel data from wide to long can be complicated. However, using a systematic approach and the GAUSS dfLonger procedure help to alleviate the frustration, time, and errors.

Discover how GAUSS 24 can help you reach your goals.

Request Demo Request pricing

Visualizing COVID-19 Panel Data With GAUSS 22

Eric — Tue, 14 Dec 2021 18:57:02 +0000

Introduction

When they're done right, graphs are a useful tool for telling compelling data stories and supporting data models. However, too often graphs lack the right components to truly enhance understanding.

In this blog, we look at how a few quick customizations help make graphs more impactful. In particular, we will consider:

Using grid lines without cluttering a graph.
Changing tick labels for readability.
Using clear axis labels.
Marking events and outcomes with lines, bars, and annotations.

Data

As an example, we will use New York Times COVID tracking data (available on GitHub). This data is part of the New York Times U.S. tracking project.

From this data, we will be using the rolling 7-day average of COVID cases per 100k provided by date for five states: Arizona, California, Florida, Texas, and Washington.

Creating a Basic Graph

Let's start by creating a basic panel data plot using:

The plotXY procedure with dates.
A formula string and the by keyword.

First we will load our data:

// Load original data
fname = "us_state_covid_cases.csv";
covid_cases = loadd(fname, 
                    "date($date) + cat(state) + cases + cases_avg_per_100k");

// Filter desired states
covid_cases = selif(covid_cases, 
                    rowcontains(covid_cases[., "state"], 
                                "Florida"$|"California"$|
                                "Arizona"$|"Washington"$|
                                "Texas"));

Note that in this step we've:

Specified the variables we want to load and their variable types.
Filtered our data to include only our states of interest.

For more information about loading data and other data management tips see our previous blog, Getting to Know Your Data with GAUSS 22.

Now, we can make a preliminary plot of the rolling 7 day average number of COVID-19 cases per 100,000 people:

// Plot COVID cases per 100K by state
plotXY(covid_cases, "cases_avg_per_100k ~ date + by(state)");

The by keyword tells GAUSS to split the data on a particular variable. It was introduced in GAUSS 22, as well as the capability to use plotXY with date variables.

Customizing Our Graph

Our quick graph was a good starting point. However, a few customizations will help present a clearer picture:

Adding y-axis grid lines will help us read COVID cases values more easily.
Reformatting our x-axis tick labels to include months rather than quarters will make the dates more recognizable.
Change axis labels.

Declaring a `plotControl` Structure

The first step for customizing graphs is to declare a plotControl structure and to fill it with the appropriate defaults:

// Declare plot control structure
struct plotControl myPlot;

// Fill with defaults for "xy" graph
myPlot = plotGetDefaults("xy");

Customizing Plot Attributes

After declaring the plotControl structure, we can use plotSet procedures to change the desired attributes of our graph.

Adding Y-Axis Grid Lines

First, to help make levels of COVID cases more clear, let's add y-axis grid lines to our plot using plotSetYGridPen.

The plotSetYGridPen procedure can be used to set the width, color, and style of the y-axis grid lines:

Turn on y-axis major and/or minor grids.
Set the width, color, and style of the grid lines.

Input Description

which_grid Specifies which grid line to modify. The options include: "major", "minor", or "both".

width Specifies the thickness of the line(s) in pixels. The default value is 1.

color Optional argument, specifying the name or RGB value of the new color(s) for the line(s).

style

Optional argument, the style(s) of the pen for the line(s).
Options include:

1	Solid line
2	Dash line
3	Dot line
4	Dash-Dot line
5	Dash-Dot-Dot line

// Turn on y-axis grid for the major ticks. Set the
// grid lines to be solid, 1 pixel and light grey
plotSetYGridPen(&myPlot, "major", 1, "Light Grey", 1);

When using any plotSet procedure, the first input is a pointer to a declared plotControl structure. We indicate that something is a pointer using the & symbol.

Because GAUSS allows us to add and format y-axis and x-axis grid lines separately, we are able to improve readability with y-axis lines without adding the clutter of a full grid.

Customizing X-Axis Ticks

Next, let's turn our attention to the x-axis ticks. We will use three GAUSS procedures to help us customize our ticks:

Procedure	Description
`plotSetXTicLabel`	Controls the formatting and angle of x-axis tick labels for 2-D graphs.
`plotSetXTicInterval`	Controls the interval between x-axis tick labels and also allows the user to specify the first tick to be labeled for 2-D graphs.
`plotSetTicLabelFont`	Controls the font name, size and color for the X and Y axis tick labels.

First, let's change the format of the labels on the x-axis to indicate months rather than quarters:

// Display 4 digit year and month on 'X' tick labels
plotSetXTicLabel(&myPlot, "YYYY-MO");

A full list of supported x-axis tick label formats for time series data is available in the Remarks section of the documentation for plotSetXTicLabel.

Second, let's set the x-axis ticks to:

Start in March of 2020 to correspond with the start of the pandemic.
Occur every 3 months.

// Place first 'X' tick mark on March 1st, 2020
// with ticks occurring every 3 months
plotSetXTicInterval(&myPlot, 3, "months", asDate("2020-03"));

Third, let's increase the size of the axis tick labels:

// Change tic label font size
plotSetTicLabelFont(&myPlot, "Arial", 12);

Updating Axis Labels

Finally, we change the axis labels:

// Specify the text for the Y-axis label as well as
// the font and font size for both labels
plotSetYLabel(&myPlot, "Cases per 100k", "Arial", 14);

// Specify text for the x-axis label
plotSetXLabel(&myPlot, "Date");

TheplotSetYLabel and plotSetXLabel functions automatically set the font, font size, and font color for both axes. There is no need to specify it again.

Now we can create our formatted graph:

// Plot COVID cases per 100K by state. Pass in the 'plotControl'
// structure, 'myPlot', to use the settings we applied above.
plotXY(myPlot, covid_cases, "cases_avg_per_100k ~ date + by(state)");

Highlighting Events

It's common with time series plots that we want to note specific dates or periods on the graph. GAUSS includes four functions, introduced in GAUSS 22, that make highlighting events easy.

Procedure	Description	Example
`plotAddVLine`	Adds one or more vertical lines to an existing plot.	`plotAddVLine("2020-01-01");`
`plotAddVBar`	Adds one or more vertical bars spanning the full extent of the y-axis to an existing graph.	`plotAddVBar("2020-01", "2020-03");`
`plotAddHLine`	Adds one or more horizontal lines to an existing plot.	`plotAddHLine(500);`
`plotAddHBar`	Adds one or more horizontal bars spanning the full extent of the x-axis to an existing graph.	`plotAddHBar(580, 740);`

As an example, let's add vertical lines to help compare July 4th, 2020 to July 4th, 2021.

Specifying Legend Behavior When Adding Lines

First, when adding new data to an existing plot, we need to specify how we want this data treated on the legend using the plotSetLegend procedure.

We can add a label for the line to the legend:

// Label next added line "Independence Day"
// and add to the legend
plotSetLegend(&myPlot, "Independence Day");

or we can tell GAUSS to not make any changes to the current legend:

// The empty string specifies that the legend 
// should remain unchanged when the next line is added.
plotSetLegend(&myPlot, "");

Specifying Line Style

Next, we will specify the style of our lines using the plotSetLinePen procedure. This procedure lets us set the width, color, and style of the lines added to the graph.

Attribute

Description

width

Specifies the thickness of the line(s) in pixels. The default value is 2.

color

Optional argument, specifying the name or RGB value of the new color(s) for the line(s).

style

Optional argument, the style(s) of the pen for the line(s).
Options include:

1	Solid line
2	Dash line
3	Dot line
4	Dash-Dot line
5	Dash-Dot-Dot line

// Set the line width to be 2 pxs
// the line color to be #555555
// and the line to be dashed
plotSetLinePen(&myPlot, 2, "#555555", 2);

Adding Lines to Mark Events

Finally, let's add the lines marking Independence Day in 2020 and 2021.

We first specify the dates we want to add lines using asDate:

// Create string array of independence days
ind_days = asDate("2020-07-04"$|"2021-07-04");

Then we add our holidays to the existing graph using plotAddVLine:

// Add holidays to graph
plotAddVLine(myPlot, ind_days);

The complete code for adding the lines looks like this:

// Do not add vertical lines to the legend
plotSetLegend(&myPlot, "");

// Set the line width to be 2 pixels
// the line color to be a dark grey color, #555555,
// and the line to be dashed
plotSetLinePen(&myPlot, 2, "#555555", 2);

// Create string array of independence days
ind_days = asDate("2020-07-04"$|"2021-07-04");

// Add holidays to graph
plotAddVline(myPlot, ind_days);

Adding Bars to Mark Events

Now, let's add a vertical bar to mark the winter holidays time period of 2020. We will add a bar that marks the time span from Thanksgiving 2020 to New Year's Day 2021.

We first need to create a new plotControl structure to format our bars. Since we are adding a bar to the graph, we will fill our new plotControl structure with the defaults for a bar graph:

// Create plotControl structure
struct plotControl plt;

// Fill with default bar settings
plt = plotGetDefaults("bar");

Next, we can format our bar using the plotSetFill procedure. The plotSetFill procedure allows us to control the fill style, opacity, and color of graphed bars:

// Set bar to have solid fill with 20% opacity
// and grey color
plotSetFill(&plt, 1, 0.20, "grey");

We also have to specify the legend behavior when the bar is added. This time let's add a label to the legend for the "Winter Holidays":

// Add "Winter Holidays" to the legend
plotSetLegend(&plt, "Winter
Holidays");

The code

is HTML and it tells GAUSS to line break between the words "Winter" and "Holidays".

Now we are ready to add the bar to our graph using the plotAddVBar procedure:

// Add a vertical bar to graph starting 
// on November 26th, 2020 and 
// ending January 1st, 2021
plotAddVBar(plt, asDate("2020-11-26"), asDate("2021-01"));

Adding Notes to Graphs

As final customization, let's add a note to our graph to label one of our holidays. We can do this using the plotAddTextBox procedure.

The plotAddTextBoxtakes three required inputs:

The text to be added to the graph.
The x location where the text should start.
The y location where the text should start.

An optional plotAnnotation structure can be used to format the textbox and its text content.

// Label the 2020 Independence Day line
plotAddTextBox("← Independence Day", asDate("2020-07-04"), 80);

The code ← is HTML and it tells GAUSS to add a left arrow to the graph.

Conclusion

In this blog, we see how a few customizations and enhancements can make plots easier to read and more impactful.

In particular, we covered:

Using grid lines without cluttering a graph.
Changing tick labels for readability.
Using clear axis labels.
Marking events and outcomes with lines, bars, and annotations.

References

"The New York Times. (2021). Coronavirus (Covid-19) Data in the United States. Retrieved 12-05-2021, from https://github.com/nytimes/covid-19-data."

Panel Data Stationarity Test With Structural Breaks

Eric — Fri, 02 Oct 2020 05:24:31 +0000

Introduction

The validity of many time series models and panel data models requires that the underlying data is stationary. As such, reliable unit root testing is an important step of any time series analysis or panel data analysis.

However, standard time series unit root tests and panel data unit root tests aren’t reliable when structural breaks are present. Because of this, when structural breaks are suspected, we must employ unit root tests that properly incorporate these breaks.

Today we will examine one of those tests, the Carrion-i-Silvestre, et al. (2005) panel data test for stationarity in the presence of multiple structural breaks.

Why Panel Data Unit Root Testing?

We may be tempted when working with panel data to treat the data as individual time-series, performing unit root testing on each one separately. However, one of the fundamental ideas of panel data is that there is a shared underlying component that connects the group.

It is this shared component, that suggests that there are advantages to be gained from testing the panel data collectively:

Panel data contains more combined information and variation than pure time-series data or cross-sectional data.
Collectively testing for unit roots in panels provides more power than testing individual series.
Panel data unit root tests are more likely than time series unit root tests to have standard asymptotic distributions.

Put simply, when dealing with panel data, using tests designed specifically for panel data and testing the panel collectively, can lead to more reliable results.

For more background on unit root testing, see our previous blog post, “How to Conduct Unit Root Tests in GAUSS”.

Why do we Need to Worry About Structural Breaks?

It is important to properly address structural breaks when conducting unit root testing because most standard unit root tests will bias towards non-rejection of the unit root test. We discuss this in greater detail in our “Unit Root Tests with Structural Breaks” blog.

Panel Data Stationarity Test with Structural Breaks

The Carrion-i-Silvestre, et al. (2005) panel data stationarity test introduces a number of important testing features:

Tests the null hypothesis of stationarity against the alternative of non-stationarity.
Allows for multiple, unknown structural breaks.
Accommodates shifts in the mean and/or trend of the individual time series.
Does not require the same breaks across the entire panel but, rather, allows for each individual to have a different number of breaks at different dates.
Allows for homogeneous or heterogeneous long-run variances across individuals.

Deciding which unit root test is right for your data? Download our Unit Root Selection Guide!

Conducting Panel Data Stationarity Tests in GAUSS

Where can I Find the Tests?

The panel data stationarity test with structural breaks is implemented by the pd_kpss procedure in the GAUSS tspdlib library.

The library can be directly installed using the GAUSS Package Manager.

What Format Should my Data be in?

The pd_kpss procedure takes panel data in wide format - this means that each column of your data matrix should contain the time series observations for a different individual in the panel.

For example, if we have 100 observations of real GDP for 3 countries, our test data will be 100 x 3 matrix.

Observation #	Country A	Country B	Country C
1	1.11	1.40	1.39
2	1.14	1.37	1.34
3	1.27	1.45	1.28
4	1.19	1.51	1.35
$\vdots$	$\vdots$	$\vdots$	$\vdots$
99	1.53	1.75	1.65
100	1.68	1.78	1.67

How do I Call the Test Procedure?

The first step to implementing the panel date stationarity test with structural breaks in GAUSS is to load the tspdlib library.

library tspdlib;

This statement provides access to all the procedures in the tspdlib libraries. After loading the libraries, the pd_kpss procedure can be called directly from the command line or within a program file.

The pd_kpss procedure takes 2 required inputs and 5 optional arguments:

{ testd_hom, testd_het, m_lee_est, brks } = pd_kpss(y, model, 
                                                       nbreaks,
                                                       bwl,
                                                       varm, 
                                                       pmax, 
                                                       b_ctl);

y

$T \times N$ Wide form panel data to be tested.

model

Scalar, model to be used when there are structural breaks found:

1	Constant (Hadri test)
2	Constant + trend (Hadri test)
3	Constant + shift (in mean)
4	Constant + trend + shift (in mean and trend)

nbreaks

Scalar, Optional input, number of breaks to consider (up to 5). Default = 5.

bwl

Scalar, Optional input, bandwidth for the spectral window. Default = round(4 * (T/100)^(2/9)).

varm

Scalar, Optional input, kernel used for long-run variance computation. Default = 1:

1	iid
2	Bartlett.
3	Quadratoc spectral (QS).
4	Sul, Phillips, and Choi (2003) with the Bartlett kernel.
5	Sul, Phillips, and Choi (2003) with quadratic spectral kernel.
6	Kurozumi with the Bartlett kernel.
7	Kurozumi with quadratic spectral kernel.

pmax

Scalar, Optional input, denotes the number of maximum lags that is used in the estimation of the AR(p) model for lrvar. The final number of lags is chosen using the BIC criterion. Default = 8.

b_ctl

Optional input, An instance of the breakControl structure controlling the setting for the Bai and Perron structural break estimation.

The pd_kpss procedure provides 4 returns :

test_hom: Scalar, stationarity test statistic with structural breaks and homogeneous variance.
test_het: Scalar, stationarity test statistic with structural breaks and heterogeneous variance.
kpss_test: Matrix, individual tests. This matrix contains the test statistics in the first column, the number of breaks in the second column, the BIC chosen optimal lags, and the LWZ chosen optimal lags.
brks: Matrix of estimated breaks. Breaks for each individual group are contained in separate rows.

Empirical Example

Let’s look further into testing for panel data stationarity with structural breaks using an empirical example.

Data Description

The dataset contains government deficit as a percentage of GDP for nine OECD countries. The time span ranges from 1995 to 2019. This gives us a balanced panel of 9 individuals and 25 time observations each.

Loading our data into GAUSS

Our first step is to load the data from govt-deficit-oecd.csv using loadd. This .csv file contains three variables, Country, Year, and Gov_deficit.

We will load all three variables into a GAUSS dataframe. Note that loadd automatically detects that Country is a categorical variable, and assigns the category type. However, we will need to convert Year to a date variable:

// Load all variables and convert country to numeric categories
data = loadd("govt-deficit-oecd.csv");

// Convert "Year" to a date variable
data = asDate(data, "%Y", "Year");

This loads our data in long format (a 225x1 dataframe). Our next step, is to convert this to wide-format using the dfWider procedure.

// Specify names_from column 
names_from = "Country";

// Specify values_from column
values_from = 
// Convert from long to wide format
wide_data = df(data);

// Delete first column which contains the year variable
govt_def = delcols(wide_data, 1);

Setting up our Model Parameters

With our loading and transformations complete, we are ready to set-up our testing parameters. For this test, we will allow for both a constant and trend. All other parameters will be kept at their default values.

// Specify which model to 
// Allow for both constant and trend.
model = 2;

Calling the `pd_kpss` Procedure

Finally, we call the pd_kpss procedure:

{ test_hom, test_het, kpss_test, brks } = pd_kpss(wide_data, model);

Empirical Results

The pd_kpss output includes:

A header describing the testing settings.
The test_hom and test_het test statistics along with associated p-values.
The critical values for both test statistics.
The testing conclusions based on a comparison of the test statistics to the associated critical values.

Test:                                                PD KPSS
Ho:                                             Stationarity
Number of breaks:                                       None
LR variance:                                             iid
Model:                                Break in level & trend
==============================================================
                                      PD KPSS          P-val

Homogenous                             14.352          0.000
Heterogenous                           10.425          0.000

Critical Values:
                            1%             5%            10%

Homogenous               2.326          1.645          1.282
Heterogenous             2.326          1.645          1.282
==============================================================

Homogenous var:
Reject the null hypothesis of stationarity at the 1% level.

Heterogenous var:
Reject the null hypothesis of stationarity at the 1% level.

These results tell us that we can reject the null hypothesis of stationarity at the 1% level for both cases, homogenous and heterogenous variance.

The test results also include a table of individual test results and conclusions:

==============================================================
Individual panel results
==============================================================
                                         KPSS    Num. Breaks

AUT                                     0.165          2.000
DEU                                     0.079          0.000
ESP                                     0.249          4.000
FRA                                     0.210          2.000
GBR                                     0.298          2.000
IRL                                     0.235          2.000
ITA                                     0.130          3.000
LUX                                     0.127          3.000
NOR                                     0.414          1.000

Critical Values:
                            1%             5%            10%

AUT                      0.059          0.048          0.043
DEU                      0.207          0.150          0.122
ESP                      0.035          0.031          0.028
FRA                      0.056          0.045          0.040
GBR                      0.058          0.046          0.041
IRL                      0.074          0.059          0.051
ITA                      0.055          0.045          0.041
LUX                      0.058          0.045          0.039
NOR                      0.083          0.066          0.058
==============================================================

AUT                                     Reject Ho ( 1% level)
DEU                                          Cannot reject Ho
ESP                                     Reject Ho ( 1% level)
FRA                                     Reject Ho ( 1% level)
GBR                                     Reject Ho ( 1% level)
IRL                                     Reject Ho ( 1% level)
ITA                                     Reject Ho ( 1% level)
LUX                                     Reject Ho ( 1% level)
NOR                                     Reject Ho ( 1% level)
==============================================================

Finally, the pd_kpss procedure prints the estimated breakpoints for each individual in the panel.

Group        Break 1      Break 2      Break 3      Break 4      Break 5

AUT          2003         2008         .            .            .

DEU          .            .            .            .            .

ESP          1999         2006         2009         2012         .

FRA          2001         2008         .            .            .

GBR          2000         2008         .            .            .

IRL          2007         2010         .            .            .

ITA          1997         2006         2009         .            .

LUX          1999         2004         2008         .            .

NOR          2008         .            .            .            .

For more information on how to view the matrices returned by pd_kpss see our data viewing tutorial.

Interpreting the Results

When interpreting the results from pd_kpss test, it helps to remember a few key things:

The test considers the null hypothesis of stationarity against the alternative of non-stationarity.
We reject the null hypothesis of stationarity at
- Large values of the test statistic.
- Small p-values.

Notice that the TSPDLIB library conveniently provides interpretations for the pd_kpss tests.

Panel Data Test Statistic

The test statistic for our panel, assuming homogeneous variances:

Is equal to 14.352 with a p-value of 0.0000.
Suggests that we reject the null hypothesis of stationarity at the 1% level.

The test statistic for our panel, assuming heterogeneous variances:

Is equal to 10.425 with a p-value of 0.000.
Suggests that we reject the null hypothesis of stationarity at the 1% level.

These results tell us that regardless of whether we assume heterogeneous or homogenous variances, we can reject the null hypothesis of stationarity for the panel. Given this, we must make proper adjustments to account for non-stationarity when modeling our data.

Individual Test Results

Country	Statistic	Breaks	Conclusion
Austria	0.165	2003;2008	Reject null at 1%.
France	0.210	2001;2008	Reject null at 1%.
Germany	0.079	None	Cannot reject null.
Ireland	0.235	2007;2010	Reject null at 1%.
Italy	0.130	1997;2006;2009	Reject null at 1%.
Luxemberg	0.127	1999;2004;2008	Reject null at 1%.
Norway	0.414	2008	Reject null at 1%.
Spain	0.249	1999;2006;2009;2012	Reject null at 1%.
United Kingdom	0.298	2000;2008	Reject null at 1%.

Conclusion

Todays's blog considers the panel data stationarity test proposed by Carrion-i-Silvestre, et al. (2005). This test is built upon two crucial aspects of unit root testing:

Panel data specific tests should be used with panel data.
Structural breaks should be accounted for.

Ignoring these two facts can result in unreliable results.

After today, you should have a stronger understanding of how to implement the panel data stationarity test with structural breaks in GAUSS and how to interpret the results.

Introduction to the Fundamentals of Panel Data

Eric — Fri, 29 Nov 2019 18:52:07 +0000

Introduction

Panel data, sometimes referred to as longitudinal data, is data that contains observations about different cross sections across time. Examples of groups that may make up panel data series include countries, firms, individuals, or demographic groups.

Like time series data, panel data contains observations collected at a regular frequency, chronologically. Like cross-sectional data, panel data contains observations across a collection of individuals.

There are a number of advantages of panel data:

Panel data can model both the common and individual behaviors of groups.
Panel data contains more information, more variability, and more efficiency than pure time series data or cross-sectional data.
Panel data can detect and measure statistical effects that pure time series or cross-sectional data can't.
Panel data can minimize estimation biases that may arise from aggregating groups into a single time series.

Panel data examples can be found in economics , social sciences, medicine and epidemiology, finance, and the physical sciences.

What Is an Example of Panel Data?
Field	Example topics	Example dataset
Microeconomics	GDP across multiple countries, Unemployment across different states, Income dynamic studies, international current account balances.	Panel Study of Income Dynamics (PSID)
Macroeconomics	International trade tables, world socioeconomic tables, currency exchange rate tables.	Penn World Tables
Epidemiology and Health Statistics	Public health insurance data, disease survival rate data, child development and well-being data.	Medical Expenditure Panel Survey
Finance	Stock prices by firm, market volatilities by country or firm.	Global Market Indices

What Is Panel Data?

Panel data is a collection of quantities obtained across multiple individuals, that are assembled over even intervals in time and ordered chronologically. Examples of individual groups include individual people, countries, and companies.

In order to denote both individuals and time observations, panel data often refers to groups with the subscript i and time as the subscript t. For example, a panel data observation $Y_{it}$ is observed for all individuals $i = {1, ..., N}$ across all time periods $t = {1, ..., T}$

More specifically:

Group	Time Period	Notation
1	1	$Y_{11}$
1	2	$Y_{12}$
1	T	$Y_{1T}$
⁞	⁞	⁞
N	1	$Y_{N1}$
N	2	$Y_{N2}$
N	T	$Y_{NT}$

Wide and Long Panel Datasets

Panel datasets may come in different formats. The format in the table above is sometimes called long format data. Long format datasets stack the observations of each variable from all groups, across at all time periods into one column.

When panel data is stored with the observations for a single variable from separate groups stored in separate columns this is sometimes referred to as wide data format.

Time	$Y_1$	$Y_2$	$Y_N$
1	$Y_{11}$	$Y_{21}$	$Y_{N1}$
2	$Y_{12}$	$Y_{22}$	$Y_{N2}$
3	$Y_{13}$	$Y_{23}$	$Y_{N3}$
4	$Y_{14}$	$Y_{24}$	$Y_{N4}$
⁞	⁞	⁞	⁞
T-1	$Y_{1T-1}$	$Y_{2T-1}$	$Y_{NT-1}$
T	$Y_{1T}$	$Y_{2T}$	$Y_{NT}$

Balanced Panel Data Versus Unbalanced Panel Data

Panel data can also be characterized as unbalanced panel data or balanced panel data:

Balanced panel datasets have the same number of observations for all groups.
Unbalanced panel datasets have missing values at some time observations for some of the groups.

Certain panel data models are only valid for balanced datasets. If the panel datasets are unbalanced they may need to be condensed to include only the consecutive periods for which there are observations for all individuals in the cross section.

Panel Data and Heterogeneity

Panel data series modeling centers around addressing the likely dependence across data observations within the same group. In fact, the primary difference between panel data models and time series models, is that panel data models allow for heterogeneity across groups and introduce individual-specific effects.

As an example, consider a panel data series which includes gross domestic product (GDP) data for a panel of 5 different countries, the United States, France, Canada, Greece, and Australia:

A worldwide economic recession is likely to impact all 5 countries and causes changes in the GDP across all 5 countries.
An election in Australia is likely to impact the GDP of Australia but may not affect the other countries in the panel.
A change in North American trade policy may only regionally impact the US and Canada.
A change in the Euro exchange rate will most directly affect only France and Greece.

Panel data models include techniques that can address these heterogeneities across individuals. Furthermore, pure cross-sectional methods and pure time series models may not be valid in the presence of this heterogeneity.

Modeling Panel Data

Researchers commonly analyze datasets with multiple observations of a set of cross-sectional units (e.g., people, firms, countries) over time. For example, one may have data covering the production of multiple firms or the gross product of multiple countries across a number of years.

Modeling these panel data series is a unique branch of time series modeling made up of methodologies specific to their structure.

This section looks more closely at panel data analysis and the associated panel data models.

Homogeneous Versus Heterogeneous Panel Data Models

Panel data methods can be split into two broad categories:

Homogeneous (or pooled) panel data models assume that the model parameters are common across individuals.
Heterogeneous models allow for any or all of the model parameters to vary across individuals. Fixed effects and random effects models are both examples of heterogeneous panel data models.

Within these groups, the assumptions made about the variation of the model across individuals are the primary drivers for which model to use.

Let’s consider a simple linear model

$$y_{it} = \alpha + \beta x_{it} + \epsilon_{it}$$

The representation above is a homogenous model:

The constant, $ \alpha $, is the same across groups and time.
The coefficient, $ \beta $, is constant across groups and time.
Any differences across groups enter the model only through the error term, $ \epsilon_{it} $.

Alternatively, we could believe that groups share common coefficients on regressors but there are group-specific intercepts, as is captured in the fixed effects or least squares dummy variable LSDV model

$$y_{it} = \alpha_i + \beta x_{it} + \epsilon_{it}$$

The representation above is a heterogenous model, because the constants, $ \alpha_i $, are group-specific.

Individual-Specific Effects in Panel Data

This section considers four popular panel data models:

Pooled ordinary least squares.
One-way fixed effects.
One-way random effects.
Random coefficients.

We will examine these models using an assumed data generation process given by

$$ y_{it} = \beta x_{it} + \delta z_i + \epsilon_{it}$$

In this model, $X$ represents the observed characteristics such as age, firm size, expenditures, and $Z$ represents unobserved characteristics, such as management quality, growth opportunities, or skill.

Component	Description	Example
$x_{it}$	These are observable characteristics. These characteristics may be constant for an individual across all time, such as race, or may vary across all time observations for an individual such as age.	Age, race, company size, expenditure, population, GDP
$z_i$	Unobservable characteristics, responsible for model heterogeneity.	Skill, company potential, lack of basic infrastructure in the community, political unrest.
$\epsilon_{it}$	Stochastic error term.	N/A

What Is Pooled Ordinary Least Squares?
In some cases, there are no unobservable individual-specific effects, and $\delta z_i $ is constant across individuals. This is a strong assumption and implies that all the observations within groups are independent of one another.

In these cases, the model becomes

$$ y_{it} = \beta x_{it} + \alpha + \epsilon_{it}$$

This implies that when there is no dependence within individual groups, the panel data can be treated as one large, pooled dataset. The model parameters, $\beta$, and, $\alpha$, can be directly estimated using pooled ordinary least squares.

Linear independence within the groups of a panel is unlikely and pooled OLS is rarely acceptable for panel data models.

What Is The One-Way Fixed Effects Model?
The one-way fixed effects panel data model:

Includes unobservable time-specific or individual-specific effects. These effects capture omitted variables.
Assumes that individual-specific effects are correlated with the observed characteristics, $x_{it}$
Pooled OLS estimates for data generated by this process will be inconsistent.

Fixed effects data with group-specific intercepts and one shared slope.

As an example, let’s consider the one-way fixed effects model with individual-specific effects where the unobservable component, $\delta z_i$ , acts like an individual-specific intercept:

$$y_{it} = \beta x_{it} + \alpha_i + \epsilon_{it}$$

The intercept term, $\alpha_i$, varies across individuals but is constant across time. This term is composed of the constant intercept term, $\mu$, and the individual-specific error terms, $\gamma_i$.

The key feature of the fixed effects model is that $\gamma_i$ has a true, but unobservable, effect that must be estimated. More importantly, if we estimate $\beta$ using pooled OLS and fail to appropriately account for $\gamma_i$, the estimates will be inconsistent and biased.

The fixed effects model requires the estimation of the model parameter $\beta$ and individual $\alpha_i$ for each of the N groups in the panel. This is generally achieved using one of three estimation techniques:

Within-group estimation.
First differences estimation.
Least squares dummy variable (LSDV) estimation.

The first two of these techniques focuses on eliminating the individual effects before estimation. The LSDV method directly incorporates these effects using dummy variables.

What Is the One-Way Random Effects Model?
The one-way random effects panel data model:

Includes unobservable time-specific or individual-specific effects, $\delta z_i$, which act like individual-specific stochastic error terms.
Assumes that these effects are uncorrelated with the observed characteristics, $x_{it}$.
Does not result in biased OLS estimates of coefficients but does lead to inefficient parameters and incorrect standard inference tools.

Plot of random effects panel data showing stochastic differences across groups.

The distinguishing feature of the random effects model is that $\delta z_i$ does not have a true value but rather follows a random distribution with parameters that we must estimate.

The random effects term, $\delta z_i$:

Is uncorrelated with $x_{it}$ and pooled OLS estimates of the model parameters will not be biased.
Impacts the covariance structure of the error term which implies that pooled OLS estimates of the model parameters will be inefficient and standard inference tools, like the t-stat, will not be correct.

The random effects model should be estimated using feasible generalized least squares (FGLS). Using FGLS, the appropriate error structure, one which accounts for the individual-specific error terms, can be incorporated into the model.

What Is the Random Coefficients Model?

Plot of random coefficients panel data, showing differing intercepts, slopes, and variances.

The panel data regressions we’ve looked at so far have all assumed that the coefficients on regressors are the same across all individuals. The random coefficients model relaxes this assumption and introduces individual-specific effects through the coefficients, such that

$$y_{it} = \beta_i x_{it} + \alpha_i + \epsilon_{it}$$ $$y_{it} = (b_i + \beta)x_{it} + (\alpha_i+\alpha) + \epsilon_{it}$$ $$b_i \sim N(0, \tau_{i1}^2)$$ $$a_i \sim N(0, \tau_{i2}^2)$$

This model introduces both individual slope effects and allows for heteroscedasticity through the individual-specific $\tau_{i1}^2$ and $\tau_{i2}^2$.

This model can be estimated using feasible generalized least squares (FGLS) or maximum likelihood estimation (MLE).

Two-Way Individual Effects Models

The two-way individual effects model allows the presence of both time-specific effects and individual-specific effects.

Starting from a simple linear model given by,

$$y_{it} = \alpha + \beta_{xit} + \epsilon_{it}$$

the two-way individual effects model can be represented by

$$y_{it} = \alpha + \beta_{xit} + \mu_i + \lambda_t + \epsilon_{it}$$

In this model, $\mu_i$, captures any unobservable individual-specific effects and $\lambda_t$ captures any unobservable time-specific effects. Note that the individual-specific effects, $\mu_i$, do not vary with time, while the time-specific effects, $\lambda_t$, do not vary across individuals.

In the special case that there are only two groups and two individuals this model is equivalent to the difference-in-difference model. However, if there are more than two time periods and/or individuals, alternative panel data models must be considered.

What Is the Two-Way Fixed Effects Model?
The two-way fixed effects model:

Assumes that both $\mu_i$ and $\lambda_t$ are unobservable, fixed effects that must be estimated.

For data generated by this model:

Pooled OLS estimates, which ignore $\mu_i$ and $\lambda_t$, will be biased and inconsistent.
One-way fixed effects estimates, which ignore $\lambda_t$, will be biased.

Like the one-way fixed effects model, this model could be estimated by including dummy variables. However, in the two-way fixed effects model dummy variables must be included for both the time periods and the groups.

Under most circumstances, the number of dummy variables included in the two-way fixed effects model makes standard ordinary least squares estimation too computationally difficult. Instead, the two-way fixed effects model is estimated using a within-group estimator which removes the variation both within groups and within the time periods.

What Is the Two-Way Random Effects Model?
The two-way random effects model:

Occurs when both $\mu_i$ and $\lambda_t$ are unobservable, stochastic effects.
Assumes that $\mu_i$ and $\lambda_t$ are independently distributed and are uncorrelated with $x_{it}$.

For data generated by this process:

Pooled OLS estimates will be unbiased. However, the estimates will be inefficient and the associated standard errors and t-statistics will be biased.

Like the one-way random effects model, the two-way random effects model can be estimated using feasible generalized least squares (FGLS) or maximum likelihood estimation (MLE).

Dynamic Panel Data Model
A key component of pure time series models is the modeling of dynamics using lagged dependent variables. These lagged variables capture the autocorrelation between observations of the same dataset at different points in time.

Because panel datasets include a time series component, it is also important to address the possibility of autocorrelation in panel data. The dynamic panel data model adds dynamics to the panel data individual effects framework.

Consider an individual effects model which includes an AR(1) term

$$y_{it} = \delta y_{i,t-1} + \beta x_{it} + \epsilon_{it}$$

the error component includes one-way individual effects such that

$$\epsilon_{it} = \mu_i + \nu_{it}$$

where $\mu_i$ captures individual effects.

Introducing lagged dependent variables in the individual effects framework:

Both $y_{it}$ and $y_{i,t-1}$ are functions of $\mu_i$, because $\mu_i$ is time-invariant. This implies that as a regressor, $y_{i,t-1}$ is correlated with the error term.
Ordinary least squares (OLS) will lead to biased estimates because of the serial correlation.

Dynamic panel data models are most commonly estimated using a generalized method of moments (GMM) framework proposed by Arellano and Bond (1991).

Panel Data and Stationarity

In panel data that covers small time frames, there is little need to worry about stationarity. However, when panel data covers longer time frames, like is the case in many macroeconomic panel data series, the panel data must be tested for stationarity.

Weak stationarity, required for many panel data modeling techniques, requires only that:

A series has the same finite unconditional mean and finite unconditional variance at all time periods.
That the series autocovariances are independent of time.

Nonstationary panel data series are any panel series that do not meet the conditions of a weakly stationary time series.

In part because of these considerations, a large field of research and literature surrounding panel data unit root tests has developed.

Testing for unit roots in panel data requires more than just testing the individual cross sections for the presence of unit roots. Panel data unit root tests must:

Allow for both the shared movements across groups and the individual-specific movements within groups.
Use an appropriate asymptotic distribution based on how quickly the number of panels (N) and the number of time periods (T) grow relative to one another.
Determine whether to assume for cross-sectional independence or to enforce cross-sectional dependence.

Conclusion

After today's blog, you should have an understanding of the fundamentals of panel data. We covered the basics of panel data including:

The structure of panel data series.
Wide versus long panel data series.
One-way individual effects panel data models.
Two-way individual effects panel data models.
Dynamic panel data models.
Panel data series and stationarity.

Further suggested reading:

Ready to get more from your panel data with GAUSS? Contact us today to claim your free GAUSS demo copy.

How to Aggregate Panel Data in GAUSS

Eric — Sat, 23 Nov 2019 00:34:45 +0000

Introduction

The aggregate function, first available in GAUSS version 20, computes statistics within data groups. This is particularly useful for panel data.

In today's blog, we take a closer look at aggregate. We will:

Introduce the basics of the aggregate function.
Explain how to use the aggregate function
Demonstrate a real-world application of the aggregate function using current account data from the International Monetary Fund.

The GAUSS Aggregate Function

The GAUSS aggregate function computes statistics within a group based upon a specified group identifier. The function supports a variety of GAUSS statistics including:

mean
median
mode
min
max
sample standard deviation
sum
sample variance

For example, consider a panel dataset which includes observed weights for three individuals across a 6-month time span:

Name	Jan. Weight	Feb. Weight	Mar. Weight	Apr. Weight	May Weight	June Weight
Sarah	135	134	138	142	144	145
Tom	196	192	182	183	184	181
Nikki	143	144	146	147	145	143

We can use the aggregate function to find the 6-month mean weights for Sarah, Tom and Nikki:

Name	Jan. Weight	Feb. Weight	Mar. Weight	Apr. Weight	May Weight	June Weight	Mean Weight
Sarah	135	134	138	142	144	145	139.7
Tom	196	192	182	183	184	181	186.3
Nikki	143	144	146	147	145	143	144.7

Alternatively, we could find the monthly standard deviation of the weights across Sarah, Tom and Nikki:

Name	Jan. Weight	Feb. Weight	Mar. Weight	Apr. Weight	May Weight	June Weight
Sarah	135	134	138	142	144	145
Tom	196	192	182	183	184	181
Nikki	143	144	146	147	145	143
Monthly Std. Dev.	33.2	31.0	23.4	22.4	22.8	21.4

How to Use The Aggregate Function

The aggregate function takes two required inputs:

x_a = aggregate(x, method);

x: NxK data matrix, must have group identifiers in the first column.
method: String, method to use. Valid methods include: "mean", "median", "mode", "max", "min", "sd", "sum", "variance".

The Input Data Matrix

The aggregate function requires the data matrix input to:

Have numerical group identifiers in the first column.
Be in stacked panel data format.

Let's consider our example dataset from above. In order to use this data as an input to the GAUSS aggregate function we need to:

Recode our group identifiers from names to numbers.
Stack our data into a pooled dataset.

Name	Jan. Weight	Feb. Weight	Mar. Weight	Apr. Weight	May Weight	June Weight
Sarah	135	134	138	142	144	145
Tom	196	192	182	183	184	181
Nikki	143	144	146	147	145	143

$$\text{Sarah} \rightarrow 1$$ $$\text{Tom} \rightarrow 2$$ $$\text{Nikki} \rightarrow 3$$

$$\Downarrow$$

Group	Jan. Weight	Feb. Weight	Mar. Weight	Apr. Weight	May Weight	June Weight
1	135	134	138	142	144	145
2	196	192	182	183	184	181
3	143	144	146	147	145	143

$$\Downarrow$$

Group	Month	Weight
1	1	135
1	2	134
1	3	138
1	4	142
1	5	144
1	6	145
2	1	196
2	2	192
⁞	⁞	⁞
3	6	143

The Method Input

The method input into the aggregate function should always be a string indicating which statistic you wish to compute.

Each method works on groups within the panel the same way the analogous pooled data function would work, including its handling of missing values.

Method	Pooled Function
mean	`meanc`
median	`median`
mode	`modec`
max	`maxc`
min	`minc`
sum	`sumc`
sd	`stdc`
variance	`varCovXS`

Example of How to Use Aggregate

Let's use aggregate to find the means by group for weight data:

weights = { 1 1 135,
            1 2 134,
            1 3 138,
            1 4 142,
            1 5 144,
            1 6 145,
            2 1 196,
            2 2 192,
            2 3 182,
            2 4 183,
            2 5 184,
            2 6 181,
            3 1 143,
            3 2 144,
            3 3 146,
            3 4 147,
            3 5 145,
            3 6 143 };

/*
** Find the mean by person.
** We will use the first column
** as the group indicator and will find
** the mean of the weights.
*/
print aggregate(weights[., 1 3], "mean");

This prints the group means to the output window:

1   139.67 
2   186.33 
3   144.67

We can also use the month identifiers to find the sample standard deviation by month:

/*
** Find the standard deviation by month.
** We will use the second column of weights
** as the group indicator and will find
** the standard deviation of the weights.
*/
print aggregate(weights[., 2 3], "sd");

Now the standard deviations, along with their associated months will be printed to the output window:

Using Aggregate to Examine Trends in Current Account Balances

Our simple example dataset is useful for demonstrating the basics of the aggregate function. However, a real-world panel dataset better demonstrates its true power. In this section, we will use aggregate to examine some of the trends in international current account balances.

The Data

We will use current account balance measured as a percentage of GDP. This unbalanced panel data is a modified version of a dataset from the International Monetary Fund and spans 1953-Q3 to 2019-Q4 and includes a total of 46 countries, across 5 different regions.

It contains the following variables:

Variable name	Description
Country	String, name of the country.
Country ID	Integer country identifier.
World Region	String, name of the corresponding world region.
Region ID	Integer world region identifier.
Time	String, the date of the observation.
CAB	Decimal numeric, the Current Account Balance.

Mean and Median Current Account Balances By Country

We first examine variations in the mean and median current account balances across countries. Using aggregate we find the mean and median current account balances for each country in the panel across all observations.

/*
** Load the 'Country ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Country Id + CAB");

// Compute mean and median current
// account balances by Country ID
mean_cab_cid = aggregate(X, "mean");
median_cab_cid = aggregate(X, "median");

After the above code mean_cab_cid and median_cab_cid will be both be $46\times2$ matrices. Each element in the first column will be a unique country ID. The corresponding element in the second column will be the average (mean or median) current account balance for that country.

We include a graph of this data below, where we see that Germany leads the pack with the highest average current account balance while Finland has the lowest average current account balances.

Mean and Median Current Account Balances By Region

We can similarly consider the mean and median current account balances across geographical regions with the code below.

/*
** Load the 'Region ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Region ID + CAB");

// Compute mean and median current
// account balances by world region
mean_cab_wreg = aggregate(X, "mean");
median_cab_wreg = aggregate(X, "median");

mean_cab_wreg and median_cab_wreg will be two column matrices with unique world region IDs in the first column and the corresponding statistics in the second column.

Mean and Median Current Account Balances Time Series

Finally, we consider how the mean and median current account balances vary across time in the time series plot below.

/*
** Load the 'Time' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
** Wrapping 'Time' in 'date($)' tells GAUSS that 'Time' is a string
** variable that we want GAUSS to convert to a date.
*/
X = loadd("imf_cab_mod.xlsx", "date($Time) + CAB");

mean_cab_date = aggregate(X, "mean");
median_cab_date = aggregate(X, "median");

This time the first column of our resulting matrices, mean_cab_date and median_cab_date, will contain each unique date from our dataset. The second column will contain the statistic computed for each unique date.

Interested in learning more about loading dates in GAUSS?
Check out this tutorial to learn more.

Below is a graph of the Current Account Balance data grouped by quarter.

Conclusion

In today's blog, we examined the fundaments of the aggregate procedure. After reading you should have a better understanding of:

The basics of the aggregate function.
How to use the aggregate function.
How to examine trends in real-world panel data using aggregate.

Panel Data Basics: One-way Individual Effects

Eric — Mon, 15 Apr 2019 02:54:59 +0000

Introduction

In this blog, we examine one of the fundamentals of panel data analysis, the one-way error component model. Today we will:

Explain the theoretical one-way error component model.
Consider fixed effects vs. random effects.
Estimate models using an empirical example.

The theoretical one-way error component model

The one-way error-component model is a panel data model which allows for individual-specific or temporal-specific error components

$$ \begin{equation}y_{it} = \alpha + X_{it} \beta + u_{it} \label{OWEM}\end{equation}$$ $$ u_{it} = \mu_{i} + \nu_{it} $$

where the subscript i indicates cross-sections of households, individuals, firms, countries, etc. and the subscript t indicates time periods.

In this model, the individual-specific error component, $\mu_{i}$, captures any unobserved effects that are different across individuals but fixed across time.

The one-way error component model
$\alpha$	Variable of interest which measures an intercept that is constant across all individuals and time periods.
$\beta$	Variable of interest which measures the effect of x on y. It is constant across all individuals and time periods.
$\mu_i$	Individual-specific variation in y which stays constant across time for each individual. In the fixed effects model this is an individual-specific effect to be estimated. In the random effects model this follows a random distribution with parameters that must be estimated.
$\nu_{it}$	Usual stochastic regression disturbance which varies across time and individuals.

Fixed effects vs. random effects

The two most common approaches to modeling individual-specific error components are the fixed effects model and the random effects model.

The key difference between these two approaches is how we believe the individual error component behaves.

The fixed effects model

In the fixed effects model the individual error component:

Can be thought of as an individual-specific intercept term.
Captures any omitted variables that are not included in the regression.
Is correlated with other variables included in the model.

Given these assumptions, the fixed effects model can be thought of as a pooled OLS model with individual specific intercepts:

$$\begin{equation}y_{it} = \delta_{i} + X_{it} \beta + \nu_{it}\label{FEM}\end{equation}$$

The intercept term, $\delta_i$, varies across individuals but is constant across time for each individual. This term is composed of the constant intercept term, $\alpha$, and the individual-specific error terms, $\mu_i$.

The distinguishing feature of the fixed effects model is that $\delta_i$ has a true, but unobservable, effect which we must estimate.

The random effects model

In the random effects model the individual-specific error component, $\mu_i$:

Is distributed randomly and is independent of $\nu_{it}$.
Occurs in cases where individuals are drawn randomly from a large population, such as household studies (Baltagi, 2008).
Is assumed to be uncorrelated with all other variables in the model.
Random effects impact our model through the covariance structure of the error term.

For example, consider the total error disturbance in the model, $ u_{it} = \mu_{i} + \nu_{it} $. The covariance of the error at time t and time s depends on the variance of both $\mu_{i}$ and $\nu_{it}:$

$$\begin{equation}cov(u_{it}, u_{is}) = \left\{ \begin{array}{ll} \sigma_{\mu}^2 & \text{for } t \neq s \\ \sigma_{\mu}^2 + \sigma_{\nu}^2 & \text{for } t = s \\ \end{array} \right. \label{REM}\end{equation} $$

The distinguishing feature of the random effects model is that $\mu_i$ does not have a true value but rather follows a random distribution with parameters that we must estimate.

Estimation

The fixed effects model

In the fixed effects model, the individual effects introduce an endogeneity that will result in biased estimates if not properly accounted for.

Fortunately, we can make consistent estimates using one of three estimation techniques:

Within-group estimation
First differences estimation
Least squares dummy variable (LSDV) estimation

The first two of these techniques focuses on eliminating the individual effects before estimation. The LSDV method directly incorporates these effects using dummy variables.

	Within-group estimator	LSDV estimator	First differences estimator
Data transformation	Demean the data.	Use dummy variables.	Difference the data.
Regression equation	$$\widetilde{Y_i} = \widetilde{X_i} \beta_{fe} + \widetilde{\nu_i} $$	$$Y_{it} = X_{it} \beta_{fe} +\\ \alpha D_{i} + \nu_{it}$$	$$\Delta{Y}_{it} = \Delta{X}_{it} \beta_{fe} + \Delta{\nu}_{it} $$

Let's consider an example panel dataset with three individuals and three time periods shown in the table below.

Individual	Time Period	Y_it	Within Group Ave. Y_i	X_it	Within Group Ave. X_i
1	1	3.901	2.744	0.978	1.174
1	2	2.345	2.744	1.798	1.174
1	3	1.987	2.744	0.745	1.174
2	1	1.250	1.715	1.652	1.425
2	2	0.654	1.715	0.438	1.425
2	3	3.240	1.715	2.185	1.425
3	1	0.901	2.077	2.119	1.653
3	2	1.341	2.077	1.516	1.653
3	3	3.989	2.077	1.324	1.653

Example within-group estimation
We will estimate the fixed effects model using the within-group method. This can be done in three steps:

Find the within-subject means.
Demean the dependent and independent variables using the within-subject means.
Run a linear regression using the demeaned variables.

Finding the within-subject means
To find the within-subject mean of Y for individual one we compute:

$$ \bar{Y_{1}} = \frac{(3.901 + 2.345 + 1.987)}{3} = 2.7443 .$$

We can find the within-subject means using the withinMeans procedure from the pdlib. The withinMeans procedure requires two inputs:

grps: (T*N) x 1 matrix, group identifier.
data: (T*N) x k, panel data.

The pdlib library is available for free and can be directly installed using the GAUSS Package Manager.

Using our sample data stored in the GAUSS data file simple_data.dat:

// Load data
data = loadd("simple_data.dat");

// Assign groups variable
grps = data[., 1];

// Assign y~x matrix
reg_data = data[.,3:4];

// Find group means
grp_means = withinMeans(grp, reg_data);

print "Group means for Y and X:";
grp_means;

Our output reads:

Group means for Y and X:

 2.7443  1.1737
 1.7147  1.4250

Demeaning the data
The next step is to demean the data. This removes any time-invariant effects. After finding the within-subject means, the data is demeaned:

$$ \widetilde{Y_1} = Y_{1t} - \overline{Y}_1 =\\ 3.901 - 2.744 = 1.157,\\ 2.345 - 2.744 = -0.399,\\ 1.987 - 2.744 = -0.757 .$$

In GAUSS we can demean data using the demeanData procedure from the pdlib library. The demeanData procedure requires two inputs:

grps: (T*N) x 1 matrix, group identifier.
data: (T*N) x k, panel data.

The demeanData procedure internally computes the within-subject means and requires just the the reg_data and grps variables that we created in the first step:

// Remove time-invariant group means
data_tilde = demeanData(grps, reg_data);

print "Demeaned data:";
data_tilde;
print;

Our demeaned data is printed in the output:

Demeaned data:

 1.1567 -0.1957
-0.3993  0.6243
-0.7573 -0.4287
-0.4647  0.2270
-1.0607 -0.9870
 1.5253  0.7600
-1.1760  0.4660
-0.7360 -0.1370
 1.9120 -0.3290

Performing the regression
Once we have transformed our x and y data we are ready to estimate the parameters of the fixed effects regression model:

$$\widetilde{Y_i} = \widetilde{X_i} \beta_{fe} + \widetilde{\nu_i} $$

where

$$\widehat{\beta}_{fe} = (\widetilde{X_i}'\widetilde{X_i})^{-1}(\widetilde{X_i}'\widetilde{Y_i}) .$$

Using the data we previously demeaned:

// Extract variables
y_tilde = data_tilde[., 1];
x_tilde = data_tilde[., 2];

// Regress independent on dependent variables
coeff = inv(x_tilde'x_tilde)*(x_tilde'y_tilde);

// Print the fixed effects coefficient
print "Fixed effects coefficient:";
coeff;

The result reads:

Fixed effects coefficient:
 0.3413

Using the fixedEffects procedure
As an alternative to computing these three steps separately, we can use the fixedEffects procedure from the GAUSS panel data library, pdlib. This procedure runs all three steps in a single call. The fixedEffects procedure takes four inputs:

y: (T*N) x 1 matrix, the panel of stacked dependent variables.
x: (T*N) x k matrix, the panel of stacked independent variables.
grps: (T*N) x 1 matrix, group identifier.
robust: Scalar, an indicator variable of whether to use robust standard errors.

// Use fixedEffects procedure
call fixedEffects(reg_data[.,1], reg_data[.,2], grps, 1);

This prints:

------------------- FIXED EFFECTS (WITHIN) RESULTS -------------------

Observations          :  9
Number of Groups      :  3
Degrees of freedom    :  2
R-squared             :  0.026
Adj. R-squared        :  -0.558
Residual SS           :  11.021
Std error of est      :  1.485
Total SS (corrected)  :  11.319
F                     =  0.054        with 1,2 degrees of freedom
P-value               =  0.838

Variable            Coef.       Std. Error       t-Stat       P-Value
----------------------------------------------------------------------
X1                0.341276       1.011041       0.337549       0.768

The random effects model

The covariance structure of the random effects model means that pooled OLS will result in inefficient estimates. Instead, the random effects model is estimated using pooled feasible generalized least squares.

The pooled FGLS method estimates the model

$$\widetilde{Y_i} = \widetilde{W_i} \delta_{re} + \widetilde{\epsilon_i}$$

where the data is transformed using $\Omega = E[\epsilon_i \epsilon_i']$

$$\widetilde{Y_i} = \Omega^{-\frac{1}{2}}Y_{i},$$ $$\widetilde{W_i} = \Omega^{-\frac{1}{2}}W_{i},$$ $$\widetilde{\epsilon_i} = \Omega^{-\frac{1}{2}}\epsilon_{i},$$

and

$$W_i = [1, X_i],$$ $$\delta = [\alpha, \beta']',$$ $$\epsilon_i = \mu_i i_T + \nu_i .$$

The most difficult part of estimating this model is estimating $\Omega$ and there are a number of different proposed methods.

Example random effects estimation
One of the most common approaches for estimating the random effects model:

Estimates the between-group regression to obtain $\sigma_u^2$.
Estimates the within-group regression to obtain $\sigma_{\nu}^2$.
Transforms the data using $\sigma_u^2$ and $\sigma_{\nu}^2$.
Finds the pooled OLS estimator using the transformed data.

We can perform these steps in one procedure call using the randomEffects procedure in pdlib GAUSS library.

Using the randomEffects procedure
The randomEffects procedure takes four inputs:

y: (T*N) x 1 matrix, the panel of stacked dependent variables.
x: (T*N) x k matrix, the panel of stacked independent variables.
grps: (T*N) x 1 matrix, group identifier.
robust: Scalar, an indicator variable of whether to use robust standard errors.

Continuing with our fixed effects example, we will use our sample data stored in the GAUSS data file simple_data.dat.

// Use randomEffects procedure
call randomEffects(reg_data[., 1], reg_data[., 2], grps, 1);

---------------------- GLS RANDOM EFFECTS RESULTS  ----------------------

Observations          :  9
Number of Groups      :  3
Degrees of freedom    :  2
R-squared             :  0.004
Adj. R-squared        :  -2.985
Residual SS           :  12.907
Std error of est      :  1.358
Total SS (corrected)  :  12.956
F                     =  3.314        with 2,2 degrees of freedom
P-value               =  0.232

Variable            Coef.       Std. Error       t-Stat       P-Value
----------------------------------------------------------------------
CONSTANT          1.994513       1.720996       1.158930       0.366
X1                0.129940       1.053423       0.123350       0.913

Conclusion

In today's blog we have covered the fundamentals of the individual error component models:

The theoretical one-way error component model.
Fixed effects vs. random effects.
Estimating fixed effects and random effects.

The code and data for this blog can be found at our Aptech Blog Github code repository.

References

Baltagi, B.(2021). Econometric analysis of panel data. Springer.

Introduction to Difference-in-Differences Estimation

Eric — Sat, 30 Mar 2019 13:52:33 +0000

Introduction

When policy changes or treatments are imposed on people, it is common and reasonable to ask how those people have been impacted. This is a more difficult question than it seems at first glance.

In order to truly know how those individuals have been impacted, we need to consider how those individuals would be had the policies or treatments not taken place. However, the changes did take place, and we don't get to observe how those individuals would fare without those changes.

In today's blog, we examine difference-in-differences (DD) estimation, a common tool for considering the impact of treatments on individuals. We will consider:

What is difference-in-differences (DD) estimation?
How does DD work?
A simple DD example.

What is difference-in-differences estimation

Difference-in-differences estimation attempts to measure the effects of a sudden change in the economic environment, policy, or general treatment on a group of individuals.

The DD model includes several pieces:

A sudden exogenous source of variation, which we will refer to as the treatment. Treatment examples include changes in minimum wage, a new workplace non-discrimination policy, or a new CO₂ emissions tax.
A quantifiable and measurable outcome which is either the direct target of the variation or an indirect proxy.
A treatment group which is subjected to the change.
A control group which is similar in characteristic to the treatment group but is not subjected to the change.

DD uses the outcome of the control group as a proxy for what would have occurred in the treatment group had there been no treatment. The difference in the average post-treatment outcomes between the treatment and control groups is then used to measure the treatment effects.

Example case

Let's consider a simple example. Suppose we have two professors of introductory econometrics classes, one at Transylvania University (TU) and one at The University of Azkaban (UA). Both professors have decided to use GAUSS to teach a year-long series of econometrics courses.

A quarter through the year, the class at TU takes advantage of a free GAUSS training session while the class at UA does not.

We can compare the grades on the GAUSS homework assignments by the students at each university before and after the training date to measure the benefit of the training session.

Treatment	Aptech GAUSS training course
Control Group	Students using GAUSS at University of Azkaban
Treatment Group	Students using GAUSS at Transylvania University
Outcome	Grades on GAUSS homework assignments

How does it work?

The DD estimate uses the between-group cross-sectional differences and within-group time-series differences to measure treatment effects. Estimated separately, both the cross-sectional differences and within-group time-series differences may produce biased estimates of the treatment effects.

DD Model Outline

Let's look more formally at our example to better understand how DD works. First, we define our two outcomes

$$Y_{1,i,c,t} = \text{homework grades by student } i \text{, in}\\ \text{ class } c \text{, in period } t \text{ with training course} $$ $$Y_{0,i,c,t} = \text{homework grades by student } i \text{, in}\\ \text{ class } c \text{, in period } t \text{ without training course} $$

where i is an individual student, c is the class and t is the time period.

Note: These are just theoretical outcomes -- empirically we only get to observe one or the other. For example, once the TU students take the course we cannot observe their homework grades without the course in the time periods after the course.

We begin by assuming that potential outcomes before training are determined by a time-invariant, university-specific effect, $\gamma_c$, and a university-invariant, time-specific effect, $\lambda_t$:

$$ E(Y _{0,i,c,t}| c, t) = \gamma_c + \lambda_t $$

Note that the time-invariant component, $\gamma_c$, depends only on the university that the student is in and is independent of the time period. Similarly, the university-invariant component, $\lambda_t$, is independent of the university and changes only with the time period.

Assuming that the training has a constant effect, $\beta$, on homework grades:

$$ E(Y _{1,i,c,t}| c, t) = \gamma_c + \lambda_t + \beta .$$

More generally we can express our outcomes as

$$Y _{i,c,t} = \gamma_c + \lambda_t + \beta D_{c,t} + \epsilon_{i,c,t}$$

where $D_{ct}$ is a dummy variable representing classes that have received training and $E(\epsilon_{ict}| c, t) = 0$.

Using this we compare the differences in outcomes for the individual classes across time:

$$E(Y _{i,c,t} | TU, \text{post-course}) - E(Y _{i,c,t} | TU, \text{pre-course}) =\\ \lambda_{post-course} + \beta - \lambda_{pre-course}$$

and

$$E(Y _{i,c,t} | UA, \text{post-course}) - E(Y _{i,c,t} | UA, \text{pre-course}) =\\ \lambda_{post-course} - \lambda_{pre-course} .$$

From here were are able to estimate the population difference-in-differences which measures the treatment effect of interest

$$ [E(Y _{i,c,t} | TU, \text{post-course}) - E(Y _{i,c,t} | TU, \text{pre-course})] -\\ [E(Y _{i,c,t} | UA, \text{post-course}) - E(Y _{i,c,t} | UA, \text{pre-course})] = \beta .$$

This is the key outcome of the difference-in-differences method. We have eliminated the common trend between the groups, $\lambda_t$, and the permanent differences between the groups, leaving a very simple estimate of the treatment effect, $\beta$.

DD Assumptions

The DD estimation of treatment effects is an appealingly simple way to measure treatment effects. However, it relies on some key assumptions (Angrist and Pischke, 2008):

Outcomes in the treatment group and the control group follow the same trend, $\lambda_t$.
The treatment causes deviation, $\beta$, from the trend.
The differences in the treatment group and control group are captured by the fixed effects variables. $\gamma_c$.

Example

Let's look again at our econometrics students from TU and UA. The students earn the following grades on their assignments:

	Pre-treatment period	Post-treatment period
Transylvania University	77, 82, 65, 68, 90, 84, 67, 73, 84, 61	76, 88, 73, 74, 94, 88, 69, 78, 89, 71
University of Azkaban	74, 63, 82, 70, 92, 67, 66, 68, 87, 95	72, 70, 84, 67, 92, 70, 65, 65, 82, 96

Now consider the averages and their differences:

	Pre-treatment period	Post-treatment period	Differences
Transylvania University	75.100 (9.235)	80.000 (8.894)	4.900 (3.035)
University of Azkaban	76.400 (11.673)	76.300 (11.383)	-0.100 (3.510)
Differences	-1.300 (15.384)	3.700 (13.166)	5.000 (3.944)

The orange highlighted values represent the difference-in-differences across periods between the TU class, the treatment group, and the UA class, the control group. We see that the training class provides a treatment effect of an average 5.00% increase in grades.

The GAUSS code to replicate these results is available on the Aptech GitHub repository.

Conclusion

The difference-in-differences method provides a simple method for estimating treatment effects. The basic two-period approach outlined here provides a foundation for more sophisticated techniques including larger panel regression DD.

In today's blog we have covered the fundamentals of the DD method:

What is difference-in-differences (DD) estimation
How does DD work?
A simple DD example

The code and data for this blog can be found at our Aptech Blog Github code repository.

References

Angrist, Joshua D., and Jörn-Steffen Pischke, 2008, “Mostly harmless econometrics: An empiricist's companion,” Princeton university press.

Panel data, structural breaks and unit root testing

Eric — Sat, 23 Feb 2019 08:35:14 +0000

Introduction

In this blog, we extend last week's analysis of unit root testing with structural breaks to panel data.

We will again use the quarterly current account to GDP ratio but focus on a panel of data from five countries: United States, United Kingdom, Australia, South Africa, and India.

Using panel data unit roots tests found in the GAUSS tspdlib library we consider if the panel collectively shows unit root behavior.

Testing for unit roots in panel data

Why panel data

There are a number of reasons we utilize panel data in econometrics (Baltagi, 2008). Panel data:

Capture the idiosyncratic behaviors of individual groups with models like the fixed effects or random effects models.
Contain more information, more variability, and more efficiency.
Can detect and measure statistical effects that pure time-series or cross-section data can't.
Provide longer time-series for unit-root testing, which in turn leads to standard asymptotic behavior.

Panel data unit root testing

Today we will test for unit roots using the panel Lagrangian Multiplier (LM) unit-root test with structural breaks in the mean (Im, K., Lee, J., Tieslau, M., 2005):

The panel LM test statistic averages the individual LM test statistics which are computed using the pooled likelihood function.
The asymptotic distribution of the test is robust to structural breaks.
The test considers the null unit root hypothesis against the alternative that at least one time series in the panel is stationary.

Testing our panel

Setting up the test

The panel LM test can be run using the GAUSS PDLM procedure found in the GAUSS tspdlib library. The procedure has two required inputs and four additional optional arguments:

y_test: T x N matrix, the panel data to be tested.
model: Scalar, indicates the type of model to be tested.
1 = break in level.
2 = break in level and trend.
nbreak: Scalar, optional input, the number of breaks to allow.
1 = one break.
2 = two breaks. Default = 0.
pmax: Scalar, optional input, maximum number of lags for Dy. 0 = no lags. Default = 8.
ic: Scalar, optional input, the information criterion used to select lags.
    1 = Akaike.
    2 = Schwarz.
    3 = t-stat significance. Default = 3.
trimm: Scalar, optional input, data trimming rate. Default = 0.10

The PDLM procedure has five returns:

Nlm: Vector, the minimum test statistic for each cross-section.
Ntb: Vector, location of break(s) for each cross-section.
Np: Scalar, number of lags selected by chosen information criterion for each cross-section.
PDlm: Scalar, panel LM statistic with N(0, 1).
pval: Scalar, p-value of PDlm.

Running the test

The test is easy to set up and run in GAUSS. We first load the tspdlib library and our data.

library tspdlib;

// Load data
ca_panel = loadd("panel_ca.dat");
y_test = ca_panel[., 2:cols(ca_panel)];

Next, we specify that we want to run the model with level breaks and we call the PDLM procedure separately for the one break and two break models. We will keep all other parameters at their default values:

// Specify to run model with 
// level breaks
model = 1;

// Run first with one break
nbreak = 1;

// Call PD LM with one level break
{ Nlm, Ntb, Np, PDlm, pval } = PDLM(y_test, model, nbreak);

// Run next with two breaks
nbreak = 2;

// Call PD LM with level break
{ Nlm, Ntb, Np, PDlm, pval } = PDLM(y_test, model, nbreak);

Results

Country	Cross-section test statistic	Break location	Number of lags	Conclusion
Two break model
United States	-3.3067	1993 Q1, 2004 Q3	12	Reject the null
United Kingdom	-4.6080	1980 Q4, 1984 Q4	4	Reject the null
Australia	-3.9522	1970 Q3, 1977 Q4	12	Reject the null
South Africa	-5.6735	1976 Q4, 1983 Q4	4	Reject the null
India	-5.6734	1975 Q4, 2004 Q2	9<	Reject the null
Full Panel	-6.6339526	N/A	N/A	Reject the null

One break model
United States	-3.0504	1993 Q1	12<	Reject the null
United Kingdom	-4.1213	1984 Q4	4	Reject the null
Australia	-3.1625	1980 Q2	12	Reject the null
South Africa	-5.1271	1979 Q4	4	Reject the null
India	-2.8001	1976 Q2	9	Reject the null
Full Panel	-8.9118730	N/A	N/A	Reject the null

Research on the presence of unit roots in current account balances has had mixed results. These results bring to the forefront the question of current account balance sustainability (Clower & Ito, 2012).

Our panel tests with structural breaks unanimously reject the null hypothesis of unit roots for all cross-sections, as well as the combined panel. This adds support, at least for our small sample, to the idea that current account balances are sustainable and mean-reverting.

Conclusions

Today we've learned about conducting panel data unit root testing in the presence of structural breaks using the LM test from (Im, K., Lee, J., Tieslau, M., 2005). After today you should have a better understanding of:

Some of the advantages of using panel-data.
How to test for unit roots in panel data using the LM test with structural breaks.
How to use the GAUSS tspdlib library to test for unit roots with structural breaks.

Code and data from this blog can be found here.

References

Baltagi, B. (2008). Econometric analysis of panel data. John Wiley & Sons.

Clower, E., & Ito, H. (2012). The persistence of current account balances and its determinants: the implications for global rebalancing.

Im, K., Lee, J., Tieslau, M. (2005). Panel LM Unit-root Tests with Level Shifts. Oxford Bulletin of Economics and Statistics 67, 393–419.

Panel data – Aptech

Exploring and Cleaning Panel Data with GAUSS 25

Introduction

Data

Penn World Table Variables

Loading Our Panel Data

Preparing Panel Data

Transforming panel data to long form

Ordering variables

Sorting panel data

Assessing Panel Data Structure

GAUSS Functions for Panel Data Structure

Exploring the structure of the Penn World Table

Panel Data Summary Statistics

Transforming Data for Modeling

Data Visualization

Conclusion

Further Reading

Get Started with Panel Data in GAUSS (Video)

Introduction

Summary and Timeline

Timeline

Additional Resources

Transforming Panel Data to Long Form in GAUSS

Introduction

The Rules of Tidy Data

Example One: Wide Form State Population Table

Example Two: Long Form State Population Table

Why Do We Care About Tidy Data?

Transforming From Wide to Long Panel Data

The dfLonger Procedure

Setting Up Panel Data Transformations

Step 1: Identify variables.

Step 2: Identify columns to convert.

Step 3: Name the new columns for storing names.

Step 4: Name the new columns for storing values.

Basic Pivoting

Advanced Pivoting

The pivotControl Structure

Changing Variable Types

Stripping Prefixes

Splitting Names

Names Include a Separator

Variable Names With Regular Expressions

Multiple Value Variables

Conclusion

Further Reading

Discover how GAUSS 24 can help you reach your goals.

Visualizing COVID-19 Panel Data With GAUSS 22

Introduction

Data

Creating a Basic Graph

Customizing Our Graph

Declaring a plotControl Structure

Customizing Plot Attributes

Adding Y-Axis Grid Lines

Customizing X-Axis Ticks

Updating Axis Labels

Highlighting Events

Specifying Legend Behavior When Adding Lines

Specifying Line Style

Adding Lines to Mark Events

Adding Bars to Mark Events

Adding Notes to Graphs

Conclusion

Further Reading

References

Panel Data Stationarity Test With Structural Breaks

Introduction

Why Panel Data Unit Root Testing?

Why do we Need to Worry About Structural Breaks?

Panel Data Stationarity Test with Structural Breaks

Conducting Panel Data Stationarity Tests in GAUSS

Where can I Find the Tests?

What Format Should my Data be in?

How do I Call the Test Procedure?

Empirical Example

Data Description

Loading our data into GAUSS

Setting up our Model Parameters

The `dfLonger` Procedure

The `pivotControl` Structure

Declaring a `plotControl` Structure

Calling the `pd_kpss` Procedure