Introduction to the Fundamentals of Panel Data

by Eric · Published November 29, 2019 · Updated April 23, 2024

Introduction

Panel data, sometimes referred to as longitudinal data, is data that contains observations about different cross sections across time. Examples of groups that may make up panel data series include countries, firms, individuals, or demographic groups.

Like time series data, panel data contains observations collected at a regular frequency, chronologically. Like cross-sectional data, panel data contains observations across a collection of individuals.

There are a number of advantages of panel data:

Panel data can model both the common and individual behaviors of groups.
Panel data contains more information, more variability, and more efficiency than pure time series data or cross-sectional data.
Panel data can detect and measure statistical effects that pure time series or cross-sectional data can't.
Panel data can minimize estimation biases that may arise from aggregating groups into a single time series.

Panel data examples can be found in economics , social sciences, medicine and epidemiology, finance, and the physical sciences.

What Is an Example of Panel Data?
Field	Example topics	Example dataset
Microeconomics	GDP across multiple countries, Unemployment across different states, Income dynamic studies, international current account balances.	Panel Study of Income Dynamics (PSID)
Macroeconomics	International trade tables, world socioeconomic tables, currency exchange rate tables.	Penn World Tables
Epidemiology and Health Statistics	Public health insurance data, disease survival rate data, child development and well-being data.	Medical Expenditure Panel Survey
Finance	Stock prices by firm, market volatilities by country or firm.	Global Market Indices

What Is Panel Data?

Panel data is a collection of quantities obtained across multiple individuals, that are assembled over even intervals in time and ordered chronologically. Examples of individual groups include individual people, countries, and companies.

In order to denote both individuals and time observations, panel data often refers to groups with the subscript i and time as the subscript t. For example, a panel data observation $Y_{it}$ is observed for all individuals $i = {1, ..., N}$ across all time periods $t = {1, ..., T}$

More specifically:

Group	Time Period	Notation
1	1	$Y_{11}$
1	2	$Y_{12}$
1	T	$Y_{1T}$
⁞	⁞	⁞
N	1	$Y_{N1}$
N	2	$Y_{N2}$
N	T	$Y_{NT}$

Wide and Long Panel Datasets

Panel datasets may come in different formats. The format in the table above is sometimes called long format data. Long format datasets stack the observations of each variable from all groups, across at all time periods into one column.

When panel data is stored with the observations for a single variable from separate groups stored in separate columns this is sometimes referred to as wide data format.

Time	$Y_1$	$Y_2$	$Y_N$
1	$Y_{11}$	$Y_{21}$	$Y_{N1}$
2	$Y_{12}$	$Y_{22}$	$Y_{N2}$
3	$Y_{13}$	$Y_{23}$	$Y_{N3}$
4	$Y_{14}$	$Y_{24}$	$Y_{N4}$
⁞	⁞	⁞	⁞
T-1	$Y_{1T-1}$	$Y_{2T-1}$	$Y_{NT-1}$
T	$Y_{1T}$	$Y_{2T}$	$Y_{NT}$

Balanced Panel Data Versus Unbalanced Panel Data

Panel data can also be characterized as unbalanced panel data or balanced panel data:

Balanced panel datasets have the same number of observations for all groups.
Unbalanced panel datasets have missing values at some time observations for some of the groups.

Certain panel data models are only valid for balanced datasets. If the panel datasets are unbalanced they may need to be condensed to include only the consecutive periods for which there are observations for all individuals in the cross section.

Panel Data and Heterogeneity

Panel data series modeling centers around addressing the likely dependence across data observations within the same group. In fact, the primary difference between panel data models and time series models, is that panel data models allow for heterogeneity across groups and introduce individual-specific effects.

As an example, consider a panel data series which includes gross domestic product (GDP) data for a panel of 5 different countries, the United States, France, Canada, Greece, and Australia:

A worldwide economic recession is likely to impact all 5 countries and causes changes in the GDP across all 5 countries.
An election in Australia is likely to impact the GDP of Australia but may not affect the other countries in the panel.
A change in North American trade policy may only regionally impact the US and Canada.
A change in the Euro exchange rate will most directly affect only France and Greece.

Panel data models include techniques that can address these heterogeneities across individuals. Furthermore, pure cross-sectional methods and pure time series models may not be valid in the presence of this heterogeneity.

Modeling Panel Data

Researchers commonly analyze datasets with multiple observations of a set of cross-sectional units (e.g., people, firms, countries) over time. For example, one may have data covering the production of multiple firms or the gross product of multiple countries across a number of years.

Modeling these panel data series is a unique branch of time series modeling made up of methodologies specific to their structure.

This section looks more closely at panel data analysis and the associated panel data models.

Homogeneous Versus Heterogeneous Panel Data Models

Panel data methods can be split into two broad categories:

Homogeneous (or pooled) panel data models assume that the model parameters are common across individuals.
Heterogeneous models allow for any or all of the model parameters to vary across individuals. Fixed effects and random effects models are both examples of heterogeneous panel data models.

Within these groups, the assumptions made about the variation of the model across individuals are the primary drivers for which model to use.

Let’s consider a simple linear model

$$y_{it} = \alpha + \beta x_{it} + \epsilon_{it}$$

The representation above is a homogenous model:

The constant, $ \alpha $, is the same across groups and time.
The coefficient, $ \beta $, is constant across groups and time.
Any differences across groups enter the model only through the error term, $ \epsilon_{it} $.

Alternatively, we could believe that groups share common coefficients on regressors but there are group-specific intercepts, as is captured in the fixed effects or least squares dummy variable LSDV model

$$y_{it} = \alpha_i + \beta x_{it} + \epsilon_{it}$$

The representation above is a heterogenous model, because the constants, $ \alpha_i $, are group-specific.

Individual-Specific Effects in Panel Data

This section considers four popular panel data models:

Pooled ordinary least squares.
One-way fixed effects.
One-way random effects.
Random coefficients.

We will examine these models using an assumed data generation process given by

$$ y_{it} = \beta x_{it} + \delta z_i + \epsilon_{it}$$

In this model, $X$ represents the observed characteristics such as age, firm size, expenditures, and $Z$ represents unobserved characteristics, such as management quality, growth opportunities, or skill.

Component	Description	Example
$x_{it}$	These are observable characteristics. These characteristics may be constant for an individual across all time, such as race, or may vary across all time observations for an individual such as age.	Age, race, company size, expenditure, population, GDP
$z_i$	Unobservable characteristics, responsible for model heterogeneity.	Skill, company potential, lack of basic infrastructure in the community, political unrest.
$\epsilon_{it}$	Stochastic error term.	N/A

What Is Pooled Ordinary Least Squares?
In some cases, there are no unobservable individual-specific effects, and $\delta z_i $ is constant across individuals. This is a strong assumption and implies that all the observations within groups are independent of one another.

In these cases, the model becomes

$$ y_{it} = \beta x_{it} + \alpha + \epsilon_{it}$$

This implies that when there is no dependence within individual groups, the panel data can be treated as one large, pooled dataset. The model parameters, $\beta$, and, $\alpha$, can be directly estimated using pooled ordinary least squares.

Linear independence within the groups of a panel is unlikely and pooled OLS is rarely acceptable for panel data models.

What Is The One-Way Fixed Effects Model?
The one-way fixed effects panel data model:

Includes unobservable time-specific or individual-specific effects. These effects capture omitted variables.
Assumes that individual-specific effects are correlated with the observed characteristics, $x_{it}$
Pooled OLS estimates for data generated by this process will be inconsistent.

Fixed effects data with group-specific intercepts and one shared slope.

As an example, let’s consider the one-way fixed effects model with individual-specific effects where the unobservable component, $\delta z_i$ , acts like an individual-specific intercept:

$$y_{it} = \beta x_{it} + \alpha_i + \epsilon_{it}$$

The intercept term, $\alpha_i$, varies across individuals but is constant across time. This term is composed of the constant intercept term, $\mu$, and the individual-specific error terms, $\gamma_i$.

The key feature of the fixed effects model is that $\gamma_i$ has a true, but unobservable, effect that must be estimated. More importantly, if we estimate $\beta$ using pooled OLS and fail to appropriately account for $\gamma_i$, the estimates will be inconsistent and biased.

The fixed effects model requires the estimation of the model parameter $\beta$ and individual $\alpha_i$ for each of the N groups in the panel. This is generally achieved using one of three estimation techniques:

Within-group estimation.
First differences estimation.
Least squares dummy variable (LSDV) estimation.

The first two of these techniques focuses on eliminating the individual effects before estimation. The LSDV method directly incorporates these effects using dummy variables.

What Is the One-Way Random Effects Model?
The one-way random effects panel data model:

Includes unobservable time-specific or individual-specific effects, $\delta z_i$, which act like individual-specific stochastic error terms.
Assumes that these effects are uncorrelated with the observed characteristics, $x_{it}$.
Does not result in biased OLS estimates of coefficients but does lead to inefficient parameters and incorrect standard inference tools.

Plot of random effects panel data showing stochastic differences across groups.

The distinguishing feature of the random effects model is that $\delta z_i$ does not have a true value but rather follows a random distribution with parameters that we must estimate.

The random effects term, $\delta z_i$:

Is uncorrelated with $x_{it}$ and pooled OLS estimates of the model parameters will not be biased.
Impacts the covariance structure of the error term which implies that pooled OLS estimates of the model parameters will be inefficient and standard inference tools, like the t-stat, will not be correct.

The random effects model should be estimated using feasible generalized least squares (FGLS). Using FGLS, the appropriate error structure, one which accounts for the individual-specific error terms, can be incorporated into the model.

What Is the Random Coefficients Model?

Plot of random coefficients panel data, showing differing intercepts, slopes, and variances.

The panel data regressions we’ve looked at so far have all assumed that the coefficients on regressors are the same across all individuals. The random coefficients model relaxes this assumption and introduces individual-specific effects through the coefficient, such that

$$y_{it} = \beta_i x_{it} + \alpha_i + \epsilon_{it}$$ $$y_{it} = (b_i + \beta)x_{it} + (\alpha_i+\alpha) + \epsilon_{it}$$ $$b_i \sim N(0, \tau_{i1}^2)$$ $$a_i \sim N(0, \tau_{i2}^2)$$

This model introduces both individual slope effects and allows for heteroscedasticity through the individual-specific $\tau_{i1}^2$ and $\tau_{i2}^2$.

This model can be estimated using feasible generalized least squares (FGLS) or maximum likelihood estimation (MLE).

Two-Way Individual Effects Models

The two-way individual effects model allows the presence of both time-specific effects and individual-specific effects.

Starting from a simple linear model given by,

$$y_{it} = \alpha + \beta_{xit} + \epsilon_{it}$$

the two-way individual effects model can be represented by

$$y_{it} = \alpha + \beta_{xit} + \mu_i + \lambda_t + \epsilon_{it}$$

In this model, $\mu_i$, captures any unobservable individual-specific effects and $\lambda_t$ captures any unobservable time-specific effects. Note that the individual-specific effects, $\mu_i$, do not vary with time, while the time-specific effects, $\lambda_t$, do not vary across individuals.

In the special case that there are only two groups and two individuals this model is equivalent to the difference-in-difference model. However, if there are more than two time periods and/or individuals, alternative panel data models must be considered.

What Is the Two-Way Fixed Effects Model?
The two-way fixed effects model:

Assumes that both $\mu_i$ and $\lambda_t$ are unobservable, fixed effects that must be estimated.

For data generated by this model:

Pooled OLS estimates, which ignore $\mu_i$ and $\lambda_t$, will be biased and inconsistent.
One-way fixed effects estimates, which ignore $\lambda_t$, will be biased.

Like the one-way fixed effects model, this model could be estimated by including dummy variables. However, in the two-way fixed effects model dummy variables must be included for both the time periods and the groups.

Under most circumstances, the number of dummy variables included in the two-way fixed effects model makes standard ordinary least squares estimation too computationally difficult. Instead, the two-way fixed effects model is estimated using a within-group estimator which removes the variation both within groups and within the time periods.

What Is the Two-Way Random Effects Model?
The two-way random effects model:

Occurs when both $\mu_i$ and $\lambda_t$ are unobservable, stochastic effects.
Assumes that $\mu_i$ and $\lambda_t$ are independently distributed and are uncorrelated with $x_{it}$.

For data generated by this process:

Pooled OLS estimates will be unbiased. However, the estimates will be inefficient and the associated standard errors and t-statistics will be biased.

Like the one-way random effects model, the two-way random effects model can be estimated using feasible generalized least squares (FGLS) or maximum likelihood estimation (MLE).

Dynamic Panel Data Model
A key component of pure time series models is the modeling of dynamics using lagged dependent variables. These lagged variables capture the autocorrelation between observations of the same dataset at different points in time.

Because panel datasets include a time series component, it is also important to address the possibility of autocorrelation in panel data. The dynamic panel data model adds dynamics to the panel data individual effects framework.

Consider an individual effects model which includes an AR(1) term

$$y_{it} = \delta y_{i,t-1} + \beta x_{it} + \epsilon_{it}$$

the error component includes one-way individual effects such that

$$\epsilon_{it} = \mu_i + \nu_{it}$$

where $\mu_i$ captures individual effects.

Introducing lagged dependent variables in the individual effects framework:

Both $y_{it}$ and $y_{i,t-1}$ are functions of $\mu_i$, because $\mu_i$ is time-invariant. This implies that as a regressor, $y_{i,t-1}$ is correlated with the error term.
Ordinary least squares (OLS) will lead to biased estimates because of the serial correlation.

Dynamic panel data models are most commonly estimated using a generalized method of moments (GMM) framework proposed by Arellano and Bond (1991).

Panel Data and Stationarity

In panel data that covers small time frames, there is little need to worry about stationarity. However, when panel data covers longer time frames, like is the case in many macroeconomic panel data series, the panel data must be tested for stationarity.

Weak stationarity, required for many panel data modeling techniques, requires only that:

A series has the same finite unconditional mean and finite unconditional variance at all time periods.
That the series autocovariances are independent of time.

Nonstationary panel data series are any panel series that do not meet the conditions of a weakly stationary time series.

In part because of these considerations, a large field of research and literature surrounding panel data unit root tests has developed.

Testing for unit roots in panel data requires more than just testing the individual cross sections for the presence of unit roots. Panel data unit root tests must:

Allow for both the shared movements across groups and the individual-specific movements within groups.
Use an appropriate asymptotic distribution based on how quickly the number of panels (N) and the number of time periods (T) grow relative to one another.
Determine whether to assume for cross-sectional independence or to enforce cross-sectional dependence.

Conclusion

After today's blog, you should have an understanding of the fundamentals of panel data. We covered the basics of panel data including:

The structure of panel data series.
Wide versus long panel data series.
One-way individual effects panel data models.
Two-way individual effects panel data models.
Dynamic panel data models.
Panel data series and stationarity.

Further suggested reading:

Ready to get more from your panel data with GAUSS? Contact us today to claim your free GAUSS demo copy.

Eric( Director of Applications and Training at Aptech Systems, Inc. )

Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.