Introduction
Panel data, sometimes referred to as longitudinal data, is data that contains observations about different cross sections across time. Examples of groups that may make up panel data series include countries, firms, individuals, or demographic groups.
Like time series data, panel data contains observations collected at a regular frequency, chronologically. Like crosssectional data, panel data contains observations across a collection of individuals.
There are a number of advantages of panel data:
 Panel data can model both the common and individual behaviors of groups.
 Panel data contains more information, more variability, and more efficiency than pure time series data or crosssectional data.
 Panel data can detect and measure statistical effects that pure time series or crosssectional data can't.
 Panel data can minimize estimation biases that may arise from aggregating groups into a single time series.
Panel data examples can be found in economics , social sciences, medicine and epidemiology, finance, and the physical sciences.



Field  Example topics  Example dataset 
Microeconomics  GDP across multiple countries, Unemployment across different states, Income dynamic studies, international current account balances.  Panel Study of Income Dynamics (PSID) 
Macroeconomics  International trade tables, world socioeconomic tables, currency exchange rate tables.  Penn World Tables 
Epidemiology and Health Statistics  Public health insurance data, disease survival rate data, child development and wellbeing data.  Medical Expenditure Panel Survey 
Finance  Stock prices by firm, market volatilities by country or firm.  Global Market Indices 
What Is Panel Data?
Panel data is a collection of quantities obtained across multiple individuals, that are assembled over even intervals in time and ordered chronologically. Examples of individual groups include individual people, countries, and companies.
In order to denote both individuals and time observations, panel data often refers to groups with the subscript i and time as the subscript t. For example, a panel data observation $Y_{it}$ is observed for all individuals $i = {1, ..., N}$ across all time periods $t = {1, ..., T}$
More specifically:
Group  Time Period  Notation 

1  1  $Y_{11}$ 
1  2  $Y_{12}$ 
1  T  $Y_{1T}$ 
⁞  ⁞  ⁞ 
N  1  $Y_{N1}$ 
N  2  $Y_{N2}$ 
N  T  $Y_{NT}$ 
Wide and Long Panel Datasets
Panel datasets may come in different formats. The format in the table above is sometimes called long format data. Long format datasets stack the observations of each variable from all groups, across at all time periods into one column.
When panel data is stored with the observations for a single variable from separate groups stored in separate columns this is sometimes referred to as wide data format.
Time  $Y_1$  $Y_2$  $Y_N$ 

1  $Y_{11}$  $Y_{21}$  $Y_{N1}$ 
2  $Y_{12}$  $Y_{22}$  $Y_{N2}$ 
3  $Y_{13}$  $Y_{23}$  $Y_{N3}$ 
4  $Y_{14}$  $Y_{24}$  $Y_{N4}$ 
⁞  ⁞  ⁞  ⁞ 
T1  $Y_{1T1}$  $Y_{2T1}$  $Y_{NT1}$ 
T  $Y_{1T}$  $Y_{2T}$  $Y_{NT}$ 
Balanced Panel Data Versus Unbalanced Panel Data
Panel data can also be characterized as unbalanced panel data or balanced panel data:
 Balanced panel datasets have the same number of observations for all groups.
 Unbalanced panel datasets have missing values at some time observations for some of the groups.
Certain panel data models are only valid for balanced datasets. If the panel datasets are unbalanced they may need to be condensed to include only the consecutive periods for which there are observations for all individuals in the cross section.
Panel Data and Heterogeneity
Panel data series modeling centers around addressing the likely dependence across data observations within the same group. In fact, the primary difference between panel data models and time series models, is that panel data models allow for heterogeneity across groups and introduce individualspecific effects.
As an example, consider a panel data series which includes gross domestic product (GDP) data for a panel of 5 different countries, the United States, France, Canada, Greece, and Australia:
 A worldwide economic recession is likely to impact all 5 countries and causes changes in the GDP across all 5 countries.
 An election in Australia is likely to impact the GDP of Australia but may not affect the other countries in the panel.
 A change in North American trade policy may only regionally impact the US and Canada.
 A change in the Euro exchange rate will most directly affect only France and Greece.
Panel data models include techniques that can address these heterogeneities across individuals. Furthermore, pure crosssectional methods and pure time series models may not be valid in the presence of this heterogeneity.
Modeling Panel Data
Researchers commonly analyze datasets with multiple observations of a set of crosssectional units (e.g., people, firms, countries) over time. For example, one may have data covering the production of multiple firms or the gross product of multiple countries across a number of years.
Modeling these panel data series is a unique branch of time series modeling made up of methodologies specific to their structure.
This section looks more closely at panel data analysis and the associated panel data models.
Homogeneous Versus Heterogeneous Panel Data Models
Panel data methods can be split into two broad categories:
 Homogeneous (or pooled) panel data models assume that the model parameters are common across individuals.
 Heterogeneous models allow for any or all of the model parameters to vary across individuals. Fixed effects and random effects models are both examples of heterogeneous panel data models.
Within these groups, the assumptions made about the variation of the model across individuals are the primary drivers for which model to use.
Let’s consider a simple linear model
$$y_{it} = \alpha + \beta x_{it} + \epsilon_{it}$$
The representation above is a homogenous model:
 The constant, $ \alpha $, is the same across groups and time.
 The coefficient, $ \beta $, is constant across groups and time.
 Any differences across groups enter the model only through the error term, $ \epsilon_{it} $.
Alternatively, we could believe that groups share common coefficients on regressors but there are groupspecific intercepts, as is captured in the fixed effects or least squares dummy variable LSDV model
$$y_{it} = \alpha_i + \beta x_{it} + \epsilon_{it}$$
The representation above is a heterogenous model, because the constants, $ \alpha_i $, are groupspecific.
IndividualSpecific Effects in Panel Data
This section considers four popular panel data models:
 Pooled ordinary least squares.
 Oneway fixed effects.
 Oneway random effects.
 Random coefficients.
We will examine these models using an assumed data generation process given by
$$ y_{it} = \beta x_{it} + \delta z_i + \epsilon_{it}$$
In this model, $X$ represents the observed characteristics such as age, firm size, expenditures, and $Z$ represents unobserved characteristics, such as management quality, growth opportunities, or skill.
Component  Description  Example 

$x_{it}$  These are observable characteristics. These characteristics may be constant for an individual across all time, such as race, or may vary across all time observations for an individual such as age.  Age, race, company size, expenditure, population, GDP 
$z_i$  Unobservable characteristics, responsible for model heterogeneity.  Skill, company potential, lack of basic infrastructure in the community, political unrest. 
$\epsilon_{it}$  Stochastic error term.  N/A 
What Is Pooled Ordinary Least Squares?
In some cases, there are no unobservable individualspecific effects, and $\delta z_i $ is constant across individuals. This is a strong assumption and implies that all the observations within groups are independent of one another.
In these cases, the model becomes
$$ y_{it} = \beta x_{it} + \alpha + \epsilon_{it}$$
This implies that when there is no dependence within individual groups, the panel data can be treated as one large, pooled dataset. The model parameters, $\beta$, and, $\alpha$, can be directly estimated using pooled ordinary least squares.
Linear independence within the groups of a panel is unlikely and pooled OLS is rarely acceptable for panel data models.
What Is The OneWay Fixed Effects Model?
The oneway fixed effects panel data model:
 Includes unobservable timespecific or individualspecific effects. These effects capture omitted variables.
 Assumes that individualspecific effects are correlated with the observed characteristics, $x_{it}$
 Pooled OLS estimates for data generated by this process will be inconsistent.
As an example, let’s consider the oneway fixed effects model with individualspecific effects where the unobservable component, $\delta z_i$ , acts like an individualspecific intercept:
$$y_{it} = \beta x_{it} + \alpha_i + \epsilon_{it}$$
The intercept term, $\alpha_i$, varies across individuals but is constant across time. This term is composed of the constant intercept term, $\mu$, and the individualspecific error terms, $\gamma_i$.
The key feature of the fixed effects model is that $\gamma_i$ has a true, but unobservable, effect that must be estimated. More importantly, if we estimate $\beta$ using pooled OLS and fail to appropriately account for $\gamma_i$, the estimates will be inconsistent and biased.
The fixed effects model requires the estimation of the model parameter $\beta$ and individual $\alpha_i$ for each of the N groups in the panel. This is generally achieved using one of three estimation techniques:
 Withingroup estimation.
 First differences estimation.
 Least squares dummy variable (LSDV) estimation.
The first two of these techniques focuses on eliminating the individual effects before estimation. The LSDV method directly incorporates these effects using dummy variables.
What Is the OneWay Random Effects Model?
The oneway random effects panel data model:
 Includes unobservable timespecific or individualspecific effects, $\delta z_i$, which act like individualspecific stochastic error terms.
 Assumes that these effects are uncorrelated with the observed characteristics, $x_{it}$.
 Does not result in biased OLS estimates of coefficients but does lead to inefficient parameters and incorrect standard inference tools.
The distinguishing feature of the random effects model is that $\delta z_i$ does not have a true value but rather follows a random distribution with parameters that we must estimate.
The random effects term, $\delta z_i$:
 Is uncorrelated with $x_{it}$ and pooled OLS estimates of the model parameters will not be biased.
 Impacts the covariance structure of the error term which implies that pooled OLS estimates of the model parameters will be inefficient and standard inference tools, like the tstat, will not be correct.
The random effects model should be estimated using feasible generalized least squares (FGLS). Using FGLS, the appropriate error structure, one which accounts for the individualspecific error terms, can be incorporated into the model.
What Is the Random Coefficients Model?
The panel data regressions we’ve looked at so far have all assumed that the coefficients on regressors are the same across all individuals. The random coefficients model relaxes this assumption and introduces individualspecific effects through the coefficient, such that
$$y_{it} = \beta_i x_{it} + \alpha_i + \epsilon_{it}$$ $$y_{it} = (b_i + \beta)x_{it} + (\alpha_i+\alpha) + \epsilon_{it}$$ $$b_i \sim N(0, \tau_{i1}^2)$$ $$a_i \sim N(0, \tau_{i2}^2)$$
This model introduces both individual slope effects and allows for heteroscedasticity through the individualspecific $\tau_{i1}^2$ and $\tau_{i2}^2$.
This model can be estimated using feasible generalized least squares (FGLS) or maximum likelihood estimation (MLE).
TwoWay Individual Effects Models
The twoway individual effects model allows the presence of both timespecific effects and individualspecific effects.
Starting from a simple linear model given by,
$$y_{it} = \alpha + \beta_{xit} + \epsilon_{it}$$
the twoway individual effects model can be represented by
$$y_{it} = \alpha + \beta_{xit} + \mu_i + \lambda_t + \epsilon_{it}$$
In this model, $\mu_i$, captures any unobservable individualspecific effects and $\lambda_t$ captures any unobservable timespecific effects. Note that the individualspecific effects, $\mu_i$, do not vary with time, while the timespecific effects, $\lambda_t$, do not vary across individuals.
In the special case that there are only two groups and two individuals this model is equivalent to the differenceindifference model. However, if there are more than two time periods and/or individuals, alternative panel data models must be considered.
What Is the TwoWay Fixed Effects Model?
The twoway fixed effects model:
 Assumes that both $\mu_i$ and $\lambda_t$ are unobservable, fixed effects that must be estimated.
For data generated by this model:
 Pooled OLS estimates, which ignore $\mu_i$ and $\lambda_t$, will be biased and inconsistent.
 Oneway fixed effects estimates, which ignore $\lambda_t$, will be biased.
Like the oneway fixed effects model, this model could be estimated by including dummy variables. However, in the twoway fixed effects model dummy variables must be included for both the time periods and the groups.
Under most circumstances, the number of dummy variables included in the twoway fixed effects model makes standard ordinary least squares estimation too computationally difficult. Instead, the twoway fixed effects model is estimated using a withingroup estimator which removes the variation both within groups and within the time periods.
What Is the TwoWay Random Effects Model?
The twoway random effects model:
 Occurs when both $\mu_i$ and $\lambda_t$ are unobservable, stochastic effects.
 Assumes that $\mu_i$ and $\lambda_t$ are independently distributed and are uncorrelated with $x_{it}$.
For data generated by this process:
 Pooled OLS estimates will be unbiased. However, the estimates will be inefficient and the associated standard errors and tstatistics will be biased.
Like the oneway random effects model, the twoway random effects model can be estimated using feasible generalized least squares (FGLS) or maximum likelihood estimation (MLE).
Dynamic Panel Data Model
A key component of pure time series models is the modeling of dynamics using lagged dependent variables. These lagged variables capture the autocorrelation between observations of the same dataset at different points in time.
Because panel datasets include a time series component, it is also important to address the possibility of autocorrelation in panel data. The dynamic panel data model adds dynamics to the panel data individual effects framework.
Consider an individual effects model which includes an AR(1) term
$$y_{it} = \delta y_{i,t1} + \beta x_{it} + \epsilon_{it}$$
the error component includes oneway individual effects such that
$$\epsilon_{it} = \mu_i + \nu_{it}$$
where $\mu_i$ captures individual effects.
Introducing lagged dependent variables in the individual effects framework:
 Both $y_{it}$ and $y_{i,t1}$ are functions of $\mu_i$, because $\mu_i$ is timeinvariant. This implies that as a regressor, $y_{i,t1}$ is correlated with the error term.
 Ordinary least squares (OLS) will lead to biased estimates because of the serial correlation.
Dynamic panel data models are most commonly estimated using a generalized method of moments (GMM) framework proposed by Arellano and Bond (1991).
Panel Data and Stationarity
In panel data that covers small time frames, there is little need to worry about stationarity. However, when panel data covers longer time frames, like is the case in many macroeconomic panel data series, the panel data must be tested for stationarity.
Weak stationarity, required for many panel data modeling techniques, requires only that:
 A series has the same finite unconditional mean and finite unconditional variance at all time periods.
 That the series autocovariances are independent of time.
Nonstationary panel data series are any panel series that do not meet the conditions of a weakly stationary time series.
In part because of these considerations, a large field of research and literature surrounding panel data unit root tests has developed.
Testing for unit roots in panel data requires more than just testing the individual cross sections for the presence of unit roots. Panel data unit root tests must:
 Allow for both the shared movements across groups and the individualspecific movements within groups.
 Use an appropriate asymptotic distribution based on how quickly the number of panels (N) and the number of time periods (T) grow relative to one another.
 Determine whether to assume for crosssectional independence or to enforce crosssectional dependence.
Conclusion
After today's blog, you should have an understanding of the fundamentals of panel data. We covered the basics of panel data including:
 The structure of panel data series.
 Wide versus long panel data series.
 Oneway individual effects panel data models.
 Twoway individual effects panel data models.
 Dynamic panel data models.
 Panel data series and stationarity.
Further suggested reading:
 Panel data, structural breaks and unit root testing
 Panel Data Basics: Oneway Individual Effects
 How to Aggregate Panel Data in GAUSS
 Panel Data Stationarity Test With Structural Breaks
 Transforming Panel Data to Long Form in GAUSS
 Getting Started With Panel Data in GAUSS
Ready to get more from your panel data with GAUSS? Contact us today to claim your free GAUSS demo copy.
Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.
Pingback: Panel Data and Models in ClimateAgri Functions: Part 1. Concept – ClimateKimchi
Pingback: NumPy Vs Pandas  What Are The Differences Between The Two Most Popular Python Libraries? – Fly Spaceships With Your Mind