# How to Aggregate Panel Data in GAUSS

## Introduction

The aggregate function, first available in GAUSS version 20, computes statistics within data groups. This is particularly useful for panel data.

In today's blog, we take a closer look at aggregate. We will:

1. Introduce the basics of the aggregate function.
2. Explain how to use the aggregate function
3. Demonstrate a real-world application of the aggregate function using current account data from the International Monetary Fund.

## The GAUSS Aggregate Function

The GAUSS aggregate function computes statistics within a group based upon a specified group identifier. The function supports a variety of GAUSS statistics including:

• mean
• median
• mode
• min
• max
• sample standard deviation
• sum
• sample variance

For example, consider a panel dataset which includes observed weights for three individuals across a 6-month time span:

NameJan. WeightFeb. WeightMar. WeightApr. WeightMay WeightJune Weight
Sarah135134138142144145
Tom196192182183184181
Nikki143144146147145143

We can use the aggregate function to find the 6-month mean weights for Sarah, Tom and Nikki:

NameJan. WeightFeb. WeightMar. WeightApr. WeightMay WeightJune WeightMean Weight
Sarah135134138142144145139.7
Tom196192182183184181186.3
Nikki143144146147145143144.7

Alternatively, we could find the monthly standard deviation of the weights across Sarah, Tom and Nikki:

NameJan. WeightFeb. WeightMar. WeightApr. WeightMay WeightJune Weight
Sarah135134138142144145
Tom196192182183184181
Nikki143144146147145143
Monthly Std. Dev.33.231.023.422.422.821.4

## How to Use The Aggregate Function

The aggregate function takes two required inputs:

x_a = aggregate(x, method);

x
NxK data matrix, must have group identifiers in the first column.
method
String, method to use. Valid methods include: "mean", "median", "mode", "max", "min", "sd", "sum", "variance".

### The Input Data Matrix

The aggregate function requires the data matrix input to:

• Have numerical group identifiers in the first column.
• Be in stacked panel data format.

Let's consider our example dataset from above. In order to use this data as an input to the GAUSS aggregate function we need to:

• Recode our group identifiers from names to numbers.
• Stack our data into a pooled dataset.
NameJan. WeightFeb. WeightMar. WeightApr. WeightMay WeightJune Weight
Sarah135134138142144145
Tom196192182183184181
Nikki143144146147145143

$$\text{Sarah} \rightarrow 1$$ $$\text{Tom} \rightarrow 2$$ $$\text{Nikki} \rightarrow 3$$

$$\Downarrow$$

GroupJan. WeightFeb. WeightMar. WeightApr. WeightMay WeightJune Weight
1135134138142144145
2196192182183184181
3143144146147145143

$$\Downarrow$$

GroupMonthWeight
11135
12134
13138
14142
15144
16145
21196
22192
36143

### The Method Input

The method input into the aggregate function should always be a string indicating which statistic you wish to compute.

Each method works on groups within the panel the same way the analogous pooled data function would work, including its handling of missing values.

MethodPooled Function
meanmeanc
medianmedian
modemodec
maxmaxc
minminc
sumsumc
sdstdc
variancevarCovXS

### Example of How to Use Aggregate

Let's use aggregate to find the means by group for weight data:

weights = { 1 1 135,
1 2 134,
1 3 138,
1 4 142,
1 5 144,
1 6 145,
2 1 196,
2 2 192,
2 3 182,
2 4 183,
2 5 184,
2 6 181,
3 1 143,
3 2 144,
3 3 146,
3 4 147,
3 5 145,
3 6 143 };

/*
** Find the mean by person.
** We will use the first column
** as the group indicator and will find
** the mean of the weights.
*/
print aggregate(weights[., 1 3], "mean");

This prints the group means to the output window:

1   139.67
2   186.33
3   144.67

We can also use the month identifiers to find the sample standard deviation by month:

/*
** Find the standard deviation by month.
** We will use the second column of weights
** as the group indicator and will find
** the standard deviation of the weights.
*/
print aggregate(weights[., 2 3], "sd");

Now the standard deviations, along with their associated months will be printed to the output window:

1   33.151
2   31.005
3   23.438
4   22.368
5   22.811
6   21.385

Our simple example dataset is useful for demonstrating the basics of the aggregate function. However, a real-world panel dataset better demonstrates its true power. In this section, we will use aggregate to examine some of the trends in international current account balances.

### The Data

We will use current account balance measured as a percentage of GDP. This unbalanced panel data is a modified version of a dataset from the International Monetary Fund and spans 1953-Q3 to 2019-Q4 and includes a total of 46 countries, across 5 different regions.

It contains the following variables:

Variable name Description
Country String, name of the country.
Country ID Integer country identifier.
World Region String, name of the corresponding world region.
Region ID Integer world region identifier.
Time String, the date of the observation.
CAB Decimal numeric, the Current Account Balance.

### Mean and Median Current Account Balances By Country

We first examine variations in the mean and median current account balances across countries. Using aggregate we find the mean and median current account balances for each country in the panel across all observations.

/*
** Load the 'Country ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Country Id + CAB");

// Compute mean and median current
// account balances by Country ID
mean_cab_cid = aggregate(X, "mean");
median_cab_cid = aggregate(X, "median");

After the above code mean_cab_cid and median_cab_cid will be both be $46\times2$ matrices. Each element in the first column will be a unique country ID. The corresponding element in the second column will be the average (mean or median) current account balance for that country.

We include a graph of this data below, where we see that Germany leads the pack with the highest average current account balance while Finland has the lowest average current account balances.

### Mean and Median Current Account Balances By Region

We can similarly consider the mean and median current account balances across geographical regions with the code below.

/*
** Load the 'Region ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Region ID + CAB");

// Compute mean and median current
// account balances by world region
mean_cab_wreg = aggregate(X, "mean");
median_cab_wreg = aggregate(X, "median");

mean_cab_wreg and median_cab_wreg will be two column matrices with unique world region IDs in the first column and the corresponding statistics in the second column.

### Mean and Median Current Account Balances Time Series

Finally, we consider how the mean and median current account balances vary across time in the time series plot below.

/*
** Load the 'Time' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
** Wrapping 'Time' in 'date($)' tells GAUSS that 'Time' is a string ** variable that we want GAUSS to convert to a date. */ X = loadd("imf_cab_mod.xlsx", "date($Time) + CAB");

mean_cab_date = aggregate(X, "mean");
median_cab_date = aggregate(X, "median");

This time the first column of our resulting matrices, mean_cab_date and median_cab_date, will contain each unique date from our dataset. The second column will contain the statistic computed for each unique date.

In today's blog, we examined the fundaments of the aggregate procedure. After reading you should have a better understanding of:
1. The basics of the aggregate function.
2. How to use the aggregate function.
3. How to examine trends in real-world panel data using aggregate.