This post will show you how to create a simple output table with variable names, parameter estimates and standard errors as shown below, using sprintf
.
population 0.6558 (0.1910) poverty rate 0.9465 (0.2505) inflation 1.7366 (0.1101)
sprintf
, first introduced in GAUSS version 20, is a standard function in many languages. It provides powerful, compact statements that allow you to combine and style numeric and string data.
Unfortunately, some people find the sprintf
format string to be difficult to use and understand. Fortunately, today we are going to make it clear for you!
The GAUSS sprintf
function takes two or more inputs and returns a string array.
sprintf(fmt, a)
sprintf(fmt, a, ...)
sprintf
format string (described later).a
.
The sprintf
format string has many options that give you great control over the presentation of your data. Today we will show you how to use the three most important options which will cover most use cases.
The most basic format specifier starts with a percent sign (%) followed by a letter which is called the specifier character.
Specifier Character  Output  Example  

d  Integer  sprintf("%d", 3.1415) 
3 
e  Scientific  sprintf("%e", 3.1415) 
3.1415e+00 
f  Decimal  sprintf("%f", 3.1415) 
3.1415 
s  String  sprintf("%s", "Pi") 
"Pi" 
sprintf
format strings can contain more than one format specifier and it can be mixed with text. For example:
sprintf("%s is %d", "Steve", 38);
will print out:
Steve is 38
sprintf("%sis%d", "Steve", 38)
will return Steveis38
Now we know enough to create the first iteration of our table.
// Create a 3x1 string array
var_names = "population" $ "poverty rate" $ "inflation";
// Create two 3x1 vectors
beta_ = { 0.65582591,
0.94648444,
1.7365981 };
se = { 0.19095146,
0.25047475,
0.11009904 };
sprintf("%s %f %f", var_names, beta_, se);
will print out
population 0.655826 0.190951 poverty rate 0.946484 0.250475 inflation 1.736598 0.110099
The overlapping columns make this table difficult to read.
To stop the columns from overlapping, we need to make the field width large enough to hold the widest element for each column. The field width is an integer placed immediately following the percent sign (%). For example
Code  Output 

sprintf("%8f %8f", 3.1415, 6.283) 
3.141500 6.283000 
sprintf("%12f %8f", 3.1415, 6.283) 
3.141500 6.283000 
sprintf("%12f%8f", 3.1415, 6.283) 
3.1415006.283000 
sprintf("%12f%9f", 3.1415, 6.283) 
3.141500 6.283000 
Now let's update our code to:
// Create 3x1 string array
var_names = "population" $ "poverty rate" $ "inflation";
// Create two 3x1 vectors
beta_ = { 0.65582591,
0.94648444,
1.7365981 };
se = { 0.19095146,
0.25047475,
0.11009904 };
sprintf("%12s %9f %9f", var_names, beta_, se);
will return:
population 0.655826 0.190951 poverty rate 0.946484 0.250475 inflation 1.736598 0.110099
This is much better. As we saw earlier, we can add any text to the format string that we'd like to see in the output. Let's add parentheses around the format specifier for the standard error column.
// Surround SE format specifier with parentheses
sprintf("%12s %9f (%9f)", var_names, beta_, se);
will return:
population 0.655826 ( 0.190951) poverty rate 0.946484 ( 0.250475) inflation 1.736598 ( 0.110099)
The parentheses in the above table make it easy for us to see the 9 character field width of the final column. If the standard error could be negative, we might want to leave it as it is to allow room for a possible minus sign. However, since we know it will be positive, let's:
// Decrease field width of final column
// and add one space before the '('
sprintf("%12s %9f (%8f)", var_names, beta_, se);
will return:
population 0.655826 (0.190951) poverty rate 0.946484 (0.250475) inflation 1.736598 (0.110099)
Our table is really starting to look good, but to match the table from the introduction, we need to reduce the number of digits to the right of the decimal place to 4.
The sprintf
precision specifier is a dot or period followed by an integer. It controls the number of digits printed after the decimal point when using the floatingpoint f
specifier character.
Code  Output 

sprintf("%12.2f %8.3f", 3.1415, 6.283) 
3.14 6.283 
After setting the precision to 4:
// Decrease precision to 4 for both numeric columns
sprintf("%12s %9.4f (%8.4f)", var_names, beta_, se);
our output looks like this:
population 0.6558 ( 0.1910) poverty rate 0.9465 ( 0.2505) inflation 1.7366 ( 0.1101)
We can see that the final column displays 6 characters, but the field width is set to 8 characters. Similarly, the second column needs 7 spaces (5 digits plus a dot and a possible minus sign), but we have given it 9.
Let's fix those:
// Decrease field width for numeric columns to remove extra spaces
sprintf("%12s %7.4f (%6.4f)", var_names, beta_, se);
We have reached our goal!
population 0.6558 (0.1910) poverty rate 0.9465 (0.2505) inflation 1.7366 (0.1101)
Great job! You have learned the basics you need to know to display your data with sprintf
in GAUSS, including:
sprintf
in GAUSS can print styled tables with columns of string and numeric data.f
, specifier character.While there is much more that sprintf
can do, if you can remember
"%[field width].[precision]f"
and
"%s"
you'll have a great foundation to expand on.
]]>Anyone who owns GAUSS or has access to an institutional copy.
We are monitoring the situation. For the next 14 days, we will be offering 60day temporary licenses.
Fill out this webform to receive your license.
During this trying time, we remain committed to supporting you and the amazing work that you do. Aptech is following expert guidelines to keep our staff safe and minimize disease spread while continuing to serve you.
Please reach out if there is anything we can do to help you.
Best regards, The Aptech Team
]]>Optional input arguments can make your statistical computing more efficient and enjoyable. GAUSS version 20 added a new suite of tools to make it easy for you to add optional input arguments to your GAUSS procedures.
In this blog post we will show you how to easily:
We’ll start by creating a simple toy GAUSS procedure that does not use optional arguments.
// Run procedure
myProc(1.1, 2.2, 3.3);
// Define procedure
proc (0) = myProc(a, b, c);
print a;
print b;
print c;
endp;
If we run the code above, we will get the following output:
1.1 2.2 3.3
To modify this procedure to accept optional arguments, we need to make the following changes:
The procedure declaration is the first line of the procedure. The procedure declaration from our initial procedure was:
// Procedure declaration
proc (0) = myProc(a, b, c);
Now we will modify our original procedure to make a
the only required input. The new procedure declaration will look like this:
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
In the procedure declaration, above, a
is a required input as it was before. However, we have replaced b
and c
with ...
. The triple dots represent the optional arguments. The triple dots ...
:
The GAUSS function dynargsCount
returns the number of optional arguments passed into the procedure. dynargsCount
may only be called inside of a GAUSS procedure and does not take any inputs. For example:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
local n_dynargs;
// Notice that you do not pass ‘...’
// into ‘dynargsCount’
n_dynargs = dynargsCount();
print "Number of optional inputs = " n_dynargs;
print "";
endp;
Will print the following output:
call 1 Number of optional inputs = 0.0  call 2 Number of optional inputs = 1.0  call 3 Number of optional inputs = 2.0 
This allows you to add logic which depends on the number of optional arguments that were passed into your procedure.
dynargs
.You can access optional arguments in GAUSS procedures by:
dynargsGet
function....
, to another function.The dynargsGet
procedure returns specified optional arguments. It requires one input that specifies the index, or index range of optional arguments to return. For example:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
// Since ‘b’ and ‘c’ are not declared as inputs
// in the procedure declaration, they must be
// declared as local variables
local b, c, idx;
// The range of optional inputs we want to return
idx = { 1, 2 };
{ b, c } = dynargsGet(idx);
print "a = " a;
print "b = " b;
print "c = " c;
print "";
endp;
This time, the code will print out:
call 1 a = 9.00 b = {} c = {}  call 2 a = 9.00 b = 3.14 c = {}  call 3 a = 9.00 b = 3.14 c = 2.54 
As we see above, a
is passed through as we would expect each time. In "call 1", neither b
or c
were passed into myProc
. We can see that GAUSS printed their contents as empty curly braces, {}
. This indicates that they were empty matrices.
If an optional input argument requested by dynargsGet
is not supplied, by default an empty matrix will be returned in its place. We can check for empty matrices with the GAUSS function, isempty
. For example:
// Create an empty matrix
z = {};
// Check to see if `z` is an empty matrix
if isempty(z);
print "z is an empty matrix";
else;
print "z is not an empty matrix";
endif;
As you may have guessed, the above code will print:
z is an empty matrix
While we can use isempty
and if
statements to handle the optional inputs, it is often simpler to set default values.
Fortunately, dynargsGet
allows us to specify default values for the optional arguments which are not supplied. To specify default values with dynargsGet
, simply add them as a commaseparated list of inputs after the required index input.
There are a few things to remember when specifying default values:
We can set default values for our toy example, like this:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
// Since ‘b’ and ‘c’ are not declared as inputs
// in the procedure declaration, they must be
// declared as local variables
local b, c, idx;
// The range of optional inputs we want to return
idx = { 1, 2 };
// If ‘b’ or ‘c’ are not passed in, return
// 29 or 33 respectively
{ b, c } = dynargsGet(idx, 29, 33);
print "a = " a;
print "b = " b;
print "c = " c;
print "";
endp;
This time our code will return:
call 1 a = 9.00 b = 29.00 c = 33.00  call 2 a = 9.00 b = 3.14 c = 33.00  call 3 a = 9.00 b = 3.14 c = 2.54 
While any input arguments accessed by dynargsGet
can be passed to other procedures, it is important to know that the triple dots, ...
, can also be passed directly into other procedures.
Passing triple dots, ...
, to another procedure is just as if you passed a commaseparated list of the arguments in ...
.
...
, do not have to be the final input when passed to other procedures.For example:
X_1 = initMat1(3.14, 5, 2);
X_2 = initMat2(3.14, 5, 2);
// Procedure declaration with only optional inputs
proc (1) = initMat1(val, ...);
local X;
X = val + zeros(...);
retp(X);
endp;
// Procedure declaration with only required inputs
proc (1) = initMat2(val, r, c);
local X;
X = val + zeros(r, c);
retp(X);
endp;
The calls to initMat1
and initMat2
will behave the same.
Now that we’ve seen the basics of using optional inputs to GAUSS procedures, let’s work through a more practical example. We’ll create two procedures that will work together to simulate a linear model.
Our first procedure will create a random normal matrix, with the option to specify the mean and standard deviation of the variables.
/*
** Inputs: r  Required input. Number of rows.
** c  Required input. Number of columns.
** mu  Optional input. Mean of the simulated columns.
** Default = 0.
** sd  Optional input. Standard deviation of the
** simulated columns. Default = 1.
*/
proc (1) = rndNormal(r, c, ...);
local X, mean, sd;
// If a 3rd input is passed in, assign it
// to ‘mean’. Otherwise set ‘mean’ equal to 0.
// If a 4th input is passed in, assign it
// to `sd`. Otherwise set ‘sd’ equal to 1.
{ mean, sd } = dynargsGet(12, 0, 1);
// Compute a random matrix with the
// specified mean and sd.
X = (rndn(r, c) .* sd) + mean;
retp(X);
endp;
Here are a few examples of how we could call this procedure:
// Create a 100x4 random normal matrix
// with mean=0 and sd=1
X_1 = rndNormal(100, 4);
// Create a 130x2 random normal matrix
// with mean=3.7 and sd=1
X_2 = rndNormal(130, 2, 3.7);
// Create a 74x5 random normal matrix
// with mean=2 and sd=3
X_3 = rndNormal(74, 5, 2, 3);
Next, we’ll create a procedure that will use our rndNormal
procedure to simulate linear models.
/*
** Inputs: b_true  Required input. True parameter values of
** the simulated linear model.
** nobs  Required input. The number of observations
** to simulate.
** alpha_true  Optional input. True intercept of the
** simulated linear model. Default = 0.
** err_sd  Optional input. Standard deviation of the
** error term. Default = 1.
*/
proc (1) = simulateLM(b_true, nobs, ...);
local X, err, nvars, alpha_plus_err, y;
// Find number of desired variables
nvars = rows(b_true);
// Simulate variables with
// mean = 0, sd = 1
X = rndNormal(nobs, nvars);
// Simulate error term and add intercept
// if it was passed in.
alpha_plus_err = rndNormal(nobs, 1, ...);
y = X * b_true + alpha_plus_err;
retp(y);
endp;
Here are a few examples of how our simulateLM
procedure can be used:
b = { 0.8, 1.1, 0.2 };
n = 10;
/*
** Simulate linear model:
**  Without intercept
**  With error term sd equal to 1
*/
y_1 = simulateLM(b, n);
/*
** Simulate linear model:
**  With intercept
**  With error term sd equal to 1
*/
alpha = 2.3;
y_2 = simulateLM(b, n, alpha);
/*
** Simulate linear model:
**  With intercept
**  With error term sd equal to 3
*/
alpha = 2.3;
err_sd = 3;
y_3 = simulateLM(b, n, alpha, err_sd);
Congratulations! You should now be able to start creating GAUSS procedures with optional input arguments. You have learned how to:
...
, to the procedure declaration.dynargsCount
.dynargsGet
.Cointegration is an important tool for modeling the longrun relationships in time series data. If you work with time series data, you will likely find yourself needing to use cointegration at some point.
This blog provides an indepth introduction to cointegration and will cover all the nuts and bolts you need to get started. In particular, we will look at:
Though not necessary, you may find it helpful to review the blogs on time series modeling and unit root testing before continuing with this blog.
Economic theory suggests that many time series datasets will move together, fluctuating around a longrun equilibrium. In econometrics and statistics, this longrun equilibrium is tested and measured using the concept of cointegration.
Cointegration occurs when two or more nonstationary time series:



Field  Supporting Theory  Time Series 
Economics  The permanent income hypothesis describes how agents spread their consumption out over their lifetime based on their expected income.  Consumption and income. 
Economics  Purchasing power parity is a theory that relates the prices of a basket of goods across different countries.  Nominal exchange rates and domestic and foreign prices. 
Finance  The present value model of stock prices implies a longrun relationship between stock prices and their dividends or earnings.  Stock prices and stock dividends/earnings. 
Epidemiology  Joint mortality models imply a longrun relationship between mortality rates across different demographics.  Male and female mortality rates. 
Medicine  Time series methodologies have been used to examine comorbidities of different types of cancers and trends in medical welfare.  Occurrence rates of different types of cancer. 
To understand the mathematics of cointegration, let's consider a group of time series, $Y_t$, which is composed of three separate time series:
$$y_1 = (y_{11}, y_{12}, \ldots, y_{1t})$$ $$y_2 = (y_{21}, y_{22}, \ldots, y_{2t})$$ $$y_3 = (y_{31}, y_{32}, ..., y_{3t})$$
All three series are nonstationary time series.
Cointegration implies that while $y_1$, $y_2$, and $y_3$ are independently nonstationary, they can be combined in a way that their linear combination is stationary :
$$\beta Y_t = \beta_1 y_{1t} + \beta_2 y_{2t} + \beta_3 y_{3t} \sim I(0)$$
The Cointegrating Vector
In the context of cointegration, $\beta$ is commonly known as the cointegrating vector. This vector:
Normalization Because there can be multiple cointegrating vectors that fit the same economic model, we must impose identification restrictions to normalize the cointegrating vector for estimation.
A common normalization of the cointegrating vector is to set $\beta = ( 1, \beta_2, \ldots, \beta_N)$. For example, applying these restrictions to our earlier system yields
$$\beta Y_t = y_{1t}  \beta_2 y_{2t}  \beta_3 y_{3t} \sim I(0)$$
Part of the appeal of this normalization is that it can be rewritten in a standard regression form
$$y_{1t} = \beta_2 y_{2t} + \beta_3 y_{3t} + u_t$$
where $u_t$ is a stationary cointegrating error component. Intuitively, $u_t$ can be thought of as shortterm deviations from the longrun equilibrium.
While the regression format is a common normalization, it is important to remember that economic theory should inform our identifying restrictions.
Cointegration implies that time series will be connecting through an error correction model. The error correction model is important in time series analysis because it allows us to better understand longrun dynamics. Additionally, failing to properly model cointegrated variables can result in biased estimates.
The error correction model:
Let's assume that there is a bivariate cointegrated system with $Y_t = (y_{1t}, y_{2t})$ and a cointegrating vector $\beta = (1, \beta_2)$ such that
$$\beta Y_{t} = y_{1t}  \beta_2 y_{2t}$$
The error correction model depicts the dynamics of a variable as a function of the deviations from longrun equilibrium
$$\Delta y_{1t} = c_1 + \alpha_1 (y_{1,t1}  \beta_2 y_{2,t1}) + \sum_j \psi^j_{11} \Delta y_{1, tj} + \sum_j \psi^j_{12} \Delta y_{2, tj} + \epsilon_1t$$ $$\Delta y_{2t} = c_2 + \alpha_2 (y_{1,t1}  \beta_2 y_{2,t1}) + \sum_j \psi^j_{21} \Delta y_{1, tj} + \sum_j \psi^j_{22} \Delta y_{2, tj} + \epsilon_2t$$
Term  Description  Intuition 

$y_{1,t1}  \beta_2 y_{2,t1}$  Cointegrated longrun equilibrium  Because this is an equilibrium relationship, it plays a role in dynamic paths of both $y_{1t}$ and $y_{2t}$. 
$\alpha_1$, $\alpha_2$  Adjustment coefficients  Captures the reactions of $y_{1t}$ and $y_{2t}$ to disequilibrium. 
$\sum_j \psi^j_{11} \Delta y_{1, tj} + \sum_j \psi^j_{12} \Delta y_{2, tj}$  Autoregressive distributed lags  Captures additional dynamics. 
If the cointegrating vector has been previously estimated, then standard OLS or DOLS can be used to estimate the error correction relationship. In this case:
The ECM relationship can be estimated using OLS, seemingly unrelated regressions (SUR), or maximum likelihood estimation.
The vector error correction model (VECM) is the multivariate extension of the ECM. If we are working in a vector autoregressive context, cointegration implies a VECM such that
$$\Delta Y_t = \Phi D_t + \Pi Y_{t1} + \Gamma_1 \Delta Y_{t1} + \cdots + \Gamma_{p1} \Delta Y_{tp+1} + \epsilon_t$$
Like the ECM, the VECM parameters reflect the longrun and shortrun dynamics of system as shown in the table below:
Term  Description  Intuition 

$\Pi$  Longrun impact matrix.  $\Pi = \Pi_1 + \Pi_2 + \cdots + \Pi_p  I_n$, captures adjustments towards the longrun equilibrium and contains the cointegrating relationships. 
$\Gamma_k$  Shortrun impact matrix.  The shortrun impact matrix is constructed from $\sum_{j=k+1}^p \Pi_j$ and captures shortrun deviations from the equilibrium. 
$D_t$  Deterministic terms.  These terms take the form $D_t = u_0 + u_1 t$ where $u_0$ is the constant component and $u_1 t$ is the trend component. 
The VECM model can be estimated using the Johansen method:
Before jumping directly to cointegration testing, there are a number of other time series modeling modeling steps that we should consider first.
One of the key considerations prior to testing for cointegration, is whether there is theoretical support for the cointegrating relationship. It is important to remember that cointegration occurs when separate time series share an underlying stochastic trend. The idea of a shared trend should be supported by economic theory.
As an example, consider growth theory which suggests that productivity is a key driver of economic growth. As such, it acts as the common trend, driving the comovements of many indicators of economic growth. Hence, this theory implies that consumption, investment, and income are all cointegrated.
One of the first steps in time series modeling should be data visualization. Time series plots provide good preliminary insights into the behavior of time series data:
We've established that cointegration occurs between nonstationary, I(1), time series. This implies that before testing for or estimating a cointegrating relationship, we should perform unit root testing.
Our previous blog, "How to Conduct Unit Root Testing in GAUSS", provides an indepth look at how to perform unit root testing in GAUSS.
GAUSS tools for performing unit root tests are available in a number of libraries, including the Time Series MT (TSMT), the opensource TSPDLIB, and the coint libraries. All of these can be directly located and installed using the GAUSS package manager.
Full example programs for testing for unit roots using TSMT procedures and TSPDLIB procedures are available on our Aptech GitHub page.
Panel Data Unit Root Test  TSMT procedure  TSPDLIB procedure 

Hadri  hadri  
Im, Pesaran, and Shin  ips  
LevinLuChin  llc  
Schmidt and Perron LM test  lm  
Breitung and Das  breitung  
Crosssectionally augmented IPS test (CIPS)  cips  
Panel analysis of nonstationary and idiosyncratic and common (PANIC)  bng_panic 
Time Series Unit Root Test  TSMT procedure  TSPDLIB procedure 

AugmentedDickey Fuller  vmadfmt  adf 
PhillipsPerron  vmppmt  pp 
KPSS  kpss  lmkpss 
Schmidt and Perron LM test  lm  
GLSADF  dfgls  dfgls 
Quantile ADF  qr_adf 
A complete time series analysis should consider the possibility that structural breaks have occurred. In the case that structural breaks have occurred, standard tests for cointegration are invalid.
Therefore, it is important to:
The GAUSS sbreak
procedure, available in TSMT, is an easy to use tool for identifying multiple, unknown structural breaks.
In order to test for cointegration, we must test that a longrun equilibrium exists for a group of data. There are a number of things that need to be considered:
In this section, we will show how to use these questions to guide cointegration testing without structural breaks.
The EngleGranger cointegration test considers the case that there is a single cointegrating vector. The test follows the very simple intuition that if variables are cointegrated, then the residual of the cointegrating regression should be stationary.
Forming the cointegrating residual
How to form the cointegrating residual depends on if the cointegrating vector is known or must be estimated:
If the cointegrating vector is known, the cointegrating residuals are directly computed using $u_t = \beta Y_t$. The residuals should be stationary and:
If the cointegrating vector is unknown, OLS is used to estimate the normalized cointegrating vector from the regression $$y_{1t} = c + \beta y_{2t} + u_{t}$$
There are two Johansen cointegrating tests for the VECM context, the trace test and the maximal eigenvalue test. These tests hinge on the intuition that in the VECM, the rank of the longrun impact matrix, $\Pi$, determines if the VAR(p) variables are cointegrated.
Since the rank of the longrun impact matrix equals the number of cointegrating relationships:
The Johansen testing process has two general steps:
The Johansen Trace Statistics
The Johansen trace statistic:
The Johansen testing procedure sequentially tests the null hypothesis that the number of cointegrating vectors, $k = m$ against the alternative that $k > m$.
Stage  Null Hypothesis  Alternative  Conclusion 

One  $H_0: k = 0$  $H_A: k>0$  If $H_0$ cannot be rejected, stop testing, and $k = 0$. If null is rejected, perform next test. 
Two  $H_0: k \leq 1$  $H_A: k>1$  If $H_0$ cannot be rejected, stop testing, and $k \leq 1$. If null is rejected, perform next test. 
Three  $H_0: k \leq 2$  $H_A: k>2$  If $H_0$ cannot be rejected, stop testing, and $k \leq 2$. If null is rejected, perform next test. 
m1  $H_0: k \leq m1$  $H_A: k>m1$  If $H_0$ cannot be rejected, stop testing, and $k \leq m1$. If null is rejected, perform next test. 
The test statistic follows a nonstandard distribution and depends on the dimension and the specified deterministic trend.
The Johansen Maximum Eigenvalue Statistic
The maximal eigenvalue statistic:
In the case that there are structural breaks in the cointegrating relationship, the cointegration tests in the previous station should not be used. In this section we look at three tests for cointegration with structural breaks:
The Gregory and Hansen (1996) cointegration test is a residualbased cointegration test which tests the null hypothesis of no cointegration against the alternative of cointegration in the presence of a single regime shift.
The Gregory and Hansen (1996) test:
Because the structural break date is unknown, the test computes the cointegration test statistic for each possible breakpoint, and the smallest test statistics are used.
Gregory and Hansen (1996) suggest running their tests in combination with the standard cointegration tests:
The HatemiJ cointegration test is an extension of the Gregory and Hansen cointegration test. It allows for two possible structural breaks with unknown timing.
The Maki cointegration test builds on the Gregory and Hansen and the HatemiJ cointegration tests to allow for an unknown number of structural breaks.
GAUSS tools for performing cointegration tests and estimating VECM models are available in a number of libraries, including the Time Series MT (TSMT) library, TSPDLIB, and the coint libraries. All of these can be directly located and installed using the GAUSS package manager.
Cointegration test  Null Hypothesis  Decision Rule  GAUSS library 

EngleGranger (ADF)  No cointegration.  Reject the null hypothesis if the $ADF$ test statistic is less than the critical value.  TSMT, tspdlib, coint 
Phillips  No cointegration.  Reject the null hypothesis if the $Z$ test statistic is less than the critical value.  coint, tspdlib 
Stock and Watson common trend  $Y$ is a noncointegrated system after allowing for the pth order polynomial common trend.  Reject the null hypothesis if the $SW$ test statistic is less than the critical value.  coint 
Phillips and Ouliaris  $Y$ and $X$ are not cointegrated.  Reject the null hypothesis if the $P_u$ or $P_z$ statistic is greater than the critical value.  coint, tspdlib 
Johansen trace  Rank of $\Pi$ is equal to $r$ against the alternative that the rank of $\Pi$ is greater than $r$.  Reject the null hypothesis if $LM_{max}(k)$ is greater than the critical value.  TSMT, coint 
Johansen maximum eigenvalue  Rank of $\Pi$ is equal to $r$ against the alternative that the rank of $\Pi$ is equal to $r+1$.  Reject the null hypothesis if $LM(r)$ is greater than the critical value.  TSMT, coint 
Gregory and Hansen  No cointegration against the alternative of cointegration with one structural break.  Reject the null hypothesis if $ADF$, $Z_{\alpha}$, or $Z_t$ are less than the critical value.  tspdlib 
HatemiJ  No cointegration against the alternative of cointegration with two structural breaks.  Reject the null hypothesis if $ADF$, $Z_{\alpha}$, or $Z_t$ are less than the critical value.  tspdlib 
Shin test  Cointegration.  Reject the null hypothesis if the test statistic is less than the critical value.  tspdlib 
In this section, we will test for cointegration between monthly gold and silver prices, using historic monthly price date starting in 1915. Specifically, we will work through several stages of analysis:
As a first step, we will create a time series graph of our data. This allows us to visually examine the deterministic trends in our data.
From our graphs, we can draw some preliminary conclusions about the dynamics of gold and silver prices over our time period:
Before testing if silver and gold prices are cointegrated, we should test if the series have unit roots. We can do this using the unit roots tests available in the TSMT and tspdlib libraries.
Gold monthly closing prices (20152020)  

Time Series Unit Root Test  Test Statistic  Conclusion 
AugmentedDickey Fuller  1.151  Cannot reject the null hypothesis of unit root. 
PhillipsPerron  1.312  Cannot reject the null hypothesis of unit root. 
KPSS  2.102  Reject the null hypothesis of stationarity at the 1% level. 
Schmidt and Perron LM test  2.399  Cannot reject the null hypothesis of unit root. 
GLSADF  0.980  Cannot reject the null hypothesis of unit root. 
Silver monthly closing prices (20152020)  

Time Series Unit Root Test  Test Statistic  Conclusion 
AugmentedDickey Fuller  5.121  Reject the null hypothesis of unit root at the 1% level. 
PhillipsPerron  5.446  Reject the null hypothesis of unit root at the 1% level. 
KPSS  0.856  Reject the null hypothesis of stationarity at the 1% level. 
Schmidt and Perron LM test  4.729  Reject the null hypothesis of unit root at the 1% level. 
GLSADF  4.895  Reject the null hypothesis of unit root at the 1% level. 
These results provide evidence that gold prices are nonstationary but suggest that the silver prices are stationary. At this point, we would not likely proceed with cointegration testing or we may wish to perform additional unit root testing. For example, we may want to perform unit root tests that allow for structural breaks.
The GAUSS code for the tests in this section is available here.
Now, let's test for cointegration without structural breaks using two different tests, the Johansen tests and the EngleGranger test.
The Johansen Tests
We will use the vmsjmt
procedure from the TSMT library. This procedure should be used with the vmc_sjamt
and vmc_sjtmt
procedures, which find the critical values for the Maximum Eigenvalue and Trace statistics, respectively.
The vmsjmt
procedure requires four inputs:
The vmsjmt
procedure returns both the Johansen Trace and the Johansen Maximum Eigenvalue statistic. In addition, it returns the associated eigenvalues and eigenvectors.
new;
// Load tsmt library
library tsmt;
// Set filename (with path) for loading
fname2 = __FILE_DIR $+ "commodity_mon.dat";
// Load real prices data
y_test_real = loadd(fname2, "P_gold_real + P_silver_real");
// No deterministic component
// the fitted regression
p = 1;
// Set number of lagged differences
// for computing estimator
k = 2;
// No determinant
no_det = 0;
{ ev, evec, trace_stat, max_ev } = vmsjmt(y_test_real, p, k, no_det);
cv_lr2 = vmc_sjamt(cols(y), p);
cv_lr1 = vmc_sjtmt(cols(y), p);
Both trace_stat
and max_ev
will contain statistics for all possible ranks of $\Pi$.
For example, since we are testing for cointegration of just two time series, there will be at most one cointegrating vector. This means trace_stat
and max_ev
will be 2 x 1 matrices, testing both null hypotheses that $r=0$ and $r=1$.
Test  Test Statistic  10% Critical Value  Conclusion 

Johansen Trace Statistic  $$H_0: r=1, 56.707$$ $$H_0: r=0, 0.0767$$  10.46  Cannot reject the null hypothesis that $r=0$. 
Johansen Maximum Eigenvalue  $$H_0: r=1, 56.631$$ $$H_0: r=0, 0.0766$$  9.39  Cannot reject the null hypothesis that $r=0$. 
These results indicate that there is no cointegration between monthly gold and silver prices. This should not be a surprise, given the results of our unit root testing.
vmsjmt
. However, the Johansen tests are sequential tests and since we cannot reject the null that $k=0$ for either test, we technically do not need to continue testing for any higher rank. The EngleGranger
Since there is only one possible cointegrating vector for this system, we could have also used the EngleGranger test for cointegration. This test can be implemented using the coint_egranger
procedure from the TSPDLIB library.
The coint_egranger
procedure requires five inputs:
The coint_egranger
procedure returns the test statistic along with the 1%, 5% and 10% critical values.
Using the data already loaded in the previous example:
/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
// Maximum number of lags
pmax = 12;
// No constant or trend
model = 0
{ tau0, cvADF0 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
// Constant
model = 1
{ tau1, cvADF1 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
// Constant and trend
model = 2
{ tau2, cvADF2 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
Test  Test Statistic  10% Critical Value  Conclusion 

EngleGranger, no constant  3.094  2.450  Reject the null of no cointegration at the 10% level. 
EngleGranger, constant  1.609  3.066  Cannot reject the null hypothesis of no cointegration. 
EngleGranger, constant and trend  2.327  3.518  Cannot reject the null hypothesis of no cointegration. 
These results provide evidence for our conclusion that there is no cointegration between gold and silver prices. Note, however, that these results are not conclusive and depend on whether we include a constant. This sheds light on the importance of including the correct deterministic components in our model.
To be thorough we should also test for cointegration using tests that allow for a structural break. As an example, let's use the GregoryHansen test to compare the null hypothesis of no cointegration against the alternative that there is cointegration with one structural break.
This test can be implemented using the coint_ghansen
procedure from the TSPDLIB library.
The coint_ghansen
procedure requires eight inputs:
/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
//Maximum number of lags
pmax = 12;
// Trimming rate
trimm= 0.15;
// Longrun consistent variance estimation method
varm = 3;
// Bandwidth for kernel estimator
T = rows(y_test_real);
bwl = round(4 * (T/100)^(2/9));
// Level shift
model = 1;
{ ADF_min1, TBadf1, Zt_min1, TBzt1, Za_min1, TBza1, cvADFZt1, cvZa1 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Level shift with trend
model = 2;
{ ADF_min2, TBadf2, Zt_min2, TBzt2, Za_min2, TBza2, cvADFZt2, cvZa2 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Regime shift
model = 3;
{ ADF_min3, TBadf3, Zt_min3, TBzt3, Za_min3, TBza3, cvADFZt3, cvZa3 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Regime shift with trend
model = 4;
{ ADF_min4, TBadf4, Zt_min4, TBzt4, Za_min4, TBza4, cvADFZt4, cvZa4 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
The coint_ghansen
has eight returns:
Test  $ADF$ Test Statistic  $Z_t$ Test Statistic  $Z_{\alpha}$ Test Statistic  10% Critical Value $ADF$,$Z_t$  10% Critical Value $Z_{\alpha}$  Conclusion 

GregoryHansen, Level shift  3.887  4.331  39.902  4.34  36.19  Cannot reject the null of no cointegration for $ADF$ and $Z_t$. Reject the null of no cointegration at the 10% level for $Z_{\alpha}$. 
GregoryHansen, Level shift with trend  3.915  5.010  50.398  4.72  43.22  Reject the null of no cointegration at the 10% level for $Z_{\alpha}$ and $Z_t$. Cannot reject the null for $ADF$ test. 
GregoryHansen, Regime change  5.452  6.4276  80.379  4.68  41.85  Reject the null of no cointegration at the 10% level. 
GregoryHansen, Regime change with trend  6.145  7.578  106.549  5.24  53.31  Reject the null of no cointegration at the 10% level. 
Note that our test for cointegration with one structural break is inconsistent and depends on which type of structural breaks we include in our model. This provides some indication that further exploration of structural breaks is needed.
Some examples of additional steps that we could take:
Congratulations! You now have an established guide for cointegration and the background you need to perform cointegration testing.
In particular, today's blog covered:
]]>
You're probably familiar with the basic findandreplace. However, large projects with many files across several directories, require a more powerful search tool.
The GAUSS Source Browser is the powerful searchandreplace tool you need. We'll discuss some ways you can use it to make your life easier, including:
If an output seems incorrect or is causing an error, you need a quick way to trace the progress of specific variables. GAUSS's Find Usages functionality does exactly that.
As shown in the image below, rightclick on the variable you want to search for and select Find Usages.
This will search our current file for every instance of our search variable and display them in the GAUSS Source Browser.
As we can see in the above image, Find Usages has shown us:
We can doubleclick on a line in the Source Browser to open it in the GAUSS editor.
The Find Usages basics alone are great, but the GAUSS Source Browser has much more to offer.
In addition to rightclicking and selecting Find Usages, the Source Browser can be opened two ways:
Ctrl+Shift+F
(Command+Shift+F
) keyboard shortcut.The Search Tab is the starting point for any search. It has three input boxes:
*.*
, where *
is a wildcard character that will match any character or characters. *.gss, estimate.src
.*.gss,*.src,*.sdf,*.gau,*.prg
.Using these tools together can help us quickly locate the information we're looking for. For, example, suppose we want to find uses of the GAUSS generalized method of moments procedure, gmmFit, in internal GAUSS source code.
We can achieve this by:
gmmFit
in the Search for box./Users/Research/gauss/src
as our Directory. We could also set the File pattern option to *.src
to limit our search to .src
files. However, we left the default option to search all files in this case.
Our search results show that the term gmmFit
is used between 1 and 37 times in each of five files in the specified directory.
The Advanced Tab provides many useful options for further refining our search. We'll discuss some of the most helpful and less common options.
The Assignments checkbox allows us to restrict our search to only the instances when a variable is assigned to. Narrowing the list of uses to assignments only is invaluable when tracking down the cause of an incorrect value. The screenshot above shows our original Find Usages search after adding the Assignments filter.
References is the opposite of Assignments, returning instances where a variable is used, but not assigned to.
By default, the Source Browser will search the directory that you specify and all subdirectories. Uncheck the Search Subdirs checkbox to restrict the search to the main search directory only.
After we have completed our desired search, we can perform our replace by:
The checkboxes at the start of each line of the search results allow you to quickly control which results are replaced.
Backup Original will create a copy of your files with a .bak
file extension before making the replace. Since there is no undo option for replacements made to files not open in the editor, the .bak
copies allow you to recover the original file if you make a mistake.
Overwrite Backup will copy over any preexisting .bak
files made by previous searches with Backup Original selected. If this option is unchecked, GAUSS will use .bak2
, .bak3
, etc if preexisting .bak
files are found.
After reading this post, you now know how to use the advanced search and replace tools in GAUSS to be more productive!
We have learned:
The latest Time Series MT (TSMT) 3.1.0 is now available for release. If you own TSMT 3.0 the update is available for free.
This minor update adds a new option to use the crossproduct covariance computation for some estimation routines and includes some bug fixes.
varmaFit
, ecmFit
, svarmaxmt
, and garchFit
.cusum
and sbreak
graphs.The GAUSS TSMT application module provides a comprehensive suite of tools for MLE and statespace estimation, model diagnostics and forecasting of univariate, multivariate and nonlinear time series models.
]]>Panel data, sometimes referred to as longitudinal data, is data that contains observations about different cross sections across time. Examples of groups that may make up panel data series include countries, firms, individuals, or demographic groups.
Like time series data, panel data contains observations collected at a regular frequency, chronologically. Like crosssectional data, panel data contains observations across a collection of individuals.
There are a number of advantages of panel data:
Panel data examples can be found in economics, social sciences, medicine and epidemiology, finance, and the physical sciences.



Field  Example topics  Example dataset 
Microeconomics  GDP across multiple countries, Unemployment across different states, Income dynamic studies, international current account balances.  Panel Study of Income Dynamics (PSID) 
Macroeconomics  International trade tables, world socioeconomic tables, currency exchange rate tables.  Penn World Tables 
Epidemiology and Health Statistics  Public health insurance data, disease survival rate data, child development and wellbeing data.  Medical Expenditure Panel Survey 
Finance  Stock prices by firm, market volatilities by country or firm.  Global Market Indices 
Panel data is a collection of quantities obtained across multiple individuals, that are assembled over even intervals in time and ordered chronologically. Examples of individual groups include individual people, countries, and companies.
In order to denote both individuals and time observations, panel data often refers to groups with the subscript i and time as the subscript t. For example, a panel data observation $Y_{it}$ is observed for all individuals $i = {1, ..., N}$ across all time periods $t = {1, ..., T}$
More specifically:
Group  Time Period  Notation 

1  1  $Y_{11}$ 
1  2  $Y_{12}$ 
1  T  $Y_{1T}$ 
⁞  ⁞  ⁞ 
N  1  $Y_{N1}$ 
N  2  $Y_{N2}$ 
N  T  $Y_{NT}$ 
Panel datasets may come in different formats. The format in the table above is sometimes called long format data. Long format datasets stack the observations of each variable from all groups, across at all time periods into one column.
When panel data is stored with the observations for a single variable from separate groups stored in separate columns this is sometimes referred to as wide data format.
Time  $Y_1$  $Y_2$  $Y_N$ 

1  $Y_{11}$  $Y_{21}$  $Y_{N1}$ 
2  $Y_{12}$  $Y_{22}$  $Y_{N2}$ 
3  $Y_{13}$  $Y_{23}$  $Y_{N3}$ 
4  $Y_{14}$  $Y_{24}$  $Y_{N4}$ 
⁞  ⁞  ⁞  ⁞ 
T1  $Y_{1T1}$  $Y_{2T1}$  $Y_{NT1}$ 
T  $Y_{1T}$  $Y_{2T}$  $Y_{NT}$ 
Panel data can also be characterized as unbalanced panel data or balanced panel data:
Certain panel data models are only valid for balanced datasets. If the panel datasets are unbalanced they may need to be condensed to include only the consecutive periods for which there are observations for all individuals in the cross section.
Panel data series modeling centers around addressing the likely dependence across data observations within the same group. In fact, the primary difference between panel data models and time series models, is that panel data models allow for heterogeneity across groups and introduce individualspecific effects.
As an example, consider a panel data series which includes gross domestic product (GDP) data for a panel of 5 different countries, the United States, France, Canada, Greece, and Australia:
Panel data models include techniques that can address these heterogeneities across individuals. Furthermore, pure crosssectional methods and pure time series models may not be valid in the presence of this heterogeneity.
Researchers commonly analyze datasets with multiple observations of a set of crosssectional units (e.g., people, firms, countries) over time. For example, one may have data covering the production of multiple firms or the gross product of multiple countries across a number of years.
Modeling these panel data series is a unique branch of time series modeling made up of methodologies specific to their structure.
This section looks more closely at panel data analysis and the associated panel data models.
Panel data methods can be split into two broad categories:
Within these groups, the assumptions made about the variation of the model across individuals are the primary drivers for which model to use.
Let’s consider a simple linear model
$$y_{it} = \alpha + \beta x_{it} + \epsilon_{it}$$
The representation above is a homogenous model:
Alternatively, we could believe that groups share common coefficients on regressors but there are groupspecific intercepts, as is captured in the fixed effects or least squares dummy variable (LSDV) model
$$y_{it} = \alpha_i + \beta x_{it} + \epsilon_{it}$$
The representation above is a heterogenous model, because the constants, $ \alpha_i $, are groupspecific.
This section considers four popular panel data models:
We will examine these models using an assumed data generation process given by
$$ y_{it} = \beta x_{it} + \delta z_i + \epsilon_{it}$$
In this model, $X$ represents the observed characteristics such as age, firm size, expenditures, and $Z$ represents unobserved characteristics, such as management quality, growth opportunities, or skill.
Component  Description  Example 

$x_{it}$  These are observable characteristics. These characteristics may be constant for an individual across all time, such as race, or may vary across all time observations for an individual such as age.  Age, race, company size, expenditure, population, GDP 
$z_i$  Unobservable characteristics, responsible for model heterogeneity.  Skill, company potential, lack of basic infrastructure in the community, political unrest. 
$\epsilon_{it}$  Stochastic error term.  N/A 
What Is Pooled Ordinary Least Squares?
In some cases, there are no unobservable individualspecific effects, and $\delta z_i $ is constant across individuals. This is a strong assumption and implies that all the observations within groups are independent of one another.
In these cases, the model becomes
$$ y_{it} = \beta x_{it} + \alpha + \epsilon_{it}$$
This implies that when there is no dependence within individual groups, the panel data can be treated as one large, pooled dataset. The model parameters, $\beta$, and, $\alpha$, can be directly estimated using pooled ordinary least squares.
Linear independence within the groups of a panel is unlikely and pooled OLS is rarely acceptable for panel data models.
What Is The OneWay Fixed Effects Model
The oneway fixed effects panel data model:
As an example, let’s consider the oneway fixed effects model with individualspecific effects where the unobservable component, $\delta z_i$ , acts like an individualspecific intercept:
$$y_{it} = \beta x_{it} + \alpha_i + \epsilon_{it}$$
The intercept term, $\alpha_i$, varies across individuals but is constant across time. This term is composed of the constant intercept term, $\mu$, and the individualspecific error terms, $\gamma_i$.
The key feature of the fixed effects model is that $\gamma_i$ has a true, but unobservable, effect which must be estimated. More importantly, if we estimate $\beta$ using pooled OLS and fail to appropriately account for $\gamma_i$, the estimates will be inconsistent and biased.
The fixed effects model requires the estimation of the model parameter $\beta$ and individual $\alpha_i$ for each of the N groups in the panel. This is generally achieved using one of three estimation techniques:
The first two of these techniques focuses on eliminating the individual effects before estimation. The LSDV method directly incorporates these effects using dummy variables.
What Is the OneWay Random Effects Model?
The oneway random effects panel data model:
The distinguishing feature of the random effects model is that $\delta z_i$ does not have a true value but rather follows a random distribution with parameters that we must estimate.
The random effects term, $\delta z_i$:
The random effects model should be estimated using feasible generalized least squares (FGLS). Using FGLS, the appropriate error structure, one which accounts for the individualspecific error terms, can be incorporated into the model.
What Is the Random Coefficients Model?
The panel data regressions we’ve looked at so far have all assumed that the coefficients on regressors are the same across all individuals. The random coefficients model relaxes this assumption and introduces individualspecific effects through the coefficient, such that
$$y_{it} = \beta_i x_{it} + \alpha_i + \epsilon_{it}$$ $$y_{it} = (b_i + \beta)x_{it} + (\alpha_i+\alpha) + \epsilon_{it}$$ $$b_i \sim N(0, \tau_{i1}^2)$$ $$a_i \sim N(0, \tau_{i2}^2)$$
This model introduces both individual slope effects and allows for heteroscedasticity through the individualspecific $\tau_{i1}^2$ and $\tau_{i2}^2$.
This model can be estimated using FGLS or maximum likelihood estimation (MLE).
The twoway individual effects model allows the presence of both timespecific effects and individualspecific effects.
Starting from a simple linear model given by,
$$y_{it} = \alpha + \beta_{xit} + \epsilon_{it}$$
the twoway individual effects model can be represented by
$$y_{it} = \alpha + \beta_{xit} + \mu_i + \lambda_t + \epsilon_{it}$$
In this model, $\mu_i$, captures any unobservable individualspecific effects and $\lambda_t$ captures any unobservable timespecific effects. Note that the individualspecific effects, $\mu_i$, do not vary with time, while the timespecific effects, $\lambda_t$, do not vary across individuals.
In the special case that there are only two groups and two individuals this model is equivalent to the differenceindifference model. However, if there are more than two time periods and/or individuals, alternative panel data models must be considered.
What Is the TwoWay Fixed Effects Model?
The twoway fixed effects model:
For data generated by this model:
Like the oneway fixed effects model, this model could be estimated by including dummy variables. However, in the twoway fixed effects model dummy variables must be included for both the time periods and the groups.
Under most circumstances, the number of dummy variables included in the twoway fixed effects model makes standard ordinary least squares estimation too computationally difficult. Instead, the twoway fixed effects model is estimated using a withingroup estimator which removes the variation both within groups and within the time periods.
What Is the TwoWay Random Effects Model?
The twoway random effects model:
For data generated by this process:
Like the oneway random effects model, the twoway random effects model can be estimated using feasible generalized least squares (FGLS) or maximum likelihood estimation (MLE).
Dynamic Panel Data Model
A key component of pure time series models is the modeling of dynamics using lagged dependent variables. These lagged variables capture the autocorrelation between observations of the same dataset at different points in time.
Because panel datasets include a time series component, it is also important to address the possibility of autocorrelation in panel data. The dynamic panel data model adds dynamics to the panel data individual effects framework.
Consider an individual effects model which includes an AR(1) term
$$y_{it} = \delta y_{i,t1} + \beta x_{it} + \epsilon_{it}$$
the error component includes oneway individual effects such that
$$\epsilon_{it} = \mu_i + \nu_{it}$$
where $\mu_i$ captures individual effects.
Introducing lagged dependent variables in the individual effects framework:
Dynamic panel data models are most commonly estimated using a generalized method of moments (GMM) framework proposed by Arellano and Bond (1991).
In panel data that covers small time frames, there is little need to worry about stationarity. However, when panel data covers longer time frames, like is the case in many macroeconomic panel data series, the panel data must be tested for stationarity.
Weak stationarity, required for many panel data modeling techniques, requires only that:
Nonstationary panel data series are any panel series that do not meet the conditions of a weakly stationary time series.
In part because of these considerations, a large field of research and literature surrounding panel data unit root tests has developed.
Testing for unit roots in panel data requires more than just testing the individual cross sections for the presence of unit roots. Panel data unit root tests must:
After today's blog, you should have an understanding of the fundamentals of panel data. We covered the basics of panel data including:
To learn more about performing panel data analysis using GAUSS contact us for a GAUSS demo copy.
]]>The aggregate
function, first available in GAUSS version 20, computes statistics within data groups. This is particularly useful for panel data.
In today's blog, we take a closer look at aggregate
. We will:
aggregate
function.aggregate
functionaggregate
function using current account data from the International Monetary Fund.The GAUSS aggregate
function computes statistics within a group based upon a specified group identifier. The function supports a variety of GAUSS statistics including:
For example, consider a panel dataset which includes observed weights for three individuals across a 6month time span:
Name  Jan. Weight  Feb. Weight  Mar. Weight  Apr. Weight  May Weight  June Weight 

Sarah  135  134  138  142  144  145 
Tom  196  192  182  183  184  181 
Nikki  143  144  146  147  145  143 
We can use the aggregate
function to find the 6month mean weights for Sarah, Tom and Nikki:
Name  Jan. Weight  Feb. Weight  Mar. Weight  Apr. Weight  May Weight  June Weight  Mean Weight 

Sarah  135  134  138  142  144  145  139.7 
Tom  196  192  182  183  184  181  186.3 
Nikki  143  144  146  147  145  143  144.7 
Alternatively, we could find the monthly standard deviation of the weights across Sarah, Tom and Nikki:
Name  Jan. Weight  Feb. Weight  Mar. Weight  Apr. Weight  May Weight  June Weight 

Sarah  135  134  138  142  144  145 
Tom  196  192  182  183  184  181 
Nikki  143  144  146  147  145  143 
Monthly Std. Dev.  33.2  31.0  23.4  22.4  22.8  21.4 
The aggregate
function takes two required inputs:
x_a = aggregate(x, method);
"mean"
, "median"
, "mode"
, "max"
, "min"
, "sd"
, "sum"
, "variance"
.The aggregate
function requires the data matrix input to:
Let's consider our example dataset from above. In order to use this data as an input to the GAUSS aggregate
function we need to:
Name  Jan. Weight  Feb. Weight  Mar. Weight  Apr. Weight  May Weight  June Weight 

Sarah  135  134  138  142  144  145 
Tom  196  192  182  183  184  181 
Nikki  143  144  146  147  145  143 
$$\text{Sarah} \rightarrow 1$$ $$\text{Tom} \rightarrow 2$$ $$\text{Nikki} \rightarrow 3$$
$$\Downarrow$$
Group  Jan. Weight  Feb. Weight  Mar. Weight  Apr. Weight  May Weight  June Weight 

1  135  134  138  142  144  145 
2  196  192  182  183  184  181 
3  143  144  146  147  145  143 
$$\Downarrow$$
Group  Month  Weight 

1  1  135 
1  2  134 
1  3  138 
1  4  142 
1  5  144 
1  6  145 
2  1  196 
2  2  192 
⁞  ⁞  ⁞ 
3  6  143 
The method input into the aggregate
function should always be a string indicating which statistic you wish to compute.
Each method works on groups within the panel the same way the analogous pooled data function would work, including its handling of missing values.
Method  Pooled Function 

mean  meanc 
median  median 
mode  modec 
max  maxc 
min  minc 
sum  sumc 
sd  stdc 
variance  varCovXS 
Let's use aggregate
to find the means by group for weight data:
weights = { 1 1 135,
1 2 134,
1 3 138,
1 4 142,
1 5 144,
1 6 145,
2 1 196,
2 2 192,
2 3 182,
2 4 183,
2 5 184,
2 6 181,
3 1 143,
3 2 144,
3 3 146,
3 4 147,
3 5 145,
3 6 143 };
/*
** Find the mean by person.
** We will use the first column
** as the group indicator and will find
** the mean of the weights.
*/
print aggregate(weights[., 1 3], "mean");
This prints the group means to the output window:
1 139.67
2 186.33
3 144.67
We can also use the month identifiers to find the sample standard deviation by month:
/*
** Find the standard deviation by month.
** We will use the second column of weights
** as the group indicator and will find
** the standard deviation of the weights.
*/
print aggregate(weights[., 2 3], "sd");
Now the standard deviations, along with their associated months will be printed to the output window:
1 33.151
2 31.005
3 23.438
4 22.368
5 22.811
6 21.385
Our simple example dataset is useful for demonstrating the basics of the aggregate
function. However, a realworld panel dataset better demonstrates its true power. In this section, we will use aggregate
to examine some of the trends in international current account balances.
We will use current account balance measured as a percentage of GDP. This unbalanced panel data is a modified version of a dataset from the International Monetary Fund and spans 1953Q3 to 2019Q4 and includes a total of 46 countries, across 5 different regions.
It contains the following variables:
Variable name  Description 

Country  String, name of the country. 
Country ID  Integer country identifier. 
World Region  String, name of the corresponding world region. 
Region ID  Integer world region identifier. 
Time  String, the date of the observation. 
CAB  Decimal numeric, the Current Account Balance. 
We first examine variations in the mean and median current account balances across countries. Using aggregate
we find the mean and median current account balances for each country in the panel across all observations.
/*
** Load the 'Country ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Country Id + CAB");
// Compute mean and median current
// account balances by Country ID
mean_cab_cid = aggregate(X, "mean");
median_cab_cid = aggregate(X, "median");
After the above code mean_cab_cid
and median_cab_cid
will be both be $46\times2$ matrices. Each element in the first column will be a unique country ID. The corresponding element in the second column will be the average (mean or median) current account balance for that country.
We include a graph of this data below, where we see that Germany leads the pack with the highest average current account balance while Finland has the lowest average current account balances.
We can similarly consider the mean and median current account balances across geographical regions with the code below.
/*
** Load the 'Region ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Region ID + CAB");
// Compute mean and median current
// account balances by world region
mean_cab_wreg = aggregate(X, "mean");
median_cab_wreg = aggregate(X, "median");
mean_cab_wreg
and median_cab_wreg
will be two column matrices with unique world region IDs in the first column and the corresponding statistics in the second column.
Finally, we consider how the mean and median current account balances vary across time in the time series plot below.
/*
** Load the 'Time' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
** Wrapping 'Time' in 'date($)' tells GAUSS that 'Time' is a string
** variable that we want GAUSS to convert to a date.
*/
X = loadd("imf_cab_mod.xlsx", "date($Time) + CAB");
mean_cab_date = aggregate(X, "mean");
median_cab_date = aggregate(X, "median");
This time the first column of our resulting matrices, mean_cab_date
and median_cab_date
, will contain each unique date from our dataset. The second column will contain the statistic computed for each unique date.
Below is a graph of the Current Account Balance data grouped by quarter.
In today's blog, we examined the fundaments of the aggregate
procedure. After reading you should have a better understanding of:
aggregate
function.aggregate
function.aggregate
.The GAUSS Package Manager, first introduced in version 20, allows you to download, install and uninstall GAUSS packages without leaving GAUSS.
It supports the paid GAUSS Application Modules, free GAUSS packages and even allows you to create custom packages and channels.
This post will guide you through the basics needed to install and uninstall GAUSS packages.
The Package Manager must be downloaded and installed before its first use. To begin this process, select Tools > Package Manager from the main GAUSS menu. This will open a window which looks like this:
By default, the Package Manager will install inside a folder named conda
inside your GAUSSHOME folder. In most cases, you should leave the default installation directory.
However, the initial release of the Package Manager does not support installation directories with spaces in the path names. If your GAUSSHOME directory contains a space, you will need to point this to another folder in which your user has write permissions.
The next section is for your GAUSS Download Account username and password. This allows you to download any GAUSS Application Modules which you have purchased. If you do not know your username and password, contact Aptech.
A username and password are not required to download free packages.
Click the install button on the bottom right of the Install Package Manager window. This will begin the download and installation of the Package Manager. It will take a minute or two. When the installation has completed you will see the following popup window.
If you have not already installed the Package Manager, see the previous steps.
+
) on the top right of the Installed Packages window to open the Repository Packages window.If a package in the Repository Packages window has a red background color, it means that the username and password you have entered do not allow access to the package.
+
) on the right of the Installed Packages window.
) icon.In this post, you've learned:
New function to aggregate results across groups by:
// Load data with the grouping data in the first column
X = loadd("grunfeld.dat", "Years + Investment");
// Group investment by average for year
mean_by_year = aggregate(X, "mean");
// Load data with the grouping data in the first column
X = loadd("grunfeld.dat", "firm + Investment");
// Group investment by median for each firm
median_by_firm = aggregate(X, "median");
New suite of tools makes it easy to add optional arguments to your GAUSS procedures.
This example shows a simple estimation procedure that chooses either OLS estimation or ridge regression based on whether an optional input, lambda, is passed in.
// ... is a placeholder for the optional arguments
proc (1) = estimate(y, X, ...);
local lambda;
// Get the optional 'dynamic argument'
lambda = getDynargs(1);
// If the 'lambda' was not passed in,
// it will be an empty matrix.
if isempty(lambda);
// No 'lambda' so perform standard OLS
b_hat = olsRegress(y, X);
else;
// 'lambda' passed in so perform ridge regression
b_hat = ridgeRegress(y, X, lambda);
endif;
retp(b_hat);
endp;
The procedure above can be called like this:
// OLS regression when optional input is not passed in
b_hat = estimate(y, X);
or like this:
// Ridge regression is performed when optional input is passed in
b_hat = estimate(y, X, lambda);
plotXYFill
.plotBarH
.
plotSetLegend
now supports setting the legend location by coordinates. plotSetYTicInterval
.
var_names = "alpha" $ "beta" $ "gamma";
b = { 0.34, 1.9334, 0.8983 };
se = { 0.00002234, 0.013235, 0.03752 };
↓
print sprintf(fmt, var_names, b, se);
↓
alpha 0.340 (0.00) beta 1.933 (0.01) gamma 0.898 (0.04)