Optional input arguments can make your statistical computing more efficient and enjoyable. GAUSS version 20 added a new suite of tools to make it easy for you to add optional input arguments to your GAUSS procedures.
In this blog post we will show you how to easily:
We’ll start by creating a simple toy GAUSS procedure that does not use optional arguments.
// Run procedure
myProc(1.1, 2.2, 3.3);
// Define procedure
proc (0) = myProc(a, b, c);
print a;
print b;
print c;
endp;
If we run the code above, we will get the following output:
1.1 2.2 3.3
To modify this procedure to accept optional arguments, we need to make the following changes:
The procedure declaration is the first line of the procedure. The procedure declaration from our initial procedure was:
// Procedure declaration
proc (0) = myProc(a, b, c);
Now we will modify our original procedure to make a
the only required input. The new procedure declaration will look like this:
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
In the procedure declaration, above, a
is a required input as it was before. However, we have replaced b
and c
with ...
. The triple dots represent the optional arguments. The triple dots ...
:
The GAUSS function dynargsCount
returns the number of optional arguments passed into the procedure. dynargsCount
may only be called inside of a GAUSS procedure and does not take any inputs. For example:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
local n_dynargs;
// Notice that you do not pass ‘...’
// into ‘dynargsCount’
n_dynargs = dynargsCount();
print "Number of optional inputs = " n_dynargs;
print "";
endp;
Will print the following output:
call 1 Number of optional inputs = 0.0  call 2 Number of optional inputs = 1.0  call 3 Number of optional inputs = 2.0 
This allows you to add logic which depends on the number of optional arguments that were passed into your procedure.
dynargs
.You can access optional arguments in GAUSS procedures by:
dynargsGet
function....
, to another function.The dynargsGet
procedure returns specified optional arguments. It requires one input that specifies the index, or index range of optional arguments to return. For example:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
// Since ‘b’ and ‘c’ are not declared as inputs
// in the procedure declaration, they must be
// declared as local variables
local b, c, idx;
// The range of optional inputs we want to return
idx = { 1, 2 };
{ b, c } = dynargsGet(idx);
print "a = " a;
print "b = " b;
print "c = " c;
print "";
endp;
This time, the code will print out:
call 1 a = 9.00 b = {} c = {}  call 2 a = 9.00 b = 3.14 c = {}  call 3 a = 9.00 b = 3.14 c = 2.54 
As we see above, a
is passed through as we would expect each time. In "call 1", neither b
or c
were passed into myProc
. We can see that GAUSS printed their contents as empty curly braces, {}
. This indicates that they were empty matrices.
If an optional input argument requested by dynargsGet
is not supplied, by default an empty matrix will be returned in its place. We can check for empty matrices with the GAUSS function, isempty
. For example:
// Create an empty matrix
z = {};
// Check to see if `z` is an empty matrix
if isempty(z);
print "z is an empty matrix";
else;
print "z is not an empty matrix";
endif;
As you may have guessed, the above code will print:
z is an empty matrix
While we can use isempty
and if
statements to handle the optional inputs, it is often simpler to set default values.
Fortunately, dynargsGet
allows us to specify default values for the optional arguments which are not supplied. To specify default values with dynargsGet
, simply add them as a commaseparated list of inputs after the required index input.
There are a few things to remember when specifying default values:
We can set default values for our toy example, like this:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
// Since ‘b’ and ‘c’ are not declared as inputs
// in the procedure declaration, they must be
// declared as local variables
local b, c, idx;
// The range of optional inputs we want to return
idx = { 1, 2 };
// If ‘b’ or ‘c’ are not passed in, return
// 29 or 33 respectively
{ b, c } = dynargsGet(idx, 29, 33);
print "a = " a;
print "b = " b;
print "c = " c;
print "";
endp;
This time our code will return:
call 1 a = 9.00 b = 29.00 c = 33.00  call 2 a = 9.00 b = 3.14 c = 33.00  call 3 a = 9.00 b = 3.14 c = 2.54 
While any input arguments accessed by dynargsGet
can be passed to other procedures, it is important to know that the triple dots, ...
, can also be passed directly into other procedures.
Passing triple dots, ...
, to another procedure is just as if you passed a commaseparated list of the arguments in ...
.
...
, do not have to be the final input when passed to other procedures.For example:
X_1 = initMat1(3.14, 5, 2);
X_2 = initMat2(3.14, 5, 2);
// Procedure declaration with only optional inputs
proc (1) = initMat1(val, ...);
local X;
X = val + zeros(...);
retp(X);
endp;
// Procedure declaration with only required inputs
proc (1) = initMat2(val, r, c);
local X;
X = val + zeros(r, c);
retp(X);
endp;
The calls to initMat1
and initMat2
will behave the same.
Now that we’ve seen the basics of using optional inputs to GAUSS procedures, let’s work through a more practical example. We’ll create two procedures that will work together to simulate a linear model.
Our first procedure will create a random normal matrix, with the option to specify the mean and standard deviation of the variables.
/*
** Inputs: r  Required input. Number of rows.
** c  Required input. Number of columns.
** mu  Optional input. Mean of the simulated columns.
** Default = 0.
** sd  Optional input. Standard deviation of the
** simulated columns. Default = 1.
*/
proc (1) = rndNormal(r, c, ...);
local X, mean, sd;
// If a 3rd input is passed in, assign it
// to ‘mean’. Otherwise set ‘mean’ equal to 0.
// If a 4th input is passed in, assign it
// to `sd`. Otherwise set ‘sd’ equal to 1.
{ mean, sd } = dynargsGet(12, 0, 1);
// Compute a random matrix with the
// specified mean and sd.
X = (rndn(r, c) .* sd) + mean;
retp(X);
endp;
Here are a few examples of how we could call this procedure:
// Create a 100x4 random normal matrix
// with mean=0 and sd=1
X_1 = rndNormal(100, 4);
// Create a 130x2 random normal matrix
// with mean=3.7 and sd=1
X_2 = rndNormal(130, 2, 3.7);
// Create a 74x5 random normal matrix
// with mean=2 and sd=3
X_3 = rndNormal(74, 5, 2, 3);
Next, we’ll create a procedure that will use our rndNormal
procedure to simulate linear models.
/*
** Inputs: b_true  Required input. True parameter values of
** the simulated linear model.
** nobs  Required input. The number of observations
** to simulate.
** alpha_true  Optional input. True intercept of the
** simulated linear model. Default = 0.
** err_sd  Optional input. Standard deviation of the
** error term. Default = 1.
*/
proc (1) = simulateLM(b_true, nobs, ...);
local X, err, nvars, alpha_plus_err, y;
// Find number of desired variables
nvars = rows(b_true);
// Simulate variables with
// mean = 0, sd = 1
X = rndNormal(nobs, nvars);
// Simulate error term and add intercept
// if it was passed in.
alpha_plus_err = rndNormal(nobs, 1, ...);
y = X * b_true + alpha_plus_err;
retp(y);
endp;
Here are a few examples of how our simulateLM
procedure can be used:
b = { 0.8, 1.1, 0.2 };
n = 10;
/*
** Simulate linear model:
**  Without intercept
**  With error term sd equal to 1
*/
y_1 = simulateLM(b, n);
/*
** Simulate linear model:
**  With intercept
**  With error term sd equal to 1
*/
alpha = 2.3;
y_2 = simulateLM(b, n, alpha);
/*
** Simulate linear model:
**  With intercept
**  With error term sd equal to 3
*/
alpha = 2.3;
err_sd = 3;
y_3 = simulateLM(b, n, alpha, err_sd);
Congratulations! You should now be able to start creating GAUSS procedures with optional input arguments. You have learned how to:
...
, to the procedure declaration.dynargsCount
.dynargsGet
.Cointegration is an important tool for modeling the longrun relationships in time series data. If you work with time series data, you will likely find yourself needing to use cointegration at some point.
This blog provides an indepth introduction to cointegration and will cover all the nuts and bolts you need to get started. In particular, we will look at:
Though not necessary, you may find it helpful to review the blogs on time series modeling and unit root testing before continuing with this blog.
Economic theory suggests that many time series datasets will move together, fluctuating around a longrun equilibrium. In econometrics and statistics, this longrun equilibrium is tested and measured using the concept of cointegration.
Cointegration occurs when two or more nonstationary time series:



Field  Supporting Theory  Time Series 
Economics  The permanent income hypothesis describes how agents spread their consumption out over their lifetime based on their expected income.  Consumption and income. 
Economics  Purchasing power parity is a theory that relates the prices of a basket of goods across different countries.  Nominal exchange rates and domestic and foreign prices. 
Finance  The present value model of stock prices implies a longrun relationship between stock prices and their dividends or earnings.  Stock prices and stock dividends/earnings. 
Epidemiology  Joint mortality models imply a longrun relationship between mortality rates across different demographics.  Male and female mortality rates. 
Medicine  Time series methodologies have been used to examine comorbidities of different types of cancers and trends in medical welfare.  Occurrence rates of different types of cancer. 
To understand the mathematics of cointegration, let's consider a group of time series, $Y_t$, which is composed of three separate time series:
$$y_1 = (y_{11}, y_{12}, \ldots, y_{1t})$$ $$y_2 = (y_{21}, y_{22}, \ldots, y_{2t})$$ $$y_3 = (y_{31}, y_{32}, ..., y_{3t})$$
All three series are nonstationary time series.
Cointegration implies that while $y_1$, $y_2$, and $y_3$ are independently nonstationary, they can be combined in a way that their linear combination is stationary :
$$\beta Y_t = \beta_1 y_{1t} + \beta_2 y_{2t} + \beta_3 y_{3t} \sim I(0)$$
The Cointegrating Vector
In the context of cointegration, $\beta$ is commonly known as the cointegrating vector. This vector:
Normalization Because there can be multiple cointegrating vectors that fit the same economic model, we must impose identification restrictions to normalize the cointegrating vector for estimation.
A common normalization of the cointegrating vector is to set $\beta = ( 1, \beta_2, \ldots, \beta_N)$. For example, applying these restrictions to our earlier system yields
$$\beta Y_t = y_{1t}  \beta_2 y_{2t}  \beta_3 y_{3t} \sim I(0)$$
Part of the appeal of this normalization is that it can be rewritten in a standard regression form
$$y_{1t} = \beta_2 y_{2t} + \beta_3 y_{3t} + u_t$$
where $u_t$ is a stationary cointegrating error component. Intuitively, $u_t$ can be thought of as shortterm deviations from the longrun equilibrium.
While the regression format is a common normalization, it is important to remember that economic theory should inform our identifying restrictions.
Cointegration implies that time series will be connecting through an error correction model. The error correction model is important in time series analysis because it allows us to better understand longrun dynamics. Additionally, failing to properly model cointegrated variables can result in biased estimates.
The error correction model:
Let's assume that there is a bivariate cointegrated system with $Y_t = (y_{1t}, y_{2t})$ and a cointegrating vector $\beta = (1, \beta_2)$ such that
$$\beta Y_{t} = y_{1t}  \beta_2 y_{2t}$$
The error correction model depicts the dynamics of a variable as a function of the deviations from longrun equilibrium
$$\Delta y_{1t} = c_1 + \alpha_1 (y_{1,t1}  \beta_2 y_{2,t1}) + \sum_j \psi^j_{11} \Delta y_{1, tj} + \sum_j \psi^j_{12} \Delta y_{2, tj} + \epsilon_1t$$ $$\Delta y_{2t} = c_2 + \alpha_2 (y_{1,t1}  \beta_2 y_{2,t1}) + \sum_j \psi^j_{21} \Delta y_{1, tj} + \sum_j \psi^j_{22} \Delta y_{2, tj} + \epsilon_2t$$
Term  Description  Intuition 

$y_{1,t1}  \beta_2 y_{2,t1}$  Cointegrated longrun equilibrium  Because this is an equilibrium relationship, it plays a role in dynamic paths of both $y_{1t}$ and $y_{2t}$. 
$\alpha_1$, $\alpha_2$  Adjustment coefficients  Captures the reactions of $y_{1t}$ and $y_{2t}$ to disequilibrium. 
$\sum_j \psi^j_{11} \Delta y_{1, tj} + \sum_j \psi^j_{12} \Delta y_{2, tj}$  Autoregressive distributed lags  Captures additional dynamics. 
If the cointegrating vector has been previously estimated, then standard OLS or DOLS can be used to estimate the error correction relationship. In this case:
The ECM relationship can be estimated using OLS, seemingly unrelated regressions (SUR), or maximum likelihood estimation.
The vector error correction model (VECM) is the multivariate extension of the ECM. If we are working in a vector autoregressive context, cointegration implies a VECM such that
$$\Delta Y_t = \Phi D_t + \Pi Y_{t1} + \Gamma_1 \Delta Y_{t1} + \cdots + \Gamma_{p1} \Delta Y_{tp+1} + \epsilon_t$$
Like the ECM, the VECM parameters reflect the longrun and shortrun dynamics of system as shown in the table below:
Term  Description  Intuition 

$\Pi$  Longrun impact matrix.  $\Pi = \Pi_1 + \Pi_2 + \cdots + \Pi_p  I_n$, captures adjustments towards the longrun equilibrium and contains the cointegrating relationships. 
$\Gamma_k$  Shortrun impact matrix.  The shortrun impact matrix is constructed from $\sum_{j=k+1}^p \Pi_j$ and captures shortrun deviations from the equilibrium. 
$D_t$  Deterministic terms.  These terms take the form $D_t = u_0 + u_1 t$ where $u_0$ is the constant component and $u_1 t$ is the trend component. 
The VECM model can be estimated using the Johansen method:
Before jumping directly to cointegration testing, there are a number of other time series modeling modeling steps that we should consider first.
One of the key considerations prior to testing for cointegration, is whether there is theoretical support for the cointegrating relationship. It is important to remember that cointegration occurs when separate time series share an underlying stochastic trend. The idea of a shared trend should be supported by economic theory.
As an example, consider growth theory which suggests that productivity is a key driver of economic growth. As such, it acts as the common trend, driving the comovements of many indicators of economic growth. Hence, this theory implies that consumption, investment, and income are all cointegrated.
One of the first steps in time series modeling should be data visualization. Time series plots provide good preliminary insights into the behavior of time series data:
We've established that cointegration occurs between nonstationary, I(1), time series. This implies that before testing for or estimating a cointegrating relationship, we should perform unit root testing.
Our previous blog, "How to Conduct Unit Root Testing in GAUSS", provides an indepth look at how to perform unit root testing in GAUSS.
GAUSS tools for performing unit root tests are available in a number of libraries, including the Time Series MT (TSMT), the opensource TSPDLIB, and the coint libraries. All of these can be directly located and installed using the GAUSS package manager.
Full example programs for testing for unit roots using TSMT procedures and TSPDLIB procedures are available on our Aptech GitHub page.
Panel Data Unit Root Test  TSMT procedure  TSPDLIB procedure 

Hadri  hadri  
Im, Pesaran, and Shin  ips  
LevinLuChin  llc  
Schmidt and Perron LM test  lm  
Breitung and Das  breitung  
Crosssectionally augmented IPS test (CIPS)  cips  
Panel analysis of nonstationary and idiosyncratic and common (PANIC)  bng_panic 
Time Series Unit Root Test  TSMT procedure  TSPDLIB procedure 

AugmentedDickey Fuller  vmadfmt  adf 
PhillipsPerron  vmppmt  pp 
KPSS  kpss  lmkpss 
Schmidt and Perron LM test  lm  
GLSADF  dfgls  dfgls 
Quantile ADF  qr_adf 
A complete time series analysis should consider the possibility that structural breaks have occurred. In the case that structural breaks have occurred, standard tests for cointegration are invalid.
Therefore, it is important to:
The GAUSS sbreak
procedure, available in TSMT, is an easy to use tool for identifying multiple, unknown structural breaks.
In order to test for cointegration, we must test that a longrun equilibrium exists for a group of data. There are a number of things that need to be considered:
In this section, we will show how to use these questions to guide cointegration testing without structural breaks.
The EngleGranger cointegration test considers the case that there is a single cointegrating vector. The test follows the very simple intuition that if variables are cointegrated, then the residual of the cointegrating regression should be stationary.
Forming the cointegrating residual
How to form the cointegrating residual depends on if the cointegrating vector is known or must be estimated:
If the cointegrating vector is known, the cointegrating residuals are directly computed using $u_t = \beta Y_t$. The residuals should be stationary and:
If the cointegrating vector is unknown, OLS is used to estimate the normalized cointegrating vector from the regression $$y_{1t} = c + \beta y_{2t} + u_{t}$$
There are two Johansen cointegrating tests for the VECM context, the trace test and the maximal eigenvalue test. These tests hinge on the intuition that in the VECM, the rank of the longrun impact matrix, $\Pi$, determines if the VAR(p) variables are cointegrated.
Since the rank of the longrun impact matrix equals the number of cointegrating relationships:
The Johansen testing process has two general steps:
The Johansen Trace Statistics
The Johansen trace statistic:
The Johansen testing procedure sequentially tests the null hypothesis that the number of cointegrating vectors, $k = m$ against the alternative that $k > m$.
Stage  Null Hypothesis  Alternative  Conclusion 

One  $H_0: k = 0$  $H_A: k>0$  If $H_0$ cannot be rejected, stop testing, and $k = 0$. If null is rejected, perform next test. 
Two  $H_0: k \leq 1$  $H_A: k>1$  If $H_0$ cannot be rejected, stop testing, and $k \leq 1$. If null is rejected, perform next test. 
Three  $H_0: k \leq 2$  $H_A: k>2$  If $H_0$ cannot be rejected, stop testing, and $k \leq 2$. If null is rejected, perform next test. 
m1  $H_0: k \leq m1$  $H_A: k>m1$  If $H_0$ cannot be rejected, stop testing, and $k \leq m1$. If null is rejected, perform next test. 
The test statistic follows a nonstandard distribution and depends on the dimension and the specified deterministic trend.
The Johansen Maximum Eigenvalue Statistic
The maximal eigenvalue statistic:
In the case that there are structural breaks in the cointegrating relationship, the cointegration tests in the previous station should not be used. In this section we look at three tests for cointegration with structural breaks:
The Gregory and Hansen (1996) cointegration test is a residualbased cointegration test which tests the null hypothesis of no cointegration against the alternative of cointegration in the presence of a single regime shift.
The Gregory and Hansen (1996) test:
Because the structural break date is unknown, the test computes the cointegration test statistic for each possible breakpoint, and the smallest test statistics are used.
Gregory and Hansen (1996) suggest running their tests in combination with the standard cointegration tests:
The HatemiJ cointegration test is an extension of the Gregory and Hansen cointegration test. It allows for two possible structural breaks with unknown timing.
The Maki cointegration test builds on the Gregory and Hansen and the HatemiJ cointegration tests to allow for an unknown number of structural breaks.
GAUSS tools for performing cointegration tests and estimating VECM models are available in a number of libraries, including the Time Series MT (TSMT) library, TSPDLIB, and the coint libraries. All of these can be directly located and installed using the GAUSS package manager.
Cointegration test  Null Hypothesis  Decision Rule  GAUSS library 

EngleGranger (ADF)  No cointegration.  Reject the null hypothesis if the $ADF$ test statistic is less than the critical value.  TSMT, tspdlib, coint 
Phillips  No cointegration.  Reject the null hypothesis if the $Z$ test statistic is less than the critical value.  coint, tspdlib 
Stock and Watson common trend  $Y$ is a noncointegrated system after allowing for the pth order polynomial common trend.  Reject the null hypothesis if the $SW$ test statistic is less than the critical value.  coint 
Phillips and Ouliaris  $Y$ and $X$ are not cointegrated.  Reject the null hypothesis if the $P_u$ or $P_z$ statistic is greater than the critical value.  coint, tspdlib 
Johansen trace  Rank of $\Pi$ is equal to $r$ against the alternative that the rank of $\Pi$ is greater than $r$.  Reject the null hypothesis if $LM_{max}(k)$ is greater than the critical value.  TSMT, coint 
Johansen maximum eigenvalue  Rank of $\Pi$ is equal to $r$ against the alternative that the rank of $\Pi$ is equal to $r+1$.  Reject the null hypothesis if $LM(r)$ is greater than the critical value.  TSMT, coint 
Gregory and Hansen  No cointegration against the alternative of cointegration with one structural break.  Reject the null hypothesis if $ADF$, $Z_{\alpha}$, or $Z_t$ are less than the critical value.  tspdlib 
HatemiJ  No cointegration against the alternative of cointegration with two structural breaks.  Reject the null hypothesis if $ADF$, $Z_{\alpha}$, or $Z_t$ are less than the critical value.  tspdlib 
Shin test  Cointegration.  Reject the null hypothesis if the test statistic is less than the critical value.  tspdlib 
In this section, we will test for cointegration between monthly gold and silver prices, using historic monthly price date starting in 1915. Specifically, we will work through several stages of analysis:
As a first step, we will create a time series graph of our data. This allows us to visually examine the deterministic trends in our data.
From our graphs, we can draw some preliminary conclusions about the dynamics of gold and silver prices over our time period:
Before testing if silver and gold prices are cointegrated, we should test if the series have unit roots. We can do this using the unit roots tests available in the TSMT and tspdlib libraries.
Gold monthly closing prices (20152020)  

Time Series Unit Root Test  Test Statistic  Conclusion 
AugmentedDickey Fuller  1.151  Cannot reject the null hypothesis of unit root. 
PhillipsPerron  1.312  Cannot reject the null hypothesis of unit root. 
KPSS  2.102  Reject the null hypothesis of stationarity at the 1% level. 
Schmidt and Perron LM test  2.399  Cannot reject the null hypothesis of unit root. 
GLSADF  0.980  Cannot reject the null hypothesis of unit root. 
Silver monthly closing prices (20152020)  

Time Series Unit Root Test  Test Statistic  Conclusion 
AugmentedDickey Fuller  5.121  Reject the null hypothesis of unit root at the 1% level. 
PhillipsPerron  5.446  Reject the null hypothesis of unit root at the 1% level. 
KPSS  0.856  Reject the null hypothesis of stationarity at the 1% level. 
Schmidt and Perron LM test  4.729  Reject the null hypothesis of unit root at the 1% level. 
GLSADF  4.895  Reject the null hypothesis of unit root at the 1% level. 
These results provide evidence that gold prices are nonstationary but suggest that the silver prices are stationary. At this point, we would not likely proceed with cointegration testing or we may wish to perform additional unit root testing. For example, we may want to perform unit root tests that allow for structural breaks.
The GAUSS code for the tests in this section is available here.
Now, let's test for cointegration without structural breaks using two different tests, the Johansen tests and the EngleGranger test.
The Johansen Tests
We will use the vmsjmt
procedure from the TSMT library. This procedure should be used with the vmc_sjamt
and vmc_sjtmt
procedures, which find the critical values for the Maximum Eigenvalue and Trace statistics, respectively.
The vmsjmt
procedure requires four inputs:
The vmsjmt
procedure returns both the Johansen Trace and the Johansen Maximum Eigenvalue statistic. In addition, it returns the associated eigenvalues and eigenvectors.
new;
// Load tsmt library
library tsmt;
// Set filename (with path) for loading
fname2 = __FILE_DIR $+ "commodity_mon.dat";
// Load real prices data
y_test_real = loadd(fname2, "P_gold_real + P_silver_real");
// No deterministic component
// the fitted regression
p = 1;
// Set number of lagged differences
// for computing estimator
k = 2;
// No determinant
no_det = 0;
{ ev, evec, trace_stat, max_ev } = vmsjmt(y_test_real, p, k, no_det);
cv_lr2 = vmc_sjamt(cols(y), p);
cv_lr1 = vmc_sjtmt(cols(y), p);
Both trace_stat
and max_ev
will contain statistics for all possible ranks of $\Pi$.
For example, since we are testing for cointegration of just two time series, there will be at most one cointegrating vector. This means trace_stat
and max_ev
will be 2 x 1 matrices, testing both null hypotheses that $r=0$ and $r=1$.
Test  Test Statistic  10% Critical Value  Conclusion 

Johansen Trace Statistic  $$H_0: r=1, 56.707$$ $$H_0: r=0, 0.0767$$  10.46  Cannot reject the null hypothesis that $r=0$. 
Johansen Maximum Eigenvalue  $$H_0: r=1, 56.631$$ $$H_0: r=0, 0.0766$$  9.39  Cannot reject the null hypothesis that $r=0$. 
These results indicate that there is no cointegration between monthly gold and silver prices. This should not be a surprise, given the results of our unit root testing.
vmsjmt
. However, the Johansen tests are sequential tests and since we cannot reject the null that $k=0$ for either test, we technically do not need to continue testing for any higher rank. The EngleGranger
Since there is only one possible cointegrating vector for this system, we could have also used the EngleGranger test for cointegration. This test can be implemented using the coint_egranger
procedure from the TSPDLIB library.
The coint_egranger
procedure requires five inputs:
The coint_egranger
procedure returns the test statistic along with the 1%, 5% and 10% critical values.
Using the data already loaded in the previous example:
/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
// Maximum number of lags
pmax = 12;
// No constant or trend
model = 0
{ tau0, cvADF0 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
// Constant
model = 1
{ tau1, cvADF1 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
// Constant and trend
model = 2
{ tau2, cvADF2 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
Test  Test Statistic  10% Critical Value  Conclusion 

EngleGranger, no constant  3.094  2.450  Reject the null of no cointegration at the 10% level. 
EngleGranger, constant  1.609  3.066  Cannot reject the null hypothesis of no cointegration. 
EngleGranger, constant and trend  2.327  3.518  Cannot reject the null hypothesis of no cointegration. 
These results provide evidence for our conclusion that there is no cointegration between gold and silver prices. Note, however, that these results are not conclusive and depend on whether we include a constant. This sheds light on the importance of including the correct deterministic components in our model.
To be thorough we should also test for cointegration using tests that allow for a structural break. As an example, let's use the GregoryHansen test to compare the null hypothesis of no cointegration against the alternative that there is cointegration with one structural break.
This test can be implemented using the coint_ghansen
procedure from the TSPDLIB library.
The coint_ghansen
procedure requires eight inputs:
/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
//Maximum number of lags
pmax = 12;
// Trimming rate
trimm= 0.15;
// Longrun consistent variance estimation method
varm = 3;
// Bandwidth for kernel estimator
T = rows(y_test_real);
bwl = round(4 * (T/100)^(2/9));
// Level shift
model = 1;
{ ADF_min1, TBadf1, Zt_min1, TBzt1, Za_min1, TBza1, cvADFZt1, cvZa1 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Level shift with trend
model = 2;
{ ADF_min2, TBadf2, Zt_min2, TBzt2, Za_min2, TBza2, cvADFZt2, cvZa2 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Regime shift
model = 3;
{ ADF_min3, TBadf3, Zt_min3, TBzt3, Za_min3, TBza3, cvADFZt3, cvZa3 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Regime shift with trend
model = 4;
{ ADF_min4, TBadf4, Zt_min4, TBzt4, Za_min4, TBza4, cvADFZt4, cvZa4 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
The coint_ghansen
has eight returns:
Test  $ADF$ Test Statistic  $Z_t$ Test Statistic  $Z_{\alpha}$ Test Statistic  10% Critical Value $ADF$,$Z_t$  10% Critical Value $Z_{\alpha}$  Conclusion 

GregoryHansen, Level shift  3.887  4.331  39.902  4.34  36.19  Cannot reject the null of no cointegration for $ADF$ and $Z_t$. Reject the null of no cointegration at the 10% level for $Z_{\alpha}$. 
GregoryHansen, Level shift with trend  3.915  5.010  50.398  4.72  43.22  Reject the null of no cointegration at the 10% level for $Z_{\alpha}$ and $Z_t$. Cannot reject the null for $ADF$ test. 
GregoryHansen, Regime change  5.452  6.4276  80.379  4.68  41.85  Reject the null of no cointegration at the 10% level. 
GregoryHansen, Regime change with trend  6.145  7.578  106.549  5.24  53.31  Reject the null of no cointegration at the 10% level. 
Note that our test for cointegration with one structural break is inconsistent and depends on which type of structural breaks we include in our model. This provides some indication that further exploration of structural breaks is needed.
Some examples of additional steps that we could take:
Congratulations! You now have an established guide for cointegration and the background you need to perform cointegration testing.
In particular, today's blog covered:
]]>
You're probably familiar with the basic findandreplace. However, large projects with many files across several directories, require a more powerful search tool.
The GAUSS Source Browser is the powerful searchandreplace tool you need. We'll discuss some ways you can use it to make your life easier, including:
If an output seems incorrect or is causing an error, you need a quick way to trace the progress of specific variables. GAUSS's Find Usages functionality does exactly that.
As shown in the image below, rightclick on the variable you want to search for and select Find Usages.
This will search our current file for every instance of our search variable and display them in the GAUSS Source Browser.
As we can see in the above image, Find Usages has shown us:
We can doubleclick on a line in the Source Browser to open it in the GAUSS editor.
The Find Usages basics alone are great, but the GAUSS Source Browser has much more to offer.
In addition to rightclicking and selecting Find Usages, the Source Browser can be opened two ways:
Ctrl+Shift+F
(Command+Shift+F
) keyboard shortcut.The Search Tab is the starting point for any search. It has three input boxes:
*.*
, where *
is a wildcard character that will match any character or characters. *.gss, estimate.src
.*.gss,*.src,*.sdf,*.gau,*.prg
.Using these tools together can help us quickly locate the information we're looking for. For, example, suppose we want to find uses of the GAUSS generalized method of moments procedure, gmmFit, in internal GAUSS source code.
We can achieve this by:
gmmFit
in the Search for box./Users/Research/gauss/src
as our Directory. We could also set the File pattern option to *.src
to limit our search to .src
files. However, we left the default option to search all files in this case.
Our search results show that the term gmmFit
is used between 1 and 37 times in each of five files in the specified directory.
The Advanced Tab provides many useful options for further refining our search. We'll discuss some of the most helpful and less common options.
The Assignments checkbox allows us to restrict our search to only the instances when a variable is assigned to. Narrowing the list of uses to assignments only is invaluable when tracking down the cause of an incorrect value. The screenshot above shows our original Find Usages search after adding the Assignments filter.
References is the opposite of Assignments, returning instances where a variable is used, but not assigned to.
By default, the Source Browser will search the directory that you specify and all subdirectories. Uncheck the Search Subdirs checkbox to restrict the search to the main search directory only.
After we have completed our desired search, we can perform our replace by:
The checkboxes at the start of each line of the search results allow you to quickly control which results are replaced.
Backup Original will create a copy of your files with a .bak
file extension before making the replace. Since there is no undo option for replacements made to files not open in the editor, the .bak
copies allow you to recover the original file if you make a mistake.
Overwrite Backup will copy over any preexisting .bak
files made by previous searches with Backup Original selected. If this option is unchecked, GAUSS will use .bak2
, .bak3
, etc if preexisting .bak
files are found.
After reading this post, you now know how to use the advanced search and replace tools in GAUSS to be more productive!
We have learned:
The latest Time Series MT (TSMT) 3.1.0 is now available for release. If you own TSMT 3.0 the update is available for free.
This minor update adds a new option to use the crossproduct covariance computation for some estimation routines and includes some bug fixes.
varmaFit
, ecmFit
, svarmaxmt
, and garchFit
.cusum
and sbreak
graphs.The GAUSS TSMT application module provides a comprehensive suite of tools for MLE and statespace estimation, model diagnostics and forecasting of univariate, multivariate and nonlinear time series models.
]]>Panel data, sometimes referred to as longitudinal data, is data that contains observations about different cross sections across time. Examples of groups that may make up panel data series include countries, firms, individuals, or demographic groups.
Like time series data, panel data contains observations collected at a regular frequency, chronologically. Like crosssectional data, panel data contains observations across a collection of individuals.
There are a number of advantages of panel data:
Panel data examples can be found in economics, social sciences, medicine and epidemiology, finance, and the physical sciences.



Field  Example topics  Example dataset 
Microeconomics  GDP across multiple countries, Unemployment across different states, Income dynamic studies, international current account balances.  Panel Study of Income Dynamics (PSID) 
Macroeconomics  International trade tables, world socioeconomic tables, currency exchange rate tables.  Penn World Tables 
Epidemiology and Health Statistics  Public health insurance data, disease survival rate data, child development and wellbeing data.  Medical Expenditure Panel Survey 
Finance  Stock prices by firm, market volatilities by country or firm.  Global Market Indices 
Panel data is a collection of quantities obtained across multiple individuals, that are assembled over even intervals in time and ordered chronologically. Examples of individual groups include individual people, countries, and companies.
In order to denote both individuals and time observations, panel data often refers to groups with the subscript i and time as the subscript t. For example, a panel data observation $Y_{it}$ is observed for all individuals $i = {1, ..., N}$ across all time periods $t = {1, ..., T}$
More specifically:
Group  Time Period  Notation 

1  1  $Y_{11}$ 
1  2  $Y_{12}$ 
1  T  $Y_{1T}$ 
⁞  ⁞  ⁞ 
N  1  $Y_{N1}$ 
N  2  $Y_{N2}$ 
N  T  $Y_{NT}$ 
Panel datasets may come in different formats. The format in the table above is sometimes called long format data. Long format datasets stack the observations of each variable from all groups, across at all time periods into one column.
When panel data is stored with the observations for a single variable from separate groups stored in separate columns this is sometimes referred to as wide data format.
Time  $Y_1$  $Y_2$  $Y_N$ 

1  $Y_{11}$  $Y_{21}$  $Y_{N1}$ 
2  $Y_{12}$  $Y_{22}$  $Y_{N2}$ 
3  $Y_{13}$  $Y_{23}$  $Y_{N3}$ 
4  $Y_{14}$  $Y_{24}$  $Y_{N4}$ 
⁞  ⁞  ⁞  ⁞ 
T1  $Y_{1T1}$  $Y_{2T1}$  $Y_{NT1}$ 
T  $Y_{1T}$  $Y_{2T}$  $Y_{NT}$ 
Panel data can also be characterized as unbalanced panel data or balanced panel data:
Certain panel data models are only valid for balanced datasets. If the panel datasets are unbalanced they may need to be condensed to include only the consecutive periods for which there are observations for all individuals in the cross section.
Panel data series modeling centers around addressing the likely dependence across data observations within the same group. In fact, the primary difference between panel data models and time series models, is that panel data models allow for heterogeneity across groups and introduce individualspecific effects.
As an example, consider a panel data series which includes gross domestic product (GDP) data for a panel of 5 different countries, the United States, France, Canada, Greece, and Australia:
Panel data models include techniques that can address these heterogeneities across individuals. Furthermore, pure crosssectional methods and pure time series models may not be valid in the presence of this heterogeneity.
Researchers commonly analyze datasets with multiple observations of a set of crosssectional units (e.g., people, firms, countries) over time. For example, one may have data covering the production of multiple firms or the gross product of multiple countries across a number of years.
Modeling these panel data series is a unique branch of time series modeling made up of methodologies specific to their structure.
This section looks more closely at panel data analysis and the associated panel data models.
Panel data methods can be split into two broad categories:
Within these groups, the assumptions made about the variation of the model across individuals are the primary drivers for which model to use.
Let’s consider a simple linear model
$$y_{it} = \alpha + \beta x_{it} + \epsilon_{it}$$
The representation above is a homogenous model:
Alternatively, we could believe that groups share common coefficients on regressors but there are groupspecific intercepts, as is captured in the fixed effects or least squares dummy variable (LSDV) model
$$y_{it} = \alpha_i + \beta x_{it} + \epsilon_{it}$$
The representation above is a heterogenous model, because the constants, $ \alpha_i $, are groupspecific.
This section considers four popular panel data models:
We will examine these models using an assumed data generation process given by
$$ y_{it} = \beta x_{it} + \delta z_i + \epsilon_{it}$$
In this model, $X$ represents the observed characteristics such as age, firm size, expenditures, and $Z$ represents unobserved characteristics, such as management quality, growth opportunities, or skill.
Component  Description  Example 

$x_{it}$  These are observable characteristics. These characteristics may be constant for an individual across all time, such as race, or may vary across all time observations for an individual such as age.  Age, race, company size, expenditure, population, GDP 
$z_i$  Unobservable characteristics, responsible for model heterogeneity.  Skill, company potential, lack of basic infrastructure in the community, political unrest. 
$\epsilon_{it}$  Stochastic error term.  N/A 
What Is Pooled Ordinary Least Squares?
In some cases, there are no unobservable individualspecific effects, and $\delta z_i $ is constant across individuals. This is a strong assumption and implies that all the observations within groups are independent of one another.
In these cases, the model becomes
$$ y_{it} = \beta x_{it} + \alpha + \epsilon_{it}$$
This implies that when there is no dependence within individual groups, the panel data can be treated as one large, pooled dataset. The model parameters, $\beta$, and, $\alpha$, can be directly estimated using pooled ordinary least squares.
Linear independence within the groups of a panel is unlikely and pooled OLS is rarely acceptable for panel data models.
What Is The OneWay Fixed Effects Model
The oneway fixed effects panel data model:
As an example, let’s consider the oneway fixed effects model with individualspecific effects where the unobservable component, $\delta z_i$ , acts like an individualspecific intercept:
$$y_{it} = \beta x_{it} + \alpha_i + \epsilon_{it}$$
The intercept term, $\alpha_i$, varies across individuals but is constant across time. This term is composed of the constant intercept term, $\mu$, and the individualspecific error terms, $\gamma_i$.
The key feature of the fixed effects model is that $\gamma_i$ has a true, but unobservable, effect which must be estimated. More importantly, if we estimate $\beta$ using pooled OLS and fail to appropriately account for $\gamma_i$, the estimates will be inconsistent and biased.
The fixed effects model requires the estimation of the model parameter $\beta$ and individual $\alpha_i$ for each of the N groups in the panel. This is generally achieved using one of three estimation techniques:
The first two of these techniques focuses on eliminating the individual effects before estimation. The LSDV method directly incorporates these effects using dummy variables.
What Is the OneWay Random Effects Model?
The oneway random effects panel data model:
The distinguishing feature of the random effects model is that $\delta z_i$ does not have a true value but rather follows a random distribution with parameters that we must estimate.
The random effects term, $\delta z_i$:
The random effects model should be estimated using feasible generalized least squares (FGLS). Using FGLS, the appropriate error structure, one which accounts for the individualspecific error terms, can be incorporated into the model.
What Is the Random Coefficients Model?
The panel data regressions we’ve looked at so far have all assumed that the coefficients on regressors are the same across all individuals. The random coefficients model relaxes this assumption and introduces individualspecific effects through the coefficient, such that
$$y_{it} = \beta_i x_{it} + \alpha_i + \epsilon_{it}$$ $$y_{it} = (b_i + \beta)x_{it} + (\alpha_i+\alpha) + \epsilon_{it}$$ $$b_i \sim N(0, \tau_{i1}^2)$$ $$a_i \sim N(0, \tau_{i2}^2)$$
This model introduces both individual slope effects and allows for heteroscedasticity through the individualspecific $\tau_{i1}^2$ and $\tau_{i2}^2$.
This model can be estimated using FGLS or maximum likelihood estimation (MLE).
The twoway individual effects model allows the presence of both timespecific effects and individualspecific effects.
Starting from a simple linear model given by,
$$y_{it} = \alpha + \beta_{xit} + \epsilon_{it}$$
the twoway individual effects model can be represented by
$$y_{it} = \alpha + \beta_{xit} + \mu_i + \lambda_t + \epsilon_{it}$$
In this model, $\mu_i$, captures any unobservable individualspecific effects and $\lambda_t$ captures any unobservable timespecific effects. Note that the individualspecific effects, $\mu_i$, do not vary with time, while the timespecific effects, $\lambda_t$, do not vary across individuals.
In the special case that there are only two groups and two individuals this model is equivalent to the differenceindifference model. However, if there are more than two time periods and/or individuals, alternative panel data models must be considered.
What Is the TwoWay Fixed Effects Model?
The twoway fixed effects model:
For data generated by this model:
Like the oneway fixed effects model, this model could be estimated by including dummy variables. However, in the twoway fixed effects model dummy variables must be included for both the time periods and the groups.
Under most circumstances, the number of dummy variables included in the twoway fixed effects model makes standard ordinary least squares estimation too computationally difficult. Instead, the twoway fixed effects model is estimated using a withingroup estimator which removes the variation both within groups and within the time periods.
What Is the TwoWay Random Effects Model?
The twoway random effects model:
For data generated by this process:
Like the oneway random effects model, the twoway random effects model can be estimated using feasible generalized least squares (FGLS) or maximum likelihood estimation (MLE).
Dynamic Panel Data Model
A key component of pure time series models is the modeling of dynamics using lagged dependent variables. These lagged variables capture the autocorrelation between observations of the same dataset at different points in time.
Because panel datasets include a time series component, it is also important to address the possibility of autocorrelation in panel data. The dynamic panel data model adds dynamics to the panel data individual effects framework.
Consider an individual effects model which includes an AR(1) term
$$y_{it} = \delta y_{i,t1} + \beta x_{it} + \epsilon_{it}$$
the error component includes oneway individual effects such that
$$\epsilon_{it} = \mu_i + \nu_{it}$$
where $\mu_i$ captures individual effects.
Introducing lagged dependent variables in the individual effects framework:
Dynamic panel data models are most commonly estimated using a generalized method of moments (GMM) framework proposed by Arellano and Bond (1991).
In panel data that covers small time frames, there is little need to worry about stationarity. However, when panel data covers longer time frames, like is the case in many macroeconomic panel data series, the panel data must be tested for stationarity.
Weak stationarity, required for many panel data modeling techniques, requires only that:
Nonstationary panel data series are any panel series that do not meet the conditions of a weakly stationary time series.
In part because of these considerations, a large field of research and literature surrounding panel data unit root tests has developed.
Testing for unit roots in panel data requires more than just testing the individual cross sections for the presence of unit roots. Panel data unit root tests must:
After today's blog, you should have an understanding of the fundamentals of panel data. We covered the basics of panel data including:
To learn more about performing panel data analysis using GAUSS contact us for a GAUSS demo copy.
]]>The aggregate
function, first available in GAUSS version 20, computes statistics within data groups. This is particularly useful for panel data.
In today's blog, we take a closer look at aggregate
. We will:
aggregate
function.aggregate
functionaggregate
function using current account data from the International Monetary Fund.The GAUSS aggregate
function computes statistics within a group based upon a specified group identifier. The function supports a variety of GAUSS statistics including:
For example, consider a panel dataset which includes observed weights for three individuals across a 6month time span:
Name  Jan. Weight  Feb. Weight  Mar. Weight  Apr. Weight  May Weight  June Weight 

Sarah  135  134  138  142  144  145 
Tom  196  192  182  183  184  181 
Nikki  143  144  146  147  145  143 
We can use the aggregate
function to find the 6month mean weights for Sarah, Tom and Nikki:
Name  Jan. Weight  Feb. Weight  Mar. Weight  Apr. Weight  May Weight  June Weight  Mean Weight 

Sarah  135  134  138  142  144  145  139.7 
Tom  196  192  182  183  184  181  186.3 
Nikki  143  144  146  147  145  143  144.7 
Alternatively, we could find the monthly standard deviation of the weights across Sarah, Tom and Nikki:
Name  Jan. Weight  Feb. Weight  Mar. Weight  Apr. Weight  May Weight  June Weight 

Sarah  135  134  138  142  144  145 
Tom  196  192  182  183  184  181 
Nikki  143  144  146  147  145  143 
Monthly Std. Dev.  33.2  31.0  23.4  22.4  22.8  21.4 
The aggregate
function takes two required inputs:
x_a = aggregate(x, method);
"mean"
, "median"
, "mode"
, "max"
, "min"
, "sd"
, "sum"
, "variance"
.The aggregate
function requires the data matrix input to:
Let's consider our example dataset from above. In order to use this data as an input to the GAUSS aggregate
function we need to:
Name  Jan. Weight  Feb. Weight  Mar. Weight  Apr. Weight  May Weight  June Weight 

Sarah  135  134  138  142  144  145 
Tom  196  192  182  183  184  181 
Nikki  143  144  146  147  145  143 
$$\text{Sarah} \rightarrow 1$$ $$\text{Tom} \rightarrow 2$$ $$\text{Nikki} \rightarrow 3$$
$$\Downarrow$$
Group  Jan. Weight  Feb. Weight  Mar. Weight  Apr. Weight  May Weight  June Weight 

1  135  134  138  142  144  145 
2  196  192  182  183  184  181 
3  143  144  146  147  145  143 
$$\Downarrow$$
Group  Month  Weight 

1  1  135 
1  2  134 
1  3  138 
1  4  142 
1  5  144 
1  6  145 
2  1  196 
2  2  192 
⁞  ⁞  ⁞ 
3  6  143 
The method input into the aggregate
function should always be a string indicating which statistic you wish to compute.
Each method works on groups within the panel the same way the analogous pooled data function would work, including its handling of missing values.
Method  Pooled Function 

mean  meanc 
median  median 
mode  modec 
max  maxc 
min  minc 
sum  sumc 
sd  stdc 
variance  varCovXS 
Let's use aggregate
to find the means by group for weight data:
weights = { 1 1 135,
1 2 134,
1 3 138,
1 4 142,
1 5 144,
1 6 145,
2 1 196,
2 2 192,
2 3 182,
2 4 183,
2 5 184,
2 6 181,
3 1 143,
3 2 144,
3 3 146,
3 4 147,
3 5 145,
3 6 143 };
/*
** Find the mean by person.
** We will use the first column
** as the group indicator and will find
** the mean of the weights.
*/
print aggregate(weights[., 1 3], "mean");
This prints the group means to the output window:
1 139.67
2 186.33
3 144.67
We can also use the month identifiers to find the sample standard deviation by month:
/*
** Find the standard deviation by month.
** We will use the second column of weights
** as the group indicator and will find
** the standard deviation of the weights.
*/
print aggregate(weights[., 2 3], "sd");
Now the standard deviations, along with their associated months will be printed to the output window:
1 33.151
2 31.005
3 23.438
4 22.368
5 22.811
6 21.385
Our simple example dataset is useful for demonstrating the basics of the aggregate
function. However, a realworld panel dataset better demonstrates its true power. In this section, we will use aggregate
to examine some of the trends in international current account balances.
We will use current account balance measured as a percentage of GDP. This unbalanced panel data is a modified version of a dataset from the International Monetary Fund and spans 1953Q3 to 2019Q4 and includes a total of 46 countries, across 5 different regions.
It contains the following variables:
Variable name  Description 

Country  String, name of the country. 
Country ID  Integer country identifier. 
World Region  String, name of the corresponding world region. 
Region ID  Integer world region identifier. 
Time  String, the date of the observation. 
CAB  Decimal numeric, the Current Account Balance. 
We first examine variations in the mean and median current account balances across countries. Using aggregate
we find the mean and median current account balances for each country in the panel across all observations.
/*
** Load the 'Country ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Country Id + CAB");
// Compute mean and median current
// account balances by Country ID
mean_cab_cid = aggregate(X, "mean");
median_cab_cid = aggregate(X, "median");
After the above code mean_cab_cid
and median_cab_cid
will be both be $46\times2$ matrices. Each element in the first column will be a unique country ID. The corresponding element in the second column will be the average (mean or median) current account balance for that country.
We include a graph of this data below, where we see that Germany leads the pack with the highest average current account balance while Finland has the lowest average current account balances.
We can similarly consider the mean and median current account balances across geographical regions with the code below.
/*
** Load the 'Region ID' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
*/
X = loadd("imf_cab_mod.xlsx", "Region ID + CAB");
// Compute mean and median current
// account balances by world region
mean_cab_wreg = aggregate(X, "mean");
median_cab_wreg = aggregate(X, "median");
mean_cab_wreg
and median_cab_wreg
will be two column matrices with unique world region IDs in the first column and the corresponding statistics in the second column.
Finally, we consider how the mean and median current account balances vary across time in the time series plot below.
/*
** Load the 'Time' and 'CAB' (Current Account Balance) variables.
** Notice that the grouping variable will be in the first column of 'X'.
** Wrapping 'Time' in 'date($)' tells GAUSS that 'Time' is a string
** variable that we want GAUSS to convert to a date.
*/
X = loadd("imf_cab_mod.xlsx", "date($Time) + CAB");
mean_cab_date = aggregate(X, "mean");
median_cab_date = aggregate(X, "median");
This time the first column of our resulting matrices, mean_cab_date
and median_cab_date
, will contain each unique date from our dataset. The second column will contain the statistic computed for each unique date.
Below is a graph of the Current Account Balance data grouped by quarter.
In today's blog, we examined the fundaments of the aggregate
procedure. After reading you should have a better understanding of:
aggregate
function.aggregate
function.aggregate
.The GAUSS Package Manager, first introduced in version 20, allows you to download, install and uninstall GAUSS packages without leaving GAUSS.
It supports the paid GAUSS Application Modules, free GAUSS packages and even allows you to create custom packages and channels.
This post will guide you through the basics needed to install and uninstall GAUSS packages.
The Package Manager must be downloaded and installed before its first use. To begin this process, select Tools > Package Manager from the main GAUSS menu. This will open a window which looks like this:
By default, the Package Manager will install inside a folder named conda
inside your GAUSSHOME folder. In most cases, you should leave the default installation directory.
However, the initial release of the Package Manager does not support installation directories with spaces in the path names. If your GAUSSHOME directory contains a space, you will need to point this to another folder in which your user has write permissions.
The next section is for your GAUSS Download Account username and password. This allows you to download any GAUSS Application Modules which you have purchased. If you do not know your username and password, contact Aptech.
A username and password are not required to download free packages.
Click the install button on the bottom right of the Install Package Manager window. This will begin the download and installation of the Package Manager. It will take a minute or two. When the installation has completed you will see the following popup window.
If you have not already installed the Package Manager, see the previous steps.
+
) on the top right of the Installed Packages window to open the Repository Packages window.If a package in the Repository Packages window has a red background color, it means that the username and password you have entered do not allow access to the package.
+
) on the right of the Installed Packages window.
) icon.In this post, you've learned:
New function to aggregate results across groups by:
// Load data with the grouping data in the first column
X = loadd("grunfeld.dat", "Years + Investment");
// Group investment by average for year
mean_by_year = aggregate(X, "mean");
// Load data with the grouping data in the first column
X = loadd("grunfeld.dat", "firm + Investment");
// Group investment by median for each firm
median_by_firm = aggregate(X, "median");
New suite of tools makes it easy to add optional arguments to your GAUSS procedures.
This example shows a simple estimation procedure that chooses either OLS estimation or ridge regression based on whether an optional input, lambda, is passed in.
// ... is a placeholder for the optional arguments
proc (1) = estimate(y, X, ...);
local lambda;
// Get the optional 'dynamic argument'
lambda = getDynargs(1);
// If the 'lambda' was not passed in,
// it will be an empty matrix.
if isempty(lambda);
// No 'lambda' so perform standard OLS
b_hat = olsRegress(y, X);
else;
// 'lambda' passed in so perform ridge regression
b_hat = ridgeRegress(y, X, lambda);
endif;
retp(b_hat);
endp;
The procedure above can be called like this:
// OLS regression when optional input is not passed in
b_hat = estimate(y, X);
or like this:
// Ridge regression is performed when optional input is passed in
b_hat = estimate(y, X, lambda);
plotXYFill
.plotBarH
.
plotSetLegend
now supports setting the legend location by coordinates. plotSetYTicInterval
.
var_names = "alpha" $ "beta" $ "gamma";
b = { 0.34, 1.9334, 0.8983 };
se = { 0.00002234, 0.013235, 0.03752 };
↓
print sprintf(fmt, var_names, b, se);
↓
alpha 0.340 (0.00) beta 1.933 (0.01) gamma 0.898 (0.04)
In time series modeling we often encounter trending or nonstationary time series data. Understanding the characteristics of such data is crucial for developing proper time series models. For this reason, unit root testing is an essential step when dealing with time series data.
In this blog post, we cover everything you need to conduct time series data unit root tests using GAUSS. This includes:
Unit roots are often used interchangeably with the idea of nonstationarity. This isn’t completely off base, because the two are related. However, it is important to remember that while all unit root processes are nonstationary, not all nonstationary time series are unit root processes.
A time series is stationary when all statistical characteristics of that series are unchanged by shifts in time. Time series models generally are valid only under the assumption of weak stationarity.
A weakly stationary time series has:
Nonstationarity can be caused by many factors including structural breaks, time trends, or unit roots.
A unit root process:
There are some important implications of unit roots:
Let’s consider the simplest example, the AR(1) unit root process
$$Y_t = \phi_0 + Y_{t1} + \epsilon_t$$
Since the mean of the error term, $\epsilon_t$, is zero and the scalar value $\phi_0$ is added at each time period, the expected value of this process is
$$E[Y_t] = \phi_0t$$
which changes with time. Additionally, the variance given by
$$var(Y_t) = var(\epsilon_1 + \epsilon_2 + \epsilon_3 ...+ \epsilon_t)$$
is also dependent on time.
Unit root processes have important implications for time series modeling including:
If time series data contains a unit root, shocks will have a permanent impact on the path of the data.
The top panel in the above graph shows the impact of a random shock on an AR(1) process with a unit root. After the shock hits the process transitions to a new path and there is no meanreversion.
Conversely, the bottom panel shows the impact of the same shock to a stationary AR(1) series. In this case, the impact of the shock is transitory and the series reverts to the original mean when the blue line of the shock path overlaps the orange line of the original series.
Many time series models which estimate the relationship between two variables assume that both are stationary series. When neither series is stationary, regression models can find relationships between the two series that do not exist.
Let’s look at an example using GAUSS.
First, we simulate two unit root series:
// Number of observations
nobs = 150;
// Generate two vectors of random disturbances
e1 = rndn(nobs, 1);
e2 = rndn(nobs, 1);
// Find cumulative sum of disturbances
y1 = cumsumc(e1);
x1 = cumsumc(e2);
Next, we use the ols
procedure to regress y1 on x1:
call ols("", y1, x1);
The ols
procedure prints the following results to the input/output window:
Valid cases: 150 Dependent variable: Y Missing cases: 0 Deletion method: None Total SS: 5161.244 Degrees of freedom: 148 Rsquared: 0.450 Rbarsquared: 0.446 Residual SS: 2838.019 Std error of est: 4.379 F(1,148): 121.154 Probability of F: 0.000 Standard Prob Standardized Cor with Variable Estimate Error tvalue >t Estimate Dep Var  CONSTANT 3.645808 0.419546 8.689894 0.000   X1 0.615552 0.055924 11.006998 0.000 0.670916 0.670916
Despite the fact that these two series are unrelated, the model suggests a statistically significant relationship between y1 and x1. The estimated coefficient on x1 is 0.616 with a tstatistic of 11.00 and a pvalue of 0.000.
A regression of one I(1) series on other I(1) series can:
Unit root time series have three characteristics that can impact inferences in standard time series models:
Combined, these imply that the Law of Large Numbers does not hold for a nonstationary series. This, in turn, means that inferences based on standard teststatistics and distributions are no longer valid.
Before running any unit root tests, we must first determine if our data has any deterministic components such as a constant or time trend.
How do we determine if our data has a time trend or constant?
Time series plots are useful for identifying constants and time trends which is why the first step to any time series modeling should be data visualization.
The graph above plots three different AR(1) time series. The time series plot in the first panel has no constant and no trend with the data generating process
$$y_t = 0.7y_{t1}$$
We can tell visually that the series has no constant or trend, because it fluctuates around the zero line.
The time series plot in the second panel shows an AR(1) plot with a constant term. The plotted data follows the data generating process
$$y_t = 1.5 + 0.7y_{t1}$$
The series has the same shape as the first series but is shifted upward and fluctuates around 1.5 instead of 0.
The time series plot in the final panel shows an AR(1) plot with a constant term and time trend. The plotted data follows the data generating process
$$y_t = 1.5 + 0.2_t + 0.7y_{t1}$$
This series fluctuates around an increasing, timedependent, line.
When testing for I(1) series, there are two broad categories of tests, those that test for unit roots and those that test for stationarity.
Unit root tests consider the null hypothesis that a series contains a unit root against the alternative that the series is trend stationary.
Time series stationarity tests consider the null hypothesis that a series is trend stationary against the alternative that it contains a unit root.
It is important to distinguish which test we are running to avoid making incorrect conclusions about our test results.
The first unit root test we will consider is the Augmented DickeyFuller (ADF) test. The ADF test is based on the test regression
$$\Delta y_t = \beta D_t + (\rho1) y_{t1} + \sum_{j=1}^p \psi_j \Delta y_{tj} + \epsilon_t$$
where $D_T$ is a vector of deterministic components which can include a constant and/or a trend.
The ADF test considers the null hypothesis that the series is I(1), or has a unit root, against the alternative hypothesis that the series is I(0).
The ADF test eliminates serial correlation in the residuals by including the lagged dependent variables, $\Delta y_{tj} $, in the specification.
The PhillipsPerron test also considers the null hypothesis that the series contains a unit root against the alternative that there is no unit root. However, it addresses the issues of serial correlation by adjusting the OLS estimate of the AR(1) coefficient.
Three specifications are considered, an AR(1) model without a drift, an AR(1) with a drift, and an AR(1) model with a drift and linear trend:
$$ Y_t = \rho Y_{t1} + \epsilon_t\\ Y_t = \alpha + \rho Y_{t1} + \epsilon_t\\ Y_t = \alpha + \delta t + \rho Y_{t1} + \epsilon_t$$
Unlike the previous tests, the KPSS uses a Lagrange multiplier type test of the null hypothesis of stationarity against the alternative hypothesis that the data contains a unit root.
The residuals from the regression $y_t = \Delta D_t$ are used in combination with a heteroskedasticity and autocorrelation consistent estimate of the longrun variance, to construct the KPSS test statistic.
The Time Series MT Application includes tools for conducting standard unit root testing. Today we will consider the three fundamental unit root tests discussed above:
For this example, we will simulate data using the simarmamt
procedure. The three time series we will simulate, encompass three different cases of deterministic components:
new;
cls;
library tsmt;
/*
** Step One: Generate Data
*/
// Coefficient on AR(1) term
phi = 0.80;
// AR order
p = 1;
// MA order
q = 0;
// Constant
const1 = 0;
const2 = 2.5;
const3 = 2.5;
// Trend
trend1 = 0;
trend2 = 0;
trend3 = 0.20;
// Number of obsevations
n = 150;
// Number of series
k = 1;
// Standard deviation
std = 1;
// Set seed for reproducing data
seed = 10191;
// Case One: No deterministic components
y1 = simarmamt(phi, p, q, const1, trend1, n, k, std, seed);
// Case Two: With Constant
y2 = simarmamt(phi, p, q, const2, trend2, n, k, std, seed);
// Case Three: With Constant and Trend
y3 = simarmamt(phi, p, q, const3, trend3, n, k, std, seed);
We can run the Augmented DickeyFuller test in GAUSS using the vmadfmt
procedure included in the Time Series MT library.
The vmadfmt
procedure requires three inputs:
In this case, we know that our data is AR(1), so we set l = 1. Also, we will call vmadfmt
three times, once for each of our datasets:
/*
** ADF Testing
*/
/* Order of deterministic trend to include
** 1 No deterministic trends
** 0 Constant
** 1 Constant and Trend
*/
// No deterministic trends
{ rho1, tstat1, adf_t_crit1 } = vmadfmt(y1, 1, 1);
// Constant
{ rho2, tstat2, adf_t_crit2 } = vmadfmt(y2, 0, 1);
// Constant and trend
{ rho3, tstat3, adf_t_crit3 } = vmadfmt(y3, 1, 1);
Interpreting the ADF Results
The vmadfmt
procedure returns three values:
We've summarized the tstat and tcrit results in the table below:
Case  Test Statistic  1% Critical Value  5% Critical Value  10% Critical Value  Conclusion 

Y_{1}  3.6897  2.6025  1.9423  1.5950  Reject the null 
Y_{2}  4.0473  3.4391  2.9152  2.5841  Reject the null 
Y_{3}  4.4899  4.0052  3.4611  3.1552  Reject the null 
Augmented DickeyFuller Intepretation
There are several things to note about these results:
These results should not be surprising to us since we used simulated data with an autoregressive coefficient whose absolute value was less than one. It should be noted that we also knew the correct lag specification and deterministic trends to use with our test because we knew the true data generating process.
In reality, the data generating processes will be unknown and you may need to conduct additional testing to confirm the proper lags and deterministic trends.
We can run the PhillipPerron test in GAUSS using the vmppmt
procedure included in the Time Series MT library.
The vmppmt
procedure requires three inputs:
For the PhillipsPerron case, we will let GAUSS pick the optimal NeweyWest truncation length by setting nwtrunc = 0. Again, we will call vmppmt
three times, once for each of our datasets:
/*
** PhillipsPerron
*/
/* The second input reflects the deterministic
** components to include
** 1 No deterministic trends
** 0 Constant
** 1 Constant and Trend
*/
// No deterministic components
{ ppb1, ppt1, pptcrit1 } = vmppmt(y1, 1, 0);
// Constant
{ ppb2, ppt2, pptcrit2 } = vmppmt(y2, 0, 0);
// Constant and trend
{ ppb3, ppt3, pptcrit3 } = vmppmt(y3, 1, 0);
Interpretting the PhillipPerron Results
The vmppmt
procedure returns three values:
Again, we summarize the two most relevant outputs, ppt and pptcrit:
Case  Test Statistic  1% Critical Value  5% Critical Value  10% Critical Value  Conclusion 

Y_{1}  3.7332  2.5856  1.9448  1.6246  Reject the null 
Y_{2}  4.1367  3.4642  2.9124  2.5884  Reject the null 
Y_{3}  4.5911  4.0009  3.4542  3.1625  Reject the null 
Like our ADF results, there are several notable points to draw from these results:
The final test we will run is Kwiatkowski–Phillips–Schmidt–Shin (KPSS) tests. The KPSS test can be conducted in GAUSS using the kpss
procedure included in the Time Series MT library.
The kpss
procedure has one required input and five optional inputs:
Today we will use only the first two optional inputs and will use the default values for the rest. Again, we call KPSS three times, once for each of our datasets:
/*
** KPSS
*/
/*
** Note that we use the default maxlags
** and trend settings for the two cases without a trend.
*/
// No deterministic trends
{ lm1, crit1 } = kpss(y1);
// Constant and no trend
{ lm2, crit2 } = kpss(y2);
// Constant and trend
{ lm3, crit3 } = kpss(y3, 0, 1);
Interpretting the KPSS Results
The kpss
procedure returns two values:
Lag  Y_{1}  Y_{2}  Y_{3} 

1  1.4556  1.4556  0.3124 
2  1.0363  1.0363  0.2301 
3  0.8603  0.8603  0.1964 
4  0.7637  0.7637  0.1785 
5  0.7041  0.7041  0.1678 
6  0.6658  0.6658  0.1613 
7  0.6444  0.6444  0.1585 
8  0.6349  0.6349  0.1582 
9  0.6318  0.6318  0.1586 
Critical Values (1%, 5%, 10%) 
0.739, 0.574, 0.463  0.739, 0.574, 0.463  0.216, 0.176, 0.146 
Conclusion  Reject the null (1% level)  Reject the null (1% level)  Reject the null (5% level) 
This test yields some interesting results worth noting:
These results highlight one of the known issues with the KPSS test. We know our data is stationary because we know the true data generating process. However, the KPSS test concludes that we should reject the null hypothesis of stationarity.
The KPSS test is known for incorrectly rejecting the null hypothesis of stationarity more frequently than other tests. This is known as having a high rate of Type 1 Errors.
In this blog, we have covered very fundamental, simple unit root tests. However, much work has been done to develop unit root tests for more complex cases. For example:
This week we covered everything you need to know to be able to test your data for unit roots using GAUSS. This includes:
After this week you should have a better understanding of how to determine if a time series is stationary using GAUSS.
]]>
Time series data is data that is collected at different points in time. This is opposed to crosssectional data which observes individuals, companies, etc. at a single point in time.
Because data points in time series are collected at adjacent time periods there is potential for correlation between observations. This is one of the features that distinguishes time series data from crosssectional data.
Time series data can be found in economics, social sciences, finance, epidemiology, and the physical sciences.



Field  Example topics  Example dataset 
Economics  Gross Domestic Product (GDP), Consumer Price Index (CPI), S&P 500 Index, and unemployment rates  U.S. GDP from the Federal Reserve Economic Data 
Social sciences  Birth rates, population, migration data, political indicators  Population without citizenship from Eurostat 
Epidemiology  Disease rates, mortality rates, mosquito populations  U.S. Cancer Incidence rates from the Center for Disease Control 
Medicine  Blood pressure tracking, weight tracking, cholesterol measurements, heart rate monitoring  MRI scanning and behavioral test dataset 
Physical sciences  Global temperatures, monthly sunspot observations, pollution levels.  Global air pollution from the Our World in Data 
The statistical characteristics of time series data often violate the assumptions of conventional statistical methods. Because of this, analyzing time series data requires a unique set of tools and methods, collectively known as time series analysis.
This article covers the fundamental concepts of time series analysis and should give you a foundation for working with time series data.
Time series data is a collection of quantities that are assembled over even intervals in time and ordered chronologically. The time interval at which data is collection is generally referred to as the time series frequency.
For example, the time series graph above plots the visitors per month to Yellowstone National Park with the average monthly temperatures. The data ranges between January 2014 to December 2016 and is collected at a monthly frequency.
A time series graph plots observed values on the yaxis against an increment of time on the xaxis. These graphs visually highlight the behavior and patterns of the data and can lay the foundation for building a reliable model.
More specifically, visualizing time series data provides a preliminary tool for detecting if data:
This, in turn, can help guide the testing, diagnostics, and estimation methods used during time series modeling and analysis.
“In our view, the first step in any time series investigation always involves careful scrutiny of the recorded data plotted over time. This scrutiny often suggests the method of analysis as well as statistics that will be of use in summarizing the information in the data.”  Shumay and Stoffer
Mean reverting data returns, over time, to a timeinvariant mean. It is important to know whether a model includes a nonzero mean because it is a prerequisite for determining appropriate testing and modeling methods.
For example, unit root tests use different regressions, statistics, and distributions when a nonzero constant is included in the model.
A time series graph provides a tool for visually inspecting if the data is meanreverting, and if it is, what mean the data is centered around. While visual inspection should never replace statistical estimation, it can help you decide whether a nonzero mean should be included in the model.
For example, the data in the figure above varies around a mean that lies above the zero line. This indicates that the models and tests for this data must incorporate a nonzero mean.
In addition to containing a nonzero mean, time series data may also have a deterministic component that is proportionate to the time period. When this occurs, the time series data is said to have a time trend.
Time trends in time series data also have implications for testing and modeling. The reliability of a time series model depends on properly identifying and accounting for time trends.
A time series plot which looks like it centers around an increasing or decreasing line, like that in the plot above, suggests the presence of a time trend.
Seasonality is another characteristic of time series data that can be visually identified in time series plots. Seasonality occurs when time series data exhibits regular and predictable patterns at time intervals that are smaller than a year.
An example of a time series with seasonality is retail sales, which often increase between September to December and will decrease between January and February.
Sometimes time series data shows a sudden change in behavior at a certain point in time. For example, many macroeconomic indicators changed sharply in 2008 after the start of the global financial crisis. These sudden changes are often referred to as structural breaks or nonlinearities.
These structural breaks can create instability in the parameters of a model. This, in turn, can diminish the validity and reliability of that model.
Though statistical methods and tests should be used to test for structural breaks, time series plots can help for preliminary identification of structural breaks in data.
Structural breaks in the mean of a time series will appear in graphs as sudden shifts in the level of the data at certain breakpoints. For example, in the time series plot above there is a clear jump in the mean of the data which around the start of 1980.
Time series models are used for a variety of reasons  predicting future outcomes, understanding past outcomes, making policy suggestions, and much more. These general goals of time series modeling don’t vary significantly from modeling crosssectional or panel data. However, the techniques used in time series models must account for time series correlation.
Two broad approaches have developed for modeling time series data, the timedomain approach and the frequency domain approach.
The timedomain approach models future values as a function of past values and present values. The foundation of this approach is the time series regression of present values of a time series on its own past values and past values of other variables. The estimates of these regressions are often used for forecasting and this approach is popular in time series econometrics.
Frequency domain models are based on the idea that time series can be represented as a function of time using sines and cosines. These representations are known as Fourier representations. Frequency domain models utilize regressions on sines and cosines, rather than past and present values, to model the behavior of the data.
Time Domain Examples  Frequency Domain Examples 

Autoregressive Moving Average Models (ARMA)  Spectral analysis 
Autoregressive Integrated Moving Average (ARIMA) Models  Band Spectrum Regression 
Vector Autoregressive Models (VAR)  Fourier transform methods 
Generalized autoregressive conditional heteroscedasticity (GARCH)  Spectral factorization 
Time series models may also be split into univariate time series models and multivariate time series models. Univariate time series models are models used when the dependent variable is a single time series. Trying to model an individual’s heart rate per minute using only past observations of heart rate and exogenous variables is an example of a univariate time series model.
Multivariate time series models are used when there are multiple dependent variables. In addition to depending on their own past values, each series may depend on past and present values of the other series. Modeling U.S. gross domestic product, inflation, and unemployment together as endogenous variables is an example of a multivariate time series model.
When structural breaks are present in time series data they can diminish the reliability of time series models that assume the model is constant over time. For this reason, special models must be used to deal with the nonlinearities that structural breaks introduce.
Nonlinear time series analysis focuses on:
There are different types of nonlinear time series models built around the different nature and characteristics of the nonlinearities. For example, the threshold autoregressive model assumes that jumps in the dependent data are triggered when a threshold variable reaches a specified level. Conversely, MarkovSwitching models assume that an underlying stochastic Markov chain drives regime changes.



Example  Description 
Housing market prices  The concept of a housing bubble gained notoriety after the global financial crash of 2008. Since then much research and theoretical work is being done to identify and predict housing bubbles. 
S&P 500 Unconditional Variance  Modeling stock price volatility is crucial to managing financial portfolios. Because of this, much attention is directed towards understanding the underlying behavior of market indicators like the S&P 500. 
Global temperatures  Identifying structural breaks in global temperatures has provided support to proponents of global climate change. 
What does it mean for a time series to be stationary? A time series is stationary when all statistical characteristics of that series are unchanged by shifts in time. In technical terms, strict stationarity implies that the joint distribution of $(y_t, …, y_{th})$ depends only on the lag, $h$, and not on the time period, $t$. Strict stationarity is not widely necessary in time series analysis.
This is not to imply that stationarity is not an important concept in time series analysis. Many time series models are valid only under the assumption of weak stationarity (also known as covariance stationarity).
Weak stationarity, henceforth stationarity, requires only that:
Nonstationary time series are any data series that do not meet the conditions of a weakly stationary time series.
Gaussian White Noise
An example of a stationary time series is a Gaussian white noise process. A Gaussian white noise process is given by
$$ y_t \sim iid N(0, \sigma^2)$$
where $iid$ indicates that the series is independent and identically distributed.
The Gaussian white noise process has a constant mean equal to zero, $E[Y_t]=0$, and a constant variance equal to $\sigma^2$, $Var(Y_t) = \sigma^2$.
Independent White Noise
Independent white noise is any time series data that is drawn from the same distribution with a zero mean and variance $\sigma^2$. An example of an independent white noise series is data drawn from the Student’s t distribution.
Deterministically Trending Data
When data has a time trend it has a component that is multiplicative with time. For example,
$$y_t = \beta_0 + \beta_1t + \epsilon_t\\ \epsilon_t \sim N(0, \sigma^2)$$
Note that in this case $E[y_t] = \beta_0 + \beta_1t$, which is dependent on $t$.
Data with a time trend is sometimes referred to as a trend stationary time series. This is because it can be transformed into stationary data using a simple detrending process:
$$\tilde{y}_t = y_t  \beta_0  \beta_1t = \epsilon_t$$
Random Walk
A random walk process is generated when one observation is a random modification of the previous observation. For example,
$$y_t = y_{t1}+ \epsilon_t$$
Like the deterministically trending data, transforming the random walk data will result in a stationary series:
$$ \Delta y_t = y_t  y_{t1} = \epsilon_t$$
It is important to recognize the presence of seasonality in time series. Failing to recognize the regular and predictable patterns of seasonality in time series data can lead to incorrect models and interpretations.
Identifying seasonality in time series data is important for the development of a useful time series model.
There are many tools that are useful for detecting seasonality in time series data:
Once seasonality is identified, the proper steps must be taken to deal with its presence. There are a few options for addressing seasonality in time series data:
In time series data, autocorrelation is the correlation between observations of the same dataset at different points in time. The need for distinct time series models stems in part from the autocorrelation present in time series data.



Example  Plot 
United States real GDP growth  
Luteinizing hormone  
Financial time series 
Ordinary least squares assumes that there is no autocorrelation in the error terms of a series. When autocorrelation is present, as it is in time series data:
When autocorrelation is present, there are two options for finding robust standard errors. The first approach estimates an OLS model and modifies the standard errors afterward. The NeweyWest (1987) method is the standard approach for modifying the OLS standard errors to produce heteroskedastic and autocorrelation consistent standard errors.
The alternative for dealing with autocorrelation in time series data is to reweight the data prior to estimation. One method for doing this is generalized least squares which applies least squares to data that has been transformed by weights. Generalized least squares requires that the true parameters of autocorrelation be known.
More generally, the true parameters of autocorrelations are unknown and must be estimated using feasible generalized least squares (FGLS). Feasible generalized least squares requires estimation of the covariance matrix.
When the covariance structure is assumed the method is known as weighted least squares. Alternatively, the covariance structure can be estimated using iterative methods.
Detecting autocorrelation in time series data can be done in a number of ways. One preliminary measure for detecting autocorrelation is a time series graph of residuals versus time. If no autocorrelation is present the residuals should appear random and scattered around zero. If there is a pattern in the residuals present, then autocorrelation is likely.
The DurbinWatson Test
If the time series plot suggests autocorrelation, then further statistical tests can be used to formally test for autocorrelation.
The DurbinWatson test is a test of the null hypothesis of no firstorder autocorrelation against the alternative that the error term in one period is negatively or positively correlated with the error term in the previous period.
The DurbinWatson Test:
Upper and lower critical values for the DurbinWatson statistic depend on the number of independent variables. The computed DurbinWatson statistic should be compared to these critical values:
$$d \lt d_l \rightarrow \text{reject the null hypothesis}\\ d \gt d_u \rightarrow \text{do not reject the null hypothesis}\\ d_l \lt d \lt d_u \rightarrow \text{test is inconclusive.}$$
The BrueschGodfrey Test
When higherorder autocorrelation is suspected the DurbinWatson test is not valid and the BreuschGodfrey test should be used instead.
The BreuschGodfrey test is a test of the null hypothesis of no qorder autocorrelation against the alternative of qorder autocorrelation.
The BrueschGodfrey Test:
If the BrueschGodfrey Ftest statistic is greater than the critical value then the null hypothesis of no qorder autocorrelation is rejected.
The BoxJenkins method for ARIMA modeling is a procedure for consistently identifying, estimating, and evaluating ARIMA models.
The autoregressive integrated moving average model (ARIMA) is a fundamental univariate time series model. The ARIMA model is made up of three key components:
The ARIMA data is described by the order of each of these components with the notation ARIMA(p, d, q) where:
The BoxJenkins method for estimating ARIMA models is made up of several steps:
The first step of the BoxJenkins model involves:
Identifying the autoregressive and moving average orders of the ARIMA model can be done using a variety of statistical tools:
Techniques for ARIMA model include:
Once an ARIMA model is estimated the performance of that model should be evaluated using statistical diagnostics. The residuals of the model should be closely examined using tools like:
Multivariate time series analysis provides insight into the interactions and comovements of a group of time series variables. For example, a multivariate time series model may study the comovement of temperature, wind speed, and precipitation.
The most common multivariate time series model is known as the VARMA model. The VARMA model is analogous to the ARIMA model and contains an autoregressive component and a moving average component.
In the multivariate model, the moving average component is uncommon and the more common case is the pure vector autoregressive model (VAR).
The VAR model is a flexible model that has shown great success in forecasting and has been used for policy and structural analysis.
The vector autoregressive model represents a group of dependent time series variables as combinations of their own past values and past values of the other variables in the group.
For example, consider a trivariate model of the relationship between hourly temperature, wind speeds, and precipitation. This model describes the relationship for all three variables, temperature, wind speeds and precipitation as functions of the past values:
$$temp_t = \beta_{11}temp_{t1} + \beta_{12}wind_{t1} + \beta_{13}prec_{t1}$$ $$prec_t = \beta_{21}temp_{t1} + \beta_{22}wind_{t1} + \beta_{23}prec_{t1}$$ $$wind_t = \beta_{31}temp_{t1} + \beta_{32}wind_{t1} + \beta_{33}prec_{t1} $$
Two key characteristics of the univariate time series model are the autocorrelation function and the covariance. The autocorrelation function measures the correlation of a univariate series with its own past values. The covariance measure the joint variability of the dependent time series with other variables.
The analogies of these in the multivariate time series model are the crosscovariance and the crosscorrelation. These measures provide insight into how the individual series in a group of time series are related.
What Is the Crosscorrelation Function?
The time series crosscorrelation function measures the correlation between one series at various points in time with the values in another series at various points in time.
The crosscorrelation function:
For example, the crosscorrelation function between the temperatures and wind speeds observed at 8:00 am, 9:00 am, and 10:00 gives:
What Is the Crosscovariance Function?
The time series crosscovariance measures the covariance between values in one time series with values of another time series.
The crosscovariance function:
An unrestricted VAR model is composed of $K$ equations, one for each of the time series variables in the group. Assuming that the VAR model is
each individual equation in the VAR model can be estimated using ordinary least squares.
Like ARIMA models, VAR models may also be estimated using the generalized method of moments or maximum likelihood.
Like the univariate model, one of the important steps of the VAR model is to determine the optimal lag length.
The optimal lag length selection for VAR models is based on information criterion such as the:
Time series analysis has many realworld applications. This section looks at several realworld cases for applying time series models.
The NYSE composite adjusted closing price is an example of a univariate time series with potential autocorrelation. The BoxJenkins method can be used to fit an appropriate ARIMA model to the data.
Testing for Stationarity
Though the time series graph of the NYSE composite adjusted closing price suggests that the series is stationary, statistical tests should be used to confirm this.
This example uses the Augmented DickeyFuller unit root test and the Generalized DickeyFuller unit root test, both in the GAUSS Time Series MT library.
library tsmt;
// Load data
nyse = loadd("nyse_closing.xlsx", "adj_close");
// Transform to percent change
ch_nyse = (nyse[1:rows(nyse)1]  nyse[2:rows(nyse)]) ./ nyse[2:rows(nyse)];
/*
** Unit root testing
*/
// Augmented DickeyFuller test
p = 0;
l = 3;
{ alpha, tstat_adf, adf_t_crit } = vmadfmt(nyse, p, l);
print "tstat_adf"; tstat_adf;
print "adf_t_crit"; adf_t_crit;
// GLS DickeyFuller test
trend = 0;
maxlag = 12;
{ tstat_gls, zcrit } = dfgls(nyse, maxlag, trend);
print "tstat_gls"; tstat_gls;
print "zcrit"; zcrit;
The Augmented DickeyFuller test statistic is 2.0849669 which suggests that the null hypothesis of a unit root cannot be rejected at the 10% level.
Similarly, the GLS DickeyFuller test statistic is 1.6398621 which confirms that the null hypothesis of a unit root cannot be rejected at the 10% level.
Based on these results the first differences of the data will be used.
The PACF and ACF Functions
The autocorrelation function and partial autocorrelation functions provide guidance for what autoregressive order and moving average order are appropriate for our model.
The ACF and PACF functions can be computed using the lagReport
GAUSS function provided in the TSMT library.
/*
** PACF and ACF testing
*/
// Maximum number of autocorrelations
k = 12;
// Order of differencing
d = 1;
// Compute and plot the sample autocorrelations
{ acf_nyse, pacf_nyse } = lagreport (nyse, k, d);
The lagReport
function provide ACF and PACF plots:
The ACF and PACF falls below the significance line at all lag values (blue dotted line) in these graphs. This suggests that there is no autocorrelation in the series.
Estimate the ARIMA Model
Despite the results of the PACF and ACF, the ARIMA model will be estimated using the arimaFit procedure in GAUSS demonstration:
/*
** Estimate the ARIMA(1,1,0) model
*/
call arimaFit(nyse, 1, 1, 0);
This prints the output:
Log Likelihood: 2091.560130 Number of Residuals: 250 AIC : 4181.120259 Error Variance : 11768.138992177 SBC : 4177.598798 Standard Error : 108.481053609 DF: 249 SSE: 2930266.609052175 Coefficients Std. Err. TRatio Approx. Prob. AR[1,1] 0.01905 0.06339 0.30055 0.76401 Constant: 1.15163318
Unsurprisingly, the AR(1) coefficient is not statistically significant, as suggested by its low tratio and the high pvalue. This is consistent with the ACF and PACF and confirms the conclusion that the ARIMA model is not the appropriate model for the NYSE composite adjusted closing price.
The U.S Wholesale Price Index is a classic time series dataset that has been used to demonstrate the BoxJenkins method (See Enders 2004).
Testing for Stationarity
Unlike the NYSE composite adjusted closing price, the time series plot of the WPI suggests that the level series might be nonstationary.
Using the Augmented DickeyFuller and the Generalized DickeyFuller unit root tests will help confirm this. This time, a trend is included in the test because of the apparent trend in the time series plot of the data:
library tsmt;
// Load the variable 'ln_wpi' from the dataset
wpi = loadd("wpi1.dat", "ln_wpi");
// Transform to percent change
ch_wpi = (wpi[1:rows(wpi)1]  wpi[2:rows(wpi)]) ./ wpi[2:rows(wpi)] * 100;
/*
** Unit root testing on level data
*/
// Augmented DickeyFuller test
p = 1;
l = 3;
{ alpha, tstat_adf, adf_t_crit } = vmadfmt(wpi, p, l);
print "Unit root ADF test results : tstat_adf"; tstat_adf;
print "Unit root ADF test results : adf_t_crit"; adf_t_crit;
// GLS DickeyFuller test
trend = 1;
maxlag = 12;
{ tstat_gls, zcrit } = dfgls(wpi, maxlag, trend);
print "Unit root GLSADF test results : tstat_gls"; tstat_gls;
print "Unit root GLSADF test results : zcrit"; zcrit;
The Augmented DickeyFuller test statistic is 2.250 which suggests that the null hypothesis of a unit root cannot be rejected at the 10% level. Similarly, the GLS DickeyFuller test statistic is 1.627 which confirms that the null hypothesis of a unit root cannot be rejected at the 10% level. Both of these findings provide evidence that the data may not be stationary.
Based on these results the first differences of the data will be used and a unit root test on the differenced data is performed:
/*
** Unit root testing on percent change data
*/
// Augmented DickeyFuller test
p = 1;
l = 3;
{ alpha, tstat_adf, adf_t_crit } = vmadfmt(ch_wpi, p, l);
print "Unit root ADF test results ch_wpi: tstat_adf"; tstat_adf;
print "Unit root ADF test results ch_wpi: adf_t_crit"; adf_t_crit;
// GLS Dickey Fuller test
trend = 0;
maxlag = 12;
{ tstat_gls, zcrit } = dfgls(ch_wpi, maxlag, trend);
print "Unit root GLSADF test results ch_wpi: tstat_gls"; tstat_gls;
print "Unit root GLSADF test results ch_wpi: zcrit"; zcrit;
Now both the Augmented DickeyFuller, 1.546, and the GLS DickeyFuller test, 1.606, are closer to suggesting that the null hypothesis of a unit root can be rejected at the 10% level. However, the tests cannot completely allow for the rejection of the unit root null hypothesis.
Why didn't differencing solve the nonstationarity? This may suggest the presence of seasonality or a structural break in the data.
One source of nonstationarity may be a change in volatility, which is suggested in the time series plot in the bottom panel. This could be statistically examined using tests for structural changes in volatility.
The PACF and ACF Functions
The ACF and PACF functions can be computed using the lagReport
GAUSS function provided in the TSMT library.
/*
** PACF and ACF testing
*/
// Maximum number of autocorrelations
k = 12;
// Order of differencing
d = 0;
// Compute and plot the sample autocorrelations
{ acf_wpi, pacf_wpi } = lagreport(ch_wpi, k, d);
The lagReport
function provide ACF and PACF plots:
The ACF and PACF clearly show significant values.
There are a few notable patterns in the WPI ACF and PACF which can help chose preliminary orders for the ARIMA model:
The combination of the two features above is indicative of an AR(2) model. There are more general patterns in the combination of the ACF and PACF plots that be used to identify the order of ARIMA models:
ACF  PACF  

AR model  Geometric decline.  Sharp loss of significance at p lags. 
MA model  Sharp loss of significance at q lags.  Geometric decline. 
ARMA model  Geometric decline.  Geometric decline. 
Estimate the ARIMA Model
Despite the results of unit root tests, the ARIMA model will be estimated for demonstration. Based on the results of the PACF and ACF an ARIMA(2,1,0) model is estimated using the arimaFit procedure in GAUSS:
/*
** Estimate the arima model
*/
struct arimamtOut aOut;
aOut = arimaFit(wpi, 2, 1, 0);
This prints the output:
Log Likelihood: 146.366121 Number of Residuals: 123 AIC : 296.732242 Error Variance : 0.000120756 SBC : 302.356611 Standard Error : 0.010988925 DF: 121 SSE: 0.014611534 Coefficients Std. Err. TRatio Approx. Prob. AR[1,1] 0.45945 0.08766 5.24154 0.00000 AR[2,1] 0.26237 0.08796 2.98282 0.00346
The tratio and the low pvalues on both the AR(1) and AR(2) coefficient support the use of the ARIMA(2,1,0) model for the WPI data.
Note that this time arimamtOut
structure was used to store the results from the model. The stored results include the residuals which are used next for model diagnostics.
Model Diagnostics
Using the lagReport
procedure on the residuals stored in aOut.e
shows that most the autocorrelation has been properly removed using the ARIMA(2, 1, 0) model:
// Compute and plot the sample autocorrelations
{ acf_wpi, pacf_wpi } = lagreport(aOut.e, k, d);
There is still significant autocorrelation in the 4th lag which suggests that further exploration of a seasonal filter or model should be performed.
Congratulations! You should now have an indepth understanding of the fundamentals of time series analysis.
To learn more about performing time series analysis using GAUSS contact us for a GAUSS demo copy.
]]>