Dummy variables are a common econometric tool, whether working with time series, crosssectional, or panel data. Unfortunately, raw datasets rarely come formatted with dummy variables that are regression ready.
In today's blog, we explore several options for creating dummy variables from categorical data in GAUSS, including:
Dummy variables can be conveniently created from files at the time of loading data or calling procedures using formula string notation. Formula string notation is a powerful GAUSS tool that allows you to represent a model or collection of variables in a compact and intuitive manner, using the variable names in the dataset.
factor
KeywordThe factor
keyword is used in formula strings to:
olsmt
, glm
, or gmmFit
.Let's consider the model
$$mpg = \alpha + \beta_1 weight + \beta_2 length + \beta_3 rep78$$
We will use ordinary least squares to estimate this model with data from the auto2.dta
file which can be found in the GAUSSHOME/examples directory.
The variable rep78
is a categorical, 5point variable that measures a car's repair record in 1978. To estimate the effects of the repair record on mpg
we can include dummy variables representing the different categories.
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Perform OLS estimation, creating dummy variables from 'rep78'
call olsmt(fname, "mpg ~ weight + factor(rep78)");
The printed output table includes coefficients for rep78=fair, average, good, excellent
. Note that rep78=poor
is automatically excluded from the regression as the base level.
Standard Prob Standardized Cor with Variable Estimate Error tvalue >t Estimate Dep Var  CONSTANT 38.0594 3.09336 12.3036 0.000   weight 0.00550304 0.000601001 9.15645 0.000 0.743741 0.80552 rep78: Fair 0.478604 2.76503 0.173092 0.863 0.0263109 0.134619 rep78: Average 0.471562 2.55314 0.184699 0.854 0.0401403 0.279593 rep78: Good 0.599032 2.6066 0.229814 0.819 0.0451669 0.0384391 rep78: Excellent 2.08628 2.72482 0.765657 0.447 0.131139 0.454192
cat
KeywordSome common file types, such as XLS and CSV do not have a robust method of determining the variable types. In these cases, the cat
keyword is used to:
The cat
keyword can be combined with the factor
keyword to instruct GAUSS to load a column as string data, reclassify it to integers and then create dummy variables:
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/yarn.xlsx";
// Reclassify 'load' variable from 'high, low, med'
// to '0, 1, 2', then create dummy variables from
// integer categories and create OLS estimates
call olsmt(fname, "cycles ~ factor(cat(load))");
Using factor(cat(load))
in the formula strings tells GAUSS to create dummy variables representing the different categories of the load
variable. This is seen in the printed output table which now includes coefficients for load=low, medium
. Note that load=high
is automatically excluded from the regression as the base level.
Standard Prob Standardized Cor with Variable Estimate Error tvalue >t Estimate Dep Var  CONSTANT 534.444444 292.474662 1.827319 0.080   load: low 621.555556 413.621634 1.502715 0.146 0.338504 0.240716 load: med 359.111111 413.621634 0.868212 0.394 0.195575 0.026323
loadd
In our previous two examples, we used the factor
and cat
keywords directly in calls to estimation procedures. However, we can also use these keywords when loading data to create dummy variables in our data matrices.
For example, let's load the dummy variables associated with the rep78
variable:
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Perform OLS estimation, creating dummy variables from 'rep78'
reg_data = loadd(fname, "mpg + weight + factor(rep78)");
The reg_data
matrix is a 74 x 6 matrix. It contains the mpg
and weight
data, as well as 4 columns of dummy variables for rep78=fair, average, good, excellent
.
The first five rows look like this:
mpg weight rep78:fair rep78:avg rep78:good rep78:exc 22 2930 0 1 0 0 17 3350 0 1 0 0 22 2640 . . . . 20 3250 0 1 0 0 15 4080 0 0 1 0
Note that, again, rep78=poor
is automatically excluded as the base level.
In the previous section, we looked at creating dummy variables at the time of loading data or running procedures. In this section, we consider how to create dummy variables from an existing GAUSS vector.
The GAUSS design
procedure provides a convenient method for creating dummy variables from a vector of discrete categories.
Let's load the data from the auto2.dta
dataset used in our earlier regression example. This time we won't load rep78
using factor
:
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Load auto data for regression
reg_data = loadd(fname, "mpg + weight + rep78");
// Remove missing values
reg_data = packr(reg_data);
The first five rows of reg_data
look like this:
22 2930 3 17 3350 3 20 3250 3 15 4080 4 18 3670 3
Our third column now contains discrete, categorical data with values ranging from 15, which represent poor
, fair
, average
, good
, and excellent
.
auto2.dta
file specifies the preferred order for the string categories.// Compute the unique values found
// in the third column of 'reg_data'
print unique(reg_data[., 3]);
1 2 3 4 5
design
creates a matrix with a column of indicator variables for each positive integer in the input. For example:
cats = { 1, 2, 1, 3 };
print design(cats);
will return:
1 0 0 0 1 0 1 0 0 0 0 1
Therefore, if we pass the third column of reg_data
to design
we will get a matrix with a column for all five categories. However, we want to drop the base case column for our regression.
$$mpg = \alpha + \beta_1 weight + \beta_2 length + \beta_3 rep78_{fair} + \beta_4 rep78_{avg} + \beta_5 rep78_{good} + \beta_6 rep78_{excl}$$
To do this, we shift the range of the categorical data from 15 to 04 by subtracting 1.
// Create dummy variables. Subtract one
// to remove the base case.
dummy_vars = design(reg_data[., 3]  1);
This creates a 69x4 matrix, dummy_vars
, which contains dummy variables representing the final four levels of rep78
.
Now we can estimate our model as shown below.
// Select the 'mpg' data as the dependent variable
y = reg_data[., 1];
// Independent variables:
// 'weight' is in the second column of 'reg_data'.
// 'rep78'= Fair, Average, Good and Excellent
// are represented by the 4 columns
// of 'dummy_vars'.
x = reg_data[., 2]~dummy_vars;
// Estimate model using OLS
call olsmt("", y, x);
Our printed results are the same as earlier, except our table no longer includes variables names:
Standard Prob Standardized Cor with Variable Estimate Error tvalue >t Estimate Dep Var  CONSTANT 38.059415 3.093361 12.303578 0.000   X1 0.005503 0.000601 9.156447 0.000 0.743741 0.805520 X2 0.478604 2.765035 0.173092 0.863 0.026311 0.134619 X3 0.471562 2.553145 0.184699 0.854 0.040140 0.279593 X4 0.599032 2.606599 0.229814 0.819 0.045167 0.038439 X5 2.086276 2.724817 0.765657 0.447 0.131139 0.454192
The design
procedure works well when our data already contains categorical data. However, there may be cases when we want to create dummy variables based on ranges of continuous data. The GAUSS dummybr
, dummydn
, and dummy
procedures can be used to achieve this.
Consider a simple example:
x = { 1.53,
8.41,
3.81,
6.34,
0.03 };
// Breakpoints
v = { 1, 5, 7 };
All three procedures create a set of dummy (0/1) variables by breaking up a data vector into categories based on specified breakpoints. These procedures differ in how they treat boundary cases as shown below.
Category Boundaries  # dummies ($K$ breakpoints) 
Call  Result  

dummybr  $$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$  $$K$$  dm = dummybr(x, v); 
$$dm = \begin{matrix} 0 & 1 & 0\\ 0 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0 \end{matrix}$$ 
dummy  $$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ $$7 \lt x $$  $$K+1$$  dm = dummy(x, v); 
$$dm = \begin{matrix} 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 1 & 0 & 0 & 0 \end{matrix}$$ 
dummydn  $$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ $$7 \lt x $$  $$K$$  // Column to drop p = 2; dm = dummydn(x, v, p); 
$$dm = \begin{matrix} 0 & 0 & 0\\ 0 & 0 & 1\\ 0 & 0 & 0\\ 0 & 1 & 0\\ 1 & 0 & 0 \end{matrix}$$ 
Let's look a little closer at how these procedures work.
dummybr
When creating dummy variables with dummybr
:
dm = dummybr(x, v);
The code above produces three dummies based upon the breakpoints in the vector v
:
x <= 1 1 < x <= 5 5 < x <= 7
The matrix dm
contains:
0 1 0 1.53 0 0 0 8.41 dm = 0 1 0 x = 3.81 0 0 1 6.34 1 0 0 0.03
Notice that in this case, the second row of dm
does not contain a 1 because x = 8.41
does not fall into any of our specified categories.
dummy
Now, let's compare our results from dummybr
above to the dummy
procedure. When we use the dummy
procedure:
dm = dummy(x, v);
The code above produces four dummies based upon the breakpoints in the vector v
:
x <= 1 1 < x <= 5 5 < x <= 7 7 < x
The matrix dm
contains:
0 1 0 0 1.53 0 0 0 1 8.41 dm = 0 1 0 0 x = 3.81 0 0 1 0 6.34 1 0 0 0 0.03
These results vary from our previous example:
dummy
procedure results in 4 columns of dummy variables. It adds a new column for the case where 7 < x
.x = 8.41
falls into the category 7 < x
.dummydn
Our final function is dummydn
which behaves just like dummy
, except that the pth column of the matrix of dummies is dropped. This is convenient for specifying a base case to ensure that these variables will not be collinear with a vector of ones.
// Column to drop
p = 2;
// Create matrix of dummy variables
dm_dn = dummydn(x, v, p);
The code above produces three dummies based upon the breakpoints in the vector v
:
x <= 1 1 < x <= 5 // Since p = 2, this column is dropped 5 < x <= 7 7 < x
The matrix dm_dn
contains:
0 1 0 0 0 0 0 1.53 0 0 0 1 0 0 1 8.41 dm = 0 1 0 0 dm_dn = 0 0 0 x = 3.81 0 0 1 0 0 1 0 6.34 1 0 0 0 1 0 0 0.03
Note that the matrix dm_dn
is the same as dm
except the second column has been removed.
Dummy variables are an important tool for data analysis whether we are working with time series data, crosssectional data, or panel data. In today's blog, we have explored three GAUSS tools for generating dummy variables:
design
procedure. dummy
, dummybr
, and dummydn
procedures.In this blog we will explore how to set up and interpret cointegration results using a realworld time series example. We will cover the case with no structural breaks as well as the case with one unknown structural break using tools from the GAUSS tspdlib library.
In this blog, we will use the famous NelsonPlosser time series data. The dataset contains macroeconomic fundamentals for the United States.
We will be using three of these fundamentals:
The time series data is annual data, covering 1900  1970.
In order to prepare for cointegration testing, we will take some preliminary time series modeling steps. We will:
In this example, we will examine the macroeconomic question of whether stock prices are linked to macroeconomic indicators. In particular, we will examine if there is a cointegrated, longrun relationship between the S&P 500 price index and monetary policy indicators of the M2 money stock and the bond yields.
Mathematically we will consider the cointegrated relationship:
$$y_{sp, t} = c + \beta_1 y_{money, t} + \beta_2y_{bond, t} + u_t$$
When visualizing time series data, we look for visual evidence of:
Our time series plots give us some important considerations for our testing, providing visual evidence to support:
Prior to testing for cointegration between our time series data, we should check for unit roots in the data. We will do this using the adf
procedure in the tspdlib
library to conduct the Augmented DickeyFuller unit root test.
Variable  Test Statistic  1% Critical Value  5% Critical Value  10% Critical Value  Conclusion 

Money  1.621  4.04  3.45  3.15  Cannot reject the null 
Bond yield  1.360  4.04  3.45  3.15  Cannot reject the null 
S&P 500  0.3842  4.04  3.45  3.15  Cannot reject the null 
Our ADF test statistics are greater than the 10% critical value for all of our time series. This implies that we cannot reject the null hypothesis of a unit root for any of our time series data.
What about the potential structural break that we see in our time series data? Does this have an impact on our unit root testing?
Using the adf_1break
procedure in the tspdlib
library to test for unit roots with a single structural break in the trend and constant we get the following results.
Variable  Test Statistic  Break Date  1% Critical Value  5% Critical Value  10% Critical Value  Conclusion 

Money  4.844  1948  5.57  5.08  4.82  Cannot reject the null 
Bond yield  3.226  1963  5.57  5.08  4.82  Cannot reject the null 
S&P 500  4.639  1945  5.57  5.08  4.82  Cannot reject the null 
Our ADF test statistics again suggest that even when accounting for the structural break, we cannot reject the null hypothesis of a unit root for any of our time series data.
Having concluded that there is evidence for unit roots in our data, we can now run our cointegration tests.
When setting up cointegration tests, there are a number of assumptions that we must specify:
To better understand these general assumptions, let’s look at the simplest of our tests, the EngleGranger cointegration test.
In the twostage, residualbased cointegration tests which we will consider today, normalization amounts to deciding which variable is our dependent variable and which variables are our independent variables in the cointegration regression.
We will choose our normalization to reflect our theoretical question of whether the S&P 500 index is cointegrated with the money stock and the bond yield. As we mentioned earlier, this means we will consider the cointegrated relationship:
$$y_{sp, t} = c + \beta_1 y_{money, t} + \beta_2 y_{bond, t} + u_t$$
// Set fname to name of dataset
fname = "nelsonplosser.dta";
// Load three variables from the dataset
// and remove rows with missing values
coint_data = packr(loadd(fname, "sp500 + m + bnd"));
// Define y and x matrix
y = coint_data[., 1];
x = coint_data[., 2 3];
The second assumption we must make about our EngleGranger test is which model
we wish to use. To understand how to make this decision, let's look closer at what this input means.
The EngleGranger test is a twostep test:
When we specify which model to use we impact two things:
There are three options to choose from:
No constant or trend (model = 0
)
$$y_{sp, t} = \beta_1 y_{money, t} + \beta_2 y_{bond, t} + u_t$$
Constant (model = 1
)
$$y_{sp, t} = \alpha + \beta_1 y_{money, t} + \beta_2 y_{bond, t} + u_t$$
model = 2
)
$$y_{sp, t} = \alpha + \delta t + \beta_1 y_{money, t} + \beta_2 y_{bond, t} + u_t$$ For our example, we will include a constant and trend in our firststage cointegration regression by setting:
// Select model with constant and trend
model = 2;
In the secondstage ADF residual unit root test, the error terms should be serially independent. To account for possible autocorrelation, lags of the first differences of the residual can be included in ADF test regression.
The GAUSS coint_egranger
will automatically determine the optimal number of lags to include in the secondstage regression based on two user inputs:
ic = 0
]ic = 1
]ic = 2
]/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
// Maximum number of lags
pmax = 12;
Now that we have loaded our data and chosen the test settings, we can call the coint_egranger
procedure:
// Perform EngleGranger Cointegration Test
{ tau_eg, cvADF_eg } = coint_egranger(y, x, model, pmax, ic);
In order to interpret our cointegration results, let's revisit the two steps of the EngleGranger test:
The EngleGranger test statistic for cointegration reduces to an ADF unit root test of the residuals of the cointegration regression:
After running our cointegration test we obtain the following results:
EngleGranger Test Constant and Trend H0: no cointegration (EG, 1987 & P0, 1990) Test Statistic CV(1%, 5%, 10%)   EG_ADF 2.105 4.645 4.157 3.843
We can see that:
Earlier we saw that the potential structural break in our data did not change our unit root test conclusion. We should also see if the structural break has an impact on our cointegration testing.
To do this we will use the GregoryHansen cointegration test which can be implemented using the coint_ghansen
test in the tspdlib
library.
We can carry over all of our coint_egranger
testing specifications, except our model specification.
When implementing the GregoryHansen test, we must decide on a model which specifies:
There are four modeling options to choose from
model = 1
]model = 2
]model = 3
]model = 4
]In this model, the structural break again affects the constant, the regression coefficients, and the trend.
For example, let's consider the last case, where the constant, coefficients, and trend are all impacted by the structural break:
// Set fname to name of dataset
fname = "nelsonplosser.dta";
// Load three variables from the dataset
// and remove rows with missing values
coint_data = packr(loadd(fname, "sp500 + m + bnd"));
// Define y and x matrix
y = coint_data[., 1];
x = coint_data[., 2 3];
/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
// Maximum number of lags
pmax = 12;
// Model specification
// Regime and trend shift
model = 4;
// Perform cointegration test
{ ADF_min_gh, TBadf_gh, Zt_min_gh, TBzt_gh, Za_min_gh, TBza_gh, cvADFZt_gh, cvZa_gh } =
coint_ghansen(y, x, model, bwl, ic, pmax, varm, trimm);
The coint_ghansen
procedure provides more extensive results than the coint_egranger
test. In particular, the GregoryHansen test:
Cointegration test results
After calling the coint_ghansen
procedure and testing all possible models, we obtain the following test statistic results:
Test  $ADF$ Test Statistic  $Z_t$ Test Statistic  $Z_{\alpha}$ Test Statistic  10% Critical Value $ADF$,$Z_t$  10% Critical Value $Z_{\alpha}$  Conclusion 

GregoryHansen, Level shift  4.004  3.819  27.858  4.690  42.490  Cannot reject the null of no cointegration for $ADF$, $Z_t$, or $Z_{\alpha}$. 
GregoryHansen, Level shift with trend  3.889  3.751  27.618  5.030  48.94  Cannot reject the null of no cointegration for $ADF$, $Z_t$, or $Z_{\alpha}$. 
GregoryHansen, Regime change  4.658  4.539  32.766  5.23  52.85  Cannot reject the null of no cointegration for $ADF$, $Z_t$, or $Z_{\alpha}$. 
GregoryHansen, Regime change with trend  5.834  4.484  32.411  5.72  63.10  Cannot reject the null of no cointegration for $ADF$, $Z_t$, or $Z_{\alpha}$. 
As we can see from these results, there is no evidence that our S&P 500 Index is cointegrated with the money stock and bond yield.
Structural break results
The coint_ghansen
procedure also returns estimates for break dates based on the $ADF$, $Z_t$, and $Z_{\alpha}$ tests:
Test  $ADF$ Break Date  $Z_t$ Break Date  $Z_{\alpha}$ Break Date 

GregoryHansen, Level shift  1958  1956  1956 
GregoryHansen, Level shift with trend  1958  1956  1956 
GregoryHansen, Regime change  1955  1955  1955 
GregoryHansen, Regime change with trend  1951  1953  1947 
The results from our Gregory Hansen cointegration test provide some important conclusions:
Note that while the GregoryHansen test does estimate break dates, it does not provide the statistical evidence to conclude whether these are statistically significant break dates or not.
Today's blog looks closer at the EngleGranger and GregoryHansen residualbased cointegration tests. By building a better understanding of how the tests work and what assumptions we make when running the tests, you will be better equipped to interpret the test results.
In particular, today we learned
]]>
You probably know that you can use GAUSS code to create repeatable styling for your graphs. However, you might not know that GAUSS also allows you to build reusable graphics profiles with a few clicks of your mouse.
We will show you how to clone existing graphics profiles and how to control:
While this tutorial will not cover all available options, you should learn enough to be comfortable making most other changes without additional help.
For this tutorial, our goal will be to create a graph profile to replicate the style of the graph below. Data for the graph can be downloaded from here.
This video tutorial will show you how to:
or read the text instructions below.
Using the graphics profile created in the previous step:
This will open your operating system's font dialog, where you can select your desired font, weight, and size:
We'll start by modifying the third line which marks the date of the first U.S. location transmission.
Then repeat step 2 above to add the legend text for Series 1 and Series 2.
You can access the settings from a graph profile using the plotGetDefaults
function as you would for the default settings. Below is a simple example.
// Load variables
Y = loadd("cookcountycovid.csv", "cases + deaths");
x = seqa(1, 1, rows(Y));
// Declare myPlot to be a plotControl structure
// and fill with the settings from our covid profile
struct plotControl myPlot;
myPlot = plotGetDefaults("covid");
// Draw plot using covid profile settings
plotXY(myPlot, x, Y);
which will produce the plot below:
While the graphs for a given project may share many features, it is likely that some attributes such as the title or legend text will need to be different. You could clone your custom profile and make these changes. However, it might be simpler to make these changes in your code.
Fortunately, you can use the plotSet
functions to further customize the settings for a particular graph. You can modify any settings that you would like to change.
The code below shows how to change the title, keeping all other settings unchanged.
// Load variables
Y = loadd("cookcountycovid.csv", "cases + deaths");
x = seqa(1, 1, rows(Y));
// Declare myPlot to be a plotControl structure
// and fill with the settings from our covid profile
struct plotControl myPlot;
myPlot = plotGetDefaults("covid");
// Set title text
plotSetTitle(&myPlot, "COVID19 Cook County");
// Draw plot using covid profile settings
plotXY(myPlot, x, Y);
These settings will not modify the custom graphics profile that you created. Next time you call myPlot = plotGetDefaults("covid")
, myPlot
will be filled with the custom profile settings that you selected in the Preferences dialog window.
Full code to create the time series plot shown at the top of this post, using the "covid" custom graphics profile we created, can be downloaded here.
Great job! You've learned how to interactively create custom graphics profiles that can be easily reused and further modified.
This post will show you how to create a simple output table with variable names, parameter estimates and standard errors as shown below, using sprintf
.
population 0.6558 (0.1910) poverty rate 0.9465 (0.2505) inflation 1.7366 (0.1101)
sprintf
, first introduced in GAUSS version 20, is a standard function in many languages. It provides powerful, compact statements that allow you to combine and style numeric and string data.
Unfortunately, some people find the sprintf
format string to be difficult to use and understand. Fortunately, today we are going to make it clear for you!
The GAUSS sprintf
function takes two or more inputs and returns a string array.
sprintf(fmt, a)
sprintf(fmt, a, ...)
sprintf
format string (described later).a
.
The sprintf
format string has many options that give you great control over the presentation of your data. Today we will show you how to use the three most important options which will cover most use cases.
The most basic format specifier starts with a percent sign (%) followed by a letter which is called the specifier character.
Specifier Character  Output  Example  

d  Integer  sprintf("%d", 3.1415) 
3 
e  Scientific  sprintf("%e", 3.1415) 
3.1415e+00 
f  Decimal  sprintf("%f", 3.1415) 
3.1415 
s  String  sprintf("%s", "Pi") 
"Pi" 
sprintf
format strings can contain more than one format specifier and it can be mixed with text. For example:
sprintf("%s is %d", "Steve", 38);
will print out:
Steve is 38
sprintf("%sis%d", "Steve", 38)
will return Steveis38
Now we know enough to create the first iteration of our table.
// Create a 3x1 string array
var_names = "population" $ "poverty rate" $ "inflation";
// Create two 3x1 vectors
beta_ = { 0.65582591,
0.94648444,
1.7365981 };
se = { 0.19095146,
0.25047475,
0.11009904 };
sprintf("%s %f %f", var_names, beta_, se);
will print out
population 0.655826 0.190951 poverty rate 0.946484 0.250475 inflation 1.736598 0.110099
The overlapping columns make this table difficult to read.
To stop the columns from overlapping, we need to make the field width large enough to hold the widest element for each column. The field width is an integer placed immediately following the percent sign (%). For example
Code  Output 

sprintf("%8f %8f", 3.1415, 6.283) 
3.141500 6.283000 
sprintf("%12f %8f", 3.1415, 6.283) 
3.141500 6.283000 
sprintf("%12f%8f", 3.1415, 6.283) 
3.1415006.283000 
sprintf("%12f%9f", 3.1415, 6.283) 
3.141500 6.283000 
Now let's update our code to:
// Create 3x1 string array
var_names = "population" $ "poverty rate" $ "inflation";
// Create two 3x1 vectors
beta_ = { 0.65582591,
0.94648444,
1.7365981 };
se = { 0.19095146,
0.25047475,
0.11009904 };
sprintf("%12s %9f %9f", var_names, beta_, se);
will return:
population 0.655826 0.190951 poverty rate 0.946484 0.250475 inflation 1.736598 0.110099
This is much better. As we saw earlier, we can add any text to the format string that we'd like to see in the output. Let's add parentheses around the format specifier for the standard error column.
// Surround SE format specifier with parentheses
sprintf("%12s %9f (%9f)", var_names, beta_, se);
will return:
population 0.655826 ( 0.190951) poverty rate 0.946484 ( 0.250475) inflation 1.736598 ( 0.110099)
The parentheses in the above table make it easy for us to see the 9 character field width of the final column. If the standard error could be negative, we might want to leave it as it is to allow room for a possible minus sign. However, since we know it will be positive, let's:
// Decrease field width of final column
// and add one space before the '('
sprintf("%12s %9f (%8f)", var_names, beta_, se);
will return:
population 0.655826 (0.190951) poverty rate 0.946484 (0.250475) inflation 1.736598 (0.110099)
Our table is really starting to look good, but to match the table from the introduction, we need to reduce the number of digits to the right of the decimal place to 4.
The sprintf
precision specifier is a dot or period followed by an integer. It controls the number of digits printed after the decimal point when using the floatingpoint f
specifier character.
Code  Output 

sprintf("%12.2f %8.3f", 3.1415, 6.283) 
3.14 6.283 
After setting the precision to 4:
// Decrease precision to 4 for both numeric columns
sprintf("%12s %9.4f (%8.4f)", var_names, beta_, se);
our output looks like this:
population 0.6558 ( 0.1910) poverty rate 0.9465 ( 0.2505) inflation 1.7366 ( 0.1101)
We can see that the final column displays 6 characters, but the field width is set to 8 characters. Similarly, the second column needs 7 spaces (5 digits plus a dot and a possible minus sign), but we have given it 9.
Let's fix those:
// Decrease field width for numeric columns to remove extra spaces
sprintf("%12s %7.4f (%6.4f)", var_names, beta_, se);
We have reached our goal!
population 0.6558 (0.1910) poverty rate 0.9465 (0.2505) inflation 1.7366 (0.1101)
Great job! You have learned the basics you need to know to display your data with sprintf
in GAUSS, including:
sprintf
in GAUSS can print styled tables with columns of string and numeric data.f
, specifier character.While there is much more that sprintf
can do, if you can remember
"%[field width].[precision]f"
and
"%s"
you'll have a great foundation to expand on.
]]>Anyone who owns GAUSS or has access to an institutional copy.
We are monitoring the situation. For the next 14 days, we will be offering 60day temporary licenses.
Fill out this webform to receive your license.
During this trying time, we remain committed to supporting you and the amazing work that you do. Aptech is following expert guidelines to keep our staff safe and minimize disease spread while continuing to serve you.
Please reach out if there is anything we can do to help you.
Best regards, The Aptech Team
]]>Optional input arguments can make your statistical computing more efficient and enjoyable. GAUSS version 20 added a new suite of tools to make it easy for you to add optional input arguments to your GAUSS procedures.
In this blog post we will show you how to easily:
We’ll start by creating a simple toy GAUSS procedure that does not use optional arguments.
// Run procedure
myProc(1.1, 2.2, 3.3);
// Define procedure
proc (0) = myProc(a, b, c);
print a;
print b;
print c;
endp;
If we run the code above, we will get the following output:
1.1 2.2 3.3
To modify this procedure to accept optional arguments, we need to make the following changes:
The procedure declaration is the first line of the procedure. The procedure declaration from our initial procedure was:
// Procedure declaration
proc (0) = myProc(a, b, c);
Now we will modify our original procedure to make a
the only required input. The new procedure declaration will look like this:
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
In the procedure declaration, above, a
is a required input as it was before. However, we have replaced b
and c
with ...
. The triple dots represent the optional arguments. The triple dots ...
:
The GAUSS function dynargsCount
returns the number of optional arguments passed into the procedure. dynargsCount
may only be called inside of a GAUSS procedure and does not take any inputs. For example:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
local n_dynargs;
// Notice that you do not pass ‘...’
// into ‘dynargsCount’
n_dynargs = dynargsCount();
print "Number of optional inputs = " n_dynargs;
print "";
endp;
Will print the following output:
call 1 Number of optional inputs = 0.0  call 2 Number of optional inputs = 1.0  call 3 Number of optional inputs = 2.0 
This allows you to add logic which depends on the number of optional arguments that were passed into your procedure.
dynargs
.You can access optional arguments in GAUSS procedures by:
dynargsGet
function....
, to another function.The dynargsGet
procedure returns specified optional arguments. It requires one input that specifies the index, or index range of optional arguments to return. For example:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
// Since ‘b’ and ‘c’ are not declared as inputs
// in the procedure declaration, they must be
// declared as local variables
local b, c, idx;
// The range of optional inputs we want to return
idx = { 1, 2 };
{ b, c } = dynargsGet(idx);
print "a = " a;
print "b = " b;
print "c = " c;
print "";
endp;
This time, the code will print out:
call 1 a = 9.00 b = {} c = {}  call 2 a = 9.00 b = 3.14 c = {}  call 3 a = 9.00 b = 3.14 c = 2.54 
As we see above, a
is passed through as we would expect each time. In "call 1", neither b
or c
were passed into myProc
. We can see that GAUSS printed their contents as empty curly braces, {}
. This indicates that they were empty matrices.
If an optional input argument requested by dynargsGet
is not supplied, by default an empty matrix will be returned in its place. We can check for empty matrices with the GAUSS function, isempty
. For example:
// Create an empty matrix
z = {};
// Check to see if `z` is an empty matrix
if isempty(z);
print "z is an empty matrix";
else;
print "z is not an empty matrix";
endif;
As you may have guessed, the above code will print:
z is an empty matrix
While we can use isempty
and if
statements to handle the optional inputs, it is often simpler to set default values.
Fortunately, dynargsGet
allows us to specify default values for the optional arguments which are not supplied. To specify default values with dynargsGet
, simply add them as a commaseparated list of inputs after the required index input.
There are a few things to remember when specifying default values:
We can set default values for our toy example, like this:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
// Since ‘b’ and ‘c’ are not declared as inputs
// in the procedure declaration, they must be
// declared as local variables
local b, c, idx;
// The range of optional inputs we want to return
idx = { 1, 2 };
// If ‘b’ or ‘c’ are not passed in, return
// 29 or 33 respectively
{ b, c } = dynargsGet(idx, 29, 33);
print "a = " a;
print "b = " b;
print "c = " c;
print "";
endp;
This time our code will return:
call 1 a = 9.00 b = 29.00 c = 33.00  call 2 a = 9.00 b = 3.14 c = 33.00  call 3 a = 9.00 b = 3.14 c = 2.54 
While any input arguments accessed by dynargsGet
can be passed to other procedures, it is important to know that the triple dots, ...
, can also be passed directly into other procedures.
Passing triple dots, ...
, to another procedure is just as if you passed a commaseparated list of the arguments in ...
.
...
, do not have to be the final input when passed to other procedures.For example:
X_1 = initMat1(3.14, 5, 2);
X_2 = initMat2(3.14, 5, 2);
// Procedure declaration with only optional inputs
proc (1) = initMat1(val, ...);
local X;
X = val + zeros(...);
retp(X);
endp;
// Procedure declaration with only required inputs
proc (1) = initMat2(val, r, c);
local X;
X = val + zeros(r, c);
retp(X);
endp;
The calls to initMat1
and initMat2
will behave the same.
Now that we’ve seen the basics of using optional inputs to GAUSS procedures, let’s work through a more practical example. We’ll create two procedures that will work together to simulate a linear model.
Our first procedure will create a random normal matrix, with the option to specify the mean and standard deviation of the variables.
/*
** Inputs: r  Required input. Number of rows.
** c  Required input. Number of columns.
** mu  Optional input. Mean of the simulated columns.
** Default = 0.
** sd  Optional input. Standard deviation of the
** simulated columns. Default = 1.
*/
proc (1) = rndNormal(r, c, ...);
local X, mean, sd;
// If a 3rd input is passed in, assign it
// to ‘mean’. Otherwise set ‘mean’ equal to 0.
// If a 4th input is passed in, assign it
// to `sd`. Otherwise set ‘sd’ equal to 1.
{ mean, sd } = dynargsGet(12, 0, 1);
// Compute a random matrix with the
// specified mean and sd.
X = (rndn(r, c) .* sd) + mean;
retp(X);
endp;
Here are a few examples of how we could call this procedure:
// Create a 100x4 random normal matrix
// with mean=0 and sd=1
X_1 = rndNormal(100, 4);
// Create a 130x2 random normal matrix
// with mean=3.7 and sd=1
X_2 = rndNormal(130, 2, 3.7);
// Create a 74x5 random normal matrix
// with mean=2 and sd=3
X_3 = rndNormal(74, 5, 2, 3);
Next, we’ll create a procedure that will use our rndNormal
procedure to simulate linear models.
/*
** Inputs: b_true  Required input. True parameter values of
** the simulated linear model.
** nobs  Required input. The number of observations
** to simulate.
** alpha_true  Optional input. True intercept of the
** simulated linear model. Default = 0.
** err_sd  Optional input. Standard deviation of the
** error term. Default = 1.
*/
proc (1) = simulateLM(b_true, nobs, ...);
local X, err, nvars, alpha_plus_err, y;
// Find number of desired variables
nvars = rows(b_true);
// Simulate variables with
// mean = 0, sd = 1
X = rndNormal(nobs, nvars);
// Simulate error term and add intercept
// if it was passed in.
alpha_plus_err = rndNormal(nobs, 1, ...);
y = X * b_true + alpha_plus_err;
retp(y);
endp;
Here are a few examples of how our simulateLM
procedure can be used:
b = { 0.8, 1.1, 0.2 };
n = 10;
/*
** Simulate linear model:
**  Without intercept
**  With error term sd equal to 1
*/
y_1 = simulateLM(b, n);
/*
** Simulate linear model:
**  With intercept
**  With error term sd equal to 1
*/
alpha = 2.3;
y_2 = simulateLM(b, n, alpha);
/*
** Simulate linear model:
**  With intercept
**  With error term sd equal to 3
*/
alpha = 2.3;
err_sd = 3;
y_3 = simulateLM(b, n, alpha, err_sd);
Congratulations! You should now be able to start creating GAUSS procedures with optional input arguments. You have learned how to:
...
, to the procedure declaration.dynargsCount
.dynargsGet
.Cointegration is an important tool for modeling the longrun relationships in time series data. If you work with time series data, you will likely find yourself needing to use cointegration at some point.
This blog provides an indepth introduction to cointegration and will cover all the nuts and bolts you need to get started. In particular, we will look at:
Though not necessary, you may find it helpful to review the blogs on time series modeling and unit root testing before continuing with this blog.
Economic theory suggests that many time series datasets will move together, fluctuating around a longrun equilibrium. In econometrics and statistics, this longrun equilibrium is tested and measured using the concept of cointegration.
Cointegration occurs when two or more nonstationary time series:



Field  Supporting Theory  Time Series 
Economics  The permanent income hypothesis describes how agents spread their consumption out over their lifetime based on their expected income.  Consumption and income. 
Economics  Purchasing power parity is a theory that relates the prices of a basket of goods across different countries.  Nominal exchange rates and domestic and foreign prices. 
Finance  The present value model of stock prices implies a longrun relationship between stock prices and their dividends or earnings.  Stock prices and stock dividends/earnings. 
Epidemiology  Joint mortality models imply a longrun relationship between mortality rates across different demographics.  Male and female mortality rates. 
Medicine  Time series methodologies have been used to examine comorbidities of different types of cancers and trends in medical welfare.  Occurrence rates of different types of cancer. 
To understand the mathematics of cointegration, let's consider a group of time series, $Y_t$, which is composed of three separate time series:
$$y_1 = (y_{11}, y_{12}, \ldots, y_{1t})$$ $$y_2 = (y_{21}, y_{22}, \ldots, y_{2t})$$ $$y_3 = (y_{31}, y_{32}, ..., y_{3t})$$
All three series are nonstationary time series.
Cointegration implies that while $y_1$, $y_2$, and $y_3$ are independently nonstationary, they can be combined in a way that their linear combination is stationary :
$$\beta Y_t = \beta_1 y_{1t} + \beta_2 y_{2t} + \beta_3 y_{3t} \sim I(0)$$
The Cointegrating Vector
In the context of cointegration, $\beta$ is commonly known as the cointegrating vector. This vector:
Normalization Because there can be multiple cointegrating vectors that fit the same economic model, we must impose identification restrictions to normalize the cointegrating vector for estimation.
A common normalization of the cointegrating vector is to set $\beta = ( 1, \beta_2, \ldots, \beta_N)$. For example, applying these restrictions to our earlier system yields
$$\beta Y_t = y_{1t}  \beta_2 y_{2t}  \beta_3 y_{3t} \sim I(0)$$
Part of the appeal of this normalization is that it can be rewritten in a standard regression form
$$y_{1t} = \beta_2 y_{2t} + \beta_3 y_{3t} + u_t$$
where $u_t$ is a stationary cointegrating error component. Intuitively, $u_t$ can be thought of as shortterm deviations from the longrun equilibrium.
While the regression format is a common normalization, it is important to remember that economic theory should inform our identifying restrictions.
Cointegration implies that time series will be connecting through an error correction model. The error correction model is important in time series analysis because it allows us to better understand longrun dynamics. Additionally, failing to properly model cointegrated variables can result in biased estimates.
The error correction model:
Let's assume that there is a bivariate cointegrated system with $Y_t = (y_{1t}, y_{2t})$ and a cointegrating vector $\beta = (1, \beta_2)$ such that
$$\beta Y_{t} = y_{1t}  \beta_2 y_{2t}$$
The error correction model depicts the dynamics of a variable as a function of the deviations from longrun equilibrium
$$\Delta y_{1t} = c_1 + \alpha_1 (y_{1,t1}  \beta_2 y_{2,t1}) + \sum_j \psi^j_{11} \Delta y_{1, tj} + \sum_j \psi^j_{12} \Delta y_{2, tj} + \epsilon_1t$$ $$\Delta y_{2t} = c_2 + \alpha_2 (y_{1,t1}  \beta_2 y_{2,t1}) + \sum_j \psi^j_{21} \Delta y_{1, tj} + \sum_j \psi^j_{22} \Delta y_{2, tj} + \epsilon_2t$$
Term  Description  Intuition 

$y_{1,t1}  \beta_2 y_{2,t1}$  Cointegrated longrun equilibrium  Because this is an equilibrium relationship, it plays a role in dynamic paths of both $y_{1t}$ and $y_{2t}$. 
$\alpha_1$, $\alpha_2$  Adjustment coefficients  Captures the reactions of $y_{1t}$ and $y_{2t}$ to disequilibrium. 
$\sum_j \psi^j_{11} \Delta y_{1, tj} + \sum_j \psi^j_{12} \Delta y_{2, tj}$  Autoregressive distributed lags  Captures additional dynamics. 
If the cointegrating vector has been previously estimated, then standard OLS or DOLS can be used to estimate the error correction relationship. In this case:
The ECM relationship can be estimated using OLS, seemingly unrelated regressions (SUR), or maximum likelihood estimation.
The vector error correction model (VECM) is the multivariate extension of the ECM. If we are working in a vector autoregressive context, cointegration implies a VECM such that
$$\Delta Y_t = \Phi D_t + \Pi Y_{t1} + \Gamma_1 \Delta Y_{t1} + \cdots + \Gamma_{p1} \Delta Y_{tp+1} + \epsilon_t$$
Like the ECM, the VECM parameters reflect the longrun and shortrun dynamics of system as shown in the table below:
Term  Description  Intuition 

$\Pi$  Longrun impact matrix.  $\Pi = \Pi_1 + \Pi_2 + \cdots + \Pi_p  I_n$, captures adjustments towards the longrun equilibrium and contains the cointegrating relationships. 
$\Gamma_k$  Shortrun impact matrix.  The shortrun impact matrix is constructed from $\sum_{j=k+1}^p \Pi_j$ and captures shortrun deviations from the equilibrium. 
$D_t$  Deterministic terms.  These terms take the form $D_t = u_0 + u_1 t$ where $u_0$ is the constant component and $u_1 t$ is the trend component. 
The VECM model can be estimated using the Johansen method:
Before jumping directly to cointegration testing, there are a number of other time series modeling modeling steps that we should consider first.
One of the key considerations prior to testing for cointegration, is whether there is theoretical support for the cointegrating relationship. It is important to remember that cointegration occurs when separate time series share an underlying stochastic trend. The idea of a shared trend should be supported by economic theory.
As an example, consider growth theory which suggests that productivity is a key driver of economic growth. As such, it acts as the common trend, driving the comovements of many indicators of economic growth. Hence, this theory implies that consumption, investment, and income are all cointegrated.
One of the first steps in time series modeling should be data visualization. Time series plots provide good preliminary insights into the behavior of time series data:
We've established that cointegration occurs between nonstationary, I(1), time series. This implies that before testing for or estimating a cointegrating relationship, we should perform unit root testing.
Our previous blog, "How to Conduct Unit Root Testing in GAUSS", provides an indepth look at how to perform unit root testing in GAUSS.
GAUSS tools for performing unit root tests are available in a number of libraries, including the Time Series MT (TSMT), the opensource TSPDLIB, and the coint libraries. All of these can be directly located and installed using the GAUSS package manager.
Full example programs for testing for unit roots using TSMT procedures and TSPDLIB procedures are available on our Aptech GitHub page.
Panel Data Unit Root Test  TSMT procedure  TSPDLIB procedure 

Hadri  hadri  
Im, Pesaran, and Shin  ips  
LevinLuChin  llc  
Schmidt and Perron LM test  lm  
Breitung and Das  breitung  
Crosssectionally augmented IPS test (CIPS)  cips  
Panel analysis of nonstationary and idiosyncratic and common (PANIC)  bng_panic 
Time Series Unit Root Test  TSMT procedure  TSPDLIB procedure 

AugmentedDickey Fuller  vmadfmt  adf 
PhillipsPerron  vmppmt  pp 
KPSS  kpss  lmkpss 
Schmidt and Perron LM test  lm  
GLSADF  dfgls  dfgls 
Quantile ADF  qr_adf 
A complete time series analysis should consider the possibility that structural breaks have occurred. In the case that structural breaks have occurred, standard tests for cointegration are invalid.
Therefore, it is important to:
The GAUSS sbreak
procedure, available in TSMT, is an easy to use tool for identifying multiple, unknown structural breaks.
In order to test for cointegration, we must test that a longrun equilibrium exists for a group of data. There are a number of things that need to be considered:
In this section, we will show how to use these questions to guide cointegration testing without structural breaks.
The EngleGranger cointegration test considers the case that there is a single cointegrating vector. The test follows the very simple intuition that if variables are cointegrated, then the residual of the cointegrating regression should be stationary.
Forming the cointegrating residual
How to form the cointegrating residual depends on if the cointegrating vector is known or must be estimated:
If the cointegrating vector is known, the cointegrating residuals are directly computed using $u_t = \beta Y_t$. The residuals should be stationary and:
If the cointegrating vector is unknown, OLS is used to estimate the normalized cointegrating vector from the regression $$y_{1t} = c + \beta y_{2t} + u_{t}$$
There are two Johansen cointegrating tests for the VECM context, the trace test and the maximal eigenvalue test. These tests hinge on the intuition that in the VECM, the rank of the longrun impact matrix, $\Pi$, determines if the VAR(p) variables are cointegrated.
Since the rank of the longrun impact matrix equals the number of cointegrating relationships:
The Johansen testing process has two general steps:
The Johansen Trace Statistics
The Johansen trace statistic:
The Johansen testing procedure sequentially tests the null hypothesis that the number of cointegrating vectors, $k = m$ against the alternative that $k > m$.
Stage  Null Hypothesis  Alternative  Conclusion 

One  $H_0: k = 0$  $H_A: k>0$  If $H_0$ cannot be rejected, stop testing, and $k = 0$. If null is rejected, perform next test. 
Two  $H_0: k \leq 1$  $H_A: k>1$  If $H_0$ cannot be rejected, stop testing, and $k \leq 1$. If null is rejected, perform next test. 
Three  $H_0: k \leq 2$  $H_A: k>2$  If $H_0$ cannot be rejected, stop testing, and $k \leq 2$. If null is rejected, perform next test. 
m1  $H_0: k \leq m1$  $H_A: k>m1$  If $H_0$ cannot be rejected, stop testing, and $k \leq m1$. If null is rejected, perform next test. 
The test statistic follows a nonstandard distribution and depends on the dimension and the specified deterministic trend.
The Johansen Maximum Eigenvalue Statistic
The maximal eigenvalue statistic:
In the case that there are structural breaks in the cointegrating relationship, the cointegration tests in the previous station should not be used. In this section we look at three tests for cointegration with structural breaks:
The Gregory and Hansen (1996) cointegration test is a residualbased cointegration test which tests the null hypothesis of no cointegration against the alternative of cointegration in the presence of a single regime shift.
The Gregory and Hansen (1996) test:
Because the structural break date is unknown, the test computes the cointegration test statistic for each possible breakpoint, and the smallest test statistics are used.
Gregory and Hansen (1996) suggest running their tests in combination with the standard cointegration tests:
The HatemiJ cointegration test is an extension of the Gregory and Hansen cointegration test. It allows for two possible structural breaks with unknown timing.
The Maki cointegration test builds on the Gregory and Hansen and the HatemiJ cointegration tests to allow for an unknown number of structural breaks.
GAUSS tools for performing cointegration tests and estimating VECM models are available in a number of libraries, including the Time Series MT (TSMT) library, TSPDLIB, and the coint libraries. All of these can be directly located and installed using the GAUSS package manager.
Cointegration test  Null Hypothesis  Decision Rule  GAUSS library 

EngleGranger (ADF)  No cointegration.  Reject the null hypothesis if the $ADF$ test statistic is less than the critical value.  TSMT, tspdlib, coint 
Phillips  No cointegration.  Reject the null hypothesis if the $Z$ test statistic is less than the critical value.  coint, tspdlib 
Stock and Watson common trend  $Y$ is a noncointegrated system after allowing for the pth order polynomial common trend.  Reject the null hypothesis if the $SW$ test statistic is less than the critical value.  coint 
Phillips and Ouliaris  $Y$ and $X$ are not cointegrated.  Reject the null hypothesis if the $P_u$ or $P_z$ statistic is greater than the critical value.  coint, tspdlib 
Johansen trace  Rank of $\Pi$ is equal to $r$ against the alternative that the rank of $\Pi$ is greater than $r$.  Reject the null hypothesis if $LM_{max}(k)$ is greater than the critical value.  TSMT, coint 
Johansen maximum eigenvalue  Rank of $\Pi$ is equal to $r$ against the alternative that the rank of $\Pi$ is equal to $r+1$.  Reject the null hypothesis if $LM(r)$ is greater than the critical value.  TSMT, coint 
Gregory and Hansen  No cointegration against the alternative of cointegration with one structural break.  Reject the null hypothesis if $ADF$, $Z_{\alpha}$, or $Z_t$ are less than the critical value.  tspdlib 
HatemiJ  No cointegration against the alternative of cointegration with two structural breaks.  Reject the null hypothesis if $ADF$, $Z_{\alpha}$, or $Z_t$ are less than the critical value.  tspdlib 
Shin test  Cointegration.  Reject the null hypothesis if the test statistic is less than the critical value.  tspdlib 
In this section, we will test for cointegration between monthly gold and silver prices, using historic monthly price date starting in 1915. Specifically, we will work through several stages of analysis:
As a first step, we will create a time series graph of our data. This allows us to visually examine the deterministic trends in our data.
From our graphs, we can draw some preliminary conclusions about the dynamics of gold and silver prices over our time period:
Before testing if silver and gold prices are cointegrated, we should test if the series have unit roots. We can do this using the unit roots tests available in the TSMT and tspdlib libraries.
Gold monthly closing prices (20152020)  

Time Series Unit Root Test  Test Statistic  Conclusion 
AugmentedDickey Fuller  1.151  Cannot reject the null hypothesis of unit root. 
PhillipsPerron  1.312  Cannot reject the null hypothesis of unit root. 
KPSS  2.102  Reject the null hypothesis of stationarity at the 1% level. 
Schmidt and Perron LM test  2.399  Cannot reject the null hypothesis of unit root. 
GLSADF  0.980  Cannot reject the null hypothesis of unit root. 
Silver monthly closing prices (20152020)  

Time Series Unit Root Test  Test Statistic  Conclusion 
AugmentedDickey Fuller  5.121  Reject the null hypothesis of unit root at the 1% level. 
PhillipsPerron  5.446  Reject the null hypothesis of unit root at the 1% level. 
KPSS  0.856  Reject the null hypothesis of stationarity at the 1% level. 
Schmidt and Perron LM test  4.729  Reject the null hypothesis of unit root at the 1% level. 
GLSADF  4.895  Reject the null hypothesis of unit root at the 1% level. 
These results provide evidence that gold prices are nonstationary but suggest that the silver prices are stationary. At this point, we would not likely proceed with cointegration testing or we may wish to perform additional unit root testing. For example, we may want to perform unit root tests that allow for structural breaks.
The GAUSS code for the tests in this section is available here.
Now, let's test for cointegration without structural breaks using two different tests, the Johansen tests and the EngleGranger test.
The Johansen Tests
We will use the vmsjmt
procedure from the TSMT library. This procedure should be used with the vmc_sjamt
and vmc_sjtmt
procedures, which find the critical values for the Maximum Eigenvalue and Trace statistics, respectively.
The vmsjmt
procedure requires four inputs:
The vmsjmt
procedure returns both the Johansen Trace and the Johansen Maximum Eigenvalue statistic. In addition, it returns the associated eigenvalues and eigenvectors.
new;
// Load tsmt library
library tsmt;
// Set filename (with path) for loading
fname2 = __FILE_DIR $+ "commodity_mon.dat";
// Load real prices data
y_test_real = loadd(fname2, "P_gold_real + P_silver_real");
// No deterministic component
// the fitted regression
p = 1;
// Set number of lagged differences
// for computing estimator
k = 2;
// No determinant
no_det = 0;
{ ev, evec, trace_stat, max_ev } = vmsjmt(y_test_real, p, k, no_det);
cv_lr2 = vmc_sjamt(cols(y), p);
cv_lr1 = vmc_sjtmt(cols(y), p);
Both trace_stat
and max_ev
will contain statistics for all possible ranks of $\Pi$.
For example, since we are testing for cointegration of just two time series, there will be at most one cointegrating vector. This means trace_stat
and max_ev
will be 2 x 1 matrices, testing both null hypotheses that $r=0$ and $r=1$.
Test  Test Statistic  10% Critical Value  Conclusion 

Johansen Trace Statistic  $$H_0: r=1, 56.707$$ $$H_0: r=0, 0.0767$$  10.46  Cannot reject the null hypothesis that $r=0$. 
Johansen Maximum Eigenvalue  $$H_0: r=1, 56.631$$ $$H_0: r=0, 0.0766$$  9.39  Cannot reject the null hypothesis that $r=0$. 
These results indicate that there is no cointegration between monthly gold and silver prices. This should not be a surprise, given the results of our unit root testing.
vmsjmt
. However, the Johansen tests are sequential tests and since we cannot reject the null that $k=0$ for either test, we technically do not need to continue testing for any higher rank. The EngleGranger
Since there is only one possible cointegrating vector for this system, we could have also used the EngleGranger test for cointegration. This test can be implemented using the coint_egranger
procedure from the TSPDLIB library.
The coint_egranger
procedure requires five inputs:
The coint_egranger
procedure returns the test statistic along with the 1%, 5% and 10% critical values.
Using the data already loaded in the previous example:
/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
// Maximum number of lags
pmax = 12;
// No constant or trend
model = 0
{ tau0, cvADF0 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
// Constant
model = 1
{ tau1, cvADF1 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
// Constant and trend
model = 2
{ tau2, cvADF2 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
Test  Test Statistic  10% Critical Value  Conclusion 

EngleGranger, no constant  3.094  2.450  Reject the null of no cointegration at the 10% level. 
EngleGranger, constant  1.609  3.066  Cannot reject the null hypothesis of no cointegration. 
EngleGranger, constant and trend  2.327  3.518  Cannot reject the null hypothesis of no cointegration. 
These results provide evidence for our conclusion that there is no cointegration between gold and silver prices. Note, however, that these results are not conclusive and depend on whether we include a constant. This sheds light on the importance of including the correct deterministic components in our model.
To be thorough we should also test for cointegration using tests that allow for a structural break. As an example, let's use the GregoryHansen test to compare the null hypothesis of no cointegration against the alternative that there is cointegration with one structural break.
This test can be implemented using the coint_ghansen
procedure from the TSPDLIB library.
The coint_ghansen
procedure requires eight inputs:
/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
//Maximum number of lags
pmax = 12;
// Trimming rate
trimm= 0.15;
// Longrun consistent variance estimation method
varm = 3;
// Bandwidth for kernel estimator
T = rows(y_test_real);
bwl = round(4 * (T/100)^(2/9));
// Level shift
model = 1;
{ ADF_min1, TBadf1, Zt_min1, TBzt1, Za_min1, TBza1, cvADFZt1, cvZa1 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Level shift with trend
model = 2;
{ ADF_min2, TBadf2, Zt_min2, TBzt2, Za_min2, TBza2, cvADFZt2, cvZa2 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Regime shift
model = 3;
{ ADF_min3, TBadf3, Zt_min3, TBzt3, Za_min3, TBza3, cvADFZt3, cvZa3 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Regime shift with trend
model = 4;
{ ADF_min4, TBadf4, Zt_min4, TBzt4, Za_min4, TBza4, cvADFZt4, cvZa4 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
The coint_ghansen
has eight returns:
Test  $ADF$ Test Statistic  $Z_t$ Test Statistic  $Z_{\alpha}$ Test Statistic  10% Critical Value $ADF$,$Z_t$  10% Critical Value $Z_{\alpha}$  Conclusion 

GregoryHansen, Level shift  3.887  4.331  39.902  4.34  36.19  Cannot reject the null of no cointegration for $ADF$ and $Z_t$. Reject the null of no cointegration at the 10% level for $Z_{\alpha}$. 
GregoryHansen, Level shift with trend  3.915  5.010  50.398  4.72  43.22  Reject the null of no cointegration at the 10% level for $Z_{\alpha}$ and $Z_t$. Cannot reject the null for $ADF$ test. 
GregoryHansen, Regime change  5.452  6.4276  80.379  4.68  41.85  Reject the null of no cointegration at the 10% level. 
GregoryHansen, Regime change with trend  6.145  7.578  106.549  5.24  53.31  Reject the null of no cointegration at the 10% level. 
Note that our test for cointegration with one structural break is inconsistent and depends on which type of structural breaks we include in our model. This provides some indication that further exploration of structural breaks is needed.
Some examples of additional steps that we could take:
Congratulations! You now have an established guide for cointegration and the background you need to perform cointegration testing.
In particular, today's blog covered:
]]>
You're probably familiar with the basic findandreplace. However, large projects with many files across several directories, require a more powerful search tool.
The GAUSS Source Browser is the powerful searchandreplace tool you need. We'll discuss some ways you can use it to make your life easier, including:
If an output seems incorrect or is causing an error, you need a quick way to trace the progress of specific variables. GAUSS's Find Usages functionality does exactly that.
As shown in the image below, rightclick on the variable you want to search for and select Find Usages.
This will search our current file for every instance of our search variable and display them in the GAUSS Source Browser.
As we can see in the above image, Find Usages has shown us:
We can doubleclick on a line in the Source Browser to open it in the GAUSS editor.
The Find Usages basics alone are great, but the GAUSS Source Browser has much more to offer.
In addition to rightclicking and selecting Find Usages, the Source Browser can be opened two ways:
Ctrl+Shift+F
(Command+Shift+F
) keyboard shortcut.The Search Tab is the starting point for any search. It has three input boxes:
*.*
, where *
is a wildcard character that will match any character or characters. *.gss, estimate.src
.*.gss,*.src,*.sdf,*.gau,*.prg
.Using these tools together can help us quickly locate the information we're looking for. For, example, suppose we want to find uses of the GAUSS generalized method of moments procedure, gmmFit, in internal GAUSS source code.
We can achieve this by:
gmmFit
in the Search for box./Users/Research/gauss/src
as our Directory. We could also set the File pattern option to *.src
to limit our search to .src
files. However, we left the default option to search all files in this case.
Our search results show that the term gmmFit
is used between 1 and 37 times in each of five files in the specified directory.
The Advanced Tab provides many useful options for further refining our search. We'll discuss some of the most helpful and less common options.
The Assignments checkbox allows us to restrict our search to only the instances when a variable is assigned to. Narrowing the list of uses to assignments only is invaluable when tracking down the cause of an incorrect value. The screenshot above shows our original Find Usages search after adding the Assignments filter.
References is the opposite of Assignments, returning instances where a variable is used, but not assigned to.
By default, the Source Browser will search the directory that you specify and all subdirectories. Uncheck the Search Subdirs checkbox to restrict the search to the main search directory only.
After we have completed our desired search, we can perform our replace by:
The checkboxes at the start of each line of the search results allow you to quickly control which results are replaced.
Backup Original will create a copy of your files with a .bak
file extension before making the replace. Since there is no undo option for replacements made to files not open in the editor, the .bak
copies allow you to recover the original file if you make a mistake.
Overwrite Backup will copy over any preexisting .bak
files made by previous searches with Backup Original selected. If this option is unchecked, GAUSS will use .bak2
, .bak3
, etc if preexisting .bak
files are found.
After reading this post, you now know how to use the advanced search and replace tools in GAUSS to be more productive!
We have learned:
The latest Time Series MT (TSMT) 3.1.0 is now available for release. If you own TSMT 3.0 the update is available for free.
This minor update adds a new option to use the crossproduct covariance computation for some estimation routines and includes some bug fixes.
varmaFit
, ecmFit
, svarmaxmt
, and garchFit
.cusum
and sbreak
graphs.The GAUSS TSMT application module provides a comprehensive suite of tools for MLE and statespace estimation, model diagnostics and forecasting of univariate, multivariate and nonlinear time series models.
]]>Panel data, sometimes referred to as longitudinal data, is data that contains observations about different cross sections across time. Examples of groups that may make up panel data series include countries, firms, individuals, or demographic groups.
Like time series data, panel data contains observations collected at a regular frequency, chronologically. Like crosssectional data, panel data contains observations across a collection of individuals.
There are a number of advantages of panel data:
Panel data examples can be found in economics, social sciences, medicine and epidemiology, finance, and the physical sciences.



Field  Example topics  Example dataset 
Microeconomics  GDP across multiple countries, Unemployment across different states, Income dynamic studies, international current account balances.  Panel Study of Income Dynamics (PSID) 
Macroeconomics  International trade tables, world socioeconomic tables, currency exchange rate tables.  Penn World Tables 
Epidemiology and Health Statistics  Public health insurance data, disease survival rate data, child development and wellbeing data.  Medical Expenditure Panel Survey 
Finance  Stock prices by firm, market volatilities by country or firm.  Global Market Indices 
Panel data is a collection of quantities obtained across multiple individuals, that are assembled over even intervals in time and ordered chronologically. Examples of individual groups include individual people, countries, and companies.
In order to denote both individuals and time observations, panel data often refers to groups with the subscript i and time as the subscript t. For example, a panel data observation $Y_{it}$ is observed for all individuals $i = {1, ..., N}$ across all time periods $t = {1, ..., T}$
More specifically:
Group  Time Period  Notation 

1  1  $Y_{11}$ 
1  2  $Y_{12}$ 
1  T  $Y_{1T}$ 
⁞  ⁞  ⁞ 
N  1  $Y_{N1}$ 
N  2  $Y_{N2}$ 
N  T  $Y_{NT}$ 
Panel datasets may come in different formats. The format in the table above is sometimes called long format data. Long format datasets stack the observations of each variable from all groups, across at all time periods into one column.
When panel data is stored with the observations for a single variable from separate groups stored in separate columns this is sometimes referred to as wide data format.
Time  $Y_1$  $Y_2$  $Y_N$ 

1  $Y_{11}$  $Y_{21}$  $Y_{N1}$ 
2  $Y_{12}$  $Y_{22}$  $Y_{N2}$ 
3  $Y_{13}$  $Y_{23}$  $Y_{N3}$ 
4  $Y_{14}$  $Y_{24}$  $Y_{N4}$ 
⁞  ⁞  ⁞  ⁞ 
T1  $Y_{1T1}$  $Y_{2T1}$  $Y_{NT1}$ 
T  $Y_{1T}$  $Y_{2T}$  $Y_{NT}$ 
Panel data can also be characterized as unbalanced panel data or balanced panel data:
Certain panel data models are only valid for balanced datasets. If the panel datasets are unbalanced they may need to be condensed to include only the consecutive periods for which there are observations for all individuals in the cross section.
Panel data series modeling centers around addressing the likely dependence across data observations within the same group. In fact, the primary difference between panel data models and time series models, is that panel data models allow for heterogeneity across groups and introduce individualspecific effects.
As an example, consider a panel data series which includes gross domestic product (GDP) data for a panel of 5 different countries, the United States, France, Canada, Greece, and Australia:
Panel data models include techniques that can address these heterogeneities across individuals. Furthermore, pure crosssectional methods and pure time series models may not be valid in the presence of this heterogeneity.
Researchers commonly analyze datasets with multiple observations of a set of crosssectional units (e.g., people, firms, countries) over time. For example, one may have data covering the production of multiple firms or the gross product of multiple countries across a number of years.
Modeling these panel data series is a unique branch of time series modeling made up of methodologies specific to their structure.
This section looks more closely at panel data analysis and the associated panel data models.
Panel data methods can be split into two broad categories:
Within these groups, the assumptions made about the variation of the model across individuals are the primary drivers for which model to use.
Let’s consider a simple linear model
$$y_{it} = \alpha + \beta x_{it} + \epsilon_{it}$$
The representation above is a homogenous model:
Alternatively, we could believe that groups share common coefficients on regressors but there are groupspecific intercepts, as is captured in the fixed effects or least squares dummy variable (LSDV) model
$$y_{it} = \alpha_i + \beta x_{it} + \epsilon_{it}$$
The representation above is a heterogenous model, because the constants, $ \alpha_i $, are groupspecific.
This section considers four popular panel data models:
We will examine these models using an assumed data generation process given by
$$ y_{it} = \beta x_{it} + \delta z_i + \epsilon_{it}$$
In this model, $X$ represents the observed characteristics such as age, firm size, expenditures, and $Z$ represents unobserved characteristics, such as management quality, growth opportunities, or skill.
Component  Description  Example 

$x_{it}$  These are observable characteristics. These characteristics may be constant for an individual across all time, such as race, or may vary across all time observations for an individual such as age.  Age, race, company size, expenditure, population, GDP 
$z_i$  Unobservable characteristics, responsible for model heterogeneity.  Skill, company potential, lack of basic infrastructure in the community, political unrest. 
$\epsilon_{it}$  Stochastic error term.  N/A 
What Is Pooled Ordinary Least Squares?
In some cases, there are no unobservable individualspecific effects, and $\delta z_i $ is constant across individuals. This is a strong assumption and implies that all the observations within groups are independent of one another.
In these cases, the model becomes
$$ y_{it} = \beta x_{it} + \alpha + \epsilon_{it}$$
This implies that when there is no dependence within individual groups, the panel data can be treated as one large, pooled dataset. The model parameters, $\beta$, and, $\alpha$, can be directly estimated using pooled ordinary least squares.
Linear independence within the groups of a panel is unlikely and pooled OLS is rarely acceptable for panel data models.
What Is The OneWay Fixed Effects Model
The oneway fixed effects panel data model:
As an example, let’s consider the oneway fixed effects model with individualspecific effects where the unobservable component, $\delta z_i$ , acts like an individualspecific intercept:
$$y_{it} = \beta x_{it} + \alpha_i + \epsilon_{it}$$
The intercept term, $\alpha_i$, varies across individuals but is constant across time. This term is composed of the constant intercept term, $\mu$, and the individualspecific error terms, $\gamma_i$.
The key feature of the fixed effects model is that $\gamma_i$ has a true, but unobservable, effect which must be estimated. More importantly, if we estimate $\beta$ using pooled OLS and fail to appropriately account for $\gamma_i$, the estimates will be inconsistent and biased.
The fixed effects model requires the estimation of the model parameter $\beta$ and individual $\alpha_i$ for each of the N groups in the panel. This is generally achieved using one of three estimation techniques:
The first two of these techniques focuses on eliminating the individual effects before estimation. The LSDV method directly incorporates these effects using dummy variables.
What Is the OneWay Random Effects Model?
The oneway random effects panel data model:
The distinguishing feature of the random effects model is that $\delta z_i$ does not have a true value but rather follows a random distribution with parameters that we must estimate.
The random effects term, $\delta z_i$:
The random effects model should be estimated using feasible generalized least squares (FGLS). Using FGLS, the appropriate error structure, one which accounts for the individualspecific error terms, can be incorporated into the model.
What Is the Random Coefficients Model?
The panel data regressions we’ve looked at so far have all assumed that the coefficients on regressors are the same across all individuals. The random coefficients model relaxes this assumption and introduces individualspecific effects through the coefficient, such that
$$y_{it} = \beta_i x_{it} + \alpha_i + \epsilon_{it}$$ $$y_{it} = (b_i + \beta)x_{it} + (\alpha_i+\alpha) + \epsilon_{it}$$ $$b_i \sim N(0, \tau_{i1}^2)$$ $$a_i \sim N(0, \tau_{i2}^2)$$
This model introduces both individual slope effects and allows for heteroscedasticity through the individualspecific $\tau_{i1}^2$ and $\tau_{i2}^2$.
This model can be estimated using FGLS or maximum likelihood estimation (MLE).
The twoway individual effects model allows the presence of both timespecific effects and individualspecific effects.
Starting from a simple linear model given by,
$$y_{it} = \alpha + \beta_{xit} + \epsilon_{it}$$
the twoway individual effects model can be represented by
$$y_{it} = \alpha + \beta_{xit} + \mu_i + \lambda_t + \epsilon_{it}$$
In this model, $\mu_i$, captures any unobservable individualspecific effects and $\lambda_t$ captures any unobservable timespecific effects. Note that the individualspecific effects, $\mu_i$, do not vary with time, while the timespecific effects, $\lambda_t$, do not vary across individuals.
In the special case that there are only two groups and two individuals this model is equivalent to the differenceindifference model. However, if there are more than two time periods and/or individuals, alternative panel data models must be considered.
What Is the TwoWay Fixed Effects Model?
The twoway fixed effects model:
For data generated by this model:
Like the oneway fixed effects model, this model could be estimated by including dummy variables. However, in the twoway fixed effects model dummy variables must be included for both the time periods and the groups.
Under most circumstances, the number of dummy variables included in the twoway fixed effects model makes standard ordinary least squares estimation too computationally difficult. Instead, the twoway fixed effects model is estimated using a withingroup estimator which removes the variation both within groups and within the time periods.
What Is the TwoWay Random Effects Model?
The twoway random effects model:
For data generated by this process:
Like the oneway random effects model, the twoway random effects model can be estimated using feasible generalized least squares (FGLS) or maximum likelihood estimation (MLE).
Dynamic Panel Data Model
A key component of pure time series models is the modeling of dynamics using lagged dependent variables. These lagged variables capture the autocorrelation between observations of the same dataset at different points in time.
Because panel datasets include a time series component, it is also important to address the possibility of autocorrelation in panel data. The dynamic panel data model adds dynamics to the panel data individual effects framework.
Consider an individual effects model which includes an AR(1) term
$$y_{it} = \delta y_{i,t1} + \beta x_{it} + \epsilon_{it}$$
the error component includes oneway individual effects such that
$$\epsilon_{it} = \mu_i + \nu_{it}$$
where $\mu_i$ captures individual effects.
Introducing lagged dependent variables in the individual effects framework:
Dynamic panel data models are most commonly estimated using a generalized method of moments (GMM) framework proposed by Arellano and Bond (1991).
In panel data that covers small time frames, there is little need to worry about stationarity. However, when panel data covers longer time frames, like is the case in many macroeconomic panel data series, the panel data must be tested for stationarity.
Weak stationarity, required for many panel data modeling techniques, requires only that:
Nonstationary panel data series are any panel series that do not meet the conditions of a weakly stationary time series.
In part because of these considerations, a large field of research and literature surrounding panel data unit root tests has developed.
Testing for unit roots in panel data requires more than just testing the individual cross sections for the presence of unit roots. Panel data unit root tests must:
After today's blog, you should have an understanding of the fundamentals of panel data. We covered the basics of panel data including:
To learn more about performing panel data analysis using GAUSS contact us for a GAUSS demo copy.
]]>