In today's blog, we cover the fundamentals of maximum likelihood including:
The post Beginner's Guide To Maximum Likelihood Estimation appeared first on Aptech.
]]>Maximum likelihood is a widely used technique for estimation with applications in many areas including time series modeling, panel data, discrete data, and even machine learning.
In today's blog, we cover the fundamentals of maximum likelihood estimation.
In particular, we discuss:
In addition, we consider a simple application of maximum likelihood estimation to a linear regression model.
Maximum likelihood estimation is a statistical method for estimating the parameters of a model. In maximum likelihood estimation, the parameters are chosen to maximize the likelihood that the assumed model results in the observed data.
This implies that in order to implement maximum likelihood estimation we must:
Once the likelihood function is derived, maximum likelihood estimation is nothing more than a simple optimization problem.
At this point, you may be wondering why you should pick maximum likelihood estimation over other methods such as least squares regression or the generalized method of moments. The reality is that we shouldn't always choose maximum likelihood estimation. Like any estimation technique, maximum likelihood estimation has advantages and disadvantages.
There are many advantages of maximum likelihood estimation:
Maximum likelihood estimation hinges on the derivation of the likelihood function. For this reason, it is important to have a good understanding of what the likelihood function is and where it comes from.
Let's start with the very simple case where we have one series $y$ with 10 independent observations: 5, 0, 1, 1, 0, 3, 2, 3, 4, 1.
The first step in maximum likelihood estimation is to assume a probability distribution for the data. A probability density function measures the probability of observing the data given a set of underlying model parameters.
In this case, we will assume that our data has an underlying Poisson distribution which is a common assumption, particularly for data that is nonnegative count data.
The Poisson probability density function for an individual observation, $y_i$, is given by
$$f(y_i  \theta ) = \frac{e^{\theta}\theta^{y_i}}{y_i!}$$
Because the observations in our sample are independent, the probability density of our observed sample can be found by taking the product of the probability of the individual observations:
$$f(y_1, y_2, \ldots, y_{10}\theta) = \prod_{i=1}^{10} \frac{e^{\theta}\theta^{y_i}}{y_i!} = \frac{e^{10\theta}\theta^{\sum_{i=1}^{10}y_i}}{\prod_{i=1}^{10}y_i!} $$
We can use the probability density to answer the question of how likely it is that our data occurs given specific parameters.
The differences between the likelihood function and the probability density function are nuanced but important.
Mathematically the likelihood function looks similar to the probability density:
$$L(\thetay_1, y_2, \ldots, y_{10}) = f(y_1, y_2, \ldots, y_{10}\theta)$$
For our Poisson example, we can fairly easily derive the likelihood function
$$L(\thetay_1, y_2, \ldots, y_{10}) = \frac{e^{10\theta}\theta^{\sum_{i=1}^{10}y_i}}{\prod_{i=1}^{10}y_i!} = \frac{e^{10\theta}\theta^{20}}{207,360}$$
The maximum likelihood estimate of the unknown parameter, $\theta$, is the value that maximizes this likelihood.
In practice, the joint distribution function can be difficult to work with and the $\ln$ of the likelihood function is used instead. In the case of our Poisson dataset the loglikelihood function is:
$$\ln(L(\thetay)) = n\theta + \ln \sum_{i=1}^{n} y_i  \ln \theta \sum_{i=1}^{n} y_i! = 10\theta + 20 \ln(\theta)  \ln(207,360)$$
The loglikelihood is usually easier to optimize than the likelihood function.
A graph of the likelihood and loglikelihood for our dataset shows that the maximum likelihood occurs when $\theta = 2$. This means that our maximum likelihood estimator, $\hat{\theta}_{MLE} = 2$.
In the simple example above, we use maximum likelihood estimation to estimate the parameters of our data's density. We can extend this idea to estimate the relationship between our observed data, $y$, and other explanatory variables, $x$. In this case, we work with the conditional maximum likelihood function:
$$L(\theta  y, x)$$
We will look more closely at this in our next example.
The versatility of maximum likelihood estimation makes it useful across many empirical applications. It can be applied to everything from the simplest linear regression models to advanced choice models.
In this section we will look at two applications:
In linear regression, we assume that the model residuals are identical and independently normally distributed:
$$\epsilon = y  \hat{\beta}x \sim N(0, \sigma^2)$$
Based on this assumption, the loglikelihood function for the unknown parameter vector, $\theta = \{\beta, \sigma^2\}$, conditional on the observed data, $y$ and $x$ is given by:
$$\ln L(\thetay, x) =  \frac{1}{2}\sum_{i=1}^n \Big[ \ln \sigma^2 + \ln (2\pi) + \frac{y\hat{\beta}x}{\sigma^2} \Big] $$
The maximum likelihood estimates of $\beta$ and $\sigma^2$ are those that maximize the likelihood.
The probit model is a fundamental discrete choice model.
The probit model assumes that there is an underlying latent variable driving the discrete outcome. The latent variables follow a normal distribution such that:
$$y^* = x\theta + \epsilon$$ $$\epsilon \sim N(0,1)$$
where
$$ y_i = \begin{cases} 0 \text{ if } y_i^* \le 0\\ 1 \text{ if } y_i^* \gt 0\\ \end{cases} $$
The probability density
$$P(y_i = 1X_i) = P(y_i^* \gt 0X_i) = P(x\theta + \epsilon\gt 0X_i) = $$ $$P(\epsilon \gt x\thetaX_i) = 1  \Phi(x\theta) = \Phi(x\theta)$$
where $\Phi$ represents the normal cumulative distribution function.
The loglikelihood for this model is
$$\ln L(\theta) = \sum_{i=1}^n \Big[ y_i \ln \Phi (x_i\theta) + (1  y_i) \ln (1  (x_i\theta)) \Big] $$
Congratulations! After today's blog, you should have a better understanding of the fundamentals of maximum likelihood estimation. In particular, we've covered:
The post Beginner's Guide To Maximum Likelihood Estimation appeared first on Aptech.
]]>Anchoring vignette questions combined with the Compound Hierarchical Ordered Probit (CHOPIT) model, allows researchers to address this issue in survey data (King et al. 2004).
This methodology is based on two key identifying assumptions:
In today's blog we look more closely the fundamental pieces of this modeling technique including the:
The post Anchoring Vignettes and the Compound Hierarchical Ordered Probit (CHOPIT) Model appeared first on Aptech.
]]>Selfassessments are a common survey tool but, they can be difficult to analyze due to bias arising from systematic variation in individual reporting styles, known as reporting heterogeneity.
Anchoring vignette questions combined with the Compound Hierarchical Ordered Probit (CHOPIT) model, allows researchers to address this issue in survey data (King et al. 2004).
This methodology is based on two key identifying assumptions:
In today's blog we look more closely the fundamental pieces of this modeling technique including the:
In addition, we discuss a new test for evaluating the identifying assumptions, introduced by Greene, et al (2020).
In addition to ordinal selfassessment responses, the CHOPIT model requires data from a set of vignette questions. Both sets of questions are asked on the same ordinal scale and filled out by each respondent.
The vignette questions describe hypothetical situations, to provide an anchor and create a scale that allows you to compare selfassessments across survey participants.
As an example, consider the Survey of Health, Ageing and Retirement in Europe (SHARE), described in Greene et al. (2020). For a selfassessment of pain, respondents were asked:
SHARE SelfAssessment of Pain 

Overall in the last 30 days, how much of bodily aches or pains did you have? 
□ None □ Mild □ Moderate □ Severe □ Extreme 
There were also several vignette questions for pain (corresponding to different levels of severity) using the same response categories. For example:
SHARE Pain Vignette 

Karen has a headache once a month that is relieved after taking a pill. During the headache, she can carry on with her daytoday affairs. Overall in the last 30 days, how much of bodily aches or pains did Karen have? 
□ None □ Mild □ Moderate □ Severe □ Extreme 
Anchoring vignettes are motivated by the issues arising from the fact that two individuals with identical levels of true underlying pain may apply different response scales when answering the selfassessment question. This can lead to reporting different answers, known as Differential Item Functioning (DIF).
In Figure 1:
The issue of DIF is demonstrated by:
The availability of vignette questions, which all respondents assess, allows us to anchor selfassessments to individualspecific response scales and hence make DIFadjusted comparisons.
In the rest of this blog, we show how to conduct such an analysis, as well as how to test for the underlying assumptions of RC and VE.
To begin, let's note some key characteristics of our data:
These lend themselves nicely to the Ordered Probit (OP) model (Greene & Hensher 2010).
In the ordered probit model, there is an underlying latent variable
$$\begin{equation}y^* = \tilde{x}'\tilde{\beta} + \epsilon_y\end{equation}$$
where
The underlying latent variable translates into the actual responses via the mapping
$$\begin{equation}y = j\end{equation}$$
if
$$\mu_{j1} \leq y^* < \mu_j \text{ for } j=0, \ldots, J1$$
where $\mu_{1} = \infty$ and $\mu_{J1} = +\infty$. To ensure well defined probabilities $\mu_{j1} < \mu_j \forall j$. For example, consider the following boundary conditions:
$$J$$  $$\mu_j$$ 

0  $$\infty$$ 
1  0.681 
2  0.107 
3  0.278 
4  1.309 
5  $$\infty$$ 
Based on these conditions we can map the latent variable, $y^*$ to $y$:
$$y^*$$  $$y$$ 

0.791  1 
0.078  3 
0.341  2 
0.578  4 
1.67  5 
The ordered probit model has standard resulting probabilities and likelihood functions which we won't cover here. However, for example, see Greene & Hensher (2010).
The hierarchical ordered probit modifies the ordered probit model two key ways:
A key consideration of the HOPIT model is how to relate the boundary parameters to observed characteristics. We will allow for two options for the boundary specification:
The linear specification: $$\begin{equation}\mu_{i,j} = z_i'\gamma_j\end{equation}$$
The exponential specification: $$\begin{equation}\mu_0 = z_i'\gamma_0\end{equation}$$ $$\mu_{i,j} = \mu_{i,j1} + exp(z_i'\gamma_j)$$
Due to the linear specification of $\mu_{i,0}$ (for both boundary specifications above), unless other restrictions are imposed, $\gamma_0$ and $\tilde{\beta}$ are not separately identified for variables common to both $x$ and $z.$
However, in practice, it can difficult to determine variables that should be in $x$ and $z$. In these circumstances, additional (external) information is required to identify the parameters of the model.
This is where the availability of vignettes in sample surveys play a crucial role!
Say we have $k = 1, \ldots, K$ vignettes that relate to the selfassessment of interest, where each vignette:
The introduction of information from the responses to the vignettes, together with the restrictions imposed by the response consistency and vignette equivalence (see below), allows identification of all the model parameters, even when the variables in both $x$ and $z$ are identical.
Note that the vignette equivalence restriction imposes the requirement that $\tilde{\beta}_k = 0$.
The loglikelihood is the combination of that coming from the HOPIT component of the selfassessment ($lnL_{HOPIT}$) and that from the $K$ vignette equations ($lnL_{V,k}$):
$$\begin{equation}ln L = lnL_{HOPIT} + \sum_k lnL_{V,k}\end{equation}$$
The components of likelihood function are linked via the common boundary parameters ($\gamma$). Due to the "compound" nature of the several HOPIT models here, this combined approach is often called the Compound HOPIT model (CHOPIT).
The approach relies on two assumptions:
In this CHOPIT model it is now possible to estimate the scale of the vignette equations (remember, normalized to 1 in the selfassessment of interest), either individually for each vignette or constraint to be equivalent across all vignettes. The latter tends to be the norm when estimating the model.
Greene et al. (2020) suggests two amendments to the CHOPIT model that allow a test for RC and VE:
This amended specification:
The tests are not only informative of the identifying assumptions underlying the CHOPIT model but may be used to select appropriate vignettes or combinations thereof where multiple vignettes are available.
In today's blog, we look more closely at addressing differing reporting scales in selfassessment survey questions, including:
Greene, W., Harris, M. N., Knott, R. & Rice, N. (2020), Specification and testing of hierarchical ordered response models with anchoring vignettes, Journal of the Royal Statistical Society Series A, forthcoming.
Greene, W. & Hensher, D. (2010), Modeling Ordered Choices, Cambridge University Press, Cambridge.
King, G., Murray, C., Salomon, J. & Tandon, A. (2004), Enhancing the validity and crosscultural comparability of measurement in survey research, American Political Science Review 98(1), 191207.
The post Anchoring Vignettes and the Compound Hierarchical Ordered Probit (CHOPIT) Model appeared first on Aptech.
]]>We will work through two simple examples where you will learn:
The post How to Create Tiled Graphs in GAUSS appeared first on Aptech.
]]>Placing graphs next to each other can be a great way to present information. Today we will learn how to create tiled graphs in GAUSS using plotLayout
.
We will work through two simple examples where you will learn:
plotLayout
.In GAUSS, plotLayout
allows you to arrange multiple graphs in a tiled layout. It takes the following inputs:
Therefore, the call:
plotLayout(2, 3, idx);
will:
idx
as shown below.idx = 1  idx = 2  idx = 3 
idx = 4  idx = 5  idx = 6 
plotLayout
does not actually alter the graph canvas. It computes and stores the coordinates needed to place your graph tile.Below is a partial list of the GAUSS graph functions which can be used with plotLayout
. Each cell is independent and may contain a different graph type.
Function  Graph type 

plotXY  XY line. 
plotScatter  2D Scatter. 
plotTS  Time series line. 
plotTSHF  High frequency and irregular time series lines. 
plotBar  2D bars. 
plotBox  Box plots. 
plotHist, plotHistP  Standard and percentage histograms. 
plotLogX, plotLogY, plotLogLog  Line plot where one or more of the axes are in log space. 
plotContour  Contours. 
plotArea  Stacked area. 
plotXYFill  Area between two vectors. 
Now that we have seen the basics of plotLayout
, let's put together a simple example placing 4 graphs in a 2x2 grid.
new;
// Create the vector 0.1, 0.2, 0.3...3.0
x = seqa(0.1, 0.1, 30);
/*
** First subplot
*/
// Divide the graph canvas into a 2x2 grid
// and place the next graph in the 1st cell
plotLayout(2, 2, 1);
// Create 'y1' and draw graph
y1 = exp(x);
plotXY(x, y1);
/*
** Second subplot
*/
plotLayout(2, 2, 2);
y2 = sin(x);
plotXY(x, y2);
/*
** Third subplot
*/
plotLayout(2, 2, 3);
y3 = cos(x);
plotXY(x, y3);
/*
** Fourth subplot
*/
plotLayout(2, 2, 4);
y4 = ln(x);
plotXY(x, y4);
After running the above code, you should see a graph that looks like this:
What do you think would happen if we created another plot right after running the code from the previous section? Let's try it with the code below.
// Create new 'y' by adding uniform random
// numbers to the 'x' created above
y = x + rndu(30,1);
// Draw graph
plotScatter(x, y);
This time your graph should look like this:
As we can see, GAUSS remembers the grid we created with our last call to plotLayout
as well as the chosen cell index. This state will persist until one of the following statements has been executed:
// Subsequent graphs will fill the entire graph canvas.
plotClearLayout();
// 1. Clear all data from the GAUSS workspace.
// 2. Clear the current plot layout state, so that new graphs will fill the entire graph canvas.
new;
Sometimes your graphs will look better with tiles of different sizes. Fortunately, as we saw earlier, plotLayout
only computes and stores coordinates.
Therefore, each call to plotLayout
is independent of any previous calls. As long as your graph tiles do not overlap, each call to plotLayout
may use a different grid size. This allows you to mix graphs of different sizes in the same graph canvas.
Let's modify our previous example, so that the third graph takes up the entire second row.
new;
// Create the vector 0.1, 0.2, 0.3...3.0
x = seqa(0.1, 0.1, 30);
/*
** First subplot
*/
// Divide the graph canvas into a 2x2 grid
// and place the next graph in the 1st cell
plotLayout(2, 2, 1);
// Create 'y1' and draw graph
y1 = exp(x);
plotXY(x, y1);
/*
** Second subplot
*/
plotLayout(2, 2, 2);
y2 = sin(x);
plotXY(x, y2);
/*
** Third subplot
**
** Change to interpret the graph canvas
** as a 2x1 grid and place the next graph
** in the second cell.
*/
plotLayout(2, 1, 2);
y3 = cos(x);
plotXY(x, y3);
In this code example, we changed the inputs to plotLayout
instructing it to:
This results in the graph below with the third plot taking up the entire second row of the graph canvas.
Congratulations, you now have another tool in your graph creating toolbox! You have learned how to:
The post How to Create Tiled Graphs in GAUSS appeared first on Aptech.
]]>The post Basics of GAUSS Procedures appeared first on Aptech.
]]>Our goal for this post is to learn the fundamentals of creating and using procedures in GAUSS. When you are finished, you should be able to create GAUSS procedures and use those written by others.
GAUSS procedures are userdefined functions. They allow you to combine a sequence of commands to perform a desired task.
A function is a block of organized, reusable code that is used to perform a single, related action.  Tutorialspoint.com
There are many advantages to creating GAUSS procedures. Even if you don’t consider yourself a programmer, creating GAUSS procedures will make your work easier and more productive.
Save time. Every time you use a previously written GAUSS procedure you,
Let’s start with a simple example. We’ll create a simple procedure named power
which takes two inputs and returns one output.
proc (1) = power(X, p); // 1. Procedure declaration
local X_out; // 2. Local variable list
X_out = X ^ p; // 3. Procedure body
retp(X_out); // 4. Procedure return
endp; // 5. Procedure end
Each line in the above procedure is one of the main parts of a GAUSS procedure. Now let's discuss the details of each of these elements.
a. c.
↓ ↓
proc (1) = power(X, p);
↑ ↑
b. d.
The procedure declaration is the first line of all GAUSS procedures. It contains four main sections:
Description  

a.  proc 
Keyword that starts the procedure declaration. 
b.  (1) = 
The number of items (matrices, strings, etc) that will be returned by this procedure. It is optional with a default value of 1. 
c.  power 
The name of the procedure. 
d.  (X,p) 
The required input arguments. 
Don't forget the ending semicolon.
local X_out;
The local variable list is a commaseparated list of variables that will be used inside the procedure. These local variables only exist when the procedure is running.
X_out = X ^ p;
The procedure body is where the procedure's computations are performed. In this case, we have just one simple line. However, there is no limit to the size of your procedure.
You can use procedures created by you or others as well as any builtin GAUSS function or operator inside the body of your procedure.
retp(X_out);
You can return any of the local variables from the procedure, or none at all. When returning more than one variable, use commas to separate the variable names.
endp;
The endp
keyword signals the end of the procedure.
There are two steps required to use a GAUSS procedure:
When GAUSS first starts up, it is only aware of its own builtin functions and procedures. Defining a procedure is the process of making GAUSS aware of your procedure. If you try to use a procedure before it is defined, like this:
// Clear all data and procedures from the GAUSS workspace
new;
// Call procedure that has not been defined
power(3, 2);
you will get the error G0025 : Undefined symbol: 'power'
.
Here are four ways to define your GAUSS procedure:
.g
file.1. Add the procedure to your main code file. The simplest method to define a procedure is to simply add it to your code and run the main file. This allows for a natural process of learningbydoing.
As we can see from the above image, the procedure can be included anywhere in your program file. It is often convenient to place procedures after the main code.
GAUSS can allow this because it does not just run your code linebyline. It is smarter than that. When GAUSS runs your program, its first step is to compile your code. This allows GAUSS to find many mistakes immediately and also speedup your code in addition to locating your procedures.
2. Run the GAUSS procedure. Since GAUSS does not run your code linebyline, as we saw in the previous section, GAUSS needs to have the entire procedure definition at once. Therefore, you cannot run a procedure one line at a time. If you try to run the first line alone:
proc (1) = power(X, p);
you will get the error G0278 : PROC without ENDP 'power'
.
After you run the entire procedure:
the procedure is defined so that you can use it in GAUSS. Now running:
x = power(3, 2);
print x;
will be successful, returning:
9
3. Place your procedure in a .g file. It is not always convenient to keep all procedures in your main code file.
A simple alternative is to:
i. Create a file with the same name as your procedure plus a .g
file extension. Such as power.g
.
ii. Place your procedure in this file.
iii. Save this file in a location where GAUSS can find it.
4. Add your procedure to a GAUSS library. Adding your related procedures to a GAUSS library is considered a best practice. GAUSS libraries make it easier to use and share code without worrying about paths.
Full GAUSS library management is beyond the scope of this article. However, the above image shows the Create Library and Add Files controls.
Now that we've learned how to create and define GAUSS procedures, it's time to use them. Here are a few example procedures to fill in anything you might be unsure about.
We'll start with a procedure with two returns.
// Create 4x1 vector
y = { 1, 3, 2, 1 };
// Call procedure
{ a, b } = range(y);
// Print results
print "The smallest value is: " a;
print "The largest value is : " b;
// Procedure that returns two items
proc (2) = range(X);
local min, max;
min = minc(X);
max = maxc(X);
retp(min, max);
endp;
Running the above code will return:
The smallest value is: 1 The largest value is : 3
Next, we'll consider a case with two inputs.
// Create 5x1 vector
z = { 1, 2, 3, 4, 9 };
m1 = average(z, "median");
print m1;
m2 = average(z, "mean");
print m2;
proc (1) = average(y, type_);
local avg;
if type_ $== "mean";
avg = meanc(y);
elseif type_ $== "median";
avg = median(y);
else;
print "Second input, type_, not recognized";
end;
endif;
retp(avg);
endp;
which will return:
3.0 3.8
Great job! You've made it through the fundamentals of GAUSS procedures. You've learned:
The post Basics of GAUSS Procedures appeared first on Aptech.
]]>In today's blog, we explore several options for creating dummy variables from categorical data in GAUSS, including:
The post How To Create Dummy Variables in GAUSS appeared first on Aptech.
]]>Dummy variables are a common econometric tool, whether working with time series, crosssectional, or panel data. Unfortunately, raw datasets rarely come formatted with dummy variables that are regression ready.
In today's blog, we explore several options for creating dummy variables from categorical data in GAUSS, including:
Dummy variables can be conveniently created from files at the time of loading data or calling procedures using formula string notation. Formula string notation is a powerful GAUSS tool that allows you to represent a model or collection of variables in a compact and intuitive manner, using the variable names in the dataset.
factor
KeywordThe factor
keyword is used in formula strings to:
olsmt
, glm
, or gmmFit
.Let's consider the model
$$mpg = \alpha + \beta_1 weight + \beta_2 length + \beta_3 rep78$$
We will use ordinary least squares to estimate this model with data from the auto2.dta
file which can be found in the GAUSSHOME/examples directory.
The variable rep78
is a categorical, 5point variable that measures a car's repair record in 1978. To estimate the effects of the repair record on mpg
we can include dummy variables representing the different categories.
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Perform OLS estimation, creating dummy variables from 'rep78'
call olsmt(fname, "mpg ~ weight + factor(rep78)");
The printed output table includes coefficients for rep78=fair, average, good, excellent
. Note that rep78=poor
is automatically excluded from the regression as the base level.
Standard Prob Standardized Cor with Variable Estimate Error tvalue >t Estimate Dep Var  CONSTANT 38.0594 3.09336 12.3036 0.000   weight 0.00550304 0.000601001 9.15645 0.000 0.743741 0.80552 rep78: Fair 0.478604 2.76503 0.173092 0.863 0.0263109 0.134619 rep78: Average 0.471562 2.55314 0.184699 0.854 0.0401403 0.279593 rep78: Good 0.599032 2.6066 0.229814 0.819 0.0451669 0.0384391 rep78: Excellent 2.08628 2.72482 0.765657 0.447 0.131139 0.454192
cat
KeywordSome common file types, such as XLS and CSV do not have a robust method of determining the variable types. In these cases, the cat
keyword is used to:
The cat
keyword can be combined with the factor
keyword to instruct GAUSS to load a column as string data, reclassify it to integers and then create dummy variables:
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/yarn.xlsx";
// Reclassify 'load' variable from 'high, low, med'
// to '0, 1, 2', then create dummy variables from
// integer categories and create OLS estimates
call olsmt(fname, "cycles ~ factor(cat(load))");
Using factor(cat(load))
in the formula strings tells GAUSS to create dummy variables representing the different categories of the load
variable. This is seen in the printed output table which now includes coefficients for load=low, medium
. Note that load=high
is automatically excluded from the regression as the base level.
Standard Prob Standardized Cor with Variable Estimate Error tvalue >t Estimate Dep Var  CONSTANT 534.444444 292.474662 1.827319 0.080   load: low 621.555556 413.621634 1.502715 0.146 0.338504 0.240716 load: med 359.111111 413.621634 0.868212 0.394 0.195575 0.026323
loadd
In our previous two examples, we used the factor
and cat
keywords directly in calls to estimation procedures. However, we can also use these keywords when loading data to create dummy variables in our data matrices.
For example, let's load the dummy variables associated with the rep78
variable:
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Perform OLS estimation, creating dummy variables from 'rep78'
reg_data = loadd(fname, "mpg + weight + factor(rep78)");
The reg_data
matrix is a 74 x 6 matrix. It contains the mpg
and weight
data, as well as 4 columns of dummy variables for rep78=fair, average, good, excellent
.
The first five rows look like this:
mpg weight rep78:fair rep78:avg rep78:good rep78:exc 22 2930 0 1 0 0 17 3350 0 1 0 0 22 2640 . . . . 20 3250 0 1 0 0 15 4080 0 0 1 0
Note that, again, rep78=poor
is automatically excluded as the base level.
In the previous section, we looked at creating dummy variables at the time of loading data or running procedures. In this section, we consider how to create dummy variables from an existing GAUSS vector.
The GAUSS design
procedure provides a convenient method for creating dummy variables from a vector of discrete categories.
Let's load the data from the auto2.dta
dataset used in our earlier regression example. This time we won't load rep78
using factor
:
// Create a fully pathed file name
fname = getGAUSSHome() $+ "examples/auto2.dta";
// Load auto data for regression
reg_data = loadd(fname, "mpg + weight + rep78");
// Remove missing values
reg_data = packr(reg_data);
The first five rows of reg_data
look like this:
22 2930 3 17 3350 3 20 3250 3 15 4080 4 18 3670 3
Our third column now contains discrete, categorical data with values ranging from 15, which represent poor
, fair
, average
, good
, and excellent
.
auto2.dta
file specifies the preferred order for the string categories.// Compute the unique values found
// in the third column of 'reg_data'
print unique(reg_data[., 3]);
1 2 3 4 5
design
creates a matrix with a column of indicator variables for each positive integer in the input. For example:
cats = { 1, 2, 1, 3 };
print design(cats);
will return:
1 0 0 0 1 0 1 0 0 0 0 1
Therefore, if we pass the third column of reg_data
to design
we will get a matrix with a column for all five categories. However, we want to drop the base case column for our regression.
$$mpg = \alpha + \beta_1 weight + \beta_2 length + \beta_3 rep78_{fair} + \beta_4 rep78_{avg} + \beta_5 rep78_{good} + \beta_6 rep78_{excl}$$
To do this, we shift the range of the categorical data from 15 to 04 by subtracting 1.
// Create dummy variables. Subtract one
// to remove the base case.
dummy_vars = design(reg_data[., 3]  1);
This creates a 69x4 matrix, dummy_vars
, which contains dummy variables representing the final four levels of rep78
.
Now we can estimate our model as shown below.
// Select the 'mpg' data as the dependent variable
y = reg_data[., 1];
// Independent variables:
// 'weight' is in the second column of 'reg_data'.
// 'rep78'= Fair, Average, Good and Excellent
// are represented by the 4 columns
// of 'dummy_vars'.
x = reg_data[., 2]~dummy_vars;
// Estimate model using OLS
call olsmt("", y, x);
Our printed results are the same as earlier, except our table no longer includes variables names:
Standard Prob Standardized Cor with Variable Estimate Error tvalue >t Estimate Dep Var  CONSTANT 38.059415 3.093361 12.303578 0.000   X1 0.005503 0.000601 9.156447 0.000 0.743741 0.805520 X2 0.478604 2.765035 0.173092 0.863 0.026311 0.134619 X3 0.471562 2.553145 0.184699 0.854 0.040140 0.279593 X4 0.599032 2.606599 0.229814 0.819 0.045167 0.038439 X5 2.086276 2.724817 0.765657 0.447 0.131139 0.454192
The design
procedure works well when our data already contains categorical data. However, there may be cases when we want to create dummy variables based on ranges of continuous data. The GAUSS dummybr
, dummydn
, and dummy
procedures can be used to achieve this.
Consider a simple example:
x = { 1.53,
8.41,
3.81,
6.34,
0.03 };
// Breakpoints
v = { 1, 5, 7 };
All three procedures create a set of dummy (0/1) variables by breaking up a data vector into categories based on specified breakpoints. These procedures differ in how they treat boundary cases as shown below.
Category Boundaries  # dummies ($K$ breakpoints) 
Call  Result  

dummybr  $$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$  $$K$$  dm = dummybr(x, v); 
$$dm = \begin{matrix} 0 & 1 & 0\\ 0 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\\ 1 & 0 & 0 \end{matrix}$$ 
dummy  $$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ $$7 \lt x $$  $$K+1$$  dm = dummy(x, v); 
$$dm = \begin{matrix} 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 1 & 0 & 0 & 0 \end{matrix}$$ 
dummydn  $$x \leq 1$$ $$1 \lt x \leq 5$$ $$5 \lt x \leq 7$$ $$7 \lt x $$  $$K$$  // Column to drop p = 2; dm = dummydn(x, v, p); 
$$dm = \begin{matrix} 0 & 0 & 0\\ 0 & 0 & 1\\ 0 & 0 & 0\\ 0 & 1 & 0\\ 1 & 0 & 0 \end{matrix}$$ 
Let's look a little closer at how these procedures work.
dummybr
When creating dummy variables with dummybr
:
dm = dummybr(x, v);
The code above produces three dummies based upon the breakpoints in the vector v
:
x <= 1 1 < x <= 5 5 < x <= 7
The matrix dm
contains:
0 1 0 1.53 0 0 0 8.41 dm = 0 1 0 x = 3.81 0 0 1 6.34 1 0 0 0.03
Notice that in this case, the second row of dm
does not contain a 1 because x = 8.41
does not fall into any of our specified categories.
dummy
Now, let's compare our results from dummybr
above to the dummy
procedure. When we use the dummy
procedure:
dm = dummy(x, v);
The code above produces four dummies based upon the breakpoints in the vector v
:
x <= 1 1 < x <= 5 5 < x <= 7 7 < x
The matrix dm
contains:
0 1 0 0 1.53 0 0 0 1 8.41 dm = 0 1 0 0 x = 3.81 0 0 1 0 6.34 1 0 0 0 0.03
These results vary from our previous example:
dummy
procedure results in 4 columns of dummy variables. It adds a new column for the case where 7 < x
.x = 8.41
falls into the category 7 < x
.dummydn
Our final function is dummydn
which behaves just like dummy
, except that the pth column of the matrix of dummies is dropped. This is convenient for specifying a base case to ensure that these variables will not be collinear with a vector of ones.
// Column to drop
p = 2;
// Create matrix of dummy variables
dm_dn = dummydn(x, v, p);
The code above produces three dummies based upon the breakpoints in the vector v
:
x <= 1 1 < x <= 5 // Since p = 2, this column is dropped 5 < x <= 7 7 < x
The matrix dm_dn
contains:
0 1 0 0 0 0 0 1.53 0 0 0 1 0 0 1 8.41 dm = 0 1 0 0 dm_dn = 0 0 0 x = 3.81 0 0 1 0 0 1 0 6.34 1 0 0 0 1 0 0 0.03
Note that the matrix dm_dn
is the same as dm
except the second column has been removed.
Dummy variables are an important tool for data analysis whether we are working with time series data, crosssectional data, or panel data. In today's blog, we have explored three GAUSS tools for generating dummy variables:
design
procedure. dummy
, dummybr
, and dummydn
procedures.The post How To Create Dummy Variables in GAUSS appeared first on Aptech.
]]>The post How to Interpret Cointegration Test Results appeared first on Aptech.
]]>In this blog we will explore how to set up and interpret cointegration results using a realworld time series example. We will cover the case with no structural breaks as well as the case with one unknown structural break using tools from the GAUSS tspdlib library.
In this blog, we will use the famous NelsonPlosser time series data. The dataset contains macroeconomic fundamentals for the United States.
We will be using three of these fundamentals:
The time series data is annual data, covering 1900  1970.
In order to prepare for cointegration testing, we will take some preliminary time series modeling steps. We will:
In this example, we will examine the macroeconomic question of whether stock prices are linked to macroeconomic indicators. In particular, we will examine if there is a cointegrated, longrun relationship between the S&P 500 price index and monetary policy indicators of the M2 money stock and the bond yields.
Mathematically we will consider the cointegrated relationship:
$$y_{sp, t} = c + \beta_1 y_{money, t} + \beta_2y_{bond, t} + u_t$$
When visualizing time series data, we look for visual evidence of:
Our time series plots give us some important considerations for our testing, providing visual evidence to support:
Prior to testing for cointegration between our time series data, we should check for unit roots in the data. We will do this using the adf
procedure in the tspdlib
library to conduct the Augmented DickeyFuller unit root test.
Variable  Test Statistic  1% Critical Value  5% Critical Value  10% Critical Value  Conclusion 

Money  1.621  4.04  3.45  3.15  Cannot reject the null 
Bond yield  1.360  4.04  3.45  3.15  Cannot reject the null 
S&P 500  0.3842  4.04  3.45  3.15  Cannot reject the null 
Our ADF test statistics are greater than the 10% critical value for all of our time series. This implies that we cannot reject the null hypothesis of a unit root for any of our time series data.
What about the potential structural break that we see in our time series data? Does this have an impact on our unit root testing?
Using the adf_1break
procedure in the tspdlib
library to test for unit roots with a single structural break in the trend and constant we get the following results.
Variable  Test Statistic  Break Date  1% Critical Value  5% Critical Value  10% Critical Value  Conclusion 

Money  4.844  1948  5.57  5.08  4.82  Cannot reject the null 
Bond yield  3.226  1963  5.57  5.08  4.82  Cannot reject the null 
S&P 500  4.639  1945  5.57  5.08  4.82  Cannot reject the null 
Our ADF test statistics again suggest that even when accounting for the structural break, we cannot reject the null hypothesis of a unit root for any of our time series data.
Having concluded that there is evidence for unit roots in our data, we can now run our cointegration tests.
When setting up cointegration tests, there are a number of assumptions that we must specify:
To better understand these general assumptions, let’s look at the simplest of our tests, the EngleGranger cointegration test.
In the twostage, residualbased cointegration tests which we will consider today, normalization amounts to deciding which variable is our dependent variable and which variables are our independent variables in the cointegration regression.
We will choose our normalization to reflect our theoretical question of whether the S&P 500 index is cointegrated with the money stock and the bond yield. As we mentioned earlier, this means we will consider the cointegrated relationship:
$$y_{sp, t} = c + \beta_1 y_{money, t} + \beta_2 y_{bond, t} + u_t$$
// Set fname to name of dataset
fname = "nelsonplosser.dta";
// Load three variables from the dataset
// and remove rows with missing values
coint_data = packr(loadd(fname, "sp500 + m + bnd"));
// Define y and x matrix
y = coint_data[., 1];
x = coint_data[., 2 3];
The second assumption we must make about our EngleGranger test is which model
we wish to use. To understand how to make this decision, let's look closer at what this input means.
The EngleGranger test is a twostep test:
When we specify which model to use we impact two things:
There are three options to choose from:
No constant or trend (model = 0
)
$$y_{sp, t} = \beta_1 y_{money, t} + \beta_2 y_{bond, t} + u_t$$
Constant (model = 1
)
$$y_{sp, t} = \alpha + \beta_1 y_{money, t} + \beta_2 y_{bond, t} + u_t$$
model = 2
)
$$y_{sp, t} = \alpha + \delta t + \beta_1 y_{money, t} + \beta_2 y_{bond, t} + u_t$$ For our example, we will include a constant and trend in our firststage cointegration regression by setting:
// Select model with constant and trend
model = 2;
In the secondstage ADF residual unit root test, the error terms should be serially independent. To account for possible autocorrelation, lags of the first differences of the residual can be included in ADF test regression.
The GAUSS coint_egranger
will automatically determine the optimal number of lags to include in the secondstage regression based on two user inputs:
ic = 0
]ic = 1
]ic = 2
]/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
// Maximum number of lags
pmax = 12;
Now that we have loaded our data and chosen the test settings, we can call the coint_egranger
procedure:
// Perform EngleGranger Cointegration Test
{ tau_eg, cvADF_eg } = coint_egranger(y, x, model, pmax, ic);
In order to interpret our cointegration results, let's revisit the two steps of the EngleGranger test:
The EngleGranger test statistic for cointegration reduces to an ADF unit root test of the residuals of the cointegration regression:
After running our cointegration test we obtain the following results:
EngleGranger Test Constant and Trend H0: no cointegration (EG, 1987 & P0, 1990) Test Statistic CV(1%, 5%, 10%)   EG_ADF 2.105 4.645 4.157 3.843
We can see that:
Earlier we saw that the potential structural break in our data did not change our unit root test conclusion. We should also see if the structural break has an impact on our cointegration testing.
To do this we will use the GregoryHansen cointegration test which can be implemented using the coint_ghansen
test in the tspdlib
library.
We can carry over all of our coint_egranger
testing specifications, except our model specification.
When implementing the GregoryHansen test, we must decide on a model which specifies:
There are four modeling options to choose from
model = 1
]model = 2
]model = 3
]model = 4
]In this model, the structural break again affects the constant, the regression coefficients, and the trend.
For example, let's consider the last case, where the constant, coefficients, and trend are all impacted by the structural break:
// Set fname to name of dataset
fname = "nelsonplosser.dta";
// Load three variables from the dataset
// and remove rows with missing values
coint_data = packr(loadd(fname, "sp500 + m + bnd"));
// Define y and x matrix
y = coint_data[., 1];
x = coint_data[., 2 3];
/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
// Maximum number of lags
pmax = 12;
// Model specification
// Regime and trend shift
model = 4;
// Perform cointegration test
{ ADF_min_gh, TBadf_gh, Zt_min_gh, TBzt_gh, Za_min_gh, TBza_gh, cvADFZt_gh, cvZa_gh } =
coint_ghansen(y, x, model, bwl, ic, pmax, varm, trimm);
The coint_ghansen
procedure provides more extensive results than the coint_egranger
test. In particular, the GregoryHansen test:
Cointegration test results
After calling the coint_ghansen
procedure and testing all possible models, we obtain the following test statistic results:
Test  $ADF$ Test Statistic  $Z_t$ Test Statistic  $Z_{\alpha}$ Test Statistic  10% Critical Value $ADF$,$Z_t$  10% Critical Value $Z_{\alpha}$  Conclusion 

GregoryHansen, Level shift  4.004  3.819  27.858  4.690  42.490  Cannot reject the null of no cointegration for $ADF$, $Z_t$, or $Z_{\alpha}$. 
GregoryHansen, Level shift with trend  3.889  3.751  27.618  5.030  48.94  Cannot reject the null of no cointegration for $ADF$, $Z_t$, or $Z_{\alpha}$. 
GregoryHansen, Regime change  4.658  4.539  32.766  5.23  52.85  Cannot reject the null of no cointegration for $ADF$, $Z_t$, or $Z_{\alpha}$. 
GregoryHansen, Regime change with trend  5.834  4.484  32.411  5.72  63.10  Cannot reject the null of no cointegration for $ADF$, $Z_t$, or $Z_{\alpha}$. 
As we can see from these results, there is no evidence that our S&P 500 Index is cointegrated with the money stock and bond yield.
Structural break results
The coint_ghansen
procedure also returns estimates for break dates based on the $ADF$, $Z_t$, and $Z_{\alpha}$ tests:
Test  $ADF$ Break Date  $Z_t$ Break Date  $Z_{\alpha}$ Break Date 

GregoryHansen, Level shift  1958  1956  1956 
GregoryHansen, Level shift with trend  1958  1956  1956 
GregoryHansen, Regime change  1955  1955  1955 
GregoryHansen, Regime change with trend  1951  1953  1947 
The results from our Gregory Hansen cointegration test provide some important conclusions:
Note that while the GregoryHansen test does estimate break dates, it does not provide the statistical evidence to conclude whether these are statistically significant break dates or not.
Today's blog looks closer at the EngleGranger and GregoryHansen residualbased cointegration tests. By building a better understanding of how the tests work and what assumptions we make when running the tests, you will be better equipped to interpret the test results.
In particular, today we learned
The post How to Interpret Cointegration Test Results appeared first on Aptech.
]]>The post How to Interactively Create Reusable Graphics Profiles appeared first on Aptech.
]]>You probably know that you can use GAUSS code to create repeatable styling for your graphs. However, you might not know that GAUSS also allows you to build reusable graphics profiles with a few clicks of your mouse.
We will show you how to clone existing graphics profiles and how to control:
While this tutorial will not cover all available options, you should learn enough to be comfortable making most other changes without additional help.
For this tutorial, our goal will be to create a graph profile to replicate the style of the graph below. Data for the graph can be downloaded from here.
This video tutorial will show you how to:
or read the text instructions below.
Using the graphics profile created in the previous step:
This will open your operating system's font dialog, where you can select your desired font, weight, and size:
We'll start by modifying the third line which marks the date of the first U.S. location transmission.
Then repeat step 2 above to add the legend text for Series 1 and Series 2.
You can access the settings from a graph profile using the plotGetDefaults
function as you would for the default settings. Below is a simple example.
// Load variables
Y = loadd("cookcountycovid.csv", "cases + deaths");
x = seqa(1, 1, rows(Y));
// Declare myPlot to be a plotControl structure
// and fill with the settings from our covid profile
struct plotControl myPlot;
myPlot = plotGetDefaults("covid");
// Draw plot using covid profile settings
plotXY(myPlot, x, Y);
which will produce the plot below:
While the graphs for a given project may share many features, it is likely that some attributes such as the title or legend text will need to be different. You could clone your custom profile and make these changes. However, it might be simpler to make these changes in your code.
Fortunately, you can use the plotSet
functions to further customize the settings for a particular graph. You can modify any settings that you would like to change.
The code below shows how to change the title, keeping all other settings unchanged.
// Load variables
Y = loadd("cookcountycovid.csv", "cases + deaths");
x = seqa(1, 1, rows(Y));
// Declare myPlot to be a plotControl structure
// and fill with the settings from our covid profile
struct plotControl myPlot;
myPlot = plotGetDefaults("covid");
// Set title text
plotSetTitle(&myPlot, "COVID19 Cook County");
// Draw plot using covid profile settings
plotXY(myPlot, x, Y);
These settings will not modify the custom graphics profile that you created. Next time you call myPlot = plotGetDefaults("covid")
, myPlot
will be filled with the custom profile settings that you selected in the Preferences dialog window.
Full code to create the time series plot shown at the top of this post, using the "covid" custom graphics profile we created, can be downloaded here.
Great job! You've learned how to interactively create custom graphics profiles that can be easily reused and further modified.
The post How to Interactively Create Reusable Graphics Profiles appeared first on Aptech.
]]>The post How to Create a Simple Table with sprintf in GAUSS appeared first on Aptech.
]]>This post will show you how to create a simple output table with variable names, parameter estimates and standard errors as shown below, using sprintf
.
population 0.6558 (0.1910) poverty rate 0.9465 (0.2505) inflation 1.7366 (0.1101)
sprintf
, first introduced in GAUSS version 20, is a standard function in many languages. It provides powerful, compact statements that allow you to combine and style numeric and string data.
Unfortunately, some people find the sprintf
format string to be difficult to use and understand. Fortunately, today we are going to make it clear for you!
The GAUSS sprintf
function takes two or more inputs and returns a string array.
sprintf(fmt, a)
sprintf(fmt, a, ...)
sprintf
format string (described later).a
.
The sprintf
format string has many options that give you great control over the presentation of your data. Today we will show you how to use the three most important options which will cover most use cases.
The most basic format specifier starts with a percent sign (%) followed by a letter which is called the specifier character.
Specifier Character  Output  Example  

d  Integer  sprintf("%d", 3.1415) 
3 
e  Scientific  sprintf("%e", 3.1415) 
3.1415e+00 
f  Decimal  sprintf("%f", 3.1415) 
3.1415 
s  String  sprintf("%s", "Pi") 
"Pi" 
sprintf
format strings can contain more than one format specifier and it can be mixed with text. For example:
sprintf("%s is %d", "Steve", 38);
will print out:
Steve is 38
sprintf("%sis%d", "Steve", 38)
will return Steveis38
Now we know enough to create the first iteration of our table.
// Create a 3x1 string array
var_names = "population" $ "poverty rate" $ "inflation";
// Create two 3x1 vectors
beta_ = { 0.65582591,
0.94648444,
1.7365981 };
se = { 0.19095146,
0.25047475,
0.11009904 };
sprintf("%s %f %f", var_names, beta_, se);
will print out
population 0.655826 0.190951 poverty rate 0.946484 0.250475 inflation 1.736598 0.110099
The overlapping columns make this table difficult to read.
To stop the columns from overlapping, we need to make the field width large enough to hold the widest element for each column. The field width is an integer placed immediately following the percent sign (%). For example
Code  Output 

sprintf("%8f %8f", 3.1415, 6.283) 
3.141500 6.283000 
sprintf("%12f %8f", 3.1415, 6.283) 
3.141500 6.283000 
sprintf("%12f%8f", 3.1415, 6.283) 
3.1415006.283000 
sprintf("%12f%9f", 3.1415, 6.283) 
3.141500 6.283000 
Now let's update our code to:
// Create 3x1 string array
var_names = "population" $ "poverty rate" $ "inflation";
// Create two 3x1 vectors
beta_ = { 0.65582591,
0.94648444,
1.7365981 };
se = { 0.19095146,
0.25047475,
0.11009904 };
sprintf("%12s %9f %9f", var_names, beta_, se);
will return:
population 0.655826 0.190951 poverty rate 0.946484 0.250475 inflation 1.736598 0.110099
This is much better. As we saw earlier, we can add any text to the format string that we'd like to see in the output. Let's add parentheses around the format specifier for the standard error column.
// Surround SE format specifier with parentheses
sprintf("%12s %9f (%9f)", var_names, beta_, se);
will return:
population 0.655826 ( 0.190951) poverty rate 0.946484 ( 0.250475) inflation 1.736598 ( 0.110099)
The parentheses in the above table make it easy for us to see the 9 character field width of the final column. If the standard error could be negative, we might want to leave it as it is to allow room for a possible minus sign. However, since we know it will be positive, let's:
// Decrease field width of final column
// and add one space before the '('
sprintf("%12s %9f (%8f)", var_names, beta_, se);
will return:
population 0.655826 (0.190951) poverty rate 0.946484 (0.250475) inflation 1.736598 (0.110099)
Our table is really starting to look good, but to match the table from the introduction, we need to reduce the number of digits to the right of the decimal place to 4.
The sprintf
precision specifier is a dot or period followed by an integer. It controls the number of digits printed after the decimal point when using the floatingpoint f
specifier character.
Code  Output 

sprintf("%12.2f %8.3f", 3.1415, 6.283) 
3.14 6.283 
After setting the precision to 4:
// Decrease precision to 4 for both numeric columns
sprintf("%12s %9.4f (%8.4f)", var_names, beta_, se);
our output looks like this:
population 0.6558 ( 0.1910) poverty rate 0.9465 ( 0.2505) inflation 1.7366 ( 0.1101)
We can see that the final column displays 6 characters, but the field width is set to 8 characters. Similarly, the second column needs 7 spaces (5 digits plus a dot and a possible minus sign), but we have given it 9.
Let's fix those:
// Decrease field width for numeric columns to remove extra spaces
sprintf("%12s %7.4f (%6.4f)", var_names, beta_, se);
We have reached our goal!
population 0.6558 (0.1910) poverty rate 0.9465 (0.2505) inflation 1.7366 (0.1101)
Great job! You have learned the basics you need to know to display your data with sprintf
in GAUSS, including:
sprintf
in GAUSS can print styled tables with columns of string and numeric data.f
, specifier character.While there is much more that sprintf
can do, if you can remember
"%[field width].[precision]f"
and
"%s"
you'll have a great foundation to expand on.
The post How to Create a Simple Table with sprintf in GAUSS appeared first on Aptech.
]]>The post The Basics of Optional Arguments in GAUSS Procedures appeared first on Aptech.
]]>Optional input arguments can make your statistical computing more efficient and enjoyable. GAUSS version 20 added a new suite of tools to make it easy for you to add optional input arguments to your GAUSS procedures.
In this blog post we will show you how to easily:
We’ll start by creating a simple toy GAUSS procedure that does not use optional arguments.
// Run procedure
myProc(1.1, 2.2, 3.3);
// Define procedure
proc (0) = myProc(a, b, c);
print a;
print b;
print c;
endp;
If we run the code above, we will get the following output:
1.1 2.2 3.3
To modify this procedure to accept optional arguments, we need to make the following changes:
The procedure declaration is the first line of the procedure. The procedure declaration from our initial procedure was:
// Procedure declaration
proc (0) = myProc(a, b, c);
Now we will modify our original procedure to make a
the only required input. The new procedure declaration will look like this:
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
In the procedure declaration, above, a
is a required input as it was before. However, we have replaced b
and c
with ...
. The triple dots represent the optional arguments. The triple dots ...
:
The GAUSS function dynargsCount
returns the number of optional arguments passed into the procedure. dynargsCount
may only be called inside of a GAUSS procedure and does not take any inputs. For example:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
local n_dynargs;
// Notice that you do not pass ‘...’
// into ‘dynargsCount’
n_dynargs = dynargsCount();
print "Number of optional inputs = " n_dynargs;
print "";
endp;
Will print the following output:
call 1 Number of optional inputs = 0.0  call 2 Number of optional inputs = 1.0  call 3 Number of optional inputs = 2.0 
This allows you to add logic which depends on the number of optional arguments that were passed into your procedure.
dynargs
.You can access optional arguments in GAUSS procedures by:
dynargsGet
function....
, to another function.The dynargsGet
procedure returns specified optional arguments. It requires one input that specifies the index, or index range of optional arguments to return. For example:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
// Since ‘b’ and ‘c’ are not declared as inputs
// in the procedure declaration, they must be
// declared as local variables
local b, c, idx;
// The range of optional inputs we want to return
idx = { 1, 2 };
{ b, c } = dynargsGet(idx);
print "a = " a;
print "b = " b;
print "c = " c;
print "";
endp;
This time, the code will print out:
call 1 a = 9.00 b = {} c = {}  call 2 a = 9.00 b = 3.14 c = {}  call 3 a = 9.00 b = 3.14 c = 2.54 
As we see above, a
is passed through as we would expect each time. In "call 1", neither b
or c
were passed into myProc
. We can see that GAUSS printed their contents as empty curly braces, {}
. This indicates that they were empty matrices.
If an optional input argument requested by dynargsGet
is not supplied, by default an empty matrix will be returned in its place. We can check for empty matrices with the GAUSS function, isempty
. For example:
// Create an empty matrix
z = {};
// Check to see if `z` is an empty matrix
if isempty(z);
print "z is an empty matrix";
else;
print "z is not an empty matrix";
endif;
As you may have guessed, the above code will print:
z is an empty matrix
While we can use isempty
and if
statements to handle the optional inputs, it is often simpler to set default values.
Fortunately, dynargsGet
allows us to specify default values for the optional arguments which are not supplied. To specify default values with dynargsGet
, simply add them as a commaseparated list of inputs after the required index input.
There are a few things to remember when specifying default values:
We can set default values for our toy example, like this:
a = 9;
b = 3.14;
c = 2.54;
print "call 1";
myProc(a);
print "call 2";
myProc(a, b);
print "call 3";
myProc(a, b, c);
// Procedure declaration with optional inputs
proc (0) = myProc(a, ...);
// Since ‘b’ and ‘c’ are not declared as inputs
// in the procedure declaration, they must be
// declared as local variables
local b, c, idx;
// The range of optional inputs we want to return
idx = { 1, 2 };
// If ‘b’ or ‘c’ are not passed in, return
// 29 or 33 respectively
{ b, c } = dynargsGet(idx, 29, 33);
print "a = " a;
print "b = " b;
print "c = " c;
print "";
endp;
This time our code will return:
call 1 a = 9.00 b = 29.00 c = 33.00  call 2 a = 9.00 b = 3.14 c = 33.00  call 3 a = 9.00 b = 3.14 c = 2.54 
While any input arguments accessed by dynargsGet
can be passed to other procedures, it is important to know that the triple dots, ...
, can also be passed directly into other procedures.
Passing triple dots, ...
, to another procedure is just as if you passed a commaseparated list of the arguments in ...
.
...
, do not have to be the final input when passed to other procedures.For example:
X_1 = initMat1(3.14, 5, 2);
X_2 = initMat2(3.14, 5, 2);
// Procedure declaration with only optional inputs
proc (1) = initMat1(val, ...);
local X;
X = val + zeros(...);
retp(X);
endp;
// Procedure declaration with only required inputs
proc (1) = initMat2(val, r, c);
local X;
X = val + zeros(r, c);
retp(X);
endp;
The calls to initMat1
and initMat2
will behave the same.
Now that we’ve seen the basics of using optional inputs to GAUSS procedures, let’s work through a more practical example. We’ll create two procedures that will work together to simulate a linear model.
Our first procedure will create a random normal matrix, with the option to specify the mean and standard deviation of the variables.
/*
** Inputs: r  Required input. Number of rows.
** c  Required input. Number of columns.
** mu  Optional input. Mean of the simulated columns.
** Default = 0.
** sd  Optional input. Standard deviation of the
** simulated columns. Default = 1.
*/
proc (1) = rndNormal(r, c, ...);
local X, mean, sd;
// If a 3rd input is passed in, assign it
// to ‘mean’. Otherwise set ‘mean’ equal to 0.
// If a 4th input is passed in, assign it
// to `sd`. Otherwise set ‘sd’ equal to 1.
{ mean, sd } = dynargsGet(12, 0, 1);
// Compute a random matrix with the
// specified mean and sd.
X = (rndn(r, c) .* sd) + mean;
retp(X);
endp;
Here are a few examples of how we could call this procedure:
// Create a 100x4 random normal matrix
// with mean=0 and sd=1
X_1 = rndNormal(100, 4);
// Create a 130x2 random normal matrix
// with mean=3.7 and sd=1
X_2 = rndNormal(130, 2, 3.7);
// Create a 74x5 random normal matrix
// with mean=2 and sd=3
X_3 = rndNormal(74, 5, 2, 3);
Next, we’ll create a procedure that will use our rndNormal
procedure to simulate linear models.
/*
** Inputs: b_true  Required input. True parameter values of
** the simulated linear model.
** nobs  Required input. The number of observations
** to simulate.
** alpha_true  Optional input. True intercept of the
** simulated linear model. Default = 0.
** err_sd  Optional input. Standard deviation of the
** error term. Default = 1.
*/
proc (1) = simulateLM(b_true, nobs, ...);
local X, err, nvars, alpha_plus_err, y;
// Find number of desired variables
nvars = rows(b_true);
// Simulate variables with
// mean = 0, sd = 1
X = rndNormal(nobs, nvars);
// Simulate error term and add intercept
// if it was passed in.
alpha_plus_err = rndNormal(nobs, 1, ...);
y = X * b_true + alpha_plus_err;
retp(y);
endp;
Here are a few examples of how our simulateLM
procedure can be used:
b = { 0.8, 1.1, 0.2 };
n = 10;
/*
** Simulate linear model:
**  Without intercept
**  With error term sd equal to 1
*/
y_1 = simulateLM(b, n);
/*
** Simulate linear model:
**  With intercept
**  With error term sd equal to 1
*/
alpha = 2.3;
y_2 = simulateLM(b, n, alpha);
/*
** Simulate linear model:
**  With intercept
**  With error term sd equal to 3
*/
alpha = 2.3;
err_sd = 3;
y_3 = simulateLM(b, n, alpha, err_sd);
Congratulations! You should now be able to start creating GAUSS procedures with optional input arguments. You have learned how to:
...
, to the procedure declaration.dynargsCount
.dynargsGet
.The post The Basics of Optional Arguments in GAUSS Procedures appeared first on Aptech.
]]>The post A Guide to Conducting Cointegration Tests appeared first on Aptech.
]]>Cointegration is an important tool for modeling the longrun relationships in time series data. If you work with time series data, you will likely find yourself needing to use cointegration at some point.
This blog provides an indepth introduction to cointegration and will cover all the nuts and bolts you need to get started. In particular, we will look at:
Though not necessary, you may find it helpful to review the blogs on time series modeling and unit root testing before continuing with this blog.
Economic theory suggests that many time series datasets will move together, fluctuating around a longrun equilibrium. In econometrics and statistics, this longrun equilibrium is tested and measured using the concept of cointegration.
Cointegration occurs when two or more nonstationary time series:



Field  Supporting Theory  Time Series 
Economics  The permanent income hypothesis describes how agents spread their consumption out over their lifetime based on their expected income.  Consumption and income. 
Economics  Purchasing power parity is a theory that relates the prices of a basket of goods across different countries.  Nominal exchange rates and domestic and foreign prices. 
Finance  The present value model of stock prices implies a longrun relationship between stock prices and their dividends or earnings.  Stock prices and stock dividends/earnings. 
Epidemiology  Joint mortality models imply a longrun relationship between mortality rates across different demographics.  Male and female mortality rates. 
Medicine  Time series methodologies have been used to examine comorbidities of different types of cancers and trends in medical welfare.  Occurrence rates of different types of cancer. 
To understand the mathematics of cointegration, let's consider a group of time series, $Y_t$, which is composed of three separate time series:
$$y_1 = (y_{11}, y_{12}, \ldots, y_{1t})$$ $$y_2 = (y_{21}, y_{22}, \ldots, y_{2t})$$ $$y_3 = (y_{31}, y_{32}, ..., y_{3t})$$
All three series are nonstationary time series.
Cointegration implies that while $y_1$, $y_2$, and $y_3$ are independently nonstationary, they can be combined in a way that their linear combination is stationary :
$$\beta Y_t = \beta_1 y_{1t} + \beta_2 y_{2t} + \beta_3 y_{3t} \sim I(0)$$
The Cointegrating Vector
In the context of cointegration, $\beta$ is commonly known as the cointegrating vector. This vector:
Normalization Because there can be multiple cointegrating vectors that fit the same economic model, we must impose identification restrictions to normalize the cointegrating vector for estimation.
A common normalization of the cointegrating vector is to set $\beta = ( 1, \beta_2, \ldots, \beta_N)$. For example, applying these restrictions to our earlier system yields
$$\beta Y_t = y_{1t}  \beta_2 y_{2t}  \beta_3 y_{3t} \sim I(0)$$
Part of the appeal of this normalization is that it can be rewritten in a standard regression form
$$y_{1t} = \beta_2 y_{2t} + \beta_3 y_{3t} + u_t$$
where $u_t$ is a stationary cointegrating error component. Intuitively, $u_t$ can be thought of as shortterm deviations from the longrun equilibrium.
While the regression format is a common normalization, it is important to remember that economic theory should inform our identifying restrictions.
Cointegration implies that time series will be connecting through an error correction model. The error correction model is important in time series analysis because it allows us to better understand longrun dynamics. Additionally, failing to properly model cointegrated variables can result in biased estimates.
The error correction model:
Let's assume that there is a bivariate cointegrated system with $Y_t = (y_{1t}, y_{2t})$ and a cointegrating vector $\beta = (1, \beta_2)$ such that
$$\beta Y_{t} = y_{1t}  \beta_2 y_{2t}$$
The error correction model depicts the dynamics of a variable as a function of the deviations from longrun equilibrium
$$\Delta y_{1t} = c_1 + \alpha_1 (y_{1,t1}  \beta_2 y_{2,t1}) + \sum_j \psi^j_{11} \Delta y_{1, tj} + \sum_j \psi^j_{12} \Delta y_{2, tj} + \epsilon_1t$$ $$\Delta y_{2t} = c_2 + \alpha_2 (y_{1,t1}  \beta_2 y_{2,t1}) + \sum_j \psi^j_{21} \Delta y_{1, tj} + \sum_j \psi^j_{22} \Delta y_{2, tj} + \epsilon_2t$$
Term  Description  Intuition 

$y_{1,t1}  \beta_2 y_{2,t1}$  Cointegrated longrun equilibrium  Because this is an equilibrium relationship, it plays a role in dynamic paths of both $y_{1t}$ and $y_{2t}$. 
$\alpha_1$, $\alpha_2$  Adjustment coefficients  Captures the reactions of $y_{1t}$ and $y_{2t}$ to disequilibrium. 
$\sum_j \psi^j_{11} \Delta y_{1, tj} + \sum_j \psi^j_{12} \Delta y_{2, tj}$  Autoregressive distributed lags  Captures additional dynamics. 
If the cointegrating vector has been previously estimated, then standard OLS or DOLS can be used to estimate the error correction relationship. In this case:
The ECM relationship can be estimated using OLS, seemingly unrelated regressions (SUR), or maximum likelihood estimation.
The vector error correction model (VECM) is the multivariate extension of the ECM. If we are working in a vector autoregressive context, cointegration implies a VECM such that
$$\Delta Y_t = \Phi D_t + \Pi Y_{t1} + \Gamma_1 \Delta Y_{t1} + \cdots + \Gamma_{p1} \Delta Y_{tp+1} + \epsilon_t$$
Like the ECM, the VECM parameters reflect the longrun and shortrun dynamics of system as shown in the table below:
Term  Description  Intuition 

$\Pi$  Longrun impact matrix.  $\Pi = \Pi_1 + \Pi_2 + \cdots + \Pi_p  I_n$, captures adjustments towards the longrun equilibrium and contains the cointegrating relationships. 
$\Gamma_k$  Shortrun impact matrix.  The shortrun impact matrix is constructed from $\sum_{j=k+1}^p \Pi_j$ and captures shortrun deviations from the equilibrium. 
$D_t$  Deterministic terms.  These terms take the form $D_t = u_0 + u_1 t$ where $u_0$ is the constant component and $u_1 t$ is the trend component. 
The VECM model can be estimated using the Johansen method:
Before jumping directly to cointegration testing, there are a number of other time series modeling modeling steps that we should consider first.
One of the key considerations prior to testing for cointegration, is whether there is theoretical support for the cointegrating relationship. It is important to remember that cointegration occurs when separate time series share an underlying stochastic trend. The idea of a shared trend should be supported by economic theory.
As an example, consider growth theory which suggests that productivity is a key driver of economic growth. As such, it acts as the common trend, driving the comovements of many indicators of economic growth. Hence, this theory implies that consumption, investment, and income are all cointegrated.
One of the first steps in time series modeling should be data visualization. Time series plots provide good preliminary insights into the behavior of time series data:
We've established that cointegration occurs between nonstationary, I(1), time series. This implies that before testing for or estimating a cointegrating relationship, we should perform unit root testing.
Our previous blog, "How to Conduct Unit Root Testing in GAUSS", provides an indepth look at how to perform unit root testing in GAUSS.
GAUSS tools for performing unit root tests are available in a number of libraries, including the Time Series MT (TSMT), the opensource TSPDLIB, and the coint libraries. All of these can be directly located and installed using the GAUSS package manager.
Full example programs for testing for unit roots using TSMT procedures and TSPDLIB procedures are available on our Aptech GitHub page.
Panel Data Unit Root Test  TSMT procedure  TSPDLIB procedure 

Hadri  hadri  
Im, Pesaran, and Shin  ips  
LevinLuChin  llc  
Schmidt and Perron LM test  lm  
Breitung and Das  breitung  
Crosssectionally augmented IPS test (CIPS)  cips  
Panel analysis of nonstationary and idiosyncratic and common (PANIC)  bng_panic 
Time Series Unit Root Test  TSMT procedure  TSPDLIB procedure 

AugmentedDickey Fuller  vmadfmt  adf 
PhillipsPerron  vmppmt  pp 
KPSS  kpss  lmkpss 
Schmidt and Perron LM test  lm  
GLSADF  dfgls  dfgls 
Quantile ADF  qr_adf 
A complete time series analysis should consider the possibility that structural breaks have occurred. In the case that structural breaks have occurred, standard tests for cointegration are invalid.
Therefore, it is important to:
The GAUSS sbreak
procedure, available in TSMT, is an easy to use tool for identifying multiple, unknown structural breaks.
In order to test for cointegration, we must test that a longrun equilibrium exists for a group of data. There are a number of things that need to be considered:
In this section, we will show how to use these questions to guide cointegration testing without structural breaks.
The EngleGranger cointegration test considers the case that there is a single cointegrating vector. The test follows the very simple intuition that if variables are cointegrated, then the residual of the cointegrating regression should be stationary.
Forming the cointegrating residual
How to form the cointegrating residual depends on if the cointegrating vector is known or must be estimated:
If the cointegrating vector is known, the cointegrating residuals are directly computed using $u_t = \beta Y_t$. The residuals should be stationary and:
If the cointegrating vector is unknown, OLS is used to estimate the normalized cointegrating vector from the regression $$y_{1t} = c + \beta y_{2t} + u_{t}$$
There are two Johansen cointegrating tests for the VECM context, the trace test and the maximal eigenvalue test. These tests hinge on the intuition that in the VECM, the rank of the longrun impact matrix, $\Pi$, determines if the VAR(p) variables are cointegrated.
Since the rank of the longrun impact matrix equals the number of cointegrating relationships:
The Johansen testing process has two general steps:
The Johansen Trace Statistics
The Johansen trace statistic:
The Johansen testing procedure sequentially tests the null hypothesis that the number of cointegrating vectors, $k = m$ against the alternative that $k > m$.
Stage  Null Hypothesis  Alternative  Conclusion 

One  $H_0: k = 0$  $H_A: k>0$  If $H_0$ cannot be rejected, stop testing, and $k = 0$. If null is rejected, perform next test. 
Two  $H_0: k \leq 1$  $H_A: k>1$  If $H_0$ cannot be rejected, stop testing, and $k \leq 1$. If null is rejected, perform next test. 
Three  $H_0: k \leq 2$  $H_A: k>2$  If $H_0$ cannot be rejected, stop testing, and $k \leq 2$. If null is rejected, perform next test. 
m1  $H_0: k \leq m1$  $H_A: k>m1$  If $H_0$ cannot be rejected, stop testing, and $k \leq m1$. If null is rejected, perform next test. 
The test statistic follows a nonstandard distribution and depends on the dimension and the specified deterministic trend.
The Johansen Maximum Eigenvalue Statistic
The maximal eigenvalue statistic:
In the case that there are structural breaks in the cointegrating relationship, the cointegration tests in the previous station should not be used. In this section we look at three tests for cointegration with structural breaks:
The Gregory and Hansen (1996) cointegration test is a residualbased cointegration test which tests the null hypothesis of no cointegration against the alternative of cointegration in the presence of a single regime shift.
The Gregory and Hansen (1996) test:
Because the structural break date is unknown, the test computes the cointegration test statistic for each possible breakpoint, and the smallest test statistics are used.
Gregory and Hansen (1996) suggest running their tests in combination with the standard cointegration tests:
The HatemiJ cointegration test is an extension of the Gregory and Hansen cointegration test. It allows for two possible structural breaks with unknown timing.
The Maki cointegration test builds on the Gregory and Hansen and the HatemiJ cointegration tests to allow for an unknown number of structural breaks.
GAUSS tools for performing cointegration tests and estimating VECM models are available in a number of libraries, including the Time Series MT (TSMT) library, TSPDLIB, and the coint libraries. All of these can be directly located and installed using the GAUSS package manager.
Cointegration test  Null Hypothesis  Decision Rule  GAUSS library 

EngleGranger (ADF)  No cointegration.  Reject the null hypothesis if the $ADF$ test statistic is less than the critical value.  TSMT, tspdlib, coint 
Phillips  No cointegration.  Reject the null hypothesis if the $Z$ test statistic is less than the critical value.  coint, tspdlib 
Stock and Watson common trend  $Y$ is a noncointegrated system after allowing for the pth order polynomial common trend.  Reject the null hypothesis if the $SW$ test statistic is less than the critical value.  coint 
Phillips and Ouliaris  $Y$ and $X$ are not cointegrated.  Reject the null hypothesis if the $P_u$ or $P_z$ statistic is greater than the critical value.  coint, tspdlib 
Johansen trace  Rank of $\Pi$ is equal to $r$ against the alternative that the rank of $\Pi$ is greater than $r$.  Reject the null hypothesis if $LM_{max}(k)$ is greater than the critical value.  TSMT, coint 
Johansen maximum eigenvalue  Rank of $\Pi$ is equal to $r$ against the alternative that the rank of $\Pi$ is equal to $r+1$.  Reject the null hypothesis if $LM(r)$ is greater than the critical value.  TSMT, coint 
Gregory and Hansen  No cointegration against the alternative of cointegration with one structural break.  Reject the null hypothesis if $ADF$, $Z_{\alpha}$, or $Z_t$ are less than the critical value.  tspdlib 
HatemiJ  No cointegration against the alternative of cointegration with two structural breaks.  Reject the null hypothesis if $ADF$, $Z_{\alpha}$, or $Z_t$ are less than the critical value.  tspdlib 
Shin test  Cointegration.  Reject the null hypothesis if the test statistic is less than the critical value.  tspdlib 
In this section, we will test for cointegration between monthly gold and silver prices, using historic monthly price date starting in 1915. Specifically, we will work through several stages of analysis:
As a first step, we will create a time series graph of our data. This allows us to visually examine the deterministic trends in our data.
From our graphs, we can draw some preliminary conclusions about the dynamics of gold and silver prices over our time period:
Before testing if silver and gold prices are cointegrated, we should test if the series have unit roots. We can do this using the unit roots tests available in the TSMT and tspdlib libraries.
Gold monthly closing prices (20152020)  

Time Series Unit Root Test  Test Statistic  Conclusion 
AugmentedDickey Fuller  1.151  Cannot reject the null hypothesis of unit root. 
PhillipsPerron  1.312  Cannot reject the null hypothesis of unit root. 
KPSS  2.102  Reject the null hypothesis of stationarity at the 1% level. 
Schmidt and Perron LM test  2.399  Cannot reject the null hypothesis of unit root. 
GLSADF  0.980  Cannot reject the null hypothesis of unit root. 
Silver monthly closing prices (20152020)  

Time Series Unit Root Test  Test Statistic  Conclusion 
AugmentedDickey Fuller  5.121  Reject the null hypothesis of unit root at the 1% level. 
PhillipsPerron  5.446  Reject the null hypothesis of unit root at the 1% level. 
KPSS  0.856  Reject the null hypothesis of stationarity at the 1% level. 
Schmidt and Perron LM test  4.729  Reject the null hypothesis of unit root at the 1% level. 
GLSADF  4.895  Reject the null hypothesis of unit root at the 1% level. 
These results provide evidence that gold prices are nonstationary but suggest that the silver prices are stationary. At this point, we would not likely proceed with cointegration testing or we may wish to perform additional unit root testing. For example, we may want to perform unit root tests that allow for structural breaks.
The GAUSS code for the tests in this section is available here.
Now, let's test for cointegration without structural breaks using two different tests, the Johansen tests and the EngleGranger test.
The Johansen Tests
We will use the vmsjmt
procedure from the TSMT library. This procedure should be used with the vmc_sjamt
and vmc_sjtmt
procedures, which find the critical values for the Maximum Eigenvalue and Trace statistics, respectively.
The vmsjmt
procedure requires four inputs:
The vmsjmt
procedure returns both the Johansen Trace and the Johansen Maximum Eigenvalue statistic. In addition, it returns the associated eigenvalues and eigenvectors.
new;
// Load tsmt library
library tsmt;
// Set filename (with path) for loading
fname2 = __FILE_DIR $+ "commodity_mon.dat";
// Load real prices data
y_test_real = loadd(fname2, "P_gold_real + P_silver_real");
// No deterministic component
// the fitted regression
p = 1;
// Set number of lagged differences
// for computing estimator
k = 2;
// No determinant
no_det = 0;
{ ev, evec, trace_stat, max_ev } = vmsjmt(y_test_real, p, k, no_det);
cv_lr2 = vmc_sjamt(cols(y), p);
cv_lr1 = vmc_sjtmt(cols(y), p);
Both trace_stat
and max_ev
will contain statistics for all possible ranks of $\Pi$.
For example, since we are testing for cointegration of just two time series, there will be at most one cointegrating vector. This means trace_stat
and max_ev
will be 2 x 1 matrices, testing both null hypotheses that $r=0$ and $r=1$.
Test  Test Statistic  10% Critical Value  Conclusion 

Johansen Trace Statistic  $$H_0: r=1, 56.707$$ $$H_0: r=0, 0.0767$$  10.46  Cannot reject the null hypothesis that $r=0$. 
Johansen Maximum Eigenvalue  $$H_0: r=1, 56.631$$ $$H_0: r=0, 0.0766$$  9.39  Cannot reject the null hypothesis that $r=0$. 
These results indicate that there is no cointegration between monthly gold and silver prices. This should not be a surprise, given the results of our unit root testing.
vmsjmt
. However, the Johansen tests are sequential tests and since we cannot reject the null that $k=0$ for either test, we technically do not need to continue testing for any higher rank. The EngleGranger
Since there is only one possible cointegrating vector for this system, we could have also used the EngleGranger test for cointegration. This test can be implemented using the coint_egranger
procedure from the TSPDLIB library.
The coint_egranger
procedure requires five inputs:
The coint_egranger
procedure returns the test statistic along with the 1%, 5% and 10% critical values.
Using the data already loaded in the previous example:
/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
// Maximum number of lags
pmax = 12;
// No constant or trend
model = 0
{ tau0, cvADF0 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
// Constant
model = 1
{ tau1, cvADF1 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
// Constant and trend
model = 2
{ tau2, cvADF2 } = coint_egranger(y_test_real[., 1], y_test_real[., 2], model, pmax, ic);
Test  Test Statistic  10% Critical Value  Conclusion 

EngleGranger, no constant  3.094  2.450  Reject the null of no cointegration at the 10% level. 
EngleGranger, constant  1.609  3.066  Cannot reject the null hypothesis of no cointegration. 
EngleGranger, constant and trend  2.327  3.518  Cannot reject the null hypothesis of no cointegration. 
These results provide evidence for our conclusion that there is no cointegration between gold and silver prices. Note, however, that these results are not conclusive and depend on whether we include a constant. This sheds light on the importance of including the correct deterministic components in our model.
To be thorough we should also test for cointegration using tests that allow for a structural break. As an example, let's use the GregoryHansen test to compare the null hypothesis of no cointegration against the alternative that there is cointegration with one structural break.
This test can be implemented using the coint_ghansen
procedure from the TSPDLIB library.
The coint_ghansen
procedure requires eight inputs:
/*
** Information Criterion:
** 1=Akaike;
** 2=Schwarz;
** 3=tstat sign.
*/
ic = 2;
//Maximum number of lags
pmax = 12;
// Trimming rate
trimm= 0.15;
// Longrun consistent variance estimation method
varm = 3;
// Bandwidth for kernel estimator
T = rows(y_test_real);
bwl = round(4 * (T/100)^(2/9));
// Level shift
model = 1;
{ ADF_min1, TBadf1, Zt_min1, TBzt1, Za_min1, TBza1, cvADFZt1, cvZa1 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Level shift with trend
model = 2;
{ ADF_min2, TBadf2, Zt_min2, TBzt2, Za_min2, TBza2, cvADFZt2, cvZa2 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Regime shift
model = 3;
{ ADF_min3, TBadf3, Zt_min3, TBzt3, Za_min3, TBza3, cvADFZt3, cvZa3 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
// Regime shift with trend
model = 4;
{ ADF_min4, TBadf4, Zt_min4, TBzt4, Za_min4, TBza4, cvADFZt4, cvZa4 }=
coint_ghansen(y_test_real[., 1], y_test_real[., 2], model, bwl, ic, pmax, varm, trimm);
The coint_ghansen
has eight returns:
Test  $ADF$ Test Statistic  $Z_t$ Test Statistic  $Z_{\alpha}$ Test Statistic  10% Critical Value $ADF$,$Z_t$  10% Critical Value $Z_{\alpha}$  Conclusion 

GregoryHansen, Level shift  3.887  4.331  39.902  4.34  36.19  Cannot reject the null of no cointegration for $ADF$ and $Z_t$. Reject the null of no cointegration at the 10% level for $Z_{\alpha}$. 
GregoryHansen, Level shift with trend  3.915  5.010  50.398  4.72  43.22  Reject the null of no cointegration at the 10% level for $Z_{\alpha}$ and $Z_t$. Cannot reject the null for $ADF$ test. 
GregoryHansen, Regime change  5.452  6.4276  80.379  4.68  41.85  Reject the null of no cointegration at the 10% level. 
GregoryHansen, Regime change with trend  6.145  7.578  106.549  5.24  53.31  Reject the null of no cointegration at the 10% level. 
Note that our test for cointegration with one structural break is inconsistent and depends on which type of structural breaks we include in our model. This provides some indication that further exploration of structural breaks is needed.
Some examples of additional steps that we could take:
Congratulations! You now have an established guide for cointegration and the background you need to perform cointegration testing.
In particular, today's blog covered:
The post A Guide to Conducting Cointegration Tests appeared first on Aptech.
]]>