Programming – Aptech

MLE with Bounded Parameters: A Cleaner Approach

admin — Wed, 08 Apr 2026 17:56:17 +0000

Introduction

It's natural in data analysis applications for parameters to have bounds; variances can't be negative, GARCH coefficients must sum to less than one for stationarity, and mixing proportions live between zero and one.

When you estimate these models by maximum likelihood, the optimizer needs to respect those bounds, not just at the solution, but throughout the search. If optimization searches wander into invalid territory, it can impact the reliability and convergence of your results. For example, you may get complex numbers from negative variances, explosive forecasts from non-stationary GARCH, or likelihoods that make no sense.

GAUSS 26.0.1 introduces minimize, the first new GAUSS optimizer in over 10 years, to handle this cleanly.

The minimize optmizer let's you specify bounds directly and GAUSS internally keeps parameters feasible at every iteration. No more log-transforms, no penalty functions, and no doublechecking.

In today's blog, we'll see the new minimize function in action, as we walk through two examples:

A GARCH estimation where variance parameters must be positive
A Stochastic frontier models where both variance components must be positive.

In both cases, bounded optimization makes estimation easier and aligns results with theory.

Why Bounds Matter

To see why this matters in practice, let’s look at a familiar example. Consider a GARCH(1,1) model:

$\sigma^2_t = \omega + \alpha \varepsilon^2_{t-1} + \beta \sigma^2_{t-1}$

For this model to be well-defined and economically meaningful:

The baseline variance must be positive ($\omega \gt 0$)
Shocks and persistence must contribute non-negatively to variance ($\alpha \geq 0$, $\beta \geq 0$)
The model must be stationary ($\alpha + \beta \lt 1$)

The traditional workaround is to estimate transformed parameters, $\log(\omega)$ instead of $\omega$, then convert back. This works, but it distorts the optimization surface and complicates standard error calculations. You're not estimating the parameters you care about; you're estimating transforms and hoping the numerics work out.

With bounded optimization, you estimate $\omega$, $\alpha$, and $\beta$ directly, with the optimizer respecting the constraints throughout.

Example 1: GARCH(1,1) on Commodity Returns

Let's estimate a GARCH(1,1) model on a dataset of 248 observations of commodity price returns (this data is included in the GAUSS 26 examples directory).

Step One: Data and Likelihood

First, we load the data and specify our log-likelihood objective function.

// Load returns data (ships with GAUSS)
fname = getGAUSShome("examples/df_returns.gdat");
returns = loadd(fname, "rcpi");

// GARCH(1,1) negative log-likelihood
proc (1) = garch_negll(theta, y);
    local omega, alpha, beta_, sigma2, ll, t;

    omega = theta[1];
    alpha = theta[2];
    beta_ = theta[3];

    sigma2 = zeros(rows(y), 1);

    // Initialize with sample variance
    sigma2[1] = stdc(y)^2;

    // Variance recursion
    for t (2, rows(y), 1);
        sigma2[t] = omega + alpha * y[t-1]^2 + beta_ * sigma2[t-1];
    endfor;

    // Gaussian log-likelihood
    ll = -0.5 * sumc(ln(2*pi) + ln(sigma2) + (y.^2) ./ sigma2);

    retp(-ll);  // Return negative for minimization
endp;

Step Two: Setting Up Optimization

Now we set up the bounded optimization with:

$\omega \gt 0$ (small positive lower bound to avoid numerical issues)
$\alpha \geq 0$
$\beta \geq 0$

Because minimize handles simple box constraints, we impose individual upper bounds on $\alpha$ and $\beta$ to keep the optimizer in a reasonable region. We'll verify the stationarity condition, $\alpha + \beta \lt 1$ after estimation.

// Starting values
theta0 = { 0.00001,   // omega (small, let data speak)
           0.05,      // alpha
           0.90 };    // beta

// Set up minimize
struct minimizeControl ctl;
ctl = minimizeControlCreate();

// Bounds: all parameters positive, alpha + beta < 1
ctl.bounds = { 1e-10      1,      // omega in [1e-10, 1]
               0          1,      // alpha in [0, 1]
               0     0.9999 };    // beta in [0, 0.9999]

We cap $\beta$ slightly below 1 to avoid numerical issues near the boundary, where the likelihood surface can become flat and unstable.

Step Three: Running the Model

Finally, we call minimize to run our model.

// Estimate
struct minimizeOut out;
out = minimize(&garch_negll, theta0, returns, ctl);

Results and Visualization

After estimation, we'll extract the conditional variance series and confirm the stationarity condition:

// Extract estimates
omega_hat = out.x[1];
alpha_hat = out.x[2];
beta_hat = out.x[3];

print "omega = " omega_hat;
print "alpha = " alpha_hat;
print "beta  = " beta_hat;
print "alpha + beta = " alpha_hat + beta_hat;
print "Iterations: " out.iterations;

Output:

omega = 0.0000070
alpha = 0.380
beta  = 0.588

alpha + beta = 0.968
Iterations: 39

There are a few noteworthy results:

The high persistence ($\alpha + \beta \approx 0.97$) means volatility shocks decay slowly.
The relatively high $\alpha$ (0.38) indicates that recent shocks have substantial immediate impact on variance.
The optimization converged in 39 iterations with all parameters staying inside their bounds throughout. No invalid variance evaluations, no numerical exceptions.

Visualizing the conditional variance alongside the original series provides further insight:

// Compute conditional variance series for plotting
T = rows(returns);
sigma2_hat = zeros(T, 1);
sigma2_hat[1] = stdc(returns)^2;

for t (2, T, 1);
    sigma2_hat[t] = omega_hat + alpha_hat * returns[t-1]^2 + beta_hat * sigma2_hat[t-1];
endfor;

// Plot returns and conditional volatility
struct plotControl plt;
plt = plotGetDefaults("xy");
plotSetTitle(&plt, "GARCH(1,1): Returns and Conditional Volatility");
plotSetYLabel(&plt, "Returns / Volatility");

plotLayout(2, 1, 1);
plotXY(plt, seqa(1, 1, T), returns);

plotLayout(2, 1, 2);
plotSetTitle(&plt, "Conditional Standard Deviation");
plotXY(plt, seqa(1, 1, T), sqrt(sigma2_hat));

The plot shows volatility clustering: periods of high volatility tend to persist, consistent with what we observe in commodity markets.

Example 2: Stochastic Frontier Model

Stochastic frontier analysis separates random noise from systematic inefficiency. It's widely used in productivity analysis to measure how far firms operate below their production frontier.

The model:

$y = X\beta + v - u$

where:

$v \sim N(0, \sigma^2_v)$ — symmetric noise (measurement error, luck)
$u \sim N^+(0, \sigma^2_u)$ — one-sided inefficiency (always reduces output)

Both variance components must be positive. If the optimizer tries $\sigma^2_v \lt 0$ or $\sigma^2_u \lt 0$, the likelihood involves square roots of negative numbers.

Step One: Data and Likelihood

For this example, we'll simulate data from a Cobb-Douglas production function with inefficiency. This keeps the example self-contained and lets you see exactly what's being estimated.

// Simulate production data
rndseed 8675309;
n = 500;

// Inputs (labor, capital, materials)
labor = exp(2 + 0.5*rndn(n, 1));
capital = exp(3 + 0.7*rndn(n, 1));
materials = exp(2.5 + 0.4*rndn(n, 1));

// True parameters
beta_true = { 1.5,    // constant
              0.4,    // labor elasticity
              0.3,    // capital elasticity
              0.25 }; // materials elasticity
sig2_v_true = 0.02;   // noise variance
sig2_u_true = 0.08;   // inefficiency variance

// Generate output with noise (v) and inefficiency (u)
v = sqrt(sig2_v_true) * rndn(n, 1);
u = sqrt(sig2_u_true) * abs(rndn(n, 1));  // half-normal

X = ones(n, 1) ~ ln(labor) ~ ln(capital) ~ ln(materials);
y = X * beta_true + v - u;  // inefficiency reduces output

After simulating our data, we specify the log-likelihood function for minimization:

// Stochastic frontier log-likelihood (half-normal inefficiency)
proc (1) = sf_negll(theta, y, X);
    local k, beta_, sig2_v, sig2_u, sigma, lambda;
    local eps, z, ll;

    k = cols(X);
    beta_ = theta[1:k];
    sig2_v = theta[k+1];
    sig2_u = theta[k+2];

    sigma = sqrt(sig2_v + sig2_u);
    lambda = sqrt(sig2_u / sig2_v);

    eps = y - X * beta_;
    z = -eps * lambda / sigma;

    ll = -0.5*ln(2*pi) + ln(2) - ln(sigma)
         - 0.5*(eps./sigma).^2 + ln(cdfn(z));

    retp(-sumc(ll));
endp;

Step Two: Setting Up Optimization

As we did in our previous example, we begin with our starting values. For this model, we run OLS and use the residual variance as starting values:

// OLS for starting values
beta_ols = invpd(X'X) * X'y;
resid = y - X * beta_ols;
sig2_ols = meanc(resid.^2);

// Starting values: Split residual variance 
// between noise and inefficiency
theta0 = beta_ols | (0.5 * sig2_ols) | (0.5 * sig2_ols);

We leave our coefficients unbounded but constrain the variances to be positive:

// Bounds: coefficients unbounded, variances positive
k = cols(X);
struct minimizeControl ctl;
ctl = minimizeControlCreate();
ctl.bounds = (-1e300 * ones(k, 1) | 0.001 | 0.001) ~ (1e300 * ones(k+2, 1));

Step Three: Running the Model

Finally, we call minimize to estimate our model:

// Estimate
struct minimizeOut out;
out = minimize(&sf_negll, theta0, y, X, ctl);

Results and Visualization

Now that we've estimated our model, let's examine our results.

// Extract estimates
k = cols(X);
beta_hat = out.x[1:k];
sig2_v_hat = out.x[k+1];
sig2_u_hat = out.x[k+2];

print "Coefficients:";
print "  constant     = " beta_hat[1];
print "  ln(labor)    = " beta_hat[2];
print "  ln(capital)  = " beta_hat[3];
print "  ln(materials)= " beta_hat[4];
print "";
print "Variance components:";
print "  sig2_v (noise)       = " sig2_v_hat;
print "  sig2_u (inefficiency)= " sig2_u_hat;
print "  ratio sig2_u/total   = " sig2_u_hat / (sig2_v_hat + sig2_u_hat);
print "";
print "Iterations: " out.iterations;

This prints out coefficients and variance components:

Coefficients:
  constant     = 1.51
  ln(labor)    = 0.39
  ln(capital)  = 0.31
  ln(materials)= 0.24

Variance components:
  sig2_v (noise)       = 0.022
  sig2_u (inefficiency)= 0.087
  ratio sig2_u/total   = 0.80

Iterations: 38

The estimates recover the true parameters reasonably well. The variance ratio ($\approx 0.80$) tells us that most residual variation is systematic inefficiency, not measurement error — an important finding for policy.

We can also compute and plot firm-level efficiency scores:

// Compute efficiency estimates (Jondrow et al. 1982)
eps = y - X * beta_hat;
sigma = sqrt(sig2_v_hat + sig2_u_hat);
lambda = sqrt(sig2_u_hat / sig2_v_hat);

mu_star = -eps * sig2_u_hat / (sig2_v_hat + sig2_u_hat);
sig_star = sqrt(sig2_v_hat * sig2_u_hat / (sig2_v_hat + sig2_u_hat));

// E[u|eps] - conditional mean of inefficiency
u_hat = mu_star + sig_star * (pdfn(mu_star/sig_star) ./ cdfn(mu_star/sig_star));

// Technical efficiency: TE = exp(-u)
TE = exp(-u_hat);

// Plot efficiency distribution
struct plotControl plt;
plt = plotGetDefaults("hist");
plotSetTitle(&plt, "Distribution of Technical Efficiency");
plotSetXLabel(&plt, "Technical Efficiency (1 = frontier)");
plotSetYLabel(&plt, "Frequency");
plotHist(plt, TE, 20);

print "Mean efficiency: " meanc(TE);
print "Min efficiency:  " minc(TE);
print "Max efficiency:  " maxc(TE);

Mean efficiency: 0.80
Min efficiency:  0.41
Max efficiency:  0.95

The histogram shows substantial variation in efficiency — some firms operate near the frontier (TE $\approx$ 0.95), while others produce 40-50% below their potential. This is the kind of insight that drives productivity research.

Both variance estimates stayed positive throughout optimization. No log-transforms needed, and the estimates apply directly to the parameters we care about.

When to Use minimize

The minimize procedure is designed for one thing: optimization with bound constraints. If that's all you need, it's the right tool.

Situation	Recommendation
Parameters with simple bounds	`minimize`
Nonlinear constraints ($g(x) \leq 0$)	`sqpSolveMT`
Equality constraints	`sqpSolveMT`
Algorithm switching, complex problems	OPTMT

For the GARCH and stochastic frontier examples above — and most MLE problems where parameters have natural bounds — minimize handles it directly.

Conclusion

Bounded parameters show up constantly in econometric models: variances, volatilities, probabilities, shares. GAUSS 26.0.1 gives you a clean way to handle them with minimize. As we saw today minimize:

Set bounds in the control structure
Optimizer respects bounds throughout (not just at the solution)
No log-transforms or penalty functions
Included in base GAUSS

If you've been working around parameter bounds with transforms or checking for invalid values inside your likelihood function, this is the cleaner path.

Why You Should Consider Constrained Maximum Likelihood MT (CMLMT)

Eric — Wed, 09 Apr 2025 13:49:48 +0000

Introduction

The Constrained Maximum Likelihood (CML) library was one of the original constrained optimization tools in GAUSS. Like many GAUSS libraries, it was later updated to an "MT" version.

The "MT" version libraries, named for their use of multi-threading, provide significant performance improvements, greater flexibility, and a more intuitive parameter-handling system.

This blog post explores:

The key features, differences, and benefits of upgrading from CML to CMLMT.
A practical example to help you transition code from CML to CMLMT.

Key Features Comparison

Before diving into the details of transitioning from CML to CMLMT, it’s useful to understand how these two libraries compare. The table below highlights key differences, from optimization algorithms to constraint handling.

Feature	CML (2.0)	CMLMT (3.0)
Optimization Algorithm	Sequential Quadratic Programming (SQP) with BFGS, DFP, and Newton-Raphson methods.	SQP with improved secant algorithms and Cholesky updates for Hessian approximation.
Parallel Computing Support	No multi-threading support.	Multi-threading enabled for numerical derivatives and bootstrapping.
Log-Likelihood Computation	Function and derivatives computed separately, requiring redundant calculations.	Unified procedure for computing log-likelihood, first derivatives, and second derivatives, reducing redundant computations.
Parameter Handling	Supports only a simple parameter vector.	Supports both a simple parameter vector and a `PV` structure (for advanced parameter management). Additionally, allows an unlimited number of data arguments in the log-likelihood function, simplifying the function and improving computation time.
Constraints Handling	Supports linear and nonlinear equality/inequality constraints.	Improved constraint handling with an explicit control structure for optimization.
Line Search Methods	STEPBT (quadratic/cubic fitting), BRENT, HALF, and BHHHSTEP.	Introduces the Augmented Lagrangian Penalty method for constrained models. Also includes STEPBT (quadratic/cubic fitting), BRENT, HALF, and BHHHSTEP.
Statistical Inference	Basic hypothesis testing.	Enhanced hypothesis testing for constrained models, including profile likelihoods, bootstrapping, and Lagrange multipliers.
Handling of Fixed Parameters	Global variables used to fix parameters.	Uses the `cmlmtControl` structure for setting fixed parameters.
Run-Time Adjustments	Uses global variables to modify settings.	The `cmlmtControl` structure enables flexible tuning of optimization settings.

Advantages of CMLMT

Beyond just performance improvements, CMLMT introduces several key advantages that make it a more powerful and user-friendly tool for constrained maximum likelihood estimation. These improvements do more than just support multi-threading, they provide greater flexibility, efficiency, and accuracy in model estimation.

Some of the most notable advantages include:

Threading & Multi-Core Support: CMLMT enables multi-threading, significantly speeding up numerical derivatives and bootstrapping, whereas CML is single-threaded.
Simplified Parameter Handling: Only CMLMT supports both a simple parameter vector and the PV structure for advanced models. Additionally, CMLMT allows dynamic arguments, making it easier to pass data to the log-likelihood function.
More Efficient Log-Likelihood Computation: CMLMT integrates the analytic computation of log-likelihood, first derivatives, and second derivatives into a user-specified log-likelihood procedure, reducing redundancy.
Augmented Lagrangian Method: CMLMT introduces an Augmented Lagrangian Penalty Line Search for handling constrained optimization.
Enhanced Statistical Inference: CMLMT includes bootstrapping, profile likelihoods, and hypothesis testing improvements, which are limited in CML.

Converting a CML Model to CMLMT

Let's use a simple example to walk through the step-by-step transition from CML to CMLMT. In this model, we will perform constrained maximum likelihood estimation for a Poisson model.

The dataset is included in the CMLMT library.

Original CML Code

We will start by estimating the model using CML:

new;
library cml;
#include cml.ext;
cmlset;

// Load data
data = loadd(getGAUSSHome("pkgs/cmlmt/examples/cmlmtpsn.dat"));

// Set constraints for first two coefficients
// to be equal
_cml_A = { 1 -1 0 };   
_cml_B = { 0 };  

// Specify starting parameters
beta0 = .5|.5|.5;

// Run optimization
{ _beta, f0, g, cov, retcode } = CMLprt(cml(data, 0, &logl, beta0));

// Specify log-likelihood function
proc logl(b, data);
   local m, x, y;

   // Extract x and y
   y = data[., 1];
   x = data[., 2:4];

   m = x * b;

  retp(y .* m - exp(m));
endp;

This code prints the following output:

Mean log-likelihood       -0.670058
Number of cases     100

Covariance of the parameters computed by the following method:
Inverse of computed Hessian

Parameters    Estimates     Std. err.    Gradient
------------------------------------------------------------------
P01              0.1199        0.1010      0.0670
P02              0.1199        0.1010     -0.0670
P03              0.8343        0.2648      0.0000

Number of iterations    5
Minutes to convergence     0.00007

Step One: Switch to CMLMT Library

The first step in updating our program file is to load the CMLMT library instead of the CML library.

Original CML Code

// Clear workspace and load library
new;
library cml;

New CMLMT Code

// Clear workspace and load library
new;
library cmlmt;

Step Two: Load Data

Since data loading is handled by GAUSS base procedures, no changes are necessary.

Original CML and CMLMT Code

// Load data
x = loadd(getGAUSSHome("pkgs/cmlmt/examples/cmlmtpsn.dat"));

// Extract x and y
y = x[., 1];
x = x[., 2:4];

Step Three: Setting Constraints

The next step is to convert the global variables used to control optimization in CML into members of the cmlmtControl structure. To do this, we need to:

Declare an instance of the cmlmtControl structure.
Initialize the cmlmtControl structure with default values using cmlmtControlCreate.
Assign the constraint vectors to the corresponding cmlmtControl structure members.

Original CML Code

// Set constraints for first two coefficients
// to be equal
_cml_A = { 1 -1 0 };   
_cml_B = { 0 };

New CMLMT Code

//Declare and initialize control structure
struct cmlmtControl ctl;
ctl = cmlmtControlCreate();

// Set constraints for first two coefficients
// to be equal
ctl.A = { 1 -1 0 };   
ctl.B = { 0 };

Step Four: Specify Starting Values

In our original CML code, we specified the starting parameters using a vector of values. In the CMLMT library, we can specify the starting values using either a parameter vector or a PV structure.

The advantage of the PV structure is that it allows parameters to be stored in different formats, such as symmetric matrices or matrices with fixed parameters. This, in turn, can simplify calculations inside the log-likelihood function.

If we use the parameter vector option, we don't need to make any changes to our original code:

Original CML and CMLMT Code

// Specify starting parameters
beta0 = .5|.5|.5;

Using the PV structure option requires additional steps:

Declare an instance of the PV structure.
Initialize the PV structure using the PVCreate procedure.
Use the PVpack functions to create and define specific parameter types within the PV structure.

New CMLMT Code to use PV

// Declare instance of 'PV' struct
struct PV p0;

// Initialize p0
p0 = pvCreate();

// Create parameter vector
beta0 = .5|.5|.5;

// Load parameters into p0
p0 = pvPack(p0, beta0, "beta");

Step Five: The Likelihood Function

In CML, the likelihood function takes only two parameters:

A parameter vector.
A data matrix.

Original CML Code

// Specify log-likelihood function
proc logl(b, data);
   local m, x, y;

   // Extract x and y
   y = data[., 1];
   x = data[., 2:4];

   m = x * b;

  retp(y .* m - exp(m));
endp;

The likelihood function in CMLMT is enhanced in several ways:

We can pass as many arguments as needed to the likelihood function. This allows us to simplify the function, which, in turn, can speed up optimization.
We return output from the likelihood function in the form of the modelResults structure. This makes computations thread-safe and allows us to specify both gradients and Hessians inside the likelihood function:
- The likelihood function values are stored in the mm.function member.
- The gradients are stored in the mm.gradient member.
- The Hessians are stored in the mm.hessian member.
The last input into the likelihood function must be ind.ind is passed to your log-likelihood function when it is called by CMLMT. It tells your function whether CMLMT needs you to compute the gradient and Hessian, or just the function value. (see online examples). NOTE: You are never required to compute the gradient or Hessian if requested by ind. If you do not compute it, CMLMT will compute numerical derivatives.

New CMLMT Code

// Specify log-likelihood function
// Allows separate arguments for y & x
// Also has 'ind' as last argument
proc logl(b, y, x, ind);
   local m;

   // Declare modeResult structure
   struct modelResults mm;

   // Likelihood computation
   m = x * b;

   // If the first element of 'ind' is not zero,
   // CMLMT wants us to compute the function value
   // which we assign to mm.function
   if ind[1];
      mm.function = y .* m - exp(m);
   endif;

   retp(mm);
endp;

Step Six: Run Optimization

We estimate the maximum likelihood parameters in CML using the cml procedure. The cml procedure returns five parameters, and a results table is printed using the cmlPrt procedure.

Original CML Code

/*
** Run optimization
*/
// Run optimization
{ _beta, f0, g, cov, retcode } = cml(data, 0, &logl, beta0);

// Print results
CMLprt(_beta, f0, g, cov, retcode);

In CMLMT, estimation is performed using the cmlmt procedure. The cmlmt procedure returns a cmlmtResults structure, and a results table is printed using the cmlmtPrt procedure.

To convert to cmlmt, we take the following steps:

Declare an instance of the cmlmtResults structure.
Call the cmlmt procedure. Following an initial pointer to the log-likelihood function, the parameter and data inputs are passed to cmlmt in the exact order they are specified in the log-likelihood function.
The output from cmlmt is stored in the cmlmtResults structure, out.

New CMLMT Code

/*
** Run optimization
*/
// Declare output structure
struct cmlmtResults out;

// Run estimation
out = cmlmt(&logl, beta0, y, x, ctl);

// Print output
cmlmtPrt(out);

Conclusion

Upgrading from CML to CMLMT provides faster performance, improved numerical stability, and easier parameter management. The addition of multi-threading, better constraint handling, and enhanced statistical inference makes CMLMT a powerful upgrade for GAUSS users.

If you're still using CML, consider transitioning to CMLMT for a more efficient and flexible modeling experience!

Try out The GAUSS Constrained Maximum Likelihood MT Library

[contact-form-7]

Introducing the GAUSS Data Management Guide

Eric — Tue, 20 Feb 2024 18:50:08 +0000

Introduction

If you've worked with real-world data, you know that data cleaning and management can eat up your time. Efficiently tackling tedious data cleaning, organization, and management tasks can have a huge impact on productivity.

We created the GAUSS Data Management Guide with that exact goal in mind. It's aimed to help you save time and make the most of your data.

Today's blog looks at what the GAUSS Data Management Guide offers and how to best use the guide.

What is the GAUSS Data Management Guide?

The GAUSS Data Management Guide is a comprehensive reference tool for accomplishing data-related tasks in GAUSS. It provides a detailed roadmap for working with data in GAUSS, from basic data import and manipulation to advanced data cleaning and visualization.

The guide is intentionally designed for all levels of GAUSS users with:

Extensive coverage.
Step-by-step instructions.
Annotated examples.

What does the GAUSS Data Management Guide cover?

The GAUSS Data Management Guide includes sections for:

How should I use the GAUSS Data Management Guide?

Use page outlines, located on the right-hand side of each page, to identify and navigate to specific tasks.
Copy the examples in the guide and paste into GAUSS program files to use as templates.
Use the links to complete function reference pages to find additional support.

Conclusion

The GAUSS Data Management Guide provides practical examples, detailed instructions, and comprehensive coverage that can help work productively and efficiently with your data.

Transforming Panel Data to Long Form in GAUSS

Eric — Tue, 12 Dec 2023 21:24:59 +0000

Introduction

Anyone who works with panel data knows that pivoting between long and wide form, though commonly necessary, can still be painstakingly tedious, at best. It can lead to frustrating errors, unexpected results, and lengthy troubleshooting, at worst.

The new dfLonger and dfWider procedures introduced in GAUSS 24 make great strides towards fixing that. Extensive planning has gone into each procedure, resulting in comprehensive but intuitive functions.

In today's blog, we will walk through all you need to know about the dfLonger procedure to tackle even the most complex cases of transforming wide form panel data to long form.

The Rules of Tidy Data

Before we get started, it will be useful to consider what makes data tidy (and why tidy data is important).

It's useful to think of breaking our data into components (these subsets will come in handy later when working with dflonger):

Values.
Observations.
Variables.

We can use these components to define some basic rules for tidy data:

Variables have unique columns.
Observations have unique rows.
Values have unique cells.

Example One: Wide Form State Population Table

State	2020	2021	2022
Alabama	5,031,362	5,049,846	5,074,296
Alaska	732,923	734,182	733,583
Arizona	7,179,943	7,264,877	7,359,197
Arkansas	3,014,195	3,028,122	3,045,637
California	39,501,653	39,142,991	39,029,342

Though not clearly labeled, we can deduce that this data presents values for three different variables: State, Year, and Population.

Looking more closely we see:

State is stored in a unique column.
The values of Years are stored as column names.
The values of Population are stored in separate columns for each year.

Our variables do not each have a unique column, violating the rules of tidy data.

Example Two: Long Form State Population Table

State	Year	Population
Alabama	2020	5,031,362
Alabama	2021	5,049,846
Alabama	2022	5,074,296
Alaska	2020	732,923
Alaska	2021	734,182
Alaska	2022	733,583
Arizona	2020	7,179,943
Arizona	2021	7,264,877
Arizona	2022	7,359,197

The transformed data above now has three columns, one for each variable State, Year, and Population. We can also confirm that each observation has a single row and each value has a single cell.

Transforming the data to long form has resulted in a tidy data table.

Why Do We Care About Tidy Data?

Working with tidy data offers a number of advantages:

Tidy data storage offers consistency when trying to compare, explore, and analyze data whether it be panel data, time series data or cross-sectional data.
Using columns for variables is aligned with vectorization and matrix notation, both of which are fundamental to efficient computations.
Many software tools expect tidy data and will only work reliably with tidy data.

Ready to elevate your research? Try GAUSS today.

Transforming From Wide to Long Panel Data

In this section, we will look at how to use the GAUSS procedure dfLonger to transform panel data from wide to long form. This section will cover:

The fundamentals of the dfLonger procedure.
A standard process for setting up panel data transformations.

The `dfLonger` Procedure

The dfLonger procedure transforms wide form GAUSS dataframes to long form GAUSS dataframes. It has four required inputs and one optional input:

df_long = dfLonger(df_wide, columns, names_to, values_to [, pctl]);

df_wide: A GAUSS dataframe in wide panel format.
columns: String array, the columns that should be used in the conversion.
names_to: String array, specifies the variable name(s) for the new column(s) created to store the wide variable names.
value_to: String, the name of the new column containing the values.
pctl: Optional, an instance of the pivotControl structure used for advanced pivoting options.

Setting Up Panel Data Transformations

Having a systematic process for transforming wide panel data to long panel data will:

Save time.
Eliminate frustration.
Prevent errors.

Let's use our wide form state population data to work through the steps.

Step 1: Identify variables.

In our wide form population table, there are three variables: State, Year, and Population.

Variables are not always are clearly labeled in wide form data. You will often need to have background information to identify variables. Make sure to pay attention to references, titles, or other sources to ensure that you clearly understand the variables.

Step 2: Identify columns to convert.

The easiest way to determine what columns need to be converted is to identify the "problem" columns in your wide form data.

For example, in our original state population table, the columns named 2020, 2021, 2022, represent our Year variable. They store the values for the Population variable.

These are the columns we will need to address in order to make our data tidy.

columns = "2020"$|"2021"$|"2022";

We only have three columns to transform and it is easy to just type out our column names in a string array. This won't always be the case, though. Fortunately, GAUSS has a lot of great convenience functions to help with creating your column lists.

My favorites include:

Function	Description	Example
getColNames	Returns the column variable names.	`varnames = getColNames(df_wide)`
startsWith	Returns a 1 if a string starts with a specified pattern.	`mask = startsWith(colNames, pattern)`
trimr	Trims rows from the top and/or bottom of a matrix.	`names = trimr(full_list, top, bottom)`
rowcontains	Returns a 1 if the row contains the data specified by the `needle` variable, otherwise it returns a 0.	`mask = rowcontains(haystack, needle)`
selif	Selects rows from a matrix, dataframe or string array, based upon a vector of 1’s and 0’s.	`names = rowcontains(full_list, mask)`

For more complex cases, it useful to approach creating column lists as a two-step process:

Get all column names using getColNames.
Select a subset of columns names using a selection convenience functions.

As an example, suppose our state population dataset contains a year column as the first column and the remaining columns contain the populations for 1950-2022. It would be difficult to write out the column list for all years.

Instead we could:

Get a list of all the column names using getColNames.
Trim the first name off the list.

// Get all columns names
colNames = getColNames(pop_wide);

// Trim first name `year` 
// from top of the name list
colNames = trimr(colNames, 1, 0);

Step 3: Name the new columns for storing names.

The names of the columns being transformed from our wide form data will be stored in a variable specified by the input names_to.

In this case, we want to store the names from the wide data in one new variable called, "Years". In later examples, we will look at how to split names into multiple variables using prefixes, separators, or patterns.

names_to = "Years";

Step 4: Name the new columns for storing values.

The values stored in the columns being transformed will be stored in a variable specified by the input values_to.

For our population table, we will store the values in a variable named "Population".

values_to = "Population";

Basic Pivoting

Now it's time to put all these steps together into a working example. Let's continue with our state population example.

We'll start by loading the complete state population dataset from the state_pop.gdat file:

// Load data 
pop_wide = loadd("state_pop.gdat");

// Preview data
head(pop_wide);

           State             2020             2021             2022
         Alabama        5031362.0        5049846.0        5074296.0
          Alaska        732923.00        734182.00        733583.00
         Arizona        7179943.0        7264877.0        7359197.0
        Arkansas        3014195.0        3028122.0        3045637.0
      California        39501653.        39142991.        39029342.

Now, let's set up our information for transforming our data:

// Identify columns
columns = "2020"$|"2021"$|"2022";

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we'll transform our data using df_longer:

// Convert data using df_longer
pop_long = dfLonger(pop_wide, columns, names_to, values_to);

// Preview data
head(pop_long);

           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00

Advanced Pivoting

One of the most appealing things about dfLonger is that while simple to use, it offers tools for tackling the most complex cases. In this section, we'll cover everything you need to know for moving beyond basic pivoting.

The `pivotControl` Structure

The pivotControl structure allows you to control pivoting specifications using the following members:

Member	Purpose
names_prefix	A string input which specifies which characters, if any, should be stripped from the front of the wide variable names before they are assigned to a long column.
names_sep_split	A string input which specifies which characters, if any, mark where the names_to names should be broken up.
names_pattern_split	A string input containing a regular expression specifying group(s) in names_to names which should be broken up.
names_types	A string input specifying data types for the names_to variable.
values_drop_missing	Scalar, is set to 1 all rows with missing values will be removed.

We will demonstrate more how to use the pivotControl structure in later examples. However, if you are unfamiliar with structures you may find it useful to review our tutorial, "A Gentle Introduction to Using Structures."

Changing Variable Types

By default the variables created from the pieces of the variable names will be categorical variables.

If we examine the variable type of pop_long from our previous example,

// Check the type of the 'Year' variables
getColTypes(pop_long[., "Year"]);

we can see that the Year variable is a categorical variable:

            type
        category

This isn't ideal and we'd prefer our Year variable to be a date. We can control the assigned type using the names_types member of the pivotControl structure. The names_types member can be specified in one of two ways:

As a column vector of types for each of the names_to variables.
An n x 2 string array where the first column is the name of the variable(s) and the second column contains the type(s) to be assigned.

For our example, we wish to specify that the Year variable should be a date but we don't need to change any of the other assigned types, so we will use the second option:

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify that 'Year' should be
// converted to a date variable
pctl.names_types = {"Year" "date"};

Next, we complete the steps for pivoting:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide);
columns = trimr(columns, 1, 0);

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we call dfLonger including the pivotControl structure, pctl, as the final input:

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);

           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00

Now if we check the type of our Year variable:

// Check the type of 'Year'
getColTypes(pop_long[., "Year"]);

It is a date variable:

  type
  date

Stripping Prefixes

In our previous example, the wide data names only contained the year. However, the column names of a wide dataset often have common prefixes. The names_prefix member of the pivotControl structure offers a convenient way to strip unwanted prefixes.

Suppose that our wide form state population columns were labeled "yr_2020", "yr_2021", "yr_2022":

// Load data
pop_wide2 = loadd("state_pop2.gdat");

// Preview data
head(pop_wide2);

           State          yr_2020          yr_2021          yr_2022
         Alabama        5031362.0        5049846.0        5074296.0
          Alaska        732923.00        734182.00        733583.00
         Arizona        7179943.0        7264877.0        7359197.0
        Arkansas        3014195.0        3028122.0        3045637.0
      California        39501653.        39142991.        39029342.

We need to strip these prefixes when transforming our data to long form.

To accomplish this we first need to specify that our name columns have the common prefix "yr":

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify prefix
pctl.names_prefix = "yr_";

Next, we complete the steps for pivoting:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide2);
columns = trimr(columns, 1, 0);

// Variable for storing names
names_to = "Year";

// Variable for storing values
values_to = "Population";

Finally, we call dfLonger:

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide2, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);

           State             Year       Population
         Alabama             2020        5031362.0
         Alabama             2021        5049846.0
         Alabama             2022        5074296.0
          Alaska             2020        732923.00
          Alaska             2021        734182.00

Splitting Names

In our basic example the only information contained in the names columns was the year. We created one variable to store that information, "Year". However, we may have cases where our wide form data contains more than one piece of information.

In theses case there are two important steps to take:

Name the variables that will store the information contained in the wide data column names using the names_to input.
Indicate to GAUSS how to split the wide data column names into the names_to variables.

Names Include a Separator

One way that names in wide data can contain multiple pieces of information is through the use of separators.

For example, suppose our data looks like this:

           State       urban_2020       urban_2021       urban_2022       rural_2020       rural_2021       rural_2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858.

Now our names specify:

Whether the population is the urban or rural population.
The year of the observation.

In this case, we:

Use the names_sep_split member of the pivotControl structure to indicate how to split the names.
Specify a names_to variable for each group created by the separator.

// Load data
pop_wide3 = loadd("state_pop3.gdat");

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify how to separate names
pctl.names_sep_split = "_";

// Specify two variables for holding
// names information:
//    'Location' for the information before the separator
//    'Year' for the information after the separator
names_to = "Location"$|"Year";

// Variable for storing values
values_to = "Population";

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide3, columns, names_to, values_to, pctl);

// Preview data
head(pop_long);

           State         Location             Year       Population
         Alabama            urban             2020        6558153.0
         Alabama            urban             2021        4972982.0
         Alabama            urban             2022        12375977.
         Alabama            rural             2020        1526791.0
         Alabama            rural             2021        76863.000

Now, the pop_long dataframe contains:

The information in the wide form names found before the separator, "_", (urban or rural) in the Location variable.
The information in the wide form names found after the separator, "_", in the Year variable.

Variable Names With Regular Expressions

In our example above, the variables contained in the names were clearly separated by a "_". However, this isn't always the case. Sometimes names use a pattern rather than separator:

// Load data
pop_wide4 = loadd("state_pop4.gdat");

// Preview data
head(pop_wide4);

           State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858.

In cases like this, we can use the names_pattern_split member to tell GAUSS we want to pass in a regular expression that will split the columns. We can't cover the full details of regular expressions here. However, there are a few fundamentals that will help us get started with this example.

In regEx:

Each statement inside a pair of parentheses is a group.
To match any upper or lower case letter we use "[a-zA-Z]". More specifically, this tells GAUSS that we want to match any lowercase letter ranging from a-z and any upper case letter ranging from A-Z. If we wanted to limit this to any lowercase letters from t to z and any uppercase letter B to M we would say "[t-zB-M]".
To match any integer we use "[0-9]".
To represent that we want to match one or more instances of a pattern we use "+".
To represent that we want to match zero or more instances of a pattern we use "*".

In this case, we want to separate our names so that "urban" and "rural" are collected in Location and 2020, 2021, and 2022 are collected in the Year variable:

We have two groups.
We can capture both urban and rural using "[a-zA-Z]+".
We can capture the years by matching one or more number using "[0-9]+".

Let's use regEx to specify our names_pattern_split member:

// Declare pivotControl structure and fill with default values
struct pivotControl pctl;
pctl = pivotControlCreate();

// Specify how to separate names 
// using the pivotControl structure
pctl.names_pattern_split = "([a-zA-Z]+)([0-9]+)";

Next, we can put this together with our other steps to transform our wide data:

// Variable for storing names
names_to = "Location"$|"Year";

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide4);
columns = trimr(columns, 1, 0);

// Variable for storing values
values_to = "Population";

// Call dfLonger with optional control structure
pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl4);
head(pop_long);

           State         Location             Year       Population
         Alabama            urban             2020        6558153.0
         Alabama            urban             2021        4972982.0
         Alabama            urban             2022        12375977.
         Alabama            rural             2020        1526791.0
         Alabama            rural             2021        76863.000

Multiple Value Variables

In all our previous examples we had values that needed to be stored in one variable. However, it's more realistic that our dataset contains multiple groups of values and we will need to specify multiple variables to store these values.

Let's consider our previous example which used the pop_wide4 dataset:

           State        urban2020        urban2021        urban2022        rural2020        rural2021        rural2022
         Alabama        6558153.0        4972982.0        12375977.        1526791.0        76863.000        7301681.0
          Alaska        21944.000        467051.00        311873.00        710978.00        267130.00        421709.00
         Arizona        1248007.0        6033358.0        1444029.0        8427950.0        1231518.0        5915167.0
        Arkansas        863918.00        913266.00        7000024.0        2150276.0        3941388.0        3954387.0
      California        17255657.        27682794.        63926200.        22245995.        11460196.        24896858.

Suppose that rather than creating a location variable, we wish to separate the population information into two variables, urban and rural. To do this we will:

Split the variable names by words ("urban" or "rural") and integers.
Create a Year column from the integer portions of the names.
Create two values columns, urban and rural, from the word portions.

First, we will specify our columns:

// Get all column names and remove the first column, 'State'
columns = getColNames(pop_wide4);
columns = trimr(columns, 1, 0);

Since we are using the same data as our previous example, we don't need to load any additional data.

Next, we need to specify our names_to and values_to inputs. However, this time we want our values_to variables to be determined by the information in our names.

We do this using ".value".

// Tell GAUSS to use the first group of the split names 
// to set the values variables and 
// store the remaining group in 'Year'
names_to = ".value" $| "Year";

// Tell GAUSS to get 'values_to' variables from 'names_to'
values_to = "";

Setting ".value" as the first element in our names_to input tells dfLonger to take the first piece of the wide data names and create a column with the all the values from all matching columns.

In other words, combine all the values from the variables urban2020, urban2021, urban2022 into a single variable named urban and do the same for the rural columns.

Finally, we need to tell GAUSS how to split the variable names.

// Declare 'pctl' to be a pivotControl structure
// and fill with default settings
struct pivotControl pctl;
pctl = pivotControlCreate();

// Set the regex to split the variable names
pctl.names_pattern_split = "(urban|rural)([0-9]+)";

This time, we specify the variable names, "(urban|rural)" rather than use the general specifier "([a-zA-Z])".

Now we call dfLonger:

// Convert the dataframe to long format according to our specifications
pop_long = dfLonger(pop_wide4, columns, names_to, values_to, pctl);

// Print the first 5 rows of the long form dataframe
head(pop_long);

           State             Year            urban            rural
         Alabama             2020        6558153.0        1526791.0
         Alabama             2021        4972982.0        76863.000
         Alabama             2022        12375977.        7301681.0
          Alaska             2020        21944.000        710978.00
          Alaska             2021        467051.00        267130.00

Now the urban population and rural population are stored in their own column, named urban and rural.

These names can easily be changed using the Data Manager or setColNames

Conclusion

As we've seen today, pivoting panel data from wide to long can be complicated. However, using a systematic approach and the GAUSS dfLonger procedure help to alleviate the frustration, time, and errors.

Discover how GAUSS 24 can help you reach your goals.

Request Demo Request pricing

Managing String Data with GAUSS Dataframes

Eric — Tue, 28 Mar 2023 20:41:19 +0000

Introduction

Working with strings hasn’t always been easy in GAUSS. In the past, the only option in GAUSS was to store strings separately from numeric data. It made it difficult to work with datasets that contained mixed types.

With the introduction of GAUSS dataframes in GAUSS 21 and the enhanced string capabilities of GAUSS 23, that has all changed! I would argue that GAUSS now offers one of the best environments for managing and cleaning mixed-type data.

I recently used GAUSS to perform the very practical task of creating an email list from a string-heavy dataset – something I never would have chosen GAUSS for in the past. In this blog, we walk through this data cleaning task, highlighting several key features for handling strings.

Quick Overview of Strings in GAUSS

The GAUSS dataframe revolutionized data storage in GAUSS. It allows you to store mixed data types together including numbers, dates, categorical data, and strings.

The GAUSS string data type can contain letters, numbers, and other characters. The string data type:

Keeps labels with data.
Saves additional loading steps.
Makes data and reports easier to understand.

It isn’t difficult to see the usefulness of this in real-world data which often includes information such as customer names, product names, or locations.

Loading Strings in GAUSS

Strings can be programmatically loaded from multiple data file types using loadd. No special steps are required, and GAUSS automatically detects strings in XLSX, CSV, STATA, SAS, and GDAT files.

In addition, the Data Import window provides a great tool for interactively previewing and managing data of all types at the time of import.

See our Data Management and Cleaning User Guide for an in-depth guide to data handling in GAUSS.

Data Exercise: Building an Email List

To help demonstrate GAUSS's string capabilities, we will build and export an email contact list from a provided Excel dataset. We will break this project into several smaller tasks:

Loading our raw data.
Generating email addresses from the provided information.
Combining the desired contact list information into a dataframe.
Exporting the dataframe as a CSV file.

Provided Data

We will use a sample dataset containing sales territory information for sales representatives. The original dataset includes a mix of string and categorical data including:

Variable	Description	Type
KAR	The assigned territory sales representative.	Category
Store	The store number.	Numeric
Store name	The store name.	String
Format	Type of display found in store.	Category
Vet	Y/N indicator of in-store vet clinics.	Category
Nielsen Market	Assigned Nielsen Market	Category

You can download the original dataset here.

Importing Raw Data Interactively

For this exercise, I'm going to use the interactive Data Window to load my data. For data cleaning projects like this, I often find it helpful to have a preview of my raw data. This allows me to make preliminary observations about my raw data such as:

The presence of unnecessary variables.
If the dataset has a non-standard header.
Data types.

It's useful to note that the GAUSS Data Window always generates GAUSS code that can be used for replicating data loading programmatically in the future.

territory_info = loadd("C:/business/accounts/territory-info.xlsx");

Notice that this data will load directly as we saw it in the preview.

// Print the first 5 rows
head(territory_info);

It will also look exactly like our preview if we print it to screen:

             KAR            Store       Store Name           Format       Vet   Nielsen Market
   Larry McGuire        725.00000     NY-MIDDLETWN              RUN         N     New York, NY
   Larry McGuire        728.00000        STRATFORD      PREMIUM RUN         N     New York, NY
   Larry McGuire        752.00000       NORWALK-CT           PANTRY         N     New York, NY
   Larry McGuire        758.00000       SEEKONK-MA           PANTRY         N Providence et al
   Larry McGuire        762.00000   SOUTHINGTON-CT   4 FT. MINI RUN         N Hartford and New

Cleaning Our Data

Before generating our email list, we should perform some preliminary data cleaning.

Though we will conduct our data cleaning programmatically, it is worth noting that the Data Management pane offers an interactive environment for data cleaning. For more information on interactive data cleaning, see our Data Cleaning User Guide.

First, we check for duplicates:

// Check for duplicates
getduplicates(territory_info);

Since no output is printed, it shows us that there are not any duplicate rows.

Next, let's review the Nielsen Market variable using the frequency command:

// Check Frequencies
frequency(territory_info, "Nielsen Market");

The frequencies aren't that interesting to us. However, the report provides us with a quick view of the categories:

                      Label      Count   Total %    Cum. % 
              21 iowa-idaho          1   0.09814   0.09814 
     Abilene-Sweetwater, TX          1   0.09814    0.1963 
    Abilene-Sweetwater, TX           1   0.09814    0.2944 
          Albany et al, NY           2    0.1963    0.4907 
  Albuquerque-Santa Fe, NM           3    0.2944    0.7851 
               Atlanta, GA          20     1.963     2.748 
      Augusta-Aiken, GA-SC           1   0.09814     2.846 
                Austin, TX          13     1.276     4.122 
            Bakersfield, CA          2    0.1963     4.318 
             Baltimore, MD          19     1.865     6.183 
  Beaumont-Port Arthur, TX           2    0.1963     6.379 
                  Bend, OR           2    0.1963     6.575 
            Binghamton, NY           1   0.09814     6.673 
       Boston et al, MA-NH          37     3.631      10.3 
               Buffalo, NY           3    0.2944      10.6 
         Butte-Bozeman, MT           2    0.1963     10.79 
            Charleston, SC           1   0.09814     10.89 
             Charlotte, NC           7    0.6869     11.58 
       Charlottesville, VA           1   0.09814     11.68 
      Cheyenne et al, WY-NE          1   0.09814     11.78 
               Chicago, IL          45     4.416     16.19 
         Chico-Redding, CA           3    0.2944     16.49 
            Cincinnati, OH           1   0.09814     16.58 
       Cleveland et al, OH          11     1.079     17.66 
   Colorado Sprgs et al, CO          6    0.5888     18.25 
              Columbia, SC           2    0.1963     18.45 
              Columbus, OH           2    0.1963     18.65 
        Corpus Christi, TX           2    0.1963     18.84 
       Dallas-Ft. Worth, TX         35     3.435     22.28 
      Dallas-Ft. Worth, TX           2    0.1963     22.47 
    Davenport et al, IA-IL           1   0.09814     22.57 
                Dayton, OH           3    0.2944     22.87 
                 Denver, CO         27      2.65     25.52 
       Des Moines-Ames, IA           1   0.09814     25.61 
               Detroit, MI          12     1.178     26.79 
      El Paso et al, TX-NM           3    0.2944     27.09 
          Elmira et al, NY           1   0.09814     27.18 
                Eugene, OR           2    0.1963     27.38 
                Eureka, CA           1   0.09814     27.48 
     Fargo-Valley City, ND           1   0.09814     27.58 
        Fresno-Visalia, CA           5    0.4907     28.07 
      Ft. Myers-Naples, FL           5    0.4907     28.56 
       Ft. Smith et al, AR           2    0.1963     28.75 
           Gainesville, FL           1   0.09814     28.85 
   Grand Junction et al, CO          1   0.09814     28.95 
      Greensboro et al, NC           2    0.1963     29.15 
      Greenville et al, NC           2    0.1963     29.34 
   Greenville et al, SC-NC           6    0.5888     29.93 
       Harlingen et al, TX           2    0.1963     30.13 
      Harrisburg et al, PA           5    0.4907     30.62 
          Harrisonburg, VA           2    0.1963     30.81 
Hartford and New Haven, CT          17     1.668     32.48 
               Houston, TX          38     3.729     36.21 
          Indianapolis, IN           6    0.5888      36.8 
          Jacksonville, FL           5    0.4907     37.29 
       Johnstown et al, PA           4    0.3925     37.68 
        Kansas City, MO-KS           1   0.09814     37.78 
             Knoxville, TN           1   0.09814     37.88 
             Lafayette, LA           2    0.1963     38.08 
          Lake Charles, LA           1   0.09814     38.17 
             Las Vegas, NV           9    0.8832     39.06 
              Lexington, KY          1   0.09814     39.16 
         Lincoln et al, NE           2    0.1963     39.35 
           Los Angeles, CA          92     9.028     48.38 
         Medford et al, OR           2    0.1963     48.58 
  Miami-Ft. Lauderdale, FL          14     1.374     49.95 
             Milwaukee, WI           7    0.6869     50.64 
  Minneapolis-St. Paul, MN          15     1.472     52.11 
           Minot et al, ND           2    0.1963     52.31 
              Missoula, MT           2    0.1963      52.5 
       Mobile et al, AL-FL           1   0.09814      52.6 
    Myrtle Beach et al, SC           3    0.2944     52.89 
             Nashville, TN          10    0.9814     53.88 
           New Orleans, LA           3    0.2944     54.17 
              New York, NY          80     7.851     62.02 
         Norfolk et al, VA           7    0.6869     62.71 
         Odessa-Midland, TX          1   0.09814     62.81 
         Oklahoma City, OK           6    0.5888      63.4 
                 Omaha, NE           1   0.09814     63.49 
         Orlando et al, FL          18     1.766     65.26 
          Palm Springs, CA           4    0.3925     65.65 
    Peoria-Bloomington, IL           3    0.2944     65.95 
          Philadelphia, PA          35     3.435     69.38 
         Phoenix et al, AZ          21     2.061     71.44 
            Pittsburgh, PA          13     1.276     72.72 
              Portland, OR          25     2.453     75.17 
       Portland-Auburn, ME           2    0.1963     75.37 
   Providence et al, RI-MA           9    0.8832     76.25 
    Quincy et al, IL-MO-IA           1   0.09814     76.35 
         Raleigh et al, NC           6    0.5888     76.94 
            Rapid City, SD           1   0.09814     77.04 
                   Reno, NV          3    0.2944     77.33 
   Richmond-Petersburg, VA           5    0.4907     77.82 
     Roanoke-Lynchburg, VA           2    0.1963     78.02 
             Rochester, NY           4    0.3925     78.41 
              Rockford, IL           1   0.09814     78.51 
      Sacramento et al, CA           6    0.5888      79.1 
             Salisbury, MD           2    0.1963     79.29 
        Salt Lake City, UT          19     1.865     81.16 
           San Antonio, TX          13     1.276     82.43 
             San Diego, CA          28     2.748     85.18 
   Santa Barbara et al, CA           5    0.4907     85.67 
              Savannah, GA           1   0.09814     85.77 
         Seattle-Tacoma, WA          1   0.09814     85.87 
        Seattle-Tacoma, WA          38     3.729      89.6 
        Sherman-Ada, TX-OK           2    0.1963     89.79 
     Sioux Falls et al, SD           1   0.09814     89.89 
    South Bend-Elkhart, IN           1   0.09814     89.99 
                Spokane- wa          1   0.09814     90.09 
   Springfield-Holyoke, MA           1   0.09814     90.19 
             St. Louis, MO          11     1.079     91.27 
              Syracuse, NY           2    0.1963     91.46 
  Tallahassee et al, FL-GA           1   0.09814     91.56 
           Tampa et al, FL          16      1.57     93.13 
                Toledo, OH           1   0.09814     93.23 
   Tucson(Sierra Vista), AZ          8    0.7851     94.01 
                 Tulsa, OK           1   0.09814     94.11 
   Tyler-Longview et al, TX          2    0.1963     94.31 
                 Utica, NY           1   0.09814     94.41 
   W. Palm Beach et al, FL           8    0.7851     95.19 
     Waco-Temple-Bryan, TX           4    0.3925     95.58 
   Washington et al, DC-MD          37     3.631     99.21 
             Watertown, NY           1   0.09814     99.31 
  Wichita Fls et al, TX-OK           2    0.1963     99.51 
          Yakima et al, WA           3    0.2944      99.8 
                spokane- wa          2    0.1963       100 
                      Total       1019       100

From this report we can identify a few issues that need addressing:

The Spokane-WA market is entered twice, once as spokane-wa and once as Spokane, WA.
The format of the Spokane-WA market differs from the other entries. It uses a dash rather than a comma to separate the city from the state.
The Abilene-Sweetwater, TX and Seattle-Tacoma, WA markets occur twice because of differing white spaces.
The misalignment in the market names indicates that there are leading and trailing white spaces which should be removed.
It would be useful to separate the Nielsen Market into Nielsen City and Nielsen State

/*
** Cleaning  data
*/
// Strip leading and trailing white spaces
territory_info[., "Nielsen Market"] = 
strtrim(territory_info[., "Nielsen Market"]);

// Update the Spokane listing
territory_info[., "Nielsen Market"] = 
strreplace(territory_info[., "Nielsen Market"], "spokane", "Spokane");

// Replace Spokane-WA with Spokane, WA
territory_info[., "Nielsen Market"] = 
strreplace(territory_info[., "Nielsen Market"], "Spokane- wa", "Spokane, WA");

// Split Nielsen Market into state and city
nielsen = asDF(strsplit(territory_info[., "Nielsen Market"], ","), 
          "Nielsen City", "Nielsen State");

Notice that we've used three different GAUSS string procedures above. These three are all very useful for data cleaning and are worth noting:

Procedure	Purpose
strtrim	Strips all white space characters from the left and right side of each element in a string array.
strreplace	Replaces all matches of a substring with a replacement string.
strsplit	Splits a string into individual tokens based on a specified separator.

Generating Email Addresses

Now that we've cleaned up the Nielsen Market data, we can generate the email addresses for our list. The email address for each store takes the general form storenumber + "d" + "@petpeople.com". For example, the email address for store number 548 is "548d@petpeople.com".

To generate our email addresses we need to:

Convert the store numbers to strings.
Add the suffix of the email address to the new strings.

To do this in GAUSS we will:

Convert the store numbers to strings using the GAUSS function itos.
Add the string prefix to form the email using $+.
Change the string array to a dataframe using asDF.

/*
** Create email addresses
*/
// Convert store number to string
str_store = itos(territory_info[., "Store"]);

// Add prefix
email_address = str_store $+ "d@petpeople.com";

// Convert to dataframe
// and name the variable "Email"
email_df = asDF(email_address, "Email");

Build Email Database

We want the final email database to include KAR, Store Name, Email, Nielsen City, and Nielsen State.

// Form dataframe containing
// email list information
email_database = territory_info[., "KAR" "Store Name"] ~ email_df ~ nielsen;

// Preview database
head(email_database);

The first five rows of our data look like:

             KAR       Store Name              Email     Nielsen City    Nielsen State
   Larry McGuire     NY-MIDDLETWN 725d@petpeople.com         New York               NY
   Larry McGuire        STRATFORD 728d@petpeople.com         New York               NY
   Larry McGuire       NORWALK-CT 752d@petpeople.com         New York               NY
   Larry McGuire       SEEKONK-MA 758d@petpeople.com Providence et al            RI-MA
   Larry McGuire   SOUTHINGTON-CT 762d@petpeople.com Hartford and New               CT

Filtering the Data

Now that our database is created, let's filter our data to focus on one representative, Jeff Canary, and save the email list under his name.

/*
** Filtering and saving our email list
*/
// Specify KAR 
name = "Jeff Canary";

// Filter data for specified employee
email_list = selif(email_database, email_database[., "KAR"] .$== name);

Export to CSV file

As a final step, we will export the email_list dataframe to a CSV file using saved.

// Create file name
fsave_name = name $+ "_store_emails.csv";

// Save file
saved(email_list, fsave_name);

Extra Credit: Looping Through All Representatives

Suppose we need to export email lists for all representatives. We can do this using a fairly simple loop.

// Get list of unique 
// representative names
kar_names  = unique(email_database[., "KAR"]);

// Loop over all names
for i(1, rows(kar_names), 1);
  /*
  ** Filtering and saving our email list
  */
  // Specify KAR to create email list for
  name = kar_names[i];

  // Filter data for specified employee
  email_list = selif(email_database, email_database[., "KAR"] .$== name);

  // Save email list
  fsave_name = name $+ "_store_emails.csv";

  // Save file
  saved(email_list, fsave_name);
endfor;

Conclusion

In today's blog we've demonstrated the improved string capabilities of GAUSS using a simple data cleaning task. Our project covered several useful tasks including:

Loading raw data.
Cleaning common string data issues.
Generating new string variables by splitting and joining strings.
Exporting dataframes as CSV files.

Importing FRED Data to GAUSS

Eric — Fri, 16 Dec 2022 02:05:10 +0000

Introduction

The GAUSS FRED database integration, introduced in GAUSS 23, is a time-saving feature that allows you to import FRED data directly into GAUSS. This means you have thousands of datasets at your fingertips without ever leaving GAUSS. These tools also ensure that FRED data is imported directly into a GAUSS dataframe format, which can eliminate hours of data cleaning and the headaches that come with it.

In today's blog, we will learn how to use the FRED import tools to:

Search for a FRED data series.
Import FRED data to GAUSS, including merging multiple series.
Use advanced import tools to perform data transformations.

Getting Started

Requesting an API Key

Prior to importing any data from FRED using GAUSS you will need to request an API key from FRED. This can be done on the FRED API Request page. To request an API key you will need:

To create and/or login to a FRED account.
Provide a brief description of the program you intend to write. This can be simple such as, "Using GAUSS to conduct economic research."

Specifying your API key in GAUSS

You can set your API in GAUSS using any of the following methods:

Set the API key directly at the top of your program:
```
FRED_API_KEY = "your_api_key"
```
Set the environment variable FRED_API_KEY to your API key.
Edit your gauss.cfg and modify the fred_api_key value:
```
fred_api_key = your_api_key
```

Finding Your FRED Series

In order to download a series directly from FRED, we will need to know the series ID. However, this may not be something you know right offhand. Fortunately, we can use the fred_search procedure to find the proper series ID.

The fred_search procedure requires one input, a string specifying the search text. As an example, let's search for all series related to "producer price index":

fred_search("producer price index");

This prints a search report to the command window. The first five rows are:

frequency  frequency_short group_popularity              id     last_updated  observation_end observation_star       popularity     realtime_end   realtime_start seasonal_adjustm seasonal_adjustm            title            units      units_short
Monthly                 M        80.000000           PPIACO 2022-11-15 07:52       2022-10-01       1913-01-01        80.000000       2022-11-23       2022-11-23 Not Seasonally A              NSA Producer Price I   Index 1982=100   Index 1982=100
Monthly                 M        79.000000          WPU0911 2022-11-15 07:52       2022-10-01       1926-01-01        79.000000       2022-11-23       2022-11-23 Not Seasonally A              NSA Producer Price I   Index 1982=100   Index 1982=100
Monthly                 M        79.000000            PCEPI 2022-10-28 08:40       2022-09-01       1959-01-01        78.000000       2022-11-23       2022-11-23 Seasonally Adjus               SA Personal Consump   Index 2012=100   Index 2012=100
Monthly                 M        78.000000  PCU325211325211 2022-11-15 07:55       2022-10-01       1976-06-01        78.000000       2022-11-23       2022-11-23 Not Seasonally A              NSA Producer Price I Index Dec 1980=1 Index Dec 1980=1

We can see that the FRED search report provides a thorough summary of related series. In addition to the id, which we will need to import the data from FRED, some other useful fields include:

Frequency.
Popularity.
Last updated.
Observation end.
Observation start.
Seasonal adjustment status.
Units.

For our next steps, let's use the PPIACO series, which is the highest popularity series related to the search term Producer Price Index.

Note: A number of advanced search options are available and can be read about in the official documentation for the fred_search

Importing Data From FRED

Loading A Single Series From FRED

Next, we will import the PPIACO series from the FRED database into GAUSS using the fred_load procedure.

The fred_load procedure requires one string input specifying the series ID to be loaded. To load the producer price data that we found with our FRED search, we will use the series ID PPIACO:

// Download all observations of 'PPIACO' into a GAUSS dataframe
PPI = fred_load("PPIACO");

We can examine the first five rows of the PPI dataframe using the head procedure:

// Print the first 5 rows of 'PPI'
head(PPI);

which reports

            date           PPIACO
      1913-01-01        12.100000
      1913-02-01        12.000000
      1913-03-01        12.000000
      1913-04-01        12.000000
      1913-05-01        11.900000

We can also use the tail procedure to examine the last 5 rows of the PPI dataframe:

// Print the last 5 rows of 'PPI'
tail(PPI);

            date           PPIACO
      2022-06-01        280.25100
      2022-07-01        272.27800
      2022-08-01        269.46500
      2022-09-01        268.69300
      2022-10-01        265.19300

This shows us that the PPIACO data ranges from January, 1913 to October, 2022. Which is consistent with the observation start and end date reported in our FRED search.

Loading Multiple Series From FRED

The fred_load procedure can also be used to load multiple series from FRED simultaneously. To do this, we use a GAUSS formula string syntax, using + to add additional series IDs to our formula string.

// Load producer price
// and treasury bond data
macro_data = fred_load("PPIACO + T10Y2Y");

// Preview data
head(macro_data);

The preview of our data shows that our two series have been imported together and automatically merged by date:

            date           PPIACO           T10Y2Y
      1913-01-01        12.100000                .
      1913-02-01        12.000000                .
      1913-03-01        12.000000                .
      1913-04-01        12.000000                .
      1913-05-01        11.900000                .

However, the preview doesn't necessarily give us reassurance that T10Y2Y was loaded properly because the values for the first five observations are all missing. Let's take a quick look at some summary statistics using dstatmt:

// Compute and print descriptive statistics
// for all variables in 'macro_data'
dstatmt(macro_data);

This prints a summary table to our Command Window:

-----------------------------------------------------------------------------
Variable    Mean   Std Dev  Variance     Minimum     Maximum   Valid  Missing
-----------------------------------------------------------------------------

date       -----     -----     -----  1913-01-01  2022-11-25   13048        0
PPIACO     74.57      66.3      4396        10.3       280.3    1318    11730
T10Y2Y    0.9146     0.903    0.8155       -2.41        2.91   11619     1429

From this, we can tell that both series have been imported properly. However, they have different ranges, with both series having a number of missing values.

Plotting a FRED Series

It could be useful to view our FRED data before importing it into the GAUSS workspace. This can be done using the fred_load procedure with the plotXY.

To do this, we need to remember the dataframe returned from fred_load will always contain:

A date variable named, date
A variable for every series loaded named with the seriesID

As an example, let's consider viewing the FRED S&P 500 series with the series ID sp500:

plotXY(fred_load("sp500"), "sp500 ~ date");

Advanced Import Tools

One of most useful features of the GAUSS FRED import tools is that they can perform a number of data cleaning tasks at the time of import. In this section, we will look at how to use the FRED import tools to:

Filter dates.
Aggregate data.
Perform data transformations.

The FRED Parameter List

GAUSS FRED functions use a parameter list for passing advanced settings. This list is constructed using the fred_set function.

The fred_set function creates a running list of parameters you want to pass to the FRED functions. It is specified by first listing a parameter name, then the associated parameter value.

For example:

// Create a FRED parameter list with
// 'frequency' set to 'q' (quarterly)
params_GDP = fred_set("frequency", "q");

If we wish to add additional parameters values we can update an existing parameter list:

// Set 'aggregation_method' to end-of-period
// in the previously created parameter list 'params_GDP'
params_GDP = fred_set("aggregation_method", "eop", params_GDP);

Or we can specify all parameters at the same time:

// Create a FRED parameter list with 2 settings at once.
params_GDP = fred_set("frequency", "q", "aggregation_method", "eop");

There are a few things to note about the parameter list:

The parameter specifications are case sensitive.
Order does not matter, with the exception that each parameter should be directly followed by its associated value. For example, we could have also specified

params_GDP = fred_set("aggregation_method", "eop", "frequency", "q");

Next, we'll look at how to use the parameter list for advanced FRED data import.

Filtering Dates

The observation_start and/or observation_end parameters can be used to filter the range of imported data.

For example, suppose we are interested in loading seasonally adjusted CPI data for all dates after 1971. Let's start by searching for the series ID we want to load:

// Read series information from FRED and print first 5 rows
head(fred_search("consumer price index seasonally adjusted"));

       frequency  frequency_short group_popularity               id     last_updated            notes  observation_end observation_star       popularity     realtime_end   realtime_start seasonal_adjustm seasonal_adjustm            title            units      units_short
         Monthly                M        95.000000         CPIAUCSL 2022-11-10 07:38 The Consumer Pri       2022-10-01       1947-01-01        94.000000       2022-11-28       2022-11-28 Seasonally Adjus               SA Consumer Price I Index 1982-1984= Index 1982-1984=
         Monthly                M        95.000000         CPIAUCNS 2022-11-10 07:38 Handbook of Meth       2022-10-01       1913-01-01        71.000000       2022-11-28       2022-11-28 Not Seasonally A              NSA Consumer Price I Index 1982-1984= Index 1982-1984=
      Semiannual               SA        95.000000      CUUS0000SA0 2022-07-13 07:37                .       2021-01-01       1913-01-01        38.000000       2022-11-28       2022-11-28 Not Seasonally A Consumer Price I Inflation, consu          Percent Index 1982-1984=
          Annual                A        84.000000   FPCPITOTLZGUSA 2022-05-03 14:01 Inflation as mea       2021-01-01       1960-01-01        84.000000       2022-11-28       2022-11-28 Not Seasonally A              NSA Inflation, consu          Percent                %
         Monthly                M        83.000000  CPALTT01USM657N 2022-11-14 14:25 OECD descriptor        2022-09-01       1960-01-01        80.000000       2022-11-28       2022-11-28 Not Seasonally A              NSA Consumer Price I Growth rate prev Growth rate prev

It looks like the best series for us to use is "CPIAUCSL". However, this series starts in January 1947.

We can tell GAUSS to only import data starting from 1971 by setting the observation_start parameter to "1971-01-01" using the fred_set procedure:

// Set observation_start parameter
// to use all data on or after 1971-01-01
params_cpi = fred_set("observation_start", "1971-01-01");

Now we can load our CPI data using fred_load with two inputs:

The series ID.
The parameter list, params_cpi.

// Load data using a parameter list
cpi_m = fred_load("CPIAUCSL", params_cpi);

// Preview first 5 rows of data
head(cpi_m);

Our data preview shows that the imported data starts on January 1, 1971:

            date         CPIAUCSL
      1971-01-01        39.900000
      1971-02-01        39.900000
      1971-03-01        40.000000
      1971-04-01        40.100000
      1971-05-01        40.300000

Aggregating Data

Next, suppose we want to aggregate our data from monthly to quarterly data. The FRED import tools provide a convenient way to do this at the time of import using the frequency parameter.

The frequency parameter allows you to specify the frequency of data you would like. The specified frequency can only be the same or lower than the frequency of the original series.

Frequency options include:

Specifier	Description
"d"	Daily
"w"	Weekly
"bw"	Biweekly
"m"	Monthly
"q"	Quarterly
"sa"	Semiannual
"a"	Annual

The default aggregation method is to use averaging. However, the aggregation_method parameter can be used to specify an aggregation method. Aggregation options include:

Specifier	Description
"avg"	Average
"sum"	Sum
"eop"	End of Period

Let's use the frequency parameter to aggregate the monthly "CPIAUCSL" series to quarterly observations. We will also use the aggregation_method to specify that end-of-period aggregation is used:

// Set parameter list
// Include previously specified
// parameter list to append new specifications
params_cpi = fred_set("frequency", "q", "aggregation_method", "eop", params_cpi);

// Load quarterly CPI
cpi_q_eop  = fred_load("CPIAUCSL", params_cpi);

head(cpi_q_eop);

            date         CPIAUCSL
      1971-01-01        40.000000
      1971-04-01        40.500000
      1971-07-01        40.800000
      1971-10-01        41.100000
      1972-01-01        41.400000

The cpi_q_eop dataframe now contains quarterly data starting in January 1971.

Transformations

Finally, suppose we want to use our CPI data to study inflation. With the FRED import tools, we can do this using the units parameter with the fred_load procedure.

The units options include:

Specifier	Description
"lin"	Levels (no transformation).
"chg"	Change.
"ch1"	Change from one year ago.
"pch"	Percent change.
"pc1"	Percent change from one year ago.
"pca"	Compounded annual rate of change.
"cch"	Continuously compounded rate of change.
"cca"	Continuously compounded annual rate of change.
"log"	Natural log.

Let's update our params_cpi parameter list and import the percent change of "CPIAUCSL" from a year ago.

// Set params
params_cpi = fred_set("units", "pc1", params_cpi);

// Load quarterly CPI
infl_q  = fred_load("CPIAUCSL", params_cpi);
plotXY(infl_q,  "CPIAUCSL ~ date");

Conclusion

In today's blog, we saw how the GAUSS FRED integration introduced in GAUSS 23 can save you time and effort when it comes to working with FRED data.

We learned how to use the FRED import tools to:

Search for a FRED data series.
Import FRED data to GAUSS, including merging multiple series.
Use advanced import tools to perform data transformations.

Introduction to Efficient Creation of Detailed Plots

aptech — Tue, 27 Sep 2022 16:26:33 +0000

Introduction

A few weeks ago, we showed you how to create a detailed plot from a recent article in the American Economic Review. That article contained several plots that contain quite a bit of similar and stylized formatting. Today we will show you how to efficiently create two of these graphs.

Our main goals are to get you thinking about code reuse and how it can help you:

Get more results from your limited research time.
Avoid the frustration that comes from growing mountains of spaghetti code.

If you missed it, be sure to check out our original blog on this topic, Advanced Formatting Techniques for Creating AER Quality Plots.

Our Graphs

This is what we will create today. As you can see they share many style attributes. This gives us a great opportunity to reuse code.

You can download the data here.

Our Initial Code

This is not a massive amount of code and many of you might be tempted to just copy and paste this code and make the minor modifications needed to get your desired result. I completely understand. Your biggest problem is probably a lack of time, so productivity is paramount.

While it might feel like this is a shortcut, it will saddle you with technical debt. Technical debt is just a fancy term that describes the stress, frustration, and time-wasting that inevitably occurs when you take shortcuts like this.

Not only will this save you pain, but it might save you some embarrassment as well. These sorts of mundane issues are real drivers of the replication crises in research today.

Your research is important and I know you want to do it right, so let's get started!

new;
cls;

/*
** Load and preview data
*/
int_rate = loadd("int_rate.csv");

tail(int_rate);

ks = { 0.517, 0.653, 0.781  };

/*
** Graph data
*/

// Graph size
plotCanvasSize("px", 500 | 400);

// Default settings
struct plotControl plt;
plt = plotGetDefaults("xy");

// Font
plotSetFonts(&plt, "all", "roboto", 14);

// Legend
plotSetLegend(&plt, "", "vcenter left inside", 1);
plotSetLegendBkd(&plt, 0);

// Main line settings
clrs = getColorPalette("set2");
plotSetLinePen(&plt, 4, clrs[3 2], 1|3);

// Axes outline (spine)
plotSetOutlineEnabled(&plt, 1);

// X-axis
plotSetTextInterpreter(&plt, "latex", "xaxis");
plotSetXAxisLabel(&plt, "\\text{country opacity }, \\omega");

// Y-axis
plotSetYLabel(&plt, "interest rate");

// Draw main plot
plotXY(plt, int_rate, "high + low ~ x");

// Style and add vertical lines
plotSetLinePen(&plt, 1, "#CCC", 2);
plotAddVLine(plt, ks);

// Style text boxes
struct plotAnnotation ant;
ant = annotationGetDefaults();
annotationSetTextInterpreter(&ant, "latex");
annotationSetLinePen(&ant, 0, "", -1);
annotationSetFont(&ant, "", 14, "#3333");
annotationSetBkd(&ant, "", 0);

// Add text boxes
plotAddTextbox(ant, "\\omega_1", ks[1], 0.15);
plotAddTextbox(ant, "\\omega_2", ks[2], 0.15);
plotAddTextbox(ant, "\\omega_3", ks[3], 0.15);

Initial Code Simplification

We will start by creating a procedure to hold some of the plot styling functions that we want to repeat and apply them to the first plot only. Then we will add the data for the second plot.

It looks like all of the styling applied before the call to plotXY will be the same in both plots, but the y-axis label text is different. So, let's create a procedure that will apply the main settings:

new;
cls;

/*
** Load and preview data
*/
int_rate = loadd("int_rate.csv");

tail(int_rate);

ks = { 0.517, 0.653, 0.781  };

/*
** Graph data
*/

// Graph size
plotCanvasSize("px", 500 | 400);

// Declare plotControl structure
struct plotControl plt;

// Fill with defaults for this project
plt = pltDefaults();

// Set y-axis label for first plot
plotSetYLabel(&plt, "interest rate");

// Draw first plot
plotXY(plt, int_rate, "high + low ~ x");

proc (1) = pltDefaults();
    local clrs;

    struct plotControl plt;
    plt = plotGetDefaults("xy");

    // Font
    plotSetFonts(&plt, "all", "roboto", 14);

    // Legend
    plotSetLegend(&plt, "", "vcenter left inside", 1);
    plotSetLegendBkd(&plt, 0);

    // Main line settings
    clrs = getColorPalette("set2");
    plotSetLinePen(&plt, 4, clrs[3 2], 1|3);

    // Axes outline (spine)
    plotSetOutlineEnabled(&plt, 1);

    // X-axis
    plotSetTextInterpreter(&plt, "latex", "xaxis");
    plotSetXAxisLabel(&plt, "\\text{country opacity }, \\omega");

    retp(plt);
endp;

While this code is slightly longer when drawing just one plot, it will save us when we add the next plot. Before we do that, we need to address the vertical lines and annotations.

Simplifying the Annotations

Looking over the plots at the top of this article shows us that the vertical lines and the omega text boxes all depend on the ks vector. Since they seem to be intertwined, it is probably safe to put them in one procedure.

The simplest thing to do would be to add all the annotation code to a single procedure like this:

proc (0) = pltAddOmegas(ks);

    struct plotControl plt;
    plt = plotGetDefaults("xy");

    // Style and add vertical lines
    plotSetLinePen(&plt, 1, "#CCC", 2);
    plotAddVLine(plt, ks);

    // Style text boxes
    struct plotAnnotation ant;
    ant = annotationGetDefaults();
    annotationSetTextInterpreter(&ant, "latex");
    annotationSetLinePen(&ant, 0, "", -1);
    annotationSetFont(&ant, "", 14, "#3333");
    annotationSetBkd(&ant, "", 0);

    // Add text boxes
    plotAddTextbox(ant, "\\omega_1", ks[1], 0.15);
    plotAddTextbox(ant, "\\omega_2", ks[2], 0.15);
    plotAddTextbox(ant, "\\omega_3", ks[3], 0.15);
endp;

and then call that procedure right after plotXY. In this case, it is not a bad place to start. However, since we are in learning mode, let's pretend that we were going to create more graphs in this file that would add text boxes with the same styling, but would use different greek letters and would be located in a different place in the graph.

In that case, we would probably want to separate the text box styling from the text box drawing, like this:

proc (1) = textBoxDefaults();
    struct plotAnnotation ant;
    ant = annotationGetDefaults();

    annotationSetTextInterpreter(&ant, "latex");
    annotationSetLinePen(&ant, 0, "", -1);
    annotationSetFont(&ant, "", 14, "#3333");
    annotationSetBkd(&ant, "", 0);

    retp(ant);
endp;

proc (0) = pltAddOmegas(ks);

    struct plotControl plt;
    plt = plotGetDefaults("xy");

    // Style and add vertical lines
    plotSetLinePen(&plt, 1, "#CCC", 2);
    plotAddVLine(plt, ks);

    struct plotAnnotation ant;
    ant = textBoxDefaults();

    // Add text boxes
    plotAddTextbox(ant, "\\omega_1", ks[1], 0.15);
    plotAddTextbox(ant, "\\omega_2", ks[2], 0.15);
    plotAddTextbox(ant, "\\omega_3", ks[3], 0.15);
endp;

Conclusion and final code

Below is the final code to create the graphs from the top of this blog. This isn't designed to show you the best way to write this code, but rather to get you started with the idea of code reuse.

Software engineers sometimes use the acronym DRY — Don't Repeat Yourself. While that is a great practice, even just repeating yourself less often will bring you great rewards.

new;
cls;

/*
** Load and preview data
*/
int_rate = loadd("int_rate.csv");
tail(int_rate);

rationing = loadd("rationing.csv");
tail(rationing);

ks = { 0.517, 0.653, 0.781  };

/*
** Graph data
*/

// Graph size
plotCanvasSize("px", 1000 | 400);

// Declare plotControl structure and
// fill with defaults for this project
struct plotControl plt;
plt = pltDefaults();

/*
** Interest rate plot
*/

// Create grid for multiple plots
plotLayout(1,2,1);

// Set y-axis label for first plot
plotSetYLabel(&plt, "interest rate");

// Draw first plot
plotXY(plt, int_rate, "high + low ~ x");
pltAddOmegas(ks);

/*
** Rationing plot
*/

// Create grid for multiple plots
plotLayout(1,2,2);

// Set y-axis label for first plot
plotSetYLabel(&plt, "rationing");

// Draw first plot
plotXY(plt, rationing, "high + low ~ x");
pltAddOmegas(ks);

proc (1) = pltDefaults();
    local clrs;

    struct plotControl plt;
    plt = plotGetDefaults("xy");

    // Font
    plotSetFonts(&plt, "all", "roboto", 14);

    // Legend
    plotSetLegend(&plt, "", "vcenter left inside", 1);
    plotSetLegendBkd(&plt, 0);

    // Main line settings
    clrs = getColorPalette("set2");
    plotSetLinePen(&plt, 4, clrs[3 2], 1|3);

    // Axes outline (spine)
    plotSetOutlineEnabled(&plt, 1);

    // X-axis
    plotSetTextInterpreter(&plt, "latex", "xaxis");
    plotSetXAxisLabel(&plt, "\\text{country opacity }, \\omega");

    retp(plt);
endp;

proc (1) = textBoxDefaults();
    struct plotAnnotation ant;
    ant = annotationGetDefaults();

    annotationSetTextInterpreter(&ant, "latex");
    annotationSetLinePen(&ant, 0, "", -1);
    annotationSetFont(&ant, "", 14, "#3333");
    annotationSetBkd(&ant, "", 0);

    retp(ant);
endp;

proc (0) = pltAddOmegas(ks);

    struct plotControl plt;
    plt = plotGetDefaults("xy");

    // Style and add vertical lines
    plotSetLinePen(&plt, 1, "#CCC", 2);
    plotAddVLine(plt, ks);

    struct plotAnnotation ant;
    ant = textBoxDefaults();

    // Add text boxes
    plotAddTextbox(ant, "\\omega_1", ks[1], 0.15);
    plotAddTextbox(ant, "\\omega_2", ks[2], 0.15);
    plotAddTextbox(ant, "\\omega_3", ks[3], 0.15);
endp;

Advanced Formatting Techniques for Creating AER Quality Plots

aptech — Wed, 27 Jul 2022 21:18:24 +0000

Introduction

Today's blog will show you how to reproduce one of the graphs from a paper in the June 2022 issue of the journal, American Economic Review. You will learn how to:

Add and style text boxes with LaTeX.
Set the anchor point of text boxes.
Add and style vertical lines.
Automatically set legend text to use your dataframe's variable names.
Set the font for all or a subset of the graph text elements.
Set the size of your graph.

The Graph and Data

Below is the graph that we are going to create. It is adapted from a recent paper in the American Economic Review. You can download the data here.

Load and Preview Data

Our first step will be to load the data and take a quick look at it.

// Load all variables from 'int_rate.csv'
int_rate = loadd("int_rate.csv");

// Print the first 5 observations of 'int_rate'
print "First 5 observations:";
head(int_rate);

// Print the last 5 observations of 'int_rate'
print "Last 5 observations:";
tail(int_rate);

// Print descriptive statistics of our variables
call dstatmt(int_rate);

This will give us the following results:

First 5 observations:

               x             high              low
   0.00010000000       0.17140598        0.0000000
   0.00020000000       0.17140598        0.0000000
   0.00030000000       0.17140598        0.0000000
   0.00040000000       0.17140598        0.0000000
   0.00050000000       0.17140598        0.0000000

Last 5 observations:

               x             high              low
      0.99950000       0.17140598       0.23012203
      0.99960000       0.17140598       0.23012203
      0.99970000       0.17140598       0.23012203
      0.99980000       0.17140598       0.23012203
      0.99990000       0.17140598       0.23012203

-------------------------------------------------------------------------------
Variable     Mean    Std Dev     Variance    Minimum    Maximum   Valid Missing
-------------------------------------------------------------------------------

x             0.5      0.289       0.0833      1e-4      0.999    9999    0
high       0.1714   1.04e-07     1.08e-14     0.171      0.171    9999    0
low       0.09374      0.108       0.0116         0      0.230    9999    0

Initial Graphs

Our first graphs will use the default GAUSS styling. We will create one graph indexing our x and y variables and another using a formula string.

Indexing

// Set our graph size to 500x400 pixels
plotCanvasSize("px", 500 | 400);

// Use indexing to select the 'x' and 'y' variables
plotXY(int_rate[.,"x"], int_rate[.,"high" "low"]);

Formula string

// Set our graph size to 500x400 pixels
plotCanvasSize("px", 500 | 400);

// Specify the 'x' and 'y' variables using a formula string
plotXY(int_rate, "high + low ~ x");

When we use a formula string with our plot functions, it tells GAUSS that we want to use the information from the dataframe.

When using a formula string in plots:

The tilde symbol, ~, separates the y variables on the left from the x variable(s) on the right.
If there is a single y variable, GAUSS will use that variable name to label the y axis. If there is more than one y variable, then the variable names will be added to the legend.
The name of the x variable will be used to label the x-axis.

While this may not always be the information we want to be displayed in our final plot, it makes it convenient to quickly create graphs that are easier to interpret.

Plot Styling

Fp Next, we will adjust the styling to match our intended final plot.

To programmatically style our graph, the first thing we need to do is to create a plotControl structure and fill it with default values.

struct plotControl plt;
plt = plotGetDefault("xy");

Legend styling

After the pointer to the plotControl structure we want to modify, &plt, the function plotSetLegend takes one required input and two optional ones.

Legend text: This controls the text that will be displayed in the legend. We want GAUSS to use the variable names from our input. Therefore, below, we set this to an empty string, "". This tells GAUSS that we do not want to modify the default behavior of the legend text.

As we mentioned earlier, the default behavior for a graph created with a formula string with more than one y variable is to use the y variable names as the legend text elements.
Legend location: This input can be a string with text location specifications, or a 2x1 vector with the x and y coordinates for the location of the top-left corner of the legend.
Legend orientation: This input specifies whether the legend position should be stacked vertically or horizontally. We set it to 1 to indicate a vertical arrangement. It may help to remember that a 1 is a vertical essentially a vertical mark.

The first input to plotSetLegendBkd, after the plotControl structure pointer, controls the legend opacity. We set it to be 0% opaque, or 100% transparent.

plotSetLegend(&plt, "", "vcenter left inside", 1);
plotSetLegendBkd(&plt, 0);

Font styling

plotSetFonts provides a convenient way to set the font family, size, and color for any subset of the text in your graph. Below we set the font for 'all' of the text in the plot. However, there are many other options, including: "axes", "legend", "legend_title", "title", "ticks" and many more.

plotSetFonts(&plt, "all", "roboto", 14);

X-axis label

plotSetTextInterpreter tells GAUSS whether you would like text labels to be interpreted as:

HTML
LaTeX
Plain text

Like plotSetFonts, it allows you to specify many different locations, or even, "all". Below, we set the x-axis to be interpreted as LaTeX and then use LaTeX in our x-axis label.

plotSetTextInterpreter(&plt, "latex", "xaxis");
plotSetXAxisLabel(&plt, "\\text{country opacity }, \\omega");

Main line styling

The "set2" color palette contains eight colors:

We want to use the third color for our first series, "high", and the second color for our second series, "low".

Additionally, we set the line width to 4 pixels and set the line style to 1 and 3 respectively. One indicates a solid line and three is for a dotted line.

clrs = getColorPalette("set2");

// Set the line width, line colors, and line style
plotSetLinePen(&plt, 4, clrs[3 2], 1|3);

Axes outline

The axes outline, or spine as they are called by other libraries, controls the lines around the edges of the data area. By default, the bottom x-axis and left y-axis are enabled. The code below will also turn on the lines on the top x-axis and right y-axis.

plotSetOutlineEnabled(&plt, 1);

Graph before annotations

If we draw the graph, using our previous styling and a formula string as shown below:

plotXY(plt, int_rate, "high + low ~ x");

we get the following plot:

Add Vertical Lines

Line styling

We will continue with the plotControl structure we created earlier and modify the line settings to match the vertical lines from the graph we are trying to reproduce.

// Set lines to be: 1 pixel wide, light gray (#CCC) and dashed style (2)
plotSetLinePen(&plt, 1, "#CCC", 2);

Draw the lines

We add the vertical lines using plotAddVLine. plotAddVLine takes an optional plotControl structure as the first input and then a vector of one or more x-axis locations at which to draw the vertical spanning lines.

// The x-axis locations for the vertical lines
ks = { 0.517, 0.653, 0.781 };

plotAddVLine(plt, ks);

Add Text Annotations

Style text boxes

The annotation styling functions use a plotAnnotation structure but work very similarly to the main plot styling (plotSet) functions. Therefore, we will just use comments to describe their actions.

struct plotAnnotation ant;
ant = annotationGetDefaults();

// Set text to be interpreted as LaTeX
annotationSetTextInterpreter(&ant, "latex");

// Turn off the text box bounding line, by setting:
//     line-width=0, line-color="" (ignore), line-style=-1 (no line)
annotationSetLinePen(&ant, 0, "", -1);

// Leave the font-family as default, "",
// Set the font-size to 14 points and the color to a
// dark gray, #333
annotationSetFont(&ant, "", 14, "#3333");

// Leave the annotation background color, "".
// Set the opacity to 0% (100% transparent)
annotationSetBkd(&ant, "", 0);

Draw the text boxes

After the optional plotAnnotation structure, plotAddTextbox takes 3 input arguments:

The text to display.
The x-axis coordinate.
The y-axis coordinate.

By default, the x and y-axis coordinates specify the location of the top-left of the bounding box that contains the text.

plotAddTextbox(ant, "\\omega_1", ks[1], 0.15);
plotAddTextbox(ant, "\\omega_2", ks[2], 0.15);
plotAddTextbox(ant, "\\omega_3", ks[3], 0.15);

Full code

Below is the full code to create our graph.

new;
cls;

/*
** Load and preview data
*/
int_rate = loadd("int_rate.csv");

tail(int_rate);

ks = { 0.517, 0.653, 0.781  };

/*
** Graph data
*/

// Graph size
plotCanvasSize("px", 500 | 400);

// Default settings
struct plotControl plt;
plt = plotGetDefaults("xy");

// Font
plotSetFonts(&plt, "all", "roboto", 14);

// Legend
plotSetLegend(&plt, "", "vcenter left inside", 1);
plotSetLegendBkd(&plt, 0);

// Main line settings
clrs = getColorPalette("set2");
plotSetLinePen(&plt, 4, clrs[3 2], 1|3);

// Axes outline (spine)
plotSetOutlineEnabled(&plt, 1);

// X-axis
plotSetTextInterpreter(&plt, "latex", "xaxis");
plotSetXAxisLabel(&plt, "\\text{country opacity }, \\omega");

// Draw main plot
plotXY(plt, int_rate, "high + low ~ x");

// Style and add vertical lines
plotSetLinePen(&plt, 1, "#CCC", 2);
plotAddVLine(plt, ks);

// Style text boxes
struct plotAnnotation ant;
ant = annotationGetDefaults();
annotationSetTextInterpreter(&ant, "latex");
annotationSetLinePen(&ant, 0, "", -1);
annotationSetFont(&ant, "", 14, "#3333");
annotationSetBkd(&ant, "", 0);

// Add text boxes
plotAddTextbox(ant, "\\omega_1", ks[1], 0.15);
plotAddTextbox(ant, "\\omega_2", ks[2], 0.15);
plotAddTextbox(ant, "\\omega_3", ks[3], 0.15);

Bonus Content: Text Box Anchor Position

For this case, we wanted the text boxes to appear to just the right of the vertical lines and the vertical position of the text boxes was not critical. Therefore, the default anchor position worked well.

However, if we had needed the text boxes to be towards the bottom of the graph, the first of them would have overlapped with one of our lines. We can see this by changing the plotAddTextbox lines to the following:

// Draw the text boxes at a lower position, y=0.04
plotAddTextbox(ant, "\\omega_1", ks[1], 0.04);
plotAddTextbox(ant, "\\omega_2", ks[2], 0.04);
plotAddTextbox(ant, "\\omega_3", ks[3], 0.04);

This makes the bottom of our graph look like this:

In this case, it would be nice to move the text boxes to the left of the vertical line. We can do this by using the final optional input of plotAddTextbox. It is a string that allows you to specify the position of the text box with respect to its anchor position.

The string options include:

Vertical position: "top", "vcenter", "bottom".
Horizontal position: "left", "hcenter", "right".

or "center" which is equivalent to "vcenter hcenter".

For this example, we will just move the text boxes to the left of the vertical lines which are at the same position as the text box's anchor locations:

// Draw the text boxes at a lower position, y=0.04
plotAddTextbox(ant, "\\omega_1", ks[1], 0.04, "left");
plotAddTextbox(ant, "\\omega_2", ks[2], 0.04, "left");
plotAddTextbox(ant, "\\omega_3", ks[3], 0.04, "left");

This gives us the following image:

Conclusion

Great job! You have learned how to:

Add and style text boxes with LaTeX.
Set the anchor point of text boxes.
Add and style vertical lines.
Automatically set legend text to use your dataframe's variable names.
Set the font for all or a subset of the graph text elements.
Set the size of your graph.
Control the position of text boxes with respect to their attachment point.

References

Farboodi, Maryam, and Péter Kondor. 2022. "Heterogeneous Global Booms and Busts." American Economic Review, 112 (7): 2178-2212. DOI: 10.1257/aer.20181830

Installing the GAUSS Package Manager [Video]

Eric — Tue, 31 May 2022 14:54:43 +0000

GAUSS packages provide access to powerful tools for performing data analysis. Learn how to install the GAUSS Package Manager, and get the quickest access to the full suite of GAUSS packages, in this short video.

Additional Resources

How to Load Excel Data into GAUSS

Eric — Mon, 18 Apr 2022 19:36:19 +0000

Introduction

Data loading is often the first step in your data analysis. In this video, you'll learn how to save time and avoid data loading errors when working with Excel files.

Our video demonstration shows just how quick and easy it can be to load time series, categorical and numeric variables from Excel files into GAUSS.

Interactively preview and load variables

See how to use the GAUSS Data Import window to interactively:

Load basic Excel data.
Load data from different Excel sheets.
Specify variables to load.
Specify dataframe names.

Use autogenerated code to reproduce your steps.

The Data Import window auto-generates code to perform all the import and filter steps. We show you how to put this code into a program and run the file to repeat your data loading steps.

Data Exploration and Cleaning

GAUSS provides an easy-to-use environment for data exploration and cleaning. In this video, we'll demonstrate how to:

Perform descriptive statistics.
Change variable names.
Specify values, such as-999, as missing values
Change categorical labels.
Set the category base case.

Programming – Aptech

MLE with Bounded Parameters: A Cleaner Approach

Introduction

Why Bounds Matter

Example 1: GARCH(1,1) on Commodity Returns

Step One: Data and Likelihood

Step Two: Setting Up Optimization

Step Three: Running the Model

Results and Visualization

Example 2: Stochastic Frontier Model

Step One: Data and Likelihood

Step Two: Setting Up Optimization

Step Three: Running the Model

Results and Visualization

When to Use minimize

Conclusion

Further Reading

Why You Should Consider Constrained Maximum Likelihood MT (CMLMT)

Introduction

Key Features Comparison

Advantages of CMLMT

Converting a CML Model to CMLMT

Original CML Code

Step One: Switch to CMLMT Library

Step Two: Load Data

Step Three: Setting Constraints

Step Four: Specify Starting Values

Step Five: The Likelihood Function

Step Six: Run Optimization

Conclusion

Further Reading

Try out The GAUSS Constrained Maximum Likelihood MT Library

Introducing the GAUSS Data Management Guide

Introduction

What is the GAUSS Data Management Guide?

What does the GAUSS Data Management Guide cover?

How should I use the GAUSS Data Management Guide?

Conclusion

Further Reading

Transforming Panel Data to Long Form in GAUSS

Introduction

The Rules of Tidy Data

Example One: Wide Form State Population Table

Example Two: Long Form State Population Table

Why Do We Care About Tidy Data?

Transforming From Wide to Long Panel Data

The dfLonger Procedure

Setting Up Panel Data Transformations

Step 1: Identify variables.

Step 2: Identify columns to convert.

Step 3: Name the new columns for storing names.

Step 4: Name the new columns for storing values.

Basic Pivoting

Advanced Pivoting

The pivotControl Structure

Changing Variable Types

Stripping Prefixes

Splitting Names

Names Include a Separator

Variable Names With Regular Expressions

Multiple Value Variables

Conclusion

Further Reading

Discover how GAUSS 24 can help you reach your goals.

Managing String Data with GAUSS Dataframes

Introduction

Quick Overview of Strings in GAUSS

Loading Strings in GAUSS

Data Exercise: Building an Email List

Provided Data

Importing Raw Data Interactively

Cleaning Our Data

Generating Email Addresses

Build Email Database

Filtering the Data

Export to CSV file

Extra Credit: Looping Through All Representatives

Conclusion

Further Reading

Importing FRED Data to GAUSS

The `dfLonger` Procedure

The `pivotControl` Structure