### Introduction

Loops account for a large percentage of computation time for many scientific computing problems. For this reason, GAUSS provides an easy method to make parallel loops with the `threadfor`

keyword.

This tutorial will start out by making a simple single-threaded code to compute parameter estimates for Rolling Regression on some simulated data. Then we will use the `threadfor`

keyword to parallelize the code.

## Simulate some data

In order to keep this tutorial focused on parallelization and make it simple for you to scale the problem up and down easily, we will just simulate random normal data.

```
//Number of observations
nobs = 1e6;
//Number of variables
nvars = 3;
//Set random seed for repeatable random numbers
rndseed 432343;
//Random independent variables, N(0,1)
X = rndn(nobs, nvars);
//Simulate parameter values
b_true = rndn(nvars, 1);
//Simulate error term
err = rndn(nobs, 1);
//Simulate our dependent variable
y = X * b_true + err;
```

## Single-threaded Rolling Regression

Now that we have computed our `y`

variable, we can implement a single-threaded Rolling Regression estimation. Our algorithm will be:

- Choose a window size.
- Estimate the parameters of a linear model from
`1:window_size`

. - Slide our window forward 1 observation, i.e. from
`2:window_size + 1`

. - Estimate the parameters of our current window.
- Continue steps 3 and 4 until we reach the end of the
`y`

vector.

```
window_size = 50;
//Number of iterations to completion
iters = rows(y) - window_size;
//Pre-allocate matrix to hold parameter estimates
out = zeros(iters, cols(X));
//Start timer
h_start = hsec();
for i(1, iters, 1);
//Estimate parameters of current window,
//transpose them to a row vector and place
//in 'out' vector
out[i, .] = (y[i:i+window_size] / x[i:i+window_size,.])';
endfor;
h_final = hsec();
print (h_final - h_start)/100 "seconds for single-thread";
```

The time to run the above code will vary based upon your computer and the version of GAUSS you are running. On the reference device for this tutorial, a quad-core Macbook Pro, this code takes about 1.73 seconds.

`threadfor`

basics

The `threadfor`

keyword takes the same inputs as the standard `for`

loop:

```
threadfor i(start, stop, step);
//loop body
threadendfor;
```

- i
- The loop counter
- start
- Value of the loop counter for the first iteration
- stop
- The final value of the loop counter in the last iteration
- step
- The amount to change the loop counter on each iteration

The difference, of course, is that`threadfor`

will run multiple iterations at the same time. Since any two iterations could be running at the same time, we cannot write data to the same location from separate iterations. Fortunately, in this example, we write to a separate row in each iteration. Since that row is controlled by the loop counter,`i`

, we do not have to worry about collisions. So we only need to change`for`

to`threadfor`

and our code will be run in parallel:

```
//Pre-allocate matrix to hold parameter estimates
out = zeros(iters, cols(X));
//Start timer
h_start = hsec();
//Parallel loop
threadfor i(1, iters, 1);
//Estimate parameters of current window,
//transpose them to a row vector and place
//in 'out' vector
out[i, .] = (y[i:i+window_size] / x[i:i+window_size,.])';
threadendfor;
h_final = hsec();
print (h_final - h_start)/100 "seconds with threadfor";
```

On the same quad-core Macbook Pro, the threaded code runs in about 0.44 seconds. This gives us a speed-up of almost exactly 4 times. Not all algorithms will scale as perfectly as this one, but it does indicate that `threadfor`

can be quite simple to use and provide a great performance boost.

## Temporary variables

Our first example was very easy and clean to convert to parallel, with excellent performance as well. However, we stated above that you cannot assign to the same variable (or more precisely the same index location) from separate threads. Since it is common for loops to require some scalar temporary variables, how can we handle this? Do we need to create an array for every single temporary variable? Fortunately, the answer is no.

In `threadfor`

loops, GAUSS will automatically create and manage loop specific versions of temporary variables that are created and used in a `threadfor`

loop. To illustrate this behavior, let's make a slight modification to our parallel Rolling Regression example:

```
//Parallel loop
threadfor i(1, iters, 1);
//Assign to temporary variables 'a' and 'b'
a = i;
b = i + window_size;
//Estimate parameters of current window,
//transpose them to a row vector and place
//in 'out' vector
out[i, .] = (y[a:b] / x[a:b,.])';
threadendfor;
```

In this case, the use of the temporary variables `a`

and `b`

is not critical for the code. But there are many cases in which it would be a burden to not be able to use them. However, due to the fact that we do not know the order in which the iterations will run, these temporary variables have some special behavior and rules. The rules for `threadfor`

temporary variables are:

- A variable that is assigned to without using an index is a temporary variable. Variables that are assigned to by index, such as the `out` vector in our example are called 'slice' variables.
- The first reference to a temporary variable in a `threadfor` loop must be an assignment.
- The value of a variable after the loop will not be changed by actions inside the loop.

Rules number 2 and 3 above are intended to prevent you from creating code with bugs called 'data races'. A 'data race' is when the output of your program is dependent on the order that parallel calculations are run. To make this more clear, let's look at a very simple example:

```
my_var = 3;
threadfor i(1, 4, 1);
my_var = i * 2;
threadendfor;
print my_var;
```

What do you think `my_var`

will be equal to after the `threadfor`

statement? The answer is 3. Since `my_var`

was not assigned to by index inside the `threadfor`

loop, it became a temporary variable. While a loop is a natural way to think about this code, the actual behavior of the `threadfor`

loop is more like this:

Thread 1 | Thread 2 | Thread 3 | Thread 4 |
---|---|---|---|

Create my_var_tmp_1 | Create my_var_tmp_2 | Create my_var_tmp_1 | Create my_var_tmp_2 |

↓ | ↓ | ↓ | ↓ |

my_var_tmp_1 = 1 * 2 | my_var_tmp_2 = 2 * 2 | my_var_tmp_1 = 1 * 2 | my_var_tmp_2 = 2 * 2 |

↓ | ↓ | ↓ | ↓ |

Destroy my_var_tmp_1 | Destroy my_var_tmp_2 | Destroy my_var_tmp_1 | Destroy my_var_tmp_2 |