### Introduction

Linear regression commonly assumes that the error terms of a model are independently and identically distributed (**i.i.d**). However, when datasets contain groups, the potential for correlated error terms within groups arises.

## Example: Weather shocks to apple orchards

For example, consider a model of the supply of apples from various orchards across the United States. Naturally, we would expect that orchards within Washington may all face similar weather-related "shocks" to their supply. However, we would not expect the weather shocks to the orchards in Washington to be the same as the weather shocks to the orchards in New York.

When these correlated within-group shocks occur, the i.i.d error term assumption is not valid and traditional error terms can result in misleading inference about coefficient estimates.

In these cases, is important to use error terms that appropriately account for the within-group correlation between error terms.

**Potential impacts of ignoring clustered error terms:**

- Standard errors that are too small
- Confidence interval bands that are too narrow
- Overly large t-statistics
- Over rejection of the null hypothesis

## The Model

Let's consider our hypothetical apple supply model which makes observations for each individual orchard, $i = 1, 2, \ldots\, N$ at each time period $t = 1, 2, \ldots\, T$:

$$y_{it} = x_{it}\beta + u_{it}$$

We can further aggregate our dataset into state-level groups, $g = 1, 2, \ldots, G$ such that:

$$y_{igt} = x_{igt}\beta + u_{igt}$$

The cluster-robust error term assumes that, $u_{igt}$, is correlated within groups but independent across groups. More formally:

$$E[u_{igt}u_{iht}] \begin{cases} = 0 & \text{ if }g \neq h \\ \neq 0 & \text{ if }g = h \end{cases} $$

The cluster-robust error computation allows for this correlation:

$$V_{clu}[\hat{\beta}] = (X'X)^{-1} * \sum_{j=1}^G u_j' u_j * (X'X)^{-1}$$

where

$$u_j = \sum_{cluster_j} u_{it} x_{it} .$$

## Estimating our model in GAUSS

Let's look more formally at our apple production model using the `apples_cluster.dat`

dataset. Using this data we will model the production of apples in relationship to orchard acreage:

$$prod = \beta_0 + \beta_1*acres + u$$

### Estimating i.i.d error terms

First, let's model the data using i.i.d standard errors:

```
// Specify filename
fname = __FILE_DIR $+ "apples_cluster.dat";
// Estimate model using ols
struct olsmtOut oOut;
oOut = olsmt(fname, "prod ~ acres", oCtl);
```

This yields the following results:

Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var ------------------------------------------------------------------------------- CONSTANT 0.0296797 0.0283593 1.04656 0.295 --- --- acres 1.03483 0.0285833 36.2041 0.000 0.455813 0.455813

### Estimating cluster-robust error terms

Now, we specify cluster-robust errors using two members in the `olsmtControl`

structure:

- oCtl.cov
- String, the type of covariance matrix to be computed:

`"iid"`

for i.i.d errors.`"cluster"`

for cluster-robust errors.`"robust"`

for the Huber/White sandwich estimator.

- oCtl.clusterId
- String, the name of the variable containing data groups.

**Note:**Because we are using formula strings to specify our model we use

`oCtl.clusterId`

to specify our groups. However, if we use data matrices to specify our model the member `oCtl.clusterVar`

to specify our groups.Our code for estimation now becomes:

```
// Estimate model using ols
struct olsmtControl oCtl;
oCtl = olsmtControlCreate();
// Set up cluster id variable
oCtl.clusterId = "state";
// Turn on cluster vce
oCtl.cov = "cluster";
// Estimate model
struct olsmtOut oOut;
oOut = olsmt(fname, "prod ~ acres", oCtl);
```

Which yields:

Standard Prob Standardized Cor with Variable Estimate Error t-value >|t| Estimate Dep Var ------------------------------------------------------------------------------- CONSTANT 0.0296797 0.0670127 0.442897 0.658 --- --- acres 1.03483 0.0505957 20.453 0.000 0.455813 0.455813

### Comparing results

There are several key things to note about the two sets of results.

- Using cluster-robust standard errors has no impact on the coefficient estimates.
- The cluster-robust standard errors are larger than i.i.d errors.

In this case, the larger standard errors do not impact our conclusions regarding the significance of the estimated coefficients, but this may not always be true.

### Conclusions

In today's discussion of cluster-robust standard errors we have learned :

- What types of models may introduce within-cluster correlation in error terms.
- The potential impacts of ignoring within-cluster correlations in error terms.
- How to estimate cluster-robust error terms.

Code and data from this blog can be found here.

Eric has been working to build, distribute, and strengthen the GAUSS universe since 2012. He is an economist skilled in data analysis and software development. He has earned a B.A. and MSc in economics and engineering and has over 18 years of combined industry and academic experience in data analysis and research.