### Introduction

Classical linear regression estimates the mean response of the dependent variable dependent on the independent variables. There are many cases, such as skewed data, multimodal data, or data with outliers, when the behavior at the conditional mean fails to fully capture the patterns in the data.

In these cases, quantile regression provides a useful alternative to linear regression which:

- Can be used to study the distributional relationships of variables.
- Can help detect heteroscedasticity.
- Is useful for dealing with censored variables.
- Is more robust to outliers.

Today we will use quantile regression to analyze Major League Baseball Salary data at the 10%, 25%, 50%, 75%, and 90% quantiles. We will consider the model

$$ ln(salary) = \beta_0 + \beta_1 AtBats + \beta_2 Hits + \beta_3 HmRun + \beta_4 Walks\\ + \beta_5 Years + \beta_6 PutOuts $$

## The intuition of quantile regression

To understand the intuition of quantile regression, let's start with the intuition of ordinary least squares. Given the model

$$ y_i = \beta'X_i + \epsilon_i ,$$

the least squares estimate minimizes the sum of the squared error terms

$$ \sum^N_i (y_i - \hat{y_i})^2 .$$

Comparatively, quantile regression minimizes a weighted sum of the positive and negative error terms:

$$ \tau\sum_{y_i \gt \hat{\beta_{\tau}}'X_i} | y_i - \hat{\beta_{\tau}}'X_i |\ +\ (1 - \tau)\sum_{y_i \lt \hat{\beta_{\tau}}'X_i} | y_i - \hat{\beta_{\tau}}'X_i | $$

where $\tau$ is the quantile level.

If we assume that $\tau$ is equal to 0.9, we can compute the quadratic regression loss for the data in the image above, like this:

$$ \tau(d2) + (1 - \tau)(|d1 + d3|)\\ 0.9 * 0.4 + 0.1 * (|-1.3 + -0.4|) = 0.53 $$

Optimizing this loss function results in an estimated linear relationship between $y_i$ and $x_i$ where a portion of the data, $\tau$, lies below the line and the remaining portion of the data, $1-\tau$, lies above the line as shown in the graph below (Leeds, 2014).

## Estimating a quantile regression with GAUSS

Today we will use the **GAUSS** function `quantileFit`

to estimate our salary model at the 10%, 25%, 50%, 75%, and 90% quantiles. This allows us insight into what factors impact salaries at the extremes of the salary distribution, in addition to those at quantiles in between those extremes.

The `quantileFit`

function uses formula string syntax and takes the following inputs:

- dataset
- String, name of data set.
- formula
- String, the formula of the model. E.g "y ~ X1 + X2"
- tau
- Optional argument, Mx1 vector, quantile levels. Default = {0.05, 0.5, 0.95};
- w
- Optional argument, Nx1 vector, containing observation weights. Default = uniform weights.
- qCtl
- Optional argument, an instance of the qfitControl structure containing members for controlling parameters of the quantile regression.

We will also use the `qFitControl`

structure to specify variables names and set up a bootstrap for standard errors and confidence intervals :

```
// Load variables
y = loadd("islr_hitters.xlsx", "ln(salary)");
x = loadd("islr_hitters.xlsx", "AtBat + Hits + HmRun + Walks + Years + PutOuts");
/*
** Estimate the model
*/
// Set up tau for regression
tau = 0.10 | 0.25 | 0.50 |0.75 | 0.90;
// Declare control structure
// and fill with default values
struct qfitControl qCtl;
qCtl = qfitControlCreate();
// Add variable names
qCtl.varnames = "AtBat" $| "Hits" $| "HmRun" $| "Walks" $| "Years" $| "PutOuts";
// Turn on bootstrapped confidence intervals
qCtl.bootstrap = 1000;
// Call quantileFit
struct qfitOut qOut;
qOut = quantileFit(y, x, tau, qCtl);
```

## Interpreting our results

### Coefficients estimates

Variable | OLS | 10% | 25% | 50% | 75% | 90% |
---|---|---|---|---|---|---|

Constant | 4.37*** | 3.69*** | 3.72*** | 4.078*** | 4.663*** | 5.304*** |

(0.133) | (0.107) | (0.105) | (0.277) | (0.157) | (0.483) | |

AtBat | -0.00258** | -0.00324** | -0.00256** | -0.00253* | -0.00173 | -0.00179 |

(0.001) | (0.00156) | (0.00113) | (0.00143) | (0.00124) | (0.00157) | |

Hits | 0.01366*** | 0.01811*** | 0.01576*** | 0.01503*** | 0.01106*** | 0.008907** |

(0.003) | (0.00597) | (0.00377) | (0.00441) | (0.00374) | (0.00384) | |

HmRun | 0.0051 | -0.00289 | 0.000219 | 0.002443 | 0.01687*** | 0.01416* |

(0.0054) | (0.00801) | (0.00583) | (0.00906) | (0.00605) | (0.00821) | |

Walks | 0.0071*** | 0.006536* | 0.009025*** | 0.007767** | 0.006164** | 0.007038** |

(0.0023) | (0.00341) | (0.00284) | (0.00365) | (0.0025) | (0.00325) | |

Years | 0.0932*** | 0.09149*** | 0.1039*** | 0.1054*** | 0.08664*** | 0.07418*** |

(0.008) | (0.00691) | (0.00877) | (0.0154) | (0.0143) | (0.0269) | |

Putouts | 0.0003** | -7.322e-5 | -0.00015 | 0.000462* | 0.000398** | 0.000388** |

(0.0001) | (0.00019) | (0.00028) | (0.00025) | (0.0002) | (0.00018) |

We can see in the table of our results that both the magnitude and intensity of the coefficients on our predictors' changes across the quantiles.

Looking at our table alone, the most interesting results are the coefficients on `Hits`

and `HmRun`

. There are several notable things about these results:

- The magnitude of impact that
`Hits`

has on salary decreases as players' salaries move from the 10% quantile to those in the 90% quantile. `Hits`

is less statistically significant for the 90% quantile than lower quantiles.`HmRun`

is only statistically significant for the 75% and 90% quantiles.

This suggests that players with the highest salaries aren't necessarily paid to just hit balls but rather to hit home runs.

### Confidence intervals

This paints a nice picture. However, it is inappropriate to make any conclusions without first considering how statistically significant these differences are (Leeds, 2014).

The graph above provides a visualization of the difference in coefficients across the quantiles with the bootstrapped confidence intervals. It also includes the OLS estimates, which are constant across all quantiles, and their confidence intervals.

From this graph, we can see that OLS coefficients fall within the confidence intervals of the quantile regression coefficients. This implies that our quantile regression results are not statistically different from the OLS results.

### Conclusions

Today we've learned the basics of quantile regression and seen an application to Major League Baseball Salary data. After today you should have a better understanding of:

- The intuition of quantile regression.
- How to estimate a quantile regression model in
**GAUSS**. - How to interpret the results from quantile regression estimates.

Code and data from this blog can be found here.

## References

Leeds, M. 2014, “Quantile Regression for Sports Economics,” *International journal of sport finance*, 9, 346-359.