The Basics of Quantile Regression

by Eric · Published January 20, 2019 · Updated May 14, 2020

Introduction

Classical linear regression estimates the mean response of the dependent variable dependent on the independent variables. There are many cases, such as skewed data, multimodal data, or data with outliers, when the behavior at the conditional mean fails to fully capture the patterns in the data.

In these cases, quantile regression provides a useful alternative to linear regression which:

Can be used to study the distributional relationships of variables.
Can help detect heteroscedasticity.
Is useful for dealing with censored variables.
Is more robust to outliers.

Today we will use quantile regression to analyze Major League Baseball Salary data at the 10%, 25%, 50%, 75%, and 90% quantiles. We will consider the model

$$ ln(salary) = \beta_0 + \beta_1 AtBats + \beta_2 Hits + \beta_3 HmRun + \beta_4 Walks\\ + \beta_5 Years + \beta_6 PutOuts $$

The intuition of quantile regression

To understand the intuition of quantile regression, let's start with the intuition of ordinary least squares. Given the model

$$ y_i = \beta'X_i + \epsilon_i ,$$

the least squares estimate minimizes the sum of the squared error terms

$$ \sum^N_i (y_i - \hat{y_i})^2 .$$

Comparatively, quantile regression minimizes a weighted sum of the positive and negative error terms:

$$ \tau\sum_{y_i \gt \hat{\beta_{\tau}}'X_i} | y_i - \hat{\beta_{\tau}}'X_i |\ +\ (1 - \tau)\sum_{y_i \lt \hat{\beta_{\tau}}'X_i} | y_i - \hat{\beta_{\tau}}'X_i | $$

where $\tau$ is the quantile level.

Explanation of quantile regression loss.

Each orange circle represents an observation while the blue line represents the quantile regression line. The black lines illustrate the distance between the regression line and each observation, which are labelled d1, d2 and d3.

If we assume that $\tau$ is equal to 0.9, we can compute the quadratic regression loss for the data in the image above, like this:

$$ \tau(d2) + (1 - \tau)(|d1 + d3|)\\ 0.9 * 0.4 + 0.1 * (|-1.3 + -0.4|) = 0.53 $$

Optimizing this loss function results in an estimated linear relationship between $y_i$ and $x_i$ where a portion of the data, $\tau$, lies below the line and the remaining portion of the data, $1-\tau$, lies above the line as shown in the graph below (Leeds, 2014).

In the graph above, 90.11% of the observations are below the quantile regression line which was estimated with τ set to 0.9.

Estimating a quantile regression with GAUSS

Today we will use the GAUSS function quantileFit to estimate our salary model at the 10%, 25%, 50%, 75%, and 90% quantiles. This allows us insight into what factors impact salaries at the extremes of the salary distribution, in addition to those at quantiles in between those extremes.

The quantileFit function uses formula string syntax and takes the following inputs:

dataset: String, name of data set.
formula: String, the formula of the model. E.g "y ~ X1 + X2"
tau: Optional argument, Mx1 vector, quantile levels. Default = {0.05, 0.5, 0.95};
w: Optional argument, Nx1 vector, containing observation weights. Default = uniform weights.
qCtl: Optional argument, an instance of the qfitControl structure containing members for controlling parameters of the quantile regression.

We will also use the qFitControl structure to specify variables names and set up a bootstrap for standard errors and confidence intervals :

// Load variables
y = loadd("islr_hitters.xlsx", "ln(salary)");
x = loadd("islr_hitters.xlsx", "AtBat + Hits + HmRun + Walks + Years + PutOuts");

/*
** Estimate the model
*/

// Set up tau for regression
tau = 0.10 | 0.25 | 0.50 |0.75 | 0.90;

// Declare control structure
// and fill with default values
struct qfitControl qCtl;
qCtl = qfitControlCreate();

// Add variable names
qCtl.varnames = "AtBat" $| "Hits" $| "HmRun" $| "Walks" $| "Years" $| "PutOuts";

// Turn on bootstrapped confidence intervals
qCtl.bootstrap = 1000;

// Call quantileFit
struct qfitOut qOut;
qOut = quantileFit(y, x, tau, qCtl);

Interpreting our results

Coefficients estimates

Variable	OLS	10%	25%	50%	75%	90%
Constant	4.37***	3.69***	3.72***	4.078***	4.663***	5.304***
	(0.133)	(0.107)	(0.105)	(0.277)	(0.157)	(0.483)
AtBat	-0.00258**	-0.00324**	-0.00256**	-0.00253*	-0.00173	-0.00179
	(0.001)	(0.00156)	(0.00113)	(0.00143)	(0.00124)	(0.00157)
Hits	0.01366***	0.01811***	0.01576***	0.01503***	0.01106***	0.008907**
	(0.003)	(0.00597)	(0.00377)	(0.00441)	(0.00374)	(0.00384)
HmRun	0.0051	-0.00289	0.000219	0.002443	0.01687***	0.01416*
	(0.0054)	(0.00801)	(0.00583)	(0.00906)	(0.00605)	(0.00821)
Walks	0.0071***	0.006536*	0.009025***	0.007767**	0.006164**	0.007038**
	(0.0023)	(0.00341)	(0.00284)	(0.00365)	(0.0025)	(0.00325)
Years	0.0932***	0.09149***	0.1039***	0.1054***	0.08664***	0.07418***
	(0.008)	(0.00691)	(0.00877)	(0.0154)	(0.0143)	(0.0269)
Putouts	0.0003**	-7.322e-5	-0.00015	0.000462*	0.000398**	0.000388**
	(0.0001)	(0.00019)	(0.00028)	(0.00025)	(0.0002)	(0.00018)

We can see in the table of our results that both the magnitude and intensity of the coefficients on our predictors' changes across the quantiles.

Looking at our table alone, the most interesting results are the coefficients on Hits and HmRun. There are several notable things about these results:

The magnitude of impact that Hits has on salary decreases as players' salaries move from the 10% quantile to those in the 90% quantile.
Hits is less statistically significant for the 90% quantile than lower quantiles.
HmRun is only statistically significant for the 75% and 90% quantiles.

This suggests that players with the highest salaries aren't necessarily paid to just hit balls but rather to hit home runs.

Confidence intervals

This paints a nice picture. However, it is inappropriate to make any conclusions without first considering how statistically significant these differences are (Leeds, 2014).

The quantile regression parameters and confidence intervals are in orange. The blue lines represent the OLS coefficient estimates and 95% confidence interval.

The graph above provides a visualization of the difference in coefficients across the quantiles with the bootstrapped confidence intervals. It also includes the OLS estimates, which are constant across all quantiles, and their confidence intervals.

From this graph, we can see that OLS coefficients fall within the confidence intervals of the quantile regression coefficients. This implies that our quantile regression results are not statistically different from the OLS results.

Conclusions

Today we've learned the basics of quantile regression and seen an application to Major League Baseball Salary data. After today you should have a better understanding of:

The intuition of quantile regression.
How to estimate a quantile regression model in GAUSS.
How to interpret the results from quantile regression estimates.

Code and data from this blog can be found here.

References

Leeds, M. 2014, “Quantile Regression for Sports Economics,” International journal of sport finance, 9, 346-359.