# Diagnosing a singular matrix

### Introduction

G0121: Matrix not positive definite and G0048: Matrix singular are common errors encountered during estimation. Today we will run some code to compute OLS estimates, using real data from some golf shots hit by this author and recorded by a launch monitor.

## The data

Our dataset, golf_ballflight.csv, contains 46 observations with the following variables:

• club_speed - The speed of the clubhead at impact.
• ball_speed - The initial speed of the golf ball after impact.
• launch - The angle at which the ball takes off relative to the ground in degrees.
• back_spin - The spin around the horizontal axis.
• carry - The distance the ball travels in the air in yards.

## The model

To keep things simpler, we will not estimate a constant term. $$carry = \beta_1 club\_speed + \beta_2 ball\_speed + \beta_3 launch + \beta_4 back\_spin$$

## The code

fname = "golf_ballflight.csv";

// Load all variables except 'carry'

// Compute least squares estimates
XTX = X'X;
XXI = invpd(XTX);
b_hat = XXI * X'y;

Running the code above, returns the error G0121 : Matrix not positive definite from line 11, XXI = invpd(XTX);. If the columns of X are all linearly independent vectors, the result of X'X should be a positive definite matrix.

## Step 1: Check the data

Since the result of X'X is not positive definite, we have a problem with either X or XTX. The first thing we should do is examine the data to make sure it was loaded correctly.

In this case, after a thorough examination, the data appears as we would expect. It matches our CSV dataset and we do not see any infinities or missing values.

## Step 2: Check for linear dependence

Now that we have verified that our data was loaded correctly, we need to check our data for linear dependencies.

We can use the GAUSS function qre. This function returns a pivoted R matrix from the QR decomposition. Nearly zero elements on the diagonal of R indicate a linearly dependent vector.

// Compute the pivoted R matrix
// and the permutation vector, 'P'
{ R, P } = qre(X);

print P~diag(R);

Running the code above will produce the following output:

4.000       -27522.553
2.000        269.89232
3.000       -24.417453
1.000   -1.1589374e-13 

The right column contains the elements from the diagonal of the R matrix. The permutation vector in the left column shows us which column from the original matrix, the diagonal element of R corresponds to.

The number -1.1589374e-13 to the right of 1.000 tells us that the first column of X is linearly dependent on another column in X.

### Which column is $X_1$ linearly dependent on?

We can find which column $X_1$ is linearly dependent on by performing a regression with parts of the R matrix.

PR
4-27522.56-676.7-102.46-466.7
20269.910.23186.13
300-24.42-6.82e-15
1000-1.159e-13

The dependent variable for this regression will be the fourth column of R, highlighted in orange above. The independent variables will be the first three columns of R, highlighted in blue above. Note that the final row of R is not included. By definition, it will be a row of zeros.

Non-zero parameters in this regression will indicate a linear dependence.

// Create variables corresponding to table above
X_r = R[1:3,1:3];
y_r = R[1:3,4];

// Perform regression
b_r = y_r / X_r;

After running the above, code we see that b_r is equal to:

3.4695e-18
0.68965517
2.7935e-16

As we can see, the second parameter is non-zero. Looking at the permutation vector, P, we see that the second column of R corresponds to the second column of X. Therefore, the first and second columns of our X matrix are linearly dependent on each other.

## Next steps

We now know that we got the error G0121: Matrix not positive definite because our first two variables, ball_speed and club_speed are linearly dependent on each other. What are our options?

### Collect more data

In some cases, collecting more observations will resolve the problem.

### Drop one of the variables

If we cannot collect more data, or that does not resolve our problem, we may have to drop one of the variables. In many cases, it will not matter which variable we choose to drop.

In this particular case, some research indicated that the ball launch monitor used to collect the data measures ball speed, but estimates club speed by dividing ball speed by a constant factor, known as efficiency or smash factor. Therefore, we should choose to remove club speed.

## Final model

Our new code looks like this:

fname = "golf_ballflight.csv";

// Load all variables except 'carry' and 'club_speed'
X = loadd(fname, ". -carry -club_speed");

// Compute least squares estimates
XTX = X'X;
XXI = invpd(XTX);
b_hat = XXI * X'y;

and runs without error.

### Conclusions

Great job working through to the end! Today we have learned:

1. Errors G0121: Matrix not positive definite and G0048: Matrix singular can be caused by linear dependencies or bad data.
2. How to diagnose linear dependencies with the function qre.

Code and data from this blog can be found here.