# Applications of Principal Components Analysis in Finance

### Introduction

Principal components analysis (PCA) is a useful tool that can help practitioners streamline data without losing information. In today’s blog, we’ll examine the use of principal components analysis in finance using an empirical example.

Specifically, we’ll look more closely at:

• What PCA is.
• How PCA works.
• How to use the GAUSS Machine Learning library to perform PCA.
• How to interpret PCA results.

## What is Principal Components Analysis?

Principal components analysis (PCA) is an unsupervised learning method that results in a low-dimensional representation of a dataset. The intuition behind PCA is that the most important information is drawn from the features by eliminating redundancy and noise. The resulting dataset captures the most interesting components of the data.

### PCA Snapshot

Uses linear transformations to capture the most important characteristics of a set of features.
Uses variance of the features to distinguish relevant features from pure noise.
Identifies and removes redundancy in features.

## How Do We Find Principal Components?

Principal components are found by identifying the normalized, linear combination of features

$$Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \ldots + \phi_{p1}X_p$$

which has the largest variance.

The coefficients $\phi_{11}, \phi_{21}, \ldots, \phi_{p1}$ are referred to as the loadings and are restricted such that their sum of squares is equal to one.

To compute the first principal component we:

1. Center our feature data to have a mean of zero.
2. Find loadings that result with the largest sample variance, subject to the constraint that $\sum_{j=1}^p \phi_{j,1}^2 = 1$.

Once the first principal component is found, we can find a second principal component, $Z_2$, which is constrained to be uncorrelated with $Z_1$.

## When Should You Use PCA?

The most common use of PCA is to reduce the size of a feature set without losing too much information. The feature set can then be used in second stages of modeling. However, this is not the only use of PCA, and there are a number of insightful ways PCA can be applied.

### Real World Applications of PCA

Reducing the size of images.PCA can be used to reduce the size of an image without significantly impacting the quality. Beyond just reducing the size, this is useful for image classification algorithms.
Visualizing multidimensional data.PCA allows us to represent the information contained in multidimensional data in reduced dimensions which are more compatible with visualization.
Finding patterns in high-dimensional datasets.Examining the relationships between principal components and original features can help uncover patterns in the data that are harder to identify in our full dataset.
Stock price prediction in finance.Many models of stock price prediction rely on estimating covariance matrices. However, this can be difficult with high-dimensional data. PCA can be used for data reduction to help remedy this issue.
Dataset reduction in healthcare models.Healthcare models use high-dimensional datasets because there are many factors that influence healthcare outcomes. PCA provides a method to reduce the dimensionality while still capturing the relevant variance.

## Empirical Example

Let's take a look at principal components analysis in action! We'll start by extending the PCA application to US Treasury bills and bonds from Introductory Econometrics For Finance by Chris Brooks.

In our example we will:

Our initial dataset includes 6 variables capturing short-term and long-term yields on U.S. bonds and bills.

VariableDescription
GS3MMarket yield on 3 month US Treasury bill.
GS6MMarket yield on 6 month US Treasury bill.
GS1Market yield on 1 year US Treasury bond.
GS3Market yield on 3 year US Treasury bond.
GS5Market yield on 5 year US Treasury bond.
GS10Market yield on 10 year US Treasury bond.

This data can be directly imported into GAUSS from the FRED database.

/*
** Import U.S bond and bill data
** directly from FRED
*/
// Set observation_start parameter
// to use all data on or after 1990-01-01 and before or on 2023-03-01
params_cpi = fred_set("observation_start", "1990-01-01", "observation_end", "2023-03-01");

data = fred_load("GS3M + GS6M + GS1 + GS3 + GS5 + GS10", params_cpi);

// Reorder data to match the organization in original example
data = order(data, "date"$|"GS3M"$|"GS6M"$|"GS1"$|"GS3"$|"GS5"$|"GS10");

// Preview the first 5 rows
head(data);

The data preview printed to the Command Window helps verify that our data has loaded correctly:

            date        GS3M        GS6M         GS1         GS3         GS5        GS10
1990-01-01        7.90        7.96        7.92        8.13        8.12        8.21
1990-02-01        8.00        8.12        8.11        8.39        8.42        8.47
1990-03-01        8.17        8.28        8.35        8.63        8.60        8.59
1990-04-01        8.04        8.27        8.40        8.78        8.77        8.79
1990-05-01        8.01        8.19        8.32        8.69        8.74        8.76 

### Normalizing Yields

Following the Brooks' example, we will normalize the yields to have zero mean and standard deviation of one using the rescale procedure.

/*
** Normalizing the yield
*/
// Create a dataframe that contains
// the yields, but not the 'Date' variable
yields = delcols(data, "date");

// Standardize the yields using rescale
{ yields_norm, location, scale_factor } = rescale(yields, "standardize");

head(yields_norm);

This prints a preview of our normalized yields:

            GS3M             GS6M              GS1              GS3              GS5             GS10
2.3153725        2.2469720        2.1773318        2.0802078        2.0025703        1.9626705
2.3591880        2.3159905        2.2593350        2.1936833        2.1395985        2.0916968
2.4336745        2.3850090        2.3629181        2.2984298        2.2218155        2.1512474
2.3767142        2.3806953        2.3844979        2.3638964        2.2994648        2.2504985
2.3635696        2.3461861        2.3499702        2.3246164        2.2857620        2.2356108 

### Fitting the PCA Model

Next, we will use the pcaFit procedure from GML to fit our principal components analysis model.

The pcaFit procedure requires two inputs, a data matrix, and the number of components to compute.

struct pcaModel mdl;
mdl = pcaFit(x, n_components);

X
$N \times P$ matrix, feature data to be reduced.
n_components
Scalar, the number of components to compute.

The pcaFit procedures stores all output in a pcaModel structure. The most relevant members of the pcaModel structure include:

mdl.singular_values
$n_{components} \times 1$ vector, the largest singular values of X. Equal to the square root of the eigenvalues.
mdl.components
$P \times n_{components}$ matrix, the principal component vectors which represent the directions of greatest variance. Also known as the factor loadings.
mdl.explained_variance_ratio
$n_{components} \times 1$ vector, the variance explained by each of the returned component vectors.
/*
** Perform PCA on normalized yields
*/
// Specify number of components
n_components = 6;

// pcaModel structure for holding
//  output from model
struct pcaModel mdl;
mdl = pcaFit(yields_norm, n_components);

## Dissecting Results

After running the pcaFit procedure results will be printed to the Command Window. These results include:

• A general summary of model.
• The proportion of variance explained by each component.

### General Summary

The general summary provides basic information about the model setup, including the number of variables in the original data and the number of components found.

==================================================
Model:                                         PCA
Number observations:                           399
Number variables:                                6
Number components:                               6
==================================================

### Proportion of Variance

The proportion of variance table tells us how much of the total variance in the data is described by each principal component.

Component                Proportion     Cumulative
Of Variance     Proportion
PC1                           0.960          0.960
PC2                           0.038          0.997
PC3                           0.002          1.000
PC4                           0.000          1.000
PC5                           0.000          1.000
PC6                           0.000          1.000 

For the Treasury bills and bonds yields, the first component captures 96.0% of the total variance, while the first three components explain nearly all of the total variance. If our goal was data reduction for use in a later model, this is quite promising. We could capture 96% of the variance of all 6 of our original variables using just the first principal component.

===========================================================================
Principal
components            PC1       PC2       PC3       PC4       PC5       PC6
===========================================================================
GS3M              -0.4079    0.4111    0.4863   -0.5416    0.3029    0.2076
GS6M              -0.4094    0.3883    0.1535    0.2221   -0.5448   -0.5585
GS1               -0.4122    0.2970   -0.2404    0.6120    0.1557    0.5342
GS3               -0.4154   -0.0855   -0.5911   -0.1926    0.4744   -0.4567
GS5               -0.4102   -0.3607   -0.2806   -0.3932   -0.5725    0.3750
GS10              -0.3939   -0.6742    0.5040    0.3020    0.1856   -0.1024


The factor loadings indicate how much each of the variables contributes to the component. As noted in the Brooks example, they also offer some insight into the yield curve:

 PC1 All maturities have the same sign and a similar magnitude.Captures changes in the level, or parallel shifts, of the yield curve. PC2 Short-term and long-term maturities have opposing signs.Short-term and long-term maturities move in opposite directions.Captures changes in the slope, or the steepening/flattening, of the yield curve. PC3 Shortest and longest-term maturities have the same sign, while the middle maturities have the opposite sign.Reflects changes in the curvature of the curve.

## Transforming Original Data

After fitting the PCA model, we can use the results to transform our original data into its principal components using the pcaTransform procedure.

// Transform original data
x_trans = pcaTransform(yields_norm, mdl);

Since the first three components capture most of the variation in our data, let's look at them in a plot:

If you're familiar with U.S. interest rates, this plot likely seems to contradict what we observe in the real world. As we said earlier, the first principal component represents the overall level of interest rates. However, our plot of the first principal component shows an overall upward trend through 2022, with a sharp downtick starting post-2022 — exactly opposite the overall trend in U.S. interest rates.

This highlights an important feature of PCA — the sign on the factor loadings is arbitrary.

The signs can all be flipped without any change to our analysis. For example, if we multiply all our factor loadings by -1 our principal components look like:

### Conclusion

In today's blog, we've seen that PCA is a powerful data analysis tool with uses beyond data reduction. We've also explored how to use the GAUSS Machine Learning library to fit a PCA model and transform data.