©Hal Gurgenci 2001
Principal Components Analysis is a well-established method of reducing the dimensionality in multi-channel data. Reduction in dimensionality may lead to one or both of the following results:
In machine monitoring context, circumstances in which one needs to use PCA may include the following:
The starting point for PCA is a set of observations in a multi-dimensional space. This multi-dimensional space is defined by our measurement channels. For example, if we are measuring m variables, the space will be m-dimensional. Each measurement in this space is defined by a point. The purpose of PCA is to introduce a new set of m orthogonal axes in such a way that our original data will show the highest variance on the principal axis #1, the second highest variance along the principal axis #2, and so on, with the least variance being shown along the principal aces #m. These axes are referred to as principal component axes or simply as principal components.
Each principal component is a linear combination of the original variables. The principal components are orthogonal to each other so they do not contain redundant information. The number of principal components is equal to the number of the original variables. However, usually the variance of the original data can be explained by the first few principal components and the rest can be ignored. If this is the case, then using principal components reduces a large multi-dimensional set of data into data along a few coordinates, making it more amenable to visual inspection, clustering, and pattern recognition efforts. It is important to realise that principal components analysis would only help us if there are dependencies amongst the measured variables. If all variables were fiercely independent from each other, then we would not gain any reduction of complexity by doing a principal components analysis.
We will see that the principal components can be computed by computing the eigenvalues and the eigenvectors of the original sample covariance matrix. Before we do this, let us see a simple example.
Pressure and temperature measurements are taken off a pressure vessel within which a chemical reaction is taking place. We plot them on T-p axes as shown below. The correlation between p and T is obvious.
Before we proceed any further, let us The principal axes for these data are plotted on the same figure as V1 and V2. In the next section, we will learn how to find the orientation of the principal axes.
In this two-dimensional example, the principal axes are obtained by rotating the original T-p axes. The origin of the principal axes is the sample mean, . The original data can be plotted on the new V1-V2 axes by using the formula on transformation of coordinates involving rotation and translation:
where is the angle between V1 and T.
Two things can be observed about the principal axes in the above figure:
a. The variance of the data along the axis V1 is higher than along the axis V2. In other words, the first principal axis has the highest variance of data.
b. If the vectors v1 and v2 are principal axis transformations of the original T and p data, then v1 and v2 are not correlated.
The above equations can be expressed in a compact form as
where m=2 and y and a are
We should also define a few terms using this example. They will help you to understand what is in the following section.
The observation vector
Its mean value
The sample covariance matrix
In the next section we will find out how to calculate a for any multi-dimensional (m>1) data set.
DERIVATION OF PRINCIPAL COMPONENTS
In this section we will derive a relation for the vector a which relates the principal axes to the original observation vector as follows:
We also need a normalising constraint because otherwise the variance can be increased by arbitrarily scaling the a vectors up. We will use the following normalising constraint.
The first principal axis is obtained by inserting i=1
The first principal axis must have the maximum variance by definition.
Find the variance of V1
The variance of any vector x is defined as
by substituting x=V1,
The mean of V1 is found as
The variance of V1 is then
The term within the curly brackets is a scalar. Try it with a 2-dimensional vector y if you do not believe me.
A scalar is a 1x1 matrix and it is equal to its transpose. Therefore, the curly bracket square term above can be written as
We want to write it this way because the product between a1 terms has a special meaning. It is related to the sample co-variance, which is defined as
Therefore, the variance of V1 is related to the sample covariance matrix as
Find a that will maximise the variance of V1
We want to maximise our objective function
subject to the orthogonality constraint
Using the Lagrange multiplier method, this can be converted to the problem of maximising:
The extremum can be found by taking the derivative wrt to the vector a
The derivative of a vector function, h=h(x) is defined as
Also, the derivative of the quadratic term is . Therefore the maximisation problem becomes
This is the standard eigenvalue problem. A non-trivial solution exists only if lambda is an eigenvalue. S is a m x m matrix and it has m eigenvalues:
and a set of corresponding eigenvectors, . If we order the eigenvalues in descending order, then the first eigenvector will be the directional cosines of the first principal component axis and the first eigenvalue will be the variance along this axis and so on. In other words,
Incidentally, the total variance is conserved. In other words, the sum of the variances of the principal variables is equal to the sum of the variances of the original variables, i.e.
Thus the contribution of any eigenvalue to the total variance is .
The tr is the trace operator and represents the sum of the diagonal elements of the matrix.
Lagrange's Multiplier Method
The following can be found in almost all textbooks on advanced mathematics, eg Advanced Calculus for Applications by F B Hildebrand:
If the aim is to maximise a function f subject to the constraint g=0, then this is equivalent to optimising