Principal Component Analysis (PCA)

Dear all,

I am faced with the small task of identifying from a series of 20 parameters extracted from a series of events, which parameters are dominant in describing variation between events, and which tend to be more irrelevant.

For that I looked through the Scatter Plot Matrix Demo which was a neat overview of data correlation, but what I'm looking for is some analytical measure which I want to apply to several datasets where we've perturbed our system and do comparisons across these datasets, etc. From what I've seen PCA is what I should be looking at.

In the meantime I've checked this excellent primer on PCA by Jonathon Shlens, and I got one the books referenced in Igor's help section: "Factor Analysis in Chemistry" by ER Malinowski (3rd edition), which I have begun reading.

Since I'm new to PCA I have the impression this might take some time. If anyone has some experience in doing PCA analysis in identifying which are the two main components in similar type of analysis using Igor, please leave me your thoughts.

Many thanks,

R.

Hello R.,

You may want to take a quick look at the PCA demo (File Menu->Example Experiments->Analysis->PCADemo).  Note that Igor also has an ICA operation but the PCA is probably what you want.

A.G.

WaveMetrics, Inc.

Hey A.G.,

Thanks for the note. Yes, I did have a look at that example, and did attempt to run my data through it also.

Please have a look at the image attached.

What I did so far was the following:

  • I've input my data as shown on the top table. My components names, are the wave called Parameters, and waves wave1 until wave42 are my events, for which I have measured 18 parameters.
  • I called the PCA procedure and edited the procedure so that it would get my data.
  • Using the PCA Demo Control I run the PCA as specified in the procedure and I got the table in the lower right

To me, it seems I might have a lot of redundancy in my data. Is this a fair conclusion? However, I am still not sure how to connect the observation of more or less redundant parameters with the exact identification of which parameters they are. Could you give me some insight?

Many thanks!!

R.

PCA_Igor_01.jpg

Hello R.,

If I only look at the eigenvalues, it seems that you have only two significant factors that together account for 99.8% of the variance.

I think it is important to note that even if you determine that there are only two important factors, these may not necessarily map to your input data.  Since I am not familiar with the details of your application let me try and use an example from physics:  suppose you have a bunch of waves containing xyz distribution in space and you compute the PCA and find that 99.85 of the variation is explained by two factors.  That would imply that your distribution is pretty much planar.  To determine the orientation of this plane in 3D space you need to look at the first two eigenvectors of the solution.

If you carry this analogy to your application, you might want to look at the eigenvectors and your inputs.  In complicated cases it may be helpful to compute the projections (dot product) of your input vectors on the first two eigenvectors so you get a sense of what's meaningful and what is noise.

A.G.

Hi A.G.,

Many thanks for the input. Indeed, the events I included in this analysis are from one single "run" of measurements (i.e. they refer to a single subject being observed). Essentially, I am looking at the same subject at different instances, so the fact that I have 2 significant factors accounting for  almost 99.85% of data variance, is likely to suggest that whenever subject is stimulated, a reaction tends to be fairly homogeneous with respect to most of the parameters analyzed. This is something I anticipated by looking at normalized distributions for each of the parameters. The situation will be different as I choose to feed PCA with averages of different runs of the experiment, where I include different subjects. So, to me, these preliminary results seem to be going the right way.

The part I am mostly struggling right now, is on converting this variance information, into to PCA scores for each of the parameters. Would you be able to help me on this part?

Many thanks,

Ricardo

 

Hello Ricardo,

My approach is to think about the eigenvectors as a set of orthonormal vectors that span your data space.  In your example, we determined based on the data that we only care about two eigenvectors (say e0 and e1).  To represent your data using the two new axes you need to compute the dot product (or projection) of your parameter columns with the first two eigenvectors.  For this to make sense you need to "standardize" your parameter columns by subtracting their mean and dividing by their standard deviation. 

If you denote the standardized columns by P_i then you will be effectively creating a new representation of your data as P_i=c0i*e0 + c1i*e1 where the coefficients c0i and c1i are the projections of P_i on the respective eigenvectors.

You can use MatrixOP to perform these calculations.  For example, to standardize a 1D parameter wave you can execute

MatrixOP/O P_0=normalize(subtractMean(w,1))

To calculate the projection:

MatrixOP/O c00=e0.P_0
MatrixOP/O c10=e1.P_0

which gives you {c00,c10} as the representation of the first parameter column in the new space.

I hope this helps,

 

A.G.