Tutorial:Principal Component Analysis

From Howto Wiki

Jump to: navigation, search



Principal Component Analysis is useful for reducing and interpreting large multivariate data sets with underlying linear structures, and for discovering previously unsuspected relationships.

We will start with data measuring protein consumption in twenty-five European countries for nine food groups. Using Principal Component Analysis, we will examine the relationship between protein sources and these European countries.

Minimum Origin Version Required: Origin 8.6 SR0

Selecting Principal Methods

To determine the number of principal components to be retained, we should first run Principal Component Analysis and then proceed based on its result:

  1. Open a new project or a new workbook. Import the data file \samples\Statistics\Protein Consumption in Europe.dat
  2. Select the entire worksheet and then select Statistics: Multivariate Analysis: Principal Component Analysis.
  3. Accept the default settings in the open dialog box and click OK.
  4. Select sheet PCA1.
  5. In the Eigenvalues of the Correlation Matrix table, we can see that the first four principal components explain 86% of the variance and the remaining components each contribute 5% or less. We will keep four main components.
  6. A scree plot can be a useful visual aid for determining the appropriate number of principal components. The number of components depends on the "elbow" point at which the remaining eigenvalues are relatively small and all about the same size. This point is not very evident in the scree plot, but we can still say the fourth point is our "elbow" point.
    Image:Pca scree plot.png
  7. Click the lock icon Image:Icon_Recalculate_Manual_Green.png in the results tree and select Change Parameters in the context menu. Set Number of Components to Extract to 4. Do not close the dialog; in the next steps, we will retrieve component diagrams.
    Image:Pca ex1 dialog1.png

Request Principal Component Plots

In the Plots branch of the dialog, users can choose whether they want to create a scree plot or a component diagram.

  • Scree Plot
    The scree plot is a useful visual aid for determining an appropriate number of principal components.
  • Component Plot
    Component plots show the component score of each observation or component loading of each variable for a pair of principal components. In the Select Principal Components to Plot group, users can specify which pair of components to plot. The component plots include:
    • Loading Plot
      The loading plot is a plot of the relationship between the original variables and the subspace dimension. It is used to interpret relationships between variables.
    • Score Plot
      The score plot is a projection of data onto subspace. It is used to interpret relationships between observations.
    • BiPlot
      The biplot shows both the loadings and the scores for two selected components in parallel.
  1. In the dialog that was opened in the preceeding steps, open the Plots branch. Make sure Scree Plot, Loading Plot, and Biplot are selected.
  2. The first two components are usually responsible for the bulk of the variance. This is why we are going to plot the component plot in the space of the first two principal components. In the Select Principal Components to Plot group, set Principal Component for X Axis to 1, and set Principal Component for Y Axis to 2. Click OK.
    Image:Pca ex1 dialog2.png

Interpreting The Results

  1. In the Correlation Matrix, we can see that the variables are highly correlated. Many values are greater than 0.3. Principal Component Analysis is an appropriate tool for removing the collinearity.
    Image:Pca ex1 correlation matrix.png
  2. The main component variables are defined as linear combinations of the original variables. The Extracted Eigenvectors table provides coefficients for equations.
    Image:Pca ex1 extracted eigenvectors.png
    PC1 = 0.30261 * RedMeat + 0.31056 * WhiteMeat + 0.42668 * Eggs + 0.37773 * Milk + 0.13565 * Fish - 0.43774 * Cereals + 0.29725 * Starch - 0.42033 * Nuts - 0.11042 * FruitsVegetables
    PC2 = - 0.05625 * RedMeat - 0.23685 * WhiteMeat - 0.03534 * Eggs - 0.18459Milk + 0.64682 * Fish - 0.23349 * Cereals + 0.35283 * Starch + 0.14331 * Nuts + 0.53619 * FruitsVegetables
    PC3 = - 0.29758 * RedMeat + 0.6239 * WhiteMeat + 0.18153 * Eggs - 0.38566 * Milk - 0.32127 * Fish + 0.09592 * Cereals + 0.24298 * Starch - 0.05439 * Nuts + 0.40756 * FruitsVegetables
    PC4 = 0.64648 * RedMeat - 0.03699 * WhiteMeat + 0.31316Eggs - 0.00332 * Milk - 0.21596 * Fish - 0.0062 * Cereals - 0.33668 * Starch + 0.33029 * Nuts + 0.46206 * FruitsVegetables
  3. The Loading Plot reveals the relationships between variables in the space of the first two components. In the loading plot, we can see that Red Meat, Eggs, Milk, and White Meat have similar heavy loadings for principal component 1. Fish, fruit, and vegetables, however, have similar heavy loadings for principal component 2.
    Image:Pca ex1 loading plot.png
  4. The biplot shows both the loadings and the score for two selected components in parallel. It can reveal the projection of an observation on the subspace with the score points. It can also find the ratio of observations and variables in the subspace of the first two components. (Note: Double-click the graph to open and customize.)
  5. Use the Data Reader tool Image:Button_Data_Reader.png to open the Data Info window and examine the plot in greater detail. We can see that Spain and Portugal's protein sources differ from those of other European countries. Spain and Portugal rely on fruits and vegetables, while eastern European countries such as Albania, Bulgaria, Yugoslavia, and Romania prefer cereals and nuts.
    Image:Pca ex1 biplot.png
To display country information in the Data Info window, as in the image above:
  1. Right-click the Data Info window and select Preferences....
  2. In the Rows tab, move Country from the left panel to the right. Click OK.
    Image:Pca data info settings.png
Personal tools