Tutorial:Principal Component Analysis
From Howto Wiki
Contents |
Summary
Principal Component Analysis is useful for reducing and interpreting large multivariate data sets with underlying linear structures, and for discovering previously unsuspected relationships.
We will start with data measuring protein consumption in twenty-five European countries for nine food groups. Using Principal Component Analysis, we will examine the relationship between protein sources and these European countries.
Minimum Origin Version Required: Origin 8.6 SR0
Selecting Principal Methods
To determine the number of principal components to be retained, we should first run Principal Component Analysis and then proceed based on its result:
- Open a new project or a new workbook. Import the data file \samples\Statistics\Protein Consumption in Europe.dat
- Select the entire worksheet and then select Statistics: Multivariate Analysis: Principal Component Analysis.
- Accept the default settings in the open dialog box and click OK.
- Select sheet PCA1.
- In the Eigenvalues of the Correlation Matrix table, we can see that the first four principal components explain 86% of the variance and the remaining components each contribute 5% or less. We will keep four main components.
- A scree plot can be a useful visual aid for determining the appropriate number of principal components. The number of components depends on the "elbow" point at which the remaining eigenvalues are relatively small and all about the same size. This point is not very evident in the scree plot, but we can still say the fourth point is our "elbow" point.
- Click the lock icon in the results tree and select Change Parameters in the context menu. Set Number of Components to Extract to 4. Do not close the dialog; in the next steps, we will retrieve component diagrams.
Request Principal Component Plots
In the Plots branch of the dialog, users can choose whether they want to create a scree plot or a component diagram.
- Scree Plot
- The scree plot is a useful visual aid for determining an appropriate number of principal components.
- Component Plot
- Component plots show the component score of each observation or component loading of each variable for a pair of principal components. In the Select Principal Components to Plot group, users can specify which pair of components to plot. The component plots include:
- Loading Plot
- The loading plot is a plot of the relationship between the original variables and the subspace dimension. It is used to interpret relationships between variables.
- Score Plot
- The score plot is a projection of data onto subspace. It is used to interpret relationships between observations.
- BiPlot
- The biplot shows both the loadings and the scores for two selected components in parallel.
- In the dialog that was opened in the preceeding steps, open the Plots branch. Make sure Scree Plot, Loading Plot, and Biplot are selected.
- The first two components are usually responsible for the bulk of the variance. This is why we are going to plot the component plot in the space of the first two principal components. In the Select Principal Components to Plot group, set Principal Component for X Axis to 1, and set Principal Component for Y Axis to 2. Click OK.
Interpreting The Results
- In the Correlation Matrix, we can see that the variables are highly correlated. Many values are greater than 0.3. Principal Component Analysis is an appropriate tool for removing the collinearity.
- The main component variables are defined as linear combinations of the original variables. The Extracted Eigenvectors table provides coefficients for equations.
- PC1 = 0.30261 * RedMeat + 0.31056 * WhiteMeat + 0.42668 * Eggs + 0.37773 * Milk + 0.13565 * Fish - 0.43774 * Cereals + 0.29725 * Starch - 0.42033 * Nuts - 0.11042 * FruitsVegetables
- PC2 = - 0.05625 * RedMeat - 0.23685 * WhiteMeat - 0.03534 * Eggs - 0.18459Milk + 0.64682 * Fish - 0.23349 * Cereals + 0.35283 * Starch + 0.14331 * Nuts + 0.53619 * FruitsVegetables
- PC3 = - 0.29758 * RedMeat + 0.6239 * WhiteMeat + 0.18153 * Eggs - 0.38566 * Milk - 0.32127 * Fish + 0.09592 * Cereals + 0.24298 * Starch - 0.05439 * Nuts + 0.40756 * FruitsVegetables
- PC4 = 0.64648 * RedMeat - 0.03699 * WhiteMeat + 0.31316Eggs - 0.00332 * Milk - 0.21596 * Fish - 0.0062 * Cereals - 0.33668 * Starch + 0.33029 * Nuts + 0.46206 * FruitsVegetables
- The Loading Plot reveals the relationships between variables in the space of the first two components. In the loading plot, we can see that Red Meat, Eggs, Milk, and White Meat have similar heavy loadings for principal component 1. Fish, fruit, and vegetables, however, have similar heavy loadings for principal component 2.
- The biplot shows both the loadings and the score for two selected components in parallel. It can reveal the projection of an observation on the subspace with the score points. It can also find the ratio of observations and variables in the subspace of the first two components. (Note: Double-click the graph to open and customize.)
- Use the Data Reader tool to open the Data Info window and examine the plot in greater detail. We can see that Spain and Portugal's protein sources differ from those of other European countries. Spain and Portugal rely on fruits and vegetables, while eastern European countries such as Albania, Bulgaria, Yugoslavia, and Romania prefer cereals and nuts.
To display country information in the Data Info window, as in the image above: |