Principais componentes da análise estatística

Adaptado do original: Introduction to Principal Component Analysis

Introduction to Principal Component Analysis


The sheer size of data in the modern age is not only a challenge for computer hardware but also the main bottleneck for the performance of many machine learning algorithms. The main goal of a PCA analysis is to identify patterns in data. PCA aims to detect the correlation between variables. If a strong correlation between variables exists, the attempt to reduce the dimensionality only makes sense. It is a statistical method used to reduce the number of variables in a data-set. It does so by lumping highly correlated variables together. Naturally, this comes at the expense of accuracy. However, if you have 50 variables and realize that 40 of them are highly correlated, you will gladly trade a little accuracy for simplicity.

Basic Statistics

The entire subject of statistics is based around the idea that you have this big set of data, and you want to analyse that set in terms of the relationships between the individual points in that data set. I am going to look at a few of the measures you can do on a set of data, and what they tell you about the data itself.

How does this work? Let’s use some example data. Imagine we have gone into the world and collected some 2-dimensional data, say, we have asked a bunch of students how many hours in total that they spent studying, and the mark that they received. So we have two dimensions, the first is the dimension, the hours studied, and the second is the dimension, the mark received. So what does it tell us? The exact value is not as important as its sign (ie. positive or negative). If the value is positive, then that indicates that both dimensions increase together, meaning that, in general, as the number of hours of study increased, so did the final mark.

If the value is negative, then as one dimension increases, the other decreases. If we had ended up with a negative covariance here, then that would have said the opposite, that as the number of hours of study increased the final mark decreased. In the last case, if the covariance is zero, it indicates that the two dimensions are independent of each other.

Principal Component Analysis

The assumptions of PCA:

  1. Linearity – Assumes the data set to be linear combinations of the variables.
  2. The importance of mean and covariance – There is no guarantee that the directions of maximum variance will contain good features for discrimination
  3. Those large variances have important dynamics – Assumes that components with larger variance correspond to interesting dynamics and lower ones correspond to noise. In simpler terms suppose if we want to classify Male and Female using the height dimension then the data in the height dimension should be dispersed data with negligible variance will be of no use ie. if all the observant are having same height then we will not be able to use this dimension to classify Male/Female.

Steps for PCA:

What will this give us? It will give us the original data solely in terms of the vectors we chose. Our original data set had two axes, x and y, so our data was in terms of them. It is possible to express data in terms of any two axes that you like. If these axes are perpendicular, then the expression is the most efficient. This was why it was important that eigenvectors are always perpendicular to each other. We have changed our data from being in terms of the axes x and y, and now they are in terms of our 2 eigenvectors. In the case of when the new data set has reduced dimensionality, ie. we have left some of the eigenvectors out, the new data is only in terms of the vectors that we decided to keep. In the case of keeping both eigenvectors for the transformation, we get the data and the plot found in Figure 1.3. This plot is basically the original data, rotated so that the eigenvectors are the axes. This is understandable since we have lost no information in this decomposition.

So what have we done here? Basically, we have transformed our data so that is expressed in terms of the patterns between them, where the patterns are the lines that most closely describe the relationships between the data. This is helpful because we have now classified our data point as a combination of the contributions from each of those lines. Initially, we had the simple x and y axes. This is fine, but the x and y values of each data point don’t really tell us exactly how that point relates to the rest of the data. Now, the values of the data points tell us exactly where (ie. above/below) the trend lines the data point sits. In the case of the transformation using both eigenvectors, we have simply altered the data so that it is in terms of those eigenvectors instead of the usual axes. But the single-eigenvector decomposition has removed the contribution due to the smaller eigenvector and left us with data that is only in terms of the other.

About the Author, Shailendra Kathait:

Shailendra Heads Analytics Delivery & Solutions for Valiance Solutions where he is responsible for building Machine Learning Products and Analytics driven outcomes for our clients. He brings 8 plus years of core Distributed Machine learning, Image Processing & Analytics experience with Fortune 100 companies like IBM(R), American Express & ICICI Group across EMEA, US and Indian Subcontinent region. Shailendra has deep Interest in Neural Networks, Deep Belief Networks, Digital Image Processing & Optimization.

Shailendra holds several Patents and is Anchor author of several publications on Machine Learning & Optimization. He can be followed  on LinkedIn.



Avatar de zrhans

Posted by

Deixe um comentário

Faça o login usando um destes métodos para comentar:

Logo do

Você está comentando utilizando sua conta Sair /  Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair /  Alterar )

Conectando a %s

Este site utiliza o Akismet para reduzir spam. Saiba como seus dados em comentários são processados.

Site criado com

%d blogueiros gostam disto: