*Adaptado do original: Introduction to Principal Component Analysis*

# Introduction to Principal Component Analysis

**Overview**

The sheer size of data in the modern age is not only a challenge for computer hardware but also the main bottleneck for the performance of many machine learning algorithms. The main goal of a PCA analysis is to identify patterns in data. PCA aims to detect the correlation between variables. If a strong correlation between variables exists, the attempt to reduce the dimensionality only makes sense. It is a statistical method used to reduce the number of variables in a data-set. It does so by lumping highly correlated variables together. Naturally, this comes at the expense of accuracy. However, if you have 50 variables and realize that 40 of them are highly correlated, you will gladly trade a little accuracy for simplicity.

**Basic Statistics**

The entire subject of statistics is based around the idea that you have this big set of data, and you want to analyse that set in terms of the relationships between the individual points in that data set. I am going to look at a few of the measures you can do on a set of data, and what they tell you about the data itself.

**Standard Deviation:** In statistics, the standard deviation (SD, also represented by the Greek letter sigma σ) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values. How do we calculate it? The English definition of the SD is: “The average distance from the mean of the data set to a point”. The way to calculate it is to compute the squares of the distance from each data point to the mean of the set, add them all up, and take the positive square root. As a formula:

**Variance**: In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean, and it informally measures how far a set of (random) numbers are spread out from their mean. The variance has a central role in statistics. It is used in descriptive statistics, statistical inference, hypothesis testing, the goodness of fit, and Monte Carlo sampling, amongst many others. It is the square of Standard Deviation.

**Covariance**: Standard deviation and variance only operate on 1 dimension, so that you could only calculate the standard deviation for each dimension of the data set independently of the other dimensions. However, it is useful to have a similar measure to find out how much the dimensions vary from the mean with respect to each other. Covariance is such a measure. Covariance is always measured between 2 dimensions. If you calculate the covariance between one dimension and itself, you get the variance. So, if you had a 3-dimensional data set (x,y,z), then you could measure the covariance between the and dimensions, the and dimensions, and the and dimensions. Measuring the covariance between and, or and, or and would give you the variance of the, and dimensions respectively.The formula for covariance is very similar to the formula for variance. The formula for variance could also be modified and rewritten like this

where I have simply expanded the square term to show both parts. So given that knowledge, here is the formula for covariance:

How does this work? Let’s use some example data. Imagine we have gone into the world and collected some 2-dimensional data, say, we have asked a bunch of students how many hours in total that they spent studying, and the mark that they received. So we have two dimensions, the first is the dimension, the hours studied, and the second is the dimension, the mark received. So what does it tell us? The exact value is not as important as its sign (ie. positive or negative). If the value is positive, then that indicates that both dimensions increase together, meaning that, in general, as the number of hours of study increased, so did the final mark.

If the value is negative, then as one dimension increases, the other decreases. If we had ended up with a negative covariance here, then that would have said the opposite, that as the number of hours of study increased the final mark decreased. In the last case, if the covariance is zero, it indicates that the two dimensions are independent of each other.

**Principal Component Analysis**

**The assumptions of PCA:**

**Linearity** – Assumes the data set to be linear combinations of the variables.
**The importance of mean and covariance** – There is no guarantee that the directions of maximum variance will contain good features for discrimination
**Those large variances have important dynamics** – Assumes that components with larger variance correspond to interesting dynamics and lower ones correspond to noise. In simpler terms suppose if we want to classify Male and Female using the height dimension then the data in the height dimension should be dispersed data with negligible variance will be of no use ie. if all the observant are having same height then we will not be able to use this dimension to classify Male/Female.

**Steps for PCA:**

What will this give us? It will give us the original data solely in terms of the vectors we chose. Our original data set had two axes, x and y, so our data was in terms of them. It is possible to express data in terms of any two axes that you like. If these axes are perpendicular, then the expression is the most efficient. This was why it was important that eigenvectors are always perpendicular to each other. We have changed our data from being in terms of the axes x and y, and now they are in terms of our 2 eigenvectors. In the case of when the new data set has reduced dimensionality, ie. we have left some of the eigenvectors out, the new data is only in terms of the vectors that we decided to keep. In the case of keeping both eigenvectors for the transformation, we get the data and the plot found in Figure 1.3. This plot is basically the original data, rotated so that the eigenvectors are the axes. This is understandable since we have lost no information in this decomposition.

So what have we done here? Basically, we have transformed our data so that is expressed in terms of the patterns between them, where the patterns are the lines that most closely describe the relationships between the data. This is helpful because we have now classified our data point as a combination of the contributions from each of those lines. Initially, we had the simple x and y axes. This is fine, but the x and y values of each data point don’t really tell us exactly how that point relates to the rest of the data. Now, the values of the data points tell us exactly where (ie. above/below) the trend lines the data point sits. In the case of the transformation using both eigenvectors, we have simply altered the data so that it is in terms of those eigenvectors instead of the usual axes. But the single-eigenvector decomposition has removed the contribution due to the smaller eigenvector and left us with data that is only in terms of the other.

**About the Author, Shailendra Kathait:**

Shailendra Heads Analytics Delivery & Solutions for Valiance Solutions where he is responsible for building Machine Learning Products and Analytics driven outcomes for our clients. He brings 8 plus years of core Distributed Machine learning, Image Processing & Analytics experience with Fortune 100 companies like IBM(R), American Express & ICICI Group across EMEA, US and Indian Subcontinent region. Shailendra has deep Interest in Neural Networks, Deep Belief Networks, Digital Image Processing & Optimization.

Shailendra holds several Patents and is Anchor author of several publications on Machine Learning & Optimization. He can be followed on LinkedIn.

🙂