Principal Component Analysis


Principal Component Analysis (PCA) is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance.

Given a set of data, PCA finds a new coordinates system from the old one by translation and rotation only. It proceeds by moving the center of the current coordinates system to the center of the data and maps the x-axis along the principal axis of variation (most variation relative to all data points). It then generates the y-axis orthogonally (more on Udacity). The points are then projected onto the new x-axis. The information loss is going to be the sum of all the distances between the points and their new projected spots. and the further the point are from the axis, the more information we loose.

When to use PCA ?

  • Latent features : There are many features in the dataset but the hypothesis is that just some of them are actually driving the pattern
  • Dimensionality reduction : Looking to make a composite feature that more directly probes the underlying phenomenon for dimensionality reduction and therefore being able to visualize high-dimensional data, reduce noise and being able to use other algorithms (regression, classification)

An example of dimensionality reduction can be the price of a house according to 4 features:

  • Square footage
  • Number of rooms
  • School ranking
  • Safety problems
It seems that those four measurable features could be reduced to two latent ones : space and neighborhood quality.

(more on Udacity).

Face Recognition with PCA (eigenfaces)

What makes facial recognition in pictures good for PCA ?

  • Each pixels is a feature, meaning that pictures generally have high input dimensionality.
  • Faces have patterns that could be captured in smaller number of dimensions (eyes, nose, mouth, chin...)

In this project, I will apply PCA face recognition of presidents.

  1. Getting the data and generate training and testing set :

    Faces recognition example using eigenfaces and SVMs
    The dataset used in this example is a preprocessed excerpt of the
    "Labeled Faces in the Wild", aka LFW_: (233MB)
    .. _LFW:
    original source:
    import numpy as np
    from sklearn.datasets import fetch_lfw_people
    from sklearn.model_selection import train_test_split
    # Download the data, if not already on disk and load it as numpy arrays
    lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
    # introspect the images arrays to find the shapes (for plotting)
    n_samples, h, w = lfw_people.images.shape
    # for machine learning we use the data directly (as relative pixel
    # position info is ignored by this model)
    X =
    n_features = X.shape[1]
    # the label to predict is the id of the person
    y =
    target_names = lfw_people.target_names
    n_classes = target_names.shape[0]
    print "Total dataset size:"
    print "n_samples: %d" % n_samples
    print "n_features: %d" % n_features
    print "n_classes: %d" % n_classes
    # Split into a training and testing set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
  2. Compute the eigenfaces

    from time import time
    from sklearn.decomposition import PCA
    # Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
    # dataset): unsupervised feature extraction / dimensionality reduction
    n_components = 150
    print "Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0])
    t0 = time()
    pca = PCA(n_components=n_components, whiten=True, svd_solver='randomized').fit(X_train)
    print "done in %0.3fs" % (time() - t0)
    eigenfaces = pca.components_.reshape((n_components, h, w))
    print "Projecting the input data on the eigenfaces orthonormal basis"
    t0 = time()
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)
    print "done in %0.3fs" % (time() - t0)
    print ("variance ratio: ", pca.explained_variance_ratio_)

    PCA orders the principal components so that the first PC gives the direction of the maximal variance, the second PC has the second largest and so on... An ordered array of all variance can be called with the attribute explained_variance_ratio_ .
    Here the first three PC are :

    • 0.19077216
    • 0.15184273
    • 0.07375473

  3. Train a classification model

    from sklearn.svm import SVC
    from sklearn.model_selection import GridSearchCV
    # Train a SVM classification model
    print "Fitting the classifier to the training set"
    t0 = time()
    param_grid = {
             'C': [1e3, 5e3, 1e4, 5e4, 1e5],
              'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
    clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
    clf =, y_train)
    print "done in %0.3fs" % (time() - t0)
    print "Best estimator found by grid search:"
    print clf.best_estimator_
  4. Evalutation of the model quality :

    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    print "Predicting the people names on the testing set"
    t0 = time()
    y_pred = clf.predict(X_test_pca)
    print "done in %0.3fs" % (time() - t0)
    print classification_report(y_test, y_pred, target_names=target_names)
    print confusion_matrix(y_test, y_pred, labels=range(n_classes))
    >>> classification_report should print :
                        precision    recall  f1-score   support
         Ariel Sharon       1.00      0.62      0.76        13
         Colin Powell       0.83      0.92      0.87        60
      Donald Rumsfeld       0.91      0.74      0.82        27
        George W Bush       0.83      0.97      0.90       146
    Gerhard Schroeder       0.85      0.68      0.76        25
          Hugo Chavez       1.00      0.60      0.75        15
           Tony Blair       1.00      0.72      0.84        36
          avg / total       0.87      0.86      0.86       322
    >>> confusion_matrix should print :
    [[  8   2   1   2   0   0   0]
    [  0  55   0   5   0   0   0]
    [  0   1  20   6   0   0   0]
    [  0   4   0 142   0   0   0]
    [  0   1   1   6  17   0   0]
    [  0   1   0   3   2   9   0]
    [  0   2   0   7   1   0  26]]
    The confusion matrix shows for each entry names the 
    predicted (column) and the real (line) outcome
  5. Evalutation of the prediction :

    import pylab as pl
    def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
        """Helper function to plot a gallery of portraits"""
        pl.figure(figsize=(1.8 * n_col, 2.4 * n_row))
        pl.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
        for i in range(n_row * n_col):
            pl.subplot(n_row, n_col, i + 1)
            pl.imshow(images[i].reshape((h, w)),
            pl.title(titles[i], size=12)
    # plot the result of the prediction on a portion of the test set
    def title(y_pred, y_test, target_names, i):
        pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
        true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
        return 'predicted: %s\ntrue:      %s' % (pred_name, true_name)
    prediction_titles = [title(y_pred, y_test, target_names, i)
                             for i in range(y_pred.shape[0])]
    plot_gallery(X_test, prediction_titles, h, w)
    # plot the gallery of the most significative eigenfaces
    eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
    plot_gallery(eigenfaces, eigenface_titles, h, w)

    Gallery of a portion of the test set

    Gallery of the most significative eigenfaces

Dimension reductionality with PCA in the Enron Corpus

Enron was one of the largest US companies in 2000. At the end of 2001, it had collapsed into bankruptcy due to widespread corporate fraud, known since as the Enron scandal. A vast amount of confidential information including thousands of emails and financial data was made public after Federal investigation.

In this project, I will apply PCA to the Enron financial data.

  1. We first need to download the Enron Corpus (this might take a while, like more than an hour) and unzip the file (which can take a while too). There is 156 people in this dataset each one identified by their last name and the first letter of their first name.

  2. Let's take a look at the data. The dataset for the project can be read as a dictionary where each key is a person and its value is a dictionnary containing all the possible feature. Here is an example of one of the entry :

    {'ALLEN PHILLIP K': {'bonus': 4175000,
    'deferral_payments': 2869717,
    'deferred_income': -3081055,
    'director_fees': 'NaN',
    'email_address': '',
    'exercised_stock_options': 1729541,
    'expenses': 13868,
    'from_messages': 2195,
    'from_poi_to_this_person': 47,
    'from_this_person_to_poi': 65,
    'loan_advances': 'NaN',
    'long_term_incentive': 304805,
    'other': 152,
    'poi': False,
    'restricted_stock': 126027,
    'restricted_stock_deferred': -126027,
    'salary': 201955,
    'shared_receipt_with_poi': 1407,
    'to_messages': 2902,
    'total_payments': 4484442,
    'total_stock_value': 1729541}
  3. PCA calculation on two chosen features ("long_term_incentive" and "bonus")

    ### Modified from: Udacity - Intro to Machine Learning
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    def doPCA(data):
        pca = PCA(n_components=2)
        return pca
    pca = doPCA(data)
    print (pca.explained_variance_ratio_)
    >>> Should print : [ 0.79258816  0.20741184]
    first_pc = pca.components_[0]
    >>> Should print : [ 0.33434446  0.94245094]
    second_pc = pca.components_[1]
    >>> Should print : [ 0.94245094 -0.33434446]
    transformed_data = pca.transform(data)
    for ii, jj in zip(transformed_data, data):
    plt.scatter(first_pc[0]*ii[0], first_pc[1]*ii[0], color='r')
    plt.scatter(second_pc[0]*ii[1], second_pc[1]*ii[1], color='c')
    plt.scatter(jj[0], jj[1], color='m', marker="X")
    plt.xlim(-0.4, 1.1)
    plt.ylim(-0.4, 1.1)
    plt.gca().set_aspect('equal', adjustable='box')
    plt.ylabel("long-term incentive")

    Scaled repartition of the people

    feature1 = "bonus", feature2="long_term_incentive"

    Rescaled projection of points onto first and second PC

    PC1 = red, PC2 = cyan, initial dataset = magenta

    The PC1 = [0.33434446, 0.94245094] and PC2 = [0.94245094, -0.33434446] are lists that contain as many principal component as specified in the parameter n_components (here n_components = 2). They are packaged into a vector that indicates the direction of x' in the xy original feature space. As PCA1 and PC2 are orthogonal, their coordinates are inversed.