The k-Means algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.
The k-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean μ of the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from X, although they live in the same space. The k-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum of squared criterion.
The disadvantages of k-means include :
Enron was one of the largest US companies in 2000. At the end of 2001, it had collapsed into bankruptcy due to widespread corporate fraud, known since as the Enron scandal. A vast amount of confidential information including thousands of emails and financial data was made public after Federal investigation.
In this project, I will apply k-means clustering to the Enron financial data.
We first need to download the Enron Corpus (this might take a while, like more than an hour) and unzip the file (which can take a while too). There is 156 people in this dataset each one identified by their last name and the first letter of their first name.
Let's take a look at the data. The dataset for the project can be read as a dictionary where each key is a person and its value is a dictionnary containing all the possible feature. Here is an example of one of the entry :
{'ALLEN PHILLIP K': {'bonus': 4175000,
'deferral_payments': 2869717,
'deferred_income': -3081055,
'director_fees': 'NaN',
'email_address': 'phillip.allen@enron.com',
'exercised_stock_options': 1729541,
'expenses': 13868,
'from_messages': 2195,
'from_poi_to_this_person': 47,
'from_this_person_to_poi': 65,
'loan_advances': 'NaN',
'long_term_incentive': 304805,
'other': 152,
'poi': False,
'restricted_stock': 126027,
'restricted_stock_deferred': -126027,
'salary': 201955,
'shared_receipt_with_poi': 1407,
'to_messages': 2902,
'total_payments': 4484442,
'total_stock_value': 1729541}
}
I will first perfom k-means based on just two features, "salary" and "exercised_stock_options".
### Modified from: Udacity - Intro to Machine Learning
import pickle
from feature_format import featureFormat, targetFeatureSplit
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
##########################################################################
### Split data
### A pickle document was created by the instructors of the course.
### To find it, see the full project on github
dictionary = pickle.load( open("../final_project/final_project_dataset_modified.pkl", "r") )
### Create a list of features with the first one being "poi"
poi = "poi"
feature_1 = "salary"
feature_2 = "exercised_stock_options"
features_list = [poi, feature_1, feature_2]
### FeatureFormat converts data from the dictionary format to an
### (n x k) python list that's ready for training an sklearn algorithm
data = featureFormat(dictionary, features_list, remove_any_zeroes=True)
### targetFeatureSplit separates out the first feature (should be the target)
### from the others. The function returns targets in in its own list
### and all of the other features in a separate list
poi, finance_features = targetFeatureSplit( data )
### Feature scaling
scaler = MinMaxScaler(feature_range=(0, 1), copy=True)
scaler.fit(finance_features)
finance_features = scaler.transform(finance_features)
##########################################################################
### Draw the scatterplot
for f1, f2 in finance_features:
plt.scatter( f1, f2 )
### Add axis labels
plt.xlabel(features_list[1])
plt.ylabel(features_list[2])
plt.savefig("test.png")
plt.show()
Scaled repartition of the people and their "exercised_stock_options" with respect to their "salary"
The class sklearn.cluster.KMeans() was used for clustering.
### Modified from: Udacity - Intro to Machine Learning
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(finance_features)
### Fitting linear model on the training set
kfit = kmeans.fit(finance_features)
### Cluster prediction
pred = kmeans.predict(finance_features)
##########################################################################
### Draw the scatterplot with n_clusters
for ii, pp in enumerate(pred):
plt.scatter(features[ii][0], features[ii][1], color=colors[pred[ii]])
### Add axis labels
plt.xlabel(features_list[1])
plt.ylabel(features_list[2])
plt.savefig("test.png")
plt.show()
K-means cluster with 2 features
feature1 = "salary", feature2 = "exercised_stock_options"
K-means cluster with 2 features with marked "poi"
Red crosses show "poi"
Two clusters are identified in blue and yellow. The scheme with marked "poi" shows that the yellow cluster identify some "poi" but still a lot of them fall into the blue cluster. More features might be necessary for better clustering.
Accuracy calculations for this clusters are :
Global accuracy reaches 87.903 % but poi accuracy is pretty low, which means that
the cluster is not effective at finding poi. On the other hand, the Non-Poi accuracy
is high, 99.057 %, meaning that we are less suceptible of getting false positives.
A third feature "total_payments" is taken into account.
### Modified from: Udacity - Intro to Machine Learning
import pickle
from feature_format import featureFormat, targetFeatureSplit
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
##########################################################################
### Split data
dictionary = pickle.load( open("../final_project/final_project_dataset_modified.pkl", "r") )
poi = "poi"
feature_1 = "salary"
feature_2 = "exercised_stock_options"
feature_3 = "total_payments"
features_list = [poi, feature_1, feature_2, feature_3]
data = featureFormat(dictionary, features_list, remove_any_zeroes=True)
poi, finance_features = targetFeatureSplit( data )
scaler = MinMaxScaler(feature_range=(0, 1), copy=True)
scaler.fit(finance_features)
finance_features = scaler.transform(finance_features)
##########################################################################
### Draw the scatterplot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
for f1, f2, f3 in finance_features:
ax.scatter( f1, f2, f3 )
### Add labels
ax.set_xlabel(features_list[1])
ax.set_ylabel(features_list[2])
ax.set_zlabel(features_list[3])
plt.savefig("test.png")
plt.show()
Scaled repartition of the people with 3 features
feature1 = "salary", feature2 = "exercised_stock_options", feature3 = "total_payments"
Clustering with three features (n_clusters = 2)
### Modified from: Udacity - Intro to Machine Learning
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(finance_features)
### Fitting linear model on the training set
kfit = kmeans.fit(finance_features)
### Cluster prediction
pred = kmeans.predict(finance_features)
##########################################################################
### Draw the scatterplot with n_clusters
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
colors = ['b', 'y']
for ii, pp in enumerate(pred):
plt.scatter(features[ii][0], features[ii][1], features[ii][2] color=colors[pred[ii]])
### Add axis labels
ax.set_xlabel(features_list[1])
ax.set_ylabel(features_list[2])
ax.set_zlabel(features_list[3])
plt.savefig("test.png")
plt.show()
K-means cluster (n_clusters = 2) with 3 features
feature1 = "salary", feature2 = "exercised_stock_options", feature3 = "total_payments"
K-means cluster (n_clusters = 2) with 3 features with marked "poi"
Red crosses show "poi"
Clustering (n_clusters = 2) with the 3 features "salary", "exercised_stock_options" and "total_payments" makes some point from the yellow to switch into the blue group. Accuracy calculations are:
A Global Accuracy of 87.770 % shows that the clustering does a decent job at finding
if a person is or is not a "poi". However, when looking only at the "poi" feature
the accuracy drops to 11.111 % meaning that the clustering is not very good at finding
"poi".
Clustering with three features (n_clusters = 3)
K-means cluster (n_clusters = 3) with 3 features
feature1 = "salary", feature2 = "exercised_stock_options", feature3 = "total_payments"
K-means cluster (n_clusters = 3) with 3 features with marked "poi"
Red crosses show "poi"
Clustering with n_clusters = 3 when the outcome expected is binary (is "poi" or is not "poi") might seem odd. However, the classification can be considered as a gradient as of probability where we consider three states :
Accuracy calculations for this clusters are :
The Global Accuracy drops drastically to 50.360 % showing that the cluster is
more prone to get false positive. However, the poi Accuracy rises to 94.444 %
when considering both probable poi and high probability poi. One remarkable thing
about this clustering is that the low probability poi cluster predicts non poi
with an accuracy of 98.148 %.