Text Learning

Identifying emails authors with Text learning

Enron was one of the largest US companies in 2000. At the end of 2001, it had collapsed into bankruptcy due to widespread corporate fraud, known since as the Enron scandal. A vast amount of confidential information including thousands of emails and financial data was made public after Federal investigation.

In this project, I will apply text learning to identify authors of emails in the Enron Corpus.


  1. There is 156 people in the Enron Corpus each one identified by their last name and the first letter of their first name.

  2. The email dataset is in the maildir directory. For this mini-project, only the emails from Sara and Chris will be analyzed. The function parseOutText() is created to parse out all text below the metadata block at the top of each email. It is also a step where each word is stemmed.

                                            
    ### Modified from: Udacity - Intro to Machine Learning
    
    from nltk.stem.snowball import SnowballStemmer
    import string
    
    def parseOutText(f):
    
        f.seek(0)
        all_text = f.read()
    
        ### split off metadata
        content = all_text.split("X-FileName:")
        words = ""
    
        if len(content) > 1:
            ### remove punctuation
            text_string = content[1].translate(string.maketrans("", ""), string.punctuation)
    
            ### split the text string into individual words, stem each word,
            ### and append the stemmed word to words (make sure there's a single
            ### space between each stemmed word)
            words = text_string
            word_list = (text_string.split())
            stemmer = SnowballStemmer("english")
    
            for word in word_list:
                words += stemmer.stem(word) + " "
    
        return words
                                            
                                        
  3. Preprocessing the emails include getting rid of all stopwords and labeling each email. Pickle files are returned for both email data and author label.

                                    
    ### Modified from: Udacity - Intro to Machine Learning
    
    import pickle
    
    def preprocess_email(from_sara_file, from_chris_file):
    
        from_sara  = open("from_sara.txt", "r")
        from_chris = open("from_chris.txt", "r")
    
        from_data = []
        word_data = []
    
        # temp_counter = 0 ### for development
    
        for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
             for path in from_person:
                 # temp_counter += 1
                 # if temp_counter < 200:
                     # print path
                     path = os.path.join('..', path[:-1])
                     # print path
    
                     email = open(path, "r")
                     # print email.read()
    
                     parsed_email = parseOutText(email)
    
                     stopwords = ["sara", "shackleton", "chris", "germani", "sshacklensf", "cgermannsf"]
                     new = parsed_email
                     for stopword in stopwords:
                         if stopword in new:
                             new = new.replace(stopword, '')
                     ### append the text to word_data
                     word_data.append(new)
    
                     ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
                     if name == "sara":
                        from_data.append(0)
                     elif name == "chris":
                        from_data.append(1)
    
                     email.close()
    
        print "emails processed"
        # pprint.pprint (word_data)
        from_sara.close()
        from_chris.close()
    
        pickle.dump( word_data, open("word_data.pkl", "w") )
        pickle.dump( from_data, open("from_data.pkl", "w") )
    
        return ("word_data.pkl", "from_data.pkl")
                                    
                                
  4. TfIdf vectorization (Term frequency Inverse document frequency) using class sklearn.feature_extraction.text.TfidfVectorizer():

                                            
    ### Modified from: Udacity - Intro to Machine Learning
    
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.feature_selection import SelectPercentile, f_classif
    from sklearn.model_selection import train_test_split
    
    def vectorize(word_data_file, from_data_file):
        words_file_handler = open(word_data_file, "r")
        word_data = cPickle.load(words_file_handler)
        words_file_handler.close()
    
        authors_file_handler = open(from_data_file, "r")
        authors = pickle.load(authors_file_handler)
        authors_file_handler.close()
    
    
        ### test_size is the percentage of events assigned to the test set
        ### (remainder go into training)
        features_train, features_test, labels_train, labels_test = \
            train_test_split(word_data, authors, test_size=0.1,random_state=42)
        
        ### max_df = 0.5 means that if a word occurs at a frequency higher than
        ### 50%, it is not taken as a feature
        tidf_vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                    stop_words='english')
    
        features_train_transformed = tidf_vectorizer.fit_transform(features_train)
        features_test_transformed  = tidf_vectorizer.transform(features_test)
    
        ### Additional feature selection with SelectPercentile allows to only 
        ### consider the ten best percentile of features
        selector = SelectPercentile(f_classif, percentile=10)
        selector.fit(features_train_transformed, labels_train)
    
        features_train_transformed = selector.transform(features_train_transformed).toarray()
        features_test_transformed  = selector.transform(features_test_transformed).toarray()
    
        return features_train_transformed, features_test_transformed, labels_train, labels_test
                                            
                                        

    Comparison of different classification algorithms

    The details of each classification can be found in their respective page.

    Classification algorithm Training time (sec) Predict time (sec) Accuracy
    Naive Bayes 9.182 1.905 97.497
    Decision tree 61.709 0.064 98.805
    SVM 101.02 10.511 99.203