Large Scale Text Processing and Sentiment Analysis Project with MapReduce, Hive, and Spark

This project is based on the final project I did with 2 teammates for the Cloud Computing and Big Data Application course.


Apache Hive is a data processing software built on the platform of Apache Hadoop for implementing applications of data query, analysis, integration and so on. Among these applications, sentiment analysis is one of the most commonly used and influential applications that can be executed on Hive. By referring the fundamental uses of natural language processing and text analysis, the program of sentiment analysis can achieve the functions, such as extraction, identification, and quantification. These features help to study the emotions and mood of analyzed text. The essential idea of sentiment analysis is to compute mood by comparing the set of words in each document to an existing dictionary of words in different categories. The dictionary contains positive words, negative words, and neutral words, which can be classified by analyzing the most frequent positive or negative expressions. The goal of this project is to design and implement a program that can assess the emotional states and subjective information from analyzed text based on Hive. After analyzing the homework review data, the program was put on Amazon EMR to do analysis on the Amazon customer review data. We do predictions of the ratings, and compare with real ratings.

Throughout the course, the emotional descriptions, titled as homework#_review accordingly, for all homework assignments with clearly classified labels (positive, negative, neutral, and not_labeled) are collected and put together for this project. With the provided dictionaries of positive/negative words and these reviews with emotional descriptions, we can perform sentiment prediction by analyzing the most frequent positive/negative terms as a basic step and use the results for further analysis. At the end, we will compare the results from our sentiment analysis program with the existed emotion labels to see the accuracy of sentiment.

Major Components of the project

  • Data preprocessing and stop word removal
  • Basic analysis on the data (positive/negative term frequency, N-grams)
  • Favorite topic analysis (sentiment analysis across homework)
  • Sentiment prediction system (overall accuracy, and personal evaluation)
  • Cloud execution on amazon product reviews of the prediction system on EMR


First, we need to filter out invalid reviews and useless characters from the reviews. In this case, we remove all empty reviews and all reviews shorter than 50 words, and remove all non-word characters (especially newline and tab characters) from all files and put the homework number and label at the beginning of each file, followed by the actual text. Now we separate all three columns using the tab character. Put a newline at the very end of each review. This part was done in python.

After filtering, we can do text preprocessing. We put the data into HIVE using three fields: hw_number as INT, label as INT, and review as STRING. One important part of text processing is removing the stop words so that they do not skew the results. A feature of stop words is that they occur frequently in natural language text, but without adding sentiment, or meaning, to the text. This part was implemented with Hive:

Screen Shot 2017-07-28 at 2.54.54 PM

We started the analyzing with some basic analysis. Find out which SP 2017 homework assignment was perceived the most positive and which the most negative on average across all students. Hive script for the implementation is shown as below.

Screen Shot 2017-07-28 at 3.01.20 PM.png

Use hive to group reviews by homework number, then calculate the total positive words and negative words used in each homework. Sort the homework based on the positive and negative word count. Results:

Screen Shot 2017-07-28 at 3.03.47 PM.png

As the results came out, homework 8, with 2.951220 as the number of average positive terms, was perceived the most positive in the Spring 2017 semester on average across all students. In the same semester, homework 1, with 3.813953 as the number of average negative terms, was perceived the most negative.

For the next step, we tried to find the most frequently used positive and negative words. We used MapReduce to implement a word count program, and the results are:

Screen Shot 2017-07-28 at 3.09.39 PM.png

The results should be considered as reasonable. By looking at the results, we can easily find that the top positive/negative words are colloquial and commonly used terms in the daily life to express sentiments. Considering these homework reviews are not mandatory part of each homework assignment, and students might not treat this part seriously enough as the formal assignment part, the contents of these reviews would not be in formal context. Although the results make sense, there is still limitations: Some words should be considered in the context, which means the negative words and positive words list are not accurate enough. For instance, “pretty” is considered as a positive word, but in this case “pretty” is used as the meaning of “very” instead of “beautiful”, so it is not a positive word. In fact, it should be treated as a stop word like “very”. Another example is “problem” in the negative category. Here people mostly use “Problem 1” when “problem” occurs. This is not a sign of being negative. Also, the expression of “these problems” is commonly used, but it does not stand for negative sentiment either. “Pig” here refers the tool, and no negative sentiment is implied. We should bypass these words that should be comprehended in the context, and then we extract the top positive and negative words. After doing so, we have new top word lists as following, which makes more sense than what we have above.

Screen Shot 2017-07-28 at 3.12.09 PM.png

Beside the term frequency, n-grams are also significant in sentiment analysis. It is a more informative way of analyzing text as they leverage the context or words.

Screen Shot 2017-07-28 at 3.14.40 PM.png

To calculate the emotion score of these 3 grams and 2 grams, we assign a score to each word: a positive word gets 1, a negative word gets -1, and a neutral word gets 0. Then, in each n-gram group, we sum up the scores of the included words. As a result, we have the positive–negative scores for these 2 grams (range: [-2, 2]) and 3 grams (range: [-3, 3]). After calculating these scores, sort those 2 grams and 3 grams based on their scores, the results are listed as below.

Screen Shot 2017-07-28 at 3.16.12 PM.png Screen Shot 2017-07-28 at 3.16.52 PM.png

The results above are reasonable since the listed words are the ones that people would use for common reviews like this, and consider that some words in the original review content were removed by the stop word filter, the existence of awkward combinations like [‘excited’, ‘good’, ‘sport’] is rational. The purpose of the n-gram model is to specify a certain contiguous sequence of n items from a provided text to avoid unnecessary errors or biases. Consequently, the est. frequency of 2 grams would be generally higher than that of 3 grams, which also supports our results.

Now we try to find the difference across semesters. In order to compare the difference between the emotions across the semesters, we sort the n-grams by their emotion scores from the highest to the lowest, and count the total numbers of the groups with the same score. For convenience, we only take the first 300 groups in the order of score to compare. Results are shown as below:

Screen Shot 2017-07-28 at 3.22.38 PM.png Screen Shot 2017-07-28 at 3.23.24 PM.png

There is not much difference between the emotions across semesters. The numbers of both 2-grams and 3-grams obey the pattern of normal distribution, which the number of groups decreases as the absolute value of emotion score increases. And the numbers with each score are sufficiently similar with the corresponding ones in different semester.

After that, we want to dig out the favorite topics in the homework. First, we want to know the difference between the emotions about the homework assignments in the first half of the semester (hw1-hw6) and the ones in the second half (hw7-hw10). The hive script we wrote is shown as below:

Screen Shot 2017-07-28 at 3.26.45 PM.png


Screen Shot 2017-07-28 at 3.28.53 PM.pngAnd then, we use the average number of positive terms to minus the average number of number terms for the corresponding homework on each semester.

Screen Shot 2017-07-28 at 3.29.39 PM.png

Although the numbers vary in different semesters, it is not difficult to notice that students prefer the homework assignments in the first half of the semester to the one in second half since the scores of the first half are higher than the scores of the second have in all three listed semesters. In Spring 2016, the score of the first half is around 0.395 higher than that of the second half; in Fall 2016, the score it is 0.038 higher; and in Spring 2017, it is around 0.451. Therefore, we can conclude that students’ attitudes toward the homework assignments were generally the same, the only emotional differences in Fall 2016 were not as obvious as that in the other semesters.

As for the difference between the emotions about the homework assignments focusing more on theory (hw1, hw2, and hw7) versus MAPREDUCE (hw3-hw6, hw8) versus SPARK and PIG (hw9 and hw10),we get the results listed in the table below.

Screen Shot 2017-07-28 at 3.31.08 PM.png

Differences among emotions are quite obvious in this way of comparison, the most popular subject in Spring 2016 is Spark & PIG, the most popular subject in Fall 2016 is Theory, and the most popular subject in Spring 2017 is MapReduce. However, base one the overall score, the topic on Spark and PIG was liked the most.

As for the next step, we run all pre-processed reviews with labels through your program and compare the resulting predictions with the provided ground truth labels by computing the average accuracy.

The prediction system is composed by two major steps. First, we used hive to process data into the word count per review format, which provides the intermediate data for latter classification in python. The hive script is shown as below.

Screen Shot 2017-07-28 at 3.35.32 PM.png

Before processing, the data looks like:

Screen Shot 2017-07-28 at 3.36.47 PM.png

After Hive processing, the data is reformed like:

Screen Shot 2017-07-28 at 3.37.30 PM

Then we write prediction system in python. To be more reasonable, we modify the positive word list and negative word list (e.g.: we delete “problem” in negative word list, because in most cases, problem represent the question instead of the description of the homework). Here we also need to use the positive and negative word list, and we compute each word frequency in whole text file, and we let the frequency represents the score for corresponding word. We think the higher the score, the more certain that this word can represent positive/negative. and we compute the total positive score and negative score, so the final total score = positive score – negative score. Besides, we find there are some wrong data in the homework reviews, for example, a home work that contains 0 negative words and 5 positive words is labeled as negative, so we also filter the wrong data points like this. Finally, we do the prediction with the model, and if the predicted label is same as the raw label, we marked it as correct prediction. The prediction metric is:

Screen Shot 2017-07-28 at 3.38.56 PM.png

This metric is derived from normal distribution analysis.

 Screen Shot 2017-07-28 at 3.40.43 PM.png

We fit the normalized count of homework reviews with different labels with the model of normal distribution, and pick the overlapping point as the bound with which we label reviews based on their scores. We count all the corrected prediction, and let the accuracy = corrected prediction/total prediction.

Use the above method, which combines The average accuracy after 5 runs is 68.88%, and the highest accuracy is 74.41%. This is rather high considering the noise we have in the raw data. Normally, 70% accuracy is high because it is near the accuracy of human. Although we started with accuracy of 58%, we implemented 2 effective methods:

  • Bad points removal: the label is collected long time after the review was written, which means the data provider may make mistakes when they label their review. This can be seen from some review record that contains many more negative words than positive words yet still labeled as positive. This is obviously human error, because any human can judge with common sense that a review with full of words like “struggling”, “pointless”, “dull”, “frustrating” cannot be a positive one. We removed these obviously wrong labeled data, the accuracy increased as expected.
  • Word list modification: the word lists provided, including stop words, negative words, and positive words, are too general to be applied to the prediction. Some words like ‘pretty’ should be moved from positive word list to stop word list. No one would describe these homework as ‘beautiful’, and the only case they use this word is when they want to say ‘very’. Some words are frequently used but they are given too much attention. People would say “problem 1”, “these problems”, and “pig”, but they stand for no negative sentiments. By modifying the word lists, moving words to where they belong considering the context, the accuracy increased a lot.

EMR Execution

This application is to apply the sentiment analysis program we built above to do text classification with 5 classes. We use the real customer review data as ground truth of their sentiments. Their ratings of the product are considered as the “true” sentiment of their reviews written for the product. Each record in the review data has the following format:

{ “reviewerID”: “A2SUAM1J3GNN3B”, “asin”: “0000013714”, “reviewerName”: “J. McDonald”, “helpful”: [2, 3], “reviewText”: “I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!”, “overall”: 5.0, “summary”: “Heavenly Highway Hymns”, “unixReviewTime”: 1252800000, “reviewTime”: “09 13, 2009” }

The data is in json format, and the two important features to do sentiment analysis is “overall” and “reviewText”. What we are going to do is to predict the value of “overall” based on the content of “reviewText”. There are a bunch of data to choose from, here we used the following 3 datasets to compare their runtime:

reviews_Baby_5.json (160,792 reviews)

reviews_Musical_Instruments_5.json (10,261 reviews)

reviews_Patio_Lawn_and_Garden_5.json (13,272 reviews)

These three datasets were chosen because they vary in size and category. The cloud execution process contains three major steps: read data in .json format and then convert them in to text file with python; use Hive to remove stop words: use Hive to calculate word count for positive words and negative words; then use Spark & Python on EMR to compute the score for each review and label them with predicted overall ratings. The prediction metric for different scores is:

Screen Shot 2017-07-28 at 3.49.19 PM.png

This prediction metric is derived from the normal distribution analysis for the reviews with different labels. Use Digital Music data as an example, we pick the overlapping point as the bound for our classification.

Screen Shot 2017-07-28 at 3.50.38 PM.png

Finally, we compare the prediction with the raw label data and compute the accuracy, and we get the number of predictions, runtimes and accuracy for each data set:

Screen Shot 2017-07-28 at 3.51.29 PM.png

Larger dataset requires longer runtime, which is reasonable. The prediction program has an average accuracy ranged from 50% to 65%. This is acceptable because our sentiment analysis judges right or wrong by absolute predict values, which means that “3” and “4” are completed different classes. When we predict label, our implementation can get a fairly close prediction for the data. The customers who wrote these reviews may have different rating criterions in their mind. For instance, some nice customers would rate something as “5 stars” as long as they are not too disappointed or irritated by the products, but they are not very excited about the products when they write the reviews. Some bad-tempered raters would give “1 star” even if they are not completely satisfied by the product. One small flaw of the products may cause them to give “1 star”, a super negative review. These different rating criterions would skew their overall ratings, and also compromise the accuracy of our sentiment analysis prediction.

As for further development, the ratings given by users can be normalized by their own average ratings. As a result, a too nice rater’s ratings will be deduced and a bad-tempered rater’s rating will be increased. After handling the overall ratings with the above method, we may have non- integer ratings. In order perform a better prediction, we should also predict non-integer ratings based on the sentiment analysis. The classifying metric above can be improved by machine learning methods. The performance of the prediction system can be evaluated by the total squared error for all predicted labels, and then we can use stochastic gradient descent (SGD) to optimize the model.


Amazon Product Review Data:

Sentiment Analysis:

Hive Documentation:




(Featured Image:

Data Mining & Machine Learning Distributed Computing

Arthur Zhang View All →

A current master student in WUSTL, department of Electrical and System Engineering.

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: