The Application of MapReduce in Text Mining (TF-IDF & Co-Occurrence)

This post is about how cloud computing technique MapReduce can help us with text mining jobs performed in big data cases.

In text mining, the task is to analyze large or even unlimited corpora of documents. Text documents can be of great volume, and they are mostly unstructured. The importance of the term in a document needs to be measured, so that the features we need to understand or classify the documents can be found. This can be done by calculating TF-IDF (Term Frequency – Inverse Document Frequency), which is a term weighting function that can assign a score for each term in the documents. Unlike just using term frequency, the TF-IDF also considers the number of documents in which the word appears.

TF-IDF = tf * idf = tf * log(N/n)

in which tf is number of times a term appears in a documents; N is the total number of documents; n is the number of documents that contain the term.

How to calculate that with MapReduce? 

It can be done with 3 jobs. The first job is a word count job that calculates term frequency. The second job will count the word-document pairs for each word, and the third one calculates TF-IDF. Obviously, some words are not related to the topic of the documents, such as “a”, “the”, … The NLP system would pre-process the text data by stemming (removing plurals, tense, possessive etc.) and stop words filtering (removing words like “a”, “the” etc.).

Job 1: Term Frequency Computing:

Mapper: (documentId, contents) -> ((term, documentId),1)

Reducer:  -> ((term, documentId),tf)

Job 2: Compute n (the number of documents that contain the term): 

Mapper:  ((term, documentId),tf)  -> (term,(documentId,tf,1))

Reducer: -> ((term, documentId), (tf, n))

In this part, we buffer pairs counts, which means we may have the problem that pairs do not fit in memory. (Actually, any time when we need to buffer data, memory is often the bottleneck.) So we can use a combiner or write out intermediate data to a file and use another MapReduce pass.

Job 3: TF-IDF:

Mapper: ((term, documentId), (tf, n)) -> ((term, documentId), TF-IDF)

Reducer: Identity Reducer

If we want the output sorted by documentId or term, we can use documentId or term as the key, so that the output are sorted in the Group & Sort phase.

Word Co-Occurrence

One word alone mostly cannot have enough semantic indications, we should also consider about phrases. To specify, word co-occurrence measures the frequency with which two words appear close (not just next) to each other in a document. There are two ways to implement a word-count-like algorithm for this with MapReduce.

Pairs Scheme:

Mapper: (documentId, content) -> ((term1,term2), 1)

Reducer: -> ((term1, term2), co-occurrence)

It is easy to implement, but the communication cost is large. The huge number of pairs will also make the group & sort phase time consuming.

Stripes Scheme:

Mapper: (documentId, list of words) -> create a hashtable wordMap<w,count>; for each w close to word add count to w entry in wordMap; emit (word, wordMap)

Reducer: (word, list of wordMaps) -> (word, finalMap)

This scheme is more efficient, but it is harder to implement. Also, we need to buffer all neighboring words of a given word. The memory can be a problem.

Data Mining & Machine Learning Distributed Computing

Arthur Zhang View All →

A current master student in WUSTL, department of Electrical and System Engineering.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: