This post is about how cloud computing technique MapReduce can help us with text mining jobs performed in big data cases.
In text mining, the task is to analyze large or even unlimited corpora of documents. Text documents can be of great volume, and they are mostly unstructured. The importance of the term in a document needs to be measured, so that the features we need to understand or classify the documents can be found. This can be done by calculating TF-IDF (Term Frequency – Inverse Document Frequency), which is a term weighting function that can assign a score for each term in the documents. Unlike just using term frequency, the TF-IDF also considers the number of documents in which the word appears.
TF-IDF = tf * idf = tf * log(N/n)
in which tf is number of times a term appears in a documents; N is the total number of documents; n is the number of documents that contain the term.
How to calculate that with MapReduce?
It can be done with 3 jobs. The first job is a word count job that calculates term frequency. The second job will count the word-document pairs for each word, and the third one calculates TF-IDF. Obviously, some words are not related to the topic of the documents, such as “a”, “the”, … The NLP system would pre-process the text data by stemming (removing plurals, tense, possessive etc.) and stop words filtering (removing words like “a”, “the” etc.).
Job 1: Term Frequency Computing:
Mapper: (documentId, contents) -> ((term, documentId),1)
Reducer: -> ((term, documentId),tf)
Job 2: Compute n (the number of documents that contain the term):
Mapper: ((term, documentId),tf) -> (term,(documentId,tf,1))
Reducer: -> ((term, documentId), (tf, n))
In this part, we buffer pairs counts, which means we may have the problem that pairs do not fit in memory. (Actually, any time when we need to buffer data, memory is often the bottleneck.) So we can use a combiner or write out intermediate data to a file and use another MapReduce pass.
Job 3: TF-IDF:
Mapper: ((term, documentId), (tf, n)) -> ((term, documentId), TF-IDF)
Reducer: Identity Reducer
If we want the output sorted by documentId or term, we can use documentId or term as the key, so that the output are sorted in the Group & Sort phase.
One word alone mostly cannot have enough semantic indications, we should also consider about phrases. To specify, word co-occurrence measures the frequency with which two words appear close (not just next) to each other in a document. There are two ways to implement a word-count-like algorithm for this with MapReduce.
Mapper: (documentId, content) -> ((term1,term2), 1)
Reducer: -> ((term1, term2), co-occurrence)
It is easy to implement, but the communication cost is large. The huge number of pairs will also make the group & sort phase time consuming.
Mapper: (documentId, list of words) -> create a hashtable wordMap<w,count>; for each w close to word add count to w entry in wordMap; emit (word, wordMap)
Reducer: (word, list of wordMaps) -> (word, finalMap)
This scheme is more efficient, but it is harder to implement. Also, we need to buffer all neighboring words of a given word. The memory can be a problem.