Algorithm’s Role in MapReduce

MapReduce programs are usually short considering the size of data they are processing. In most cases, 1 job/program cannot take us where we need to be, thus we introduce multiple jobs into a workflow. Output from a job are fed into anther job as input so that more processing can be performed. After I have been working on several assignments and projects on MapReduce, I noticed that most of those jobs can be done with very similar code, and in most cases, the only thing we need to change is the interface. For instance, we need to change the way we read data when input data comprises different number of fields or various data types. Technically speaking, those processing can be regarded as data preprocessing, which means that what we essentially want from MapReduce can be classified into a bunch of algorithms. A new MapReduce program we write for a new case is tantamount to what we have written before.

Sorting

MapReduce has a common workflow: Map -> Group&Sort -> Reduce. If the purpose is to sort, we can take advantage of the sorting phase during which the keys are sorted to be fed into reduce(). For instance, if we are to sort strings contained in lines in text file, we have input data (key, value) which is composed with (line index, text). If we want to sort the text, we can emit (text, ), which uses value as keys. After the grouping and sorting, we have something looks like {(k1,[_,_,_,_]),(k2,[_,_,_])…}. As for the reduce job, we can emit (key, ) for each value. So that we can have {k1,k1,k1,k1,k2,k2,k2,…} as output. We sort text by using them as keys in Group&Sort phase. If we want to remove duplicates, we can only emit one key once, then no matter how many times a text exists in the file, it is only emitted once. If we want to use multiple reducers, we have to implement a partition function so that we can secure that we can combine reducer output correctly.

Sorting can be used to test the speed of a Hadoop cluster since the only computing that matters is in the Group&Sorting phase. Since mapper and reducer are trivial, sorting can be effective in testing the Hadoop framework’s I/O. Besides, it can be used to measure the increase in performance if the cluster is enlarged. This can be done by running and timing a sorting job before and after more nodes are added to the cluster.

Indexing

Index lists contents of file/document, and inverted index maps contents to file/document. Using inverted index algorithm as an example, the input is (byte offset, line of text), and the output is (word, filename).

Relational Algebra Operations

  • Selection: apply a condition C to each tuple, return only tuples that satisfy the condition. Mapper: for each tuple in R, if C -> emit(t, _) Reducer: pass (t, _)
  • Projection: return a subset S of attributes from each tuple, only keep the attributes in S. Mapper: for each tuple in R, emit(t’ , _ )  Reducer: receives (t’, [_,_,…])remove duplicates and emit(t’, _ )
  • Union, Intersection, and Difference: Mapper: get all tuples from R1 and R2, emit(t, R1) and (t, R2)  Reducer: {Union: receives (t, [“R1”, “R2”]) or (t, [“R1”]) or (t, [“R2”]) -> emit (t, _)}, {Intersection: receives (t, [“R1”, “R2”]) ]) -> emit (t, _)}, {Difference(in R1 but not in R2): receives (t, [“R1”]) -> emit (t, _)}
  • Grouping and Aggregation: R(A,B); group all tuples based on their value of A and aggregate the values of B.  Mapper: gets all tuples t -> emit(a, bi)  Reducer: receives (a, [b1,b2,…]) -> emit (a, ⊗i bi)
  • Natural Join: join relations on attribute Ai. This can be done by map-side and reduce-side join. As for map-side join, the basic idea is: Load one set of data into memory -> Map over the other set of data, and perform a lookup on the hash table using the join key. If the join key is found, emit new tuple. This does not scale. As for reduce-side join: Mapper: gets all tuples t from R1 and R2 -> emit (b, (“R1”, ai)) & (b, (“R2”, ci))  Reducer: receives (b, [(“R1”, a1), (“R2”, c1), (“R1”, a2), (“R2”, c2), …]) -> emit (a1, b, c1), emit (a1, b, c2), emit(a2, b, c1), emit(a2, b, c2)

Although MapReduce is not the best solution for database queries/operations, but it can be used as building blocks for more complicated jobs.

 

(Featured Image: https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwiYgI-8vrHVAhVi5YMKHSSaDr4QjRwIBw&url=http%3A%2F%2Faimotion.blogspot.com%2F2012%2F08%2Fintroduction-to-recommendations-with.html&psig=AFQjCNG_rHuyx50SEqOOuqgwxperQwfZ9Q&ust=1501520653323771)

Algorithms Distributed Computing

Arthur Zhang View All →

A current master student in WUSTL, department of Electrical and System Engineering.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: