Lab 2: Threads and Syncronization

Due: Oct 31, 2014
Web posted: Oct 17, 2014



Getting Started

The Assignment

Like the previous lab, you may work in groups.
  1. MapReduce using Threads and Semaphores
    In this assignment, we will use posix threads (pthreads) to implement an useful application based on the Producer Consumer problem studied in class.


    MapReduce is a popular data processing framework used in cloud computing today. MapReduce is used to process large datasets by companies such as Google, Yahoo, and many others. MapReduce processing cosnsits of Mappers tasks and Reducer tasks that coordinate/split the processing of data between them.

    In this assignment, you will implement a toy multi-threaded version of MapReduce (Note: you do not need to understand how the actual MapReduce framework works to complete this assignment; just follow the instructions here.).

    The goal of this MapReduce assignment is to write a "parallel" (i.e., multi-threaded) application to construct an invereted index on all words in a file. You will be provided with m text files named foo1.txt, foo2.txt,... foom.txt. Your objective is to construct a single inverted index from all of the text contained in this files. An inverted index is simply a hash table where you hash on a word and the index tells you the which files and what line numbers the word appeared.

    To do this, you will need to spawn m Map threads and n Reduce threads. Assume that m and n are specified as inputs to your program. Your code should prompt for the number of Map and Reduce threads, in that order. You can assume that a valid integer is provided by the user for each of these.

    Each Map thread then reads one of the text files. Thus, Map thread 1 will read file foo1.txt, Map thread 2 will read file foo2.txt and so on. Assume that each file contains exactly one word per line. Each map thread then reads text words from its corresponding file. A hashing function is used hash each word and compute an integer from 1 to n. If word produces a hash value i, it is then "sent" to Reduce thread i for the actual computation of the inverted index. This is done by inserting the word into the bounded buffer for the corresponding Reduce thread.

    Assume that there is one bounded buffer of size 10 for each Reduce thread. Map threads are the Producers and Reduce threads are the Consumers. Map threads insert words into one of bounded buffers, and Reduce threads consume words from the buffer to compute the inverted index.

    Each Reduce thread works as follows: It repeatedly consumes (i.e, reads) the next word from its boundedBuffer and constructs the inverted index. For each word, the inverted index should contain file name(s) where the word occured and the line numbers in those file. For example, the word pickle may occur in files foo2.txt and foo5.txt on lines numbers 2 and 7, respectively. The inverted index should contain pickle: (foo2.txt: 2), (foo5.txt: 7).

    The reduce thread can use a simple data structure to compute the inverted index. You may use a search tree or a HashMap to track words in the inverted index.

    You will need to implement your own Bounded Buffer as in Lecture 9. You may use pthreads locks, semaphores or pthread condition variables for synchronization.

    Finally, you will also need to maintain a shared counter variable initialized to m. When each Map thread terminates (i.e., is done reading its file), it will decrement the counter by 1. If the counter equals zero, and if bounded buffer i is empty then the Reduce thread i will print out the inverted index that it has computed. Be sure to protect this shared counter using syncronization.

    Helpful references IF you are not familiar with pthreads, be sure to review this reference on Pthreds programming. A brief tutorial is also available.

    While this assignment describes everything you need to know about Inverted indices and MapReduce to complete the lab, you can learn about inverted indices on Wikipedia. And about MapReduce here.


How to Turn in Lab 2

All of the following files must be submitted on Moodle as a zip file to get full credit for this assignment.
  1. Your zip file should contain a copy of all source files.
  2. Your zip file should contain a copy of a README file identifying your lab partner (if you have one) and containing an outline of what you did for the assignment. It should also explain and motivate your design choices. Explain the design of your Map Reduce program and how syncronization works. Keep it short and to the point.

  3. If your implementation does not work, you should also document the problems in the README, preferably with your explanation of why it does not work and how you would solve it if you had more time. Of course, you should also comment your code. We can't give you credit for something we don't understand!
  4. Fnally, your zip file should contain a copy showing sample output from your programs.
  5. Individual Group Assessment (for students working in groups only)
  6. A percent of your lab grade will come from your participation in this project as a member of your group.
    What you need to turn in (each person individually):
    Include in your zip file a copy of your assessment of the division of labor in the group in the format shown below.  For a 2 person group, if you give yourself a 50% and your partner gives you a 50%, you will get the full credit for group participation.  If you give your partner a 40% and your partner gives himself or herself a 40%, he or she will get fewer points for group participation.  And so on...
  7. Note: We will strictly enforce policies on cheating. Remember that we routinely run similarity checking programs on your solutions to detect cheating. Please make sure you turn in your own work.

    You should be very careful about using code snippets you find on the Internet. In general your code should be your own. It is OK to read tutorials on the web and use these concepts in your assignment. Blind use of code from web is strictly disallowed. Feel free to check with us if you have questions on this policy. And be sure to document any Internet sources/ tutorials you have used to complete the assignment in your README file.

  8. Late Policy: Please refer to the course syllabus for late policy on labs assignments. This late policy will be strictly enforced. Please start early so that you can submit the assignment on time.