Lab 2: Threads and Semaphores

Due: Thu, March 4, 2010, 18:00
Web posted: Wed, Feb 17, 2010, 09:00 hrs



Getting Started

The Assignment

For this assignment and all subsequent ones, you may work in groups.
  1. MapReduce using Threads and Semaphores
    In this assignment, we will use Java threads and semaphores to implement a useful application based on the Producer Consumer problem studied in class.


    MapReduce is a popular data processing framework used in cloud computing today. In this assignment, you will implement a toy multi-threaded version of MapReduce (Note: you do not need to understand how the actual MapReduce framework works to complete this assignment; just follow the instructions here.).

    In this assignment, we will implement a parallel version of the InvertedIndex application you wrote for Lab 1 using Java threads using MapReduce. You will be provided with m text files named foo1.txt, foo2.txt,... foom.txt. Your objective is to construct a single inverted index from all of the text contained in this files.

    To do this, you will need to spawn m Map threads and n Reduce threads. Assume that m and n are specified as inputs to your program. Your code should prompt for the number of Map and Reduce threads, in that order. You can assume that a valid integer is provided by the user for each of these.

    Each Map thread then reads one of the text files. Thus, Map thread 1 will read file foo1.txt, Map thread 2 will read file foo2.txt and so on. Each map thread then parses text words from its corresponding file. A hashing function is used hash each word and compute an integer from 1 to n. If word produces a hash value i, it is then "sent" to Reduce thread i for the actual computation of the inverted index. This is done by inserting the word into the bounded buffer for the corresponding Reduce thread.

    Assume that there is one bounded buffer of size 10 for each Reduce thread. Map threads are the Producers and Reduce threads are the Consumers. Map threads insert words into one of bounded buffers, and Reduce threads consume words from the buffer to compute the inverted index.

    Each Reduce thread works as follows: It repeatedly consumes (i.e, reads) the next word from its boundedBuffer and constructs the inverted index. For each word, the inverted index should contain file name(s) where the word occured and the line numbers in those file. For example, the word pickle may occur in files foo2.txt and foo5.txt on lines numbers 2 and 7, respectively. The iverted index should contain pickle: (foo2.txt: 2), (foo5.txt: 7).

    The reduce thread can use a simple data structure to compute the inverted index. Like in lab 1, you may use a search tree or a HashMap to track words in the inverted index.

    You will need to implement your own Bounded Buffer as in Lecture 8, page 19. You are allowed to use Java's Semaphore class (java.util.concurrent.Semaphore) alone for synchronization. You are allowed to reuse your file parsing code from Lab 1.

    Finally, you will also need to maintain a shared counter variable initialized to m. When each Map thread terminates (i.e., is done reading its file), it will decrement the counter by 1. If the counter equals zero, and if bounded buffer i is empty then the Reduce thread i will print out the inverted index that it has computed. Be sure to protect this shared counter with a mutex semaphore.


How to Turn in Lab 2

All of the following files must be submitted on SPARK as a zip file to get full credit for this assignment.
  1. Your zip file should contain a copy of all source files.
  2. Your zip file should contain a copy of a README file identifying your lab partner (if you have one) and containing an outline of what you did for the assignment. It should also explain and motivate your design choices. Keep it short and to the point.

  3. If your implementation does not work, you should also document the problems in the README, preferably with your explanation of why it does not work and how you would solve it if you had more time. Of course, you should also comment your code. We can't give you credit for something we don't understand!
  4. Fnally, your zip file should contain a copy showing sample output from your programs.
  5. Individual Group Assessment (for students working in groups only)
  6. A percent of your lab grade will come from your participation in this project as a member of your group.
    What you need to turn in (each person individually):
    Include in your zip file a copy of your assessment of the division of labor in the group in the format shown below.  For a 2 person group, if you give yourself a 50% and your partner gives you a 50%, you will get the full credit for group participation.  If you give your partner a 40% and your partner gives himself or herself a 40%, he or she will get fewer points for group participation.  And so on...
  7. Note: We will strictly enforce policies on cheating. Remember that we routinely run similarity checking programs on your solutions to detect cheating. Please make sure you turn in your own work.

    You should be very careful about using code snippets you find on the Internet. In general your code should be your own. It is OK to read tutorials on the web and use these concepts in your assignment. Blind use of code from web is strictly disallowed. Feel free to check with us if you have questions on this policy. And be sure to document any Internet sources/ tutorials you have used to complete the assignment in your README file.

  8. Late Policy: Please refer to the course syllabus for late policy on labs assignments. This late policy will be strictly enforced. Please start early so that you can submit the assignment on time.