CS 677 Distributed Operating Systems

Fall 2009

Programming Assignment 3
Replication, Caching and Consistency in Whisper.com

Due: Dec 10, 2009



The Problem:

After numerous outages at twitter recently, the designers of Whisper have decided to build a replicated architecture for their system. Whisper has become popular in recent months, so replication will be used for both scalability and fault tolerance.

This project is based on project 2. This assignment has two parts.


Part I. Replication and Caching

Create a Dispatcher which forwards the requests from the clients to ONE of the two Front tier servers.

Each front end server caches results of recent retrieve queries. When a new retrieve request comes in, it first checks the cache to determine if all whisps are cached locally. If so, the request is answered locally. If not, it is sent to the database for further processing. The results returned by the database are cached for future retrieve queries.

Cache consistency needs to be addressed whenever a database entry is updated by post requests. We will also assume that each topic is now moderated, and inappropriate whisp posts are periodically removed by a moderator process.

Implement a Server Push cache consistency protocol, where the backend database sends a notify message to the front-end caches upon receiving a post message. The notify message tells the front end server that cached messages are incomplete (i.e., there are new post messages at the database that are not cached by it).

We assume that the moderator process periodically deletes a small number of randomly chosen whisp messages from the database. Upon deletion, your server-push cache consistency protocol must sent an invalidate message to the various caches, asking them to invalidate the corresponding whisps.

You may assume that each whisp posting has an internally generated unique ID for easy tracking within the system.

Finally, the back-end tier needs to implement a poll(t1,t2,#topic)command which allows a front-end tier to query/poll the number of whisps are stored in the database and were posted in the interval (t1,t2). This command is used to determine if all entries in the database are also present in the cache -- this check must be performed before determining whether to answer a retrieve request using locally cached whisp posts.


Part II. Crash Fault Tolerance

Next, we will also replicate the backend databases. To do so, create another dispatcher and two database tier replicas. One of the database replica is designated as the master, and the other one is the slave (i.e., backup server). The new Dispatcher will forward the requests from each Front end server to the current master.

Assume that the master (or the slave) can crash at any time. We are only interested in crash failures in this assignment. If the master fails, the slave must detect it and take over as the new master.

Once the failed master recovers, it takes over as the master again and the backup server becomes a slave once again. Since the master could have been down for an arbitrary period of time, you need to implement a recovery protocol where the master must syncronize its out-of-date database with the backup server. This is done by fetching all post messages that were missed by the master when it was down. The poll command may be used by the recovery protocol used to determine whether posts have been missed for each topic, and then synronize state by using the retrieve command.

During normal operation, when both the master and the slave are functioning correctly, the dispatcher sends all requests to the master only. Any request that modifies the database state (e.g., follow, retrieve, unsubscribe) are forwarded by the master to the slave, so that the slave remains in lock-step with the master at all times.

It is also possible for the slave to crash at any time. In this case, the master simply waits for the slave to recover and asks the slave to run the above recovery protocol to resyncronize its state.

Requirements:


B. Evaluation and Measurement

  1. Compute the average response time (post/retrieve) of your replicated system.
  2. Design a test to show whether your dispatcher balances the workload among the two front end servers.
  3. Design a test to show how the cache consistency protocol at the front-end tier works correctly.
  4. Design a test to show your failures of the master and the slave are handed properly in your system.

    Make necessary plots to support your conclusions.


C. What you will submit

  • When you have finished implementing the complete assignment as described above, you will submit your solution in the form of printouts.

  • Each program must work correctly and be documented. You should hand in:
    1. A copy of the output generated by running your program. Print informative messages showing caching, cache consistency, failure detection, recovery, etc.
    2. A seperate (typed) document of approximately two pages describing the overall program design, a description of "how it works", and design tradeoffs considered and made. Also describe possible improvements and extensions to your program (and sketch how they might be made).
    3. A program listing containing in-line documentation.
    4. A seperate description of the tests you ran on your program to convince yourself that it is indeed correct. Also describe any cases for which your program is known not to work correctly.
    5. Performance results.
    6. A readme file about how to run your code on EDLAB machines.
    7. After you turn in all the stuff mentioned above, please also send TA an email telling the location of your files .
  • Let us not waste a lot of trees. So, if any of the above turn out to be large, just save the relevant information in a file, leave it on your EDLAB account and submit the name of the file.

  • D. Grading policy for all programming assignments

    1. Program Listing
        works correctly ------------- 50%
        in-line documentation -------- 15%
    2. Design Document
        quality of design ------------ 15%
        understandability of doc ------- 10%
    3. Thoroughness of test cases ---------- 10%
    4. Grades for late programs will be lowered 12 points per day late.