CS 677 Operating Systems Project 3

CS 677 Distributed Operating Systems

Spring 2005

Programming Assignment 3: A Fault-tolerant Banking System

Due: May 12th (Thursday)

(for off-campus students - 14 days after viewing Lecture 23 )

You may work in groups of two for this lab assignment.
A link to a FAQ for this project. The link will be updated to answer common questions about the assignment.
Here are some useful references for this project.

1 The Problem

In this programming assignment you will implement a Fault-tolerant Banking System.
The assignment uses concepts of fault-tolerance, resynchronization using logs, centralized locking and encryption..

An pictorial representation of the system is as shown in the figure below.

Figure 1: Distributed Fault-tolerant Online Banking System and Components
The system has two important components:

Replicated Database Servers (this is similar to Assignment 2)
The bank database of accounts (and corresponding information of each) is replicated on both servers.
Coordinator (similar to Assignment 2 but no load balancing)
The coordinator acts as the interface to the banking database. Each client sends its request to the coordinator which in turn forwards the request to one of the servers to perform the desired action. Additionally, the coordinator also fetches any response from the server and sends it back to the client.

2 Functionalities of the System:

Replicated Database Servers
- Replication:
  All accounts of the bank are replicated on both servers. For the assignment assume that all accounts are already created and the client operations are for deposit, withdraw and balance check . Further, information of accounts is to stored and updated in a file (and not in memory) as we are going to test for fault--tolerance (where servers fail and wakeup and need to restore state of previous operations.), i.e., if system started with account# 101 having $100, and when server 1 crashed it had $150, on wakeup server 1 should know that account# 101 balance is $150, unless server 2 updated it (which is dealt with using logs at both servers).
- Logging:
  The server maintains a log for each operation performed, which can be of the form
```
		Operation#   Account#  Balance  
		------------------------------
		1            101       2000    
		2            102       5000    
		3            105       500    
		.
		.
		.
	
```
  The log must be stored/updated in a file for persistence.
  Additionally, all account information is stored in files and hence all update operations also modify records in the account information file.
- Resynchronization and Wakeup:
  When a server fails and wakes-up it needs to resynchronize with the other replica in-order to restore a consistent state for each account.
  On wakeup, the server sends an ALIVE message to coordinator to signal its wakeup.
  Next, the server sends a resynchronize command to the other server, alongwith the last operation number in its log. The server that had not failed then looks up its log and sends all [account #, balance] tuples which the failed server has not seen. The operation number sent from the failed server is used to get the starting point for the log replay. The server sends all log tuples and also results of operations being performed when the resynch command arrived.
  The assignment assumes only one server can fail at a time and hence atleast one server always services requests and keeps the log upto date.
  Once the resynchronize is complete, the server can flush(delete) its log as both servers are consistent. Also, the server that had failed now being in a consistent state sends a RESYNCH-DONE message to the coordinator.
Coordinator
Each client request is forwarded to the coordinator, which in-turn forwards the request to both the servers or the server that has not failed.
- Account--level locking:
  For each request that reaches the coordinator, it locks the account the operation is going to update it (withdraw/deposit operations). So, if two operations for the same account simultaneously arrive at the coordinator, one of the will be queued till the first one finishes and lock released.
  Each server signals an end-of-operation with a OP-DONE message, which can of form [DONE, Account#]. On receiving OP-DONE messages from all awake servers, the lock on the account is released.
- HeartBeat Messages:
  The coordinator uses periodic heartbeat messages to determine the state of each server. Based on the number of servers (1 or 2 in this case), it waits for the corresponding OP-DONE messages to release the lock. Additionally, the periodic hearbeat messages will indicate whether a server failed after getting the lock. In this case, the coordinator only waits for a single OP-DONE message from the alive server.
- Resynchronization: When resnchronization is in progress, no requests are forwarded by the coordinator. The resynchronization starts with a ALIVE message from a server and ends with a RESYNCH-DONE
- Client Response:
  The coordinator forwards the result of the operation after the locks are released and operation completed to the client.

Some things to keep in mind:

Assume each server has the same file with all records and account information at startup. The records in the file get updated on each update operation.
To keep the replication simple, assume clients do not create new accounts and only query or update existing accounts.
The system should be fault tolerant, meaning no operation should fail or make the system inconsistent due to failures.

3 Evaluation and Measurement

Correctness
Demonstrate that your system works correctly according to requirements stated in the description and functionality of the system. In particular:

Show that the bank database is distributed and replicated, operations are sent to both servers or single server (based on failure).
Demonstrate that the account-level locking mechanism works correctly. i.e: show (via output snapshots) that simultaneous updates to an account is handled correctly by the locking mechanism.
Demonstrate the logging mechanism, how log is updated on operations and maintained.
Demonstrate the proper functioning of the resynchronization step.
- between the 2 servers
- no requests being sent by coordinator
Demonstrate the working of heartbeat messages, by showing that a coordinator initially waits for 2 OP-DONE messsages, but a heartbeat discovers a server failure and waits for only 1 OP-DONE message.

Evaluation
Additionally, experiment with your system to measure its performance in different scenarios and test conditions.
Design and present results of your own experiments to demonstrate the characteristics of the system. A few examples are:

Measure the average time for of requests, when both servers are on and no queuing at the coordinator.
Measure the average time for of requests, when both servers are on and and queuing at the coordinator (i.e., queued due to account-level locking).
Keeping request rate of each client constant, vary the number of clients and measure the latency of each request.
Keep number of clients constant, but vary request rate to measure latency of each request.
Measure how much time resychronization requires, by measuring time between ALIVE and RESYNCH-DONE messages at the coordinator.
How does this change with log size?

It is important that you describe the results of your experiment and not just describe what the experiment did. Please state what the experiment demonstrates or what you expected and what was seen etc.
These are guidelines only, so be creative in what can be evaluated and measured as part of your experiments to test the system.

4 What you will submit

When you have finished implementing the complete assignment as described above, put all the code in a separate directory in your edlab account (/cs677/project3).

You are required to submit your solution in the form of printouts (please only attach relevant outputs that demonstrate your points and demonstrate functionality, DO NOT printout entire output logs and source code).

Each program must work correctly and be documented. You should hand in:

Outputs generated by running your program. (in EdLab account)
Outputs to demonstrate correct working of the system. (in EdLab account and Printout)
This is important, as it will show that your system works according to the requirements.
A separate (typed) document of approximately two pages describing the overall program design, a description of "how it works", and design tradeoffs considered and made. Describe clearly how each system is designed and implemented. Also describe possible improvements and extensions to your program (and sketch how they might be made). (Edlab and Printout)
Prepare a list of design considerations you made while designing your system and describe each briefly. This is similar to the design considerations discussed in class of the Email system on the last slide of Lecture 2.
(in Edlab and Printout)
A program listing containing in-line documentation. (in Edlab account)
Instructions to compile and run the code from 677/project3. (in Edlab account)
A separate description of the tests you ran on your program to convince yourself that it is indeed correct. Also describe any cases for which your program is known not to work correctly. (in Edlab account and Printout)
Performance results to test scalability and performance parameters. (in Edlab abd Printout)

Let us not waste a lot of trees. So, if any of the above turn out to be large, just save the relevant information in a file, leave it on your EDLAB account and submit the name of the file.

5 Grading policy for all programming assignments

Grading:

Program Listing

works correctly ------------- 50%
in-line documentation -------- 15%

Design Document

quality of design ------------ 15%
understandability of doc ------- 10%

Thoroughness evaluation ---------- 10%

Grades for late programs will be lowered 12 points per day late.