Spring 2005
Programming Assignment 3: A Fault-tolerant Banking System
Due: May 12th (Thursday)
(for off-campus students - 14 days after viewing Lecture 23 )
- You may work in groups of two for this lab assignment.
- A link to a FAQ for this project.
The link will be updated to answer common questions about the assignment.
- Here are some useful
references for this project.
1 The Problem
In this programming assignment you will implement a Fault-tolerant Banking System.
The assignment uses concepts of fault-tolerance, resynchronization using logs,
centralized locking and encryption..
An pictorial representation of the system is as shown in the figure below.
Figure 1: Distributed Fault-tolerant Online Banking System and Components
The system has two important components:
- Replicated Database Servers (this is similar to Assignment 2)
The bank database of accounts (and corresponding information of each) is replicated
on both servers.
- Coordinator (similar to Assignment 2 but no load balancing)
The coordinator acts as the interface to the banking database.
Each client sends its request to the coordinator which in turn forwards
the request to one of the servers to perform the desired action.
Additionally, the coordinator also fetches any response from the server
and sends it back to the client.
2 Functionalities of the System:
- Replicated Database Servers
- Coordinator
Each client request is forwarded to the coordinator, which in-turn forwards
the request to both the servers or the server that has not failed.
- Account--level locking:
For each request that reaches the coordinator, it locks the account the operation
is going to update it (withdraw/deposit operations).
So, if two operations for the same account simultaneously arrive at the coordinator,
one of the will be queued till the first one finishes and lock released.
Each server signals an end-of-operation with a OP-DONE message,
which can of form [DONE, Account#]. On receiving OP-DONE messages from
all awake servers, the lock on the account is released.
- HeartBeat Messages:
The coordinator uses periodic heartbeat messages to determine the state of
each server. Based on the number of servers (1 or 2 in this case), it waits for
the corresponding OP-DONE messages to release the lock.
Additionally, the periodic hearbeat messages will indicate whether a server failed
after getting the lock. In this case, the coordinator only waits for a single
OP-DONE message from the alive server.
- Resynchronization:
When resnchronization is in progress, no requests are forwarded by the coordinator.
The resynchronization starts with a ALIVE message from a server
and ends with a RESYNCH-DONE
- Client Response:
The coordinator forwards the result of the operation after the locks are released
and operation completed to the client.
Some things to keep in mind:
- Assume each server has the same file with all records and account information
at startup. The records in the file get updated on each update operation.
- To keep the replication simple, assume clients do not create new accounts
and only query or update existing accounts.
- The system should be fault tolerant, meaning no operation should fail
or make the system inconsistent due to failures.
3 Evaluation and Measurement
Correctness
Demonstrate that your system works correctly according to requirements stated in the
description and functionality of the system. In particular:
- Show that the bank database is distributed and replicated, operations are sent
to both servers or single server (based on failure).
- Demonstrate that the account-level locking mechanism works correctly.
i.e: show (via output snapshots) that simultaneous updates to an account
is handled correctly by the locking mechanism.
- Demonstrate the logging mechanism, how log is updated on operations
and maintained.
- Demonstrate the proper functioning of the resynchronization step.
- between the 2 servers
- no requests being sent by coordinator
- Demonstrate the working of heartbeat messages, by showing that a coordinator
initially waits for 2 OP-DONE messsages, but a heartbeat discovers a server failure
and waits for only 1 OP-DONE message.
Evaluation
Additionally, experiment with your system to measure its performance in different
scenarios and test conditions.
Design and present results of your own experiments to demonstrate the characteristics
of the system. A few examples are:
- Measure the average time for of requests, when both servers are on
and no queuing at the coordinator.
- Measure the average time for of requests, when both servers are on
and and queuing at the coordinator (i.e., queued due to account-level locking).
- Keeping request rate of each client constant, vary the number of clients and
measure the latency of each request.
- Keep number of clients constant, but vary request rate to measure latency
of each request.
- Measure how much time resychronization requires, by measuring time between
ALIVE and RESYNCH-DONE messages at the coordinator.
How does this change with log size?
It is important that you describe the results of your experiment and not just describe
what the experiment did. Please state what the experiment demonstrates or what you expected and what was
seen etc.
These are guidelines only, so be creative in what can be evaluated and measured as
part of your experiments to test the system.
4 What you will submit
When you have finished implementing the complete assignment as
described above, put all the code in a separate directory in your edlab account
(/cs677/project3).
You are required to submit your solution in the form of
printouts (please only attach relevant outputs that demonstrate your points and demonstrate functionality,
DO NOT printout entire output logs and source code).
Each program must work correctly and be documented. You
should hand in:
- Outputs generated by running your program. (in EdLab account)
- Outputs to demonstrate correct working of the system. (in EdLab account and Printout)
This is important, as it will show that your system works according to the requirements.
- A separate (typed) document of approximately two pages describing
the overall program design, a description of "how it works", and design
tradeoffs considered and made.
Describe clearly how each system is designed and implemented.
Also describe possible improvements and
extensions to your program (and sketch how they might be made). (Edlab and Printout)
- Prepare a list of design considerations you made while designing your system and
describe each briefly. This is similar to the design considerations discussed in class of the Email
system on the last slide of Lecture 2.
(in Edlab and Printout)
- A program listing containing in-line documentation. (in Edlab account)
- Instructions to compile and run the code from 677/project3. (in Edlab account)
- A separate description of the tests you ran on your program to convince
yourself that it is indeed correct. Also describe any cases for which your
program is known not to work correctly. (in Edlab account and Printout)
- Performance results to test scalability and performance parameters. (in Edlab abd Printout)
Let us not waste a lot of trees. So, if any of the above turn out
to be large, just save the relevant information in a file, leave it on your
EDLAB account and submit the name of the file.
5 Grading policy for all programming assignments
Grading:
- Program Listing
- works correctly ------------- 50%
- in-line documentation -------- 15%
- Design Document
- quality of design ------------ 15%
- understandability of doc ------- 10%
- Thoroughness evaluation ---------- 10%
Grades for late programs will be lowered 12 points per day
late.