Project 3

Scalability Replication for Fault Tolerance

1 The problem

The third programming project addresses different methods of handling scalability and fault tolerance in a sensor network. In this part you are encouraged to be creative. Only the bare requirements of the system will be specified. There are many design decisions to be made by you. Clearly document the reasoning behind your decisions, pros/cons etc.

The project has two parts.

1. Minimum Depth Tree

In this part you are asked to build a network as shown below.

Oval: Sink 3

This is a completely Pull-based network. There is only one source/sensor in the network. This source supplies data to a large number of sinks.

For improved scalability, the sensor is directly polled by a small number of sinks, k (Eg 3 sinks as shown in the figure). This data is then propagated down the tree to the remaining sinks in the network. The data at the source is updated at some fixed interval.

When a new sink, I, wants to subscribe to the system, I asks the source, IBM, for a subscription. If the source has more than k sinks subscribed to it, the source replies with a list of the k sinks that it directly sends data to. I traverses the tree and finds the first node, S, that has fewer than k children. I becomes a child of S and polls S for data.

Each sink has an associated refresh frequency. When the new sink I joins S, S may have to update its polling frequency so that it is polling at the minimum of its own and its children’s refresh frequencies. This change may have to propagate up the tree.

You may implement the above network in any way that you like. I may obtain the address of the source either from a directory server or using a broadcast query. You may use breadth first search on the sinks to find the first sink, S with fewer than k children.

Since this is a purely Pull-based network, you may use the code from the second project to help you implement the source and sinks. Assume that the source and sinks never crash once they are up, and so never create orphan sinks.

2. Server replicas

In this part, scalability is achieved using server replicas. There are n replicas of the IBM server. Each of these sensors has a number of sinks subscribed to them. Each sensor pushes data to the sinks subscribed to it at the update interval specified by the sinks. When a new sink, I, wants to subscribe to a sensor, it obtains the addresses of the sensors as in part 1, from a directory server or through broadcast messages. From the sensors, I obtains the address of the least loaded node in the network, S. I then subscribes to the node S, and is pushed data by S at the refresh frequency of I. This is a Push-based network. You may use your code from Project 1 to implement the sensors and sinks.

Fault tolerance and load balancing are handled explicitly in this part. Sensors send periodic heartbeat messages to each other to indicate that they are still up and running. All sensors know which sinks are subscribed to which sensors. This may be achieved by having sensors send out a broadcast message every time a sink subscribes or leaves the network, or alternatively, by having sensors send out a list of subscribers periodically. This makes it possible for all sensors to know which node is least loaded.

When a server goes down, the remaining sensors should distribute the load of the failed server among themselves. One may use a voting algorithm to decide on how to balance the load of the failed server. Once this is done, the respective sensors will contact the sinks that have been assigned to it and push the data to them.

When a server, R comes back up or joins the network for the first time, it contacts the other sensors to let them know that it can service sinks. R is allocated some sinks, so that the load is equally balanced among the available sensors. R then takes over the subscriptions of these sinks and pushes data to them at their respective refresh frequencies.

Some things to keep in mind:

Your code should be written in a way that the numbers k and n can be varied easily.
Be aware of the thread synchronizing issues to avoid inconsistency or deadlock in your system.
No GUIs are required. Simple command line interfaces are fine.
You can work in groups of two for this assignment.

2 Evaluation and Measurement

Set up the networks described above.

Since the goal of this part of the project is to study load balancing and fault tolerance, you are not required to turn in graphs of throughput or calculated value vs actual value. However, you have to have detailed analyses and diagnostics in your report. Also, you have to show examples of what your system does when a new subscription is added to the network and how fault tolerance is handled.

For part one, show what happens when a level 1 sink (directly subscribes to sensor), level 2 sink and level 3 sink are added to the system. Have each component in the path along the tree print out well-formatted messages. Also look at the average time for an update to propagate from the source to the leaf. If there is an update at the source at time T, when does that update get to the leaf? If the tree is deep, there may be quite a delay for an update to reach a leaf node.

For part two, show what happens when a new sink is added to the network (ie finding the least loaded node etc). Also show what happens when a sensor, with >2n sinks subscribed to it, crashes. Show how this is handled and how load balancing is achieved between the sensors. Study how long it takes on average for the sinks to be back on the network, how many updates the sinks miss before they are up. Look at the effect of varying the heartbeat interval on the time it takes to detect crashes.

3 What you will submit

When you have finished implementing the complete assignment as described above, put all the code in a separate directory in your edlab account (/677/project3).

You are required to submit your solution in the form of printouts.

Each program must work correctly and be documented. You should hand in:

1. A copy of the output generated by running your program. When the directory server sends data, have your program print the information to the screen. When a sink is polling a source, have your program print messages for whether an update is required, and if so, for the data being sent by the source and for the data being received by the sink.

2. A separate (typed) document of approximately two pages describing the overall program design, a description of "how it works", and design tradeoffs considered and made. Explain how you implemented the insertion algorithm in each part of the project. Describe how load balancing is achieved in the second part and how fault tolerance is handled. Describe clearly how synchronization is handled. Also describe possible improvements and extensions to your program (and sketch how they might be made).

3. A program listing containing in-line documentation.

4. Instructions to compile and run the code from 677/project3.

5. A separate description of the tests you ran on your program to convince yourself that it is indeed correct. Also describe any cases for which your program is known not to work correctly.

6. Performance results and discussion.

Let us not waste a lot of trees. So, if any of the above turn out to be large, just save the relevant information in a file, leave it on your EDLAB account and submit the name of the file.

4 Grading policy for all programming assignments

Grading:

· Program Listing

o works correctly ------------- 50%

o in-line documentation -------- 15%

· Design Document

o quality of design ------------ 15%

o understandability of doc ------- 10%

· Thoroughness of test cases ---------- 10%

Grades for late programs will be lowered 12 points per day late.