Spring 2019
Lab 3
Turning the Pygmy into an Amazon: Replication, Caching and Consistency
Due: 23:55pm April 24, 2019
First, Pygmy.com has three NEW books in its catalog:
These books were added to the catalog during a spring break sale that turned out to be a big success. However, with the growing popularity of the book store, buyers began to complain about the time that the system needs to process their requests. You are tasked with rearchitecting Pygmy.com's online store you built in Lab 2 to handle this higher workload.
This project is based on project 2. This assignment has three parts.
In this part, we will add replication and caching to improve request processing latency. While the front-end server in lab 2 was a very simple component, in this part, we will add two types of functionality to the front-end node. First, we will add an in-memory cache that caches the results of recent requests to the order and catalog servers. Second, assume that both the order and catalog server are replicated - their code and their database files are replicated on multiple machines (for this part, assume two replicas each for the order and catalog servers). To deal with this replication, the front end node needs to implement a load balancing algorithm that takes each incoming request and sends it to one of the replicas. You may use any load balancing algorithm such as round-robin, least-loaded and can do load balancing on a per-request basis or a per user-session basis. The front end node is NOT replicated. It receives all incoming requests from clients. Upon receiving a request, it first checks the in-memory cache to see if it can be satisfied from the cache. The cache stores results of recently accessed lookup queries (i.e. book id, number of items in stock, and cost) locally. When a new query request comes in, the front end server checks the cache first before it forwards the request to the catalog server. Note that caching is only useful for read requests (queries to catalog); write requests, which are basically orders or update requests, to the catalog must be processed by the order or catalog servers rather than the cache. You can implement the cache server in one of two ways: it can as a separate component from the front-end server and you will then need to use REST calls to get and put items from and to the cache; or your in-memory cache can be integrated into the front-end server process, in which case, internal function calls are used to get and put items into the cache.
Cache consistency needs to be addressed whenever a database entry is updated by buy requests or arrival of new stock of books. To ensure strong consistency guarantees, you should implement a server-push techniques where backend replicas send invalidate requests to the in-memory cache prior to making any writes to their database files. The invalidate request causes the data for that item to be removed from the cache. Feel free to add other caching features such as a limit on the number of items in the cache, which will then need a cache replacement policy such as LRU to replace older items with newer ones.
The replicas should also use an internal protocol to ensure that any writes to their database are also performed at the other replica to keep them in sync with one another.
Like in Lab 2, all components use REST APIs to communicate with one another.
There are many tools available on the Internet to take the code for each component and package it as a container/ image. You are free to use any such tool to create your docker container images (also feel free to use piazza to discuss specifics of a tool or share your ideas).
Once you have created a docker version of your app, you should also upload it into lab 3 github (in a docker directory for lab 3), so that you can use docker to directly download it from github and deploy on any machine.
Be sure to track requests that were in-progress while the failure occurred and these in-progress requests should be re-issued to the non-faulty replica to get a proper response. Finally, you also need to implement a recovery process where the failed replica is restarted from a crash and then uses a resync method to syncronize its database state with the non-faulty replica so that the two are again in sync.
Assume that the order server is replicated on three nodes. Implement a RAFT consensus protocol that uses state machine replication so that all replicas can order incoming writes and apply them to the database in the same order. This will ensure that race conditions do not occur where concurrent incoming orders go to two different replicas and get applied to the other replicas in different orders. You will need to implement an on-disk log that implements state machine replication as part of RAFT. You will further need to show that failures of a order replica does not prevent the others from making progress since the majority of the replicas (2 out of 3) are still up.
Make necessary plots to support your conclusions.