Database Caching Reviews

Class Summary: Akash Jain and Zhenlin Wang Caching in Data Warehousing environment:

A Data Warehouse environment is characterized by infrequent updates of databases. They store a large volume of data which are used frequently by decision support applications. These applications involve complex queries. Hence query performance is of utmost importance as these applications require interactive query response time. In such an environment, caching can perhaps be done at multiple levels. Another reason why caching becomes attractive is that DSS queries are often complex but small. Caching these data items is a good idea.

It is clear from the above discussion that "maximizing the hit ratio" should not be the criteria for efficient page replacement algorithm in database environment. In an interactive environment, we need to "minimize the response time" seen by the user. For this we need to introduce two other statistics, viz. the size of the set retrieved by the query and the cost of execution of the query apart from the reference rate to minimize the performance metric.

In this class, we discuss an algorithm which uses these metircs by using caching to minimize the response time. One can think of applying such an algorithm for web caching too. The reason being that the most imporant need in the web today is to minimize the reponse time seen by the receiver. So, our caching decisions should also be based on the size and the cost of retrieval of what is being cached. Moreover, with the emergence of dynamic html where we cater on a per user basis, caching decisions would need to be based something else other than the refernce rate.

Transactional Client-Server Cache Consistency:

This paper is more like a survey on all consistency algorithm in client-server database systems. The main contribution of this paper is to provide a new taxonomy of related algorithms. It compares their performance under different workloads based on the design space talked in the taxonomy. Unlike the traditional classification, which classifies algorithms simply as ``optimistic'' or ``pessimistic'', the paper first divides algorithms into two classes, detection-based algorithms and avoidance-based algorithms, based on how to prevent the transactions accessing invalid data from committing. Detection-based algorithms are further classified through when to initiate validity check, whether or not to notify remote clients for update and how, and what kind of remote action to take. Similarly, avoidance-based algorithms are further classified by when to declare write intention, how long to hold the write permission, the choice of waiting or preemption when local write conflicts with remote read, and what kind of remote action to take. Performance of different algorithms varies from workload to workload. Pessimistic style detection-based algorithms tend to send more messages than avoidance based algorithms under the assumption of reasonable locality in each client. The choice between synchronous and deferred write intention declaration is in the tradeoff the number of messages and the abort rate. Whether or not hold write permission across transactions depends the locality among transactions and the conflict across clients. Choosing between invalidating remote pages or propagating new updates is workload-dependent too. In most cases, invalidation is a safer and more robust choice. On the other hand, propagation is more dangerous and degrades system performance when there is large amount of conflicts and the number of clients increases.

In general, this is a good paper to read if we want to know the current status of researches on data-base cache consistency. However, there are are not so
many creative points in the paper.

Reviews

Paper: WATCHMAN: A Data Warehouse Intelligent Cache Manager

Reviewer: Vijay Sundaram

This paper presents the design of an intelligent cache manager for sets retrieved by queries called WATCHMAN. The design of WATCHMAN incorporates two complementary algorithms: LNC-R (Least Normalized Cost Replacement : for cache replacement) and LNC-A (Least Normalized Cost Admission : for cache admission). LNC-R can be used stand alone or integrated with cache admission algorithm LNC-A (LNC-RA). These algorithms aim at optimizing the query response time by minimizing the execution cost of queries that miss the cache.

LNC-R and LNC-A aim at minimizing the execution time of queries rather than the hit ratio which is the case for buffer management. LNC-R aims at minimizing the CSR (cost savings ratio). In addition to the average rate of reference to a particular query LNC-R uses two additional parameters: the size of the set retrieved by the query and the cost of execution of the query. These three parameters are combined together into one performance metric 'profit'. The average rate of reference is calculated as the moving average of the last K inter-arrival times of references to the retrieved set.

A cache admission algorithm tries to prevent caching of retrieved sets which may cause response time degradation. Watchman decides to cache a retrieved set in relation to a set C of replacement candidates only if the new set has a higher profit than all the retrieved sets in C. If the retrieved set has been retrieved for the first time, then Watchman compares the expected-profit with the set.

The problems due to retained reference information are taken care of by evicting the reference information of related to a retrieved set from cache whenevr teh profit associated is smaller than the least profit among all cached retirieved sets.
The chief idea in the paper is to improve query performance by by caching sets retrieved by queries in addition to query execution plans.
Overall the paper was well written and not much of an effort to understand.

Reviewer:M S. Raunak

This paper presents a new algorithm particularly suited for caching query results in a data warehouse environment. Data warehouses are usually used to support data analysis and queries of large decision support systems (DSS). These queries are very complex by nature as they involve accessing and manipulating substantial amount of data in the warehouse. As a result, they are quite expensive compared to online transaction processing queries which involve only a few tuples in a relation. On the other hand, data in a warehouse is relative static in nature. Considering these two characteristics of a data warehouse, it makes caching an important and effective way to improve response time for the queries.

The paper describes two complementary cache replacement and cache admission algorithms for reducing the response time for complex queries. The idea is based on the fact that DSS queries often follow a hierarchical pattern, where a query on each level is a refinement of the query of the previous level. By Caching the intermediary query results, it is possible to serve the higher level queries without spending time on the rigorous manipulation of the raw data.

Any caching scheme needs to deal with the policies regarding cache admission and cache replacement. Some sort of metric is used to decide on this policy. This paper defines a "profit metric" based on the cost of executing a query, its average rate of reference and the size of the set retrieved by the query. Based on this profit metric, the decision for admission in the cache and removal from cache is taken following a heuristic. The paper also presents a simple model
that describes the optimal case and argues that there proposed solution closely follow the heuristic solution of the best case.

The strength of the paper lies with identifying a specific area where traditional LRU does not necessarily provide the right metric for cache decision. It has also nicely identified the need for combining policies for both cache admission and cache replacement which was absent in traditional database caching. The proposed algorithms are also simple and intuitive in nature.

The work should have used real workloads instead of synthetically generated ones. The result regarding marginal improvement using multiple inter-arrival time from past references needs more justification.

Reviewer: Sivakumar M

This paper deals with the design of intelligent cache manager called "watchman". It aims at minimizing query response time by making use of "profit" metric. The metric is based on each retrieved set's average rate of reference, size and execution cost of its query. The paper first introduces DSS queries and difference between DSS and OLTP queries. Reasonable arguments are placed for selecting the profit metric.

Watchman makes use of two complementary algorithms: Cache replacement algorithm called Least Normalized Cost Replacement and Cache admission algorithm called Least Normalized Cost Admission. Both can be used as stand alone algorithm without other. Both the algorithms are explained separately. It is showed that optimality can be achieved by the combination of both the algorithms (LNC-RA). The problem "Retained Reference Information" and problems of Five Minute Rule are explained. Implementation details of watchman is discussed in brief along with interaction with buffer manager. The performance of watchman is tested based on TPC-D and Set Query benchmarks. Cost savings ratio, Cache hit ratio and external fragmentation are taken as the performance metrics for the experiments. A comparison of LNC-RA and LRU is done.

Explanation of LNC-RA through Pseudo code is quite good in this paper except the reference of symbols like k, p0. It seems to me that there is a contradiction in assumption of DSS queries. One place DSS queries are referred as more complex and access substantial part of data and in other paragraph it is referred as queries that retrieve small set of data. Setting profit metric as performance measure is better than just having hit ratio as a metric. Weight should be fixed for reliable estimate of lamda(i) while calculating the profit. Low CSR for TPC-D even though it has higher HR compared with SQ shows the importance of profit metric.

Reviewer: Osman Bin

This paper introduces a novel cache manager, WATCHMAN which aims at minimizing the query response time. WATCHMAN has two components: a cache replacement unit and a cache admission unit. The domain considered for this cache manager is data warehouses servicing data analysis and decision support
queries. The main observation is that maximizing the cache hit ratio is not a suitable metric for this domain since retrieved sets of queries are not of an equal size and queries do not incur the same cost of execution. The main contribution of the paper is then based on the abovementioned observation: instead of the hit rate define a cost function, in this case a profit metric, and minimize this cost.

Strengths and Weaknesses:
In the performance analysis the WATCHMAN algorithm LNC-RA and a plain LRU scheme were compared. Results on benchmark traces suggest LNC-RA is about 3 times better on the average on the Set Query trace. This is a significant improvement. The authors show that, the savings diminish for multiclass
workloads though. Another point is that the size of the benchmark databases were scaled down from their suggested size to save on trace collection time
and this may have an impact on the results.

The authors show the optimality of LNC-RA. However, this is under a constrained model. The assumption being made is that sizes of cached retrieved sets are relatively small when compared with the total cache size S, and thus it is always possible to utilize almost all space on the cache. This assumption seems to be a valid one since their experiments show the degree of external fragmentation to be limited.

LNC-RA uses various statistics collected for decision support. Those are cost of execution of a query, the size of the set retrieved by a query, and the average rate of reference to a query. There were various assumptions being made in collecting the statistics that make them not exact. An example is that to calculate the cost of execution of a query they assumed that query execution costs are dominated by disk I/O and they set the query execution cost to the number of buffer block reads performed during execution of the query.

Reviewer: Abhishek Chandra

This paper discusses caching mechanisms for a data warehouse queries. It presents acache replacement algorithm based on a cost function which depends on the size of query data, its computation time and frequency of access, which is more suitable in a database environment compared to simple hit rate. Using the same metric, the paper further describes a cache admission algorithm which prevents useful data from being ejected from the cache.

The main strength of the paper is its identification of proper metrics for caching in the database environment compared to the conventional metrics used most of the time. Further, it gives good intutive as well as theoretical reasoning (though on a restriced model) behind the algorithms.

The main drawback of the paper is that it uses query equality for cache references. It would be better if it used some kind of query equivalence or could cache intermediate query results. The paper also does not show experimentally the overhead of computing the more sophisticated cost function.

Paper: Transactional Client-Server Cache Consistency: Alternatives and Performance

Reviewer: Michael Bradshaw

Client-server database design makes use of intelligent clients by shipping the information to the client, and allowing them to run the operations on the datasets. When multiple clients conduct database operations on the same data, problems concerning cache consistency came into play. Franklin, Carey and Livny examine the many different techniques to insure cache consistency in transaction actions. They provide a taxonomy of the different methods of assuring cache consistency and use experimental data to create guidelines on choosing the correct way to create new systems.

good points
1) Taxonomy describes areas of research that have and have not been covered.
2) Creates a baseline set of testing for most branches of the taxonomy.
3) Offers heuristics for combinations of caching methods to develop
future systems.

bad points
1) It's all restated research, no new ideas.

Reviewer: Sivakumar M

This paper presents the taxonomy based on whether algorithms detect or avoid access to stale data and explains the design space of transactional cache consistency algorithms and relationship among them. It investigates six algorithms based on which the tradeoffs in the design choices shown in taxonomy is explained. It first introduces and explains pros and cons of query-shipping approach and data shipping approach and shows the advantages of data caching in data shipping approach. Data shipping can be either page servers or object servers. The concept of inter and intra transaction caching is touched upon to stress the importance of cache consistency protocol. The solutions for database system depends upon whether the system is transactions based or client-server architecture based. The algorithms proposed in this paper is explained with reference to classification.

Data shipping (client-server architecture) is explained along with concepts like one-copy serializability, dynamic replication and second-class ownership and their implications. The difference between transactional cache consistency and cache consistency in non-database system and the difference of client-server database system over the shared-disk data base system are explained in the areas like correctness criteria, caching granularity, cost tradeoffs, etc... Sequential, release and lazy release consistency are defined. Caching can be either detection based or avoidance based. In detection based schemes, transactions must check the validity of any cached object they access before they can be allowed to commit whereas in avoidance based schemes, transactions never have the opportunity to access stale data. The levels of differentiation of detection based like validity check initiation, change notification hints and remote update actions are separately dealt in each sub sections.

Avoidance based schemes have four levels like write intention declaration, write permission duration, remote conflict priority and remote update action. Then performance of three sets of algorithm families are compared. The broad groups or algorithms are Server-based two-phase locking, Callback Locking and Optimistic. A table showing design choices and relevant algorithms for comparison in the given design choice best summarizes the 14 pages of explanations. A client server dbms is modeled for performance analysis. Throughput is taken as performance metric against number of clients accessing the server. Graphs showing the performance of different algorithms discussed are given. The proposed algorithms are Optimistic Detection-Based, Notify locks, No-wait locking, Dynamic optimistic two-phase locking. This paper is very interesting because of its organization even though it is lengthy. Classifications describing the design space of cache maintenance algorithms is nice with diagrams.

Prashant Shenoy