A Quick Introduction to XML
XML is a meta language for future mark up languages like HTML with one important addition. All of the tags in XML derived languages will be able to describe the information that they surround. Each different niche of the information world will have its own set of descriptive tags. XML only describes how the tags must be designed and used. With this addition, XML moves to the forefront as a format for data transfer.
Querying XML Data
Since XML is moving into the forefront of the data representation game, there needs to be some way to effectively search the XML file for desired information. Deutsch et al, have made a stab at this. Their query system has the feel of SQL blended with Prolog. Queries are done by recreating the structure of tags to be searched for and then placing a variable in the unknown part(much like prolog).
The query language XML-QL is clear and easy to use. It is broad enough to encompass the SQL language which was the standard in data query languages. Its ability to reformat the output into another XML file is invaluable as web based searches are becoming more common place. This allows results to be gathered from several web sources and then in turn used in another query.
Work still needs to be done to see quantify the possible speedups that an XML query language might entail which wouldn't be necessary or currently used in SQL environments. These would include ways to limit the amount of file parsing necessary to find data, and better ways to search though XML files that do not have a know DTD.
NiagraCQ: A Scalable Continuous Query System for Internet Databases
XML allows people to build queries for the web. However, in todays pull based web design, either users would be making decision on stale data or the servers would be overworked trying to answer queries even when no new data has been posted. In order give users up to date information and to create a scalable solution, the server will have to move to a push based model of data reports. The user gives the server a query and guidelines about how periodic the reports should be and then only has to wait for the message.
NiagraCQ offers several means by which this process becomes more scalable.
Niagra "batches" queries with other related queries to limit repetitive
overhead from producing identical input files for parsing. This is
achieved dynamically by identifying components of the search and then placing
the resulting queries together. Another important feature is the
use of delta files. When a file changes, the server detects the change
and only sends the changed components to the query to be recomputed.
By caching the previous results and only querying the changed data, searches
can be done more quickly and sent off to the interested parties.
NiagraCQ uses some questionable assumptions. While their data clearly
shows that grouping queries dramatically reduces the overhead involved
in making the queries, it assumes that such groupings are not
possible but overly common. In such cases that they are exceedingly
common queries, it would be better to create specialized queries handler
to answer those common cases. The overhead for maintaining the dynamic
nature would be wasted. The state of the paper is also confusing
in that it is not certain if NiagraCQ can support a hierarchical method
of data collection by incrementally making queries.
As for a proxy based solution using this technology, I would have to say that it would be highly specialized or it would be worthless. It would only be of use where there are groups of people making similar queries. So for a brokerage firm these type of proxies might be beneficial. On the other hand, at a business where the results are needed that quickly and that responsive to time would most likely prefer to have the query made at the point that the data is changed so that propagation delays do not occur.
NiagaraCQ: A Scalable Continuous Query System for Internet Databases
This paper presents the novel idea of grouping continuous queries which share common computation. The distinct feature of this idea is incremental group optimization strategy with dynamic re-grouping. New queries are added to the existing query group without the need of re-grouping. But this results in inefficient grouping of queries. So they use dynamic re-grouping whenever needed. It groups change-based and timer-based queries in a uniform way. Scalability is achieved by incremental evaluation of continuous queries, use of both push and pull methods. First it introduces the need for grouping strategy along with definitions of change-based and timer-based queries. Then it describes the Niagara CQ command language along with sent of commands for timer and changed based queries.
Incremental group optimization using expression signature is explained based on XML-QL queries on a database of stock quotes. The groups are formed by the method of filtering which consist of split and join operations. The advantage and disadvantage of using pipelined scheme and intermediate file scheme is shown. The ungrouped parts of all query plans are combined to form single execution plan which poses its own problems. The case of queries with more than one predicates resulting in two expression signatures are explained. The analysis of placing selection operator first followed by join operator and vice versa is given. The problem of grouping timer based continous queries and timestamp based support for it are dealt with. The state changes of continous query processing in niagara CQ is explained with state diagram. Experiments are conducted on database of stock information to compare the effectiveness of grouping the queries against non grouped case. The parameters of the experiments are number of installed queries N, number of fired queries F, number of tuples modified C. Results are presented in the form of graph between Execution Time and Number of queries for various combinations of F,N and C. The paper is little bit difficult to understand especially the combination of split and join operations.
XML and XML Query Language
XML query language is a toll for structural and content based query that allows an application to extract precisely the information it needs from one or several XML data sources. As XML data proliferates on the web, applications will need to integrate and aggregate data from multiple sources and clean and transform data to facilitate exchange.
XML becomes interesting in the context of caching. When we have cached objects, there needs to be some structure in these documents so that queries can be answered from these documents. Of course, answering query from a cached document itself requires a well defined semantics but it can be done. This becomes more important in today's Internet where different users are given different pages based on their interests. If we have XML documents cached, we can provide some of the pages to all the users if some frames are common. This cannot be done with HTML. Cache coherence using a timer expiry mechanism can be done using XML by putting a tag for time. The whole query does not need to be sent to the server. Partial results of the query can be obtained from the cached data (plan, query result or data itself).
There are a lot of requirements for XML Query Language. The semantics
should support reasoning about XML queries such as determining result structure,
equivalence and containment. Query containment is useful for determining
if a push stream of data can be used to answer a particular query. It is
also desirable that XML query be translateable into query language of native
data. All operations should be possible in a single XML query. Moreover,
XML query should yield an XML output so that derived databases can be viewed
via a single query. Meaning of an expression in XML query language should
be same wherever it appears.
NiagaraCQ: A Scalable Continuous Query System for Internet Databases
Due to the scale of the Internet, a query system is required which scales
and also transforms the passive web into an active environment. NiagaraCQ
addresses this problem by grouping continuous queries based on the observation
that many web queries share similar structures. Grouped queries can share
the common characteristics, tend to fit in memory and reduces I/O cost
significantly. Grouping also eliminates a large number of unnecessary query
invocations. NiagaraCQ uses an incremental group optimization strategy
where new queries are added to existing query groups, without having to
regroup already installed queries.
NiagaraCQ has advantages over traditional trigger systems. It can install a large number of triggers where as traditional DBMS would install a limited number of trigger on each table and a trigger can usually only be defined only on a single table. It can monitor autonomous data sources over the Internet. Action and timer events are also included in NiagaraCQ. It is clear that we can have significant gains by using this technique.
In the paper, group optimization is only done to queries that contain
select or join operators. A real solution would need to extend this. A
lot of other questions also need to be answered in the context of caching.
What is the unit of cached data: query plan, query results or the data
itself? In an Internet environment, what kind of grouping approach should
one take?
Moreover, dynamic regrouping to regroup part or all of the queries
either periodically or when the system performance degrades below some
threshold, needs to be defined.
NiagaraCQ: A Scalable Continuous query system for Internet Databases
Continuous queries are help to get notified about a change in
valueat the source as soon as it occurs. This is implemented in regular
database by registering a persistent query with the server that runs continually
on the source data and informs the query initiator of any updates. In case
of the Internet, this approach faces problems regarding scalability as
the server now has to deal with millions of queries.
NiagaraCQ provides a technique of dealing with this problem by grouping similar queries. The key observation here is that a lot of queries in the web come with very similar structure. Consequently, these queries can be aggregated under a single signature with different values for certain variable in the signature. This grouping technique allows the queries to share common computation and reduce I/O cost significantly. Moreover, it handles the scalability issue nicely by eliminating a large number of unnecessary query invocations.
The paper describes its goal as developing techniques to allow a very large number of users to be able to register continuous queries in a high level query language like XML-QL. The data is assumed to be in XML format over a distributed database on the Internet.
The paper introduces an incremental grouping methodology that groups queries according to their signatures. When a new query arrives, the existing groups are considered as one of the possible optimization choices. Instead of re-grouping all the queries at every query arrival, it is done dynamically on a periodic basis. The new query is merged into existing groups whose signatures match that of the query. When no group matches a signature of the new query, a new query group for this signature is created in the system.
The key contribution of the paper is its observation of being
able to group the similar structured queries in the web. The grouping
technique with query-split scheme also provides a novel approach for
developing a scalable system.
The overall description of the paper is not very well written.
There were sections of the paper that were somewhat ambiguous and
sometimes a little confusing. Also the examples they have described
and experimented the system with are all toy examples. No real life
data or analysis were provided.
NiagaraCQ: A Scalable Continuous query system for Internet Databases
NiagaraCQ is a continuous query system for internet databases.
Its' properties include:
- Supporting incremental evaluation, so considers only the changed
portion of the XML file and not the entire file.
- Queries are gouped. Group optimization is less computationally
expensive. When a new query comes in, it is merged into existing groups
that match its signatures.
- Group optimizer chooses the most selective conjuct.
Issues about the paper:
- Overall, depends on a collection of heuristics.
- Existing groups are not regrouped, so we may have sub-optimal
groups. The authors suggest the next remark as a solution.
- Since group optimization heuristic does not create a new group; system
needs to be recalibrated at suitable intervals though 'dynamic reconfiguration'
which was left as future work.
- Group optimizer chooses the most selective conjunct. If a join is
performed before the selection, this may generate a large intermediate
file. A better heuristic, according to the authors is doing the selection
before the join. However, since
the first approach was not implemented, it was not possible to make
a comparative performance analysis.
- It is mentioned that NiagaraCQ caches recently accessed file and
that small delta files generated by split operations are consumed
and discarded. Authors argue that a caching policy that favors these small
files saves lots of disk I/Os. However, since those files are written and
read only once, this policy may be worse from a performance point of view.
Querying XML Data
This paper describes a query language called XML-QL that can be used to query XML data. The paper first describes the requirements for such a language and then propose XML-QL as a language that satisfies most of the requirements. It describes the various features of the language through examples.
The main strength of the paper is that it presents the various issues
faced in the design of XML query languages and then tries to address
some of these issues specifically. It lays down some guidelines for other
query languages that might be
designed.
But the drawback of the paper is that first of all, it is just a specification paper, and hence, there is no place for any evaluation of the language. Secondly, the language itself seems to have been designed hastily, and has ignored some of the issues brought forth by the authors themselves at the beginning. Moreover there is no discussion on how the queries could be optimized using their language or how practical the language is in a real-world implementation.
NiagaraCQ: A Scalable Continuous Query System for Internet Databases
This paper describes the design of an internet query engine for continuous queries which may be based on value or time. The query engine, NiagaraCQ, described here groups together queries on the basis of certain commonality features, and does common computation followed by result splitting for all queries within a group.
The main strength of the paper is that it uses many of traditional database
operations for optimizing the query computation and storage of intermediate
data. It allows incremental grouping of quries and allows the usage of
an internet query
language such as XML-QL.
The main weakness of the paper is that the grouping critrion used in the paper is somewhat specific and could be different in different scenarios. A more generalized approcah might be more benefitial. Also the experiments were conducted using a very small and specific data set, and the queries used were also artificial. It may be more interesting to see how this system behaves in a real web environment.
NiagaraCQ: A Scalable Continuous Query System for Internet Databases
Although most people in class argued that this paper was not organized
well and the experiments are too simple to be applicable on web environment.
We can still learn some good ideas in this paper. The basic point the authors
want to argue
is that the continuous queries on a web server share some common structures
or signatures. So we can group them based on their signatures. Unfortunately,
the paper only talks about two simple signatures. How to group queries
on a complicate signature is still a open topic. One problem I am really
concerned about the grouping policy is that the constant table also grows
incrementally. When the table size is close to the total number of possible
constant values in the data source, it will be no help to use the group
query plan. In particular, if the data source is very large, the jointed
group query plan may be very inefficient. One way out of this is to group
those most frequent queries only. We can in fact look a group as a dynamic
cache whose size if fixed. The replacement is based the execution frequency
of the queries. For the timer-based queries, the execution frequency is
obvious. But for the change-based queries, we need predict the frequency.
To apply the ideas to web proxies, we are expecting both the servers and the proxies to support query grouping. However the queries on a server could be a complicate join query from a proxy. Assuming the the grouping works well in the proxies and there are not so many different grouped queries, the server can apply more aggressive optimization to the query groups from proxies.On the other hand, we are expecting to reduce the communication traffic between servers and proxies if only intermediate results of q query group need to be passed from server to proxy.