Files
Abstract
The adoption of the Resource Description Framework (RDF) as a metadata representation standard is spurring the development of high-level mechanisms for storing and querying RDF data. Many of the proposed systems are built on Relational/Object-Relational Databases with a translation of queries posed in the supported RDF query language to SQL for processing by the database. Graph pattern matching which matches a query graph against a data graph, often require join operations. To process join operations, the database optimizer determines an optimal join order from a cost model which employs the expected cardinality of join results as a key parameter. This parameter is estimated from a statistical summary of the data maintained in memory. In this work, we argue that the data summarization technique employed by database systems are oblivious of the graph structure of RDF data and may lead to estimation errors which result in the choice of a sub-optimal query plan. We present and evaluate two techniques for estimating the frequency of subgraphs utilizing a small statistical summary of the graph, based on occurrences. In the first technique, we summarize the graph in the P-Tree by pruning small subgraphs based on a valuation scheme that blends information about their importance and estimation power. In the second technique, we assume that edge occurrences on edge sequences of length maxL are position independent. We then summarize the most informative dependencies in the MD-Tree. In both techniques, we assume conditional independence to estimate the frequencies of larger subgraphs. We present extensive experiments on real world and synthetic datasets which confirm the feasibility of our approach. Our experiments are geared towards showing that the estimates obtained from the proposed summaries are accurate as well as effective for optimizing graph pattern queries posed over RDF graphs.