Files
Abstract
In recent years, the robustness of graphs for representing complex data has led to a proliferation of research on graph databases and analytics. One important topic in the field is graph pattern matching, which can be used, for example, in the processing of queries in graph databases. Though fast querying is highly desirable, pattern matching algorithms are hindered by the NP-completeness of the subgraph isomorphism problem. This paper presents a conceptually simple, memory-efficient, pruning-based algorithm for the subgraph isomorphism problem that outperforms commonly used algorithms by orders of magnitude on large labeled graphs. This speedup is due in large part to the effectiveness of the pruning algorithm, known as dual simulation, which in many cases removes a large percentage of the vertices not found in isomorphic matches. In this paper, the runtime of the algorithm is tested on synthetic graphs of up to 10 million vertices and 250 million edges and on two real life datasets, comparing when possible to an adjacency list version of Ullmanns algorithm and to the VF2 algorithm. To the best of our knowledge, this is the first paper to test a centralized subgraph isomorphism algorithm on graphs of this magnitude. The algorithm is tested extensively to determine the effects of label density, edge density, datagraph size, degree distribution, and query graph size and type on runtime. The effectiveness of the algorithm is then demonstrated on two large real life graphs. The algorithm is easily extendable to graphs with multiple attributes on vertices and edges, making it an ideal candidate to serve as the backbone of a query processing engine for a graph database.