Mining and Indexing Graph Databases.

Yuan, Dayu

Graphs are widely used to model structures and relationships of objects in various scientific and commercial fields. Chemical molecules, proteins, malware system-call dependencies and three-dimensional mechanical parts are all modeled as graphs. In this dissertation, we propose to mine and index those graph data to enable fast and scalable search. There are two common search scenarios: subgraph search and supergraph search. In a subgraph search, given a query graph q, the algorithm searches for all graphs that have q as a subgraph, from a graph database. A supergraph search, on the other hand, retrieves all the database graphs that have q as a supergraph. Determining whether a graph is a subgraph (or supergraph) of another is an NP-complete problem. Hence, it is challenging to efficiently process graph queries on large databases. Graph indices are commonly built to fast process graph queries. Subgraph patterns are mined from the graph database to build such graph indices. It has been shown that both the index structure of the graph indices and the subgraph patterns that are selected for indexing determine the time of processing graph queries. We address the graph search problem by designing innovative index structures and proposing new subgraph pattern mining algorithms. First, we propose an index structure, Lindex, which organizes subgraph patterns in a lattice. As a novel index structure, Lindex (1) is effective in filtering false graphs, (2) provides fast index lookups, (3) is fast with respect to index construction and maintenance, and (4) can be constructed using any set of substructure index patterns. These four properties result in a fast and scalable graph-querying infrastructure. Lindex is compatible with any choice of patterns. Empirically, we demonstrate that Lindex used in conjunction with subgraph patterns proposed in previous works outperforms other specifically designed index structures. Second, we address the pattern mining problem by modeling it as a combinatory optimization problem. Instead of mining subgraph patterns based on heuristics as in previous works, we proposed a greedy algorithm, which mines index patterns that minimize the query-processing time with theoretical guarantee. Specifically for the supergraph search, previous algorithms are proposed to optimize either a "filtering gain" or a "prefix-sharing gain." No single subgraph-mining algorithm considers both gains, although these two gains jointly model the savings of the query-processing time. We are the first one to mine subgraph patterns to optimize both the filtering gain and the prefix-sharing gain. Finally, we study the update of graph indices. All previous algorithms are mine-at-once algorithms in which the whole set of index patterns is mined in one run before building a graph index. Because of the change of environments, such as an update of the graph database and the increase of available memory, the index needs to be updated to accommodate such changes. Most of the mine-at-once algorithms are time-consuming because they involve frequent subgraph or subtree mining over the whole graph database. Also, constructing and deploying a new index involves expensive disk operations. Hence, it is inefficient to re-mine the patterns and rebuild the index from scratch. We propose an "iterative mining" algorithm and a "one-pass mining" algorithm to address the index-update problem. Both algorithms replace an old index pattern p' with a newly selected subgraph pattern p until the decrease of the query-processing time caused by the swap is lower than a threshold. The "iterative mining" algorithm always mines the new subgraph pattern p that maximizes the decrease of the query-processing time. And it can achieve an approximation ratio of 1 - 1/e. The "one-pass mining" algorithm, on the other hand, can choose any pattern p that meets a criterion f(p, p') > 0, where the function f([middle dot]) measures the advantage of indexing p over p'. The "one-pass mining" algorithm have an approximation ratio of 1/4 with faster running-time than iterative mining.. We conduct extensive empirical studies to study the performance of our proposed algorithms. The experiment results demonstrate the effectiveness of our algorithms by showing their improvement over the state-of-the-art approaches over existing benchmark datasets. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]