NotesFAQContact Us
Search Tips
ERIC Number: ED549950
Record Type: Non-Journal
Publication Date: 2012
Pages: 281
Abstractor: As Provided
Reference Count: N/A
ISBN: 978-1-2672-9487-6
Querying Patterns in High-Dimensional Heterogenous Datasets
Singh, Vishwakarma
ProQuest LLC, Ph.D. Dissertation, University of California, Santa Barbara
The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a word-histogram. Analyzing these datasets gives better insights into the processes generating the datasets, opens new frontiers of scientific research, and fuels development of life-changing products. An effective mechanism for exploring these heterogenous datasets is querying. One such kind of query is a pattern query. Given a heterogenous dataset and a query, the task here is to find a set of objects which are constrained by a relationship and satisfy the query. For example, given a dataset of keyword-tagged objects, a useful pattern query is to find a set of similar objects that contains a given set of keywords. Querying patterns in high-dimensional heterogenous datasets brings about a new set of computational challenges. High performance algorithms to efficiently and accurately query patterns are presented in this thesis. First, a scalable algorithm, SIMP, is described for accurately querying near neighbors in a high-dimensional dataset. SIMP significantly outperforms the state-of-the-art techniques. Next, a novel algorithm, ProMiSH, is proposed for efficiently querying patterns by keywords. ProMiSH has a speed-up of more than four orders over the state-of-the-art techniques. Then, an algorithm, QUIP, is described for querying patterns by example in a spatial dataset, e.g., geographical maps. QUIP offers an improvement of 87% in running time over the baseline approach. Next, an algorithm for querying patterns by example in a temporal dataset is described. It specifically solves the problem of finding duplicate videos. The proposed algorithm yields a practical query time for video duplicate detection. Finally, a scalable method to compute statistical significance of results of a multi-object query is discussed. Statistical significance or p-value provides a more useful criterion for ranking the results of a query. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page:]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site:
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A