NotesFAQContact Us
Search Tips
ERIC Number: ED534283
Record Type: Non-Journal
Publication Date: 2011
Pages: 130
Abstractor: As Provided
Reference Count: 0
ISBN: ISBN-978-1-1249-2081-8
Scalable Kernel Methods and Algorithms for General Sequence Analysis
Kuksa, Pavel
ProQuest LLC, Ph.D. Dissertation, Rutgers The State University of New Jersey - New Brunswick
Analysis of large-scale sequential data has become an important task in machine learning and pattern recognition, inspired in part by numerous scientific and technological applications such as the document and text classification or the analysis of biological sequences. However, current computational methods for sequence comparison still lack accuracy and scalability necessary for reliable analysis of large datasets. To this end, we develop a new framework (efficient algorithms and methods) that solve sequence matching, comparison, classification, and pattern extraction problems in linear time, with increased accuracy, improving over the prior art. In particular, we propose novel ways of modeling sequences under complex transformations (such as multiple insertions, deletions, mutations) and present a new family of similarity measures (kernels), the spatial string kernels (SSK). SSKs can be computed very efficiently and perform better than the best available methods on a variety of distinct classification tasks. We also present new algorithms for approximate (e.g., with mismatches) string comparison that improve currently known time complexity bounds for such tasks and show order-of-magnitude running time improvements. We then propose novel linear time algorithms for representative pattern extraction in sequence data sets that exploit developed computational framework. In an extensive set of experiments on many challenging classification problems, such as detecting homology (evolutionary similarity) of remotely related proteins, categorizing texts, and performing classification of music samples, our algorithms and similarity measures display state-of-the-art classification performance and run significantly faster than existing methods. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page:]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site:
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A