NotesFAQContact Us
Search Tips
ERIC Number: ED567377
Record Type: Non-Journal
Publication Date: 2014
Pages: 177
Abstractor: As Provided
Reference Count: N/A
ISBN: 978-1-3038-0727-5
An Evolutionary Machine Learning Framework for Big Data Sequence Mining
Kamath, Uday Krishna
ProQuest LLC, Ph.D. Dissertation, George Mason University
Sequence classification is an important problem in many real-world applications. Unlike other machine learning data, there are no "explicit" features or signals in sequence data that can help traditional machine learning algorithms learn and predict from the data. Sequence data exhibits inter-relationships in the elements that are important in understanding and predicting future sequences. However, finding these relationships is proven to be an NPhard problem. When we use naive enumerations of combinations of elements or "brute force" iterative approaches for defining these features they often result in poor predictions. Some algorithms which perform well in prediction lack transparency, i.e., the discriminating features generated by these methods are not easily identifiable. In addition, the size of the sequence-based datasets presents practical challenges to most learning algorithms. Most sequence-based datasets contain millions or even billions of instances, for example, the genome-wide sequences of organisms in bioinformatics. At these sizes, classic learning algorithms often become prohibitively expensive, making scalability an important issue. Therefore, there is a need for an approach that can help find features/signals in complex sequences, offer meaningful discriminators, produce good predictions, and can scale well in time and space. This dissertation addresses the above issues by designing a comprehensive approach in the form of the Evolutionary Machine Learner (EML) framework. This framework can be employed on sequence-based datasets to generate explicit, human-recognizable features while solving the scalability issue. EML framework consists of a novel EA-based feature generation (EFG) algorithm for automatic feature construction. By modeling four complex sequencing problems in bioinformatics and generating meaningful, human-understandable features with comparable or better accuracy than the state of the art algorithms, the power and usefulness of the EFG algorithm is demonstrated. The EFG algorithm is also validated by applying it to time series classification problems showing the generic nature of the algorithm in finding the important discriminating patterns that assist in modeling sequence based data. EML framework addresses the scalability issue by means of a novel, parallel scalable machine learning algorithm (PSBML) based on spatially structured evolutionary algorithms. PSBML is validated on real-world "big data" classification problems for various properties of meta-learning, scalability and noise resilience using well known benchmark datasets. The PSBML algorithm is also proven theoretically to be a large margin classifier with linear scalability in training time and space, giving it a unique distinction among the existing large scale learning algorithms. Finally, the EML framework is validated on a large genome-wide bioinformatics classification problem and a large time series problem, showing that the combined algorithms achieve higher predictive performance, training time speed up, and the ability to produce human-understandable discriminating signals as features. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page:]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site:
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A