NotesFAQContact Us
Search Tips
ERIC Number: ED563522
Record Type: Non-Journal
Publication Date: 2013
Pages: 109
Abstractor: As Provided
Reference Count: N/A
ISBN: 978-1-3035-2574-2
Feature Selection with Missing Data
Sarkar, Saurabh
ProQuest LLC, Ph.D. Dissertation, University of Cincinnati
In the modern world information has become the new power. An increasing amount of efforts are being made to gather data, resources being allocated, time being invested and tools being developed. Data collection is no longer a myth; however, it remains a great challenge to create value out of the enormous data that is being collected. Data modeling is one of the ways in which data is being utilized. When we try to model a process or a system, it is crucial to have the right features, and thus, feature selection has become an essential part of data modeling. Yet often we have missing data, and in a worse scenario, the important features themselves may have considerable data missing. The challenge is to pick out the best features and yet accommodate the missing data. To address this problem, this dissertation introduces a cluster based feature selection process which is quite robust in handling missing data. The research extends the Minimum Expected Cost of Misclassification (MECM) based feature selection method to a very high dimensional dataset by using cluster based sampling methods. However, even though the cluster based sampling methods allow the MECM to scale to larger datasets, determining the optimal cluster size is still a challenge. This is the first issue that the dissertation aims to solve. The second area that the dissertation tries to address is the issue of handling missing data while doing feature selection by MECM based method. This area has not been studied extensively as feature selection itself, though missing data is witnessed quite often. The dissertation discusses an algorithm which enables the MECM to handle missing data. This approach is a probabilistic approach based on the distribution of most similar instances. The algorithm determines the probability of having the instance in the sampling cluster and then does a fractional count while evaluating the MECM. One of the challenges of this approach is to correctly estimate the probability of a missing point lying within the sampling cluster. The key lies in picking up the correct number of similar instances to calculate the probability. The dissertation also seeks to address this problem. The last part of the research contains a benchmark study to determine the effectiveness of the algorithm. A wrapper based feature selection method using Naive Bayesian and another method using the MECM without missing data algorithm are used simultaneously as benchmarks. The MECM missing data algorithm showed a significant improvement over the other two. Solving these problems is of great practical significance to data modeling. Instances with missing data might carry critical information; ignoring missing data during feature selection can have a cascading effect downstream when the final model is built. This research will enable us to choose better features which would in return improve the accuracy of the existing models. It will impact a broad range of applications from gene based medicine, fraud detection models, engineering, business and any field which uses feature selection as one of the components in model building process. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page:]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site:
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A