NotesFAQContact Us
Search Tips
ERIC Number: ED534374
Record Type: Non-Journal
Publication Date: 2011
Pages: 164
Abstractor: As Provided
Reference Count: N/A
ISBN: ISBN-978-1-1249-0341-5
Privacy Preserving PCA on Distributed Bioinformatics Datasets
Li, Xin
ProQuest LLC, Ph.D. Dissertation, University of Maryland, Baltimore County
In recent years, new bioinformatics technologies, such as gene expression microarray, genome-wide association study, proteomics, and metabolomics, have been widely used to simultaneously identify a huge number of human genomic/genetic biomarkers, generate a tremendously large amount of data, and dramatically increase the knowledge on human genomic/genetic information, thus significantly improving biomedical research. However, these exciting advances in bioinformatics do come with a drawback: the increasingly richer human genomic/genetic data contains sensitive private information, such as genetic markers, diseases, etc., which may further lead to the discovery of the individual's race, family, or even identity. Therefore, privacy is an important issue when dealing with bioinformatics data. This is further exacerbated when multiple data providers try to collaborate with each other. This dissertation presents a set of novel approaches for Privacy Preserving Principal Component Analysis (PP-PCA) computations on genomic data from several distributed parties. The approaches allow data providers to collaborate together to identify gene profiles, biomarkers, and possible new pathways from a global viewpoint, and at the same time protect sensitive genomic data from possible privacy breaches. Based on our approaches, we further provide a PP-PCA gene clustering framework and workflow that includes two types of roles: data providers and a trusted center. Within this mechanism, scenarios of horizontal, vertical, and mixed partitioning are covered. Under the horizontal partitioning scenario, distributed genomic datasets can be processed for global PCA gene clustering analysis with privacy protection. Furthermore, compared to the results from a centralized scenario, the results calculated from distributed partitions using our mechanism maintain 100% accuracy. Experiments on five genomic datasets are conducted, and the results show that our framework produces exactly the same results as from merged datasets. In the vertical partitioning scenario, two different methodologies are employed: Collective Principal Component Analysis (CPCA) and Repeating Principal Component Analysis (RPCA). CPCA requires local sites to transmit a sample of original data to a Trusted Center Site (TCS). CPCA can be applied to datasets each having a different number of columns. The RPCA approach requires that all local sites have the same or similar number of columns, but releases very little information of original datasets. Experiments on five genomic datasets show that both CPCA and RPCA approaches maintain very good accuracy compared with a centralized scenario. Under the mixed partitioning scenario, the more generic situation, multiple conditions are discussed, and the conditions of "Vertical Partitioning with Extra Rows" (VPER) and "Horizontal Partitioning with Extra Columns" (HPEC) were identified as the valuable and practical types of mixed partitioning scenario. Both CPCA- and RPCA-related methodologies are applied to the VPER condition, and horizontal partitioning related method is applied to the HPEC condition. Overall, this dissertation offers multiple approaches to build a framework to handle multiple situations on distributed PCA gene clustering, and experimental results show it could obtain accurate global results and preserve data privacy. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page:]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site:
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A