Large-Scale Machine Learning for Classification and Search.

Liu, Wei

With the rapid development of the Internet, nowadays tremendous amounts of data including images and videos, up to millions or billions, can be collected for training machine learning models. Inspired by this trend, this thesis is dedicated to developing large-scale machine learning techniques for the purpose of making classification and nearest neighbor search practical on gigantic databases. 1) Large Graph Construction: We present a novel graph construction approach, called "Anchor Graphs," which enjoys linear space and time complexities and can thus be constructed over gigantic databases efficiently. The central idea of the Anchor Graph is introducing a few anchor points and converting intensive data-to-data affinity computation to drastically reduced data-to-anchor affinity computation. A low-rank data-to-data affinity matrix is derived using the data-to-anchor affinity matrix. We also theoretically prove that the Anchor Graph lends itself to an intuitive probabilistic interpretation by showing that each entry of the derived affinity matrix can be considered as a transition probability between two data points through Markov random walks. 2) Large-Scale Semi-Supervised Learning: We employ Anchor Graphs to develop a scalable solution for semi-supervised learning, which capitalizes on both labeled and unlabeled data to learn graph-based classification models. We propose several key methods to build scalable semi-supervised kernel machines such that real-world linearly inseparable data can be tackled. The proposed techniques take advantage of the Anchor Graph from a kernel point of view, generating a set of low-rank kernels which are made to encompass the neighborhood structure unveiled by the Anchor Graph. By linearizing these low-rank kernels, training nonlinear kernel machines in semi-supervised settings can be simplified to training linear SVMs in supervised settings, so the computational cost for classifier training is substantially reduced. We accomplish excellent classification performance by applying the proposed semi-supervised kernel machine - a linear SVM with a linearized Anchor Graph warped kernel. 3) Unsupervised Hashing: We present a novel unsupervised hashing approach based on the Anchor Graph which captures the underlying manifold structure. The Anchor Graph Hashing approach allows constant time hashing of a new data point by extrapolating graph Laplacian eigenvectors to eigenfunctions. Furthermore, a hierarchical threshold learning procedure is devised to produce multiple hash bits for each eigenfunction, thus leading to higher search accuracy. 4) Supervised Hashing: We present a novel kernel-based supervised hashing model which requires a limited amount of supervised information in the form of similar and dissimilar data pairs, and is able to achieve high hashing quality at a practically feasible training cost. The idea is to map the data to compact binary codes whose Hamming distances are simultaneously minimized on similar pairs and maximized on dissimilar pairs. Our approach is distinct from prior work in utilizing the equivalence between optimizing the code inner products and the Hamming distances. This enables us to sequentially and efficiently train the hash functions one bit at a time, yielding very short yet discriminative codes. The presented supervised hashing approach is general, allowing search of both semantically similar neighbors and metric distance neighbors. 5) Hyperplane Hashing: We present a novel hyperplane hashing technique which yields high search accuracy with compact hash codes. The key idea is a novel bilinear form used in designing the hash functions, leading to a higher collision probability than all of the existing hyperplane hash functions when using random projections. To further increase the performance, we develop a learning based framework in which the bilinear functions are directly learned from the input data. This results in compact yet discriminative codes, as demonstrated by the superior search performance over all random projection based solutions. (Abstract shortened by UMI.). [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]