Facilitating Internet-Scale Code Retrieval.

Bajracharya, Sushil Krishna

Internet-Scale code retrieval deals with the representation, storage, and access of relevant source code from a large amount of source code available on the Internet. Internet-Scale code retrieval systems support common emerging practices among software developers related to finding and reusing source code. In this dissertation we focus on some system and domain-specific challenges of Internet-Scale code retrieval. This dissertation starts with an in-depth study of how developers use Koders, a commercial code search engine. The results of this study highlight several problems that need to be tackled in a commercial code search engine. To build solutions for some of these problems we develop an infrastructure, Sourcerer, that includes models and tools for large-scale collection and analysis of open source code. The stored contents and set of programmable services in Sourcerer enable rapid development and evaluation of retrieval schemes and applications of code search. We demonstrate the feasibility of developing state-of-the-art Internet-Scale code retrieval techniques on top of Sourcerer by presenting the implementation and evaluation details of code-specific retrieval schemes and code search tools. The central premise of this dissertation is that source code retrieval techniques that incorporate structural information extracted from source code can be more effective in retrieving relevant code entities. We support this premise by presenting three approaches that lever-age structural information in code search. First, we present structure-based techniques to improve ranking in retrieving implementations of commonly sought for programming features, where our best technique outperforms Google and Google Code Search. Second, we present Test-Driven Code Search (TDCS), an approach to finding reusable code fragments on the Internet, that uses structure-based code retrieval and dependency slicing--a technique to automatically pull code dependencies. Evaluation of TDCS with 34 students shows that TDCS is the fastest approach to find reusable code fragments for 59% of the students, and faster than Google Code Search for 66% of the students. Finally, we present Structural Semantic Indexing, a technique to associate meaningful terms with source code entities that improves the performance of retrieving code fragments to be used as API usage examples. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]