Leveraging Code Comments to Improve Software Reliability.

Tan, Lin

Commenting source code has long been a common practice in software development. This thesis, consisting of three pieces of work, made novel use of the code comments written in natural language to improve software reliability. Our solution combines Natural Language Processing (NLP), Machine Learning, Statistics, and Program Analysis techniques to achieve this goal. First, innovations from multiple directions have been proposed to improve software reliability. Unfortunately, many of the innovations are not fully exploited by programmers. To bridge the gap, we propose a new approach, cComment, to "listen" to thousands of programmers by studying their programming comments. Since comments express programmers' assumptions and intention, comments can reveal programmers' needs. These programmers' needs provide guidance (1) for language/tool designers on where they should develop new techniques or enhance the usability of existing ones, and (2) for programmers on what problems are most pervasive and important so that they should take initiatives to adopt some existing tools or language extensions. We studied 1050 comments randomly sampled from the latest versions of Linux, FreeBSD, and OpenSolaris at the time of writing. We found that 52.6% of these comments could be leveraged by existing or to-be-proposed tools for improving reliability. Our findings include: (1) many comments describe code relationships, code evolutions, or the usage and meaning of integers and integer macros, (2) a significant amount of comments could be expressed by existing annotation languages, and (3) many comments express synchronization related concerns but are not well supported by annotation languages. Second, compared to source code, comments are more direct, descriptive and easy-to-understand. Comments and source code provide relatively redundant and independent information regarding a program's semantic behavior. As software evolves, they can easily grow out-of-sync, indicating two problems:(1) bugs--the source code does not follow the assumptions and requirements specified by correct program comments; (2) bad comments--comments that are inconsistent with correct code, which can mislead programmers to introduce bugs in subsequent versions. Unfortunately, as most comments are written in natural language, no solution has been proposed to automatically analyze comments and detect inconsistencies between comments and source code. iComment took the first step in automatically analyzing comments written in natural language to extract implicit program rules and use these rules to automatically detect inconsistencies between comments and source code, indicating either bugs or bad comments. We evaluate iComment on four large code bases: Linux, Mozilla, Wine and Apache. Our experimental results show that iComment automatically extracts 1832 rules from comments with 90.8-100% accuracy and detects 60 comment-code inconsistencies, 33 new bugs and 27 bad comments, in the latest versions of the four programs when the study was conducted. Nineteen of them (12 bugs and 7 bad comments) have already been confirmed by the corresponding developers while the others are currently being analyzed by the developers. Lastly, we proposed and implemented aComment to detect operating system concurrency bugs and handle the complex interaction between interrupts and lock. Specifically, we designed a new type of interrupt related annotations, and semi-automatically generated 96,821 such annotations for the Linux kernel. These annotations were automatically propagated from 246 seed annotations, directly inferred from comments and code assertions. By extracting annotations from both comments and code, we are able to extract more annotations than using a single source as only a small number (6) of the annotations can be extracted from both sources. These annotations were used to check against source code to detect software bugs, and 9 bugs were detected from the latest version of the Linux kernel at the time of writing. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]