PMAN4 - SEL@KIT

PMAN4

SEL@KIT

Thesis

phdthesis

Bridging Source Code and Natural Language for Code Search

June 2024
Ph.D. thesis / Graduate School of Science and Technology, Kyoto Institute of Technology /

Juntong Hong

No URL available

Abstract

Software has been deeply integrated into today’s society. From large-scale social infrastructure to smart homes, many things in our lives can be controlled or enriched with additional functionality by software. Although recent software development is becoming more complex, developers must still focus on both development efficiency and software quality assurance during the development process. Meanwhile, this also increases the burden on developers and may cause the process between development and quality assurance to become a vicious cycle. However, although well-written codes in the programming phase of development are important and can significantly reduce the effort required for software quality assurance, balancing programming efficiency and quality is a challenge. Code search is an important task in software engineering. It aims to retrieve the most related code from large code repositories for a given query written in natural language (query). The retrieved code can improve soft- ware development from different perspectives. It can be slightly modified to implement the target functionality or as an inspiration for implementing similar functionality. Furthermore, a well-written code from a high-quality repository with the same functionality can directly reuse or quickly locate a functionality module in a large project. Hence, there has been an increasing interest in improving code search performance in recent years. In general, code search models face a primary challenge as they must bridge the gap between code and query since they have different natures. This means that effectively capturing and establishing relationships are crucial for code search (hereafter, we refer to these relationships as alignments) between code and query. Tradition methods (called retrieval-based methods in the following) are primarily based on information retrieval (IR) techniques. Thus, they commonly rely on feature engineering to expand features (e.g., tokens or words) through external knowledge databases and perform text similarity techniques to catch the alignments between code and query. On the other hand, query understanding and adequately representing word features are challenging for retrieval-based methods. To address these problems, code search methods with deep-learning technology (called DL-based methods in the following) hold immense potential to improve code search methods. DL-based code search methods can effectively learn the semantics of code or query and handle unseen features better, addressing the limitations of retrieval-based methods. However, there are still unresolved problems in DL-based methods. For example, many recent DL-based methods use different encoders to generate different representations for the same code token or even combine unique tokens to create a new unique token. The failures of these methods first lie in their vocabulary, which needs to hold different vocabularies for each encoder and contain many overlapping and redundant tokens. These redundant tokens and the increasing number of created tokens further increase the burden on vocabulary and result in out-of-vocabulary problems. Moreover, the alignment learning in these methods is often simple and insufficient to capture and establish effective alignments. Besides, treating code as plain text almost inevitably faces vocabulary problems, and learning hierarchical structural information from plain text code is difficult. In this paper, we focus on follows challenges: 1) Lack of exploration for alignment learning between code and query. 2) Overdependence on additional manual feature engineering and lack of exploration to enhance code embedding (we explained embedding in Section 2.1) representation. 3) Explore a novel approach to learning code embedding without vocabulary. To address these challenges, we proposed a combined alignment model and a global alignment model for code search. Moreover, we proposed a code detection model to explore the effectiveness of representing code as an image instead of plain text code. Note that the objective of the code detection model is not for code search since the method that represents code as an image needs further improvement to achieve the expected performance in code search. (we describe the future improvement direction in Section 6.2.2) However, experiment results still have positive implications for enhancing code understanding and providing novel insight into code encoding. Firstly, unlike many DL-based methods, which still rely on expanding code features to facilitate alignment learning between code and query, combined alignment model investigates improving code search performance under limited manual feature engineering with cross-modal alignment between code and query. It proposes a novel alignment learning approach to facilitate the capture of alignments between code and query with limited features. Furthermore, it demonstrates the importance and effectiveness of carefully designed cross-modal alignment in enhancing performance in code search. To this end, we conducted several evaluation experiments to measure the effectiveness and improvement of the cross-modal model. The results show that combined alignment model performs better than previous studies with limited manual feature engineering. Secondly, global alignment model further develops feature enhancement on limited manual feature engineering and proposes a novel alignment learning approach. It characterizes code and query embedding into a diverse and distinctive representation and further proposes a dense graph convolutional network to capture global alignments between code and query. We also conduct evaluation experiments compared to previous DL-based studies to demonstrate the effectiveness of global alignment model. As a result, we observe that global alignment model significantly improves the performance of code search and outperforms previous studies by a large margin. Finally, different from traditional methods that detect plain text code, image code detection model aims to detect which programming language wrote the code from imaged code rather than plain text code and investigate the effectiveness of representing code as images. To this end, it converts plain text into a code image and utilizes the image encoding model to encode code images without generating a vocabulary. To evaluate the effectiveness of this model, we conducted comparison experiments with several baseline models. The results of experiments demonstrate that our model can successfully detect code language with high accuracy. This dissertation is organized as follows: Chapter 1 is an introduction; we briefly introduce the related studies of code search and highlight the challenges addressed in this paper. Chapter 2 introduces the background for code search, such as the word embedding and alignment learning techniques. Chapter 3 describes the details of combined alignment model and experiment results. Chapter 4 describes the embedding enhancement and alignment learning of global alignment model and experiment results. Chapter 5 introduces data preprocessing, model designation, and experiment results of image code detection model. In Chapter 6, we conclude this dissertation and introduce the direction of future works for code search.