Abstract
The momentum of the open-source community has been constantly increasing, thus leading to numerous tools for writing, maintaining, and sharing source code. Several code search engines have been developed to support development tasks and facilitate reuse either directly or by functioning as information sources for code recommenders. In this chapter, we present AGORA, a code search engine that facilitates reuse in component level, snippet level, and project level. Through its Elasticsearch index, AGORA fosters advanced queries (syntax-aware, regular expressions), while the engine also integrates with popular code hosting repositories and offers a well-designed API. We provide representative examples and a usage scenario to illustrate the functionality of AGORA, and perform a comparative analysis in a code reuse context, which indicates that AGORA provides an efficient alternative to current solutions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
The term “agora” refers to the central spot of ancient Greek city-states. Although it is roughly translated to the English term “market”, “agora” can be better viewed as an assembly, a place where people met not only to trade goods, but also to exchange ideas. It is a good fit for our search engine since we envision it as a place where developers can (freely) distribute and exchange their source code, and subsequently their ideas.
- 5.
- 6.
- 7.
Elasticsearch supports removing stop words for several languages. In our case, the default choice of English is adequate.
- 8.
Note that the standard analyzer initially splits the filename into two parts, the filename without extension and the extension since it splits according to punctuation.
- 9.
One of 040000, 100644, 100664, 100755, 120000, or 160000 which correspond to directory, regular non-executable file, regular non-executable group-writeable file, regular executable file, symbolic link, or gitlink, respectively.
- 10.
One of blob, tree, commit, or tag.
- 11.
The CamelCase analyzer is quite effective for fields including Java types; the types are conventionally in camelCase, while the primitives are not affected by the CamelCase tokenizer, i.e., the text “float” results in the token “float”.
References
Thummalapenta S, **e T (2007) PARSEWeb: a programmer assistant for reusing open source code on the web. In: Proceedings of the 22nd IEEE/ACM international conference on automated software engineering, ASE ’07, New York, NY, USA. ACM, pp 204–213
**e T, Pei J (2006) MAPO: mining API usages from open source repositories. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06, New York, NY, USA. ACM, pp 54–57
Hummel O, Janjic W, Atkinson C (2008) Code conjurer: pulling reusable software out of thin air. IEEE Softw 25(5):45–52
Lazzarini Lemos OA, Bajracharya SK, Ossher J (2007) CodeGenie: a tool for test-driven source code search. In: Companion to the 22nd ACM SIGPLAN conference on object-oriented programming systems and applications companion, OOPSLA ’07, New York, NY, USA. ACM, pp 917–918
Diamantopoulos T, Symeonidis AL (2018) AGORA: a search engine for source code reuse. SoftwareX, page under review
Janjic W, Hummel O, Schumacher M, Atkinson C (2013) An unabridged source code dataset for research in software reuse. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 339–342
Linstead E, Bajracharya S, Ngo T, Rigor P, Lopes C, Baldi P (2009) Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Discov 18(2):300–336
Elasticsearch: RESTful, distributed search & analytics (2016). https://www.elastic.co/products/elasticsearch. Accessed April 2016
GitHub API, GitHub Developer (2016). https://developer.github.com/v3/. Accessed April 2016
Java Compiler Tree API (2016). http://docs.oracle.com/javase/8/docs/jdk/api/javac/tree/index.html. Accessed April 2016
Unicode Standard Annex #29 (2016) Unicode text segmentation. In: Davis M (ed) An integral part of The Unicode Standard. http://www.unicode.org/reports/tr29/. Accessed April 2016
CamelCase tokenizer, pattern analyzer, analyzers, Elasticsearch analysis (2016). http://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html#_camelcase_tokenizer. Accessed April 2016
Lucene’s practical scoring function, controlling relevance, search in depth, elasticsearch: The definitive guide (2016). http://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html. Accessed April 2016
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Gamma E, Vlissides J, Johnson R, Helm R (1998) Design patterns: elements of reusable object-oriented software. Addison-Wesley Longman Publishing Co. Inc, Boston
Papamichail M, Diamantopoulos T, Symeonidis AL (2016) User-perceived source code quality estimation based on static analysis metrics. In: Proceedings of the 2016 IEEE international conference on software quality, reliability and security, QRS, Vienna, Austria, pp 100–107
Aggarwal K, Hindle A, Stroulia E (2014) Co-evolution of project documentation and popularity within github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, New York, NY, USA. ACM, pp 360–363
Weber S, Luo J (2014) What makes an open source code popular on GitHub? In: 2014 IEEE international conference on data mining workshop, ICDMW, pp 851–855
Borges H, Hora A, Valente MT (2016) Understanding the factors that impact the popularity of GitHub repositories. In: 2016 IEEE international conference on software maintenance and evolution (ICSME), ICSME, pp 334–344
Dimaridou V, Kyprianidis A-C, Papamichail M, Diamantopoulos T, Symeonidis A (2017) Towards modeling the user-perceived quality of source code using static analysis metrics. In: Proceedings of the 12th international conference on software technologies - volume 1, ICSOFT, Setubal, Portugal, 2017. INSTICC, SciTePress, pp 73–84
Diamantopoulos T, Thomopoulos K, Symeonidis AL (2016) QualBoa: reusability-aware recommendations of source code components. In: Proceedings of the IEEE/ACM 13th working conference on mining software repositories, MSR ’16, pp 488–491
Reiss SP (2009) Semantics-based code search. In: Proceedings of the 31st international conference on software engineering, ICSE ’09, Washington, DC, USA. IEEE Computer Society, pp 243–253
Sahavechaphan N, Claypool K (2006) XSnippet: mining for sample code. SIGPLAN Not. 41(10):413–430
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Diamantopoulos, T., Symeonidis, A.L. (2020). Source Code Indexing for Component Reuse. In: Mining Software Engineering Data for Software Reuse. Advanced Information and Knowledge Processing. Springer, Cham. https://doi.org/10.1007/978-3-030-30106-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-30106-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30105-7
Online ISBN: 978-3-030-30106-4
eBook Packages: Computer ScienceComputer Science (R0)