Source Code Indexing for Component Reuse

Diamantopoulos, Themistoklis; Symeonidis, Andreas L.

doi:10.1007/978-3-030-30106-4_5

Themistoklis Diamantopoulos¹⁴ &
Andreas L. Symeonidis¹⁴

Part of the book series: Advanced Information and Knowledge Processing ((AI&KP))

535 Accesses

Abstract

The momentum of the open-source community has been constantly increasing, thus leading to numerous tools for writing, maintaining, and sharing source code. Several code search engines have been developed to support development tasks and facilitate reuse either directly or by functioning as information sources for code recommenders. In this chapter, we present AGORA, a code search engine that facilitates reuse in component level, snippet level, and project level. Through its Elasticsearch index, AGORA fosters advanced queries (syntax-aware, regular expressions), while the engine also integrates with popular code hosting repositories and offers a well-designed API. We provide representative examples and a usage scenario to illustrate the functionality of AGORA, and perform a comparative analysis in a code reuse context, which indicates that AGORA provides an efficient alternative to current solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 70.00; Price includes VAT (United Kingdom)

Softcover Book: GBP 88.00; Price includes VAT (United Kingdom)

Hardcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/.
2.
https://sourceforge.net/.
3.
http://stackoverflow.com/.
4.
The term “agora” refers to the central spot of ancient Greek city-states. Although it is roughly translated to the English term “market”, “agora” can be better viewed as an assembly, a place where people met not only to trade goods, but also to exchange ideas. It is a good fit for our search engine since we envision it as a place where developers can (freely) distribute and exchange their source code, and subsequently their ideas.
5.
Source: http://googleblog.blogspot.gr/2011/10/fall-sweep.html.
6.
Source: http://techcrunch.com/2012/07/20/ohloh-wants-to-fill-the-gap-left-by-google-code-search/.
7.
Elasticsearch supports removing stop words for several languages. In our case, the default choice of English is adequate.
8.
Note that the standard analyzer initially splits the filename into two parts, the filename without extension and the extension since it splits according to punctuation.
9.
One of 040000, 100644, 100664, 100755, 120000, or 160000 which correspond to directory, regular non-executable file, regular non-executable group-writeable file, regular executable file, symbolic link, or gitlink, respectively.
10.
One of blob, tree, commit, or tag.
11.
The CamelCase analyzer is quite effective for fields including Java types; the types are conventionally in camelCase, while the primitives are not affected by the CamelCase tokenizer, i.e., the text “float” results in the token “float”.

References

Thummalapenta S, **e T (2007) PARSEWeb: a programmer assistant for reusing open source code on the web. In: Proceedings of the 22nd IEEE/ACM international conference on automated software engineering, ASE ’07, New York, NY, USA. ACM, pp 204–213
Google Scholar
**e T, Pei J (2006) MAPO: mining API usages from open source repositories. In: Proceedings of the 2006 international workshop on mining software repositories, MSR ’06, New York, NY, USA. ACM, pp 54–57
Google Scholar
Hummel O, Janjic W, Atkinson C (2008) Code conjurer: pulling reusable software out of thin air. IEEE Softw 25(5):45–52
Article Google Scholar
Lazzarini Lemos OA, Bajracharya SK, Ossher J (2007) CodeGenie: a tool for test-driven source code search. In: Companion to the 22nd ACM SIGPLAN conference on object-oriented programming systems and applications companion, OOPSLA ’07, New York, NY, USA. ACM, pp 917–918
Google Scholar
Diamantopoulos T, Symeonidis AL (2018) AGORA: a search engine for source code reuse. SoftwareX, page under review
Google Scholar
Janjic W, Hummel O, Schumacher M, Atkinson C (2013) An unabridged source code dataset for research in software reuse. In: Proceedings of the 10th working conference on mining software repositories, MSR ’13, Piscataway, NJ, USA. IEEE Press, pp 339–342
Google Scholar
Linstead E, Bajracharya S, Ngo T, Rigor P, Lopes C, Baldi P (2009) Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Discov 18(2):300–336
Article MathSciNet Google Scholar
Elasticsearch: RESTful, distributed search & analytics (2016). https://www.elastic.co/products/elasticsearch. Accessed April 2016
GitHub API, GitHub Developer (2016). https://developer.github.com/v3/. Accessed April 2016
Java Compiler Tree API (2016). http://docs.oracle.com/javase/8/docs/jdk/api/javac/tree/index.html. Accessed April 2016
Unicode Standard Annex #29 (2016) Unicode text segmentation. In: Davis M (ed) An integral part of The Unicode Standard. http://www.unicode.org/reports/tr29/. Accessed April 2016
CamelCase tokenizer, pattern analyzer, analyzers, Elasticsearch analysis (2016). http://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html#_camelcase_tokenizer. Accessed April 2016
Lucene’s practical scoring function, controlling relevance, search in depth, elasticsearch: The definitive guide (2016). http://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html. Accessed April 2016
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Google Scholar
Gamma E, Vlissides J, Johnson R, Helm R (1998) Design patterns: elements of reusable object-oriented software. Addison-Wesley Longman Publishing Co. Inc, Boston
MATH Google Scholar
Papamichail M, Diamantopoulos T, Symeonidis AL (2016) User-perceived source code quality estimation based on static analysis metrics. In: Proceedings of the 2016 IEEE international conference on software quality, reliability and security, QRS, Vienna, Austria, pp 100–107
Google Scholar
Aggarwal K, Hindle A, Stroulia E (2014) Co-evolution of project documentation and popularity within github. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014, New York, NY, USA. ACM, pp 360–363
Google Scholar
Weber S, Luo J (2014) What makes an open source code popular on GitHub? In: 2014 IEEE international conference on data mining workshop, ICDMW, pp 851–855
Google Scholar
Borges H, Hora A, Valente MT (2016) Understanding the factors that impact the popularity of GitHub repositories. In: 2016 IEEE international conference on software maintenance and evolution (ICSME), ICSME, pp 334–344
Google Scholar
Dimaridou V, Kyprianidis A-C, Papamichail M, Diamantopoulos T, Symeonidis A (2017) Towards modeling the user-perceived quality of source code using static analysis metrics. In: Proceedings of the 12th international conference on software technologies - volume 1, ICSOFT, Setubal, Portugal, 2017. INSTICC, SciTePress, pp 73–84
Google Scholar
Diamantopoulos T, Thomopoulos K, Symeonidis AL (2016) QualBoa: reusability-aware recommendations of source code components. In: Proceedings of the IEEE/ACM 13th working conference on mining software repositories, MSR ’16, pp 488–491
Google Scholar
Reiss SP (2009) Semantics-based code search. In: Proceedings of the 31st international conference on software engineering, ICSE ’09, Washington, DC, USA. IEEE Computer Society, pp 243–253
Google Scholar
Sahavechaphan N, Claypool K (2006) XSnippet: mining for sample code. SIGPLAN Not. 41(10):413–430
Article Google Scholar

Download references

Author information

Authors and Affiliations

Thessaloniki, Greece
Themistoklis Diamantopoulos & Andreas L. Symeonidis

Authors

Themistoklis Diamantopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Andreas L. Symeonidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Themistoklis Diamantopoulos .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Diamantopoulos, T., Symeonidis, A.L. (2020). Source Code Indexing for Component Reuse. In: Mining Software Engineering Data for Software Reuse. Advanced Information and Knowledge Processing. Springer, Cham. https://doi.org/10.1007/978-3-030-30106-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-30106-4_5
Published: 31 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30105-7
Online ISBN: 978-3-030-30106-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics