Abstract
With the exponential growth of research papers, text summarization tools have emerged. However, existing text summarization tools merely extract existing sentences or words based on their frequency and may not be particularly well-suited for papers. To address this gap, this study develops a model based on DistilBERT, primarily focusing on information extraction and dataset labeling and augmentation techniques. The model’s central objective is entity recognition, aiming to identify two specific entities from the full text of research papers. The model takes these critical segments of papers as input and aims to identify the research problems and content contained within them. In response to the limitations of existing datasets, this research augments a dataset with over 4000 full-text ar**v computer algorithm papers through manual annotations.
The developed model demonstrates exceptional performance on several evaluation metrics, including accuracy, precision, F1 score, and recall. For comparative experiments, we employed several baseline models based on BERT. These results demonstrate the effectiveness of the proposed model. As part of a comparative experiment, we trained our models using three different dataset training methods. Additionally, to evaluate our dataset’s quality and underline the importance of full-text data, we manually annotated a random selection of 4000 papers from the ARXIV Data dataset, extracting only their titles and abstracts. As a result, Our proposed model outperforms all the baseline models, achieving an accuracy of 0.823 and an F1 Score of 0.798 and models trained on the proposed full-text annotated dataset outperform those trained on other datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Felizardo, K.R., Carver, J.C.: Automating systematic literature review. In: Contemporary Empirical Methods in Software Engineering, pp. 327–355. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-32489-6_12
McNabb, L., Laramee, R.S.: How to write a visualization survey paper: a starting point. In: Eurographics (Education Papers), pp. 29–39 (2019)
Loza, V., Lahiri, S., Mihalcea, R., et al.: Building a dataset for summarization and keyword extraction from emails. In: LREC, pp. 2441–2446 (2014)
Jonnalagadda, S., Goyal, P., Huffman, M.: Automating data extraction in systematic reviews: a systematic review. Syst. Rev. 4(1), 78 (2015)
Aliyu, M.B., Iqbal, R., James, A.: The canonical model of structure for data extraction in systematic reviews of scientific research articles. In: 15th International Conference on Social Networks Analysis, Management and Security (SNAMS 2018), pp. 264–271 (2018)
Cabot, P.L.H., Navigli, R.: REBEL: relation extraction by end-to-end language generation. In: Findings of the Association for Computational Linguistics, EMNLP 2021, pp. 2370–2381 (2021)
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, p. 2 (2019)
Nayak, T., Ng, H.T.: Effective modeling of encoder-decoder architecture for joint entity and relation extraction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 8528–8535 (2020)
Yamada, I., Asai, A., Shindo, H., et al.: LUKE: deep contextualized entity representations with entity-aware self-attention. ar**v preprint ar**v:2010.01057 (2020)
Zhang, R.H., Liu, Q., Fan, A.X., et al.: Minimize exposure bias of Seq2Seq models in joint entity and relation extraction. ar**v preprint ar**v:2009.07503 (2020)
Blloshmi, R., Conia, S., Tripodi, R., et al.: Generating senses and RoLes: an end-to-end model for dependency-and span-based semantic role labeling. In: IJCAI, pp. 3786–3793 (2021)
Dernoncourt, F., Lee, J.Y.: PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts. ar**v preprint ar**v:1710.06071 (2017)
Gehrke, J., Ginsparg, P., Kleinberg, J.: Overview of the 2003 KDD cup. ACM SIGKDD Explor. Newsl. 5(2), 149–151 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Luo, F., Yu, X. (2024). Element Extraction from Computer Science Academic Papers for AI Survey Writing. In: **, H., Pan, Y., Lu, J. (eds) Computer Networks and IoT. IAIC 2023. Communications in Computer and Information Science, vol 2060. Springer, Singapore. https://doi.org/10.1007/978-981-97-1332-5_21
Download citation
DOI: https://doi.org/10.1007/978-981-97-1332-5_21
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-1331-8
Online ISBN: 978-981-97-1332-5
eBook Packages: Computer ScienceComputer Science (R0)