Extracting problem and method sentence from scientific papers: a context-enhanced transformer using formulaic expression desensitization

Zhang, Yingyi; Zhang, Chengzhi

doi:10.1007/s11192-024-05048-6

Extracting problem and method sentence from scientific papers: a context-enhanced transformer using formulaic expression desensitization

Published: 27 May 2024

Volume 129, pages 3433–3468, (2024)
Cite this article

Scientometrics Aims and scope Submit manuscript

226 Accesses
Explore all metrics

Abstract

Billions of scientific papers lead to the need to identify essential parts from the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences within scientific papers is labor-intensive, resulting in small-scale datasets that limit the amount of information models can learn. This limited information leads models to rely heavily on specific forms, which in turn reduces their generalization capabilities. This paper addresses the problems caused by small-scale datasets from three perspectives: increasing dataset scale, reducing dependence on specific forms, and enriching the information within sentences. To implement the first two ideas, we introduce the concept of formulaic expression (FE) desensitization and propose FE desensitization-based data augmenters to generate synthetic data and reduce models’ reliance on FEs. For the third idea, we propose a context-enhanced transformer that utilizes context to measure the importance of words in target sentences and to reduce noise in the context. Furthermore, this paper conducts experiments using large language model (LLM) based in-context learning (ICL) methods. Quantitative and qualitative experiments demonstrate that our proposed models achieve a higher macro F₁ score compared to the baseline models on two scientific paper datasets, with improvements of 3.71% and 2.67%, respectively. The LLM based ICL methods are found to be not suitable for the task of problem and method extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

DuIE: A Large-Scale Chinese Dataset for Information Extraction

Prompt Tuning in Biomedical Relation Extraction

Article 29 February 2024

Sentence-to-Label Generation Framework for Multi-task Learning of Japanese Sentence Classification and Named Entity Recognition

References

Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., & Sontag, D. A. (2022).Large language models are few-shot clinical information extractors. In Proceedings of the 2022 conference on empirical methods in natural language processing, EMNLP (pp. 1998–2022). Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.130
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, (EMNLP-IJCNLP) (pp. 3615–3620). Association for Computational Linguistics. https://aclanthology.org/D19-1371
Bornmann, L., & Mutz, R. (2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222. https://doi.org/10.1002/asi.23329
Article Google Scholar
Boudin, F., Nie, J. Y., Bartlett, J. C., Grad, R., Pluye, P., & Dawes, M. (2010). Combining classifiers for robust PICO element detection. BMC Medical Informatics and Decision Making, 10(1), 1–6. https://doi.org/10.1186/1472-6947-10-29
Article Google Scholar
Chao, W., Chen, M., Zhou, X., & Luo, Z. (2023). A joint framework for identifying the type and arguments of scientific contribution. Scientometrics, 128(6), 3347–3376. https://doi.org/10.1007/s11192-023-04694-6
Article Google Scholar
Chen, Y., Hu, D., Li, M., Duan, H., & Lu, X. (2022). Automatic SNOMED CT coding of Chinese clinical terms via attention-based semantic matching. International Journal of Medical Informatics, 159, 104676. https://doi.org/10.1016/j.ijmedinf.2021.104676
Article Google Scholar
Deng, M., Wang, J., Hsieh, C.-P., Wang, Y., Guo, H., Shu, T., Song, M., **ng, E., & Hu, Z. (2022). RLPrompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 conference on empirical methods in natural language processing (EMNLP) (pp. 3369–3391). Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.222
Dernoncourt, F., Lee, J. Y., & Szolovits, P. (2016). Neural networks for joint sentence classification in medical paper abstracts. In Proceedings of the 15th conference of the european chapter of the association for computational linguistics, EACL (pp. 694–700). Association for Computational Linguistics. https://aclanthology.org/E17-2110
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, NAACL (pp. 4171–4186). Association for Computational Linguistics. https://aclanthology.org/N19-1423
Ding, B., Qin, C., Liu, L., Bing, L., Joty, S. R., & Li, B. (2022). Is GPT-3 a good data annotator? Preprint retrived form https://arxiv.org/abs/2212.10450
Dong, Q., Li, L., Dai, D., Zheng, C., Wu, Z., Chang, B., Sun, X., Xu, J., Li, L., & Sui, Z. (2023). A survey on in-context learning. Preprint retrieved form http://arxiv.org/abs/2301.00234
Fisas, B., Saggion, H., & Ronzano, F. (2015). On the discoursive structure of computer graphics research papers. In Proceedings of the 9th linguistic annotation workshop, SIGANN (pp. 42–51). Association for Computational Linguistics. https://doi.org/10.3115/v1/W15-1605
Gonçalves, S., Cortez, P., & Moro, S. (2020). A deep learning classifier for sentence classification in biomedical and computer science abstracts. Neural Computing and Applications, 32(11), 6793–6807. https://doi.org/10.1007/s00521-019-04334-2
Article Google Scholar
Graça, M., Kim, Y., Schamper, J., Khadivi, S., & Ney, H. (2019). Generalizing Back-Translation in Neural Machine Translation. In Proceedings of the fourth conference on machine translation, WMT (pp. 45–52). Association for Computational Linguistics.
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM networks. In Proceedings of the 2005 IEEE international joint conference on neural networks, IJCNN (pp. 2047–2052). IEEE. https://doi.org/10.1109/IJCNN.2005.1556215
Heffernan, K., & Teufel, S. (2018). Identifying problems and solutions in scientific text. Scientometrics, 116(2), 1367–1382. https://doi.org/10.1007/s11192-018-2718-6
Article Google Scholar
Iwatsuki, K., & Aizawa, A. (2021). Communicative-function-based sentence classification for construction of an academic formulaic expression database. In Proceedings of the 16th conference of the European chapter of the association for computational linguistics, EACL (pp. 3476–3497). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.304
**, D., & Szolovits, P. (2018). Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In Proceedings of the 2018 conference on empirical methods in natural language processing, EMNLP (pp. 3100–3109). Association for Computational Linguistics. https://aclanthology.org/D18-1349
Kim, S. N., Martinez, D., Cavedon, L., & Yencken, L. (2011). Automatic classification of sentences to support evidence based medicine. BMC Bioinformatics, 12(S2), S5. https://doi.org/10.1186/1471-2105-12-S2-S5
Article Google Scholar
Kobayashi, S. (2018). Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, NAACL (pp. 452–457). Association for Computational Linguistics. https://aclanthology.org/N18-2072
Kovačević, A., Konjović, Z., Milosavljević, B., & Nenadic, G. (2012). Mining methodologies from NLP publications: A case study in automatic terminology recognition. Computer Speech & Language, 26(2), 105–126. https://doi.org/10.1016/j.csl.2011.09.001
Article Google Scholar
La Quatra, M., & Cagliero, L. (2022). Transformer-based highlights extraction from scientific papers. Knowledge-Based Systems, 252, 109382. https://doi.org/10.1016/j.knosys.2022.109382
Article Google Scholar
Liakata, M., Teufel, S., Siddharthan, A., & Batchelor, C. (2010). Corpora for the conceptualisation and zoning of scientific papers. In Proceedings of the 7th international conference on language resources and evaluation, LREC (pp. 2054–2061). European Language Resources Association. https://aclanthology.org/L10-1440
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. Preprint retrieved from https://arxiv.org/abs/1907.11692v1
Liu, Y., Wu, F., Liu, M., & Liu, B. (2013). Abstract sentence classification for scientific papers based on transductive SVM. Computer and Information Science, 6(4), 125–131. https://doi.org/10.5539/cis.v6n4p125
Article Google Scholar
Luan, Y., He, L., Ostendorf, M., & Hajishirzi, H. (2018). Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In Proceedings of the 2018 conference on empirical methods in natural language processing, (EMNLP) (pp. 3219–3232). Association for Computational Linguistics. https://aclanthology.org/D18-1360
Luan, Y., Ostendorf, M., & Hajishirzi, H. (2017). Scientific information extraction with semi-supervised neural tagging. In Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP (pp. 2641–2651). Association for Computational Linguistics. https://doi.org/10.18653/v1/d17-1279
Luo, Z., Lu, W., He, J., & Wang, Y. (2022). Combination of research questions and methods: A new measurement of scientific novelty. Journal of Informetrics, 16(2), 101282. https://doi.org/10.1016/j.joi.2022.101282
Article Google Scholar
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., & Clark, P. (2023). Self-refine: Iterative refinement with self-feedback. Preprint retrieved form https://arxiv.org/abs/2303.17651
Maier Ferreira, T., & Reali Costa, A. H. (2020). DeepBT and NLP data augmentation techniques: a new proposal and a comprehensive study. In Proceeding of intelligent systems: 9th Brazilian conference, BRACIS (pp. 435–449). Springer International Publishing. https://doi.org/10.1007/978-3-030-61377-8_30
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: System demonstrations, ACL (pp. 55–60). Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-5010
Mutlu, B., Sezer, E. A., & Akcayol, M. A. (2020). Candidate sentence selection for extractive text summarization. Information Processing & Management, 57(6), 102359. https://doi.org/10.1016/j.ipm.2020.102359
Article Google Scholar
Neves, M., Butzke, D., & Grune, B. (2019). Evaluation of scientific elements for text similarity in biomedical publications. In Proceedings of the 6th workshop on argument mining, ArgMining (pp. 124–135). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4515
Ng, N., Yee, K., Baevski, A., Ott, M., Auli, M., & Edunov, S. (2019). Facebook FAIR’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation, (WMT) (pp. 314–319). Association for Computational Linguistics. https://aclanthology.org/W19-5333
Oelen, A., Stocker, M., & Auer, S. (2021). Crowdsourcing scholarly discourse annotations. In Proceeding of 26th international conference on intelligent user interfaces, IUI (pp. 464–474). Association for Computing Machinery. https://doi.org/10.1145/3397481.3450685
Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871. https://doi.org/10.1016/j.joi.2015.07.012
Article Google Scholar
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP (pp. 1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162
Raffel, C., & Ellis, D. P. W. (2016). Feed-forward networks with attention can solve some long-term memory problems. Preprint retrieved form https://arxiv.org/abs/1512.08756
Safder, I., & Hassan, S. U. (2019). Bibliometric-enhanced information retrieval: A novel deep feature engineering approach for algorithm searching from full-text publications. Scientometrics, 119(1), 257–277. https://doi.org/10.1007/s11192-019-03025-y
Article Google Scholar
Sakai, T., & Hirokawa, S. (2012). Feature words that classify problem sentence in scientific article. In Proceedings of the 14th international conference on information integration and web-based applications & services, IIWAS (pp. 360–367). Association for Computing Machinery. https://doi.org/10.1145/2428736.2428803
Shakeel, M. H., Karim, A., & Khan, I. (2020). A multi-cascaded model with data augmentation for enhanced paraphrase detection in short texts. Information processing & management, 57(3), 102204. https://doi.org/10.1016/j.ipm.2020.102204
Shorten, C., Khoshgoftaar, T. M., & Furht, B. (2021). Text data augmentation for deep learning. Journal of big Data, 8(1), 101. https://doi.org/10.1186/s40537-021-00492-0
Teufel, S., & Moens, M. (2002). Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4), 409–445. https://doi.org/10.1162/089120102762671936
Article Google Scholar
Tokala, Y. S. S. S., Aluru, S. S., Vallabhajosyula, A., Sanyal, D. K., & Das, P. P. (2023). Label informed hierarchical transformers for sequential sentence classification in scientific abstracts. Expert Systems, 40(6), e13238. https://doi.org/10.1111/exsy.13238
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I. (2017). Attention is all you need. In Proceedings of the advances in neural information processing systems, NIPS (pp. 6000–6010). Curran Associates Inc. https://doi.org/10.5555/3295222.3295349#sec-comments
Wang, W. Y., & Yang, D. (2015). That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015 conference on empirical methods in natural language processing (EMNLP) (pp.2557–2563). Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1306
Wang, Z., Shang, J., Liu, L., Lu, L., Liu, J., & Han, J. (2019). Crossweigh: Training named entity tagger from imperfect annotations. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, (EMNLP-IJCNLP) (pp.5154–5163). Association for Computational Linguistics.
Wang, R., Zhang, C., Zhang, Y., & Zhang, J. (2020). Extracting methodological sentences from unstructured abstracts of academic articles. In Proceedings of the international conference on information, iConference (pp. 790–798). Springer. https://doi.org/10.1007/978-3-030-43687-2_66
Wei, J., & Zou, K. (2019). EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 6382–6388). Association for Computational Linguistics.
Wilson, E. B. (1952). An introduction to scientific research. McGraw-Hill.
Google Scholar
Wray, A. (2000). Formulaic sequences in second language teaching: Principle and practice. Applied Linguistics, 21(4), 463–489. https://doi.org/10.1093/applin/21.4.463
Article Google Scholar
Wu, X., Lv, S., Zang, L., Han, J., & Hu, S. (2018). Conditional BERT contextual augmentation. In Proceedings of the international conference on computational science (ICCS) (pp. 84–95). Springer. https://doi.org/10.1007/978-3-030-22747-0_7
**e, Q., Dai, Z., Hovy, E., Luong, M.-T., & Le, Q. V. (2020). Unsupervised data augmentation for consistency training. In Proceedings of the advances in neural information processing systems (NIPS) (pp. 6256–6268). Curran Associates Inc. https://doi.org/10.5555/3495724.3496249
Yamamoto, Y., & Takagi, T. (2005). A sentence classification system for multi biomedical literature summarization. In Proceedings of the 21st international conference on data engineering workshops (ICDEW) (pp.1163–1163). IEEE. https://doi.org/10.1109/ICDE.2005.170
Yang, L., Na, J. C., & Yu, J. (2022). Cross-modal multitask transformer for end-to-end multimodal aspect-based sentiment analysis. Information Processing & Management, 59(5), 103038. https://doi.org/10.1016/j.ipm.2022.103038
Article Google Scholar
Yu, J., Jiang, J., Yang, L., & **a, R. (2020). Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In Proceedings of the 58th annual meeting of the association for computational linguistics (ACL) (pp. 3342–3352). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.306
Zeng, X., Li, Y., Zhai, Y., & Zhang, Y. (2020). Counterfactual generator: A weakly-supervised method for named entity recognition. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 7270–7280). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.590
Zhang, H., & Ren, F. (2020). BERTatDE at SemEval-2020 task 6: Extracting term-definition pairs in free text using pre-trained model. In Proceedings of the fourteenth workshop on semantic evaluation (pp. 690–696). International Committee for Computational Linguistics. https://doi.org/10.18653/v1/2020.semeval-1.90
Zhang, Y., Zhang, C., & Li, J. (2020). Joint modeling of characters, words, and conversation contexts for microblog keyphrase extraction. Journal of the Association for Information Science and Technology, 71(5), 553–567. https://doi.org/10.1002/asi.24279
Article Google Scholar
Zhao, M., Yan, E., & Li, K. (2018). Data set mentions and citations: A content analysis of full-text publications. Journal of the Association for Information Science and Technology, 69(1), 32–46. https://doi.org/10.1002/asi.23919
Article MathSciNet Google Scholar
Zhao, D., Wang, J., Zhang, Y., Wang, X., Lin, H., & Yang, Z. (2020). Incorporating representation learning and multihead attention to improve biomedical cross-sentence n-ary relation extraction. BMC Bioinformatics, 21(1), 312. https://doi.org/10.1186/s12859-020-03629-9
Article Google Scholar
Zhou, Y., Dong, F., Liu, Y., Li, Z., Du, J., & Zhang, L. (2020). Forecasting emerging technologies using data augmentation and deep learning. Scientometrics, 123(1), 1–29. https://doi.org/10.1007/s11192-020-03351-6
Article Google Scholar

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (Grant No.72074113).

Funding

Funding was provided by National Natural Science Foundation of China (Grant No. 72074113).

Author information

Authors and Affiliations

Department of Archives and E-Government, Soochow University, Suzhou, China
Yingyi Zhang
Department of Information Management, Nan**g University of Science and Technology, Nan**g, China
Chengzhi Zhang

Authors

Yingyi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chengzhi Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chengzhi Zhang.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Zhang, C. Extracting problem and method sentence from scientific papers: a context-enhanced transformer using formulaic expression desensitization. Scientometrics 129, 3433–3468 (2024). https://doi.org/10.1007/s11192-024-05048-6

Download citation

Received: 15 October 2023
Accepted: 30 April 2024
Published: 27 May 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s11192-024-05048-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Germany)

Instant access to the full article PDF.

Institutional subscriptions

Extracting problem and method sentence from scientific papers: a context-enhanced transformer using formulaic expression desensitization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DuIE: A Large-Scale Chinese Dataset for Information Extraction

Prompt Tuning in Biomedical Relation Extraction

Sentence-to-Label Generation Framework for Multi-task Learning of Japanese Sentence Classification and Named Entity Recognition

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Extracting problem and method sentence from scientific papers: a context-enhanced transformer using formulaic expression desensitization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DuIE: A Large-Scale Chinese Dataset for Information Extraction

Prompt Tuning in Biomedical Relation Extraction

Sentence-to-Label Generation Framework for Multi-task Learning of Japanese Sentence Classification and Named Entity Recognition

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation