Abstract
As the first session-level Chinese dataset, CHASE contains two separate parts, i.e., 2,003 sessions manually constructed from scratch (CHASE-C), and 3,456 sessions translated from English SParC (CHASE-T). We find the two parts are highly discrepant and incompatible. In this work, we present SeSQL, a high-quality large-scale session-level Chinese text-to-SQL dataset, consisting of 5,028 sessions all manually constructed from scratch. Compared with previous datasets, in order to guarantee data quality, we adopt an iterative annotation workflow to facilitate intense and in-time review of previous-round natural language (NL) questions and SQL queries. Moreover, by completing all context-dependent NL questions, we obtain 27,012 context-independent question/SQL pairs, allowing SeSQL to be used as the largest dataset for single-round text-to-SQL parsing. We conduct benchmark session-level text-to-SQL parsing experiments on SeSQL via employing three competitive session-level parsers, and present detailed analysis.
S. Huang and L. Wang—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Please note that we ask annotators not to introduce identification information and ask them to anonymize the existing identification information.
- 2.
The average salary is about 20 RMB for a part-time KFC employee in our city.
- 3.
The values of “Both” for other datasets are inferred from their reported results of “Core.” and “Elli.”.
References
Bertomeu, N., Uszkoreit, H., Frank, A., Krieger, H.U., Jörg, B.: Contextual phenomena and thematic relations in database QA dialogues: results from a Wizard-of-Oz experiment. In: Proceedings of HLT-NAACL, pp. 1–8 (2006)
Cai, Y., Wan, X.: IGSQL: database schema interaction graph based neural model for context-dependent text-to-SQL generation. In: Proceedings of EMNLP, pp. 6903–6912 (2020)
Cao, R., Chen, L., Chen, Z., Zhao, Y., Zhu, S., Yu, K.: LGESQL: line graph enhanced text-to-SQL model with mixed local and non-local relations. In: Proceedings of ACL, pp. 2541–2555 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Geva, M., Goldberg, Y., Berant, J.: Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In: Proceedings of EMNLP-IJCNLP, pp. 1161–1166 (2019)
Guo, J., et al.: Chase: a large-scale and pragmatic Chinese dataset for cross-database context-dependent text-to-SQL. In: Proceedings of ACL, pp. 2316–2331 (2021)
Hui, B., et al.: Dynamic hybrid relation exploration network for cross-domain context-dependent semantic parsing. In: Proceedings of AAAI, pp. 13116–13124 (2021)
Scholak, T., Li, R., Bahdanau, D., de Vries, H., Pal, C.: DuoRAT: towards simpler text-to-SQL models. In: Proceedings of NAACL-HLT, pp. 1313–1321 (2021)
Tang, L.R., Mooney, R.J.: Using multiple clause constructors in inductive logic programming for semantic parsing. In: Proceedings of ECML, pp. 466–477 (2001)
Wang, B., Shin, R., Liu, X., Polozov, O., Richardson, M.: RAT-SQL: relation-aware schema encoding and linking for text-to-SQL parsers. In: Proceedings of ACL, pp. 7567–7578 (2020)
Wang, L., et al.: DuSQL: a large-scale and pragmatic Chinese text-to-SQL dataset. In: Proceedings of EMNLP, pp. 6923–6935 (2020)
Yu, T., et al.: CoSQL: a conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In: Proceedings of EMNLP-IJCNLP, pp. 1962–1979 (2019)
Yu, T., Zhang, R., Polozov, A., Meek, C., Awadallah, A.H.: SCoRe: pre-training for context representation in conversational semantic parsing. In: Proceedings of ICLR (2020)
Yu, T., et al.: Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In: Proceedings of EMNLP, pp. 3911–3921 (2018)
Yu, T., et al.: SParC: cross-domain semantic parsing in context. In: Proceedings of ACL, pp. 4511–4523 (2019)
Zhang, R., et al.: Editing-based SQL query generation for cross-domain context-dependent questions. In: Proceedings of EMNLP-IJCNLP, pp. 5338–5349 (2019)
Zheng, Y., Wang, H., Dong, B., Wang, X., Li, C.: HIE-SQL: history information enhanced network for context-dependent text-to-SQL semantic parsing. In: Proceedings of ACL, pp. 2997–3007 (2022)
Zhong, V., **ong, C., Socher, R.: Seq2SQL: generating structured queries from natural language using reinforcement learning. ar**v:1709.00103 (2017)
Acknowledgement
We want to thank all anonymous reviewers for their valuable comments. We thank all annotators for their great effort in data annotation and review as well. This work was supported by the National Natural Science Foundation of China (Grant No. 62176173) and the Projected Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, S. et al. (2023). SeSQL: A High-Quality Large-Scale Session-Level Chinese Text-to-SQL Dataset. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14302. Springer, Cham. https://doi.org/10.1007/978-3-031-44693-1_42
Download citation
DOI: https://doi.org/10.1007/978-3-031-44693-1_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44692-4
Online ISBN: 978-3-031-44693-1
eBook Packages: Computer ScienceComputer Science (R0)