Abstract
This paper presents the design and development of unrestricted text to speech synthesis (TTS) system in Bengali language. Unrestricted TTS system is capable to synthesize good quality of speech in different domains. In this work, syllables are used as basic units for synthesis. Festival framework has been used for building the TTS system. Speech collected from a female artist is used as speech corpus. Initially five speakers’ speech is collected and a prototype TTS is built from each of the five speakers. Best speaker among the five is selected through subjective and objective evaluation of natural and synthesized waveforms. Then development of unrestricted TTS is carried out by addressing the issues involved at each stage to produce good quality synthesizer. Evaluation is carried out in four stages by conducting objective and subjective listening tests on synthesized speech. At the first stage, TTS system is built with basic festival framework. In the following stages, additional features are incorporated into the system and quality of synthesis is evaluated. The subjective and objective measures indicate that the proposed features and methods have improved the quality of the synthesized speech from stage-2 to stage-4.
Similar content being viewed by others
References
Basu, J. B., Mitra, T., Mandal, M., & Das, S. K. (2009). Grapheme to phoneme (g2p) conversion for Bangla. In Oriental COCOSDA international conference on speech database and assessments.
Benesty, J., Sondhi, M. M., & Huang, Y. (2008). Springer handbook of speech processing. Springer, Berlin
Beutnagel, M., Conkie, A., & Syrdal, A. (1998). Diphone synthesis using unit selection. In 3rd ESCA/COCOSDA workshop on speech synthesis, Nov.
Beutnagel, M., Mohri, M., & Riley, M. (1999). Rapid unit selection from a large speech corpus for concatenative speech synthesis. In Proc. Eurospeech.
Black, A. W., & Lanzo, K. (2003). Building synthetic voices. Cambridge: Carnegie Mellon University.
Black, A. W., & Lenzo, K. A. (2000). Limited domain synthesis. In ICSLP, Bei**g, China.
Black, A. W., & Taylor, P. (1994). Chatr: a generic speech sythesis system. In COLING ’94 (pp. 983–986).
Black, A. W., & Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. In Eurospeech’97 (vol. 2, pp. 601–604).
Blouin, C., Rosec, O., Bagshaw, P., & d’Alessandro, C. (2002). Concatenation cost calculation and optimization for unit selection in tts. In IEEE workshop on speech synthesis, Santa Monica, CA, USA.
Bozkurt, B., Ozturk, O., & Dutoit, T. (2003). Text design for tts speech corpus building using a modified greedy selection. In 8th European conference on speech communication and technology (Eurospeech), Geneva, Switzerland, September (pp. 277–280).
Chitturi, R., Mariam, S. H., & Kumar, R. (2005). Rapid methods for optimal text selection. In Recent advances in natural language processing, Borovets, Bulgaria, September.
Choudhury, M. (2003). Rule-based grapheme to phoneme map** for Hindi speech synthesis. In 9th Indian science congress of the international speech communication association (ISCA), Bangalore.
Conkie, A., & Isard, S. (1997). Progress in speech synthesis. In Progress in speech synthesis. New York: Springer.
Deivapalan, P. G., Jha, M., Guttikonda, R., & Murthy, H. A. (2008). Donlabel: an automatic labeling tool for Indian languages. In National conference on communication (NCC), IIT-Bombay, February (pp. 263–266).
Dong, M., teng Lua, K., & Li, H. (2008). A unit selection-based speech synthesis approach for mandarin Chinese. Journal of Chinese Language and Computing, 16, 135–144.
Ghosh, K., Reddy, R. V., Narendra, N. P., Maity, S., Koolagudi, S. G., & Rao, K. S. (2010). Grapheme to phoneme conversion in Bengali for festival based tts framework. In 8th international conference on natural language processing (ICON). Macmillan Publishers, New Delhi.
Gros, J. Z., & Zganec, M. (2008). An efficient unit-selection method for concatenative text-to-speech synthesis systems. Journal of Computing and Information Technology, 1, 69–78.
Hunt, A., & Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of IEEE int. conf. acoust., speech, and signal processing (vol. 1, pp. 373–376).
Kaira, S. (1976). Schwa-deletion in Hindi. In Bhari publications: Vol. 2. Language forum.
Karabetsos, S., Tsiakoulis, P., Chalamandaris, A., & Raptis, S. (2010) One-class classification for spectral join cost calculation in unit selection speech synthesis, IEEE Signal Processing Letters 17.
Kishore, S., & Black, A. W. (2003). Unit size in unit selection speech synthesis. In EUROSPEECH (pp. 1317–1320).
Kishore, S. P., Sangal, R., & Srinivas, M. (2002). Building Hindi and Telugu voices using festvox. In ICON, Mumbai, India, December.
Klatt, D. H. (1987). Review of text-to-speech conversion for English. The Journal of the Acoustical Society of America, 82, 737–793.
Krishna, N. S., & Murthy, H. A. (2004). Duration modeling of Indian languages Hindi and Telugu. In Proceedings of 5 th ISCA SSW.
Krishna, N. S., Talukdar, P. P., Bali, K., & Ramakrishnan, A. (2004). Duration modeling for Hindi text-to-speech synthesis system. In ICSLP 2004 (pp. 789–792).
Lawrence, W. (1953). The synthesis of speech from signals which have a low information rate. London: Butterworths.
Mitchell, T. (1997). Machine learning. New York: McGraw-Hill.
Raghavendra, E., & Prahallad, K. (2010). A multilingual screen reader in Indian languages. In National conference on communications (NCC), Chennai, India, January.
Rao, M. N., Thomas, S., Nagarajan, T., & Murthy, H. A. (2005). Text-to-speech synthesis using syllable like units. In National conference on communication, IIT Kharagpur, India, January (pp. 227–280).
Riley, M. (1992). Tree-based modeling for speech synthesis. In G. Bailly, C. Benoit, & T. Sawallis (Eds.), Talking machines: theories, models and designs (pp. 265–273).
Sreekanth, M., & Ramakrishnan, A. G. (2007). Festival based maiden tts system for Tamil language. In Proc. 3rd language and technology conf., Poznan, Poland, October (pp. 187–191).
Tahar, S., Mounir, Z., & Mohamed, B. A. (2005). Arabic speech synthesis using a concatenation of polyphones: the results. In Lecture notes in computer science: Vol. 3501. Advances in artificial intelligence (pp. 406–411).
van Santen, J. P. H., & Buchsbaum, A. L. (1997). Methods for optimal text selection. In Eurospeech, Rhodes, Greece (pp. 553–556).
Vepa, J., & King, S. (2004). Join cost for unit selection speech synthesis. In Text to speech synthesis: new paradigms and advances. New York: Prentice Hall (pp. 35–62).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in hmm-based speech synthesis. In Proc. Eurospeech (pp. 2347–2350).
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51, 1039–1064.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Narendra, N.P., Rao, K.S., Ghosh, K. et al. Development of syllable-based text to speech synthesis system in Bengali. Int J Speech Technol 14, 167–181 (2011). https://doi.org/10.1007/s10772-011-9094-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-011-9094-4