ISSN 2007-9737 Measuring the Quality of Low-Resourced Statistical Parametric Speech Synthesis Trained with Noise-Degraded Data Supported by the University of Costa Rica Marvin Coto-Jiménez University of Costa Rica, Costa Rica marvin.coto@ucr.ac.cr Abstract. After the successful implementation of speech synthesis (also referred to as text-to-speech) has synthesis in several languages, the study of robustness a long history, from early mechanic systems to became an important topic so as to increase the our days, where complex techniques and the possibility of building voices from non-standard sources, release of dedicated software have extended the e.g. historical recordings, children’s speech, and data speech synthesis possibilities to many languages freely available on the Internet. In this work, a measure and applications. of the influence of noise in the source speech of the statistical parametric speech synthesis system based The evolution of modern techniques can be on HMM is performed, for a case of a low-resourced traced back to the early 1970s [1], where database. For this purpose, three types of additive noise the waveform generation was made using low- were considered at five signal-to-noise ratio levels to dimensional information, such as formants. And affect the source speech data. Using objective measures it has evolved to perform direct manipulations of to assess the perceptual quality of the results and the waveforms (e.g. concatenative and unit selection propagation of the noise through all the processes of approaches) or high dimensional parameters and building speech synthesis, the results show a severe deep learning-based models. drop in the quality of artificial speech, even for the cases of lower levels of noise. Such degradation seems to be The statistical models of speech synthesis, independent of the noise type, and is at lower proportion mainly based on Hidden Markov Models (HMM), to the noise level. This results are of importance for were popularized among researchers of the field any practical implementation of speech synthesis from after the first publications of the technique [2, degraded data in similar conditions, and shows that 3], particularly after the release of the HTS applying denoising processes became mandatory in software [4]. HMMs were previously successfully order to keep the possibility of building intelligible voices. applied to speech recognition, and many of the ideas and parameters applied for that task were Keywords. Noise, robustness, speech synthesis. translated to the speech synthesis field. With the HTS software, many papers were 1 Introduction published on the implementation of statistical parametric speech synthesis in several languages The purpose of speech synthesis can be estab- around the world. The case of Spanish was also lished as the production of artificial speech from reported by a reduced number of researchers [5, a given text input using computers. The resulting 6, 7]. speech should be perceived with intelligibility and The advantages of statistical parametric speech naturalness, in order to apply the results in the synthesis based on HMM were reported in terms of desired application. This process of speech its flexibility and capacity for producing intelligible Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842 doi: 10.13053/CyS-26-2-4254 ISSN 2007-9737 836 Marvin Coto-Jimenez voices with low-training data [8]. The main The problem of producing artificial speech has disadvantages were the buzzy, muffled sound been addressed by some authors with particular in- often reported. terests in techniques that take advantage of a large corpus of clean data, such as speaker-adaptation With the increased performance and success in HMM-based speech synthesis. Using such of deep learning in several fields during the last corpus new voices can be built by incorporating decade, speech synthesis also benefits from the information from the corpus in the smaller datasets. possibilities of the complex modeling and effective training algorithms of deep neural networks. The For example, in [17], the authors proved that first ideas on the implementation of deep learning naturalness is not significantly affected by the in speech synthesis were published in [9]. presence of noise in the smaller dataset. The unfavorable conditions can be presented in found In previous years, many proposals have been data, i.e. data freely available on the web. Such made to apply different types of neural networks, data has significant variation in terms of speaking such as Restrictive Boltzmann Machines, Deep style and channel characteristics [18]. Belief Networks, Bidirectional Long Short-term Memory Neural Networks, and Convolutional In this paper, an experimental study on the Neural Networks [10]. In some recent reports, influence of noisy recordings in the results the combination of both statistical parametric of statistical parametric speech synthesis is modeling combined with deep learning was also performed, for the case of a small database in published [11, 12]. Spanish. The purpose of the study is to numerically report and compare the influence of several types Typically, the deep learning-based approaches and levels of noise in the speech data required to report a higher quality of results but require a produce artificial speech. large amount of training data. There are many The influence of the noise provide information situations where the availability of such resources to anticipate the quality of artificial speech that is not possible to achieve. For example, in building can be produced from recordings with unfavorable speech from historical recordings, children’s conditions. Such information is relevant for speech and low resourced languages [13, 14]. the evaluation of low-quality sources of speech For these cases, HMM-based statistical para- resources in building speech synthesis. metric speech synthesis remains the main possibil- The rest of this paper is organized as follows: ity to produce intelligible artificial voices. In many of Section 2 presents the theoretical background such cases, the quality of the recordings was also of speech synthesis and the effects of noise. a shortcoming for the quality of the results. Section 3 presents the experimental setup of the The usual framework in the building of synthetic proposal. Section 4 presents the results. Finally, voices was considered in the vast majority of the Conclusions are presented in Section 5. cases: the recording of datasets in highly- controlled environments, which has typically done in professional studios with high-quality equipment. 2 Statistical Parametric Speech According to to [15], given the advances in speech Synthesis synthesis techniques, the research community can consider building quality voices from data The Statistical Parametric Speech Synthesis collected in less controlled environments. These based on HMMs models the speech production new conditions represent several challenges for process using the source-filter theory of voice the process, for example, non-consistent recording production [1]. This model comprises the conditions, unbalanced phonetic material, and voicing information using fundamental frequency noisy data. It is still not clear how robust (or the logarithm of this measure) and the speech systems are under such unfavorable spectral envelope, commonly represented by conditions [16]. mel-frequency cepstral coefficients (MFCC). The Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842 doi: 10.13053/CyS-26-2-4254 ISSN 2007-9737 Measuring the Quality of Low-Resourced Statistical Parametric Speech Synthesis ... 837 speech waveforms are reconstructed from se- HMM and the procedures involved in the speech quences of such parameters, and additional synthesis can be found in [1, 20]. information about dynamic features (e.g. rate For this work, it is of particular importance to of change in form of delta and delta delta state that the quality of the speech synthesis relies features [19]). on the proper adjustment of the parameters of λmax in Equation 2. And this adjustment depends on the quality of the features O extracted from the dataset, and its consistency according to the phoneme labels (linguistic specification) W . Several factors can affect the outcomes of the process: The amount of information in the database (few information implies less O to estimate the parameters of the HMMs) and the quality of this information. If the information is Fig. 1. Left-to-right Hidden Markov Model with three corrupted by noise, or the recordings have large states variations among phonemes (typically, this can occur in very expressive or emotional speech), the First, the HMMs are trained, using a similar ability of the HMMs to reproduce the parameters of approach to that utilized in speech recognition: the speech for a natural sounding voice with high adjusting the parameters of the HMM model intelligibility may be affected. The nature of such (Figure 1) using information extracted from noise and its level can also be a relevant factor for a speech database. Each HMM can be the results. In this work, an experimental validation expressed as: of such assumptions is proposed and measured. λ = (π,a,b), (1) where π is the probability of initial-state, a and b the 3 Experimental Setup state-transition and output probability distributions, assumed as multivariate Gaussian distributions 3.1 Database (with a mixture of continuous and 0-dimensional For this work, we selected the set of words distributions). and sentences of [21], developed at the Center In statistical parametric speech synthesis based for Language and Speech Technologies and on HMM, the set of models depends not only Applications of the Polytechnic University of on the number of phonemes of the particular Catalonia. The 184 utterances were recorded by language, but on the context-dependency of the a professional native Spanish speaker actor in a phonemes (phonetic and prosody contexts) as professional studio, where the recording conditions well. For this reason, a large number of models were controlled completely. The database includes are trained to represent the temporal, spectral and affirmative and interrogative sentences, fifteen pitch characteristics of every sound and its context. paragraphs, digits and isolated words. For example, a model for the< a > phoneme at the beginning of a phrase, followed by consonant, and a model for the < a > phoneme at the beginning of 3.2 Experiments a phrase, followed by a vowel, etc. To determine how noise affects the building The training of each HMM can be expressed as: of synthetic voices with such small database, several voices were produced using the HTS λmax = argmax p(O|λ,W ), (2) system, each one after affecting the speech λ source with noise. The complete database where O is the set of speech parameters and W was degraded with additive noise of three types: the phoneme labels. A detailed description of the two artificial-generated noise (White, Noise) and Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842 doi: 10.13053/CyS-26-2-4254 ISSN 2007-9737 838 Marvin Coto-Jimenez Clean speech + Noisy speech Noisy speech ... Noisy speech recordings recordings recordings recordings Different types and levels of noise HTS HTS HTS ... HTS System System System System ... Artificial Artificial Artificial Artificial Speech Speech Speech Speech Objective measures Objective measures Objective measures Fig. 2. Diagram of the experimental procedure one natural noise (Babble). Five levels of The evaluation metrics proposed in the following Signal-to-noise (SNR) ratio were considered, to section were used to compare the level of cover a range of conditions and comparatively degradation on the artificial voice in comparison assess the effect on the results. with the base system (HTS clean). A diagram of The whole set of voices to compare can be the complete process is presented in Figure 2. listed as: 3.3 Evaluation — HTS Clean: The produced with the clean database, without any noise added. To determine the quality of each case of synthetic — White Noise added at five SNR levels: SNR 5, voice, two objective measures were applyed. SNR 7.5, SNR 10, SNR 12.5, SNR 15. These measures have been reported in speech synthesis reports as reliable in measuring the — Pink Noise added at five SNR levels: SNR 5, quality of synthesized voices: SNR 7.5, SNR 10, SNR 12.5, SNR 15. — Babble Noise added at five SNR levels: SNR — Segmental SNR (SegSNR): This measure 5, SNR 7.5, SNR 10, SNR 12.5, SNR 15. calculate the average of SNR at frame level, Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842 doi: 10.13053/CyS-26-2-4254 ISSN 2007-9737 Measuring the Quality of Low-Resourced Statistical Parametric Speech Synthesis ... 839 Fig. 3. Spectrograms of an utterance with White noise Fig. 4. Spectrograms of an utterance with Pink noise at SNR5 (above) and the same utterance synthesized at SNR10 (above) and the same utterance synthesized from a database degraded with the same type and level from a database degraded with the same type and level of noise (below) of noise (below) according to the equ[ation: ∑ ] 4 Results N 10 ∑ SegSNR = log ∑ L−1 2 j=0 s (i, j) , This section presents the evaluation metrics on the N L−1 2 i=1 j=0 (s(i, j)− x(i, j)) different experiments and its analysis in terms of (3) how the presence of noise affects the building of where x(i) is the original sample and si the synthetic voices. For example, in the spectrograms ith synthetic speech sample. N is the total of Figure 3, the silence segments at the beginning number of samples of the utterance and L is and the end of the noisy speech (with SNR 5), the frame length. and the synthesized version of the same utterance preserves similar patterns of the noise. On — PESQ: This is a measure intended to predict the other hand, in the speech segments, the the subjective perception of speech, in ITU-T spectrogram presents noticeably blurred bands of recommendation P.862.ITU. The results are frequencies. reported in the interval [0.5, 4.5]. A PESQ A similar observation can be made for the value of 4.5 means an exact reconstruction of case of Pink noise at SNR 10, as presented in the speech. PESQ is computed following the Figure 4. The particular pattern in the form of equation: bands of frequencies can be explained for the process of adjusting the trajectories of parameters PESQ = a + a D + a A . (4) in the HMMs. The noisy information became 0 1 ind 2 ind part of the information adjusted in the models, The coefficients ak are chosen to optimize and in the process of generating parameters, PESQ measure in signal distortions and the characteristics of flat trajectories also affected overall quality. the noise. Unfortunately, such characteristic during the speech segments in the spectrograms represent Additionally, we propose the visualization of considerable decrease in the objective measures spectrograms as a mean to represent the noise of the synthesized voice. For example, Figure 5 and its effect on the spectrum of the speech shows how the noisy condition of the data signals. severely affects the perceptual quality of the Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842 doi: 10.13053/CyS-26-2-4254 ISSN 2007-9737 840 Marvin Coto-Jimenez 2.5 2 HTS Clean PESQ= 1.71 1.5 1 4 5 6 7 8 9 10 11 12 13 14 15 16 White Natural Pink NSaNtRural Babble Natural White Synthesized Pink Synthesized Babble Synthesized. Fig. 5. PESQ results for the noise-degraded speech and the artificial version produced from the same speech 10 5 0 HTS SegSNR= -5.11 −5 −10 4 5 6 7 8 9 10 11 12 13 14 15 16 White Natural Pink NSaNtRural Babble Natural White Synthesized Pink Synthesized Babble Synthesized. Fig. 6. SegSNR results for the noise-degraded speech and the artificial version produced from the same speech synthesized speech at all SNR levels. At SNR5 there is a significant drop in the quality of synthetic of White Noise, the resulting synthesized speech voices at all SNR levels, and very similar among is closed to the lowest value of PESQ. All artificial the noise types. All the cases present values voices produced under noisy conditions have below the base system (HTS Clean voice, with considerably lower PESQ values than the base SegSNR=-5.11) as expected, but there is a system: the HTS voice. decrease in the slope of the lines in the synthesized There are no significant differences between the speech that can be considered an unexpected three types of noise analyzed in this work. The result of this study. Such behavior in the SegSNR Babble noise seems to affect the results more than trends at all SNR levels can be explained by the the artificial voices, which is expected due to the averaging process performed during the training of speech nature of such noise (consisting of a crowd the HMMs. talking in the background). All the results presented have similar trends in Considering SNR levels below SNR 5 is a the dropping of the quality of synthetic voices in the common practice in the study of robust speech presence of noise; thereby, preserving the slope recognition. But with these results, it seems that of the degraded speech for the case of PESQ. below this level, the synthesized speech for a low It is important to remark that the results were resource database cannot be considered for any obtained from a Spanish speech database that practical application. can be considered low-resourced. The robustness The results of the measure SegSNR are of the HTS system under such conditions can be presented in Figure 6. Like the previous measure, considered very low in contrast to the experiences Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842 doi: 10.13053/CyS-26-2-4254 SegSNR PESQ ISSN 2007-9737 Measuring the Quality of Low-Resourced Statistical Parametric Speech Synthesis ... 841 reported in the references that took advantage References of adaption systems or the complement of clean speech from other speakers during the process of 1. Tokuda, K., Nankaku, Y., Toda, T., Zen, generating the artificial speech. H., Yamagishi, J., Oura, K. (2013). Speech synthesis based on hidden Markov models. Proceedings of the IEEE, pp. 1234–1252. 5 Conclusions 2. Masuko, T., Tokuda, K., Kobayashi, T., Imai, S. (1996). Speech synthesis using HMMs with dynamic features. IEEE International In this work, an experimental study on the Conference on Acoustics, Speech, and Signal quality of synthetic speech built from a Spanish Processing Conference Proceedings, Vol. 1. noisy database was performed. The amount of data available for the experiments can be 3. Tokuda, K., Kobayashi, T., Imai, S. considered low-resourced in contrast to larger (1995). Speech parameter generation from speech databases available in other languages. HMM using dynamic features.International Conference on Acoustics, Speech, and Signal The obtained results show how the presence Processing. Vol. 1. of noise in the recordings severely affects the synthetic voices produced, regardless of the type 4. Zen, H., Nose, T., Yamagishi, J. (2007). The of noise and the SNR. In particular, the perceptual HMM-based speech synthesis system (HTS) quality measured using PESQ shows how the version 2.0. SSW. resulting voices have lower quality than the voices 5. Gonzalvo, X., Sanz, I.I., Socoró-Carrié, J.C., produced from clean speech. The type of noise Alı́as F. (2007). HMM-based Spanish speech seems to make no difference in the quality of the synthesis using CBR as F0 estimator. ITRW on synthetic speech. NOLISP, pp. 788–793. The results are relevant to the building of 6. Gonzalvo, X., Taylor, P., Monzo, C., Sanz, synthetic voices where data cannot be collected in I.I. (2009). High quality emotional HMM-based controlled environments, from historic recordings, synthesis in Spanish. International Conference data freely available on the Internet, or recordings on Nonlinear Speech Processing, Springer. performed during videoconferencing. DOI:10.1007/978-3-642-11509-7 4. In addition, the results help to establish the 7. Franco, C.A., Herrera, A., Escalante B. importance of building a clean larger speech (2017). Speech synthesis in Mexican Spanish corpus for endangered languages, children’s using voice parameterization. IIISCI, 15(4), pp. speech, and many other potential applications of 72–75. speech synthesis in new languages or languages 8. Ekpenyong, M., Urua, E.A., Watts, O., where such resources have not been produced. King, S., Yamagishi, J. (2014). Statistical parametric speech synthesis for Ibibio. Speech For future work, several relevant questions can Communication, Vol. 56, pp. 243–251. DOI: be addressed for experimental validation, in terms 10.1016/j.specom.2013.02.003. of the robustness of speech synthesis systems under partially noise-corrupted data, and a broader 9. Ze, H., Senior, A., Schuster, M. (2013). range of noise types and levels. Applying Statistical parametric speech synthesis denoising algorithms before the building of the using deep neural networks. IEEE voices is an important opportunity to preserve International Conference on Acoustics, the possibility of generating synthetic voices from Speech and Signal Processing. DOI: noise-degraded data. 10.1109/ICASSP.2013.6639215. Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842 doi: 10.13053/CyS-26-2-4254 ISSN 2007-9737 842 Marvin Coto-Jimenez 10. Ning, Y., He, S., Wu, Z., Xing, Ch. (2019). for a Noise-Robust Text-to-Speech Synthesis A review of deep learning based speech System Using Deep Recurrent Neural Net- synthesis. Applied Sciences, Vol. 9, No. 19, pp. works. Interspeech. 4050. DOI: 10.3390/app9194050. 17. Karhila, R., Remes, U., Kurimo, M. 11. Hu, Y.J., Ling, Z.H. (2016). DBN-based (2013). Noise in HMM-based speech synthesis spectral feature representation for statistical adaptation: Analysis, evaluation methods and parametric speech synthesis. IEEE Signal experiments. IEEE Journal of Selected Topics Processing Letters, Vol. 23, No. 3, pp. in Signal Processing, Vol. 8, No. 2, pp. 321–325. DOI: 10.1109/LSP.2016.2516032. 285–295. 12. Hu, Y.J., Ling, Z.H. (2018). Extracting 18. Baljekar, P. (2018). Speech synthesis from spectral features using deep autoencoders found data. PhD thesis, Carnegie Mellon with binary distributed hidden units for statis- University. tical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language 19. Tokuda, K., Yoshimura, T., Masuko, Processing, Vol. 26, No. 4, pp. 713–724. DOI: T., Kobayashi, T., Kitamura, T. (2000). 10.1109/TASLP.2018.2791804. Speech parameter generation algorithms for HMM-based speech synthesis. IEEE 13. Suraj-Pandurang, P., Laxman-Lahudkar, S. International Conference on Acoustics, (2019). Hidden-Markov-model based statisti- Speech, and Signal Processing, Vol. 3. DOI: cal parametric speech synthesis for Marathi 10.1109/ICASSP.2000.861820. with optimal number of hidden states. Interna- tional Journal of Speech Technology, Vol. 22, 20. Toda, T., Tokuda, K. (2007). A speech No. 1, pp. 93–98. parameter generation algorithm considering global variance for HMM-based speech syn- 14. Sefara, T.J., Mokgonyane, T.B., Man- thesis. IEICE Transaction on Information and amela, M.J., Modipa, T.I. (2019). HMM- Systems, Vol. 90, No. 5, pp. 816–824. DOI: based speech synthesis system incorporated 10.1093/ietisy/e90-d.5.816. with language identification for low-resourced languages. International Conference on Ad- 21. Maegaard, B., Choukri, K., Calzolari, vances in Big Data, Computing and Data N., Odijk, J. (2005). Elra – European Communication Systems (ICABCD). DOI: language resources association - background, 10.1109/ICABCD.2019.8851055. recent developments and future perspectives. Language Resources and Evaluation, Vol. 39, 15. Junichi, Y., Ling, Z., King, S. (2008). No. 1, pp. 9–23. DOI: 10.1007/s10579-005- Robustness of HMM-based speech synthesis. 2692-5. 16. Valentini-Botinhao, C., Wang, X., Takaki, S., Article received on 09/10/2020; accepted on 16/02/2021. Yamagishi, J. (2016). Speech Enhancement Corresponding author is Marvin Coto-Jiménez. Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842 doi: 10.13053/CyS-26-2-4254