ISSN 2007-9737
Measuring the Quality of Low-Resourced Statistical Parametric
Speech Synthesis Trained with Noise-Degraded Data Supported by
the University of Costa Rica
Marvin Coto-Jiménez
University of Costa Rica,
Costa Rica
marvin.coto@ucr.ac.cr
Abstract. After the successful implementation of speech synthesis (also referred to as text-to-speech) has
synthesis in several languages, the study of robustness a long history, from early mechanic systems to
became an important topic so as to increase the our days, where complex techniques and the
possibility of building voices from non-standard sources, release of dedicated software have extended the
e.g. historical recordings, children’s speech, and data speech synthesis possibilities to many languages
freely available on the Internet. In this work, a measure and applications.
of the influence of noise in the source speech of the
statistical parametric speech synthesis system based The evolution of modern techniques can be
on HMM is performed, for a case of a low-resourced traced back to the early 1970s [1], where
database. For this purpose, three types of additive noise the waveform generation was made using low-
were considered at five signal-to-noise ratio levels to dimensional information, such as formants. And
affect the source speech data. Using objective measures it has evolved to perform direct manipulations of
to assess the perceptual quality of the results and the waveforms (e.g. concatenative and unit selection
propagation of the noise through all the processes of approaches) or high dimensional parameters and
building speech synthesis, the results show a severe deep learning-based models.
drop in the quality of artificial speech, even for the cases
of lower levels of noise. Such degradation seems to be The statistical models of speech synthesis,
independent of the noise type, and is at lower proportion mainly based on Hidden Markov Models (HMM),
to the noise level. This results are of importance for were popularized among researchers of the field
any practical implementation of speech synthesis from after the first publications of the technique [2,
degraded data in similar conditions, and shows that 3], particularly after the release of the HTS
applying denoising processes became mandatory in software [4]. HMMs were previously successfully
order to keep the possibility of building intelligible voices. applied to speech recognition, and many of the
ideas and parameters applied for that task were
Keywords. Noise, robustness, speech synthesis. translated to the speech synthesis field.
With the HTS software, many papers were
1 Introduction published on the implementation of statistical
parametric speech synthesis in several languages
The purpose of speech synthesis can be estab- around the world. The case of Spanish was also
lished as the production of artificial speech from reported by a reduced number of researchers [5,
a given text input using computers. The resulting 6, 7].
speech should be perceived with intelligibility and The advantages of statistical parametric speech
naturalness, in order to apply the results in the synthesis based on HMM were reported in terms of
desired application. This process of speech its flexibility and capacity for producing intelligible
Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842
doi: 10.13053/CyS-26-2-4254
ISSN 2007-9737
836 Marvin Coto-Jimenez
voices with low-training data [8]. The main The problem of producing artificial speech has
disadvantages were the buzzy, muffled sound been addressed by some authors with particular in-
often reported. terests in techniques that take advantage of a large
corpus of clean data, such as speaker-adaptation
With the increased performance and success in HMM-based speech synthesis. Using such
of deep learning in several fields during the last corpus new voices can be built by incorporating
decade, speech synthesis also benefits from the information from the corpus in the smaller datasets.
possibilities of the complex modeling and effective
training algorithms of deep neural networks. The For example, in [17], the authors proved that
first ideas on the implementation of deep learning naturalness is not significantly affected by the
in speech synthesis were published in [9]. presence of noise in the smaller dataset. The
unfavorable conditions can be presented in found
In previous years, many proposals have been data, i.e. data freely available on the web. Such
made to apply different types of neural networks, data has significant variation in terms of speaking
such as Restrictive Boltzmann Machines, Deep style and channel characteristics [18].
Belief Networks, Bidirectional Long Short-term
Memory Neural Networks, and Convolutional In this paper, an experimental study on the
Neural Networks [10]. In some recent reports, influence of noisy recordings in the results
the combination of both statistical parametric of statistical parametric speech synthesis is
modeling combined with deep learning was also performed, for the case of a small database in
published [11, 12]. Spanish. The purpose of the study is to numerically
report and compare the influence of several types
Typically, the deep learning-based approaches and levels of noise in the speech data required to
report a higher quality of results but require a produce artificial speech.
large amount of training data. There are many The influence of the noise provide information
situations where the availability of such resources to anticipate the quality of artificial speech that
is not possible to achieve. For example, in building can be produced from recordings with unfavorable
speech from historical recordings, children’s conditions. Such information is relevant for
speech and low resourced languages [13, 14]. the evaluation of low-quality sources of speech
For these cases, HMM-based statistical para- resources in building speech synthesis.
metric speech synthesis remains the main possibil- The rest of this paper is organized as follows:
ity to produce intelligible artificial voices. In many of Section 2 presents the theoretical background
such cases, the quality of the recordings was also of speech synthesis and the effects of noise.
a shortcoming for the quality of the results. Section 3 presents the experimental setup of the
The usual framework in the building of synthetic proposal. Section 4 presents the results. Finally,
voices was considered in the vast majority of the Conclusions are presented in Section 5.
cases: the recording of datasets in highly-
controlled environments, which has typically done
in professional studios with high-quality equipment. 2 Statistical Parametric Speech
According to to [15], given the advances in speech Synthesis
synthesis techniques, the research community
can consider building quality voices from data The Statistical Parametric Speech Synthesis
collected in less controlled environments. These based on HMMs models the speech production
new conditions represent several challenges for process using the source-filter theory of voice
the process, for example, non-consistent recording production [1]. This model comprises the
conditions, unbalanced phonetic material, and voicing information using fundamental frequency
noisy data. It is still not clear how robust (or the logarithm of this measure) and the
speech systems are under such unfavorable spectral envelope, commonly represented by
conditions [16]. mel-frequency cepstral coefficients (MFCC). The
Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842
doi: 10.13053/CyS-26-2-4254
ISSN 2007-9737
Measuring the Quality of Low-Resourced Statistical Parametric Speech Synthesis ... 837
speech waveforms are reconstructed from se- HMM and the procedures involved in the speech
quences of such parameters, and additional synthesis can be found in [1, 20].
information about dynamic features (e.g. rate For this work, it is of particular importance to
of change in form of delta and delta delta state that the quality of the speech synthesis relies
features [19]). on the proper adjustment of the parameters of
λmax in Equation 2. And this adjustment depends
on the quality of the features O extracted from
the dataset, and its consistency according to the
phoneme labels (linguistic specification) W .
Several factors can affect the outcomes of
the process: The amount of information in the
database (few information implies less O to
estimate the parameters of the HMMs) and the
quality of this information. If the information is
Fig. 1. Left-to-right Hidden Markov Model with three corrupted by noise, or the recordings have large
states variations among phonemes (typically, this can
occur in very expressive or emotional speech), the
First, the HMMs are trained, using a similar ability of the HMMs to reproduce the parameters of
approach to that utilized in speech recognition: the speech for a natural sounding voice with high
adjusting the parameters of the HMM model intelligibility may be affected. The nature of such
(Figure 1) using information extracted from noise and its level can also be a relevant factor for
a speech database. Each HMM can be the results. In this work, an experimental validation
expressed as: of such assumptions is proposed and measured.
λ = (π,a,b), (1)
where π is the probability of initial-state, a and b the 3 Experimental Setup
state-transition and output probability distributions,
assumed as multivariate Gaussian distributions 3.1 Database
(with a mixture of continuous and 0-dimensional For this work, we selected the set of words
distributions). and sentences of [21], developed at the Center
In statistical parametric speech synthesis based for Language and Speech Technologies and
on HMM, the set of models depends not only Applications of the Polytechnic University of
on the number of phonemes of the particular Catalonia. The 184 utterances were recorded by
language, but on the context-dependency of the a professional native Spanish speaker actor in a
phonemes (phonetic and prosody contexts) as professional studio, where the recording conditions
well. For this reason, a large number of models were controlled completely. The database includes
are trained to represent the temporal, spectral and affirmative and interrogative sentences, fifteen
pitch characteristics of every sound and its context. paragraphs, digits and isolated words.
For example, a model for the< a > phoneme at the
beginning of a phrase, followed by consonant, and
a model for the < a > phoneme at the beginning of 3.2 Experiments
a phrase, followed by a vowel, etc. To determine how noise affects the building
The training of each HMM can be expressed as: of synthetic voices with such small database,
several voices were produced using the HTS
λmax = argmax p(O|λ,W ), (2) system, each one after affecting the speech
λ
source with noise. The complete database
where O is the set of speech parameters and W was degraded with additive noise of three types:
the phoneme labels. A detailed description of the two artificial-generated noise (White, Noise) and
Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842
doi: 10.13053/CyS-26-2-4254
ISSN 2007-9737
838 Marvin Coto-Jimenez
Clean speech + Noisy speech Noisy speech ... Noisy speech
recordings recordings recordings recordings
Different
types
and
levels of
noise 
HTS HTS HTS ... HTS 
System System System System
...
Artificial Artificial Artificial Artificial
Speech Speech Speech Speech
Objective measures
Objective measures
Objective measures
Fig. 2. Diagram of the experimental procedure
one natural noise (Babble). Five levels of The evaluation metrics proposed in the following
Signal-to-noise (SNR) ratio were considered, to section were used to compare the level of
cover a range of conditions and comparatively degradation on the artificial voice in comparison
assess the effect on the results. with the base system (HTS clean). A diagram of
The whole set of voices to compare can be the complete process is presented in Figure 2.
listed as:
3.3 Evaluation
— HTS Clean: The produced with the clean
database, without any noise added.
To determine the quality of each case of synthetic
— White Noise added at five SNR levels: SNR 5, voice, two objective measures were applyed.
SNR 7.5, SNR 10, SNR 12.5, SNR 15. These measures have been reported in speech
synthesis reports as reliable in measuring the
— Pink Noise added at five SNR levels: SNR 5, quality of synthesized voices:
SNR 7.5, SNR 10, SNR 12.5, SNR 15.
— Babble Noise added at five SNR levels: SNR — Segmental SNR (SegSNR): This measure
5, SNR 7.5, SNR 10, SNR 12.5, SNR 15. calculate the average of SNR at frame level,
Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842
doi: 10.13053/CyS-26-2-4254
ISSN 2007-9737
Measuring the Quality of Low-Resourced Statistical Parametric Speech Synthesis ... 839
Fig. 3. Spectrograms of an utterance with White noise Fig. 4. Spectrograms of an utterance with Pink noise
at SNR5 (above) and the same utterance synthesized at SNR10 (above) and the same utterance synthesized
from a database degraded with the same type and level from a database degraded with the same type and level
of noise (below) of noise (below)
according to the equ[ation: ∑ ] 4 Results
N
10 ∑
SegSNR = log ∑ L−1 2
j=0 s (i, j)
, This section presents the evaluation metrics on the
N L−1 2
i=1 j=0 (s(i, j)− x(i, j)) different experiments and its analysis in terms of
(3) how the presence of noise affects the building of
where x(i) is the original sample and si the synthetic voices. For example, in the spectrograms
ith synthetic speech sample. N is the total of Figure 3, the silence segments at the beginning
number of samples of the utterance and L is and the end of the noisy speech (with SNR 5),
the frame length. and the synthesized version of the same utterance
preserves similar patterns of the noise. On
— PESQ: This is a measure intended to predict the other hand, in the speech segments, the
the subjective perception of speech, in ITU-T spectrogram presents noticeably blurred bands of
recommendation P.862.ITU. The results are frequencies.
reported in the interval [0.5, 4.5]. A PESQ A similar observation can be made for the
value of 4.5 means an exact reconstruction of case of Pink noise at SNR 10, as presented in
the speech. PESQ is computed following the Figure 4. The particular pattern in the form of
equation: bands of frequencies can be explained for the
process of adjusting the trajectories of parameters
PESQ = a + a D + a A . (4) in the HMMs. The noisy information became
0 1 ind 2 ind
part of the information adjusted in the models,
The coefficients ak are chosen to optimize and in the process of generating parameters,
PESQ measure in signal distortions and the characteristics of flat trajectories also affected
overall quality. the noise.
Unfortunately, such characteristic during the
speech segments in the spectrograms represent
Additionally, we propose the visualization of considerable decrease in the objective measures
spectrograms as a mean to represent the noise of the synthesized voice. For example, Figure 5
and its effect on the spectrum of the speech shows how the noisy condition of the data
signals. severely affects the perceptual quality of the
Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842
doi: 10.13053/CyS-26-2-4254
ISSN 2007-9737
840 Marvin Coto-Jimenez
2.5
2 HTS Clean PESQ= 1.71
1.5
1
4 5 6 7 8 9 10 11 12 13 14 15 16
White Natural Pink NSaNtRural Babble Natural
White Synthesized Pink Synthesized Babble Synthesized.
Fig. 5. PESQ results for the noise-degraded speech and the artificial version produced from the same speech
10
5
0
HTS SegSNR= -5.11
−5
−10
4 5 6 7 8 9 10 11 12 13 14 15 16
White Natural Pink NSaNtRural Babble Natural
White Synthesized Pink Synthesized Babble Synthesized.
Fig. 6. SegSNR results for the noise-degraded speech and the artificial version produced from the same speech
synthesized speech at all SNR levels. At SNR5 there is a significant drop in the quality of synthetic
of White Noise, the resulting synthesized speech voices at all SNR levels, and very similar among
is closed to the lowest value of PESQ. All artificial the noise types. All the cases present values
voices produced under noisy conditions have below the base system (HTS Clean voice, with
considerably lower PESQ values than the base SegSNR=-5.11) as expected, but there is a
system: the HTS voice. decrease in the slope of the lines in the synthesized
There are no significant differences between the speech that can be considered an unexpected
three types of noise analyzed in this work. The result of this study. Such behavior in the SegSNR
Babble noise seems to affect the results more than trends at all SNR levels can be explained by the
the artificial voices, which is expected due to the averaging process performed during the training of
speech nature of such noise (consisting of a crowd the HMMs.
talking in the background).
All the results presented have similar trends in
Considering SNR levels below SNR 5 is a the dropping of the quality of synthetic voices in the
common practice in the study of robust speech presence of noise; thereby, preserving the slope
recognition. But with these results, it seems that of the degraded speech for the case of PESQ.
below this level, the synthesized speech for a low It is important to remark that the results were
resource database cannot be considered for any obtained from a Spanish speech database that
practical application. can be considered low-resourced. The robustness
The results of the measure SegSNR are of the HTS system under such conditions can be
presented in Figure 6. Like the previous measure, considered very low in contrast to the experiences
Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842
doi: 10.13053/CyS-26-2-4254
SegSNR PESQ
ISSN 2007-9737
Measuring the Quality of Low-Resourced Statistical Parametric Speech Synthesis ... 841
reported in the references that took advantage References
of adaption systems or the complement of clean
speech from other speakers during the process of 1. Tokuda, K., Nankaku, Y., Toda, T., Zen,
generating the artificial speech. H., Yamagishi, J., Oura, K. (2013). Speech
synthesis based on hidden Markov models.
Proceedings of the IEEE, pp. 1234–1252.
5 Conclusions 2. Masuko, T., Tokuda, K., Kobayashi, T., Imai,
S. (1996). Speech synthesis using HMMs
with dynamic features. IEEE International
In this work, an experimental study on the Conference on Acoustics, Speech, and Signal
quality of synthetic speech built from a Spanish Processing Conference Proceedings, Vol. 1.
noisy database was performed. The amount
of data available for the experiments can be 3. Tokuda, K., Kobayashi, T., Imai, S.
considered low-resourced in contrast to larger (1995). Speech parameter generation from
speech databases available in other languages. HMM using dynamic features.International
Conference on Acoustics, Speech, and Signal
The obtained results show how the presence Processing. Vol. 1.
of noise in the recordings severely affects the
synthetic voices produced, regardless of the type 4. Zen, H., Nose, T., Yamagishi, J. (2007). The
of noise and the SNR. In particular, the perceptual HMM-based speech synthesis system (HTS)
quality measured using PESQ shows how the version 2.0. SSW.
resulting voices have lower quality than the voices 5. Gonzalvo, X., Sanz, I.I., Socoró-Carrié, J.C.,
produced from clean speech. The type of noise Alı́as F. (2007). HMM-based Spanish speech
seems to make no difference in the quality of the synthesis using CBR as F0 estimator. ITRW on
synthetic speech. NOLISP, pp. 788–793.
The results are relevant to the building of 6. Gonzalvo, X., Taylor, P., Monzo, C., Sanz,
synthetic voices where data cannot be collected in I.I. (2009). High quality emotional HMM-based
controlled environments, from historic recordings, synthesis in Spanish. International Conference
data freely available on the Internet, or recordings on Nonlinear Speech Processing, Springer.
performed during videoconferencing. DOI:10.1007/978-3-642-11509-7 4.
In addition, the results help to establish the 7. Franco, C.A., Herrera, A., Escalante B.
importance of building a clean larger speech (2017). Speech synthesis in Mexican Spanish
corpus for endangered languages, children’s using voice parameterization. IIISCI, 15(4), pp.
speech, and many other potential applications of 72–75.
speech synthesis in new languages or languages 8. Ekpenyong, M., Urua, E.A., Watts, O.,
where such resources have not been produced. King, S., Yamagishi, J. (2014). Statistical
parametric speech synthesis for Ibibio. Speech
For future work, several relevant questions can Communication, Vol. 56, pp. 243–251. DOI:
be addressed for experimental validation, in terms 10.1016/j.specom.2013.02.003.
of the robustness of speech synthesis systems
under partially noise-corrupted data, and a broader 9. Ze, H., Senior, A., Schuster, M. (2013).
range of noise types and levels. Applying Statistical parametric speech synthesis
denoising algorithms before the building of the using deep neural networks. IEEE
voices is an important opportunity to preserve International Conference on Acoustics,
the possibility of generating synthetic voices from Speech and Signal Processing. DOI:
noise-degraded data. 10.1109/ICASSP.2013.6639215.
Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842
doi: 10.13053/CyS-26-2-4254
ISSN 2007-9737
842 Marvin Coto-Jimenez
10. Ning, Y., He, S., Wu, Z., Xing, Ch. (2019). for a Noise-Robust Text-to-Speech Synthesis
A review of deep learning based speech System Using Deep Recurrent Neural Net-
synthesis. Applied Sciences, Vol. 9, No. 19, pp. works. Interspeech.
4050. DOI: 10.3390/app9194050.
17. Karhila, R., Remes, U., Kurimo, M.
11. Hu, Y.J., Ling, Z.H. (2016). DBN-based (2013). Noise in HMM-based speech synthesis
spectral feature representation for statistical adaptation: Analysis, evaluation methods and
parametric speech synthesis. IEEE Signal experiments. IEEE Journal of Selected Topics
Processing Letters, Vol. 23, No. 3, pp. in Signal Processing, Vol. 8, No. 2, pp.
321–325. DOI: 10.1109/LSP.2016.2516032. 285–295.
12. Hu, Y.J., Ling, Z.H. (2018). Extracting 18. Baljekar, P. (2018). Speech synthesis from
spectral features using deep autoencoders found data. PhD thesis, Carnegie Mellon
with binary distributed hidden units for statis- University.
tical parametric speech synthesis. IEEE/ACM
Transactions on Audio, Speech, and Language 19. Tokuda, K., Yoshimura, T., Masuko,
Processing, Vol. 26, No. 4, pp. 713–724. DOI: T., Kobayashi, T., Kitamura, T. (2000).
10.1109/TASLP.2018.2791804. Speech parameter generation algorithms
for HMM-based speech synthesis. IEEE
13. Suraj-Pandurang, P., Laxman-Lahudkar, S. International Conference on Acoustics,
(2019). Hidden-Markov-model based statisti- Speech, and Signal Processing, Vol. 3. DOI:
cal parametric speech synthesis for Marathi 10.1109/ICASSP.2000.861820.
with optimal number of hidden states. Interna-
tional Journal of Speech Technology, Vol. 22, 20. Toda, T., Tokuda, K. (2007). A speech
No. 1, pp. 93–98. parameter generation algorithm considering
global variance for HMM-based speech syn-
14. Sefara, T.J., Mokgonyane, T.B., Man- thesis. IEICE Transaction on Information and
amela, M.J., Modipa, T.I. (2019). HMM- Systems, Vol. 90, No. 5, pp. 816–824. DOI:
based speech synthesis system incorporated 10.1093/ietisy/e90-d.5.816.
with language identification for low-resourced
languages. International Conference on Ad- 21. Maegaard, B., Choukri, K., Calzolari,
vances in Big Data, Computing and Data N., Odijk, J. (2005). Elra – European
Communication Systems (ICABCD). DOI: language resources association - background,
10.1109/ICABCD.2019.8851055. recent developments and future perspectives.
Language Resources and Evaluation, Vol. 39,
15. Junichi, Y., Ling, Z., King, S. (2008). No. 1, pp. 9–23. DOI: 10.1007/s10579-005-
Robustness of HMM-based speech synthesis. 2692-5.
16. Valentini-Botinhao, C., Wang, X., Takaki, S., Article received on 09/10/2020; accepted on 16/02/2021.
Yamagishi, J. (2016). Speech Enhancement Corresponding author is Marvin Coto-Jiménez.
Computación y Sistemas, Vol. 26, No. 2, 2022, pp. 835–842
doi: 10.13053/CyS-26-2-4254