• Nem Talált Eredményt

5 Summary

In document Acta 2502 y (Pldal 143-152)

We presented a deep neural network based speech de-identification method that can map vocoder features of human speech to those of a generic TTS engine with little or minimal loss of sound quality using the TIMIT data set. The novelty of our scheme is that de-identification is based on speech-text sample pairs, which are widely available in the speech processing community. In the resulting signal,


Table 2: Results of the perceptual listening experiments. We report the average and the standard deviation of the identification accuracy.

Task # of


# of samp.

Accuracy mean±std

Random choice A-not-B


20 0.56±0.15 0.5

Female/Male (2AFC) 15 0.51±0.15 0.5

# of Speakers 6 0 0.16

(6AFC) 6 0.18 0.16

the identity of the speaker is concealed, as confirmed by our perceptual listening experiments.

A limitation of our technique is that the dynamics of the original speaker are inherited due to the application of DTW. We hypothesize that this problem may be alleviated by applying DTW in the loss function of the deep network. We leave such studies to future work.

Our technique enables privacy-aware speech recognition for adults. The pro- posed method is lightweight and can be used for collecting de-identified databases when the privacy of the user is important, for example in cloud-based speech services or in medical records. The fact that our method requires only speech- transcript sample pairs is a very promising aspect for deep learning, which requires large and high quality databases.


[1] Aaron van den, Oord, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, and Kavukcuoglu, Koray. Wavenet: A generative model for raw audio.

arXiv:1609.03499, 2016.

[2] Abadi, Mart´ın, Agarwal, Ashish, et al. TensorFlow: Large-scale machine learn- ing on heterogeneous systems, 2015. Software available fromtensorflow.org.

[3] Amodei, Dario, Ananthanarayanan, Sundaram, et al. Deep Speech 2: End-to- end speech recognition in English and Mandarin. InProceedings of the 33rd International Conference on Machine Learning, pages 173–182, 2016.

[4] Black, Alan. The Festival Speech Synthesis System: System Documentation (1.1.1). Technical Report HCRC/TR-83, Human Communication Research Center, 1997.

[5] Chollet, Francois et al. Keras. https://keras.io, 2015.

[6] Degottex, Gilles, Lanchantin, Pierre, and Gales, Mark. A log domain pulse model for parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1):57–70, 2018. DOI: 10.1109/taslp.


[7] Erro, Daniel, Sainz, Inaki, Navas, Eva, and Hernaez, Inma. Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2):184–194, 2014. DOI:


[8] Espic, Felipe, Botinhao, Cassia Valentini, and King, Simon. Direct modelling of magnitude and phase spectra for statistical parametric speech synthesis.

Proc. Interspeech, 2017. DOI: 10.21437/interspeech.2017-1647.

[9] Fisher, William M., Doddington, George R., Goudie-Marshall, Kathleen M., Jankowski, Charles, Kalyanswamy, Ashok, Basson, Sara, and Spitz, Judith.

NTIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database. In Proc. IEEE ICASSP, pages 109–112, 1990. DOI: 10.


[10] Fukada, Toshiaki, Tokuda, Keiichi, Kobayashi, Takao, and Imai, Satoshi. An adaptive algorithm for mel-cepstral analysis of speech. InAcoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Confer- ence on, volume 1, pages 137–140. IEEE, 1992. DOI: 10.1109/icassp.1992.


[11] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. InProceedings of IEEE Conference on Com- puter Vision and Pattern Recognition, pages 770–778, 2016.

[12] Hsu, Chin-Cheng, Hwang, Hsin-Te, Wu, Yi-Chiao, Tsao, Yu, and Wang, Hsin-Min. Voice conversion from non-parallel corpora using variational auto- encoder. InAPSIPA, Asia-Pacific, pages 1–6. IEEE, 2016. DOI: 10.1109/


[13] Hsu, Jeremy et al. PyWorldVocoder: A Python wrapper for World Vocoder. https://github.com/JeremyCCHsu/ Python-Wrapper-for-World- Vocoder, 2016.

[14] Imai, Satoshi, Kobayashi, Takao, et al. Speech signal processing toolkit (SPTK), 2009. http://sp-tk.sourceforge.net.

[15] Iser, Bernd, Minker, Wolfgang, and Schmidt, Gerhard. Broadband spectral envelope estimation. Bandwidth Extension of Speech Signals. Lecture Notes in Electrical Engineering, 13:67–95, 2008. DOI: 10.1007/978-0-387-68899-2_


[16] Jin, Qin, Toth, Arthur R., Schultz, Tanja, and Black, Alan W. Speaker de- identification via voice transformation. IEEE Workshop on Automatic Speech Recognition & Understanding, pages 529–533, 2009. DOI: 10.1109/ASRU.


[17] Justin, Tadej, Struc, Vitomir, Dobrisek, Simon, Vesnicer, Bostjan, Ipsic, Ivo, and Mihelic, France. Speaker de-identification using diphone recognition and speech synthesis. 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pages 1–7, 2015. DOI:


[18] Kawahara, Hideki. STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acous- tical Science and Technology, 27(6):349–353, 2006. DOI: 10.1250/ast.27.


[19] Kominek, John, Schultz, Tanja, and Black, Alan W. Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion. First Inter- national Workshop on Spoken Languages Technologies for Under-Resourced Languages (SLTU-2008), pages 63–68, 2008.

[20] Liptchinsky, Vitaliy, Synnaeve, Gabriel, and Collobert, Ronan. Letter-based speech recognition with Gated ConvNets. CoRR, abs/1712.09444, 2017.

[21] Liu, Li-Juan, Chen, Ling-Hui, Ling, Zhen-Hua, and Dai, Li-Rong. Spectral conversion using deep neural networks trained with multi-source speakers.

Proc. IEEE ICASSP, pages 4849–4853, 2015. DOI: 10.1109/ICASSP.2015.


[22] Magari˜nos, Carmen, Lopez-Otero, Lopez-Otero, Paula, Docio-Fernandez, Laura, Rodriguez-Banga, Eduardo, Erro, Daniel, and Garcia-Mateo, Car- men. Reversible speaker de-identification using pre-trained transformation functions. Computer Speech & Language, 46:36–52, 2017. DOI: 10.1016/j.


[23] Mohammadi, Seyed Hamidreza and Kain, Alexander. Voice conversion using deep neural networks with speaker-independent pre-training.Proc. IEEE SLT Workshop, pages 19–23, 2014. DOI: 10.1109/SLT.2014.7078543.

[24] Mohammadi, Seyed Hamidreza and Kain, Alexander. An overview of voice conversion systems. Speech Communication, 88:65–82, 2017. DOI: 10.1016/


[25] Morise, Masanori. CheapTrick, a spectral envelope estimator for high-quality speech synthesis. Speech Communication, 67:1–7, 2015. DOI: 10.1016/j.


[26] Morise, Masanori. D4C, a band-aperiodicity estimator for high-quality speech synthesis. Speech Communication, 84:57–65, 2016. DOI: 10.1016/j.specom.


[27] Morise, Masanori, Kawahara, Hideki, and Nishiura, Takanobu. Rapid F0 es- timation for high-SNR speech based on fundamental component extraction.

IEICE TRANSACTIONS on Information and Systems, 93:109–117, 2010.

[28] Morise, Masanori, Yokomori, Fumiya, and Ozawa, Kenji. WORLD: A vocoder- based high-quality speech synthesis system for real-time applications. IE- ICE Trans. Info. Sys., 99(7):1877–1884, 2016. DOI: 10.1587/transinf.


[29] Nakashika, Toru, Takiguchi, Tetsuya, and Ariki, Yasuo. Voice conversion based on speaker-dependent restricted Boltzmann machines. IEICE Trans- actions on Information and Systems, E97.D(6):1403–1410, 2014. DOI:


[30] Nguyen, Hy Quy, Lee, Siu Wa, Tian, Xiaohai, Dong, Minghui, and Chng, Eng Siong. High quality voice conversion using prosodic and high-resolution spectral features. Multimedia Tools and Applications, 75(9):5265–5285, 2016.

DOI: 10.1007/s11042-015-3039-x.

[31] Qian, Jianwei, Du, Haohua, Hou, Jiahui, Chen, Linlin, Jung, Taeho, Li, Xiang- Yang, Wang, Yu, and Deng, Yanbo. VoiceMask: Anonymize and sanitize voice input on mobile devices. arXiv:1711.11460, 2017.

[32] Sekii, Yusuke, Orihara, Ryohei, Kojima, Keisuke, Sei, Yuichi, Tahara, Ya- suyuki, and Ohsuga, Akihiko. Fast many-to-one voice conversion using autoen- coders. Proceedings of the 9th International Conference on Agents and Artifi- cial Intelligence, pages 164–174, 2017. DOI: 10.5220/0006193301640174.

[33] Tokuda, Keiichi, Kobayashi, Takao, Masuko, Takashi, and Imai, Satoshi. Mel- generalized cepstral analysis — a unified approach to speech spectral estima- tion. InThird International Conference on Spoken Language Processing, 1994.

[34] Wu, Zhizheng, Watts, Oliver, and King, Simon. Merlin: An open source neural network speech synthesis system. In 9th ISCA Speech Synthesis Workshop.

ISCA, 2016. DOI: 10.21437/ssw.2016-33.

[35] Zue, Victor, Seneff, Stephanie, and Glass, James. Speech database develop- ment at MIT: TIMIT and beyond. Speech Communication, 9:351–356, 1990.

DOI: 10.1016/0167-6393(90)90010-7.

Toolset for Supporting the Research of Lattice Based Number Expansions

P´ eter Hudoba


and Attila Kov´ acs



The world of generalized number systems contains many challenging areas.

Computer experiments often support the theoretical research. In this paper we introduce a toolset that helps to analyze some properties of lattice based number expansions. The toolset is able to (1) analyze the expansions, (2) decide the number system property, (3) classify and visualize the periodic points.

The toolset is implemented in Python, published alongside with a database that stores plenty of special expansions, and is able to store the custom prop- erties like signature, operator eigenvalues, etc. Researchers can connect to the server and request/upload data, or perform experiments on them.

We present an introductory usage of the toolset and detail some results that has been observed by the toolset. The toolset can be downloaded from http://numsys.infodomain.

1 Introduction

The generalization of positional number representations to a wide range of digit sets or to higher dimensions is a fascinating story. Gr¨unwald (1885) investigated negative-based, Kempner (1936), Knuth (1960), Khmelnik (1964), Penney (1965) complex-based systems. From the 70’s K´atai, B. Kov´acs, K¨ornyei, Peth˝o (the

“Hungarian school”) and Gilbert examined systematically the radix extensions in algebraic number fields. In the 90’s the topological aspects of radix representa- tions were studied by Bandt, Indlekofer, J´arai, K´atai, Lagarias, Wang, Vince, and later by Akiyama, Thuswaldner and others. The canonical number representation was generalized toarbitrary polynomial systems by Peth˝o (1989), and investigated later extensively by many authors (incl. Akiyama, Brunotte, Kov´acs, Peth˝o, Rao, Scheicher, Thuswaldner). The number system concept in general lattices was in- vestigated first by Vince (1993). Thealgorithmic aspectsof canonical (polynomial) systems was initiated by Brunotte (2001) and for general lattices by the second

aotv¨os Lor´and University, Budapest, Hungary

bE-mail:peter.hudoba@inf.elte.hu, ORCID:0000-0001-5810-4193

cE-mail:attila@inf.elte.hu, ORCID:0000-0002-1858-7618


author (2000). Recently, a special type of radix systems (SRS) studied in length by Thuswaldner and his co-workers (the “Austrian school”).

2 Preliminaries

Let Λ be a lattice in Rn and let M : Λ Λ be a linear operator such that det(M)= 0. Let furthermore 0 ∈D⊆Λ be a finite subset. Lattices can be seen as finitely generated free Abelian groups and have many significant applications in pure mathematics (Lie algebras, number theory and group theory), in applied mathematics (coding theory, cryptography) because of conjectured computational hardness of several lattice problems, and are used in various ways in the physical sciences. We note that the number system research in general lattices comprises also the orders.

Definition 1. The triple, M, D) is called a number system (GNS) if every elementxof Λhas a unique, finite representation of the form

x= L i=0

Midi ,

wheredi∈D andL∈Z(L+ 1is the length of the expansion).

HereM is called thebase andD is thedigit set. It is easy to see that similarity preserves the number system property, i.e., ifM1andM2are similar via the matrix Q then (Λ, M1, D) is a number system if and only if (QΛ, M2, QD) is a number system at the same time. If we change the basis in Λ a similar integer matrix can be obtained, hence, there is no loss of generality in assuming that M is integral acting on the latticeZn. If two elements of Λ are in the same coset of the factor group Λ/MΛ then they are said to becongruentmoduloM. The following theorem gives a necessary condition for the number system property.

Theorem 1. If, M, D) is a number system, then (1) D must be a full residue system modulo M, (2) M must be expansive, and (3) det(In−M) =±1. (unit condition). If a system fulfils the first two conditions then it is called a radix system.

We note that the theorem in this form was stated first in [9] but it was well- known and used much earlier by K´atai and Vince. The full residue system property can be decided easily using Smith normal form [8]. Algorithms, that calculate the eigenvalues of M exactly in a finite number of steps exist only for a few special classes of matrices. For general matrices, iterative algorithms produce approximate solutions. In polynomial systems, where M is the companion of a monic inte- ger polynomialf, deciding the Schur or Hurwitz stability of f is computationally equivalent with the expansivity check. Verification of the unit condition is trivial.

Write ϕ : Λ Λ, x ϕ M1(x−d) for the unique d D satisfying x d (modM). Since M1 is contractive andD is finite, there exists a norm . on Λ and a constant C such that the orbit of every x Λ eventually enters the finite

set S ={x∈Λ| x< C} after repeated application of ϕ. This means that the sequencex, ϕ(x), ϕ2(x), . . . is eventually periodic for allx∈Λ. Clearly, (Λ, M, D) is a number system iff for everyx∈Λ the orbit ofxeventually reaches 0. A point pis calledperiodicifϕk(p) =pfor somek >0. The orbit of a periodic pointpis a cycle. The set of all periodic points is denoted byP. Thesignature [l1, l2, . . . , lω] of a radix system is a finite sequence of non-negative integers in which the periodic structure P consists of #li cycles with period lengthi (1 ≤i ≤ω). Clearly, the signature of a number system isSig= [1].

The following problem classes are in the mainstream of the research.

• For a given (Λ, M, D) thedecision problem asks if the triple forms a number system or not.

• For a given (Λ, M, D) theclassification problem means finding all cycles (wit- nesses).

• Theparametrization problemmeans finding parametrized families of number systems.

• Theconstruction problem aims at constructing a digit setD toM for which (Λ, M, D) is a number system. In general, construct a digit setDtoM such that (Λ, M, D) satisfies a given signature.

We note that the algorithmic complexity of the decision and classification problems are still unknown.

Thefundamental domain or set of “fractions” in (Λ, M, D) can be defined as H =



Midi:di∈D (


Theorem 2. (a)H is compact. (b)x∈ P ⇔x∈ −H.

The compactness was proved by many authors (see e.g. Vince [15]). The part of (b) was proved in [9]. The other direction is obvious as well, otherwise 0 would have at least two different representations.

The theorem means that in order to determine the periodic points it is enough to localize the lattice points in−H. There are two different approaches to overcome this problem: the IFS-method (see [8, 10]), and the covering method (see [8, 4]), which was optimized by the authors [6]. The idea of the latter is that we can put the compact set−H into a boxB in which the integer elements are easily enumerable.

Then, we can compute the pairs (x, ϕ(x)) for allx∈B, and finally, we determine the cycles applying one of the cycle finding methods.

There are other algorithms for solving the decision/classification problems.

Based on the method of Vince [15], Brunotte [2] described a digit-propagation algorithm for polynomial systems with canonical digits. Later, his method was gen- eralized for arbitrary operators and digit sets [4]. The shortcoming of this method

is the sequential nature of the digit propagation, however, there is an algorithmic attempt to overcome this disadvantage [14].

Let f(x) = a0 +a1x+a2x2+· · ·+an1xn1 +xn be an integer (monic) polynomial. Let us denote the factor ring Z[x]/(f) by Λf. Then Λf is a lattice and all the problems regarding number expansions in Λf can be formulated inZn, where M is the companion of f. If f is irreducible then Λf is isomorphic toZ[θ]

where f(θ) = 0 in an appropriate extension of Q. Hence, if the digit set D is restricted to be a set of non-negative numbers D ={0,1, . . . |a0 | −1}, we get a straightforward generalization of the traditional number systems inZ. In this case the digit set is calledcanonical. If the radix system (Λf, θ, D) satisfies the unique representation property of Definition 1 with some canonical digit set D then it is called a canonical number system (CNS). The notion of canonical digit sets can be extended to form a j-canonical set Dj = {0, ej, . . . ,(| a0 | −1)ej} ⊂ Zn (ej

is thejth unit vector) [8]. There exists a canonical number system in OK – the ring of integers of the algebraic number fieldK – if and only if there is a power integral basis inOK [12]. We note that canonical digit sets may or may not exist in different lattices and canonicity depends on the chosen basis. Thesymmetric digit set is formed by those integer multiples ofej which are closest to the origin, and was introduced by K´atai [7]. The adjoint digit set consists of those lattice points which are in det(M))


. Thedense digit set — in which each digit has the smallest norm in its residue class — were introduced and used extensively by the second author. We note that the adjoint set is a dense one in a special basis.

In document Acta 2502 y (Pldal 143-152)