• Nem Talált Eredményt


In document Acta 2502 y (Pldal 130-138)

The authors would also like to thank Ken´ez Csiktusn´adi-Kiss for his work and support in this research.

Icons made by Pixel perfect, catkuro, Eucalyp, fjstudio, Freepik, Pause08, surang, xnimrodx, bimbimkha fromwww.flaticon.com.


[1] Bradski, G. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 25:120–125, 2000.

[2] Castelluccia, Claude and Le M´etayer Inria, Daniel. Impact analysis of facial recognition. Working paper or preprint, URL:https://hal.inria.fr/hal- 02480647, February 2020.

[3] Chen, Sheng, Liu, Yang, Gao, Xiang, and Han, Zhen. MobileFaceNets: Effi- cient CNNs for accurate real-time face verification on mobile devices. ArXiv, abs/1804.07573, April 2018.

[4] Dalal, N. and Triggs, B. Histograms of oriented gradients for human de- tection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893, 2005. DOI:


[5] Dantcheva, Antitza, Elia, Petros, and Ross, Arun. What else does your bio- metric data reveal? A survey on soft biometrics.IEEE Transactions on Infor- mation Forensics and Security, 11, 2015. DOI: 10.1109/TIFS.2015.2480381.

[6] Deng, Jiankang. Video face recognition demo of ArcFace, 2018. URL:https:


[7] Deng, Jiankang, Guo, Jia, Xue, Niannan, and Zafeiriou, Stefanos. ArcFace:

Additive angular margin loss for deep face recognition. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 4685–4694, 2019. DOI: 10.1109/CVPR.2019.00482.

[8] Deng, Jiankang, Guo, Jia, Yuxiang, Zhou, Yu, Jinke, Kotsia, Irene, and Zafeiriou, Stefanos. RetinaFace: Single-stage dense face localisation in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5203–5212, June 2020.

[9] Dua, Dheeru and Graff, Casey. UCI machine learning repository, 2019. URL:


[10] European Parliament and of the Council. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/EC (General Data Protection Regulation), 2016. URL:https://eur-lex.europa.eu/eli/reg/


[11] F´abi´an, Istv´an and Guly´as, G´abor Gy¨orgy. On the privacy risks of large-scale processing of face imprints. InThe 12th Conference of PhD Students in Com- puter Science, 2020. https://www.inf.u-szeged.hu/~cscs/proceedings.


[12] F´abi´an, Istv´an and Guly´as, G´abor. De-anonymizing facial recognition embed- dings. Infocommunications Journal, 12:50–56, 2020. DOI: 10.36244/ICJ.


[13] GDPR. Article 29 Data Protection Working Party, opinion 05/2014 on anonymisation techniques, 2014. URL: https://ec.europa.eu/justice/



[14] Grother, P., Ngan, M., Hanaoka, K., and National Institute of Standards and Technology (U.S.). Face Recognition Vendor Test (FVRT): Part 3, Demo- graphic Effects. NIST interagency report. National Institute of Standards and Technology, 2019.

[15] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. DOI: 10.1109/


[16] Hill, Kashmir. Another arrest, and jail time, due to a bad facial recognition match. The New York Times, 2020. URL:https://www.nytimes.com/2020/


[17] Huang, Gao, Liu, Zhuang, Van Der Maaten, Laurens, and Weinberger, Kil- ian Q. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.

DOI: 10.1109/CVPR.2017.243.

[18] Huang, Gary B., Ramesh, Manu, Berg, Tamara, and Learned-Miller, Erik.

Labeled faces in the wild: A database for studying face recognition in uncon- strained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.

[19] Keegan, Matthew. Big brother is watching: Chinese city with 2.6m cameras is world’s most heavily surveilled. The Guardian, 2019. URL:

https://www.theguardian.com/cities/2019/dec/02/big-brother-is- watching-chinese-city-with-26m-cameras-is-worlds-most-heavily- surveilled.

[20] King, Davis. dlib vs OpenCV face detection, 2014. URL: https://www.


[21] King, Davis E. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10(60):1755–1758, December 2009.

[22] Mai, Guangcan, Cao, Kai, Yuen, Pong C., and Jain, Anil K. On the re- construction of face images from deep face templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(5):1188–1202, May 2019.

DOI: 10.1109/tpami.2018.2827389.

[23] McKinney, Wes. Data structures for statistical computing in python. Pro- ceedings of the 9th Python in Science Conference, 445:51–56, 2010. DOI:


[24] Pedregosa, Fabian, Varoquaux, Ga¨el, Gramfort, Alexandre, Michel, Vincent, Thirion, Bertrand, Grisel, Olivier, Blondel, Mathieu, Prettenhofer, Peter, Weiss, Ron, Dubourg, Vincent, et al. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research, 12(85):2825–2830, 2011.

[25] Ramirez Cerna, Lourdes, Camara-Chavez, Guillermo, and Menotti Gomes, David. Face detection: Histogram of oriented gradients and bag of fea- ture method. In Proceedings of the 2013 International Conference on Im- age Processing, Computer Vision, and Pattern Recognition (IPCV’13), 2013.


[26] Schroff, Florian, Kalenichenko, Dmitry, and Philbin, James. FaceNet: A uni- fied embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Jun 2015. DOI:


[27] Simon, Mallory. HP looking into claim webcams can’t see black peo- ple. CNN, 2009. URL: https://edition.cnn.com/2009/TECH/12/22/hp.


[28] Sweeney, Latanya. Simple demographics often identify people uniquely.

Carnegie Mellon University, Data Privacy, 2000. DOI: https://doi.org/


[29] Szegedy, Christian, Ioffe, Sergey, Vanhoucke, Vincent, and Alemi, Alexan- der A. Inception-v4, Inception-ResNet and the impact of residual connections on learning. InProceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, pages 4278–4284. AAAI Press, 2017.

[30] Thorat, S. B., Nayak, S. K., and Dandale, Jyoti P. Facial recognition technol- ogy: An analysis with scope in India. ArXiv, abs/1005.4263, 2010.

[31] Turk, M.A. and Pentland, A.P. Face recognition using eigenfaces. In Proc.

IEEE Computer Society Conference on Computer Vision and Pattern Recog- nition, pages 586–591, 1991. DOI: 10.1109/CVPR.1991.139758.

[32] Viola, P. and Jones, M. Rapid object detection using a boosted cascade of simple features. InProceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, 2001. DOI:


[33] Wu, Haoran, Xu, Zhiyong, Zhang, Jianlin, Yan, Wei, and Ma, Xiao. Face recog- nition based on convolution siamese networks. In10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), pages 1–5, 2017. DOI: 10.1109/CISP-BMEI.2017.8302003.

[34] Zhang, Maggie. Google Photos tags two African-Americans as gorillas through facial recognition software. Forbes, 2015. URL: https://www.forbes.

com/sites/mzhang/2015/07/01/google-photos-tags-two-african- americans-as-gorillas-through-facial-recognition-software/.

[35] Zhang Zhifei, Song, Yang and Qi, Hairong. Age progression/regression by conditional adversarial autoencoder. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. DOI: 10.1109/cvpr.2017.


Speech De-identification with Deep Neural Networks

Ad´ ´ am Fodor


, L´ aszl´ o Kop´ acsi


, Zolt´ an ´ A. Milacski


, and Andr´ as L˝ orincz



Cloud-based speech services are powerful practical tools but the privacy of the speakers raises important legal concerns when exposed to the Internet.

We propose a deep neural network solution that removes personal character- istics from human speech by converting it to the voice of a Text-to-Speech (TTS) system before sending the utterance to the cloud. The network learns to transcode sequences of vocoder parameters, delta and delta-delta features of human speech to those of the TTS engine. We evaluated several TTS systems, vocoders and audio alignment techniques. We measured the per- formance of our method by (i) comparing the result of speech recognition on the de-identified utterances with the original texts, (ii) computing the Mel-Cepstral Distortion of the aligned TTS and the transcoded sequences, and (iii) questioning human participants in A-not-B, 2AFC and 6AFC (Al- ternative Forced-Choice) tasks. Our approach achieves the level required by diverse applications.

Keywords: speech processing, voice conversion, deep neural network, text- to-speech, speaker privacy

The research has been supported by the European Union, the Ministry of Innovation and Technology NRDI Office within the framework of the Artificial Intelligence National Labora- tory Program and by the National Research, Development and Innovation Fund of Hungary, financed under the Thematic Excellence Programme no. 2020-4.1.1.-TKP2020 (National Chal- lenges Subprogramme) funding scheme through the ”Application Domain Specific Highly Reliable IT Solutions” project and co-financed by the European Social Fund (EFOP-3.6.3-16-2017-00002, EFOP-3.6.3-VEKOP-16-2017-00001).

aEqual contributions. Department of Artificial Intelligence, E¨otv¨os Lor´and University, Bu- dapest, Hungary, E-mail:{foauaai, kopacsi}@inf.elte.hu, ORCIDs:0000-0001-7370-930Xand 0000-0003-2387-2015

bDepartment of Artificial Intelligence, E¨otv¨os Lor´and University, Budapest, Hungary, E-mail:

miztaai@inf.elte.hu, ORCID:0000-0002-3135-2936

cCorresponding author, Department of Artificial Intelligence, E¨otv¨os Lor´and University, Bu- dapest, Hungary, E-mail:lorincz@inf.elte.hu, ORCID:0000-0002-1280-3447


1 Introduction

Cloud-based speech services have improved recently due to the large amount of voice data that is exploited by deep learning technology [1, 3], giving rise to superhuman performance in several tasks. Consequently, it seems reasonable to use such utilities in practice.

Unfortunately, many speech applications involve legal concerns regarding pri- vacy. Several methods have been proposed to eliminate personal information from samples without spoiling the linguistic content before uploading. We should also mention, that in many cases the private information is carried by the linguistic content and not by the voice of the speaker. For example, when a doctor dictates medical records, the private content is the medical content and not the identity of the doctor. But in the case of diagnostic sessions with autistic people, it is the speaker whose identity should remain hidden. If an external ASR is used on the transformed speech of a patient, the identity will remain concealed, and the linguistic content can be generated safely.

Voice conversion (VC) operates by altering certain features of human speech [31]. Voice transformation (VT) converts the signal as if it was uttered by a target speaker [23]. De-identification is the process that intends to remove any personal information from the data that could be associated with identity. VC and VT may be applied to solve de-identification, but the papers in the literature suffer from several flaws: the VC algorithm in [22] is approximately invertible and relies on a good voice transformer, while VT [23, 29] requires data from pairs of speakers and is unable to anonymize the target speaker.

Our contributions are as follows. For de-identification, we propose to transform utterances to a generic voice of a Text-to-Speech (TTS) engine, by taking advan- tage of utterance-text sample pairs. We use an end-to-end trainable Deep Neural Network (DNN) to learn the many-to-one VT task. We suggest to learn the map- ping at vocoder level. We show that the trained network gives rise to tolerable distortions at utterance level by conducting two experiments: comparing the out- puts of Google’s Automatic Speech Recognition (ASR) system for the original TTS output and the de-identified utterance and measuring the Mel-Cepstral Distortion (MCD) [19]. To confirm de-identification success, we further performed three kind of perceptual listening studies with human subjects (A-not-B test: distinguishing transformed utterances of different speakers, 2-Alternative Forced-Choice (2AFC) test: classifying utterances from female/male speakers, and 6-Alternative Forced- Choice (6AFC) test: estimating the number of speakers). Our proposal is irre- versible and it requires only speech-transcript sample pairs for training, which are readily accessible in the literature. We argue that our method performs favorably compared to several baseline methods.

Figure 1: Schematic diagram of our proposed method. Training [T]: vocoded human voice is input to a Deep Neural Network (DNN) that is trained to approximate the aligned TTS output. Inference [I]: vocoded human voice is de-identified by the DNN and transformed back to utterances by the vocoder.

2 Related work

De-identification can be solved by either VC or VT methods. A subset of the literature focuses on classical algorithms instead of leveraging the potential of DNN architectures, and hence fail to produce state-of-the-art speech quality. The so- called transterpolation VC technique gave rise to significant improvements over diverse VC methods as reviewed in [16]. Another VC approach exploits the two-step procedure of an ASR system followed by a TTS [17]. However, due to the method of the conversion, the latter cannot take advantage of the superior performance of cloud-based ASR systems. A potential problem of VT methods is that the target speaker is not generic and hence it cannot be anonymized. Generic is meant here as monotonic and “robotic”, which does not contain any prosody. Neural TTS systems nowadays are realistic enough, that it may include unintended prosody, which makes the transformation harder. The problem can be resolved by converting to an average voice, however, we decided to go with a TTS system instead of generating an average voice. In addition, VT methods need speech corpora of the original speaker and the target speaker, too. To avoid the need for a parallel corpora an approach that used a pool of pre-trained transformations between a set of speakers was put forth in [22]. A transformation function was applied to the source and the target speakers based on speaker similarity and dissimilarity, respectively. By applying several sound distortion algorithms de-identification was achieved and the transformation could be reversed, offering several advantages at the cost of vulnerability.

In contrast, several recent works propose DNNs for many-to-one VC and VT.

For an overview, see [24]. A VC method using Mel-Cepstral inputs for deep autoen- coders was introduced in [23]. Speaker-dependent Conditional Restricted Boltz- mann Machine (CRBM) was applied using Mel-Frequency Cepstral Coefficients (MFCCs) and deltas for solving the VC task for each speaker pair in [29]. An autoencoder-based VT approach was proposed to reduce the required size of the data sets and to shorten conversion time in [32]. A VT method that generates a one-to-one speaker-dependent DNN using the weights of a speaker-independent DNN was suggested in [21]. Spectral envelope, fundamental frequency (F0), inten- sity trajectory and phone duration were converted in [30] subject to an 1 norm constraint during pre-training. Nevertheless, all of these methods restrict them- selves to the case of VC and VT, without using transcript data. In this paper, we directly tackle de-identification and propose to use textual data as well besides the original speech for training, which are largely available online.

In document Acta 2502 y (Pldal 130-138)