• Nem Talált Eredményt

6 Conclusions

In document Acta 2502 y (Pldal 107-113)

In this study, we presented the Fisher Vector approach as a method of classifying speech from subjects having a cold. Compared with studies done by other teams using the same dataset [11, 22], our performance is competitive. Moreover, our fea- ture extraction approach seems to be simpler than that of the mentioned studies as we utilized one single type of feature representation for training a model. We found

that SVM gave better results when the feature pre-processing step was applied before executing the training phase. Thus, we demonstrated how applying Power Normalization along with dimension reduction via Principal Component Analysis on the Fisher Vector features improved the classification performance. Combining Power Normalization with PCA gave a better UAR score on test set. These results are higher compared to those got using the Bag-of-Audio-Words approach described in [22]. We can therefore say that PCA with the SVM allowed us to carry out a better classification of the actual data while taking care of the memory consump- tion. PN helped to reduce the impact of the features that increase their sparsity as the number of Gaussian components increase. Furthermore, L2-normalization was applied before fitting the data. This helped to alleviate the effect of having differ- ent utterances with distinct amounts of background information projected into the extracted features, which attempts to improve the prediction performance. In a future study, we will try out the FV approach on bigger datasets and evaluate the performance of a time-delay neural network when it uses them as input features.


[1] Arandjelovic, Relja and Zisserman, Andrew. All about VLAD. InProceedings of CVPR, pages 1578–1585, 2013. DOI: 10.1109/CVPR.2013.207.

[2] Chang, Chih-Chung and Lin, Chih-Jeh. LIBSVM: A library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2:1–27, 2011. DOI: 10.1145/1961189.1961199.

[3] Chatfield, Ken, Lempitsky, Victor, Vedaldi, Andrea, and Zisserman, Andrew.

The devil is in the details: An evaluation of recent feature encoding methods.

InBritish Machine Vision Conference, volume 2, pages 76.1–76.12, 11 2011.

[4] Cummins, N., Epps, J., Sethu, V., and Krajewski, J. Variability compen- sation in small data: Oversampled extraction of i-vectors for the classifica- tion of depressed speech. In2014 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 970–974, 2014. DOI:


[5] Egas-L´opez, Jos´e Vicente, Orozco-Arroyave, Juan Rafael, and Gosztolya, G´abor. Assessing Parkinson’s Disease From Speech Using Fisher Vec- tors. Proceedings of Interspeech, pages 3063–3067, 2019. DOI: 10.21437/


[6] Egas-L´opez, Jos´e Vicente, T´oth, L´aszl´o, Hoffmann, Ildik´o, K´alm´an, J´anos, P´ak´aski, Magdolna, and Gosztolya, G´abor. Assessing Alzheimer’s Disease from Speech Using the i-vector Approach. InProceedings of SPECOM, pages 289–298. Springer, 2019. DOI: 10.1007/978-3-030-26061-3_30.

[7] Gosztolya, G´abor, Bagi, Anita, Szal´oki, Szilvia, Szendi, Istv´an, and Hoffmann, Ildik´o. Identifying schizophrenia based on temporal parameters in spontaneous

speech. InProceedings of Interspeech, pages 3408–3412, Hyderabad, India, Sep 2018. DOI: 10.21437/Interspeech.2018-1079.

[8] Gosztolya, G´abor, Busa-Fekete, R´obert, Gr´osz, Tam´as, and T´oth, L´aszl´o.

DNN-based feature extraction and classifier combination for child-directed speech, cold and snoring identification. InProceedings of Interspeech, pages 3522–3526, Stockholm, Sweden, Aug 2017. DOI: 10.21437/Interspeech.


[9] Gosztolya, G´abor, Gr´osz, Tam´as, Szasz´ak, Gy¨orgy, and T´oth, L´aszl´o. Esti- mating the sincerity of apologies in speech by DNN rank learning and prosodic analysis. InProceedings of Interspeech, pages 2026–2030, San Francisco, CA, USA, Sep 2016. DOI: 10.21437/Interspeech.2016-956.

[10] Gosztolya, G´abor, Gr´osz, Tam´as, and T´oth, L´aszl´o. General utterance-level feature extraction for classifying crying sounds, atypical & self-assessed affect and heart beats. In Proceedings of Interspeech, pages 531–535, Hyderabad, India, Sep 2018. DOI: 10.21437/Interspeech.2018-1076.

[11] Huckvale, Mark and Beke, Andr´as. It sounds like you have a cold! Testing voice features for the Interspeech 2017 Computational Paralinguistics Cold Chal- lenge. In Proceedings of Interspeech, pages 3447–3451. International Speech Communication Association (ISCA), 2017. DOI: 10.21437/Interspeech.


[12] Jaakkola, Tommi S. and Haussler, David. Exploiting generative models in discriminative classifiers. InProceedings of NIPS, pages 487–493, Denver, CO, USA, 1998.

[13] Jolliffe, Ian. T. Principal Component Analysis. Springer-Verlag, 1986. DOI:


[14] Kaya, Heysem and Karpov, Alexey A. Introducing weighted kernel classifiers for handling imbalanced paralinguistic corpora: Snoring, addressee and cold.

InINTERSPEECH, pages 3527–3531, 2017. DOI: 10.21437/Interspeech.


[15] Lemaˆıtre, Guillaume, Nogueira, Fernando, and Aridas, Christos K.

Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1):559–

563, 2017.

[16] Perronnin, F. and Dance, C. Fisher kernels on visual vocabularies for image categorization. In Proceedings of CVPR, 2007. DOI: 10.1109/CVPR.2007.


[17] Povey, Daniel, Ghoshal, Arnab, Boulianne, Gilles, Burget, Luk´aˇs, Glembek, Ondrej, Goel, Nagendra, Hannemann, Mirko, Motl´ıˇcek, Petr, Qian, Yanmin,

Schwarz, Petr, Silovsk´y, Jan, Stemmer, Georg, and Vesel, Karel. The Kaldi speech recognition toolkit. Proceedings of ASRU, 01 2011.

[18] Rosenberg, Andrew. Classifying skewed data: Importance weighting to op- timize average recall. In Proceedings of Interspeech, pages 2239–2242, 2012.

DOI: 10.21437/Interspeech.2012-131.

[19] S´anchez, Jorge, Perronnin, Florent, Mensink, Thomas, and Verbeek, Jakob.

Image classification with the Fisher Vector: Theory and practice. Inter- national Journal of Computer Vision, 105:222–245, 2013. DOI: 10.1007/


[20] Schuller, Bj¨orn, Batliner, Anton, Steidl, Stefan, and Seppi, Dino. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication, 53(9-10):1062–1087, 2011.

DOI: 10.1016/j.specom.2011.01.011.

[21] Schuller, Bj¨orn, Steidl, Stefan, Batliner, Anton, Bergelson, Elika, Krajewski, Jarek, Janott, Christoph, Amatuni, Andrei, Casillas, Marisa, Seidl, Amdanda, Soderstrom, Melanie, et al. The Interspeech 2017 computational paralinguis- tics challenge: Addressee, cold & snoring. InProceedings of Interspeech, pages 3442–3446, 2017. DOI: 10.21437/Interspeech.2017-43.

[22] Schuller, Bj¨orn, Steidl, Stefan, Batliner, Anton, Hantke, Simone, Bergel- son, Elika, Krajewski, Jarek, Janott, Christoph, Amatuni, Andrei, Casillas, Marisa, Seidl, Amanda, Soderstrom, Melanie, Warlaumont, Anne S., Hi- dalgo, Guillermo, Schnieder, Sebastian, Heiser, Clemens, Hohenhorst, Win- fried, Herzog, Michael, Schmitt, Maximilian, Qian, Kun, Zhang, Yue, Tri- georgis, George, Tzirakis, Panagiotis, and Zafeiriou, Stefanos. The INTER- SPEECH 2017 computational paralinguistics challenge: Addressee, Cold &

Snoring. InProceedings of Interspeech, pages 3442–3446, Stockholm, Sweden, Aug 2017.

[23] Seeland, Marco, Rzanny, Michael, Alaqraa, Nedal, W¨aldchen, Jana, and M¨ader, Patrick. Plant species classification using flower images: A compara- tive study of local feature representations. PLOS ONE, 12(2):1–29, 02 2017.

DOI: 10.1371/journal.pone.0170629.

[24] Smith, David C and Kornelson, Keri A. A comparison of Fisher vectors and Gaussian supervectors for document versus non-document image classi- fication. InApplications of Digital Image Processing XXXVI, volume 8856, page 88560N. International Society for Optics and Photonics, 2013. DOI:


[25] Vedaldi, Andrea and Fulkerson, Brian. VLFeat: an open and portable library of Computer Vision algorithms. In Proceedings of ACM Multimedia, pages 1469–1472, 2010.

A Comparative Study on the Privacy Risks of Face Recognition Libraries

Istv´ an F´ abi´ an


and G´ abor Gy¨ orgy Guly´ as



The rapid development of machine learning and the decreasing costs of computational resources has led to a widespread usage of face recognition.

While this technology offers numerous benefits, it also poses new risks. We consider risks related to the processing of face embeddings, which are float- ing point vectors representing the human face. Previously, we showed that even simple machine learning models are capable of inferring demographic attributes from embeddings, leading to the possibility of re-identification at- tacks. This paper proposes a new data protection evaluation framework for face recognition, and examines three popular Python libraries for face recog- nition (OpenCV, Dlib, InsightFace), comparing their face detection perfor- mance and inspecting how much risk each library’s embeddings pose regarding the aforementioned data leakage. Our experiments were conducted on a bal- anced face image dataset of different sexes and races, allowing us to discover biases. Based on our results, Dlib has a significant FNR of 4.2% on the total dataset, and an eccentric 5.9% FNR on black people. Finally, our findings indicate that all three libraries could enable sex or race based discrimination in re-identification attacks.

Keywords: face recognition, machine learning, privacy

1 Introduction

With the trend of technology getting cheaper and the advance of smart technologies, security and surveillance cameras are getting more and more widespread recently.

The research has been supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.2-16-2017-00013, Thematic Fundamental Research Collaborations Grounding Innovation in Informatics and Infocommunications). Project no. FIEK 16-1-2016-0007 has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the Centre for Higher Education and Industrial Cooperation - Research infrastructure development (FIEK 16) funding scheme.

aDepartment of Automation and Applied Informatics, Faculty of Electrical Engineering and Informatics, Budapest University of Technology and Economics, Hungary

bE-mail:fabian@aut.bme.hu, ORCID:0000-0003-0293-0335

cE-mail:gabor.gulyas@aut.bme.hu, ORCID:0000-0003-0877-0088


According to recent news, Chongqing, a single Chinese city alone has more than 2.5 million surveillance cameras installed [19]. This problem set is not constrained to countries similar to China, as for example London also has more than 600,000 of such cameras [19]. These devices enable emerging artificial intelligence based face recognition technologies in the physical world at scale. This will certainly have a significant impact on the society as a whole, and on the personal level as well, as these advances enable surveillance at large extent as never seen before.

In this paper, we look at the case of the large-scale storing and processing of face imprints generated by face recognition technologies. This technology uses the photo or a video frame containing a person’s face to extract an imprint from it. The imprint, or the embedding, describes the face based on its unique characteristics, thus it can be used for identification. When generated by deep learning techniques, the embedding is usually hard for a human to interpret, as usually it is a vector of real values. The length of this vector may vary depending on the used technique.

Identification (i.e. the recognition) works by comparing multiple embedding vec- tors to each other by calculating similarity between them (e.g. via the Euclidean or Manhattan distance). At the end, pairwise similarities of the embeddings indicate whether the two faces should be considered to be of the same person. It is presumed that the lower the distance, the higher the similarity, and the similarity of embed- dings is proportional to the similarity of the faces. Usually if the distance is below a certain threshold, the embeddings are considered to belong to the same person.

Or in other words, identification is effectively done by clustering embeddings.

In our research, we are concerned with the possible privacy risks related to utilizing face recognition embeddings. This paper extends our previous work, ”On the Privacy Risks of Large-Scale Processing of Face Imprints” [11].

In our previous work we have evaluated a re-identification attack scheme through where we simulated the attacker precision in predicting demographics from embed- dings (without executing any machine learning tasks). In our current work, we look in deeper details into these attacks.

We provide a thorough comparison of three popular face recognition Python libraries: OpenCV, Dlib, and InsightFace. We compare these libraries from two different perspectives on people of both sexes, four different races and multiple age groups. First, we consider the face detection performance of these libraries. Then, we consider embedding inference, where we examine how accurately we can train a machine learning model to infer demographic data from the embeddings generated by the libraries.

We also build on results from our previous work in ”De-anonymizing Facial Recognition Embeddings”[12] where we showed that re-identification attacks by inferring demographic data from face embeddings are a valid threat (see Figure 1), which justifies the relevance of our current research.

We consider the following setup: cameras observe some areas (for example at a company, or in a public space) and extract facial embeddings of people passing by.

Either the cameras themselves are capable of doing the extraction, or they transfer their footage to a capable server device that would do so. Depending on the use case (tracking, authentication, identification, etc.), either embeddings are stored in

< 0.34,-1.21,..., 0.98>

<-1.04,-1.31,..., 0.77>

< 0.38, 1.32,...,-1.01>

Face embedding vectors

Trained ML models

<sex: male, age: 60-80, ...>

<sex: female, age: 20-40, ...>

<sex: female, age: 0-20, ...>

Demographics data inference

Figure 1: A possible privacy concern regarding face recognition is the inference of sensitive demographic data from face embeddings through inference of specific machine learning models.

a database to be used later on, or are compared in real-time to other embeddings that are already stored in the database.

The reason why the processing may be concerning is that embeddings are con- sidered biometric data and unlike other biometric data such as fingerprints, facial images can be easily captured without a person’s knowledge and consent, and also at a large scale [2]. Therefore, in this paper, we look at risks related to the processing of embeddings, more specifically we analyze the privacy risk of demographic-based person re-identification by using face imprints.

This paper is structured as follows. Section 2 summarizes relevant research related to this topic, including how face recognition works and what its privacy concerns are. Section 3 introduces a proposed new data protection evaluation framework for face recognition. Section 4 demonstrates a theoretical attack and evaluates its results. Section 5 compares three popular face recognition libraries and introduces the dataset on which they were tested. Finally, Section 6 concludes the paper with a summary of its main takeaways.

In document Acta 2502 y (Pldal 107-113)