In conclusion I have constructed a family of Bayesian models mimicking features of the mammalian visual system and derived suitable learning and inference algorithms for vari- ous tasks. I have proven the viability of the models by applying them to real world tasks under real-time constraints on standard computer hardware. While I have not beaten the best published object recognition rates on the benchmark datasets with this work I have shown how this kind of model can be made to work efficiently by choosing the right learning and inference algorithms and methods. Additionally I have shown in proof of concept im- plementations how occlusion and unknown object position can be handled in a completely unsupervised manner. This became possible by choosing the right form of regularization and adapting learning and inference algorithms to deal with the added complexity. In the approach for handling occlusion I have especially shown how a simple sparsity based regular- ization suffices to let the model discover objects under occlusion. This result is encouraging for unsupervised learning of such models, as it indicates that more elaborate and compu- tationally expensive approaches to occlusion modeling as for example presented in  and  may not be necessary to learn an occlusion invariant image model. Also it shows that this kind of architecture is suitable to achieve image understanding to some degree in that it enables the unsupervised discovery of objects and leads to a relatively low complexity encoding suitable for imagerecognition tasks.
Starting with [ Hinton & Salakhutdinov 06 ] deep belief networks become very popular in sta- tistical classification tasks. Due to many local optima in the optimization function of ANNs, the weight initialization of deep belief networks is critical. [ Hinton & Salakhutdinov 06 ] provide an easy and efficient method to initialize the weights of the hidden layers using an unsuper- vised training strategy based on Restricted Boltzmann Machines. This deep network structure in combination with the pre-training of the weights is successfully transferred from the imagerecognition task to automatic speech recognition [ Mohamed & Yu + 10 , Mohamed & Sainath + 11 ]. When these deep belief networks are trained on clustered triphone or context dependent states the hybrid approach achieves competitive or even better recognition error rates than the corresponding Gaussian hidden Markov model based recognition system [ Seide & Gang + 11 , Sainath & Kingsbury + 11 , Seide & Li + 11 , Tüske & Sundermeyer + 12 ]. More details about the pre-training of the neural network weights are given in Chapter 9.
In this thesis we present and study approaches for the different types of alignments in the context of fine-grained imagerecognition, including face recognition. The respective scientific goals are formulated in Chapter 2. In Chapter 3 we review notations and concepts used throughout multiple chapters of this thesis. In Chapter 4 we cover pairwise dense alignments obtained with 2D- Warping. We first give an overview of the state-of-the-art for 2D-Warping and then describe our contributions, which include a new warping algorithm and optimizations to increase efficiency. The following Chapter 5 focuses on features used for 2D-Warping. Specifically, we evaluate features learned with the help of a convolutional neural network (CNN). In Chapter 6 we will introduce warped ROI pooling, an attempt to integrate 2D-Warping into the recognition process of a CNN. In the remaining part of the thesis we investigate how localization (coarse, canonical alignment) can be integrated efficiently into a CNN-based imagerecognition system. To this end, we build a strong baseline system in Chapter 7, including an embedding layer which aligns images to their respective class centers. We upgrade the baseline system with an efficient localization module in Section 7.3. In Chapter 8 we introduce a localization module that can be fully integrated into a CNN and trained in an end-to-end fashion.
The BERMUDA project started in January 2015 and was successfully completed after less than three years in Au- gust 2017. A technical set-up and an image processing and analysis software were developed to record and evaluate multi-perspective videos. Based on two cameras, posi- tioned relatively far from one another with tilted axes, syn- chronized videos were recorded in the laboratory and in real life. The evaluation comprised the background elim- ination, the body part classification, the clustering, the assignment to persons and eventually the reconstruction of the skeletons. Based on the skeletons, machine learning techniques were developed to recognize the poses of the persons and next for the actions performed. It was, for example, possible to detect the action of a punch, which is relevant in security issues, with a precision of 51.3 % and a recall of 60.6 %.
In Chapter 3.4, I provide a thorough evaluation of the model’s current iteration. I analyze general statistics like the overall classification accuracy. The model achieves an accuracy of about 69% and thus scores significantly lower than the validation accuracy of 78%, even though I mainly attribute this to random fluctuation since no fine-tuning with regard to the validation set has taken place and given that the set I performed the analysis on was comparably small at 56 images. The top-2-accuracy, which is defined as the relative amount of images where the true label was among the two predictions with the highest scores, turned out to be 100%. This prompts the idea of passing feature maps not only to the specialized classifier for the top score, but rather to the top two, together with the appropriate logic to determine which class will be returned to the user in the end. Furthermore, I checked the difference in confidence score depending on whether the image was classified correctly or not, yielding a signifficant difference in the median values (0.867 vs. 0.504, respectively). Moreover, I calculated precision and recall for every class. The results can be found in Table B.1.
Under equal conditions, we have achieved a maximum recognition improvement of 40% compared to our previ- ous, scale–variant algorithm . Using a temporal adap- tation, our system is able to collect object data contin- uously and to increase its classification rate over time. A field experiment in a museum has shown that our pro- totype can, in principle, pass a practical acceptance test. Although the static training achieved a higher recogni- tion rate (92.6%) and is more applicable since the NNs have not to be configured and re-trained on the mobile phone (which would require additional time), the dy- namic training (recognition rate: 85.6%) is still an appro- priate solution for selected tasks: If, for instance, objects have very diverse shapes and colors from different per- spectives the dynamic training can compensate for this much better than the static training since every exhibit is separated into multiple virtual objects. Consequently, each view can be trained individually. Furthermore, the dynamic technique trains NNs during run–time. This of- fers the opportunity to apply the system in environments where many RF–emitters change their position dynami- cally (e.g. in a bookstore where all books are tagged with RFID-chips). Comparable to the museum scenario, the mobile phone trains a NN online, based on the received RF-signals. The static training would not be applicable in this case, since the number of different NNs to be trained in advance would be too large since they have to be transferred to the mobile phone. Furthermore, the NNs would have to be adapted manually for every mod- ification in the objects’ arrangements.
These pre-trained CNNs are also gaining considerable research interest as a feature extractor for a task of interest, e. g. object or scene recognition [6, 29]. It is argued that CNNs, through their layered combination of convolutional and pooling layers, capture a robust mid-level representation of a given image, as opposed to low- level features such as edges and corners [6, 17]. It has been shown that deep representation features extracted from the activations of top layers of AlexNet have sufficient representational power and generalisability for imagerecognition tasks . Indeed, state-of- the-art results for a range of vision-based classification have been achieved with such deep representation features [5, 29].
Pattern recognition systems can act as substitutes when human experts are scarce in specialised areas such as medical diagnosis or in dangerous situations such as fault detection and automatic error discovery in nuclear power plants. Automated pattern recognition can provide a valuable support in process and quality control and function continuously with consistent performance. Finally, automated perceptual tasks such as speech and imagerecognition enable the development of more natural and convenient human-computer interfaces. Important additional benefits for a wide field of highly complex applications lie in the use of intelligent techniques such as fuzzy logic and neural networks in pattern recognition methods. These techniques permitted the development of methods and algorithms that can perform tasks normally associated with intelligent human behaviour. For instance, the primary advantage of fuzzy pattern recognition compared to the classical methods is the ability of a system to classify patterns in a non-dichotomous way, as humans do, and to handle vague information. Methods of fuzzy pattern recognition gain constantly increasing ground in practice. Some of the fields where intelligent pattern recognition has obtained the greatest level of endorsement and success in recent time are database marketing, risk management and credit-card fraud detection.
(ii) Objective criteria: as many critics have pointed out, critical social philosophy cannot rest upon subjective experiences alone, it also has to identify some sort of objective criteria to distinguish justified from unjus- tified forms of recognition or disrespect (Fraser 2003; Pilapil 2011; Zurn 2003). This is a difficult task and has to shoulder much of the normative work within the recognition approach. Here, we want to focus on those criteria that derive from the three forms of recognition as they are con- nected with the normative benchmark of the recognition approach, namely “undistorted self-realization” (Honneth 1996b). This means that the opportunities to engage in personal relationships, equal protection by civil and social rights, and the experience of social esteem and belonging are objective criteria for evaluating an individual life and social relations. On the one hand these criteria are context-sensitive – social esteem in one society can have a different meaning from that in another – but on the other hand they are universal as they are oriented towards undistorted self-realization. Also, these criteria have value in themselves, which means that disrespect is morally wrong even if it does not ultimately distort or hinder self-realization (Honneth 2002).
IDM Im Image Distortion Modell wird die Modellierung der relativen Abh¨angigkeiten vollst¨andig aufgehoben und nur eine maximale absolute Abweichung vorgegeben. Die Minimierung ist dadurch in diesem Modell der Ordnung 0 sehr einfach. In der Arbeit wird erstmals gezeigt, dass die Bestimmung der besten Abbildung zwischen zwei Bildern f¨ur Modelle der Ordnung zwei (die Abh¨angigkeiten in beiden Bilddimen- sionen ber¨ucksichtigen) zur Klasse der NP-harten Probleme geh¨ort [KU03]. Die Beweis- idee basiert auf der Reduktion des 3SAT-Problems auf das Abbildungsproblem und ist in Abbildung 4 illustriert. Ausgehend von einer Formel in 3-KNF (einer Instanz des 3SAT- Problems) wird der Abh¨angigkeitsgraph gezeichnet und es werden daraus zwei Bilder konstruiert, die eine Abbildung under 2DW-Beschr¨ankungen genau dann zur Distanz 0 erlauben, wenn die Formel erf¨ullbar ist.
In dieser Dissertation wird mithilfe empirischer Forschung gezeigt, dass Meinungen/ Klischees/ Vorurteile über Psychotherapie den Verlauf und den Outcome einer Psychothera- pie beeinflussen. Bisher werden in der therapeutischen Praxis überwiegend therapieinterne Wirkfaktoren (Motivation, Änderungsbereitschaft, Intelligenz, Therapiebeziehung, …) be- rücksichtigt. Die Ergebnisse dieser Studie weisen allerdings auf die Existenz eines therapieex- ternen Wirkfaktors hin, der die therapieinternen Wirkfaktoren entscheidend beeinflusst – Das Image der Psychotherapie. Zur Überprüfung der Haupthypothese „Die Akzeptanz der Psycho- therapie ist ein Wirkfaktor für den Therapie-Verlauf“ wurde zunächst eine Vorversion des Instrumentes ACP-a (Akzeptanz der Psychotherapie) aus Umfragen und Expertenbefragungen konstruiert. Mit der daraus entwickelten Testendversion ACP-a wurde das ‚Image der Psycho- therapie’ von 342 ambulant behandelten Patienten des Norddeutschen Instituts für Verhaltens- therapie in Bremen (NIVT) bestimmt und mit dem Therapieverlauf und -ergebnis in Bezie- hung gesetzt. Der Therapieverlauf wurde mit dem dafür konstruierten Therapeuten- und Pa- tientenstundenbogen (ACP-P, kontinuierlicher prozessualer Therapieerfolg), der zu jeder The- rapiesitzung vorgelegt wurde, protokolliert. Das Therapieergebnis (Outcome) wurde als Ver- änderungsmessung mithilfe einer Fragebogentestbatterie (SCL-90, SASB, IIP, …) zu zwei Messzeitpunkten (t 1 = nach Erstgespräch; t 2 ≈ nach 20 Therapiesitzungen) ermittelt. Es wird gezeigt, dass Patienten mit einer geringen Akzeptanz für Psychotherapie auch einen schwieri- geren/ aufwendigeren Therapieprozess und einen geringeren therapeutischen Outcome haben, als Patienten mit einer hohen Akzeptanz für Psychotherapie. Die gefundenen Ergebnisse ge- ben damit Anlass, dem „gesellschaftlichen Image der Psychotherapie“ mehr Aufmerksamkeit zu schenken und die Meinungen sowie das Vorwissen über Psychotherapie in der Bevölke- rung in eine positivere Richtung zu verändern, um eine erfolgreichere Therapie zu gewährleis- ten.
Die drei erstgenannten Strategien verdeut- lichen, dass im Rahmen der Stadtentwick- lung in den Metropolkernen vor allem die überregionale Imagebildung von Städten und Regionen (Makrostandort) an Bedeu- tung gewinnt, um im zunehmenden eu- ropäischen und weltweiten Standortwett- bewerb zu bestehen. Zur Anziehung von Investitionen, Unternehmen, Arbeitskräf- ten, Bewohnern und Besuchern müssen die Städte und Regionen ihre Qualitäten und Angebote stärker vermarkten. Daher zielen viele städtebauliche Großprojek- te auf eine überregionale Wahrnehmung hinsichtlich einer Attraktivitätssteigerung des Standortes, so auch die im Rahmen dieser Forschung untersuchten Stadtteile. Bezüglich der Wirkungen städtebaulicher Großprojekte nach innen in die Stadt (Bin- dungskraft für die lokale Bevölkerung) lie- fert dieses Forschungsprojekt anhand der Untersuchung der Fallbeispiele exemplari- sche Ergebnisse und Erkenntnisse, die auf- grund der unterschiedlichen Rahmenbe- dingungen jedoch nicht alle miteinander vergleichbar sind. Insbesondere in diesem Zusammenhang soll auf die Differenzie- rung zwischen Image und Identität hinge- wiesen werden: „Während Imagewerbung auf Basis lupenreiner Werbetechniken das Bewusstsein der Zielgruppen zu beein- flussen versucht, geht Identitätsmarketing von tatsächlich vorhandener Substanz aus. Identitätsmarketing stärkt das Starke und entdeckt neue Stärken, die es zu fördern gilt.“ 30 Eine zentrale Fragestellung der Un- tersuchung war somit, ob die untersuch- ten neuen Stadtteile an der tatsächlich vorhandenen Substanz der Stadt anknüp- fen und diesen „Markenkern“ weiterent- wickeln, oder ob sie versuchen, durch die „Neuerfindung“ eines Raums ein völlig
or lose nearly all of its bite if it tries to cover the global scale and reduces itself to a very small set of globally accepted treaties. The most but not fully convincing solution I can think of it is to show that poverty does violate such claims that arise from the core of recognition itself, that the experience of recognition forms the social conditions of any good life (Schweiger, 2012b). No matter what the internal standards within a society might be, absolute poverty limits the opportunities for self-realization so that they are negligibly small. In contrast, claims of recognition can refer to this absolute core in every society and under all circumstances – in a refugee camp in Africa, in custody pending deportation in Austria or in an automobile plant in the USA – and can demand that the intersubjective conditions and social relations should change in order to make undistorted self-realization possible. The anthropological and universal roots of the recognition transcend the borders of any given society.
As can be seen in Figure 5.4 (c), the fusion method has a significant improvement regarding boundary overlap. As a penalty, however, the false rejection rate is pretty big. There are many blanking regions between the sub-clusters and the virtual clusters. Those blanking areas are the possible regions that false rejection occurs. Intuitively, the blanking regions become much smaller when more sub-clusters of a certain person are created. It corresponds to the stable state that one person has long enough interactions with the face recognition system. However, when observations from a new person occur, the false rejection rate is so high that face shots of the same person are very likely to be clustered into different virtual clusters, i.e. enrolled as different persons. We denote this condition as the unstable state. Following the above discussion, the structure of a face database is redrawn in Figure 5.5. For the simplicity of expression, we use the more general terms “class” instead of “sub-cluster”, and “group” instead of “cluster”. In the unstable state, the database is a hierarchy of three layers, which are images, classes, and groups respectively from bottom to top.
The author shows that plan recognition in some cases can be viewed as parsing. He also shows an example how to compile a plan hierarchy into a context free grammar. This connection is convenient since we can use well known parsing strategies for recognizing a plan. Vilain then argues that a \chart-based version of Early's algorithm" can be used, an algorithm which we cannot adopt straight forward. Our plan operators contain side eects for instance for building the dialogue memory. We can therefore allow for
Zusammengefasst lässt sich sagen, dass Zürich ein Image hat, welches man für eine typische, mo- derne Region mit Zentrumsfunktion und grosser wirtschaftlicher Bedeutung erwarten würde. Doch darüber hinaus bietet Zürich einige wesentliche Alleinstellungsmerkmale, welche weltweit in dieser Kombination sehr selten anzutreffen sind. Zum Beispiel die Natur: Anderswo mussten künstliche Parks angelegt werden – in Zürich ist die ganze Region ein Park. Zürich ist sowohl urban als auch ländlich – auch wenn der ländliche Charakter von den Befragten derzeit noch unterbewertet wird. Auch ist Zürich sowohl Wirtschaftsmetropole als auch familienfreundlicher Lebensraum. Zürich ist international und abwechslungsreich, gilt aber zugleich als sicher und verlässlich. Die ganzen Wider- sprüche, die es sonst vielerorts auf der Welt gibt, kennt man hier nicht. Die Region bietet von allem etwas, und das alles gleichzeitig: Man bekommt effizienten Service und Gemütlichkeit, Stadt und Land, Multikulturalität und Sicherheit, dynamische Arbeitsplätze und Erholungsmöglichkeiten, moder- ne und historische Bauten, Lokalkolorit und internationale Anbindung. Die Region ist äusserst vielfältig und dennoch überschaubar, und eben das macht sie besonders attraktiv.
The camera obviously has to be installed forward looking, since the signs are readable in this direction only. To allow for cleaning and better viewing angle the camera usually is installed behind the windscreen, at the highest point accessible with respect to installation preconditions and screen cleansing purposes (wiped region). The sensor has to have a reasonably high dynamic luminance range to allow for bright and dark regions in the image, the sun in the centre view and the signs without illumination at the sides for example. In addition the signal to noise ratio has to be high to permit night time use of the system. Long exposure times at low lighting conditions are limited due to the resulting motion blur. This factor currently limits the resolution of the sensor to about VGA (640x480 pixel) to XGA (1024x768 pixel) even at low apertures. Another factor is the minimum light exposure of the camera since the sun reecting in a trac sign can lead to saturation of the sensor making the sign indecipherable in the image. The eld of view is dictated by the regions where signs may occur in road architecture and the relative position of the vehicle.
I would like to thank the I4U consortium of 62 researchers from 15 collaborating sites, tackling the 2016 NIST SRE in a joint effort, to Kong Aik Lee for calling the consortium and hosting online meetings, and to Anthony Larcher for his support. For involving me in a col- laboration of 13 authors on the performance assessment of morphing attacks and for our discussions on the suitability of relative to abso- lute vulnerability reporting, I would like to thank Ulrich. After these two collaborations, I was able to engage in another large project. To lead and collaborate a joint interdisciplinary effort on data privacy in speech data was a great experience, for which I would like to express my gratitude to the 22 co-authors with backgrounds in speech and audio processing, paralinguistics, legal studies, cryptography as well as image and audio based biometrics (a manuscript which would be impossible to write without the contribution of each co-author). The verbose yet concise way legal experts draft papers (Catherine and Els discussing via tracked changes Word documents) is still impressive to me. I’m glad about the way Jascha, Amos, and Abelino interac- tively engaged discussions. Thanks to Bhiksha for “poking” me to pursuit this collaboration and for involving Isabel. For helping my understanding of cryptography grow, I would like to thank Thomas, Amos, Michael, and Melek. At EURECOM, Nick introduced me to the ASVspoof organization committee—Junichi, Tomi, Kong Aik, Héctor, Sahid, Xin, Ville, and Massimiliano—who trusted me with carrying out the detailed analyses on the physical access task of all submit- ted countermeasure systems (147 systems of 50 participating labs); thank you all for letting me jump in at the evaluation stage and for entrusting me to co-organize the special session at the 2019 IEEE Au- tomatic Speech Recognition and Understanding workshop as well as for proposing me among the guest editors of the ASVspoof special issue of the Computer Speech and Language journal.
But I want to stick to the point, which experiences of non- or misrecognition constitute an injustice and in which relation this stands to other forms of injustice. Ingram mentions (Ingram 2018 , 73) that not all persons deserve recognition; non- or misrecognition is therefore justiﬁed at least when persons refuse legitimate recognition to others, as in the case of racists. That is true. But three observations are central here: First, it is not only about recognition as an interpersonal relation, but also about the institutionalization (and materialization) of recognition or its negative counterparts. The behaviour of a racist is immoral, but for a social theory and critical theory of justice, the case of institutionalized racism is particularly relevant. For example, in the case of exclusive social and educational policies or global development policies that reproduce colonial logics. In the institutional structure, however, it is more di ﬃcult to determine, who deserves recognition or misrecognition, because people are always actors in di ﬀerent, overlapping social ﬁelds with their own antagonistic rules and forms of recognition. Social rights and market esteem do not go hand in hand.