3D Shape Recognition Methods for Tangible User Interfaces

(1)

Department of Control Engineering and Information Technology

3D Shape Recognition Methods for Tangible User Interfaces

PhD Thesis

Márton Szemenyei

Supervisor Dr. Ferenc Vajda

September 9, 2020

(2)

NYILATKOZAT

Alulírott Szemenyei Márton kijelentem, hogy ezt a doktori értekezést meg nem en- gedett segítség nélkül, saját magam készítettem, csak a megadott forrásokat használ- tam fel. Minden olyan részt, melyet szó szerint, vagy azonos értelemben, de átfo- galmazva más forrásból átvettem, egyértelműen, a forrás megadásával megjelöltem.

Budapest, September 9, 2020

Szemenyei Márton

(3)

Abstract

User Interfaces (UIs) are essential components of digital systems, since they are the primary factors determining the ease of use - and consequently the spread - of the technology they are part of. In the recent decades, numerous new interface types were introduced, most notably tangible user interfaces, which allow users to interact with virtual objects by manipulating real, physical objects placed in the scene for this very purpose. Still, the setup of these interfaces is rather cumbersome, since usually creating and placing placeholder real objects is required.

In this thesis, an alternative solution is explored: We introduce a novel algorithm that is able to pair virtual objects with real, natural ones already existing in the scene. This method allows automatic setup of the virtual environment without placing additional objects, or performing manual pairing. To minimize user discomfort, the pairing is performed so that the virtual and real objects would be similar in shape.

The proposed method describes the geometry of the scene and its objects using a graph of simple geometric forms (primitive shapes). This part-based approach is arguably similar to human interpretation, while it describes man-made objects frequently found in work/home environments relatively well. The first step then is to perform a simple classification for each object part, resulting in a preliminary assignment to a certain virtual object category. To aid this step, a novel graph node embedding method is developed.

Since the resulting feature vector of a graph node may be high-dimensional, a novel discriminant analysis algorithm is proposed to improve classification. This method is tailored to cases where objects consist of a set (or graph) of multiple vectors. Finally, the classification scores are used to perform a global optimization step, determining the final arrangement of the virtual objects of the scene. In this step, physical constraints (such as compactness of real objects) and application requirements (such as the required number of certain virtual object types) are both considered.

The experiments presented in this thesis demonstrate that all three novel methods are viable solutions to their respective problems, while is is also shown that the algorithms credibly improve upon previous the state-of-the-art.

(7)

Összefoglaló

A felhasználói felületek a digitális rendszerek nélkülözhetetlen részei, mivel meghatározzák a rendszer használatának egyszerűségét - és következésképpen annak elterjedtségét. Az elmúlt évtizedekben számos új felület típus született, melyek közül kiemelendőek a tapintható felületek, melyekben a felhasználók a jelenetbe erre a célra elhelyezett valós objektumok segítségével manipulálhatnak virtuális elemeket. E felületek beüzemelése azonban meglehetősen nehézkes, ugyanis a legtöbb esetben a valós objektumok előzetes létrehozása és behelyezése szükséges.

A dolgozatban egy alternatív megoldást javasolok, amely egy újszerű algoritmus segítségével képes a virtuális objektumokat a jelenetben már bennlévő valós objek- tumokhoz párosítani. Ennek segítségével kézi párosítás és extra objektumok nélkül automatikusan elvégezhető a berendezés. A felhasználó komfort érzetének maximal- izálása céljából a párosítást formabeli hasonlóságra alapozzuk.

A javasolt módszer a jelenet geometriáját egyszerű (primitív) formákból konstruált gráf segítségével írja le. Ez a részekből épülő modell hasonló az emberi gondo- lkodáshoz, miközben a munkahelyi/otthoni környezetekben gyakorta előfordulú mesterséges objektumok formáját is jól közelíti. A módszer első lépésében egy egyszerű osztályozást végzünk minden részre, amely előzetesen hozzárendeli a részt egy virtuális kategóriához. Ezt a lépést egy újszerű gráf csomópont beágyázási módszer segíti.

Mivel az eredményként kapott csomópont leíró vektor sokdimenziós lehet, egy újsz- erű diszkriminancia-analízis módszert javasoltam az osztályzás segítésére. Ez a módszer olyan esetekre lett kifejlesztve, ahol az egyes objektumok vektorok hal- mazából/gráfjából állnak. Végezetül az osztályozás eredménye alapján egy globális optimalizálási lépéssel kapjuk meg a virtuális objektumok elrendezését. Ebben a lépésben mind a fizikai megkötések (pl.: objektumok kompaktsága), mind pedig az alkalmazás követelményei (pl.: szükséges virtuális objektumok száma) figyelembe vehetők.

Az értekezésben bemutatott kísérletek megmutatják, hogy mindhárom új módszer jó megoldásnak bizonyul az egyes részproblémákra, valamint azt is, hogy a megalko- tott algoritmusokszignifikánsan javítják a korábbi state-of-the-art módszerek ered- ményeit.

(8)

1 Introduction

User interfaces (UI) are our primary way of interaction with computers and other machines, therefore their eﬀiciency and usability are primary concerns in the field of Information and Communication Technology (ICT). In fact, one may observe that two of the three IT revolutions are linked to the development and spread of a novel interaction technology: The application of the mouse and Graphical User Interfaces (GUI) arguably played a major role in the spread of personal computers in the 1980s, while high-quality touchscreens were a major component in the popularity of smartphones, tablets and other similar devices.

At the time of this writing, the mouse and keyboard are the primary interaction devices used for personal computers, while smart handheld devices rely on touchscreens primarily. Notably, natural language-based interfaces have become increas- ingly present in this decade, still - despite the rapid advancement of Deep Learning- based Natural Language Processing (NLP) methods - these interfaces have not yet managed to replace more traditional methods.

Arguably, one of the main reasons for the popularity of both mouse and touch- based interfaces is that both require the user to learn interface metaphors that come naturally to humans, since they rely on instinctive understanding of space-time and object manipulation. The result of relying on natural metaphors is an easy-to-learn and use interface, that - as long as it functions robustly - provides a powerful tool for users.

Notably, both of these interaction techniques are inherently two dimensional, which makes the manipulation of higher-dimensional objects or environments cumbersome.

This is a significant shortcoming, since in the real world, users frequently have to work with three or four dimensional structures. This problem makes the research and development of novel, high-dimensional interfaces a worthwhile endeavor.

Moreover, new user input methods may also be paired with novel display technologies, improving the eﬀiciency of the interface even further. 3D interaction techniques would naturally require 3D display devices, which may provide considerably increased immersion - an important factor in applications such as simulation-based training and gaming.

(9)

1.1 Mixed Reality Systems

Mixed Reality (MR) is a collective term for human-machine interaction systems that aim to replace a portion of the user’s perceived reality with synthetically generated content. While these systems enable the use of more advanced interaction techniques, some systems (most prominently video games and some smartphone applications) may use traditional input and display methods still [24].

Generally, these systems can be divided into three subfields: Virtual, Augmented and Mixed Reality. Virtual Reality (VR) [25] is differentiated from general MR systems in that it replaces all of the user’s perceived reality with virtual content.

Depending on the display, input and content generation technologies used, these systems may achieve high levels of immersion.

High levels of immersion are not completely harmless, however. Due to several reasons, the user’s senses might provide conflicting information to the brain. A good example of this is when the visual content is moving relatively to the user’s head due to inaccuracies in head tracking [25]. Conflicting sensory information may result in symptoms similar to seasickness - called virtual sickness. The severity of the symptoms can be reduced by making the virtual experience consistent with the user’s expectations of reality as much as possible.

Augmented Reality (AR) [24] on the other hand does not replace the real world, instead, these systems merely display additional content as an overlay. Such systems are in wide use in smartphone applications for instance, where they allow to user to access information on real-world object by pointing the camera at them. Lastly, Mixed Reality (MR) allows for the coexistence and possibly the interaction of real and virtual objects. Note, that in this work MR is also used to mean all three kinds of systems in conjunction, since VR and AR can be understood as special cases of MR.

The field of MR development sits at the intersection of a number of other fields of engineering, most notably sensor and display technologies, signal processing, artificial intelligence and computer graphics. While an MR (and especially VR) system can be created by relying purely on traditional methods in the aforementioned fields, several important advances have been made in each of them in the last two decades, allowing the creation of superior interface metaphors and user experience.

VR systems often rely on monitors or projectors for displaying virtual content, while AR systems frequently use the touchscreen of smartphones or tablets [26]. These systems, however, usually provide minimal immersion. One of the most immersive

Márton Szemenyei 8/130 INTRODUCTION

(10)

display devices are VR glasses, which employ LCD or LED lenses to display content to the user. MR and AR systems may use similar devices - AR glasses -, which are different in that while switched off, the lenses are see-through so that they don’t block the wearer’s view of the real world.

MR systems also may use advanced input technologies, allowing the designers to replace the keyboard and the mouse. These input technologies might use a dedicated input device, which allows the users to grab something tangible to interact with the virtual content. Virtual pens or other pointer devices [27] are commonly used, while some of the more advanced systems, such as the HTC Vive (Fig. 1.1) use complex controller consoles.

Figure 1.1: The HTC Vive VR glasses and controllers (Source:

vive.com ).

Alternatively, the MR system may rely on natural input methods to allow user control, such as hand gesture recognition or full body tracking [28]. These input methods may be marker-based or markerless, yet in both cases they might limit user mobility somewhat. Alternatively, the system might also recognize voice commands.

In reality, most contemporary MR systems use a combination of the above input methods, allowing high quality interaction.

1.1.1 Tangible User Interfaces

Tangible User Interfaces (TUI) [27] have been an area of intense research in the last decade. The goal of these systems is to allow users to interact with virtual objects by manipulating real objects, which results in a natural interface. Dynamic Shader Lamps [29] and Dynamic Object Texturing [30] are two noteworthy examples, both applying virtual textures on real objects.

(11)

One particular subfield of TUI is Tangible Augmented Reality (TAR) [31], which aims to combine Tangible User Interfaces with augmented reality by introducing real-world objects into the user interface design [27]. Some systems, like Tiles [31], Tangible Bits [27], or MagicCup [32] use 2D objects with markers to augment virtual objects on them or perform operations with them (Fig. 1.2).

Figure 1.2: The Tiles (top) [31] and MagicCup (bottom) [32] systems.

The main advantage of TAR systems is that users can interact with virtual objects as if they were real. This allows for intuitive man-machine interaction, which makes these systems easy to learn and use. Most TAR systems use real objects with artificial markers [32, 33], which requires the user to prepare and arrange these objects. This is a clear disadvantage, reducing the usability of these systems.

Nonetheless, there are a few TAR systems that use placeholder objects with natural features. The Virtual Round Table [34] system is an excellent example for this: the VRT is able to track any real object located on a table using natural features. The users may add new virtual objects in the scene, selecting which real object should serve as a placeholder, and then manipulate the virtual object by interacting with the real one.

The VRT system introduces a user interface metaphor that is completely natural to humans, and therefore very easy to learn. However, even this system relies on

(12)

the user to determine the pairing of the objects, making the setup of the scene time consuming.

1.2 Adaptive Mixed Reality

Notably, none of the systems presented above concern themselves with the similarity of shape between real and virtual objects. However, objects of various forms require different manipulation techniques, and provide different sensations when touched. The inconsistency of visual and haptic senses might cause neural conflict, which greatly diminishes user experience. Nonetheless, by matching real and virtual objects similar in shape, this conflict can be mitigated. Moreover, a shape-based matching process enables us to create Adaptive Mixed Reality systems (AMR):

Environments that can adapt to the specific structure of the real scene they are projected into by arranging virtual objects intelligently, so that they would fit into the real scene.

The subject of this thesis is to propose a novel solution for determining a logical pairing of virtual and real objects. The main criterion of the pairing method is that the real and virtual objects must have similar physical properties, so that the users could effortlessly manipulate the real object as they would do with the virtual one. However, many important physical properties, like mass, roughness of surface cannot be measured visually, therefore the proposed algorithm uses shape and size only.

Notably, AMR systems may have other requirements: For instance the presence of certain virtual objects might be required, in which case even poor matches have to be accepted. Moreover, some environments may benefit from having multiple instances of certain virtual objects placed in the scene, while some others may not.

Finally, there might be virtual objects that benefit from the proximity of another, in which case it is recommended to reward the algorithm for placing them close to each other.

This way AMR systems are able to reduce the setup and preparation work required by the user: The developer of the AMR environment is responsible for creating the virtual object categories, training the system to recognize suitable placeholder shapes, and setting properties and requirements for these virtual object categories.

Using these, the system prompts the user to do a walk-around of the scene to perform 3D reconstruction, and then it proceeds to determine object placements automatically.

(13)

1.2.1 Objectives

Arguably, the object pairing method needs to be based on shape matching or shape similarity. While there are numerous shape matching methods, the aim of this work is to propose an algorithm that requires no prior information on the virtual objects, but can learn from instances of labeled real objects. Also, in order to make no assumptions on whether large-scale or small-scale shape attributes are important for pairing, the proposed methods attempt to encode all information on the shape of an object, and thus allow the learning algorithm to make that decision. Since the shape of a scene or certain objects is best represented as a structured object, the main focus of this thesis is to develop a learning algorithm for structured objects used to describe complex shapes.

However, shape recognition alone cannot solve this problem, since the additional requirements posed by the designer of the environment result in co-dependent assignments. To resolve this issue, some form of global optimization scheme has to be introduced that optimizes all assignments simultaneously, arriving at an optimal

”compromise” solution.

Based on the above, the problem this thesis aims to solve is as follows:

Problem statement

Given a partial 3D reconstruction of a real scene (the work environment) and some number of object categories (virtual objects), perform the following:

1. Segment the scene so that individual real objects (or their parts) would be separated.

2. Find a labeling for the objects/parts (from this point on this will be referred to as the arrangement of virtual objects in the scene) that satisfies the following criteria as closely as possible:

(a) The shapes of the real and virtual objects are similar

(b) The arrangement satisfies criteria regarding the number ar- rangement of the virtual objects posed by the developer (c) The algorithm allows for flexible handling of above criteria

The first step of the algorithm proposed in this work is segmentation and shape description. Object detection algorithms capable of detecting multiple instances

(14)

usually employ a segmentation procedure, in order to produce object candidates for a subsequent classification method. This is a viable method of shape recognition, especially when applied to scenes where segmentation is relatively straightforward.

In urban scenes, for instance, one can easily remove the ground, resulting in most of the objects becoming disjoint in the point cloud [35].

Nonetheless, segmentation becomes significantly more diﬀicult in indoors scenes, since objects much more likely to be cluttered in this context. For this reason, a different approach is used: the input 3D scene is first segmented into primitive shapes, which may be interpreted as the “building blocks” of the scene. Then, primitive shapes are classified individually, and objects are determined based on the segment labels. Since primitive shapes have several features (depending on the primitive type), it would be straightforward to use these features to classify each primitive.

While the simplicity of this method is alluring, it ignores the geometric relations between the primitives and the local context of each primitive. This could lead to high classification errors, especially if two classes contain very similar shapes, albeit in different contexts. In order to avoid this, a graph is constructed from the primitive shapes, and a graph node embedding procedure is applied to produce a feature vector for each node that encodes the local context of the primitive as well.

The initial classification is performed using these descriptors.

The next step of the algorithm is a discriminant analysis technique that aims to reduce the number of features used in the embedding to reduce the amount of computation and the complexity of the learning algorithm the method uses for classification. The aim of this discriminant analysis technique is not only to preserve the features that are useful for separating the nodes belonging to different classes, but also to keep the nodes within a single instance unique, since these might be useful for determining the pose of the real object.

The final stepis to use the initial individual classification to find a globally optimal arrangement for the entire scene that is able to take all additional criteria into account. For this step a Genetic Algorithm (GA) is used, with novel, problem- specific genetic operators. This step also introduces a way to optimize the contextual requirements of virtual objects, allowing designers to set the preferred proximity of certain categories.

Our aim in this thesis is to show that the proposed solutions for these three steps are viable for a shape recognition and arrangement task, while they also provide improvement over previously existing methods.

(15)

1.3 Methodology

The methods proposed in this thesis are largely based on machine learning and heuristics, attempting to satisfy criteria that cannot be precisely defined, such as similarity of shape. Moreover, they are applied to a problem that is not guaranteed have a feasible solution (for instance if there are less real objects than required virtual ones). Consequently, the correct working or feasibility of the proposed algorithm for all possible inputs cannot be verified through traditional means.

For the above reason, we choose to demonstrate the performance of the proposed methods empirically by evaluating them a numerous different datasets. This, however presents another problem: since the problem stated earlier is not a particularly common one (the proposed application is novel to our best knowledge), there are no publicly available datasets to test our methods with.

To resolve this, we created our own datasets containing real object types commonly found on oﬀice desks, one containing 3D reconstructions from synthetic images created in Blender, while the other is using real images instead. Nonetheless, we argue that two datasets are not enough to evaluate our methods on, therefore we created numerous other datasets synthetically. These synthetic datasets are generated automatically, each dataset using a different, random set of hyperparameters. We argue that having a higher number of structurally different dataset is a better choice for demonstrating the universality and robustness of our method, than having a single dataset with more training examples. The hyperparameters used to generate synthetic datasets are the following:

• The number of classes

• Whether different classes should share the same pool of node prototypes or have their own individual nodes

• The range for number of nodes per object (for each class separately)

• The average shape descriptor for each node (for each class and node separately)

• The average geometric transformation between nodes (for each class and node pair separately)

• The probability and number of noise nodes (for each class separately)

• The probability of missing nodes (for each class and node separately)

(16)

• The average geometric transformation between objects (for each class pair separately - localization datasets only)

Once the methods have been evaluated on all datasets, we perform a statistical test to infer the difference between the performance of the proposed and the previous state-of-the-art methods. Since the methods are evaluated on the same datasets, the paired-samples t-test is used to evaluate the difference of the compared methods’

eﬀiciency. Having a higher number of different datasets also makes the results of these test more robust.

It is essential to point out, that the standard, frequentist versions of the statistical testing (such as the Student’s t-test) infer the probability of the data under the null hypothesis. It is a grave (and all too common) error to infer the probability of the null (or any other) hypothesis based on these tests. Unfortunately, the probability of certain hypotheses is precisely what we aim to prove in this work.

For the above reasons, we employ the Bayesian version of the paired-samples t-test proposed by Kruschke [36], which allows us to infer the probability of the effect size (the improvement caused by our proposed methods) being positive, as well as the 95% Credible Interval (CI). The cost of using such a test is having to explicitly choose a prior distribution of the effect size. In this work we use a wide normal distribution, since it is zero-centered, symmetric, meaning that the tests are unbiased.

The normal distribution is bell-shaped (more dense around zero), which is justi- fied by the paired samples: The performance of any two compared methods also depends on the diﬀiculty of the dataset, meaning there is arguably a positive corre- lation between the accuracy measures, resulting in higher probabilities of near-zero difference. Notably, this distribution was also used in the original publication [36].

Bayesian t-test is expanded upon further in section A.1.

Since there are considerably more randomly generated synthetic datasets than image-based ones, the result of the Bayesian t-tests will be dominated by the algorithm’s performance on synthetic data. To avoid the distortion effects, we also separately report the methods’ performance on the image-based datasets, and use both results to justify our claims. For reproducibility, the implementation of these methods, tests, as well as the datasets used for evaluation are available online at Szemenyei [37].

(17)

2 Shape Classification

As mentioned above, the first part of the object pairing method is to perform simple shape recognition on the parts of the 3D scene independently to provide an initial pairing, which will be refined by later steps. In the current approach, this is formu- lated as a classification problem: The whole scene is broken down into objects, and objects into a graph of parts., These individual parts are then classified according to which virtual object category they might belong to. To increase the classification accuracy, a novel graph node embedding framework is proposed. The scores from the classification described in this chapter will be used by the subsequent global optimization step as initialization.

This chapter provides a brief overview of 3D shape recognition methods and machine learning on structured data, especially graphs. Subsequently, the method for describing and segmenting 3D structures is presented. The main part of the chapter details two novel methods for performing learning on structured object parts, while also allowing their neighborhood to influence the learning and inference. In the last part of the chapter the methods are evaluated against traditional machine learning solutions, and the Bayesian t-test is used to demonstrate their eﬀiciency.

2.1 3D Shape Recognition

Two dimensional shape recognition is a relatively common task in computer vision, for which a great variety of methods exist. Most of these, however, cannot be generalized easily to 3D shapes [38]. Still, there are a few methods, such as the Generalized Hough Transform [39] or the RANSAC algorithm [40], that work reliably for the 3D case as well. Nonetheless, these algorithms require a reference model for matching, which is highly problematic in many cases, mainly due to high intra-class variation.

In order to perform shape recognition without a reference model, learning algorithms are often used. There are numerous approaches for this, such as using local features [42, 41] as inputs for a learning algorithm. Another approach is using shape distributions [38], or global features [35] to describe objects. However, these algorithms make use of segmented objects, meaning they require a segmentation step

Márton Szemenyei 16/130 SHAPE CLASSIFICATION

(18)

in order to find object candidates to classify in larger scenes. In 3D scenes with helpful prior information (such as urban scenes, where the ground is easy to segment [35]) this may be relatively straightforward to do. In complex, cluttered scenes (such as indoors scenes), however, segmentation might become unreliable, resulting in inferior detection performance.

In recent years, deep convolutional neural networks (CNN) have become increas- ingly popular amongst researchers working on object recognition and detection due to their superior performance [44, 43]. Unsurprisingly, there is significant work on 3D object detection using either RGB-D [45, 46] or volumetric [43, 47] data. Since CNNs are able to perform classification for every (super)pixel or voxel, multi-object detection in larger scenes using CNNs is relatively straightforward. [45, 48] Nonethe- less, CNNs are fraught with numerical diﬀiculties [49, 50]. Moreover, training CNNs requires large amounts of training data and computational resources, since deep networks have millions of free parameters, making them especially prone to over- fitting. Lastly, convolutional networks have diﬀiculty learning the correct spatial configuration of features [51], thus an approach that models object parts and their relations directly is recommended.

Schnabel et. al [52] presents a different approach, that may alleviate the diﬀiculty of segmentation. They proposed a variant of the RANSAC algorithm to segment a scene into primitive shapes (such as plane, sphere, cylinder, etc.), which they treat as the ”building blocks” of the objects and the scene. Their algorithm uses local sampling to increase the probability of finding good candidates. Their solution also uses an oc-tree grid for fast inlier counting, resulting in relatively low run time, even on large point clouds. They describe the 3D shape by constructing a topology graph of the scene, where the nodes of the graph are the primitives, and edges represent the geometric relations between the nodes. The adjacency between the shapes is determined by their distance [53].

To detect objects in a larger scene, they construct a reference graph for each category and apply brute-force graph matching. Since a single reference graph has relatively low number of nodes, the matching algorithm remains feasible [53]. To further decrease the number of possible matches between the scene and the reference graphs, they introduce a three types of of constraints. Node constraints ensure that only nodes of the same primitive type are matched, while edge constraints enforce the similarity of the relations between adjacent nodes. A third type or constraints - graph constraints can be used to take the relationship of non-adjacent nodes into account as well.

Their method, however, still uses reference objects for each category, which might

(19)

be diﬀicult to obtain for more complex or varied categories. Furthermore, they use a brute-force matching algorithm, which cannot easily handle segmentation errors.

This work aims to overcome these limitations by employing a learning algorithm to classify individual nodes of the constructed shape graph. To achieve this, however, learning has to be performed on structured (graphical) data.

2.1.1 Graph Embedding

Machine learning with graphical data has various applications, including bio- informatics [54] or network analysis [55]. It is also quite common to recognize objects visually using graph-based learning, since objects can usually be described using graph of (visual) features [56]. The diﬀiculty of graph classification is that most standard learning algorithms require a vector (or tensor) of features as their input. Since these methods cannot take graphs as inputs, the graph has to be mapped to a vector space, while preserving all important properties - this process is called graph embedding. This is not a simple task however, since the ordering of graph nodes is arbitrary, and any simple method of vectorizing a graph would yield a vector that is not invariant to the ordering of nodes [57]. A related diﬀiculty is that graphs of different sizes yield vectors of different dimensions, while standard learning algorithms assume that all data is in the same vector space.

The most straightforward way for solving this problem is to construct a vectorial descriptor directly, so that the generated vector would be invariant to the ordering of the nodes. Perhaps the most widely used method for explicit vectorial embedding is the spectral representation [58]. The simplest version of spectral embedding is to compute the spectral decomposition (Eq. 2.1) of the adjacency matrix of the graph:

A=VΛV^T, (2.1)

Where A is the adjacency matrix, V and Λ are the matrices containing the eigenvectors and -values respectively. If the graph has weights on the nodes, these can be inserted in the diagonal of the adjacency matrix [58]. If the eigenvalues and the corresponding eigenvectors are ordered, then this representation will be partially invariant to node ordering. Since this invariance is only partial, and alignment step is still needed [59]. To enable the algorithm to handle graphs of different sizes, smaller graphs are enlarged by adding dummy nodes [60].

Aside from the adjacency matrix, other matrices may be used for the spectral decomposition. One such case is using the Laplacian matrix of graphs, which is computed using the adjacency (A), and the degree (D) matrices of the graph as follows:

(20)

L=D−A (2.2) One other method for embedding graphs is the heat kernel, which uses the scpectral decomposition of the Laplacian as per the following equation:

H=Ve⁻^tΛV^T (2.3)

Thetparameter heat kernel controls the trade-off between local and global representation of the graph. According to Zhu and Wilson [61], the heat kernel outperforms the other two. It is also possible to mix different spectral representations [60] in order to create a more robust method.

In this work, however, we intend to classify a graph nodes individually, which means that instead of computing the vectorial representation of entire graphs, the nodes have to be embedded. In contrast with embedding graphs, there has been little work done on the topic of embedding nodes. For instance, Demirci, Osmanlioglu, Shokoufandeh, and Dickinson [56] has used a low-distortion node embedding framework to perform many-to many feature matching using the earth movers distance.

Riba, Lladós, Fornés, and Dutta [62] used binary embedding to produce hash keys for fast graph retrieval.

These methods, however, place limitations on the structure of the graphs or the weights of nodes and edges. One [56] assumes the graphs are trees, while the other [62] assumes labeled graphs. Also, Riba, Lladós, Fornés, and Dutta [62]

produce hash keys, not feature vectors, which makes it hard to base a learning algorithm on their work. Since the proposed shape description method yields full graphs with vectorial weights on both the nodes and edges, the previous methods are insuﬀicient for this application. To our best knowledge, no work has been done yet on embedding graph nodes of vectorial weighted graphs, with no restrictions on the topology.

2.1.2 Kernel Methods

A special class of learning algorithms called kernel methods, present another, particularly elegant solution to the problem of learning on graphs. These methods employ kernel functions, which allow the algorithm to use a high dimensional representation of the data implicitly. Kernel functions are symmetric, positive semi-definite functions that can usually be interpreted as a similarity measure between objects [63].

(21)

When using kernel learning methods (such as Support Vector Machines), it is possible to simply define a kernel function between graphs. Since a kernel function does not require a vectorial input, nor does it explicitly produce a vectorial representation, the entire problem can be circumvented.

By and large, there are two kinds of graphical kernel functions: graph kernels and kernels on graphs. The former expression is used to describe kernel functions that take two graphs as inputs and compute a similarity measure of them, while the latter indicate function that compare graph nodes instead. Smola and Kondor [64] introduced a family of kernels based on generalizing the idea regularization and Greens functions to graphs. A great example for graph kernels is the method proposed by Kondor and Borgwardt [65], which is based on the skew spectrum of graphs.

Gartner, Flach, and Wrobel [66] showed that computing a strictly positive definite graphs kernel is an NP-Hard problem. They presented an alternative approach, based on walks performed on graphs, which can be computed in polynomial time.

Most of these methods, however, involve restrictions on certain properties of the graphs: Smola and Kondor [64] study undirected graphs only, Kondor and Borg- wardt [65] assume real weights on the edges, while Gartner, Flach, and Wrobel [66]

discuss labeled graphs.

Perhaps the most widely-known graph kernel is the random walk kernel [63], which interprets the edge weights of the graph as the probabilities of taking that edge during a random walk. It performs simultaneous random walks on the two graphs, and derives a similarity score based on the probability of performing the same walk (same meaning a matching sequence of nodes). The logic behind the random walk kernel is that if two graphs are similar, then performing the same walk on the two graphs is likely. This probability is computed using the direct product of the two graphs, since performing the same walk on the two graphs is equivalent of performing a walk on the direct product. The direct product of two graphs is defined as follows:

A=A₁ ⊗A₂ N =N₁⊗N₂ (2.4)

Where A is the adjacency matrix, N is the vector of scalar node weights, and ⊗ is the Kroenecker product. Intuitively, a single node in the direct product corresponds to a specific pair of nodes in the original graphs, while taking an edge to a new node in the direct product is equivalent to taking two edges on the original graphs simultaneously to a new pair of nodes. Using this, the random walk kernel function is computed according to the following equation:

(22)

K(X, Y) =

∑∞ i=1

w(i)e^TAⁱs (2.5)

w(i) =e⁻^ϑi, (2.6)

where A is the adjacency matrix of the direct product, s and e contain the probabilities of starting and ending the walk on a given node respectively and ϑ is a hyperparameter controlling the slope of the weight w. The staring and ending probabilities may be set uniformly, or according to prior information (for instance using the node weights of the direct product). In practice, the sum is obviously not computed to infinity, but to a hyperparameter N, which is the maximum length of the walks considered. Vishwanathan, Schraudolph, Kondor, and Borgwardt [63]

shows, that as long as wis chosen to ensure the convergence of the sum, the above formulation indeed yields a valid kernel.

2.1.3 Graph Neural Networks

It is worth noting, however, that the methods of deep learning have also been applied to graph-based problems, although a fair number of these methods were published after the method proposed in this chapter. Still, as some methods precede this work, it is important to introduce them briefly.

Graph Neural Networks (GNNs) is one of the first examples of a graph node classification algorithm that aim to encode a node and its neighborhood in a graph by assigning it a feature vector. Early methods [67] employ the Banach fixed point theorem to establish an iterative update rule to compute this feature embedding.

The architecture itself is a simple feed-forward neural network that can be trained in a supervised manner. This method, however has a few important limitations [68], namely that edge information cannot be included in the model.

There also methods that are more akin to kernel methods, such as DeepWalk [69]

which is an unsupervised embedding method to learn representations for graph nodes. The method performs random walks starting from the node to be embedded and uses the skip-gram work embedding method to train the network. Its main limitation is the lack of generalization: In the case of adding a new node to the graph, the model has to be re-trained.

More recent methods use recurrent units or graph convolutions, however, these require fixed-size neighborhoods [70]. One of the more recent methods can circumvent this by relying on the multi-headed attention or self-attention layers proposed for

(23)

NLP problems in Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin [71]. The main advantage of the self-attention layer is that it is able to learn relationships between elements is sequences (or sets) in a way that is invariant to ordering. Exploiting this fact has resulted in numerous state-of-the-art graph classification methods [70, 72, 73].

The functioning of the Attention layer is shown in Figure 2.1. The attention layer has two input sequences, called the source and target. When these sequences are the same, the layer is called Self-Attention. Elements of the input sequences are forwarded through Linear layers independently, producing key (K), query (Q) and value(V) vectors. Then, pair-wise dot products are computed between the key and query vectors, thus producing the attention weights between elements of the source and target sequences.

Source[SS×F]

Target[ST×F]

Linear

Value[SS×FV]

Key[ST×FQK]

Query[ST×FQK]

Sof tM ax(Q^TK)

RepMat[ST×SS×FV] ^N Σ

Weights[ST×SS]

Out[ST×FV]

1

Figure 2.1: The attention mechanism. S_S and S_T are the lengths of the source and target sequences with F features. F_QK is the number of key and query vector features, andF_V is the depth of the output. In the case of self-attention, the source and target sequences are the same.

Following this, the attention weights are normalized using the SoftMax function, and are used to compute a weighted sum of the value vectors. In essence, one element of the output sequence is a weighted sum of the values computed from the elements of the source sequence.

When used for graphs, a self-attention layer is usually given a sequence of nodes.

However, an unmodified self-attention layer cannot incorporate edge features. Still, there are ways to slightly modify this layer type to make use of edge data as well.

Scalar (distance type) edge weights can be easily used to modify the attention weights, ensuring that distant nodes have less impact on the other’s embedding [70], whereas vector (feature type) weights can be used to augment the value vectors generated for each node.

Notably, the self-attention layer can be extended to combine sequences or graphs observed at different time steps, thus making an more accurate prediction from multiple observations. This module is called Recurrent Temporal Attention (RTA) [1].

(24)

Here, the attention layer receives two subsequent observations of the graph on its two input, thus producing an embedding for the first time step. Then, this embedding is fed back to one of the inputs, while the other input receives the next observation.

This is essentially a recurrent neural cell with an attention-based sequence/graph as its hidden state.

2.2 Construction of Shape Graphs

The first step of the proposed shape recognition method is to construct the shape graphs in a manner similar to the algorithm proposed by Schnabel, Wessel, Wahl, and Klein [53]. However, the proposed method aims to include all relevant information about the individual shapes and their relations in the generated graph.

Furthermore, it is essential for the constructed graph to lend itself well to embedding or kernel methods, since the goal is to use machine learning for classifying the nodes.

2.2.1 Reconstruction and Segmentation

At first, 3D reconstruction is performed using a sequence of 2D color images via the Structure from Motion (SfM) algorithm developed by Wu [74]. To produce each point cloud, 15 successive images from a moving camera are used, resulting in partial, ”2.5D” point clouds. This method was chosen because it doesn’t require special hardware, moreover, it doesn’t force the user to examine the scene from every position. The resulting point clouds are then sampled via a box filter, and outlier points are removed.

In order to construct shape graphs, the 3D point cloud of the scene is segmented using the algorithm proposed by Schnabel, Wahl, and Klein [52]. This method is an iterative variant of the standard RANSAC method that can detect five different primitive shapes: planes, cylinders, cones, spheres and tori. In each step the primitive with the highest inlier count is removed until the number of point belonging to the latest shape falls below a threshold. This way, the algorithm produces a sequence of the largest primitive shapes in a scene.

While the segmentation method will always find at least one primitive shape (with the exception of degenerate point clouds), it may fall below the necessary inlier count. Also, small shapes close to other, larger objects may be considered a part of that larger shape and thus missed. To avoid these cases we use a relatively strict inlier distance threshold, but consider shapes with few inliers valid. This way the

(25)

Primitive Plane Cylinder Sphere Cone Torus Area Radius Radius Radius Inner Radius Features Diameter Height Height Outer Radius

Bounding Box Area Angle

Origin Centroid Centroid Centroid Peak Centroid

Direction Normal Axis N/A Axis Normal

Table 2.1: Features and reference frames for every primitive shape type

segmentation procedure is more prone to making small mistakes that can be learned by the subsequent classification (such as confusing shape types or segmenting a single shape into multiple parts), rather than large ones such as missing objects, which will cause that object to remain undetected.

2.2.2 Construction of the Vector-Graph

Figure 2.2: The graph constructed from primitive shapes (only close edges are drawn). A few example node vectors are displayed.

After the segmentation step, a graph is constructed using the primitive shapes as nodes, while the edges represent the geometric relations between the nodes.

(Fig. 2.2) Each primitive shape type has a few distinct features that further define the exact shape of the point cloud they represent. (Table 2.1) By computing these features, it is possible to assign a feature vectornto each primitive shape. Since the features of the different primitive types are incompatible, a concatenated feature vector is constructed for each primitive in the formn = [n_plane n_cyl n_sphere n_cone n_torus].

Naturally, for every primitive shape the features of the other types are set to zero.

(26)

Furthermore, position and orientation features can be easily assigned to almost every primitive shape, consisting of an origin and at least a single direction in the 3D space. The only exception is the sphere, where a unique direction cannot be defined. This means, that the geometric relations between the primitives can be described by computing the rigid transform between their coordinate systems.

(Table 2.1) However, to ensure the rotation invariance of the algorithm, only the distance between the origins, and the angle of the rotation between their special directions are used as features. Since spheres do not have a unique direction, the angle between spheres and other primitives is always set to zero.

Since primitive shapes and rigid transforms are described by more than a single parameter, the nodes and the edges of the constructed graph have vectorial weights.

In this work, these type of graphs will be referred to as vector-graphs. Furthermore, the edges of the graph have two different types of weights. The first type is the traditional distance type, meaning that if this feature is larger, then the two nodes are less connected. The second type is the feature type, which describes other qualities of the connection, but its magnitude does not influence the strength of the connection. While constructing graphs, no explicit decisions are made regarding the adjacency of the nodes, the descriptor simply stores the distance between nodes among the features. This means, that the algorithm always produces full graphs, which opens the opportunity to treat adjacency as a continuous measure instead of a binary one, thus avoiding loss of information.

2.3 Embedding Vector-Graph Nodes

After constructing the shape graphs, a learning algorithm is needed to perform classification. As mentioned earlier, however, this is far from trivial; To use a lerning algorithm on graphs, an embedding method (or a similar solution) is needed. Since using the graph construction method from the previous step results in full vector- graphs, it is vital that the embedding method can capture all features of such graph without placing restrictions on the structure. Moreover, since the scene (and thus the graph) might change during the use of the TUI system, support (or at least compatibility) for dynamic graphs is also needed.

The proposed embedding method is based on the spectral decomposition approaches presented in subsection 2.1.1. However, these methods used the adjacency matrix of the graph, which is not particularly informative in this case. Thus, our first goal will be to produce a unique, ordering-invariant feature matrix for each node that encompasses all relevant features of the graph, including features of close nodes and

(27)

geometric relations. Once this is done, the spectral decomposition can be computed, and its significant components used to construct the feature vector for the node. For the remainder of this work, the node to be embedded is referred to as the central node.

2.3.1 Node Embedding Framework

The first step of the embedding process is to order the nodes of the graph in based on the distance from the central node. Since the spectral embedding is only partially invariant to the node ordering [59], this step alone ensures that the resulting feature vectors vary for different nodes. If the ordering is ambiguous due to some nodes being too close, then two separate embeddings may be made with the different orderings, and averaged. In order to create descriptors of the same size for all nodes, the maximum number of nodes included must be set. Distant nodes are clipped from larger graphs, while smaller graphs are padded with zero nodes and edges.

For most learning algorithms, the training data is normalized to zero mean and unit variance in order to avoid numerical diﬀiculties. This is especially relevant in this case, since the features on nodes and edges are qualitatively different physical variables, meaning they might be different by orders of magnitude. For this reason, all edge and node variable are normalized using the standard deviation computed from the dataset. Notably, all features are non-negative, which will be exploited later on, therefore the mean is not subtracted.

With the preprocessing steps completed, a descriptor matrix is constructed for the node, called the node feature matrix F. The matrix is unique for every node, while it contains information on the neighboring nodes as well. Is is computed according to the equation below.

F=







T_1,1 T_1,2 · · · T_1,N T_2,1 T_2,2 · · · T_2,N ... ... . .. ...

T_N,1 T_N,2 · · · T_N,N





 (2.7)

T_ij =T(n_i, n_j, e_1i, e_1j, e_ij), (2.8) where T is a feature transform function, n_i is the feature vector of the i_th node of the graph, while e_ij is vector descriptor of the edge pointing from the i_th to the j_th node. N is the maximum number of neighboring nodes considered in the embedding.

The result of the feature transform function between two nodes is a matrix in the

(28)

general case. While the node feature matrix is unique for every node, its spectrum might be the same, especially or close nodes (provided that eigenvalues are ordered by magnitude). Still, because of the node ordering step, the eigenvectors will be different even if the spectra are not. The level of difference in the descriptors of close nodes is largely influenced by the feature transform function.

In order to finalize the embedding process, the singular value decomposition of the graph feature matrix is computed, and the first couple singular values and vectors are concatenated, as shown in Eq. 2.9. It is possible to choose feature transform functions that ensure that the resulting node feature matrix is symmetric. In this case, the spectral decomposition may be used. Still, we do not wish to place such restrictions on the embedding framework, so the SVD is used by default.

w= [

σ₁u^T₁ σ₁v^T₁ σ₂u^T₂ σ₂v^T₂ . . . σ_ku^T_k σ_kv^T_k ]T

, (2.9)

whereσ_i, u_i, andv_i are thei_th singular value, left and right singular vectors respectively, and k is the maximum number of singular values considered. Notably, an undirected graph and a symmetric feature transform function yields a symmetric node feature matrix. In such a case, it is unnecessary to add both the left and right eigenvectors to the descriptor, since they are the same, thus it is possible to reduce the size of the feature vector by a factor of two.

2.3.2 Feature Transform Functions

Previously, very little has been said about the feature transform function. This was deliberate, since we do not intend to pose unnecessary restrictions that limit the framework’s universal applicability. In this subsection, the choices for the feature transform for the shape graph classification problem are discussed. For different graphs, however, other choices may be more appropriate. Notwithstanding, some of the principles established here may be useful for other applications as well.

Since the goal is to produce a feature vector that contains information on thelocal context of the node, the influence of nodes farther from the central node must be less than that of the immediate neighbors. This means that if the edge features of the graph include a parameter that can be interpreted as “distance” or “connection strength”, then this parameter may be used to weight the influence of the nodes.

Since shape graph edges have a distance parameter, this will be used for this dis- cussion without loss of generality, since connection strength may be understood as the inverse of distance.

(29)

As stated previously, edge features are divided into distancee^d and featuree^f types.

Feature type edge features are treated the same way as node features, therefore they are appended to the node feature vectors.. On the other hand, distance type features are used to scale the result of the feature transform, therefore distant nodes will not affect the graph feature matrix significantly. Our choice for the feature transform function is shown in the equations below.

Tij =wij

[

ni e^f_1i e^f_ij ]T [

nj e^f_1j e^f_ij ]

(2.10)

wij = 1

1 +µ(e^d_1i²+e^d_1j²+e^d_ij²), (2.11) whereT_ij is the feature transform function,nandeare node and edge vectors respectively, and µ is a hyperparameter controlling the distance scaling. It is important to note, that the feature transform is scaled using not just the the distance from the central node, but also the distance between the two interacting nodes. Thus, the significance of the interaction between distant nodes will be reduced in the resulting feature vector.

There are two desirable properties of this transform function. First, it produces a higher dimensional transform of the original features, which makes it possible for linear learning methods to learn decision functions that are nonlinear in the original features. Moreover, T_ij returns a quadratic matrix, and satisfies the requirement T_ij = T_ji^T, which means that the node feature matrix will be symmetrical as well.

Consequently, the spectral decomposition may be used for embedding, resulting in a smaller feature vector.

Notably, it is also possible to define a linear feature transform function to use with nonlinear learning methods using the following choice of feature transform function.

T_i,linear = 1 1 +µe^d_1i²

[ n_i e^f_1i

]

, (2.12)

where T_i is the feature transform function. Note, that the indexj is omitted, since the feature transform is the same for all j values. This creates a node feature matrix with a rank of one, meaning that the resulting feature vector will simply be a concatenation of T_i values, removing the need for the computationally expensive spectral decomposition. This seems relatively straightforward, however, this method cannot add information on the interactionsbetweennodes into the embedded feature vector. To summarize, the explicit embedding-based classification is summarized below.

(30)

Algorithm 1 The explicit embedding

Input: Graph (N,[E^d, E^f]), where n_i ∈R^m, e^d_ij ∈R and e^f_ij ∈R^l. Require: T feature transform function ( Equation 2.8, 2.10) for eachn ∈N do

1. Order other nodes by distance from central node n 2. Pad/truncate graph to maximum size K

3. ComputeT_ij feature transforms for each node pair 4. Assemble node feature matrix F (Equation 2.7)

5. Compute SVD ofF, and select k largest singular values 6. Compute node descriptor as per Equation 2.9

end for

7. Use node descriptor vectors to train SVM

2.4 Random Walk Node Kernel

Aside from explicit node embedding, using implicit embedding via kernel methods is a viable way of graph node classification as well. In this section, the modifica- tion of the random walk kernel (RWK) to compare nodes instead of entire graphs is presented. The underlying idea is that the local context of the graph node is equivalent with the set of nodes that are likely to be reached through short random walks starting from the central node, provided that the node transition probabilities are influenced by distance. This means that two contexts are similar, if matching simultaneous random walks can be performed with high probability, starting from the two nodes being compared (central nodes).

This idea can be easily introduced into the random walk kernel by setting the starting probability vector s from Eq. 2.5 so that the walks would always start from the central nodes. According to the proof presented in Vishwanathan, Schraudolph, Kondor, and Borgwardt [63], the random walk kernel remains a kernel function regardless of the choice forsande, meaning the modified algorithm remains a kernel function. The size of the local context explored by the walks can be influenced by setting the maximum length of the walks N or the slope of the weight function ϑ from Eq. 2.5 and 2.6 respectively.

2.4.1 Node and Edge Kernels

Notably, the random walk kernel requires the direct product of the two graphs.

However, in order to compute the direct product of graphs with vectorial node and edge weights, kernel functions between the nodes and edges are needed [63]. Our choice for the node kernel determines when the simultaneous random walks on the

(31)

two graphs are considered similar. A sensible choice for the node kernel is a relatively simple RBF similarity measure between the feature vectors of the nodes.

K_n=e⁻^ν(n¹⁻ⁿ²⁾^T⁽ⁿ¹⁻ⁿ²⁾, (2.13) where n₁ and n₂ are the two node vectors compared, and ν is the hyperparameter controlling the kernel. This way the kernel can be interpreted as follows: Two nodes are similar, if random walks starting from them a likely to lead through a sequence of nodes with similar features (according to the RBF measure).

Choosing the edge kernel requires somewhat more consideration. Arguably, defining an RBF similarity measure between the features of the two edges is sensible. How- ever, edges also posses a “distance” type feature, which should be to used weight the probability of taking the given edge during a random walk to ensure the premise of the proposed node kernel. For this reason, “distance” type features are used to decrease the probability of taking long edges that take far from the starting nodes to encourage the random walks to explore the local structure. This results in the following equation:

K_e = 1

1 +γ(e²_1,d+e²_2,d)e⁻^ξ(e^1,f⁻^e^2,f⁾², (2.14) wheree₁ ande₂ are the two edge vectors compared,γandξare the hyperparameters controlling the kernel. With this choice of the edge kernel the transition probabilities also depend on the similarity of feature type edge weights. This way the similarity of random walks does not only mean a sequence of similar nodes, but also that of similar edges.

2.5 Results

In this section, the results of the experiments performed regarding the proposed node embedding methods are presented. Since the motivation behind the algorithms was to create representations that are useful for classification, the methods are tested on various classification tasks using machine learning.

Support Vector Machines (SVMs) are used to perform the actual classification, since SVMs are one of the most powerful kernel-based machine learning methods. This choice allows us to run experiments on both the explicit embedding and the random walk node kernel using the same learning algorithm.

3D Shape Recognition Methods for Tangible User Interfaces