Evaluation - 3D Shape Recognition Methods for Tangible User Interfaces

2.5 Results

2.5.3 Evaluation

In this section, the statistical analysis of the results is performed using the Bayesian t-test introduced in chapter 1.3. The measure used for the tests is the cross-validation error ecv. All three metrics have been evaluated on all databases, the results of which can be found in Table A.1 in the Appendix.

Márton Szemenyei 32/130 SHAPE CLASSIFICATION

Table 2.2 shows the results of the SVM classification on the two synthetic databases.

It is immediately apparent that the proposed embedding methods achieve consid-erably lower errors than the no-embedding version (where the n node descriptor vectors are used as inputs for the SVM). As expected, the improvement is even more pronounced on the Overlapping dataset, since it was designed to illustrate the necessity of embedding local context.

Notably, the random walk node kernel achieves slightly better results on both datasets, yet the difference between the training and cross-validation errors is ap-proximately twice as large as with the explicit embedding, suggesting that the kernel-based method might be somewhat more prone to overfitting.

Metric etr eho ecv

Embedding No Yes RWK No Yes RWK No Yes RWK

Synthetic 17.0 2.8 0.0 16.5 3.4 0.2 17.7 3.7 1.9 Overlapping 67.8 0.8 0.0 67.8 0.9 0.0 68.1 0.9 0.2

Table 2.2: Node-by-node classification errors on the two synthetic datasets

Table 2.3 shows the results of the SVM classification on the two image-based datasets. The improvement provided by the embedding method is also apparent in this case, although the improvement seems less pronounced. Surprisingly, all methods perform considerably worse on the synthetic image dataset, which can be explained by the synthetic dataset containing more complex shapes.

In contrast with the synthetic case, the explicit embedding appears to perform marginally better in this scenario. While the random walk kernel achieves better training accuracy values on both datasets, it trails the explicit embedding on both types of validation accuracy. This clearly supports our initial suspicion that the random walk node kernel is somewhat more likely to overfit.

Metric e_tr e_ho e_cv

Embedding No Yes RWK No Yes RWK No Yes RWK

Synth Images 45.2 36.1 35.9 44.6 35.9 36.0 45.3 36.0 36.6 Real Images 12.8 6.8 4.95 13.7 7.7 8.0 13.9 9.1 10.9

Table 2.3: Node-by-node classification errors on the two image-based datasets

After evaluating the methods on all datasets (the results are displayed in Table A.1 the Appendix), the Bayesian t-test was performed to compare the cross-validation accuracies. First, the accuracies with no embedding (referred to as the raw embed-ding) were compared with the explicit method. The results (Fig. 2.3) show, that

the explicit embedding clearly outperforms the raw case: The probability of a posi-tive effect size is near 100%, the 95% Highest Density Interval (HDI) falling in the [0.25, 0.46] range.

0.0 0.2 0.4 0.6 0.8

0.00.51.01.52.02.53.03.5

Data w. Post. Pred.

GraphEmbedCV[, 1] and GraphEmbedCV[, 3]

Probability

N=24 Mean difference

µdiff

0.0 0.1 0.2 0.3 0.4 0.5 0.6

median=0.36

0% < 0 < 100%

95% HDI

0.25 0.46

Std. Dev. of difference

σdiff

0.2 0.3 0.4 0.5

median=0.25

95% HDI

0.18 0.33

Effect Size

(µdiff−0) σdiff

0.0 0.5 1.0 1.5 2.0 2.5

median=1.4

0% < 0 < 100%

95% HDI

0.85 2.1

Figure 2.3: Bayesian t-test comparing the raw and explicit embeddings.

The figure shows the posterior distributions of the difference between the performances of the two method, with the black line displaying the 95% Credible Interval (CI). In the top left corner the probabilities of the difference being less than and greater than zero are displayed in green.

Comparing the Random Walk Node Kernel to the raw embedding (Fig. 2.4) yields similar results: the probability of positive effect size is once again close to 100%, while the 95% HDI falls into the [0.29, 0.53] range. The median value of the Difference of Means (DoM) is0.36 in the explicit embedding case, and0.41for the kernel, suggesting similar improvements.

0.0 0.2 0.4 0.6 0.8

0.00.51.01.52.02.5

Data w. Post. Pred.

GraphEmbedCV[, 1] and GraphEmbedCV[, 4]

Probability

N=24 Mean difference

µdiff

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

median=0.41

0% < 0 < 100%

95% HDI

0.29 0.53

Std. Dev. of difference

σdiff

0.2 0.3 0.4 0.5

median=0.28

95% HDI

0.21 0.38

Effect Size

(µdiff−0) σdiff

0.0 0.5 1.0 1.5 2.0 2.5 3.0

median=1.5

0% < 0 < 100%

95% HDI

0.86 2.1

Figure 2.4: Bayesian t-test comparing the raw and RWK embeddings.

We also performed a t-test comparing the explicit embedding and the random walk methods. The results (Fig 2.5) clearly demonstrate that the random walk method

Márton Szemenyei 34/130 SHAPE CLASSIFICATION

achieves higher accuracy (once again, the probability of a positive effect size is near 100%), with the 95% HDI ranging from 0.04 to0.79, with a median value at 0.059.

Still, this could be due to the larger number of synthetic datasets in the collection, as the explicit embedding method performed better on both image-based sets.

0.00 0.05 0.10

024681012

Data w. Post. Pred.

GraphEmbedCV[, 3] and GraphEmbedCV[, 4]

Probability

N=24 Mean difference

µdiff

0.00 0.02 0.04 0.06 0.08 0.10

median=0.059

0% < 0 < 100%

95% HDI

0.040 0.079

Std. Dev. of difference

σdiff

0.03 0.04 0.05 0.06 0.07 0.08 0.09

median=0.045

95% HDI

0.033 0.062

Effect Size

(µdiff−0) σdiff

0.0 0.5 1.0 1.5 2.0 2.5 3.0

median=1.3

0% < 0 < 100%

95% HDI

0.73 1.9

Figure 2.5: Bayesian t-test comparing the explicit and RWK embed-dings.

This alone would suggest that the random walk node kernel is a superior choice, even though the DoM is relatively small. However, the random walk kernel has considerably larger computational costs (this point is expanded further in the next subsection), while also suspected to have a larger tendency to overfit. To put this latter suspicion to the test, the Bayesian t-test was performed to compare the explicit and kernel methods’ generalization capabilities. For this test, we used the difference between the training and cross-validation errors of the methods to quantify the amount of overfitting.

The results (Fig. 2.6) show that the random walk kernellikelyhas a larger tendency to overfit. The probability of a negative DoM comfortably surpasses the - arbitrary - 95% threshold, with 98.2%. The 95% HDI is between −0.93 and −0.11, with a median value of−0.51. The result of this test is strong enough to prove a difference between these methods, and it strongly influenced our decision regarding which method to use for test in the next chapters.

The two proposed embedding methods were also compared against a graph-attention baseline. This baseline consisted of a single graph-attention layer and a linear layer for classification. The hyperparameters of both the architecture and training were determined via Bayesian optimization, with the 20% hold-out validation accuracy being the objective function.

The results of the Bayesian t-test show that the neural baseline outperforms the

−4 −3 −2 −1 0 1

0.00.10.20.30.40.50.6

Data w. Post. Pred.

GraphEmbedOverfit[, 1] and GraphEmbedOverfit[, 2]

Probability

N=24 Mean difference

µdiff

−1.5 −1.0 −0.5 0.0 0.5

median=−0.51

99.2% < 0 < 0.8%

95% HDI

−0.93 −0.11

Std. Dev. of difference

σdiff

0.5 1.0 1.5 2.0

median=0.91

95% HDI

0.56 1.3

Effect Size

(µdiff−0) σdiff

−1.5 −1.0 −0.5 0.0

median=−0.57

99.2% < 0 < 0.8%

95% HDI

−1 −0.12

Figure 2.6: Bayesian t-test comparing the RWK and explicit embed-dings’ tendency to overfit.

explicit embedding by a significant margin. The probability of improvement is near 100%, while the 95% HDI is between −9.3% and −5.6%, with a median value of

−7.6%. However, the random walk kernel managed to surpass the neural baseline slightly, with a probability of 98.7%. The 95% HDI is between 0.17% and 1.7%, with a median value of0.99%.

Generalization to scenes

Note that all tests so far were used on classification databases, meaning that every 3D scene contained the object in question only. However, in the TUI system, these classes have to be recognized as parts of larger scenes, where nearby objects of other classes may influence the embedding process. For this reason, it is prudent to test the chosen method on a scene database (in which objects appear in context) as well.

To do this, the aforementioned 24 datasets were used to create a scene dataset for each, andthe models trained on the classification sets were used to classify all nodes in these scene datasets. The results were used to compare the raw and the explicit embeddings’ performance using the Bayesian t-test. The full results on all datasets are displayed in Table A.2 in the Appendix.

The results of the test (Fig. 2.7) show that the explicit embedding method provides a similarly certain improvement on the raw version, with the probability of a positive DoM being near 100% once again. The 95% HDI is between [0.23, 0.44], with a median of 0.33. This latest result clearly demonstrates that the method proposed in this chapter is suitable for embedding shape graphs for classification in scenes.

Naturally, the graph-attention baseline was also evaluated, using a model trained

Márton Szemenyei 36/130 SHAPE CLASSIFICATION

0.0 0.2 0.4 0.6 0.8

0.00.51.01.52.02.53.0

Data w. Post. Pred.

GraphEmbedScene[, 1] and GraphEmbedScene[, 3]

Probability

N=24 Mean difference

µdiff

0.0 0.1 0.2 0.3 0.4 0.5 0.6

median=0.33

0% < 0 < 100%

95% HDI

0.23 0.44

Std. Dev. of difference

σdiff

0.2 0.3 0.4 0.5 0.6

median=0.25

95% HDI

0.18 0.33

Effect Size

(µdiff−0) σdiff

0.0 0.5 1.0 1.5 2.0 2.5

median=1.4

0% < 0 < 100%

95% HDI

0.79 1.9

Figure 2.7: Bayesian t-test comparing the raw and explicit embedding on the scene versions datasets.

on classification data, and testing it on the scene dataset. The results show, that the neural network-based solution does not generalize well, with most classification errors close to the expected performance of random guessing. The results of the test (Fig. 2.8) show that the explicit embedding out performs the neural baseline with a probability close to100%, while the 95% HDI is between −66% and −60%, with a median value of −63%. The full results of the neural baseline are displayed in Table A.8 in the Appendix.

0246

Data w. Post. Pred.

Probability

N=22 Mean difference

µdiff

−0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0.0

median=−0.63

100% < 0 < 0%

95% HDI

−0.66−0.60

Std. Dev. of difference

median=0.069

95% HDI

0.046 0.096

Effect Size

(

µdiff−0

)

σdiff

−25 −20 −15 −10 −5 0

median=−9.2

100% < 0 < 0%

95% HDI

−13 −6

Figure 2.8: Bayesian t-test comparing the neural baseline and explicit embedding on the scene versions datasets.

Márton Szemenyei 37/130 SHAPE CLASSIFICATION

Evaluation of method run times

Another essential factor for determining the usability of the proposed methods in a Tangible Augmented Reality system is the computational requirements during train-ing and inference. The former is essential, since the developer of a new augmented environment may add new virtual object classes to the system, which requires re-training of the classifiers. At the same time, inference times are a major factor in the user experience, even though classification is normally only performed during the initialization of the system.

The run times of the proposed methods were evaluated using a 2016 MacBook Pro with a 2.9 GHz i5 processor and 8 GB RAM. The code used no multi-threading or GPU acceleration, and was implemented using Matlab. For testing, the synthetic dataset was used, which had5 classes, and 1,000 graphs per class.

Training a single SVM model required 5−10 seconds depending on the hyperpa-rameters, which puts the total training time to approximately 400 seconds, since Bayesian optimization in ran for 50 iterations before the final training and cross-validation. Using explicit embedding adds 25 extra seconds, which results in an approximately6% increase of the total training time. On the other hand, using the random walk node kernel adds over 350 seconds, almost doubling the total training time required.

During inference, the time required to embed all nodes in a single scene graph and to perform the subsequent inference using a trained SVM was measured. Note that embedding times are slightly higher in this case than during training, since scene graphs are usually larger than graphs that contain a single object. With this in mind, the inference takes8ms with raw embedding, which increases to 14ms if the explicit embedding method is used. Using the random walk node kernel results in an inference time of 240 ms, while the graph-attention method takes 6ms.

We argue that training times in the magnitude of a few dozen minutes are well within the acceptable range. Similarly, even in the random walk case, spending a few hundred milliseconds for classification is perfectly acceptable, since computing the 3D reconstruction and performing the RANSAC segmentation normally takes at least several seconds. According to the experiments presented in this section, both proposed algorithms are applicable to the central problem of the thesis.

Márton Szemenyei 38/130 SHAPE CLASSIFICATION

In document 3D Shape Recognition Methods for Tangible User Interfaces (Pldal 33-40)