• Nem Talált Eredményt

4.4 Results

4.4.3 Evaluation

In this subsection, the results of the experiments and comparisons using the Bayesian t-test are presented. For every comparison, the results for a small selection of the datasets are presented. This selection consists of two synthetic datasets (one of each type), as well as the two image-based datasets. The results for all datasets are displayed in the Appendix. As stated previously, our aim is to provide evidence for the following statements:

• The Genetic Algorithm (GA) with custom operators outperforms all other methods (Greedy, Simulated Annealing and the SVM-only classification).

• Each of the custom genetic operators proposed in Section 4.3 improve the performance of the optimization.

• Using the explicit proposed in Chapter 2 embedding improves the final result of the optimization.

• Using explicit embedding with GA results in higher accuracy than a Graph-Attention network trained on scene graphs.

Comparison of Optimization Methods

First, the scene optimization was run on all datasets, with some of the results shown in Table 4.1. The full results are displayed in Table A.5 in the Appendix. From the raw results it is apparent that the GA method outperforms the other methods in terms of classification error (ec) and the optimization performance (eopt). This also supports the validity of the objective function, since better optimization led to better accuracy.

Moreover - as argued earlier - the GA method is more likely to find solutions with lower cost than the ground truth, as evidenced by the higherecost values. Note, that the frequency at which these suboptimal ground truths occur is heavily influenced by the accuracy of the SVM classifier used to produce the scores for the objective function. Lastly, all three optimization methods seem to outperform the SVM-only classification, providing justification for using global optimization.

Márton Szemenyei 78/130 ARRANGEMENT IN SCENES

Metric esvm ec ecost eopt

Method Gr. SA GA Gr. SA GA Gr. SA GA

Syn. 99.1 99.78 99.5 99.92 0.3 0.3 0.3 0.61 1.32 0 Overlap. 99.4 99.84 98.73 99.96 0.1 0.1 0.1 0.7 5 0.1 Syn. Im. 85.6 87.81 92.47 93.67 12.3 12.2 18.4 28.9 5.68 0.87 Real Im. 94.9 98.83 98.57 99.38 1.8 1.9 2.6 7.58 0.57 0

Table 4.1: Result of the scene optimization

0.0 0.1 0.2 0.3 0.4

012345

Data w. Post. Pred.

FinalOpt[, 2] and FinalOpt[, 3]

Probability

N=24 Mean difference

µdiff

0.00 0.05 0.10 0.15 0.20 0.25

median=0.17

0% < 0 < 100%

95% HDI

0.13 0.22

Std. Dev. of difference

σdiff

0.05 0.10 0.15 0.20

median=0.11

95% HDI

0.076 0.14

Effect Size

(µdiff0) σdiff

0.0 0.5 1.0 1.5 2.0 2.5 3.0

median=1.6

0% < 0 < 100%

95% HDI

0.98 2.3

−0.10 −0.05 0.00 0.05 0.10

05101520

Data w. Post. Pred.

FinalClass[, 2] and FinalClass[, 3]

Probability

N=24 Mean difference

µdiff

−0.06 −0.05 −0.04 −0.03 −0.02 −0.01 0.00

median=−0.031

100% < 0 < 0%

95% HDI

−0.044 −0.018

Std. Dev. of difference

σdiff

0.02 0.04 0.06 0.08

median=0.028

95% HDI

0.016 0.044

Effect Size

(µdiff0) σdiff

−5 −4 −3 −2 −1 0

median=−1.1

100% < 0 < 0%

95% HDI

−1.9 −0.42

Figure 4.4: Bayesian t-test between the Greedy and the GA methods for eopt(top) and ec (bottom).

These claims are strongly supported by the Bayesian t-tests comparing the perfor-mance of these methods. Comparing the Greedy and the GA algorithms (Fig. 4.4), the results show a ~100% probability that GA outperforms the Greedy algorithm for both metrics, with 95% credible intervals of [0.13, 0.22] and [0.044, 0.018]

and medians of0.17and 0.031 for eopt and ec respectively.

The results are similarly conclusive when comparing the SA and GA algorithms (Fig. 4.5). The results show a ~100% probability that GA outperforms the

0.0 0.1 0.2 0.3 0.4 0.5

0.00.51.01.52.02.53.0

Data w. Post. Pred.

FinalOpt[, 4] and FinalOpt[, 3]

Probability

N=24 Mean difference

µdiff

0.0 0.1 0.2 0.3

median=0.20

0% < 0 < 100%

95% HDI

0.13 0.28

Std. Dev. of difference

σdiff

0.10 0.15 0.20 0.25 0.30 0.35

median=0.18

95% HDI

0.13 0.25

Effect Size

(µdiff0) σdiff

0.0 0.5 1.0 1.5 2.0

median=1.1

0% < 0 < 100%

95% HDI

0.61 1.7

−0.15 −0.10 −0.05 0.00 0.05

024681012

Data w. Post. Pred.

FinalClass[, 4] and FinalClass[, 3]

Probability

N=24 Mean difference

µdiff

−0.10 −0.08 −0.06 −0.04 −0.02 0.00

median=−0.053

100% < 0 < 0%

95% HDI

−0.074 −0.031

Std. Dev. of difference

σdiff

0.02 0.04 0.06 0.08 0.10

median=0.049

95% HDI

0.035 0.067

Effect Size

(µdiff0) σdiff

−2.0 −1.5 −1.0 −0.5 0.0

median=−1.1

100% < 0 < 0%

95% HDI

−1.6 −0.56

Figure 4.5: Bayesian t-test between the SA and the GA methods for eopt(top) and ec (bottom).

SA algorithm for both metrics, with 95% credible intervals of [0.13, 0.28] and [0.074, 0.031] and medians of 0.2 and 0.053 for eopt and ec respectively.

Finally, we compared the performance of the SVM only classification eSV M to the accuracy of the GA-based localization (Fig. 4.6). The result shows a ~100% prob-ability that GA outperforms the SVM algorithm, with a 95% credible interval of [0.9, 0.83] and median of0.86.

Evaluating Genetic Operators

We have also compared the performance of the proposed genetic operators with the vanilla genetic operators for one-hot coded nominal genomes. The results (Table 4.2) show that each operator improves the performance of the genetic optimization on its own. The Random Drag and Shuffle Mutation (RDSM), and the Class Score Optimal Initialization (CSOI) achieve the highest decrease in error, while the improvement caused by the Cluster N-Point Crossover (CNPC) seems more modest. Note that

Márton Szemenyei 80/130 ARRANGEMENT IN SCENES

the operators were enabled in the same order they are presented in the table. The full results are found in Table A.6 in the Appendix.

Method None CSOI RDSM CNPC

Metric ec eopt ec eopt ec eopt ec eopt

Synthetic 29.01 91.77 89.15 27.81 99.76 0.61 99.92 0.00 Overlapping 38.92 99.90 83.74 48.31 99.82 0.80 99.96 0.10 Synth Images 31.06 94.61 87.81 28.71 91.79 8.00 93.67 0.87 Real Images 64.21 51.69 95.95 6.88 98.16 1.40 98.83 0.00

Table 4.2: Change in errors caused by the special genetic operators.

The results of the Bayesian t-test are shown in Figure 4.9. The tests show a ~100%

probability that using all custom operators outperforms the standard version of the GA algorithm, for both metrics, with 95% credible intervals of [0.9, 0.95] and [0.66, 0.59] and medians of 0.93and 0.63for eopt and ec respectively.

Note that the Bayesian t-tests were performed after the addition of every custom operator, proving that each of these custom operators contributes to the positive effect, albeit with different magnitudes. All three operators improve both metrics credibly, the largest effect is attributed to the CSOI operator with medians of0.45 and0.46foreopt andecrespectively. The RSDM operator is second, with medians of 0.38and 0.13, while the CNPC is last with medians of 0.07and 0.024 for eopt andecrespectively. The full results of these tests are displayed in Figures A.34-A.39 in the Appendix.

02468

Data w. Post. Pred.

Probability

N=24 Mean difference

µdiff

−0.8 −0.6 −0.4 −0.2 0.0

median=−0.86

100% < 0 < 0%

95% HDI

−0.90−0.83

Std. Dev. of difference

median=0.078

95% HDI

0.051 0.11

Effect Size

(µdiff0) σdiff

−30 −25 −20 −15 −10 −5 0

median=−11

100% < 0 < 0%

95% HDI

−16 −7.2

Figure 4.6: Bayesian t-test betweenesvmandec using GA and explicit embedding.

Márton Szemenyei 81/130 ARRANGEMENT IN SCENES

Benefits of using Embedding

In the third set of tests, we aim to establish the superior performance of using the embedding for the scene optimization step. While it was already shown in Chapter 2 that the explicit embedding improves the SVM classification scores for scene datasets, its effect on the final localization results was not evaluated. While it would be reasonable to assume that more accurate classification scores result in better global optima, performing the Bayesian t-test still provides a more direct proof. The full results used for the test are found in Table A.7 in the Appendix.

For this comparison, the GA method was used with all custom operators, as this proved to be the most accurate method. The results (Fig. 4.7) show that the explicit embedding improves the cost function validityecost with a 52.3% probability, mean-ing the distribution of improvements is almost perfectly centered around zero. The test run on the classification accuracy is much more conclusive, however, showing a 97.2% probability of the explicit embedding outperforming the no embedding case, with a95% credible interval of [0.031, 0.0015] and a median of 0.014.

While these results seem somewhat contradictory, there are two important insights that help resolve this apparent paradox: Firstecost is a percentage of scenes, where the there was an optimal solution with a lower cost than the ground truth. What ecost doesn’t include is a measure of how much better this optimal solution was, and even more importantly,how different the optimal solution was from the ground truth. Since ecis computed for nodes, not scenes, a dataset with similar ecost values can have vastly different classification accuracies.

Comparison with the neural baseline

Our discussion in subsection 2.5.3 demonstrated that the graph-attention-based neu-ral classification does not geneneu-ralize to scene graphs. Still, such a neuneu-ral network could be trained on the scene database directly, and the results compared against the global optimization. To do this, pre-trained models from the earlier discussion were fine-tuned on the scene datasets. Once again, Bayesian optimization was used to determine hyperparameters with the20% hold-out validation error as the metric.

The results show that the graph-attention baseline improves significantly with fine-tuning, although it still falls short of other methods, achieving only 67% accuracy on the synthetic, and 77% on the real image datasets respectively. We argue that this could be the result of a few factors, including overfitting due to small dataset sizes and failure to learn complex requirements, such as the number of necessary objects per class.

Márton Szemenyei 82/130 ARRANGEMENT IN SCENES

−1.0 −0.8 −0.6 −0.4 −0.2 0.0

0510152025

Data w. Post. Pred.

CostFun[, 1] and CostFun[, 2]

Probability

N=24 Mean difference

µdiff

−0.03 −0.02 −0.01 0.00 0.01 0.02

median=0.00029

47.7% < 0 < 52.3%

95% HDI

−0.010 0.011

Std. Dev. of difference

σdiff

0.02 0.04 0.06 0.08

median=0.019

95% HDI

0.0089 0.034

Effect Size

(µdiff0) σdiff

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

median=0.015

47.7% < 0 < 52.3%

95% HDI

−0.52 0.54

−1.0 −0.8 −0.6 −0.4 −0.2 0.0

05101520

Data w. Post. Pred.

FinalClass[, 1] and FinalClass[, 3]

Probability

N=24 Mean difference

µdiff

−0.06 −0.04 −0.02 0.00 0.02

median=−0.014

97.2% < 0 < 2.8%

95% HDI

−0.031 0.0015

Std. Dev. of difference

σdiff

0.02 0.04 0.06 0.08 0.10

median=0.028

95% HDI

0.013 0.048

Effect Size

(µdiff0) σdiff

−2.0 −1.5 −1.0 −0.5 0.0 0.5

median=−0.49

97.2% < 0 < 2.8%

95% HDI

−1.1 0.059

Figure 4.7: Bayesian t-test between no embedding and using explicit embedding forecost (top) and ec (bottom).

The Baseian t-test was also performed comparing the neural baseline with the cus-tom operator-based GA using the explicit embedding. The results (Fig. 4.8) show that the proposed method outperforms the baseline significantly, with a probability near 100%. The 95% HDI is between 19% and 42%, with a median value of 30%.

The full results of the neural baseline for all datasets is shown in Table A.8 in the Appendix.

Context optimization

The performance of context optimization was also evaluated, using the two scene databases created for this purpose. The final classification and cost function errors are presented in Table 4.3 for these two datasets, with and without context opti-mization. The results show that context optimization is able to reliably improve the scene optimization method performance, as long as the closeness of certain objects is dependent on their categories.

0.0 0.2 0.4 0.6

0.00.51.01.52.02.53.0

Data w. Post. Pred.

SceneClassGlobValRef and NNScene[, 1]

Probability

N=22 Mean difference

µdiff

0.0 0.1 0.2 0.3 0.4 0.5

median=0.30

0% < 0 < 100%

95% HDI

0.19 0.42

Std. Dev. of difference

σdiff

0.2 0.3 0.4 0.5

median=0.25

95% HDI

0.18 0.35

Effect Size

(µdiff−0) σdiff

0.0 0.5 1.0 1.5 2.0 2.5

median=1.2

0% < 0 < 100%

95% HDI

0.65 1.8

Figure 4.8: Bayesian t-test comparing the raw and explicit embedding on the scene versions datasets.

Metric ec ecost

Context No Yes No Yes

Synthetic Images 71.5 82.1 92.7 66.4 Real Images 82.7 91.3 47.9 27.9 Table 4.3: Results before and after the context optimization

Note that running Bayesian t-tests on these results would not be advised due to the low number of independent datasets, therefore we do not make statements about the effect of context optimization, except that is it viable on these two datasets.