View based recognition - Object recognition

Object recognition

4.6 View based recognition

The early version of the recognition system used 2D views (instead of 3D projective invariants) and the detected image features on them to recognize objects on the scene. In this case the objects in the database are represented by a limited number of 2D views about the objects. A view of the object consists of the features and relations between them resulting by the low level system and stored in the feature graphs. Similarly to the current impementation, the database contains also 3D information (point coordinates) about the objects, but in this case these are attached as attributes to the 2D features. The situation is depicted in Figure 4.7.

The multiview scene case complicates the analysis, because not only the scene consists of more than one view about an object but also the model database and not all of the features can usually be seen in the views of the scene as a consequence of occlusions (by itself or by other objects). To solve the problem, statistical object recognition was used that is based on [107], [80].

This is the Bayes rule for conditional probability:

P(A|B∩C)= P(A∩B∩C) P(B∩C)

u1 v ( , 1)

u2 v

( , 2) u

n v ( , n)

(X,Y,Z)

Figure 4.7: Two types of information in view based recognition

or in short form

P(A|B,C)= P(A,B,C) P(B,C) from which it follows

P(A|B,C)= P(B|A,C)P(A,C) P(B,C)

Letα, β∈MM,MS,S S where M denotes model, S denotes scene, respectively. Denote A the hypothesis that the object of the model database is present in the scene. Let R_αβ={r_{αβ,i j}}, where r_{αβ,i j}={k_l,⊥l}means that the i’th feature on the j’th view is matched with the k’th feature on the l’th view or no match is found. Denote Tαβ={Tαβ,jl}, where Tαβ,jlis the viewpoint transformation between the view j and l. Let P(·) denote the probability.

The detailed summary of the notations used in this section is shown in Table 4.5.

Match means a set of pairing between features and the transformations between them. The goal of the recognition is to somehow describe the quality of the matching with respect to the cost function. Using the Bayesian estimation theory the quality measure can be associated with the probability P(A|R,T) where R = [RMM,RMS,RS S] and T = [T_MM,T_MS,T_{S S}].

Hence the probability P(A|R,T) can be described as P(A|R,T)= P(R|A,T)P(A,T)

P(R,T) = P(R|A,TMM,TMS,TS S)P(A,TMM,TMS,TS S) P(R,TMM,TMS,TS S)

Generally two cases must be taken into consideration, if the model base contains 3D or 2D object descriptions. In case of 2D model, 2D scene the probabilities can be written into the following forms

• The probability P(R|A,T_MM,T_MS,T_{S S}) can be written as

P(R|A,TMM,TMS,TS S) = P(RMM,RMS,RS S|A,TMM,TMS,TS S)

≈ P(RMM|A,TMM,TMS,TS S)P(RMS|A,TMM,TMS,TS S)P(RS S|A,TMM,TMS,TS S)

≈ P(RMM|A,TMM)P(RMS|A,TMS)P(RS S|A,TS S)

supposing that RMMand TMS, TS S; RS S and TMM, TMS; RMS and TMM, TS S are independent.

• The probability P(A,T_MM,T_MS,T_{S S}) can be written as

P(A,TMM,TMS,TS S) = P(TMS|A,TMM,TS S)P(A,TMM,TS S)

= P(TMS|A,TMM,TS S)P(A)P(TMM)P(TS S) where A, TMM, TS S can be handled as independent.

P(·

) Probability

Hypothesis that the object as represented in the model database is present in the scene

α, β∈M M,MS,S S

Indices, M denotes model, S denotes scene

rαβ,i j ={kl}

The i’th feature on the j’th view is matched with k’th feature on the l’th view

rαβ,i j =⊥^l}

No match is found i’th feature on the j’th view

Rαβ

Set of

{rαβ,i j}

Tαβ,jl

(Affine) viewpoint transformation between the view j and l

Tαβ

Set of

{Tαβ,jl}

ti j

Type of the i’th 2D feature on the j’th view

αi j

Attribute vector of the i’th 2D feature on the j’th view

µi j

Position vector of the i’th 2D feature on the j’th view

Σ_{i j}

Covariance matrix of the position vector of the i’th 2D feature on the j’th view

Nj,Nl

Number of training images

(i

,⊥

) Number of (training) images, where the i’th 2D feature is detected

(i

=⊥

) Number of (training) images, where the i’th 2D feature does not detected

Table 4.5: Notation used in view based recognition

• The probability P(R,TMM,TMS,TS S) can be written as

P(R,TMM,TMS,TS S) = P(RMM,RMS,RS S,TMM,TMS,TS S)

≈ P(R_MM,T_MM)P(R_MS,T_MS)P(R_{S S},T_{S S})

≈ P(R_MM|T_MM)P(T_MM)P(R_MS|T_MS)P(T_MS)P(R_{S S}|T_{S S})P(T_{S S}) Substituting these quantities back it yields

= P(R_MM|A,T_MM)P(R_MS|A,T_MS)P(R_{S S}|A,T_{S S})P(T_MS|A,T_MM,T_{S S})P(A) P(RMM|TMM)P(RMS|TMS)P(TMS)P(RS S|TS S)

Since the object is always present in its model view hence

P(RMM|A,TMM)=P(RMM|TMM) and thus

P(A|R,T)≈ P(RS S|A,TS S)P(RMS|A,TMS)P(TMS|A,TMM,TS S)P(A) P(R_{S S}|T_{S S})P(R_MS|T_MS)P(T_MS)

The model-scene viewpoint transformation is independent from the model-model viewpoint transformation, because the matching process during the database creation is an off-line process. It is also supposed that the database creation process is “perfect” hence every model-model viewpoint transformation is equally likely, P(T_MM) is constant and can be omitted because it has no influence onto the location of the maximum value.

P(TMS|A,TMMTS S)=P(TMS|A,TS S)P(TMM)≈P(TMS|A,TS S) Thus

The approximation of probabilities must be used because the above probabilities are multidimensional joint prob-abilities, which are hard to represent practically. But these can be approximated by using feature independence sim-plification (using similar assumptions as during the projective reconstruction).

Therefore product of rαβ,i j terms can be used instead of Rαβ and viewpoint transformations can be calculated independently. The features are described similarly to Section 2.6 with additional covariance of the position (which was supposed to be some pixels). As proposed in [80] features are represented by t_{i j}type,α_{i j}attribute vector,µ_{i j}mean of the position andΣ_{i j}covariance of the position. Using these notations, the approximation of the probabilities is as follows:

• P(rS S,i j|A,TS S). There are two cases to consider.

The rS S,i j =⊥^lmeans, that the j’th view of the scene i’th feature is not matched in the l’th view, but the object is present in the scene. This can be approximated using the model data as

P(rS S,i j =⊥^l|A,TS S)= Nj(i,⊥) Nj

Nl(k=⊥) Nl

The rS S,i j=klmeans, that the j’th view of the scene i’th feature is matched in the l’th view with the k’th feature and the object is present in the scene.

P(rS S,i j=kl|A,TS S) = P(rS S,i j,⊥^l ∧ αi j =αkl ∧ µi j=TS S,li(µlk)|A,TS S)

= P(αi j =αkl|rS S,i j ,⊥^l, µi j=TS S,li(µlk),A,TS S)×

×P(µi j=TS S,li(µlk)|rS S,i j ,⊥^l,A,TS S)P(rS S,i j,⊥^l|A,TS S)

Further simplifications can be made because the attribute values and the transformations are independent, so the first term can be reduced

P(αi j=αkl|rS S,i j,⊥^l,A)P(µi j =TS S,li(µlk)|rS S,i j,⊥^l,A,TS S)P(rS S,i j ,⊥^l|A,TS S) where the last term can be approximated as

P(rS S,i j ,⊥^l|A,TS S)= Nj(i,⊥) N_j

Nl(k,⊥) N_l The approximation of the first two terms will be detailed later.

• The calculation of P(rMS,i j|A,TMS) probability formally can be achieved as in the case of P(rS S,i j|A,TS S). The differences are in the calculation of the transformations.

• P(T_MS|A,T_{S S}). In case of 2D model database T_MS and T_{S S} can be handled as independent, because the matching process between the scene views and the matching process between the model and scene views are independent.

• Projection between the model and scene views are independent. So P(TMS|A,TS S)= P(TMS|A). Supposing the model positions are equally likely, this equals P(TMS).

• P(A) is the probability of the hypothesis that the object is present in the scene. This can be calculated using a priori information or can be constant. In the second case the location of the maximum of the quality measure does not depend on P(A).

• Calculation of P(rS S,i j =kl|TS S) is similar to the P(rS S,i j =kl|A,TS S) case, but this does not depend on the presence of the object. There are two cases to consider.

The r_{S S,i j} =⊥lmeans, that the j’th view of the scene i’th feature is not matched in the l’th view. Supposing

uniform distribution, this is constant.

The rS S,i j=klmeans, that the j’th view of the scene i’th feature is matched in the l’th view with the k’th feature.

P(rS S,i j=kl|TS S) = P(rS S,i j,⊥^l∧αi j =αkl∧µi j=TS S,li(µlk)|TS S)

= P(αi j=αkl|rS S,i j,⊥^l, µi j=TS S,li(µlk),TS S)×

×P(µi j=TS S,li(µlk)|rS S,i j,⊥^l,TS S)P(rS S,i j,⊥^l|TS S)

Further simplifications can be made because the attribute values and the transformations are independent, so the first term can be reduced

P(α_{i j}=α_kl|r_{S S,i j},⊥l)P(µ_{i j} =T_{S S,li}(µ_lk)|r_{S S,i j},⊥l,T_{S S})P(r_{S S,i j},⊥l|T_{S S}) where the first term can be approximated as

P(αi j=αkl|rS S,i j,⊥^l)≈ N(αij=αkl) N(r_{S S}_{,i j} ,⊥l) The approximation of the last two terms will be detailed later.

• Calculation of P(rMS,i j =kl|TMS) can be achieved as in case of P(rS S,i j =kl|TS S). The differences are in the calculation of the transformations.

• P(T_MS). Because usually there is no a priori information about the transformation, it is supposed that every transformation is equally likely, so this term is constant.

The calculation of the conditional probabiliy of the pose of the features requires further approximation. Most of the feature detectors results only the best estimation of the detected feature, not the distribution of it. So for mathematical convenience it can be supposed that the distribution of the features along its mean (best) position is Gaussian. The practical reason, that it can be done, comes from the central limit theorem. Affine transformation is supposed to be between the scene and the model views. The effect of transformation t =

a b c d x yT

can be described linearly in two ways, whereµ=

u v β γT

orµ=

u v θ sT

are the representations of the pose of the 2D features, u,v is the position,θ=arctan(^γ_β) is the orientation, s= p

β²+γ²is the scaling. Introducingβandγin this way, the effect of transformation can be written in linear form

µ^′=







u v 0 0 1 0

0 0 u v 0 1

β γ 0 0 0 0

0 0 β γ 0 0











 a b c d x y







=Mµt

µ^′=







a b 0 0 x 0

c d 0 0 0 y

0 0 a b 0 0

0 0 c d 0 0











 u v β γ 1 1







=Tµ+xt

The uncertainty of the viewpoint transformation is also incorporated into the calculation. To prevent the calcula-tion of different dimensional (4 and 6) distribucalcula-tions, only the a, b, c, d components of the transformacalcula-tions are handled as probability variables. So the steps of the determination of the conditional probability of the pose of the feature

consists of the representational transformation (θ,s→ β, γ), the viewpoint transformation from the scene view to the model view and incorporate the uncertainty of the transformation. The result is the probability distribution of the scene feature. This is not a Gaussian distribution, so this is handled numerically, not simbolically. This distribution can be compared to the probability distribution of the model feature in order to get desired probability.

To calculate the transformation and determine the possible pairings, prediction-verification with a tree-search method is applied. A new feature pair is added to the sufficient node of the tree, if the features have the same type, the distance between the feature’s parameters is bellow a threshold, the transformation error is small and the insertion in the tree is consistent with the actual content of the tree.

The transformation is determined by the previous state of the tree. In the first state the insertion is based only on feature parameters. After every insertion phase the transformation is updated using Kalman filtering technique. In this case the filter equations are much simpler because the state vector is constant respecting the time. The original filter equations

x(k+1) = A(k)x(k)+B(k)u(k)+w(k) P(w)∼N(0,Q) z(k) = H(k)x(k)+v(k) P(v)∼N(0,S)

For this case A(k)=I, B(k)=0, z(k) is the model feature, x(k) is the transformation to be estimated and H(k) can be calculated from the actual scene feature

Mµ=







u v 0 0 1 0

0 0 u v 0 1

β γ 0 0 0 0

0 0 β γ 0 0







The last step of the recognition is to search for the object, that gives the best representation (has the greatest probability).

It must be mentioned, that the 2D features stored in the database resulted from “average” images that produced taken from the same viewpoint with different lighting conditions.

Note, that this method was not applied in the current implementation, because it requires more approximation and is more complicated than the 3D invariant based object recognition. The handling of 3D data must be involved into the system (both for scene and the database), because the output is 3D metrical information.

In document Budapest University of Technology and Economics Department of Control Engineering and Information Technology (Pldal 104-109)