Scale invariant feature transform

2.2 Feature description

2.2.1 Scale invariant feature transform

Scale Invariant Feature Transform (SIFT) [18, 19] is a 128-dimensional image descriptor, which gives the opportunity for eﬃcient image matching and view invariant visual object recognition, as it is invariant to scale, rotation, illumina-tion and viewpoint. The method ﬁrst detects interest points, then generates a descriptor of the local image structure by accumulating statistics of local gra-dient directions of image intensities. The high dimensionality represents large variation, therefore corresponding points can be eﬃciently matched using this descriptor between diﬀerent images.

The four main steps of the descriptor extraction are the following:

1. scale-space extrema detection, 2. keypoint localization,

3. orientation assignment, 4. keypoint descriptor.

In the present approach, SIFT method is used only for localizing the key-points, therefore just the ﬁrst two steps are executed to extract the location of the keypoints. However, in the experimental part SIFT is applied for comparison, thus, the complete algorithm with all the four steps is presented in details.

2.2.1.1 Scale-space extrema detection

The ﬁrst stage of the extraction attempts to ﬁnd ’characteristic scale’ for features, therefore the image is represented by a family of smoothed images known as scale space. The scale space is deﬁned by the function:

L(x, y, σ) = G(x, y, σ)∗I(x, y), (2.1) where∗denotes the convolution operator,I(x, y) is the input image andG(x, y, σ) is the variable-scaled Gaussian:

G(x, y, σ) = 1

2πσ² exp⁻^x2+y

2σ2 . (2.2)

Stable keypoint locations of the scale space are then detected by Diﬀerence of Gaussians (DoG) function, where D(x, y, σ) function is given by computing the diﬀerence between two images, one with scale k times by the other:

D(x, y, σ) = (G(x, y, kσ)−G(x, y, σ))∗I(x, y)

=L(x, y, kσ)−L(x, y, σ). (2.3) The L(x, y, σ) convolved images are grouped by octave (an octave corresponds to doubling the value of σ) andk is selected so that a ﬁxed number of convolved images per octave is obtained. DoG images are taken from adjacent Gaussian-blurred images per octave. Figure 2.1 shows the operation of this step. Extrema are then located by scanning each DoG image and identifying local minima and maxima. To detect such locations, each point is compared to its 8 neighbors on the same scale, and 9 neighbors on the higher and lower scale (see Figure 2.2). If the point is the local minimum/maximum of this 26 points, then it is an extremum.

2.2.1.2 Keypoint localization

After extracting extremum points, this step attempts to eliminate points having low contrast or being poorly localized along an edge. For the former case, the scale space value is used. The location of the extremum, ˆx, is determined by taking the derivative of the Taylor expansion (up to the quadratic terms) of the

Figure 2.1: An octave of L(x, y, σ) images and construction of DoG images [19].

D scale-space function (Eq. 2.3) with respect to x = (x, y, σ)^T and setting it to zero, giving:

The function value at the extremumD(ˆx) is investigated and eliminated if|D(ˆx)| is less then 0.03 (assuming image pixel values in the range [0,1]).

For the latter case (eliminating edge responses), observation about principal curvatures can be exploited: an edge point will have large principal curvature across the edge, but a small one in the perpendicular direction. Principal curva-tures can be computed from the H Hessian matrix:

H= the principal curvatures of D. As only the ratio of the eigenvalues is required, not their exact values, motivated by [24], the trace (T r) and determinant (Det) of H is used as follows:

T r(H)²

Det(H) = (α₁+α₂)²

α₁α₂ = (rα₂+α₂)²

rα²₂ = (r+ 1)²

r . (2.6)

If this ratio is over some threshold, the point is taken as an edge point, therefore it is eliminated.

Figure 2.2: Extremum detection: each point is compared with its 26 neighbors on 3 diﬀerent scales [19].

2.2.1.3 Orientation assignment

After extracting the keypoints, in this step a consistent orientation is assigned to each of them based on local image properties. The keypoint descriptor can later be represented relative to this orientation and achieve invariance to image rotation.

The scale of the keypoint is used to select the related smoothed imageLwith closest scale, therefore all computations are performed scale-invariantly. Magni-tude m(x, y) and orientation θ(x, y) are calculated using pixel diﬀerences:

m(x, y) =√

(L(x+ 1, y)−L(x−1, y))²+ (L(x, y+ 1)−L(x, y−1))², (2.7) θ(x, y) = tan⁻¹

(L(x, y+ 1)−L(x, y−1) L(x+ 1, y)−L(x−1, y)

)

. (2.8)

An orientation histogram with 36-bin is calculated from the gradient orienta-tions of sample points within a region around the keypoint. The highest peak is located and used as the keypoint’s orientation. If more local peaks exist with at least the 80% of height of the highest peak, then the orientation of these peaks

Figure 2.3: The original image is on the left, the calculated keypoints with the assigned orientation can be seen on the right.

are also assigned to the keypoint, resulting in multiple orientations. This step ensures invariance to image location, scale and rotation.

Figure 2.3 shows the result of orientation assignment. The cyan colored arrows indicate the orientation assigned to each keypoint. The size of the arrow is proportional to the magnitude.

2.2.1.4 Keypoint descriptor

The local gradient data, calculated in the previous step is also used when creating keypoint descriptors. Taking this data, the region of the keypoint is separated into subregions of 4×4 pixel neighborhood and an orientation histogram is calculated with 8 bins for each subregion (see Figure 2.4). This time the gradients are rotated by the previously computed orientation and weighted by a Gaussian function with variance of 1.5 times of the keypoint’s scale. The descriptor is then becomes a vector of all the values of these histograms, resulting in 4×4×8 = 128 dimensions.

Finally the vector is normalized to unit length, ensuring the invariance to aﬃne changes in illumination. For invariance to non-linear changes as well, the vector elements are capped to 0.2 and then the vector is renormalized.

The disadvantage of SIFT is its high dimensionality. However, this eﬃcient data content cannot be compressed [25] without important information loss. Sim-ilar local descriptors [26] also give a batch of collected features. Some of them is

Figure 2.4: Keypoint descriptor extraction: The gradient histogram is on the left, the calculated 128-dimensional descriptor is on the right.

deﬁned as scale-invariant features, where zoomed attributes describe larger scale connectivity (e.g. scaling in [19]).

In document A thesis submitted for the degree of Doctor of Philosophy (Pldal 28-33)