Budapest University of Technology and Economics Department of Control Engineering and Information Technology

(1)

Budapest University of Technology and Economics Department of Control Engineering and Information Technology

Budapest, Hungary

S ^TEREO I ^MAGE P ^ROCESSING S ^YSTEM

FOR R ^OBOTIC A PPLICATIONS

Ph. D. Thesis

S ZTEREÓ KÉPFELDOLGOZÓ RENDSZER ROBOTIKAI ALKALMAZÁSOKHOZ

Ph. D. értekezés

Ferenc Tél

Supervisor:

Prof. Dr. Béla Lantos

Department of Control Engineering and Information Technology Budapest University of Technology and Economics

February, 2009

(2)

List of Figures

1.1 Structure of the developed vision system . . . . 3

1.2 Hierarchy of geometries . . . . 4

2.1 Difficulties in correspondence problem . . . 11

2.2 Color image edge detection example, artifical image . . . 15

2.3 Color image edge detection example with a real image . . . 16

2.4 General conic vs. ellipse fitting in curvature calculation . . . 17

2.5 Counterexample for split edge chains applying global curvature thresholds . . . 18

2.6 Peak, step up and step down templates used for edge segmentation . . . 18

2.7 Synthetic profile based segmentation for two typical cases . . . 20

2.8 Vertex detection situations in case of Hough transform . . . 22

2.9 Merge situations of parallel lines . . . 24

2.10 Example of a feature graph . . . 27

2.11 Simulated scene 1, detected edges and fitted features on grayscale image . . . 31

2.12 Simulated scene 1, detected edges and fitted features on color image . . . 32

2.13 Simulated scene 2, detected edges and fitted features on grayscale image . . . 33

2.14 Simulated scene 2, detected edges and fitted features on color image . . . 34

2.15 Simulated scene, detected lines and vertices using Hough based method . . . 35

2.16 Real scene 1, detected edges and fitted features overlayed . . . 36

2.17 Real scene 2, detected edges and fitted features overlayed . . . 37

2.18 Segmentation of a real edge . . . 38

2.19 Simulated scene 1, matched feature graphs on grayscale image . . . 39

2.20 Simulated scene 1, matched feature graphs on color image . . . 40

2.21 Simulated scene 2, matched feature graphs on grayscale image . . . 41

2.22 Simulated scene 2, matched feature graphs on color image . . . 42

2.23 Real scene 1, feature graphs and matched nodes overlayed . . . 43

2.24 Real scene 2, feature graphs and matched nodes overlayed . . . 44

2.25 Output of the Harris and SIFT detectors . . . 45

2.26 Output of the edge based feature detector . . . 46

2.27 Matched features with SIFT and edge based detectors . . . 47

3.1 Beaton-Tukey objective function . . . 60

3.2 Projective depth decomposition: reconstruction error vs. pixel noise . . . 62

(7)

3.3 Minimize algebraic distances: reconstruction error vs. pixel noise . . . 62

3.4 Projective depth decomposition: reconstruction error vs. number of features . . . 63

3.5 Minimize algebraic distances: reconstruction error vs. number of features . . . 63

3.6 Minimize algebraic distances: reconstruction error vs. number of cameras . . . 64

3.7 Projective reconstruction, simulated scene . . . 65

3.8 Two different views with reprojected features (real scene) . . . 66

3.9 Projective reconstruction, real scene (2) . . . 67

3.10 Projective reconstruction, real scene (3) . . . 68

3.11 Projective reconstruction, real scene with known objects . . . 69

4.1 Plot of the six possible functions of the cross ratio . . . 78

4.2 Stereographic projection of the cross ratio . . . 79

4.3 PDF of the cross ratios of the stereographic and the polynomial functions . . . 80

4.4 Some periodic functions applied to the cross ratio and its density . . . 81

4.5 Structure of the object database . . . 86

4.6 Projection of the multidimensional collineation votes . . . 93

4.7 Two types of information in view based recognition . . . 94

4.8 3D features overlayed as the output from a recognition process (simulated scene) . . . 99

4.9 Object recognition example in case of a simulated scene . . . 100

4.10 Object recognition, real scene . . . 101

4.11 Objects used in recognition experiments . . . 102

5.1 Reference frames used in the displacement calculation . . . 107

5.2 Transformation graph of the robot . . . 111

5.3 Reconstructed relative 3D scene description (simulated scene) . . . 113

5.4 Decomposition tolerance against element perturbation . . . 113

5.5 Reconstructed relative 3D scene description (real scene) . . . 114

B.1 Geometry and coordinate frames of the projection . . . . v

B.2 Epipolar geometry . . . . vi

B.3 Three view relationship . . . . vi

C.1 PDF of the the cross ratio . . . viii

E.1 Hierarchical tree index structure . . . . xi

(8)

List of Tables

1.1 Problems and solutions in the vision systems . . . . 5

4.1 Important values of the cross ratio . . . 78

4.2 Possible pairings in the six points configuration . . . 83

4.3 Determine correspondences in six points configuration (example) . . . 83

4.4 Determine correspondences in six points configuration, fault tolerant version (example) . . . 84

4.5 Notation used in view based recognition . . . 95

(9)

Acknowledgements

This Ph. D. Thesis is the result of a research work conducted at Department of Control Engineering and Information Technology of the Budapest University of Technology and Economics.

I would like to express my gratitude to Prof. Dr. Béla Lantos, my supervisor at the Department of Control Engineering and Information Technology of the Budapest University of Technology and Economics, for the essential help he gave to support my research work and publications.

Furthermore, I shall acknowledge financial support from the Hungarian National Research Programs

(grant No. OTKA T 16855, T 29072, T 42534, K 71762) and NKTH RET 04/2004.

(10)

Declaration

Undersigned, Ferenc Tél, hereby state that this Ph. D. Thesis is my own work wherein I have used only the sources listed in the References. All parts taken from other works, either in a word for word citation or rewritten keeping the original contents, have been unambiguously marked by a reference to the source.

Nyilatkozat

Alulrott Tél Ferenc kijelentem, hogy ezt a doktori értekezést magam készítettem és abban csak a megadott forrásokat használtam fel. Minden olyan részt, amelyet szó szerint, vagy azonos tartalomban, de átfogal- mazva más forrásból átvettem, egyértelm˝uen, a forrás megadásával megjelöltem.

Budapest, February 12, 2009.

. . . . Tél Ferenc

The reviews of this Ph. D. Thesis and the Record of Defense will be available in the Dean’s Office of the Faculty of Electrical Engineering and Informatics of the Budapest University of Technology and Economics.

Az értekezésr˝ol készült bírálatok és a védésr˝ol készült jegyz˝okönyv a kés˝obbiekben a Budapesti M˝uszaki s Gazdaság- tudományi Egyetem Villamosmérnöki s Informatikai Karának Dékáni Hivatalában megtekinthet˝ok.

(11)

Notations

The notations used throughout the thesis are summarized here.

General notations

Equality up to a scale, ab↔a=β·b, β,0

≈ a≈b, a is approximated by b

s Real number

R Space of the real numbers

v, vn n-length column vector

v(i) i’th element of the vector

v(i : j) subvector composed from elements at i. . .j

M, Mm×n m×n matrix

M(i,j) (i,j) element of the matrix

M(i : j,k : l) slice of the matrix created from elements of rows i. . .j and columns k. . .l

Mm×n=







r^T₁

...

r^T_n





=(^c1...c_m) Row/column representation of the matrix

hd₁. . .d_ni Diagonal matrix

v^T, M^T Transpose of the vector/matrix

M⁻¹ Inverse of a (square) matrix

M^† Moore-Penrose pseudoinverse of a matrix

|M| Determinant of the (square) matrix

kMkF Frobenius norm of the matrix

[v]_× 3D cross-product matrix composed from 3-vector v

hv₁,v₂i Scalar product of vectors v₁and v₂

⊳(ACB) Angle at vertex C formed by rays CA and CB

Projective entities P3×4=







p^T₁ p^T₂ p^T₃





 3D projection matrix

H3×4 3D collineation matrix

q=(^{u v w})^T Homogeneous coordinate vector of 2D point q

l=(a b c)^T Homogeneous coordinate vector of 2D line l

Q=(X Y Z W)^T Homogeneous coordinate vector of 3D point Q

Λ 4×4 skew-symmetrix Plücker matrix of 3D line L

L= (L(4,1) L(4,2) L(4,3)

L(2,3) L(3,1) L(1,2))^T 6-component Plücker vector of 3D line L

(12)

Chapter 1

Introduction

1.1 Subject of this thesis

The main goal of this thesis is to describe the algorithms and methods developed as part of the stereo vision system at the Budapest University of Technology and Economics (BUTE), Department of Control Engineering and Information Technology (CEIT).

The building of a general purpose vision system that is similar to a human vision is a very difficult task, because such a system requires many context knowledge that could not be modeled in a computer. Till now only some aspects of the general human vision and recognition system was addressed and solved. This thesis attempts to solve the handling of predefined, rigid, man-made objects usually found in industrial applications with robotic appliance. The advantage of man-made objects is that usually in a restricted environment (as in case of production/assembly robots) the data about the objects are available in well structured form for example as output of 3D CAD system completed with a special program which determines 3D geometric primitives, their invariants, etc.

The basic idea of the system was formulated many years ago. Examining the dominant object recognition methods at that time and taking into account the supposed (industrial) application enviroment the 3D geometrical approach was choosen because it seemed that this approach requires less simplification assumptions than the 2D view based recognition systems (although a framework of a view based system is also worked out and partially implemented).

During the research the motivation was to address the theoretically least restricted case in the imaging process and develop the full theory according to this assumption and determine the limits of its applicability. Therefore the information about the environment is collected with simple passive (color) cameras, no structured lights or range sensors are applied. Theoretically the data acquisition process can be decomposed into several stages. For example off-line available information can be used to calibrate the cameras and then the calibrated cameras can be used for the reconstruction. This could simplify the problem of the different stages but it needs a special object database concept.

The developed system follows another concept. The cameras are supposed to be uncalibrated, no off-line calibration process is assumed. During the development it is supposed that the scene is an industrial one (for example the enviroment of a material moving robot), the type of the objects are predefined and rigid, the number of occuring objects are limited. There could be occlusions and reflections and the lighting is not further specified or limited. It is also assumed that the scene is static or images are simultaneously taken by multiple cameras (snapshot). Apart from the off-line calibration process for the determination of metric information two possible ways exist. The first way is to impose some restrictions about the intrinsic parameters of cameras (for example the known aspect ratio) and achieve the autocalibration. The second way is the complementery process, nothing is supposed about the cameras.

The “price” appears on the scene side, the restrictions and the required additional information are supposed to be known about the environment. This requires the recognition of known objects in the scene using 2D images taken by uncalibrated cameras and predefined 3D object database. This means that the system handle only those objects, whose

(13)

description was previously stored in the database.

In the most general projective framework only simple geometric primitives can be handled in a practical way.

The simplicity also makes possible to apply standard mathematical and optimization tools. Introducing additional information into the modeling requires the handling of more complicated dependencies. Therefore in the implemented system it is supposed that the objects can be described by simple 3D geometric primitives (points and line segments) whose accurate metric coordinates are stored in the object database. The database can also contain technological information, for example the preferred griping position and orientation of the object. Hence, recognizing the object in the scene and determining the transformation between object in scene and database, the desired griping pose of the object in the scene can be computed and the robot can perform the task with the recognized object.

Though some results developed in this thesis (the object recognition algorithm) could be used for other types of imaging systems (for example 3D range images), it is supposed, that the source of information are raw 2D camera images of uncalibrated (autozoomed, etc.) cameras.

It must be mentioned that many other solutions are possible for every processing phase applied in this system. The detailed and systematic comparison of the (best) available algorithms was not the goal of the work, though a simple comparison to the available methods is discussed where applicable.

The thesis does not deal how the data (type, position and orientation of the recognized objects) pruduced by the system will be applied in the in the intelligent robot control methods, only shows where and how the information could be inserted into a robot control graph.

1.2 Structure of the vision systems

Generally, the aim of the image processing systems is to collect data (images) about the environment, extract important parts, achieve some simplifications and transformations and give a compact description of the surround world.

Depending on the application that uses this description, the output could have many forms from enhanced images to 3D description.

The flow of the processing can be described as a hierarchical system where the upper levels use information produced by lower levels, see Figure 1.1. This bottom-up direction of processing is data driven. In some cases, for example verifying a hypothesis, the direction of processing could be top-down when a-priori knowledge is applied to the available data.

The classification of processing steps into low, intermediate and high levels is custom, the borders are not fixed, they depend on the application.

The input of a vision system is a set of intensity/color 2D images taken about the scene with uncalibrated camera(s). The images could be produced by the same camera moving around the scene or set of cameras looking at the scene from different viewing positions. Almost always the first step is a low level (early vision) method that achieves a noise filtering, image enhancement and segmentation. The information reduction in this step is large, starting from MB size imaging information the output could be described as a set of features requiring only some kB. The algorithms in this thesis use local features such as edges and vertices which are based only on a small portion of the image.

The next step in the processing chain is the intermediate step, where intermediate means that imaging information could be used but the processing is not localized within a single image, inter image relationships are determined. In a stereo system this means that the corresponding features, which are the projections of the same spatial feature in different images, are matched.

The high level tasks are usually detached from raw imaging information, only the abstract description of the scene is used. These processes produce scene descriptions applying “abstract”, high level spatial features. Using a-priori information additional tasks such as object recognition could be achieved. The output of this level is usually application based and could be directly used for specific purposes.

(14)

Euclidean reconstruction

Object database

Low level image processing

Low level image processing Image matching

Projective reconstruction

Object recognition

Figure 1.1: Structure of the developed vision system

(15)

Hprojective

Haffine

H_similarity

H_Euclidean

Figure 1.2: Hierarchy of geometries

A summary of the described problems and possible solutions can be found in Table 1.1. The detailed overview of other possible solutions can be found in the first Section (Overview) of every Chapter.

1.3 Hierarchy of geometries

Through the thesis the pinhole camera model is used, see Appendix B.1. In order to describe the imaging process and the relationships between different reference frames the homogeneous coordinates are used. This yields a simpler (linear) description and allows the usage of powerful, more general geometries than metric (Euclidean) geometry.

Note, that pinhole model does not take into consideration non-linear disturbance effects such that radial distortion or image blurring.

The classification of the hierarchy of geometries is based on the entities remaining invariant after application of a transformation permitted in the given geometry. It can be seen that the more general the transformation is, the more restricted is the set of invariants. It must be mentioned that the set of transformations is reduced to linear transformations, namely the mapping can be described by linear equations.

Euclidean geometry and similarity

The most widely used geometry in everyday life is the Euclidean one. The permitted transformations are the (combination of) rotations and translations, hence the degree of freedom is 6 (3 for rotation, 3 for translation).

These transformations preserve angles, parallelism and absolute distances between respecting entities. This means, that the deformation of a shape of an entity is not possible.

In case of similarity the isotropic scaling is enabled, the degree of freedom is 7 and the distance preservation becomes relative. This means that the ratio of distances remains constant.

(16)

Problem description Possible solutions Solution proposed in this thesis

Reduction of the imaging information, the extraction of primitives.

The are many type of primitives, usually the grouping is based on the similarity within the images. The primitives could be point, curve or area based. According this classification there are salient point, edge or blob detectors. If higher order primitives are used, usually these are the combinations of basic ones.

This system uses edge based feature detector.

The point features are determined as the meet- ing points of simple line of second order curve type features. The edges caused by many situations in the image: geometric boundaries of the objects, textures, spurious effects (shadows, shining parts), etc. Among this only the geometric boundaries are useful in this case, the other types of edges increasing the difficulty and the combinatorial size of the problem. The advantage of this method is that it reduces the number of features (typically in a restricted, man made environment), the disadvantage is that not every type of scene could be well described using edge-like information.

Solution of the correspondence problem, image registration.

The correspondence problems can be solved using directly the image information (correlation based algorithms), or using the primitives or descriptors extracted from the images (feature/descriptor based methods). Using only information available from simple primitives is not enough to uniquely determine the correspondent primitive in the other image.

Therefore additional constraints are applied based on geometry (e.g. fundamental matrix with RANSAC estimation, affine relations between SIFT descriptors extracted from the image). Another possibility is to handle image sequence as a flow (KLT feature tracker).

This system uses a graph matching algorithm as a framework. The similarity is parametric and intensity (image) information. The additional information is stored in the graph itself (struc- tural constraints).

Projective reconstruction from image correspondences.

There are many projection reconstruction algorithms. The most mature ones exploit the determination of the fundamental matrix using point pairs only. Later algorithms calculate the trifo- cal tensor or jointly estimate the structure and viewing parameters (bundle adjustment methods).

This system uses a “bundle adjustment”-like algorithm exploiting special parameterization and iterative minimization.

Determination of the Euclidean structure.

Many methods use different restrictions about (intrinsic) camera parameters. The most restrictive ones assumes orthogonal projection, more flexible methods use autocalibration.

This system uses general pinhole camera model. The constraints are enforced using scene information (object recognition).

Object recognition.

There are numerous object recognition systems and methods. These are usually task specific.

Many classifications exist based on for example applied dimensionality (2D, 3D), the type of the usable objects (rigid/flexible), etc.

This system uses projective invariant based object recognition where the invariants are determined from the predefined object data or scene information.

Table 1.1: Problems and solutions in the vision systems

(17)

The transformation can be described in a matrix form as H_similarity=s







R3×3 t3

0^T₃ 1







where t is a translation vector, s represents scaling (s=1 for Euclidean transformations). The R is a 3×3 orthogonal (rotation) matrix, this means, that R^TR=RR^T =I and|R|=1.

Affine geometry

Replacing the rotation matrix R with a more general one A (only|A| , 1 is required), a wider class of transformations can be represented. The permitted transformations are the (combination of) rotation, translation, shear and (anisotropic) scaling, therefore the number of degree of freedom is 12.

An affine transformation only preserves the collinearity of points (parallelism) and the ratio of distance along a line.

The transformation can be described in a matrix form as Haff ine=s







A3×3 t3

0^T₃ 1







Projective geometry

The most general class of transformations is the class of projective transformations (collineation). Every (invertible) 4×4 matrix describes a permittable mapping but the transformation is defined up to a nonzero scale factor, hence the degree of freedom is 15.

The projective geometry treats the elements (points, lines, plane) at infinity like any other finite elements. This yields that a collineation in general does not leave the line (2D) or plane (3D) at infinity unchanged.

The only preserved value is the cross ratio, see Appendix C.

It must be mentioned that in general projective geometry it is not possible to define the idea “between”, because for example in 3D space the projective lines are circular in topological manner, they are closed at infinity. Hence there is no unique line segment that connects two points QA and QB, there are two possibilities: the “normal” QA → QB

and QA→ ∞ →QB. This ambiguity is resolved by fixing the hyperplane at infinity, this yields the affine geometry.

1.4 Structure of the thesis

The Chapter 2 describes the developed low level vision process that takes camera images, achieves the edge and vertex detection, feature refinement. Using the graph build from features and intra-image relationships, the stereo matching determines the correspondences between features on different images.

Using the 2D scene descriptions produced during early vision, the projective reconstruction method detailed in Chapter 3 recovers the 3D projective structure of the surround world. The information produced in this phase contains no metrical (Euclidean) information.

The next phase in the processing chain is the object recognition that combines a-priori information stored in a database with the scene description. The output is the recognized set of objects, as shown in Chapter 4.

Using the different object coordinate frames and the projective description of the scene, Chapter 5 describes the method developed to calculate the relative positions and orientations between recognized objects.

Finally in Chapter 6 the results of this thesis are briefly summarized, possible further improvements and research suggestions are discussed.

(18)

A short survey of the most important basic ideas and tools of stereo image processing, which are used in the Thesis, are collected in Appendix. In order to increase the readability of the Thesis in some cases the details of proofs are removed into the Appendix, too.

Every chapter starts with an overview including the survey of the literature, then it describes the main steps of the developed algorithms and methods and the results produced by the system. At the end of each chapter there is a conclusion part summarizing the own results and the own publications covering the results.

(19)

Chapter 2

Feature detection and matching

2.1 Overview

The aim of a low level image processing (early vision) is to extract useful information from raw camera images and reduce the amount (and dimensionality) of the data. The output of this simplification process yields some parametric description of the image, extract those parts of the images that could be important during the subsequent steps.

Stereo vision systems requires multiple (more than one) images about the scene and the determination of corresponding entities across images.

2.1.1 The image segmentation problem

The first step in a vision system is the grouping of the image pixels into larger, meaningful entities that share some common properties (e.g. intensity, color, etc.). Depending on the produced information types the available methods yield segments containing pixels inside a particular region (region based approach), or the boundary between regions (boundary approach). The application of these approaches usually is task dependent.

The basic idea of the region based approaches is that the grouping of the image data is based on the similarity.

The pixels inside a region are similar to each other depending on the comparison function.

The traditional algorithms use (multilevel) histogram thresholding, split and merge, see [82] Chapter 17 for a survey. The more advanced algorithms perform region growing (MSER [27]), clustering (Mean-shift [21]) and graph partitioning (normalized cuts [92], [96]).

The boundary based approaches usually determine the discontinuities in the image field which are usually the projections of important things in the 3D scene, e.g. object boundaries, range discontinuities. In the areas where some descriptive information is changing, the first or higher order derivatives (and its approximations) are used to determine transitional areas.

Many algorithms determine salient points of the image. The traditional Harris vertex detector [45] localizes the vertex points by calculating the eigenvalues of second derivatives of the image intensity function. More advanced methods use scale space [64] and different transformations such as wavelets [90], descriptors such as SIFT [67] or its variants [74].

2.1.2 Edge detection methods

The edge detection methods belong to the boundary based algorithms. In order to reduce the effect of the (high frequency) noise, the differentiation is mixed with a low level filtering process. The filtering is based on averaging or Gaussian filtering.

(20)

Gray scale edge detection

In case of gray scale images only the intensity information is used to detect edges. In the traditional edge detector algorithms the edge maps are calculated applying a smaller (typically 3×3 or 5×5) convolutions mask such as Sobel [94], Previtt (using first derivatives) or LoG (Laplacian of Gaussian, using second derivatives), see [82] Chapter 15.2-3.

Canny [16] proposed a more advanced algorithm defining some criterions regarding good detection and localiza- tion and minimal response.

Color edge detection

The original segmentation or edge detection algorithms are defined for gray scale images, but methods are available to extend the usage to color images.

The output fusion methods apply an optional color space transformation [19] and achieve a gray scale method for one (intensity) or more channels and merge the results together.

The simplest way to get a gray scale representation of a color image is to calculate the average of the three channels, I= ^R⁺^G₃⁺^B, but usually it gives poor results regarding human perception. A more widely used conversion is proposed by NTSC such that I=0.2989R+0.5870G+0.1140B. These values are chosen due to the different relative sensitivity of the human perception to each of the primary colors [82]. This weighting used in the “lightness” channel in case of CIE L^∗a^∗b^∗transformation [54].

Another possible conversion is to use the “brightness(value)” from HS B or (HS V) color space, or “lightness” from HSL color space. The transformation from RGB to HS B can be found for example in [81]. It must be mentioned, that in this case not all color channels are combined for a pixel, because for a given RGB triplet the brightness is defined as I =max(R,G,B) or HSL lightness is given as L= ^max(R,G,B)⁺₂^min(R,G,B), where max() and min() yields the maximum and minimum of its arguments, respectively.

More advanced method is available using nonlinear mappings and dimension reduction (PCA, space filling curves).

Such a conversion is proposed for example in Color2Gray, [39].

The vector valued methods treat the vector space as-is, without any decomposition of the channels. The vec- torized version of the Canny edge detector determines the Jacobian matrix containing the derivatives respecting to colors, such that J =







∂R

∂u

∂R

∂v

∂G

∂u

∂G

∂v

∂B

∂u

∂B

∂v







= _∂C

∂u

∂C

∂v

. Calculating the eigenvector of the matrix J^TJ corresponding

to the largest eigenvalue, the direction and magnitude of the edge can be determined asθ = ⁽

∂C

∂u)^T(^∂C_∂v)

k^∂C_∂uk²−k^∂C_∂vk² and m = q

k^∂C∂uk²cos²θ+2(^∂C_∂u)^T(^∂C_∂v) cosθsinθ+k^∂C∂vk²sin²θ, respectively.

There are other methods that use vector order statistics to compute statistical measures [99], or calculate the distance and/or angle between vectors [108].

2.1.3 Edge post processing techniques

The detected segments or edges (edgels) are usually further processed in order to imporove the quality and usability of the results. This thesis uses edges (boundary approach), hence the survey of post-processing using the segment based methods is omitted here.

The edges produced by the derivative operators have some disadvantageous properties:

• The location of the transient areas are ambiguous, because of the response of the detector is more than one pixel wide. This requires the thinning of the output. This could be achieved using morphological operator (skeletonization) or applying non-maximum suppression.

(21)

• The produced curves are not continuous, they contain small gaps that should be eliminated (filled).

• The response of the convolution type detectors is a 2D map of the edges, not the list of connected, ordered edge chains. This requires the application of an edge following (tracing) method.

These post processing of the edgels must be achived separately or can immediately built into the edge detector (Canny).

The edge chains must be further processed, because most of the edge chains are longer, compound curves with possible branch (junction) points, therefore simple geometric entitites (lines, conics) could not directly be fitted onto them. Therefore the second step in the processing is to break edges into pieces. Two main types of the possible algorithms are as follows. The first type uses iterative methods and split edges if the fit error exceeds a predefined threshold value, see for example the split-and-merge procedures [78]. The drawback of this algorithm is that the quality of the splitting highly depends on the a global threshold value, that should be different respecting to size and other properties of the edge itself. The second type handles every pixel of the curve as candidate and calculates the curvature using a local fitting of higher order curve onto the neighbouring pixels of the edge. Collecting the curvature data, edges are splitted at points, where the curvature value is maximal and exceeds a threshold (dominant points) [20], [43].

2.1.4 Feature fitting overview

The edges supplied by the edge splitting methods are suitable to fit geometric primitives onto them, such as line segments, second order curves or splines. There are many methods that can be used to fit features onto edges, but these can be divided in two main groups. The methods in the first group use a voting scheme and the clustering of the voting results in order to get the best approximation. These methods are roboust against outliers, but for larger dimensions the computational costs are high and the methods are slow. The most known member of this group is the Hough transformation [29] that can be used for lines. Although generalization for higher order curves exists they are used seldomly because of the increasing number of parameters. The other group of methods tries to minimize an error (cost) function by using optimization techniques in order to get the best fit. For lines linear or orthogonal regression methods [24] are used. The second order curves (conics) require special handling, mainly because of internal constraints between parameters to be estimated [97], [42], [95]. The applied cost function could represent algebraic or geometric distance, see survey [113].

The last step of the low level processing is the generation of a description of the features extracted from images.

This description can simply be a list of detected features when no additional information is required. Many systems produce higher order features composed from the basic geometric primitives such as junctions [68]. Newer methods use the local desriptors instead of basic features, see SIFT [67] or GLOH [74]. In this case the key points of the image are characterized by (multidimensional) descriptors computed from the image properties surrounding the given point.

2.1.5 Image matching techniques

In order to build the 3D structure of the scene from 2D imaging information, generally a triangulation like process is required. This means that the corresponding entities must be determined in the participating images with pairing in image tuples or with tracking in image sequences.

This is a correspondence problem and the solution is required by a usual 3D vision system. The solution of this problem is difficult, it is possible that some (pixel) points have no correspondent in the other image. Many effects have influence to the process, see Figure 2.1:

• The cameras see different parts of the scene due to different intrinsic and geometric camera parameters or scene properties (for example occlusions).

(22)

X

Invisible region for camera A Different apparent edge

for camera A and B

Shadowed region for camera B Camera A

Camera B Light source

Sparkle for camera A

Figure 2.1: Difficulties in correspondence problem

• There are parts of the image, that seem similar in the images, but represent different physical parts (apparent edge of a curved object).

• Differences between images due to illumination (shadows, sparkles).

There are several heuristics, hence constraints are used to increase the efficiency and stability of the matching algorithms:

• Similarity constraint means that the properties of a 3D entity should yield similar values on different images.

• Uniqueness means, that different entities must have different parameters. Hence the output of the matching algorithm must yields one-to-one mapping.

• Positional constraints uses different relationships (for example ordering or epipolar) in order to reduce the search space during matching.

The main issues that have to be considered:

• What are the candidate matches, namely what type of information must be used during the procedure?

• How good is the matching, namely how to rate the candidates?

There are two main classes of algorithms developed over the time.

Correlation based methods

The traditional methods are based on correlatation information based on image intensities. The matching is achieved using subwindows sliding over the image (or candidate area A) and the search requires that the scene points of the correspondences have the same intensity (matte surfaces). Let 2w+1 be the width of the window, d_H and d_V is the horizontal and vertical displacement, respectively. There are several definition of the cross correlation, these definitions

(23)

represent either similarity or dissimilarity. For example the normalized sum of squared differences C(i,j,d_H,d_V)=

Pw k=−w

Pw

l=−w(IL(i+k,j+l)−IR(i+k−dH,j+l−dV))² qPw

k=−w

Pw

l=−wI²_L(i+k,j+l)Pw k=−w

Pw

l=−wI_R²(i+k−dH,j+l−dV)

represents similarity. Computing the C(i,j,dH,dV) values for each (dH,dV) displacement, the disparity map is gener- ated as D=argmin(d_H,d_V)∈A(C(i,j,dH,dV)).

The main problems with the traditional implementation and proposed solutions are the following:

• In general case this yields a computationally expensive 2D search and handles each features independently (dropping the inter-relationships information about the features). In order to reduce the search area for 1D, epipolar lines or rectification is proposed [4], but these require prior knowledge about the projection or iterative methods are needed [9].

• The success of the matching highly depends on how frequently does the searched structure occure on the other image.

• It is hard to choose the size of the window (w). Too small window size could not capture enough image information and sensitive to noise, on the other hand too large window size is less sensitive to noise but more sensitive to variations in image intensity. There are improvements that propose adaptive windows size, see for example [58].

The main properties of the correlation based methods:

• Relatively easy implementation.

• Performs poorly when the viewpoints are far from each other due to illumination changes and apparent scaling of images.

• Performs poorly when the camera parameters are badly different due to scaling of images.

• Provides reliable data in well textured images, because correlation yields many false positives in homogeneous areas.

• Generate dense disparity map.

A possible new solution was proposed in [25] that uses the cross correlation between SIFT descriptors instead of intensity values.

Feature based methods

The feature based class of methods match similary features on different images. The basic features are usually edges (line segments), corner points and/or higher level features composed from points and lines (for example T-junction) extracted from raw camera images. The descriptors (δ_i) of the features used during the matching must fulfill the requirements described above, namely different features must have different descriptors (uniqueness), the used properties should be invariant to variation and degradation effects (invariance), moreover a descriptor must have enough discriminative power to be able to distinguish among different features (separation).

The similarity measure is usually defined as S ()= ^P ¹

iwi(δ_a−δ_b)². The weighting factors wican be used to describe the importancy of each descriptor and to eliminate e.g. different scaling of the descriptors. For each of the features the algorithm search for the correspondent candidate, that maximizes the similarity measure. The main properties of the feature based methods:

• Performs poorly when no reliable features can be extracted from images (highly textured scene), due to illumination changes and apparent scaling of images.

(24)

• Faster than the correlation based methods, because usually the number of (higher level) features are much smaller than point of interests (POI) used in those methods.

• Performs better under illumination and scaling differences.

• Generate sparse disparity map (smaller number of POIs).

At first sight it seems that the increasing number of basic types of features could improve the discriminative power of the algorithm. But the set of features are usually created in artifical way such as X-junctions, T-junctions which often have no physical correspondent in the scene.

Traditional matching algorithm

In traditional case both of the methods use the following simple matching algorithm. The input contains the following elements:

• Images to match.

• Candidate pixels (correlation based) or descriptors (feature based).

• Search region (ROI) parameters.

The algorithm consists of the following steps:

1. Compute the correlation/similarity between entities within the search region.

2. Select the pairs that maximizes the measure.

3. Store the calculated pairs and disparity.

Graph matching algorithms

The graphs formed from features and its relationships could also be used during matching. Due to the different viewpoints of cameras, shadowing effects and other effects the detected features in the respecting images are never the same. Hence the algorithm must handle this inexact matching problem. Traditional graph matching algorithms are uses a variant of A^⋆search that always yields the optimal solution but requires exponential time because the subgraph homomorphism is an NP-complete problem, see for example [73]. Therefore approximate solutions must be used.

There are different techniques developed over the time. The first group uses state-space methods (variants of tree search methods) having exponential or with the help of some heuristics higher order polynomial complexity.

The second group uses nonlinear or combinatorial optimization techniques, such as genetic algorithms [22], re- laxation [63] or neural networks [55]. The method proposed in [38] uses an approximate solution to the quadratic assignment problem by using a continuation method, applying a graduated (non-convex) assignment and softassign to solve the assignment problem.

2.1.6 The aim of this Chapter

This Chapter describes the low level and image matching phases of the processing. The aim is to determine “important parts” of raw camera images and solve the stereo correspondence problem.

The first step of the processing is the extraction of the edges found in images that represent (among others) object boundaries and classify them depending on the feature types (points, lines) used in the stereo vision system.

In order to be able to use simple 2D geometric primitives, complex edge chains are splitted (segmented) into pieces based on curvature information. It will be shown that a simple global thresholding of the curvature function does not give accurate results. Hence a novel template based edge splitting is developed where the templates describe the typical shapes of the curvature function at which the splitting must be achieved.

(25)

The system fits conics and lines onto the splitted edges based on algorithms proposed in the literature. Using the parametric description of the 2D features, further simplifications are achieved (merge segments if possible), then vertices are determined as the intersection points of the analytical curves.

The final step in this phase builds the feature graphs of the images and achieves the image matching process that is based on these graphs. The novelty of the algorithm is that it gives a framework in which advantages of different image matching algorihms can be combined. The current implementation uses the feature (parameter) based and correlation (window) based algorithms.

2.2 Edge detection

The first step in our image processing system is the determination of the discontinuities in the image amplitude at- tributes. These type of discontinuities usually indicate important information, such as image of object boundaries.

For gray scale images, the system uses enhanced (Vanduc) version of Canny edge detector as proposed in [88] to extract edges from raw camera images. The output of the detector is the linked list of edges and each edge point is described with its coordinates u,v at subpixel accuracy and the direction and the estimated magnitude of the gradient in that point. These two latter values are not used in the developed system.

In contrary with the gray scale version, color images are vector valued functions, an (u,v) coordinate has col- orspace (RGB, HS B, etc.) vector values.

Most of the traditional feature detection methods use gray scale images (intensity-like information), therefore in order to handle color images two possibilities exist.

One solution is to perform a preprocessing step to get a scalar valued description of the image that could be used in traditional edge detection methods. Many different ways exist to get scalar, single channel description of the image, see the short overview in Section 2.1.2. The usability of the methods depends on the image. Examining the results for a synthetic image depicted on Figure 2.2, it is not obvious which method is the best. In some cases the simple averaging gives the best results, while in other cases this yields the poorest results. An example of a real image can be found on Figure 2.2.

Another way is to develop an edge detection algorithm, that is able to use color (multiband) information.

The developed method uses the weighted combination of both values (as proposed e.g. in [109]), depending on the saturation.

2.3 Edge splitting

Edges can be usually composed as a collection of straight line segments and pieces of second order curves. The segment boundaries (vertices) are mainly those edge points, at which the curvature of the edge is high, comparing to the other points on the edge.

The curvature can be estimated fitting a curve onto the edge points in the neighbourhood of the given point. In this case the number of points that are used during the fitting is small, therefore in order to reduce the differences between results, the class of curves is reduced to ellipses. The problem is depicted in Figure 2.3. The difference between data sets is only one point. In the first experiment it was set to (0,0), then moved into (0.1,0.25) in the second experiment.

(The location is denoted by an arrow in the Figure). The general conic fitting algorithm detects ellipse in the first case (conic1) but hyperbola in the second case (conic2). If the solution is constrained to be ellipse, the outputs are two almost identical ellipses, (ellipse1 and ellipse2).

Let the points used during the fitting be (ui−l,vi−l). . .(ui,vi). . .(ui+l,vi+l), where l denotes the half of the window size (determines the number of neighbours involved into the fit). Let the origin located at (ui,vi), this means that the points are shifted with (−u_i,−v_i). The curvature values are calculated as a result of a local ellipse fitting (see: Conic

(26)

Figure 2.2: Color image edge detection example, artifical image

(27)

Figure 2.3: Color image edge detection example with a real image.

Upper left: original image. Upper right: I=0.2989R+0.5870G+0.1140B.

Lower left: I=L of HSL. Lower right: Color edge detector

(28)

-6 -4 -2 0 2 4

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5

conic1 ellipse1 conic2 ellipse2 point set 1 point set 2

Figure 2.4: General conic vs. ellipse fitting in curvature calculation.

Perturbing one point (denoted by the arrow) yields ellipse in first case (conic1), hyperbola (conic2) in the second case. Ellipse fitting yields almost identical ellipses (ellipse1, ellipse2).

fitting section) onto neighbouring edge points as follows. Let the parametric equation of an ellipse centered at (cu,cv) and rotated by angleφis given as

u = cu+a cosφcos t−b sinφsin t v = c_v+b cosφsin t+a sinφcos t This can be written into matrix form







−b sinφ a cosφ b cosφ a sinφ











 sin t cos t





=





 u−cu

v−c_v







Using the inversion form for 2×2 matrices, the parameter of (u,v)=(0,0) can be determined as t0=arctan−acusinφ+acvcosφ

bcucosφ+bcvcosφ The curvature at t0is

κ(t0)= ab

(b²cos²t₀+a²sin²t₀)³²

This type of curvature calculation yields some averaging and robustification but makes the curvature function smoother, supressing real peaks. There is a tradeoff between the smoothing and accuary of the curvature values depending on the neighbourhood size. Examining the situation depicted in Figure 2.3 where the (average) curvature of the circle is 0.11, and the maximum curvature of the line intersection is 0.04, using the same settings during calculations. Hence it turned out that global thresholding cannot be applied, because it does not detect two lines which are connected at obtuse angle while it does separate a circular arc with small radius (high curvature). (The original size of the images is 256×256.)

(29)

Figure 2.5: Counterexample for split edge chains applying global curvature thresholds.

A circle with small radius has larger local curvature than line sections with obtuse angle.

L1 L2 L3 L1 L2 L3

L1 L2 L3

Figure 2.6: Peak, step up and step down templates used for edge segmentation

(30)

Therefore a template based edge segmentation was developed, using the fact that each interesting part (where the split of the edge chain is possible) of the curvature function has special shape. These parts can be detected in a robust way, applying correlation with templates containig the wanted shape.

Examining the situations three main possible shape types occure. In case of intersecting straight edge, the graph of the curvature values have a peak. Where a straight edge is connected to a curved one or two curved edges are connected, the transition of the curvature values gives a ramp-like shape. Three main shapes (see Figure 2.3) are used with different parametrizations (namely varying L1,L2 and L3 parameters) to detect segment boundary points on the curvature function. Using the correlation values global thresholding could be applied. In case of large enough profile template, the correlation graph has well separated peaks.

Determining the peaks above threshold and applying a non-maxima suppression, the points at which the edge chain can be split are determined.

In case of ramp-like shapes (up and down), one of them (up or down) should be enough to find splitting, but in that case the absolute value of the correlation must be used (for example up shape is anticorrelated while the down shape is correlated).

An application of the template profiles for two typical situations can be seen in Figure 2.7. The experiments show, that two different parametrizations for each template give acceptable results.

2.4 Feature fitting

Features are fitted onto the segmented edges. The feature set consists of line segments and second order curves, therefore during the fit the segments are classified as line segments, ellipses, hyperbolas and parabolas. In these cases non-roboust methods can be used, because the fit is achieved onto continuous edge segments, therefore there are no outliers.

2.4.1 Line fitting

The lines are fitted using the method proposed in [2]. This algorithm uses the implicit (general) equation of a line.

This can be written into the following form:

FL,i(pL,q_i)=au_i+bv_i+c=p^T_Lq_i=q^T_ip_L where pL=

a b cT

contains the coefficients and qi=

u_i v_i w_iT

is the design vector (homogeneous coordinate vector of the 2D point with wi =1). If a²+b² =1, then the expressionPN

i=1(aui+bvi+c)² gives the sum of the squares of the distance of points from the line. Therefore using the Lagrange multiplier rule the result is a constrained total least squares (error in variables) problem:

EL(a,b,c, λ)= XN

i=1

(aui+bvi+c)²+λ

a²+b²−1 Let

XN i=1

F²_L,i= XN

i=1

p^T_Lq_iq^T_ip_L=p^T_L XN

i=1

(q_iq^T_i)p_L=p^T_LQp_L where

Q=





 PN

i=1u² PN

i=1uv PN i=1u PN

i=1uv PN

i=1v² PN i=1v PN

i=1u PN

i=1u N







(31)

Figure 2.7: Synthetic profile based segmentation for two typical cases.

Upper half : join of two line segments. Lower half : join of a line segment and an arc.

Budapest University of Technology and Economics Department of Control Engineering and Information Technology