Multiview Matching Algorithm for Processing Mobile Sequence Images

The paper presents a multiview matching algorithm for processing sequence images acquired by a mobile mapping system (MMS). The workflow of the multiview matching algorithm is designed, and the algorithm is based on motion analysis of sequence images in computer vision. To achieve a high multiview matching accuracy, camera lens distortion in sequence images is first corrected, and images can then be resampled. Image points on sequence images are extracted using the Harris operator. The homologous image points are then matched based on correlation coefficients and used to make a robust estimation for a fundamental matrix F between the two adjacent images using the random sample consensus (RANSAC) algorithm. The fundamental matrix F is calculated under the condition of epipolar line constraints. Finally, the trifocal tensor T of the three-view images is calculated to achieve highly accurate triplet image points. These triplet image points are then provided as the initial value for bundle adjustment. The algorithm was tested using a set of sequence images. The results demonstrate that the designed workflow is available and the algorithm is promising in terms of both accuracy and feasibility. DOI: 10.1061/(ASCE) SU.1943-5428.0000235. This work is made available under the terms of the Creative Commons Attribution 4.0 International license, http:// creativecommons.org/licenses/by/4.0/. Author keywords: Sequence images; Multiview matching; Harris operator; Correlation coefficient; Random sample consensus (RANSAC); Trifocal tensor; Computer vision (CV); Mobile mapping system (MMS).


Introduction
Photogrammetry and geometric computer vision are closely related disciplines.Many studies have shared interest in these two disciplines for many similar goals, such as point feature detection (Forstner and Gulch 1987), relative orientation (Philip 1996;Nister 2004), perspective n-point (PnP) problems (Masry 1981;Lepetit et al. 2009), and bundle adjustment (Triggs et al. 2000).The mathematical fundamentals of photogrammetry and computer vision can both be derived from the central projection of the common mathematical model.Photogrammetry uses the collinearity equations of Cartesian coordinate representation of the central projection in Euclidean geometry, and computer vision applies the projection equations of homogeneous coordinate representation of the central projection in projective geometry.Homogenous coordinates have the advantage that the points, lines, and planes at infinity can be represented using finite coordinates.In photogrammetry, the nonlinearity of the collinearity equations requires linearity and iterative optimization and good initial values of exterior orientation (EO) parameters from a global navigation satellite system and inertial measuring unit (GNSS/IMU) system.In addition, all the parameters in collinearity equations have physical meanings.In computer vision, the linearity of the projection equation permits linear matrix operations using linear algebra, the linearity of the camera matrix or fundamental matrix does not require the initial values, and those parameters of the matrix are not physically interpretable.
Multiview image matching has been addressed by several researchers in photogrammetric and computer vision.Maas (1996) presented a multi-image matching algorithm using discrete points extracted by an interest operator and epipolar line intersection.Brown et al. (2005) presented a method of multi-image matching in which image features are first located as interest points using a Harris corner detector, followed by matching using a fast nearest neighbor algorithm that indexes features based on their lowfrequency Haar wavelet coefficients, followed by refining feature matches using the random sample consensus (RANSAC) method.Gruen (1985) presented a method of multiphoto correlation based on the geometrically constrained adaptive least-squares matching algorithm.Elaksher (2008) presented a method of using forward neural networks to solve multi-image correspondence and using the photogrammetric collinearity condition to validate the outputs of the neural network and to compute the three-dimensional (3D) coordinates of the matched points.
With the rapid development of mobile mapping system (MMS) technology in recent years, real-time sequence images of motion can be collected by digital cameras mounted on a platform of the vehicle-based mobile mapping system.The sequence images acquired from a mobile mapping system can be matched using multiple views based on the methods of computer vision.This paper presents a multiview matching algorithm based on motion analysis of sequence images that uses the well-known algorithms of computer vision to process sequence images rapidly and get highly accurate image point coordinates.

Design Ideas and Workflow
Fig. 1 presents the workflow diagram of the proposed method.The method is based on motion analysis of sequence images using the well-known algorithms of computer vision.To achieve higher matching accuracy, camera lens distortions in sequence images are first corrected, and images can then be resampled.Image points on each nondistortion sequence image are extracted using the Harris operator.Then, the homologous image points are matched based on correlation coefficients.The matched homologous points are then used to fit the fundamental matrix F between the two adjacent images using the RANSAC algorithm for robust estimation.A fundamental matrix F of two views is calculated under the condition of epipolar line constraints.Then, the trifocal tensor T of the threeview images is calculated to achieve highly accurate triplet image points.To test the algorithm, a set of sequence images was captured using a Sony (Tokyo, Japan) DFW-SX910 camera in a MMS, and the experimental results were analyzed.

Sequence-Image Matching Algorithm
The Applanix (Richmond Hill, Ontario, Canada) Landmark MMS used in this study consisted of charge-coupled device (CCD) digital cameras, laser scanner sensors, a GNSS/IMU positioning and orientation system, and an odometer.The CCD digital cameras and laser scanners were mounted on the top of the land vehicle.The CCD cameras pointed in different directions to acquire different digital images.The front-view sequence images were used as test samples in this study.The proposed algorithm of sequence-image matching consists of camera lens distortions correction, use of Harris operator to extract interest points, two-view image matching based on correlation coefficient, fitting of fundamental matrix F of two views by RANSAC, and estimate of trifocal tensor T of three views.The algorithm is detailed in the following sections.

Camera Lens Distortions Correction
Camera lens distortions cause the imaged positions to be displaced from their ideal location.Lens distortions are the origin for systematic errors in sequence-image coordinates.The metric digital camera sensors mounted on the MMS were rigorously calibrated to effectively correct systematic errors in image points.Camera calibration parameters are provided to the user in camera calibration report, including focal length f, principal point shift x 0 ; y 0 ð Þ parameters, radial distortion parameters k 1 ; k 2 ; k 3 ð Þand tangential distortion parameters p 1 ; p 2 ð Þ .With known camera lens distortion parameters, the sequenceimage coordinates may be refined more effectively with the Brown distortion model (Brown 1971(Brown , 1976)).The corrected coordinates x; y ð Þare computed with Eqs.(1)-(4). where Radial distortion: Tangential distortion: where x; y ð Þ= corrected image coordinate; x; y ð Þ = raw image coordinate; Dx; Dy ð Þ= correction terms for formulating the camera's systematic error; and r The image coordinates of the sequence images from the MMS are corrected with the Brown distortion model.Raw sequence images are resampled, and new nondistortion images are formed for multiview matching.

Harris Operator
The Harris operator (Harris and Stephens 1988) is a combined corner and edge detector with geometric stability.It defines interest points that have locally maximal self-matching precision under translational least-squares template matching.The authors applied the Harris operator to extract interest points in the sequence images from the MMS.The Harris operator results from shifting a local window in the image by a small amount in various directions.
Denoting the image intensities with I in the x-and y-directions, the change E (Harris and Stephens 1988) produced by a shift (x, y) is given as where W u;v is weighted by a Gaussian and denotes The first gradients are approximated as For the small shift (x, y), the change E can be written as Where the 2 Â 2 symmetric matrix M is rewritten in Eq. ( 9) and the parameters denoted in Eq. ( 10), as follows: where E is closely related to the local autocorrelation function; and M describes its shape at the origin.Let λ 1 , λ 2 be the eigenvalues of M. λ 1 and λ 2 will be proportional to the principal curvatures of the local autocorrelation function and form a rotationally invariant description of M. The Harris corner region (Harris and Stephens 1988) is defined as shown in Eq. ( 11), with Det(M) and Tr(M) as shown in Eqs. ( 12) and ( 13), respectively, as follows: R is positive in the corner region, negative in edge regions, and small in the flat region.The parameter K is usually set to 0.04-0.06.In Eqs. ( 12) and ( 13), an ideal edge is λ 1 large, λ 2 zero; a corner will be indicated by both λ 1 and λ 2 large; and a flat image region will be indicated by both λ 1 and λ 2 small.If λ 1 ) λ 2 , an edge is a horizontal edge; if λ 1 ( λ 2 , an edge is a vertical edge.Interest feature points are extracted in every sequence image using the Harris operator.

Sequence-Image Matching by Correlation Coefficient
Next, an image matching algorithm based on the correlation coefficient is applied to acquire homologous feature points.Image matching is implemented between previously detected interest feature points in two sequence images by finding points that are maximally correlated with each other within windows surrounding each point.Only interest points that correlate most strongly with each other in both the left-right and right-left directions are recorded as homologous matching points.The validity of homologous matching points can be checked in both directions.
The implementation step of image matching is as follows.First, the initial point sets detected by the Harris operator are divided into adjacent sequence images to form multipoint sets.Then, by correlation coefficient computation between a characteristic point in the match window and the candidate matching points within the search window point by point, the points with maximal correlation coefficients in both directions are regarded as homologous pairs of points.The correlation coefficient is used to estimate the similarity of gray vector linear correction and is an important similar-measurement method.The correlation coefficient between point p x; y ð Þ in the A frame image and point q x 0 ; y 0 ð Þin A þ 1 frame image is defined as where I A x; y ð Þ and I Aþ1 x 0 ; y 0 ð Þ = gray intensities in the matching window and research window of the A frame image or the A þ 1 frame image, respectively; m 1 , m 2 , s 1 and s 2 = mean and standard deviation of the match windows and research windows, respectively; n = an arrange within small region.Of course, the right window size must be chosen to improve the speed and time of matching.Window size w should be odd, such as 11 Â 11, 13 Â 13.The radius of matching window is (w -1)/2.Finally, the correlation coefficient between sequence images is calculated according to Eq. ( 14), and the homologous pairs of points with maximal correlation in both directions are recorded.

Fit Fundamental Matrix of Two Views by RANSAC
For a given point in one sequence image, a corresponding point in the other sequence image may not exist.As a result of missing parts in images, mismatching, or lack of a sufficiently textured image, the matching algorithm may fail to find the homologous point and produce outliers.Thus, these outliers must be removed using the fundamental matrix F by estimating the epipolar geometry and RANSAC strategy to fit the fundamental matrix according to the matched points.The proposed method applies the epipolar geometry constraint and the well-known normalized 8-point algorithm to calculate the fundamental matrix F and fit the fundamental matrix of two views with the RANSAC algorithm.

Epipolar Geometry Constraint
The epipolar geometry is the intrinsic projective geometry between two views.Epipolar geometry is independent of scene structure and only depends on the camera's internal parameters and relative pose (Hartley and Zisserman 2000).In computer vision, the most common way to represent this intrinsic geometry is the fundamental matrix F, which is a 3 Â 3 matrix of rank 2. A 3D-space point M X; Y; Z ð Þ is imaged as m x; y; 1 ð Þ T in the first frame view and m 0 x 0 ; y 0 ; 1 ð Þ T in the second frame view, and then the image points satisfy the relation m 0 T Fm ¼ 0; that is, the camera center, 3Dspace point M, and its image points m and m 0 lie in a common plane.
The fundamental matrix F can be derived from the mapping between a point and its epipolar line.For a given point m in the first frame image, the projective representation of the epipolar line in the second frame image l 0 is given by l 0 ¼ Fm; the point m 0 corresponding to m belongs to the line l 0 , and therefore, m 0 T l 0 ¼ m 0 T Fm ¼ 0. Thus, the fundamental matrix F satisfies the condition that for any pair of corresponding points m $ m 0 in those two frame images, m 0 T Fm ¼ 0.

Normalized 8-Point Algorithm
The simplest method of computing the fundamental matrix is an 8point algorithm, as presented by Longuet-Higgins (1981).Given sufficiently many point matches m $ m 0 (at least 8 points), the equation m 0 T Fm ¼ 0 can be used to compute the unknown matrix F. Each point and its corresponding matched point give rise to one linear equation in the unknown entries of F. According to the known coordinates m and m 0 , the equation corresponding to a pair of point m and point m 0 can be expressed as a vector inner product by x 0 x; x 0 y; x 0 ; y 0 x; y 0 y; y 0 ; x; y; 1 is a 9point vector made up of the entries of fundamental matrix F in rowmajor order.For a set of n ! 8 matched points, a set of linear equations of the following form is obtained: For a solution to exist, matrix B must have ranked at most 8.But the rank of B may be greater than 8 because of noise in the point coordinates and a lack of exact data.In this case, the leastsquares solution must be found.The least-squares solution for f is the singular vector corresponding to the smallest singular value of B, that is, the last column of V in the SVD B ¼ UDV T .The solution vector f found in this way minimizes jjBf jj subject to the condition f jj ¼ 1 j j .The key to success with the 8-point algorithm is proper careful normalization of input data before constructing the equation.A simple transformation of the points in the image before formulating the linear equations leads to an enormous improvement in the conditioning of the problem and hence in the stability of the result.Normalization is a translation and scaling of each set of points in the image so that the origin of the reference points is at centroid, the mean distance of the points from the origin is equal to ffiffi ffi 2 p ; and the scale parameter is 1.The normalized 8-point algorithm is a method for enforcing the singularity constraint on the fundamental matrix F. The initial estimate F is replaced by the singular matrix b F that minimizes the difference jj b F À Fjj subject to the condition det b F ¼ 0. This is done using the SVD, and has the advantage of being simple and rapid.
Detailed implementation of the step is as follows.First, transform the image coordinates according to b x i ¼ Tx i and b x 0 i ¼ T 0 x 0 i , where T and T 0 are normalizing transformations.Second, compose matrix B from the pairs of matched homologous points x; y ð Þ $ x 0 ; y 0 ð Þ as defined in Bf ¼ 0. Third, obtain a solution F from the vector f corresponding to the smallest singular value of B. Replace F with b F; using the SVD for correction.Finally, set F ¼ T 0 T b FT for denormalization.Matrix F is the fundamental matrix corresponding to the original data.

Fit Fundamental Matrix by RANSAC
The RANSAC algorithm is used for the robust fitting of models in the presence of many data outliers (Fishler and Boles 1981).The  algorithm is used to robustly fit a fundamental matrix to a set of putatively matched image points and obtain subset inliers (valid points).Given N matched pairs of the homologous point, the RANSAC algorithm is used to perform the iterative computations, as follows: (1) randomly select s sample correspondences as an initial data set.A minimum of s pairs of point numbers is required to form a fundamental matrix F and compute the fundamental matrix.Random sample s and the number of samples N creates the following relationship (Hartley and Zisserman 2000): Eq. ( 17) gives the number of samples N required to ensure, with a probability p ¼ 0:99, that at least one sample has no outliers for a given size of sample s and proportion of outliers ɛ. (2) Compute the distance measure.Given a current estimate of F from the RANSAC sample, the distance d measures how closely a matched pair of points satisfies the epipolar geometry.(3) Compute the number of inliers consistent with F by the number of correspondences for which d < t.Value t is a distance threshold of F model for a 95% probability that the point is an inlier.Then, the solution for F with the most inliers is retained.( 4) Choose the F with the largest number of inliers and reestimate F for all correspondences classified as inliers.Thus, the best optimization of fundamental matrix F is fitted using the RANSAC robust estimation algorithm to remove the outliers.

Estimate Trifocal Tensor of Three Views
The trifocal tensor (Hartley and Zisserman 2003) encapsulates all the projective geometric relations between three views that are independent of scene structure.It only depends on the motion between views and the internal parameters of the camera.The trifocal tensor captures point-point-point correspondence between the three images.Corresponding points backprojected from each image all intersect in a single 3D point in space.A point in 3D-space is imaged as the corresponding triplet x $ x 0 $ x 00 in three images.Because the three fundamental matrices F 12 , F 23 , and F 34 relating the three views in sequence images are computed and homogeneous points between two views are matched, it is possible to determine the trifocal tensor given the three fundamental matrices and homogeneous points.
The methods of estimating trifocal tensor T are as follows: (1) Search three sets of corresponding image points from the matched homogeneous pairs among two views, and form the homogeneous point sets x $ x 0 $ x 00 in three images.(2) Normalize each set of points so that the origin is at centroid, mean distance from origin is ffiffiffiffi 2; p and scale parameter is 1.
(3) Randomly select more than 7  18).On the condition of error constraints based on minimizing algebra and the homologous matched points of the three views, the trifocal tensor over the three images is estimated through the fundamental matrix using the RANSAC algorithm.Ensure the trifocal tensor is geometrically valid by retrieving its epipoles.( 5) Perform denormalization and compute the trifocal tensor T to the original data.
x i x 0 j x 00 k e jqs e krt T qr i ¼ 0 st (18) where x i $ x 0 j $ x 00 k = a set of corresponding points across three frame views; e = algebraic error vector and is the norm of the error vector that is minimized; T qr i = trifocal tensor; and 0 st = twodimensional (2D) tensor with all zero entries.
The estimated trifocal tensor together with a set of corresponding triplet points x $ x 0 $ x 00 across the three images is determined.The trifocal tensor is used to determine the exact image positions of three homologous points in three images.There are fewer mismatches over three views than there are over two views.There is only the weaker geometric constraint of an epipolar line against which to verify a possible match in the two views.Thus, the highly accurate and reliable results of the corresponding triplet points can be obtained.In future work, the authors will use image point coordinates from a set of triplet points in three views and EO parameters of the projection center from GNSS/IMU data postprocessing to solve a set of 3D points X by bundle adjustment over three views.

Data Set Description
The Sony DFW-SX910 9833406 camera is a front-view digital camera and was mounted in an Applanix Landmark MMS.The camera contains a digital CCD sensor with 1,280 Â 960 pixels with a 4.65-m m pixel size, and the lens focal is 6.09 mm.The camera parameters are shown in Table 1.In this study, the authors chose the continuous adjacent four-frame sequence images from Frames 686, 687, 688, and 689 collected by the Sony DFW-SX910 9833406 camera as test samples.In camera lens distortions correction, the authors chose Frame 149 images for lens correction.The results of distortion correction are presented in Figs.2(a-c), where Fig. 2(a) is the raw front view sequence image, Fig. 2(b) is the nondistortion image processed using the Brown model, and Fig. 2(c) is a difference image between the raw image and the nondistortion image.Fig. 2(c) shows that there is a small distortion in the center area of the image and a large distortion at the image edge and surrounding area that is the primary source of systematic error.
For the Harris feature point extraction, the authors chose the continuous four-frame nondistortion image sequence from Frame 686 to Frame 689 to extract feature points using the Harris corner detector operator.For parameters, the standard deviation (s ) of the Gaussian was 1, the search radium of the small shift was set to 3 pixels, and the number of corner points was 500.The extraction of feature points resulted in 3,024 points on Frame 686, 3,287 points on Frame 687, 3,348 points on Frame 688, and 2,541 points on Frame 689.The results are shown in the subplots in Figs.3(a-d).The extracted feature points are shown in the corresponding images, respectively.
In the sequence image matching based on correlation coefficient, the window size for correlation matching was set as 11 Â 11, the match radium was 5, the maximum search distance for matching was 50 Â 50, and the value of the correlation coefficient was set as 0.99.The matching results in the sequence of four adjacent images are shown in subplots Figs.4(a-c In fitting the fundamental matrix of two views with the RANSAC algorithm, the authors chose the parameters s = 8, p = 0.99, ɛ = 5%, and t = 0.002.The inlier matched results in the adjacent four-image sequence are shown in the subplots in Figs.5(a-c).The result of inlier point fitting by RANSAC for Frames 686 and 687 is shown in Frame 686 in Fig. 5(a), that for Frames 687 and 688 is shown in Frame 687 in Fig. 5(b), and that for Frames 688 and 689 is shown in Frame 688 in Fig. 5(c).The original image of Frame 689 is shown in Fig. 5(d).
In computing the trifocal tensor, the triplet points were found from matched pairs of points between Frames 686 and 687 and Frames 687 and 688, and the triplet points were obtained from the matched pairs of points between Frames 687 and 688 and Frames 688 and 689.Then, the triplet points were estimated using the RANSAC algorithm, and the trifocal tensor was computed.The trifocal tensor T 686 7 8 in Frames 686, 687 and 688 was computed as follows; the result of the number of triplet points was 507 in those three images, respectively:

Conclusions
This paper presents an approach to acquiring highly accurate matched points of sequence images from a MMS.Sequence images of motion with short baseline and overlapping are collected, and an image scale of front-view images is changed in accordance with the speed of the MMS.The approach applies the algorithms of computer vision to correct the camera lens distortion, to extract feature points using the Harris operator, and to produce many matched homologous points with higher matching effectiveness and many mistakenly matched points.The fundamental matrix of two views is computed by using the epipolar geometric constraint and RANSAC strategy to robustly estimate and effectively eliminate outliers.The accuracy and reliability of the matched points are thus improved.On the condition of error constraints based on minimizing algebra and homologous matched points of multiviews, the trifocal tensor is computed through the fundamental matrix and RANSAC algorithm.The high-precision public triplet points in three views are found.At the same time, the number of matched points in three views of sequence images is reduced to satisfy the trifocal tensor.The results of application of the method demonstrate that the proposed algorithm is very promising in terms of both accuracy and feasibility.In future work, the authors will solve the 3D point coordinates by bundle adjustment over three views using triplet pairs of points and EO parameters from GNSS/IMU data.

Fig. 4 .
Fig. 4. Subplots (a) through (c) show homologous pairs of points, and (d) shows the original image: (a) between Frames 686 and 687 shown in Frame 686; (b) between Frames 687 and 688 shown in Frame 687; (c) between Frames 687 and 688 shown in Frame 688; (d) original image in Frame 689 ).The matched homologous pairs of points between Frames 686 and 687 are shown in Frame 686 in Fig. 4(a).The homologous pairs of points between Frames 687 and 688 are shown in Frame 687 in Fig. 4(b).The homologous pairs of points between Frames 688 and 689 are shown in Frame 688 in Fig. 4(c).The original image of Frame 689 is shown in Fig. 4(d).

Fig. 5 .
Fig. 5. Subplots (a) through (c) show inlier points fitted by RANSAC, and (d) shows the original image: (a) between Frames 686 and 687 shown in Frame 686; (b) between Frames 687 and 688 shown in Frame 687; (c) between frames 688 and 689 shown in frame 688; (d) original image in Frame 689

Fig. 6 .
Fig. 6.Subplots (a) through (c) show triplet pairs of points, and (d) shows the original image: (a) between Frames 686 and 687 of three views shown in Frame 686; (b) between Frames 687 and 688 shown in Frame 687; (c) between Frames 688 and 689 shown in Frame 688; (d) original image in Frame 689