Template Matching for Wide-Baseline Panoramic ...

4 downloads 0 Views 6MB Size Report
Jun 21, 2018 - The distortions could beat some classic matching methods using absolution ..... At last, the multi-view intersection is carried out for locating and error elimination. ..... Corso, N.; Zakhor, A. Indoor localization algorithms for an ...
International Journal of

Geo-Information Article

Template Matching for Wide-Baseline Panoramic Images from a Vehicle-Borne Multi-Camera Rig Shunping Ji 1,2, * 1 2 3 4

*

ID

, Dawen Yu 1 , Yong Hong 3 and Meng Lu 4

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China; [email protected] Beijing Advanced Innovation Center for Imaging Technology, Capital Normal University, Beijing 100001, China Leador Spatial Information Technology Corporation, Wuhan 430000, China; [email protected] Department of Physical Geography, Faculty of Geoscience, Utrecht University, Princetonlaan 8, 3584 CB Utrecht, The Netherlands; [email protected] Correspondence: [email protected]; Tel.: +86-135-5405-7323

Received: 3 May 2018; Accepted: 18 June 2018; Published: 21 June 2018

 

Abstract: Automatic detection and locating of objects such as poles, traffic signs, and building corners in street scenes captured from a mobile mapping system has many applications. Template matching is a technique that could automatically recognise the counterparts or correspondents of an object from multi-view images. In this study, we aim at finding correspondents of an object from wide baseline panoramic images with large geometric deformations from sphere projection and significant systematic errors from multi-camera rig geometry. Firstly, we deduce the camera model and epipolar model of a multi-camera rig system. Then, epipolar errors are analysed to determine the search area for pixelwise matching. A low-cost laser scanner is optionally used to constrain the depth of an object. Lastly, several classic feature descriptors are introduced to template matching and evaluated on the multi-view panoramic image dataset. We propose a template matching method combining a fast variation of a scale-invariant feature transform (SIFT) descriptor. Our method experimentally achieved the best performance in terms of accuracy and efficiency comparing to other feature descriptors and the most recent robust template matching methods. Keywords: template matching; panoramic camera; mobile mapping system; feature descriptors

1. Introduction A ground mobile mapping system (MMS) mounted with multiple sensors such as a mono/stereo camera, panoramic camera, laser scanner, and GPS/INS (global positioning system/inertial navigation system) has shown a wide range of implications in city planning, 2D/3D map making, traffic surveillance, and autonomous cars [1–4]. A challenge is to effectively locate objects of interest, such as manhole covers, telegraph poles, building corners, and so on, from multi-view images acquainted by an MMS, for constructing and maintaining georeferenced datasets of certain objects [5,6]. An intuitive solution is to automatically recognize the designated object instances with high precision and recall rate, then calculate their geolocation by GPS/INS and triangulation according to specific points in objects (for example, the geometric center of a manhole cover or the base of a pole). However, this solution can hardly be approached in practice with the difficulty of accurately extracting and segmenting objects. To construct a highly accurate object dataset for traffic maps or 3D city maps, manual recognition and measuring are still necessary. In this paper, we automatically identify correspondences from multi-view images given a user-specified object in one of the images. With the correspondences, 3D world coordinates of the object

ISPRS Int. J. Geo-Inf. 2018, 7, 236; doi:10.3390/ijgi7070236

www.mdpi.com/journal/ijgi

ISPRS Int. J. Geo-Inf. 2018, 7, 236

2 of 17

ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

2 of 17

could be calculated using GPS/INS. The strategy can greatly reduce the manualthat work selecting least two correspondent points from stereo or multi-view images. One technology hasofbeen used at least two correspondent points from stereo or multi-view images. One technology that hasstereo been for specific object recognition is called template matching [7,8]. In contrast to a common sparse used for specific recognition is called template matching [7,8]. contrast to a common sparse matching, which object typically extracts arbitrary features from images andInthen matches them [9,10] (see stereo matching, which typically extracts arbitrary features from images and then matches them [9,10] Figure 1a), template matching finds corresponding objects or image patches given a template. Figure (see Figurea 1a), template corresponding objects image patches given a template. 1b shows classic case in matching early filmfinds photogrammetry where fourorfiducial marks are matched to the Figure 1b shows a classicfor case in early film photogrammetry marks are matched corresponding template precise interior orientation. In thiswhere study,four we fiducial aim to automatically locate to the corresponding for precise interior orientation. Intemplate this study, we aim to automatically a user-specified objecttemplate in multi-view panoramic images through matching (Figure 1c). locate a user-specified object in multi-view panoramic images through template matching (Figure 1c).

(a) Feature based matching

(b) Template matching for interior orientation

(c) Template matching for multi-view panoramic images Figure 1. Some examples of image matching. (a) shows the feature-based matching method that firstly Figure 1. Some examples of image matching. (a) shows the feature-based matching method that firstly extracts arbitrary features (denoted by small circles) from images and then matches them (denoted by extracts arbitrary features (denoted by small circles) from images and then matches them (denoted lines); (b) shows a traditional temple matching case that uses the corresponding template patch to by lines); (b) shows a traditional temple matching case that uses the corresponding template patch locate four fiducial marks in an aerial image; (c) shows the template matching case in our study. Given to locate four fiducial marks in an aerial image; (c) shows the template matching case in our study. an object patch by green box) inbox) oneinimage, the corresponding objects (denoted by blue Given an or object or (denoted patch (denoted by green one image, the corresponding objects (denoted by box) with significant distortion should be retrieved from multi-view panoramic images. blue box) with significant distortion should be retrieved from multi-view panoramic images.

The main advantage of a panoramic camera over a mono/stereo camera is that it covers a 360°◦ The main advantage of a panoramic camera over a mono/stereo camera is that it covers a 360 scene at one shot. Its sensor models and applications have been studied for years. The spherical scene at one shot. Its sensor models and applications have been studied for years. The spherical perspective [11], global description of omni-directional images [12], and rotation and scale invariance perspective [11], global description of omni-directional images [12], and rotation and scale invariance of image patches [13] of a spherical camera have been studied theoretically. However, no physical of image patches [13] of a spherical camera have been studied theoretically. However, no physical spherical camera has been produced due to the challenging manufacturing process. One type of spherical camera has been produced due to the challenging manufacturing process. One type of manufactured panoramic camera, called a catadioptric camera, utilizes a paraboloid mirror to reflect manufactured panoramic camera, called a catadioptric camera, utilizes a paraboloid mirror to reflect light from all around a scene into a lens for light collecting. Geyer and Daniilidis gave detailed light from all around a scene into a lens for light collecting. Geyer and Daniilidis gave detailed projective geometry for a catadioptric panoramic sensor [14]. Barreto and Araujo studied the projective for a catadioptric panoramic sensor Barreto and Araujo[15]. studied the geometry geometry geometry of the catadioptric projection of lines and [14]. its use in calibration Another type of of the catadioptric projection of lines and its use in calibration [15]. Another type of panoramic camera panoramic camera uses a high-speed rotating CCD to collect a 360°-cylinder scene [16]. The third and ◦ uses a high-speed CCDcamera to collect a 360 -cylinder [16].multi-camera The third andrig thesystem, most popular the most popular rotating panoramic used in an MMS scene is called which panoramic used in anlenses. MMS Shi is called multi-camera rig system, which of a series of consists of acamera series of fisheye et al. proposed a rigorous model for a consists multi-camera rig and fisheye lenses. et al. proposed a rigorous model for a multi-camera and integrated the model integrated the Shi model into a GPS-supported simultaneous localizationrig and mapping (SLAM) [17]. into a GPS-supported simultaneous localization and mapping (SLAM) [17]. Kaess and Dellaert used Kaess and Dellaert used a multi-camera rig for a SLAM with an ideal spherical model [18]. Ji et al. a multi-camera rig for a SLAM with an ideal spherical model [18]. Ji et al. compared the difference compared the difference between the ideal model and the rigorous model in indoor and outdoor between the ideal model and the rigorous model in indoor and outdoor scenes [19]. scenes [19]. Although forfor thethe ideal omnidirectional camera, the multi-camera rig is limited Although aapopular popularsubstitute substitute ideal omnidirectional camera, the multi-camera rig is by the current manufacturing techniques: the projection center of each fisheye lens hardly overlaps and limited by the current manufacturing techniques: the projection center of each fisheye lens hardly the opticaland axisthe of optical each lens seldom lies in the samelies plane; thesame separate fisheye images are stitched to a overlaps axis of each lens seldom in the plane; the separate fisheye images panoramic image with inevitable stitching errors. These technical limitations restrain the applications are stitched to a panoramic image with inevitable stitching errors. These technical limitations restrain the applications of 3D street scene reconstruction and high-precision surveying. However, the multicamera rig could be used to measure objects of interest with an accuracy of decimeter level. In [19],

ISPRS Int. J. Geo-Inf. 2018, 7, 236

3 of 17

of 3D street scene reconstruction and high-precision surveying. However, the multi-camera rig could be used to measure objects of interest with an accuracy of decimeter level. In [19], the localization error has been analyzed, and experimentally proved to be less than 0.2 m. In this paper, we leverage multi-view panoramic images captured from a multi-camera rig to locate user-specified objects and focus the work on the challenges of applying template matching to detect distorted objects from panoramic images. Three problems of using a multi-camera rig in template matching should be addressed. First, the wide baseline (e.g., 8 m) of a multi-camera causes significant perspective distortion between multi-view images. The distortions could beat some classic matching methods using absolution intensity difference or normalized correlation of image patches. Secondly, a panoramic camera introduces further geometric deformation. One part of the deformation is caused by the projection from a 3D sphere to a 2D image, which makes the epipolar no longer a line but a trigonometric curve. The other part is the stitching error raised from the camera model of a multi-camera rig. The stitching error cannot be eliminated without knowing the depth of the scene. An object farther away from the panoramic projection sphere (e.g., 20 m radius) has larger errors as its corresponding points in the adjacent fisheye images could not be overlapped with each other. Third, compared to an arbitrary feature matching where correspondences are compared in feature space, a template matching requires pixelwise searching in a given range in the original image. This indicates that the setting of an adequate search range may affect the efficiency and success rate, as a larger search range brings more false positives. Considering these restrictions, our work focuses on two aspects. One is to investigate the epipolar geometry of the multi-camera rig and the epipolar accuracy that could be achieved. Shi et al. proposed the stereo geometry of a multi-camera rig and pointed out that the epipolar line is the intersection of the panoramic sphere and a 2D plane roughly passes through camera center [17]. Ji et al. further analyzed the accuracy of the epipolar according to correspondent points in the stereo pairs [19]. However, these studies do not explicitly provide the formula of the epipolar model of a multi-camera rig. The other goal of our work is to develop an optimal feature descriptor or a robust template matching method that could match wide baseline panoramic images. The most used technology in template matching is normalized correlation with pixel values (intensity) [20]. However, correlation could be defeated by any geometric distortion more than a shift. To make the descriptor (or similarity measure) more robust, the strategy used in the feature matching can be utilized. Although an object selected by a user (typically an image patch) does not satisfy the definition of an empirical feature, it can be described by any feature descriptor. For example, CENSUS is a simple and popular feature descriptor used in stereo matching, which consists of a sequence vector with binary elements comparing between a point of interest and its neighborhood [21]. The descriptor used in the SURF feature consists of Haar wavelet responses in horizontal and vertical directions and the absolute values of these responses [9]. The SIFT descriptor is a 128-dimensional bin-based gradient vector [10]. Besides utilizing feature descriptors in template matching, several recent articles examined other strategies for template matching in challenging situations. Itamar et al. proposed a similarity measure named deformable diversity similarity for template matching based on properties of the nearest neighborhood [22]. Shaul et al. used a measurement called Best-Buddies similarity to make template matching robust against complex geometric deformations and high levels of outliers [23]. Wisarut et al. presented an adaptive template matching based on the sum of squared difference to enhance the matching quality in object tracking and reduce high computational cost [24]. Wu et al. also proposed a method to speed-up template matching and decrease the computational costs of conventional methods [25]. Yoo et al. presented a histogram-based template matching method for the situation of large-scale differences between target and template images [26]. To deal with heavy appearance variations, Sun et al. proposed a multiple template method to track fast motion by generating virtual templates that are affinely transformed from the original one [27]. Korman et al. also proposed a fast method for template matching considering 2D affine transformations [28]. Hong et al.

ISPRS Int. J. Geo-Inf. 2018, 7, 236

4 of 17

proposed a compact orientation templates (DOT) representation with a fast partial occlusion handling approach [29]. Generally, an ideal template matching method should be effective and efficient. We propose an accelerated SIFT descriptor for template matching with improved effectivity and efficiency comparing to the most recent template matching methods that are considered robust. The main contribution of the paper is to thoroughly investigate the challenging template matching on wide baseline multi-view panoramic images for the first time, and to propose a high performance matching method. In Section 2, we review the camera and stereo geometry of a multi-camera rig and deduce the explicit epipolar model under a sphere projection. In Section 3, we analyze the epipolar errors to constrain the searching range for a pixel-wise matching. Several feature descriptors are introduced to the template matching on multi-view panoramic images, and an efficient matching strategy that significantly alleviates computational costs is proposed. The experiments in Section 4 are designed to evaluate the accuracy and efficiency of these matching methods. The discussion and conclusion are given in Sections 5 and 6 respectively. 2. Geometry of A Multi-Camera Rig System 2.1. Ideal Panoramic Camera Model An ideal spherical panoramic model represents the collinearity of an arbitrary 3D point P with coordinate vector X p , the corresponding panoramic point u with coordinate vector X = [x0 , y0 , z0 ], and the panoramic center S (Figure 1a). In Equation (1), R and T are the rotation matrix and translation vector, respectively, and λ is the scale difference between the panoramic and world coordinate systems. In Equation (2), X is restricted on the surface of a sphere with radius r. λX = RT (X p − T ) 2

2

2

x 0 + y0 + z0 = r2

(1) (2)

Coordinate X can be obtained from a 2D image point x = [x, y]. In Equation (3), φh and φv are the horizontal angle with the range [−π, π] and the elevation angle with the range [−0.5π, 0.5π] respectively. w and h are the width and height of the panoramic image respectively. Equation (4) calculates sphere coordinate X using a right-hand coordinate system. ϕh =

π (2x − w) π (h − 2y) ; ϕv = w 2h

(3)

x 0 = r cos( ϕv ) sin( ϕh ) y0 = r cos( ϕv ) cos( ϕh ) z0 = r sin( ϕv )

(4)

2.2. Rigorous Panoramic Camera Model As is shown in Figure 2b, a multi-camera rig is composed of five separate fish-eye lenses. The rigorous camera model is therefore a transformation from a world object P0 to the corresponding projection point uc on the fish-eye image (the solid line). For convenience, the fish-eye image coordinates are usually firstly transformed to a virtual plane camera coordinates by choosing a fisheye camera model [30] and a calibration model [31]; we denote the transformation as Kc . Then, according to the rotation Rc and translation T c between a fisheye camera and the virtual panoramic camera, Equation (5) describes a fisheye image point uc with a coordinate xc projected to the corresponding panoramic camera point u with a coordinate X. Kc , Rc , T c are typically fixed values with pre-calibration, and k is the scale factor between the ideal plane and the panoramic sphere, which can be calculated by combining Equations (2) and (5). X = kRc K c xc + T c (5)

ISPRS Int. J. Geo-Inf. 2018, 7, 236

5 of 17

ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

5 of 17

The third step is to associate the obtained panoramic coordinate X with its world coordinate The third step is solid to associate the obtained coordinate withrigorous its worldmodel coordinate X p . According to the line in Figure 2b thatpanoramic passes C, u, and P’, a X more could X bep. According to the solid line in Figure 2b that passes C, u, and P’, a more rigorous model could be constructed as Equation (6) where the projection centre lies in the fisheye camera. constructed as Equation (6) where the projection centre lies in the fisheye camera. (6) T c + λ(X − T c ) = RTT (X P − T ) Tc   ( X  Tc )  R ( X P  T ) (6) Neither fully compensate the the geometric stitching errorserrors causedcaused by a multi-camera Neither (1) (1)nor nor(6)(6)could could fully compensate geometric stitching by a multirig system. However, the projection centres of the separate fish-eye cameras C and the panoramic camera rig system. However, the projection centres of the separate fish-eye cameras C and the centre S are very close, and the angle between SP and CP’ is very small, which could ensure by panoramic centre S are very close, and the angle between SP and CP’ is very small, whichthat could using (6) the projection errors of an object, within a certain distance to the projection sphere, are limited ensure that by using (6) the projection errors of an object, within a certain distance to the projection to less than one pixel. Notethan thatone every fisheye has its own translation T c to the sphere, are limited to less pixel. Notecamera that every fisheye camera hasparameter its own translation panoramic camera. parameter Tc to the panoramic camera. In In this this paper, paper, the the rigorous rigorous camera camera model model (5) (5) and and (6) (6) is is used used for for relative relative orientation orientation and and local local bundle adjustment of panoramic image series for recovering the geometric relations between bundle adjustment of panoramic image series for recovering the geometric relations between the the adjacent adjacent multi-view multi-view images. images. The The ideal ideal camera camera model model (1) (1) is is utilized utilized to to generate generate epipolar epipolar curve curve and and to to ◦ evaluate evaluate its its accuracy, accuracy, as as the the object object isismeasured measuredon onaapanoramic panoramicimage imagewith with360 360°covering. covering.

(a) Ideal panoramic camera

(b) Multi-camera rig

Figure 2. The ideal panoramic camera (a) where S, u, and p are collinear, and the multi-camera rig (b) Figure 2. The ideal panoramic camera (a) where S, u, and p are collinear, and the multi-camera rig where in fact C, u, and p′ are collinear and C and S don’t overlap. (b) where in fact C, u, and p0 are collinear and C and S don’t overlap.

2.3. Ideal Epipolar of Panoramic Stereo 2.3. Ideal Epipolar of Panoramic Stereo Both the ideal and rigorous multi-camera model have their corresponding epipolar geometry. Both the ideal and rigorous multi-camera model have their corresponding epipolar geometry. As points are matched in the stitched panoramic images other than separate fisheye images, only the As points are matched in the stitched panoramic images other than separate fisheye images, only the epipolar of an ideal panoramic stereo is utilized. epipolar of an panoramic is utilized. We set theideal baseline B = [Bx stereo By Bz] and the corresponding rays X1 = [X1 Y1 Z1] and X2 = [X2 Y2 Z2] in We set the baseline B = [B By Bz ] and the corresponding rays X 1 = [X1 Y1 Z1 ] and X 2 = [X2 Y2 Z2 ] panoramic coordinates. X2’ =xRX 2 is the ray that has been translated to the coordinates of the left in panoramic coordinates. X 2 ’we = RX 2 is the ray that has been translated to the coordinates of the left camera by a rotation R. Then have camera by a rotation R. Then we have B  ( X1  X 2 )  0 (7) B · (X 1 × X 0 2 ) = 0 (7) Equation (7) can be expanded by the third line of the determinant to obtain Equation (8), where a, b, and c are determined by X1 and and a = line ByZ1of− the BzYdeterminant 1, b = BzX1 − BxZ1, c = TxY1 − TyX1. Combined Equation (7) can be expanded byRthe third to obtain Equation (8), where a, with Equation (2), the epipolar line of ideal panoramic aT large circle through the b, and c are determined by X 1 and R and a = By Z1 − Bz Y1 , bstereo = Bz X1images − Bx Z1 is ,c= x Y1 − Ty X1 . Combined panoramic camera with Equation (2), centre. the epipolar line of ideal panoramic stereo images is a large circle through the aX2 + bY2 + cZ2 = 0 aX2 + bY2 + cZ2 = 0 Finally, we take Equations (3) and (4) into (8):

panoramic camera centre.

a sin(

2 2  x )  b cos( x)  c arctan( y )  0 w w h

(8) (8) (9)

ISPRS Int. J. Geo-Inf. 2018, 7, 236

6 of 17

Finally, we take Equations (3) and (4) into (8): 2π π 2π ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER x ) + b cos( x ) − carctan( y) a sin ( REVIEW

6 of =0 (9)17 w w h where (x, y) is pixel coordinate and w and h are the width and height of the image, respectively, where (x, y) is pixel coordinate and w and h are the width and height of the image, respectively, indicating a trigonometric curve epipolar line of a panoramic stereo. indicating a trigonometric curve epipolar line of a panoramic stereo.

Multi-ViewTemplate TemplateMatching Matchingfor forPanoramic PanoramicImages Images 3.3.Multi-View 3.1.Pre-Processing Pre-ProcessingofofFeature-Based Feature-BasedMatching Matchingand andLocal LocalBundle BundleAdjustment Adjustment 3.1. Theexterior exteriororientation orientationparameters parametersobtained obtainedbybythe theGPS/INS GPS/INSsystem systemcould couldbe beutilized utilizedtotorecover recover The epipolargeometry. geometry. However, we found the errors in GPS/INS observations couldbias cause bias in epipolar However, we found the errors in GPS/INS observations could cause in images images up topixels. a dozen execute relative orientation and local bundle to adjustment obtain up to a dozen Wepixels. executeWe relative orientation and local bundle adjustment obtain theto relative the relative poses between adjacent images use themmore to generate accurate epipolar curves. poses between adjacent images and use themand to generate accuratemore epipolar curves. Firstly,SIFT SIFTfeatures features[10] [10]are areextracted extractedfrom fromfisheye fisheyeimages imagesand andthen thenmatched matchedininstereo stereopairs pairs Firstly, witha aGPU GPUboosted boostedstrategy strategy[32]. [32].Then, Then,by bytransferring transferringthese thesematching matchingcandidates candidatestotopanoramic panoramic with coordinateswith withEquation Equation(5), (5),RANSAC RANSAC[33] [33]isisembedded embeddedinto intoEquation Equation(6) (6)for foroutlier outlierelimination. elimination. coordinates Third,the thecorresponding correspondingfeatures featuresininadjacent adjacentstereo stereopairs pairsare areused usedtotorecover recoverthe thescale scaleshift shiftbetween between Third, them. Every 3D feature could vote a scale and the median value is used. In this study, we use only them. Every 3D feature could vote a scale and the median value is used. In this study, we use triple viewview geometry for panoramic epipolar generation and template matching, as we as used wide only triple geometry for panoramic epipolar generation and template matching, weaused up to 8up m. to Some errors may remain the triple view images after scaleafter fusion. We abaseline wide baseline 8 m.matching Some matching errors mayinremain in the triple view images scale execute bundle adjustment embeddedembedded with weight decaying methods for the three according fusion. We execute bundle adjustment with weight decaying methods for images the three images to Equation (6) to ensure all of the false are eliminated. Finally, we we obtain the according to Equation (6) to ensure all of the correspondences false correspondences are eliminated. Finally, obtain accurate relative poses between thethe multi-view panoramic images. the accurate relative poses between multi-view panoramic images.

3.2. 3.2.Error ErrorEstimation EstimationofofPanoramic PanoramicEpipolar Epipolar Based Basedon onEquation Equation(9), (9),we wedraw drawthe theepipolar epipolarline lineon onaapanoramic panoramicimage imagewith withsphere sphereprojection. projection. InInFigure in the theimage imageand andcalculate calculatethe thebias biasofof the epipolar line, that is, Figure3,3,we we evenly evenly select points in the epipolar line, that is, the the minimum distance betweena atrue truecorrespondent correspondentpoint pointand and the the epipolar epipolar line. According According to minimum distance between to Table Table1,1, the theaverage averagebias biasisis8.29 8.29pixels pixelsand andthe themaximum maximumbias biasisis3535pixels. pixels.Generally, Generally,nearer nearerpoints pointsexhibit exhibit larger Figure 3. 3. WeWe setset a search range of 40 in the orthogonal direction to largererrors, errors,asasobserved observedinin Figure a search range of pixels 40 pixels in the orthogonal direction the line line to ensure every candidate is inisthe area.area. to epipolar the epipolar to ensure every candidate in search the search

Figure 3. Epipolar of a panoramic image with sphere projection (yellow curve). Points in circles are Figure 3. Epipolar of a panoramic image with sphere projection (yellow curve). Points in circles selected for checking epipolar errors and bigger circles indicate larger errors (numbers beside the are selected for checking epipolar errors and bigger circles indicate larger errors (numbers beside circles). the circles). Table 1. Checking epipolar bias in a panoramic image with known correspondent points.

Total points 76

Minimum 0

Maximum 35.0

Average 8.29

20 pixel 5 (6.58%)

ISPRS Int. J. Geo-Inf. 2018, 7, 236

7 of 17

Table 1. Checking epipolar bias in a panoramic image with known correspondent points. Total Points 76

Minimum Maximum Average 0

35.0

ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

8.29

20 Pixel

53 (69.74%)

18 (23.68%)

5 (6.58%)

7 of 17

The set the the depth depth The search search range range along along the the epipolar epipolar line line is is determined determined in in two two ways. ways. In In one one case, case, we we set of m without without any any support support from from aa depth depth sensor. sensor. A of an an object object to to be be measured measured between between 0.5~100 0.5~100 m A distance distance farther than 100 m could cause rapid degeneration of localization accuracy and is ignored. farther than 100 m could cause rapid degeneration of localization accuracy and is ignored. In scanner is considered available for collecting sparse pointpoint cloud.cloud. The point In the theother othercase, case,a alaser laser scanner is considered available for collecting sparse The cloud has been georeferenced by GPS/INS data and the pre-calibration parameters between the laser point cloud has been georeferenced by GPS/INS data and the pre-calibration parameters between the scanner and the GPS/INS system.system. We project laser3D points a virtual image according laser scanner and the GPS/INS We 3D project lasertopoints to panoramic a virtual panoramic image to the poseto ofthe the pose camera (obtained GPS/INSby data and thedata pre-calibration parameters between the according of the cameraby(obtained GPS/INS and the pre-calibration parameters GPS/INS system and the camera) and the sensor model (1). As shown in Figure 4, the point cloud is between the GPS/INS system and the camera) and the sensor model (1). As shown in Figure 4, the sparse and sparser in larger depth. To make maximum use of the data, we set a search window for the point cloud is sparse inPEER larger depth. To make maximum use of the data, ISPRS Int. J.and Geo-Inf.sparser 2018, 7, x FOR REVIEW 7 of 17 we set a search given object The object medianpoint. valueThe in this window is chosen as the depth of the object. this paper, window for point. the given median value in this window is chosen as the In depth of the The search range along the epipolar line is determined in two ways. In one case, we set the depth the length of the window is set to 20 pixels. Finally, the error range of a depth value d is set to d ± n, object. In this paper, theto length of between the window set to pixels. the Aerror range of a depth of an object be measured 0.5~100 mis without any20 support fromFinally, a depth sensor. distance farther than 100 m could cause rapid degeneration of localization accuracy and is ignored. where empirically set to 2n m. value dnisisset to d ± n, where is empirically set to 2 m. In the other case, a laser scanner is considered available for collecting sparse point cloud. The In the object with the maximum and depth is projected to the In both both cases, cases, the object with the assumed assumed maximum and minimum minimum depth point cloud has been georeferenced by GPS/INS data and the pre-calibration parameters betweenistheprojected to the panoramic image respectively to be the two endpoints of an epipolar curve segment, for constraining constraining laser scanner and the GPS/INS system. We project 3D laser points to a virtual panoramic image panoramic image respectively to be the two endpoints of an epipolar curve segment, for according to the pose of the camera (obtained by GPS/INS data and the pre-calibration parameters the the search search range range along along the the epipolar. epipolar. between the GPS/INS system and the camera) and the sensor model (1). As shown in Figure 4, the point cloud is sparse and sparser in larger depth. To make maximum use of the data, we set a search window for the given object point. The median value in this window is chosen as the depth of the object. In this paper, the length of the window is set to 20 pixels. Finally, the error range of a depth value d is set to d ± n, where n is empirically set to 2 m. In both cases, the object with the assumed maximum and minimum depth is projected to the panoramic image respectively to be the two endpoints of an epipolar curve segment, for constraining the search range along the epipolar.

Figure 4. A virtual panoramic depth map (right) generated from 3D point cloud (left) using the same Figure 4. A virtual panoramic depth map (right) generated from 3D point cloud (left) using the same pose of the corresponding panoramic image (middle). pose of the corresponding panoramic image (middle).

Figure 55 shows shows the the workflow workflow of of the the depth depth map map supported supported searching searching strategy strategy for for multi-view multi-view Figure Figure 4. A virtual panoramic depth map (right) generated from 3D point cloud (left) using the same panoramic image image pose matching. Given an n-th image, image, epipolar epipolar lines lines are are generated generated of the corresponding image point (middle). panoramic matching. Givenpanoramic an object object point in in the the n-th respectively in the first stereo pair and the second pair. The depth of the object point is retrieved respectively in the Figure first stereo pair and the second pair. The depth of the object point is retrieved from from 5 shows the workflow of the depth map supported searching strategy for multi-view the corresponding corresponding depth map. The search range is determined red curved boxes in in the the second second panoramic image matching. an object point the n-th image,(the epipolar are generated the depth map. TheGiven search range is in determined (the redlines curved boxes respectively inof thedepth stereo pair and thethe second pair.of depth of the object point is retrieved from row) based value and bias the epipolar line. Within the search range, the row) based on on the therange range of first depth value and the biasThe of the epipolar line. Within the search range, the corresponding depth map. The search range is determined (the red curved boxes in the second optimal candidates are retrieved by pixelwise matching with a given similarity measurement. At last, the optimal candidates retrieved pixelwise matching a given similarity row) based onare the range of depthby value and the bias of the epipolar with line. Within the search range, the measurement. optimal candidates retrieved by pixelwise matching with aerror given similarity At last, thelast, multi-view intersection isarecarried out for out locating and elimination. At the multi-view intersection is carried for locating and errormeasurement. elimination. the multi-view intersection is carried out for locating and error elimination.

Depth map multi-view supported multi-view matching. andright right stereos areare matched Figure 5. DepthFigure map5.supported matching. TheThe leftleftand stereos matched separately separately and in multi-view intersection, mistakenly matched candidates could be observed and and in multi-view intersection, mistakenly matched candidates could be observed and eliminated. eliminated.

Figure 5. Depth map supported multi-view matching. The left and right stereos are matched separately and in multi-view intersection, mistakenly matched candidates could be observed and

ISPRS FOR PEER REVIEW ISPRS Int. Int. J.J. Geo-Inf. Geo-Inf. 2018, 2018, 7, 7, x236

88of of 17 17

3.3. Template Matching with Different Feature Descriptors 3.3. Template Matching with Different Feature Descriptors Pixel-wise template matching within a given search area depends largely on a robust descriptor template a given search area depends largelyison a robust descriptor used Pixel-wise for representing thematching referencewithin and target image patches. The descriptor then compared with used for representing the reference and target image patches. The descriptor is then compared a normalized correlation (or other similarity measure) and the highest correlation coefficient (or with a normalized (or other coefficient minimum distance)correlation corresponds to thesimilarity optimal measure) candidate.and In the thishighest section,correlation we review several (or minimum distance) corresponds to the optimaland candidate. In this section, we review several commonly used descriptors in template matching feature matching, and propose an efficient commonly used descriptors in template matching and feature matching, and propose an efficient strategy for calculating SIFT descriptors and their similarity. strategy for calculating and their is similarity. Regardless of whatSIFT kinddescriptors of feature descriptor used, scale changes that could not be completely Regardless of what kind of feature descriptor is used, scale changes could be completely compensated will affect the matching results. We resample images to thethat same scalenot according to the compensated will affect the matching results. We resample images to the same scale according to the ratio of distances between the 3D object and the two camera stations. In Figure 6a,b, depth1 and depth2 ratiodistances of distances the 3D point object and the twoincamera stations. In Figure 6a,b,Ifdepth1 andmap depth2 are of between an interested measured different camera stations. depth is are distances of an interested point measured in different camera stations. If depth map is unavailable, unavailable, we use empirical values of 0.8, 1, and 1.2 as scale factors and sample only the reference we usepatch. empirical valuesfeatures of 0.8, 1,asand 1.2intensity as scale factors and sample reference image patch. image For those pixel that would fail dueonly to a the rotation, we dynamically For those features as intensity that would failtangential due to a direction rotation, of wethe dynamically align and align and resample thepixel patches to be matched to the epipolar line (Figure resample the patches to be matched to the tangential direction of the epipolar line (Figure 6c,d). 6c,d).

Figure Figure 6. 6. The The preprocessing preprocessing for for scale scale and and rotation. rotation. To To reduce reduce the the scale scale changes changes between between stereo stereo image image patches, we can resample (a) based on the ratio depth1/depth2 or resample (b) based on patches, we can resample (a) based on the ratio depth1/depth2 or resample (b) based on depth2/depth1. depth2/depth1. forofrotation, both of the patch could be tangent aligned to the tangent direction As for rotation, As both the patch windows couldwindows be aligned to the direction of the current of the current point, as shown on (c,d). point, as shown on (c,d).

In this paper, descriptors listed below are evaluated in the template matching of wide baseline In this paper, descriptors listed below are evaluated in the template matching of wide baseline multi-view panoramic images. multi-view panoramic images. Pixel values in dual space. Intensity of pixels is the most used feature in template matching. Except Pixel values in dual space. Intensity of pixels is the most used feature in template matching. Except for intensity, we also utilize the phase information by transforming the image patch to the frequency for intensity, we also utilize the phase information by transforming the image patch to the frequency domain with Fourier transformation [20]. The patches in the dual image and frequency space are domain with Fourier transformation [20]. The patches in the dual image and frequency space are used used to calculate correlation coefficients, and the largest one is considered an optimal candidate. to calculate correlation coefficients, and the largest one is considered an optimal candidate. CENSUS descriptor. By comparing the intensity of a pixel with its neighbor pixels within a 21 × CENSUS descriptor. By comparing the intensity of a pixel with its neighbor pixels within a 21 × 21 21 patch window after epipolar alignment, a CENSUS descriptor is an ordered sequence vector with patch window after epipolar alignment, a CENSUS descriptor is an ordered sequence vector with its elements the binary number for representing “bigger” or “smaller”. The similarity between two its elements the binary number for representing “bigger” or “smaller”. The similarity between two CENSUS descriptors is calculated by the normalized Hamming distance. CENSUS descriptors is calculated by the normalized Hamming distance. HOG (Histogram of Oriented Gradient) descriptor [34]. Depending on the quality of the reference and target images, Gamma normalization is selectively executed first. Then, within the 16 × 16 pixels block, the gradient magnitude and orientation of each point are computed, and the block is divided

ISPRS Int. J. Geo-Inf. 2018, 7, 236

9 of 17

HOG (Histogram of Oriented Gradient) descriptor [34]. Depending on the quality of the reference ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW 9 of 17 and target images, Gamma normalization is selectively executed first. Then, within the 16 × 16 pixels block, gradient magnitude orientation of each point areAfterwards, computed, and the block is divided into 2 ×the 2 cells and aligned to theand dominant gradient orientation. a tri-linear interpolation into 2 × 2 cells and aligned to the dominant gradient orientation. Afterwards, a tri-linear interpolation is used to calculate 9 orientation bins for each cell. At last, the L2-Hys (Lowe-style clipped L2 norm) is used to 9 orientation for each cell.vector At last, (Lowe-style L2 on norm) scheme is calculate used to normalize the bins 36-dimentional of the the L2-Hys block. The similarityclipped depends the scheme is used to normalize the 36-dimentional vector of the block. The similarity depends on the normalized distance of two HOG vectors. normalized distance ofAfter two HOG vectors. SURF descriptor. aligned to the tangent direction of the epipolar line, a 20 × 20 patch SURF to the direction ofathe epipolar line, afeature 20 × 20 patch centered ondescriptor. the currentAfter pointaligned is split into 4 ×tangent 4 sub-regions, and four-dimensional vector is centered on the current point is split into 4 × 4 sub-regions, and a four-dimensional feature vector is computed in each sub-region with 5 × 5 pixels inside. The four elements of the vector are the sums of computed in each sub-region with 5 × 5 pixels inside. The four elements of the vector are the sums Haar wavelet responses in horizontal and vertical directions respectively, as well as the sums of the of Haar wavelet in horizontal and vertical directions respectively, as welldescriptors as the sums of absolute values ofresponses these responses. The normalized distance between the two SURF (i.e., the absolute values of these responses. The normalized distance between the two SURF descriptors the two 64-dimensional vectors), is used as the similarity measure. (i.e., the 64-dimensional used as the and similarity measure. SIFTtwo descriptor. First, thevectors), gradientis magnitude orientation of each point within the 16 × 16 SIFT descriptor. First, the gradient magnitude and orientation of each point thekernel 16 × 16 patch window are computed and the patch is aligned to the main orientation. Thewithin Gaussian is patch window are computed and the patch is aligned to the main orientation. The Gaussian kernel then utilized to weight points by the distance to the center point. Second, the gradient magnitude and is then utilized to weight distance to center point. Second, the gradient magnitude orientation of each pixel inpoints the 4 ×by4 the sub-window arethe accumulated into an orientation histogram with and orientation of each pixel in the 4 × 4 sub-window are accumulated into an orientation histogram eight directions to make up a local eight-dimensional vector. A complete SIFT descriptor consists of with eight directionsvector to make up sub-regions a local eight-dimensional vector. A completeof SIFT descriptor consists a 128-dimensional of 16 after a maximum suppression larger gradients. The of a 128-dimensional vector of sub-regions after a maximum suppression of larger gradients. normalized distance between the16 two SIFT vectors is used as the similarity measure. The normalized distance between two SIFT vectors used as the similarity measure. Accelerated SIFT descriptor. As the the calculation of the is pixel-wise SIFT descriptor could be very slow Accelerated SIFT descriptor. Aswith the calculation of range, the pixel-wise SIFT descriptor be very slow for template matching especially large search we developed a methodcould that dramatically for template matching especially with large search range, we developed a method that dramatically reduces the calculation load of the SIFT descriptor (referred in this study as accelerated SIFT reduces the calculation load of theconsists SIFT descriptor (referred in this study as accelerated descriptor). The SIFT descriptor of 4 × 4 components each of which includes SIFT 4 × 4descriptor). pixels and The SIFT descriptor consists of 4 × 4 components eachwe of which includes 4 ×vector 4 pixels is described is described by an eight-dimensional vector. Hence, can calculate this forand every pixel via by an eight-dimensional vector. Hence, we can calculate this vector for every pixel via its 4 × region its 4 × 4 region in advance (2 × 2 region for simplicity in Figure 7a). In the search range, we4 finally in advance (2 × 2 region for simplicity in Figure 7a). In the search range, we finally produce a map with produce a map with the elements of the vector. This guarantees the calculation of the SIFT descriptor the elements of the vector. This guarantees the calculation of the descriptor performed only once performed only once and we only need to read the descriptor of SIFT a candidate from memory instead of and we only need to read the descriptor of a candidate from memory instead of re-calculation. Then, re-calculation. Then, a Gaussian window is applied to weight pixels with distance to the currenta Gaussian window is applied to weight pixelsthe with distanceimage to the patch current searchof point. For orientation search point. For orientation bias, we rotate reference instead the searched patch. bias, we rotate reference imagethe patch instead of theofsearched patch. achieve this, candidate we firstly To achieve this, the we firstly calculate tangent direction the epipolar lineTo di of the current calculate tangent of the epipolar line ofown the current point i (asinevery point point i (asthe every pointdirection on the panoramic epipolar hasdiits tangentcandidate direction as shown Figure 7b), on panoramic has its own tangent astangent shown direction in Figure of 7b), then obtain the andthe then obtain theepipolar direction difference betweendirection di and the theand reference epipolar direction difference between d and the tangent direction of the reference epipolar line d . The reference 0 line d0. The reference patch is irotated with d0 − di degree to compensate the angle bias (Figure 7c). The patch is rotated with d0 − didifference degree to compensate the angleand biascould (Figure 7c). Thetiny calculation the calculation of the direction is not very rigorous introduce directionofbias direction is not very rigorous and could introduce tinyefficiency direction improvement. bias however it is very however itdifference is very slight and could be ignored for the tremendous slight and could be ignored for the tremendous efficiency improvement.

(a) the reference point

(b) a candidate point in search area

(c) the rotated reference patch

Figure 7. The computation of the accelerated SIFT descriptor. In (a), a SIFT point is described by 16 8Figure 7. The computation of the accelerated SIFT descriptor. In (a), a SIFT point is described by 16 D vectors, which can be stored in the 16 orange points. Therefore, for every pixel in search area (b) 8-D vectors, which can be stored in the 16 orange points. Therefore, for every pixel in search area we can calculate and store the 8-D vectors only once. When a point is to be matched, its SIFT descriptor (b) we can calculate and store the 8-D vectors only once. When a point is to be matched, its SIFT is easily retrieved. In (c) the reference descriptor is rotated according to the difference of the two descriptor is easily retrieved. In (c) the reference descriptor is rotated according to the difference of the epipolar directions. two epipolar directions.

ISPRS Int. J. Geo-Inf. 2018, 7, 236 ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

10 of 17 10 of 17

4. Experiments Experiments and and Results Results 4. 4.1. Test Test Design 4.1. Design To test the multi-view multi-view template template matching matching on on panoramic panoramic images, images, we we prepared prepared two two datasets datasets under under To test the different situations. The first dataset consisting of 498 images was captured in 2012 using a PGR’s different situations. The first dataset consisting of 498 images was captured in 2012 using a PGR’s Ladybug3 camera camera [35] [35] in in Kashiwa Kashiwa city, city, Japan. The camera camera consists consists of of six six fixed fixed fisheye fisheye lenses, lenses, each each of of Ladybug3 Japan. The which has has a 1232pixel pixelresolution resolutionand and24-bit 24-bitRGB RGBcolour colour resolution resolution(see (see Figure Figure 8). 8). which a maximum maximum 1616 1616 × × 1232 The focal length of the fisheye camera is 3.3 mm, and the radius of the panoramic sphere is set to The focal length of the fisheye camera is 3.3 mm, and the radius of the panoramic sphere is set to 20 20 m. A dual-frequency receiver an IMU (inertial measurement unit) mounted on the m. A dual-frequency GPSGPS receiver andand an IMU (inertial measurement unit) are are mounted on the car car for georeferencing with CORS RTK technology and GPS/IMU combined filtering (embedded in for georeferencing with CORS RTK technology and GPS/IMU combined filtering (embedded in an an Applanix POS/AV system). The processed GPS/IMU data is leveraged to recover the poses of Applanix POS/AV system). The processed GPS/IMU data is leveraged to recover the poses of camera, camera, theislatter then used to calculate world coordinates matchedpoints points through through and the and latter thenisused to calculate the the 3D 3D world coordinates ofofmatched triangulation and to estimate the search area. A baseline of 2 m interval between adjacent images is triangulation and to estimate the search area. A baseline of 2 m interval between adjacent images is used for testing. The search range of an object depth is set to 0.5~100 m. used for testing. The search range of an object depth is set to 0.5~100 m. The other other dataset dataset with with 168 168 images images was was captured captured in in 2017 2017 using using aa car-mounted car-mounted Ladybug5 Ladybug5 camera camera The in Wuhan city, China. The maximum resolution of the fisheye lens is improved to 2448 × 2048 pixels. in Wuhan city, China. The maximum resolution of the fisheye lens is improved to 2448 × 2048 pixels. Except for for GPS/IMU, GPS/IMU,aalaser laserscanner scanner was was utilized utilized to to collect collect aa sparse sparse depth depth map map (see (see Figure Figure 4). Due to to Except 4). Due the measurement errors of GPS/IMU and the calibration errors among GPS, camera, and laser scanner, the measurement errors of GPS/IMU and the calibration errors among GPS, camera, and laser scanner, the bias bias between between the the depth depth map map and and the the panoramic panoramic image image is is up up to to 0.6 0.6 m. m. A A baseline baseline of of 88 m m interval interval the between adjacent adjacent images images is is used used for for test. test. The The search search range range along along the the epipolar epipolar line line is is set set to to the the object’s object’s between depth range d ± 2 m, where d is the median pixel value within the 20 × 20 window centered at the the depth range d ± 2 m, where d is the median pixel value within the 20 × 20 window centered at interested point in the depth map. interested point in the depth map. In pre-processing, pre-processing,SIFT SIFTfeature featurematching, matching, point extraction, bundle adjustment In tie tie point extraction, and and locallocal bundle adjustment were were sequentially carried out on the two datasets to obtain the epipolar geometry of the multi-view sequentially carried out on the two datasets to obtain the epipolar geometry of the multi-view panoramic images. images. Hundreds Hundreds of of interested interested points points were were then then manually manually selected selected in in one one view view image image to to panoramic evaluate the the performance performance of of different different template evaluate template matching matching methods. methods.

Figure 8. Six fisheye lenses and the stitched panoramic images captured by the Ladybug 3 camera. Figure 8. Six fisheye lenses and the stitched panoramic images captured by the Ladybug 3 camera.

4.2. Results of the Wuhan Data (with Depth Map) 4.2. Results of the Wuhan Data (with Depth Map) Table 2 shows the performances of the different descriptors embedded in template matching on Table 2 shows the performances of the different descriptors embedded in template matching on 80 points selected from road signs, covers, poles, railings, and so on. Although there exist significant 80 points selected from road signs, covers, poles, railings, and so on. Although there exist significant geometric distortions, the dual-space intensity based matching could obtain a satisfactory match rate geometric distortions, the dual-space intensity based matching could obtain a satisfactory match with our preprocessing strategy, that is, alignment to epipolar direction and scale normalization rate with our preprocessing strategy, that is, alignment to epipolar direction and scale normalization according to depth ratio. Among the feature descriptors, the CENSUS descriptor performs the worst according to depth ratio. Among the feature descriptors, the CENSUS descriptor performs the worst and SIFT the best. Our accelerated SIFT (shorted as AccSIFT) descriptor got the same results as SIFT and SIFT Our accelerated SIFT (shorted as AccSIFT) got the same results SIFT except forthe onebest. point. The SURF descriptor performed similardescriptor to the dual-space intensity andasbetter except for one point. The SURF descriptor performed similar to the dual-space intensity and better than the HOG descriptor. than As theto HOG descriptor. efficiency, more complex descriptors result in less efficiency, and the SIFT descriptor is to efficiency, more complex descriptors result Our in less efficiency, and the SIFT descriptor is muchAs lower than the dual-space intensity in execution. AccSIFT descriptor got almost the same much lower than the dual-space intensity in execution. Our AccSIFT descriptor got almost the same efficiency as the dual-space intensity and improved 7.9 times compared to the original SIFT efficiency as the dual-space intensity and improved 7.9 times compared to the original SIFT descriptor. descriptor.

ISPRS Int. J. Geo-Inf. 2018, 7, 236

11 of 17

ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

11 of 17

Table 2. Comparison of different descriptors on the Wuhan data.

Table 2. Comparison of different descriptors on the Wuhan data. Methods Match Rate Time (s)

Methods

Match Rate Time (s) 72/80 0.202 72/80 0.202 65/80 0.119 69/80 0.344 65/80 0.119 73/80 0.631 69/80 0.344 78/80 1.539 73/80 0.631 77/80 0.195 SIFT 78/80 1.539 AccSIFT 77/80 0.195 Figure 9 shows some vision examples applied with different descriptors. The main distortion in Figure 9a 9isshows the scaling distortion; both applied the dual-space intensity and CENSUS failed to matchin Figure some vision examples with different descriptors. The main distortion on one view image while SURF, HOG, and AccSIFT (SIFT) can find all the correct correspondences Figure 9a is the scaling distortion; both the dual-space intensity and CENSUS failed to match on one inview multi-view images. It HOG, could and alsoAccSIFT be observed the epipolarinvaries image while SURF, (SIFT)that can the findsearch all the range correctalong correspondences multilargely from different views. Figure 9b shows a challenging situation where the depth of the object view images. It could also be observed that the search range along the epipolar varies largely from isdifferent unavailable due to the9bsparsity the depth map. A very long be views. Figure shows aofchallenging situation where thesearch depthrange of theshould object istherefore unavailable covered. However, one match occurred in the range CENCUS andtherefore HOG methods, other However, methods due to the sparsityexcept of the depth map.error A very long search should be covered. correctly found the correspondences. Figure 9c shows an example demonstrating the distortion except one match error occurred in the CENCUS and HOG methods, other methods correctly found caused by different view angles. The dual-space intensity and HOGthe methods failed but all other the correspondences. Figure 9c shows an example demonstrating distortion caused bythe different methods succeeded. view angles. The dual-space intensity and HOG methods failed but all the other methods succeeded. Intensity Intensity CENSUS HOG CENSUS SURF HOG SIFT SURF AccSIFT

(a)

(b) Figure 9. Cont.

ISPRSISPRS Int. J.Int. Geo-Inf. 2018, 7, 236 J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

12 of 17 12 of 17

(c) Figure 9. Performances of template matching with different descriptors on the Wuhan data. From left

Figure 9. Performances of template matching different descriptors Wuhan data.HOG, From left to right: reference image and the two adjacentwith search image, the results ofon thethe NCC, CENSUS, to right: reference image and the respectively. two adjacent search image, the results of the NCC, CENSUS, HOG, SURF and AccSIFT descriptors SURF and AccSIFT descriptors respectively. 4.3. Results of the Kashiwa Data (without Depth Map)

4.3. Results of the Kashiwa (withoutof Depth Map) Table 3 shows the Data performances template matching with different descriptors on the Kashiwa data where 120 object points were selected for evaluation. The CENSUS descriptor got the worst Table 3 shows the performances of template matching with different descriptors on the Kashiwa matching rate and the dual-space intensity and SURF performed almost the same. The SIFT and data where 120 object points were selected for evaluation. The CENSUS descriptor got the worst AccSIFT descriptors performed the best and the second best respectively. As the depth of an object is matching rate and the dual-space intensity and SURF performed almost the same. The SIFT and unknown, a larger search range has to be covered compared to the Wuhan data. This also caused the AccSIFT descriptors theexample, best and second best respectively. As97.5% the depth of an match rate slightlyperformed dropped. For thethe match rate of SIFT dropped from to 93.3%. Theobject is unknown, larger search range has It totook be covered compared towhen the Wuhan data. also caused efficiencyadropped correspondingly. 4.6 s to match a point using the SIFTThis descriptor. the match rate slightly dropped. Forthe example, thebymatch rate of SIFT from 97.5% tois93.3%. Our AccSIFT descriptor improved efficiency 450% comparing to dropped the original version, and even fasterdropped than the dual-space intensity. The efficiency correspondingly. It took 4.6 s to match a point when using the SIFT descriptor.

Our AccSIFT descriptor improved the efficiency by 450% comparing to the original version, and is Table 3. Comparison of the different descriptors on the Kashiwa data. even faster than the dual-space intensity. Methods match Rate Average Time(s) Table 3. Comparison different descriptors on the Kashiwa data. Intensityof the101/120 1.259 CENSUS 91/120 0.572 Methods Match Rate Average HOG 96/120 1.797 Time (s) SURF 103/120 2.849 Intensity 101/120 1.259 SIFT 112/120 4.646 CENSUS 91/120 0.572 Acc SIFT 107/120 1.042 HOG 96/120 1.797 SURF 103/120 2.849 SIFT 112/120 4.646 on the Kashiwa data. In Figure Figure 10 shows some examples of different descriptors applied SIFT 107/120 1.042 the dual-space intensity. The 10a, a point in a far street Acc lamp made all of the methods failed except results represent a challenging case as the wrongly predicted positions vary largely. It was also a rare case that all of the feature-based descriptors failed to find the correct match. This might be interpreted Figure 10 shows some examples of different descriptors applied on the Kashiwa data. as meaning that the complicated SIFT or SURF descriptor could simulate more complex distortions In Figure 10a, a point in a far street lamp made all of the methods failed except the dual-space intensity. and therefore created ambiguity (false positive). Figure 10b shows the large search range and the The results a challenging case as the predicted positions vary largely. was changesrepresent of scale made the intersity, HOG and wrongly CENSUS descriptors failed in matching a pointIton a also a rarecover. case In that all of descriptors failedresulted to findone thematching correct match. might be Figure 10cthe thefeature-based changes of view sight at a corner error of This the SURF interpreted as descriptors. meaning that the complicated SIFT or SURF descriptor could simulate more complex and HOG

distortions and therefore created ambiguity (false positive). Figure 10b shows the large search range and the changes of scale made the intersity, HOG and CENSUS descriptors failed in matching a point on a cover. In Figure 10c the changes of view sight at a corner resulted one matching error of the SURF and HOG descriptors.

ISPRS Int. J. Geo-Inf. 2018, 7, 236 ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

13 of 17 13 of 17

(a)

(b)

(c) Figure10. Performances of template matching with different descriptors on the Kashiwa data. From Figure 10. Performances of template matching with different descriptors on the Kashiwa data. From left to right: reference image and the two adjacent search image, the results of the NCC, CENSUS, left to right: reference image and the two adjacent search image, the results of the NCC, CENSUS, HOG, SURF and AccSIFT descriptors respectively. HOG, SURF and AccSIFT descriptors respectively.

5. Discussion 5. Discussion 5.1. The AccSIFT and and SIFT SIFT Descriptors Descriptors 5.1. The Difference Difference between between the the AccSIFT Considering the processing efficiency, our AccSIFT descriptor was designed slightly differently Considering the processing efficiency, our AccSIFT descriptor was designed slightly differently from the original original version. version. We align the the two two corresponding corresponding patches patches by by rotating rotating an an angle angle that that equals equals from the We align the difference between the two tangent directions of the stereo epipolar lines, instead of rotating them the difference between the two tangent directions of the stereo epipolar lines, instead of rotating respectively to thetoepipolar directions. It mayItintroduce a slight bias especially when when the angle them respectively the epipolar directions. may introduce a slight bias especially the difference becomes larger, and leads to a slight underperformance of our method comparing to the angle difference becomes larger, and leads to a slight underperformance of our method comparing original version. On the Wuhan data, the matching rate dropped 1.25%; on the Kashiwa data, it to the original version. On the Wuhan data, the matching rate dropped 1.25%; on the Kashiwa data, dropped 4.1%. Figure 11 shows the case in which the AccSIFT descriptor failed while the SIFT it dropped 4.1%. Figure 11 shows the case in which the AccSIFT descriptor failed while the SIFT descriptor could find one correct match. It could be observed that large rotations of background

ISPRS Int. J. Geo-Inf. 2018, 7, 236

14 of 17

ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

14 of 17

descriptor could find one correct match. It could be observed that large rotations of background buildings buildings caused caused difficulty difficulty in in matching. matching. However, However, the the efficiency efficiency of of the the AccSIFT AccSIFT descriptor descriptor gained gained improvement improvementof of790% 790%and and450% 450%against againstthe theSIFT SIFTdescriptor descriptoron onthe thetwo twodatasets datasetsrespectively. respectively.

Figure 11. Comparison of the SIFT and the AccSIFT descriptors on matching of the Wuhan images. Figure 11. Comparison of the SIFT and the AccSIFT descriptors on matching of the Wuhan images.

5.2. Comparison to Most Recent Studies on Template Matching 5.2. Comparison to Most Recent Studies on Template Matching We compare our AccSIFT matching method with two most recent studies in robust template We compare our AccSIFT matching method with two most recent studies in robust template matching for large geometric distortions. One method proposed a deformable diversity similarity matching for large geometric distortions. One method proposed a deformable diversity similarity (short as DDIS) [22]; the other method introduced a similarity called best-buddies similarity (shorted (short as DDIS) [22]; the other method introduced a similarity called best-buddies similarity (shorted as BBS) [23]. A color based version (other than depth) is applied in this case (shorted as DDIS-C and as BBS) [23]. A color based version (other than depth) is applied in this case (shorted as DDIS-C and BBS-C). To be fair, we cropped the corresponding search range (a bounding box) of every interested BBS-C). To be fair, we cropped the corresponding search range (a bounding box) of every interested point from the Wuhan images and masked the background with black in the bounding box (see point from the Wuhan images and masked the background with black in the bounding box (see Figure 12). From Table 4, we could observe our method performing better than the two methods in Figure 12). From Table 4, we could observe our method performing better than the two methods in match rate, and the BBS-C is the worst. As to efficiency, our method and DDIS-C performed almost match rate, and the BBS-C is the worst. As to efficiency, our method and DDIS-C performed almost the same and it is faster than the BBS-C. Figure 12 shows four examples of template matching on the the same and it is faster than the BBS-C. Figure 12 shows four examples of template matching on the Wuhan data. In all the challenging cases, our method correctly found the correspondences in both of Wuhan data. In all the challenging cases, our method correctly found the correspondences in both of the views, whereas the DDIS-C and BBS-C failed to one view image or the both. the views, whereas the DDIS-C and BBS-C failed to one view image or the both. It should be noted that, our AccSIFT descriptor is specially designed for template matching with It should be noted that, our AccSIFT descriptor is specially designed for template matching with approximate epipolar constraints (commonly could be obtained via feature matching or GPS/IMU approximate epipolar constraints (commonly could be obtained via feature matching or GPS/IMU data) and therefore encoded with better rotation invariance. The other two methods depend on the data) and therefore encoded with better rotation invariance. The other two methods depend on the claimed rotation-invariant similarity measure without these easily accessed epipolar constraints. claimed rotation-invariant similarity measure without these easily accessed epipolar constraints. Table 4. Comparison to most recent studies on the Wuhan data. Table 4. Comparison to most recent studies on the Wuhan data.

Methods Methods AccSIFT AccSIFT DDIS-C DDIS-C BBS-C BBS-C

Correct Rate Correct Rate 77/80 77/80 74/80 74/80 63/80 63/80

Average Time (s) Average Time (s) 0.195 0.195 0.187 0.187 0.325 0.325

ISPRS Int. J. Geo-Inf. 2018, 7, 236 ISPRS Int. J. Geo-Inf. 2018, 7, x FOR PEER REVIEW

15 of 17 15 of 17

Figure 12. The performances of the three template matching methods on the Wuhan data. The green

Figure 12. The performances of the three template matching methods on the Wuhan data. The green box in the top images indicates the reference to be matched; the left image patches are cropped from box in the top images indicates the reference to be matched; the left image patches are cropped from the last frame panoramic image and the right cropped from the next frame. The results of our method are denoted by blue box; the results of DDIS-C and BBS-C are denoted by pink and red box respectively.

ISPRS Int. J. Geo-Inf. 2018, 7, 236

16 of 17

6. Conclusions We studied template matching on wide-baseline multi-view panoramic images from a vehicle-mounted multi-camera rig. Firstly, the epipolar equation of the panoramic stereo is deduced and the epipolar accuracy is analyzed. We then thoroughly evaluate the performances of different feature descriptors on this template matching case, and propose a fast version of a SIFT descriptor, which reached the best performance in terms of efficiency and accuracy. We also showed our method improved from the two most current studies in robust template matching of multi-view panoramic images. Author Contributions: S.J. led the research and wrote the paper; D.Y. executed most of the experiments; Y.H. and M.L. are involved in research design and manuscript editing. Funding: This work was supported by the National Natural Science Foundation of China (41471288) and the National key research and development plan of China (2017YFB0503001). Conflicts of Interest: The authors declare no conflict of interest.

References 1. 2.

3. 4. 5.

6. 7. 8. 9. 10. 11. 12. 13.

14. 15. 16.

Yang, B.; Fang, L.; Li, J. Semi-automated extraction and delineation of 3d roads of street scene from mobile laser scanning point clouds. ISPRS-J. Photogramm. Remote Sens. 2013, 79, 80–93. [CrossRef] Paparoditis, N.; Papelard, J.-P.; Cannelle, B.; Devaux, A.; Soheilian, B.; David, N.; Houzay, E. Stereopolis II: A multi-purpose and multi-sensor 3d mobile mapping system for street visualisation and 3d metrology. Rev Fr. Photogramm. Télédétec. 2012, 200, 69–79. Corso, N.; Zakhor, A. Indoor localization algorithms for an ambulatory human operated 3d mobile mapping system. Remote Sens. 2013, 5, 6611–6646. [CrossRef] El-Sheimy, N.; Schwarz, K. Navigating urban areas by VISAT—A mobile mapping system integrating GPS/INS/digital cameras for GIS applications. Navigation 1998, 45, 275–285. [CrossRef] Jaakkola, A.; Hyyppä, J.; Kukko, A.; Yu, X.; Kaartinen, H.; Lehtomäki, M.; Lin, Y. A low-cost multi-sensoral mobile mapping system and its feasibility for tree measurements. ISPRS-J. Photogramm. Remote Sens. 2010, 65, 514–522. [CrossRef] Kim, G.H.; Sohn, H.G.; Song, Y.S. Road infrastructure data acquisition using a vehicle-based mobile mapping system. Comput.-Aided Civ. Infrastruct. Eng. 2006, 21, 346–356. [CrossRef] Briechle, K.; Hanebeck, U.D. Template matching using fast normalized cross correlation. Proc. SPIE 2001, 4387, doi:10.1117/12.421129. [CrossRef] Brunelli, R. Template Matching Techniques in Computer Vision: Theory and Practice, 1st ed.; John Wiley & Sons: Hoboken, NJ, USA, 2009. Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-up robust features (SURF). J. Comput. Vis. Image Underst. 2004, 110, 346–359. [CrossRef] Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [CrossRef] Mei, C.; Benhimane, S.; Malis, E.; Rives, P. Efficient homography-based tracking and 3-D reconstruction for single-viewpoint sensors. IEEE Trans. Robot. 2008, 24, 1352–1364. [CrossRef] Paya, L.; Fernandez, L.; Gil, A.; Reinoso, O. Map building and Monte Carlo localization using global appearance of omnidirectional images. Sensors 2010, 10, 11468–11497. [CrossRef] [PubMed] Gutierrez, D.; Rituerto, A.; Montiel, J.M.M.; Guerrero, J.J. Adapting a real-time monocular visual SLAM from conventional to omnidirectional cameras. In Proceedings of the 11th OMNIVIS in IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 343–350. Geyer, C.; Daniilidis, K. Catadioptric projective geometry. Int. J. Comput. Vis. 2001, 45, 223–243. [CrossRef] Barreto, J.P.; Araujo, H. Geometric properties of central catadioptric line images and their application in calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1327–1333. [CrossRef] [PubMed] Parian, J.A.; Gruen, A. Sensor modeling, self-calibration and accuracy testing of panoramic cameras and laser scanners. ISPRS-J. Photogramm. Remote Sens. 2010, 65, 60–76. [CrossRef]

ISPRS Int. J. Geo-Inf. 2018, 7, 236

17. 18. 19. 20. 21. 22.

23.

24. 25. 26. 27. 28. 29. 30. 31. 32.

33. 34. 35.

17 of 17

Shi, Y.; Ji, S.; Shi, Z.; Duan, Y.; Shibasaki, R. Gps-supported visual slam with a rigorous sensor model for a panoramic camera in outdoor environments. Sensors 2012, 13, 119–136. [CrossRef] [PubMed] Kaess, M.; Dellaert, F. Probabilistic structure matching for visual SLAM with a multi-camera rig. Comput. Vis. Image Underst. 2010, 114, 286–296. [CrossRef] Ji, S.; Shi, Y.; Shi, Z.; Bao, A.; Li, J.; Yuan, X.; Duan, Y.; Shibasaki, R. Comparison of two panoramic sensor models for precise 3d measurements. Photogramm. Eng. Remote Sens. 2014, 80, 229–238. [CrossRef] Lewis, J.P. Fast normalized cross-correlation. In Proceedings of the Vision Interface, Quebec, OC, Canada, 15–19 June 1995; pp. 120–123. Zabih, R.; Woodfill, J. Non-parametric local transforms for computing visual correspondence. In Proceedings of the 3rd European Conference on Computer Vision, Stockholm, Sweden, 2–6 May 1994; pp. 151–158. Talmi, I.; Mechrez, R.; Zelnik-Manor, L. Template matching with deformable diversity similarity. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1311–1319. Dekel, T.; Oron, S.; Rubinstein, M.; Avidan, S.; Freeman, W.T. Best-buddies similarity for robust template matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2021–2029. Chantara, W.; Mun, J.-H.; Shin, D.-W.; Ho, Y.-S. Object tracking using adaptive template matching. IEEE Trans. Smart Process. Comput. 2015, 4, 1–9. [CrossRef] Wu, T.; Toet, A. Speed-up template matching through integral image based weak classifiers. J. Pattern Recognit. Res. 2014, 1, 1–12. Yoo, J.; Hwang, S.S.; Kim, S.D.; Ki, M.S.; Cha, J. Scale-invariant template matching using histogram of dominant gradients. Pattern Recognit. 2014, 47, 3006–3018. [CrossRef] Sun, J.; He, F.Z.; Chen, Y.L.; Chen, X. A multiple template approach for robust tracking of fast motion target. Appl. Math. Ser. B 2016, 31, 177–197. [CrossRef] Korman, S.; Reichman, D.; Tsur, G.; Avidan, S. Fast-Match: Fast Affine Template Matching. Int. J. Comput. Vis. 2017, 121, 1–15. [CrossRef] Hong, C.; Zhu, J.; Yu, J.; Cheng, J.; Chen, X. Realtime and robust object matching with a large number of templates. Multimed. Tools Appl. 2016, 75, 1459–1480. [CrossRef] Schneider, D.; Schwalbe, E.; Maas, H.-G. Validation of geometric models for fisheye lenses. ISPRS-J. Photogramm. Remote Sens. 2009, 64, 259–266. [CrossRef] Kannala, J.; Brandt, S.S. A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1335–1340. [CrossRef] [PubMed] Sinha, S.N.; Frahm, J.M.; Pollefeys, M.; Yakup Genc, Y. GPU-based video feature tracking and matching, EDGE 2006. In Proceedings of the Workshop on Edge Computing Using New Commodity Architectures, Chapel Hill, NC, USA, 23–24 May 2006. Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [CrossRef] Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; pp. 886–893. Introduction of a PGR’s Ladybug3 Camera. Available online: https://www.ptgrey.com/ladybug3-360degree-firewire-spherical-camera-systems (accessed on 1 May 2018). © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).