Persistent Aerial Tracking System for UAVs

6 downloads 94249 Views 3MB Size Report
[1] and hence not suitable to identify the best tracker for integration on the UAV. .... in target initialization ranking second on both datasets. It is also noteworthy ...
Persistent Aerial Tracking System for UAVs∗ Matthias Mueller1 , Gopal Sharma2 , Neil Smith1 , Bernard Ghanem1

Abstract— In this paper, we propose a persistent, robust and autonomous object tracking system for unmanned aerial vehicles (UAVs) called Persistent Aerial Tracking (PAT). A computer vision and control strategy is applied to a diverse set of moving objects (e.g. humans, animals, cars, boats, etc.) integrating multiple UAVs with a stabilized RGB camera. A novel strategy is employed to successfully track objects over a long period, by ’handing over the camera’ from one UAV to another. We evaluate several state-of-the-art trackers on the VIVID aerial video dataset and additional sequences that are specifically tailored to low altitude UAV target tracking. Based on the evaluation, we select the leading tracker and improve upon it by optimizing for both speed and performance, integrate the complete system into an off-the-shelf UAV, and obtain promising results showing the robustness of our solution in real-world aerial scenarios.

I. I NTRODUCTION The ability to capture stabilized high resolution video from low-cost UAVs has the potential to significantly redefine future objectives in the development of state-of-the-art object tracking methods. In this paper, we propose a persistent, robust and autonomous object tracking system designed for UAV applications, called Persistent Aerial Tracking (PAT) (see Fig. 1). Persistent aerial tracking can serve many purposes, not only related to surveillance but also search and rescue, wild-life monitoring, crowd monitoring/management, and extreme sports. Deploying PAT on UAVs is a very promising application, since the camera can follow the target based on its visual feedback and actively change its orientation and position to optimize for tracking performance (e.g. persistent tracking accuracy in the presence of occlusion or fast motion across large and diverse areas). This is the defining difference with static tracking systems, which passively analyze a dynamic scene to produce analytics for other systems. It enables ad-hoc and low-cost surveillance that can be quickly deployed, especially in locales where surveillance infrastructure is not already established or feasible (e.g. remote locations, rugged terrain, and large water bodies). A current drawback of UAV use for persistent aerial tracking is their limited flight time (ca. 10-25 minutes) especially with multi-rotor copters. To this end, we propose a novel system of target tracking handover among a network of coordinated UAVs. In this scenario, ’camera handover’ is a process similar to what is needed in traditional static tracking systems, since it involves transferring the tracked targets *This work was supported by KAUST, Saudi Arabia 1 Matthias Mueller, Neil Smith and Bernard Ghanem are with the Department of Electrical Engineering, KAUST

([email protected]) 2 Gopal

Sharma with Dept. of Electrical Engineering, IIT Roorkee, India

Fig. 1: UAV visually tracking a moving human despite occlusion

appearance model from one fixed camera to another. In the case of PAT, camera handover also requires the exchange of flight data of the active UAV (e.g. GPS coordinates, altitude, heading, etc.) to one or more UAVs. When a UAV reaches its low battery threshold it can request another UAV to resume tracking. Moreover, since communication range and transmission speed can prove to be bottlenecks when relying on a ground station for online tracking, our approach allows all computations to be conducted onboard each UAV. Based on extensive evaluation of tracking algorithms in offline and online experiments, we identify a top-performing tracker, improve it for real-time aerial tracking, and integrate it into a completely automated UAV tracking system. This system provides a robust solution for real-world aerial scenarios. The proposed system is validated by a series of experiments, where objects are tracked in outdoor cluttered environments with a variety of appearance and scale changes. The contributions of our work are three fold. 1) An integrated UAV system that performs onboard autonomous aerial tracking using an optimized aerial tracking algorithm. The tracking system controls the UAV and relays tracking results to a ground station. It is fully modular with the ability to modify the tracking technique and is mountable on commercially available UAVs. 2) An evaluation of state-of-the-art trackers using multiple evaluation metrics [1] and an adaptation (runtime and accuracy) of the leading tracker for UAV tracking. 3) A novel strategy for UAV camera handover and a reinitialization module in case of target loss. II. R ELATED W ORK Early object tracking methods for UAVs primarily rely upon two approaches. The first approach applies Canny Edge and Harris Corner Detectors to isolate distinct feature points, then accumulative frame differencing methods and background subtraction for blob tracking of all moving targets within the UAV’s field-of-view (FOV) [2]–[4]. Tensor voting and motion pattern segmentation is used by [5] to address parallax, noise in background modeling, and long term occlusions. More recent work by [6], [7] focuses

on tracking for sense-and-avoid maneuvers by utilizing a combination of feature point tracking and morphological filters. A major drawback of the tracker is that it falsely identifies lens flare as an aircraft. This is overcome in later work by the authors [7]. The second approach utilizes color information to detect and track moving targets [8]–[10]. These works develop robust vision-based control mechanisms to control the flight motion of the UAV, known as visual servoing. However, this approach is limited to scenarios where the colors of the target are clearly distinguishable from the background (e.g. red balloon or red car). Other work uses thermal and IR cameras as an alternative to RGB cameras [11], [12]. Humans and cars’ thermal signature can be easily distinguished from the background using approaches such as Mean-Shift. The work in [11], [12] train HAAR classifiers to isolate human body signatures, speed up computation, and account for varying illumination, base color and scale. Human tracking using UAV mounted RGB-D sensors shows promising results [13]. The added depth information makes segmentation and tracking of moving objects robust and highly precise with the OpenNI tracker. In [13], the problems of vibration and fast motion are addressed on the UAV by warping it to a virtual-static camera using the IMU sensors and an Extended Kalman filter. Although this approach performs well indoors, the RGB-D sensor limits its applicability outdoors and its tracking range to 5m. Most closely related to our approach is work by [14] and [15]. In [14], the object tracker TLD [16] is used to track objects from a UAV and obtains good short-term results. However, as their experiments show, TLD incorporates the background into its learning over time, leading to quick target drift. This method only tracks a person at an altitude of 1.5 m, not taking into consideration common tracking problems such as large perspective change and scale variation. The work in [15] presents a new tracker outperforming MIL [17] and TLD [16] for visual aircraft model-free tracking. The tracker is able to track another aircraft/intruder in the sky with illumination changes (strong sunlight) and background clutter (clouds) from a fixed-wing UAV. Note, that the experiments are very application specific with object (aircraft/intruder) and uncluttered background (sky) always being the same. While these contributions are related to our work, they are specific to certain application domains and rely on tracking methods that are no longer state-of-theart. In comparison, our proposed tracking system is generalpurpose and makes use of very recent advances in object tracking in RGB video. III. T HE A ERIAL T RACKING S YSTEM A. System Overview Our system consists of two classes of low-cost UAVs, a 850mm Hexacopter with a 3-axis gimbal system and a 450mm class quadcopter with a pan/tilt gimbal. Both utilize the Pixhawk flight controller (FC) for low-level stabilization and control of the UAV and gimbal. The high-level onboard

Fig. 2: The two classes of unmanned aerial vehicles integrated with the onboard vision based flight control system.

processing for tracking, handover, and communication is handled by an ARM-based Linux computer (Odroid XU4). Attached to the onboard flight computer (OFC) is a USB camera, Wifi module, and FTDI adapter for serial communication with the Pixhawk flight controller using the Mavlink protocol. The ground control station (GCS) connects via Ethernet to a Wifi AP for communication with both UAVs. The software for the OFC and GCS is written in C++ using QT5 for Linux and Windows. Figure 3 shows on overview of the complete hardware system setup.

Fig. 3: The hardware and software onboard the flight control system (FCS) and the ground control station (GCS). The FCS establishes a direct Wifi link with the GCS to enable communication back and forth and to initialize/change the tracked target.

B. System Description Our proposed PAT system is implemented onboard the UAV. It is modular allowing integration of any image based object tracker within its pipeline. Video frames are captured with a resolution of 320x240 pixels at up to 30Hz by a low-cost USB camera (68 degree FOV, Novatek NY99140 image sensor) and then processed and stored within the flight computer. The predicted image patches containing the target are wirelessly transmitted to the GCS at a lower frame rate for tracking supervision. In order to initialize tracking, the GCS requests a full frame and then transmits a cropped image patch of the desired target. The tracking thread is started with the image patch as input for the initialization module (Section III-F), which uses template matching to compute the first bounding box. After the tracker is initialized, it runs in an independent thread that is updated with current video frames received by the USB camera. The tracker predicts the new bounding box, which is subsequently evaluated using an evaluation module (Section III-G). If the evaluation module’s score of the predicted bounding box is above the expected threshold, it is

control the UAV allows for fast computation, extensibility to other UAV platforms and robustness without requiring GPS/IMU measurements. We also set a deadband in the PID loop to reduce the frequency of copter movement and energy expenditure. The PID controller holds its output steady if the bounding box error is smaller than the defined deadband range in which case the low-level controller keeps the UAV in position. Since the offset is usually small from frame-toframe, only small UAV movements are required to keep the tracked object consistently within the center of the frame.

Fig. 4: Pipeline of the onboard visual tracking system showing the integration of the different components.

used for UAV position correction. Otherwise (i.e. target lost or fully occluded), the bounding box is rejected and the reinitialization module (Section III-G) attempts to re-identify the target in subsequent frames. The continuous output of the tracker is a scaled upright bounding box enclosing the target. The calculated error between the centers of the bounding box and the complete image frame is used to update the position of the UAV in order to keep the object centered in the UAV’s FOV (Section III-C). A TCP Client/Server module for communication between the UAVs and the GCS is implemented as part of the system (See Fig. 3). The OFC on each UAV opens a TCP channel to receive control messages from the GCS and to send tracking information to the GCS (e.g. current tracking status, target appearance, and flight data). Through a user friendly GUI, the GCS is capable of monitoring multiple UAVs at a time each in its own independent thread. The GCS allows initialization of the target and ability to cancel or change current tracking targets during the UAV operation. The GCS can override the onboard control of each UAV independently to navigate it to a new GPS waypoint or adjust position through direct motor control. Finally, the GCS supports the UAVs in handling camera handover (Section III-D).

D. Handover System In order to achieve persistent aerial tracking, we implement a method of camera handover. We apply a simple but robust strategy through the use of the onboard GPS, compass and barometer and known camera angle of the first UAV. Once the battery of the active UAV reaches its first low voltage threshold, it transmits a handover request to the GCS with its current GPS position, altitude, heading, and the tracked target’s appearance model. The GCS then sends a new UAV to continue tracking the object of interest. The active UAV continues tracking until the new UAV is positioned correctly to ensure that the most current template is transfered and the target is in view even if it has moved during the handover operation. In order to initialize the tracker on the new UAV, it has to be in the proximity of the active UAV and face the desired object while keeping enough distance to avoid a collision (see Figure 5). We calculate the horizontal distance xobj between the active UAV and target using the camera angle φ and altitude of the UAV hgnd . We then determine the necessary heading offset θ, given the desired distance between the active and new UAV xuav .

C. Visual Tracking and Control System The camera system is mounted on a 2-axis gimbal decoupling it from the UAV so that the camera pitch (45 degrees) and roll are held constant during translational movement. Moreover, the UAV altitude is maintained using sensor data of the integrated barometer. In every frame, the object tracker attempts to predict a bounding box containing the target. The x and y pixel error between the current target’s bounding box center and the center of the video frame is used as input for a fully-tuned PID controller. The output consists of two PWM signals translated as x and y-movement by the flight controller. The correction occurs at the speed of the camera (30fps), incrementally adjusting the position of the UAV until the target is centered in the FOV of the camera (i.e. x,y error close to 0). Using a PID loop to

Fig. 5: Camera Handover

xobj

hgnd = ⇒ θ = cos−1 tan( π2 − φ)

1 − x2uav 2x2obj

! (1)

Note that since the camera angle is π4 = 45 deg in our setup, xobj is equal to hgnd . Lastly, we calculate the desired GPS coordinates for the new UAV using the haversine formula, where ϕ1 is the latitude of the active UAV [18]. Latitude (new UAV): ϕ2 (ϕ1 , xuav , rearth , θ)

(2)

Longitude (new UAV): ω2 (ϕ1 , ϕ2 , xuav , rearth , θ)

(3)

ϕ2 = asin(sin(ϕ1 ) cos(δ) + cos(ϕ1 ) sin(δ) cos(θ))

(4)

ω2 = ϕ1 + atan2(u(ϕ1 , ϕ2 , δ), v(ϕ1 , θ, δ))

(5)

where xuav δ= rearth u(ϕ1 , ϕ2 , δ) = cos(δ) − sin(ϕ1 ) sin(ϕ2 ) v(ϕ1 , θ, δ) = sin(θ) sin(δ) cos(ϕ1 )

(6) (7) (8)

The new UAV receives and navigates to the determined GPS position using its own onboard GPS and compass but at a higher altitude than the current UAV to avoid any kind of interference (we assume correct GPS and compass readings but account for the margin of error of these sensors). Once in position, it descends to the same altitude as the active UAV and turns to the correct heading θnew = θactive + θ. The new UAV uses the passed target model to detect the object and initialize the tracker. In case the new UAV is unable to detect the target at the first attempt (e.g. due to some error in altitude, heading or position), we retry until the object is found. Upon success, the new UAV starts tracking and signals the active UAV to return to home. E. Tracker Module The core element of the PAT system is the tracker. There is a large selection of established object trackers and our system is modular to allow for integration of new trackers as advancements are made. However, most benchmarks of stateof-the-art trackers are evaluated on very generic datasets [1] and hence not suitable to identify the best tracker for integration on the UAV. Therefore, we choose 7 trackers according to their performance in the very popular online tracking benchmark [1] and attempt to represent as many different methods and features as possible. Note that we only pick trackers with speed in the order of at least 10fps: IVT [19], CXT [20], TLD [16], Struck [21], OAB [22], CSK [23], and ASLA [24]. In addition, we also compare 5 of the most recent trackers (even those running at less than 10fps): MEEM [25], MUSTER [26], DSST [27] (winner VOT2014), SRDCF [28] (winner VOTTIR2015) and SOWP [29]. We evaluate all trackers on the established aerial tracking dataset VIVID. Since the VIVID dataset is fairly small (9 sequences), outdated (2005) and the sequences are very similar (only vehicles as targets), we add 41 additional sequences1 captured from an octocopter with stabilized gimbal following different objects at varying altitudes (approx. 5-25 meters). We call this dataset UAV50. 1) Evaluation Setup: For fair evaluation, all trackers are run with standard parameters on a server-grade workstation (Intel Xenon X5675, 3.07GHz, 48GB). Due to the ARM architecture, we cannot run the benchmark on the onboard computer directly, but we experimentally determine that 1 In another work we extend these additional sequences and propose a new dataset (UAV123) and benchmark for aerial tracking [30]. We provide an extensive evaluation of many state-of-the-art trackers and analyze specific aerial tracking nuisances. Moreover, we present a novel evaluation approach with a high-fidelity real-time visual tracking simulator (UE4) which can be used to evaluate tracking algorithms in real-time scenarios before they are deployed on a UAV “in the field”, as well as, generate synthetic but photorealistic tracking datasets with automatic ground truth annotations. Both the benchmark and simulator are made publicly available on our website.

tracking speed is approximately reduced by a factor of 5 without further code optimization. To compare the different trackers on UAV sequences, we use the same evaluation paradigm proposed in the aforementioned tracking benchmark [1], namely one-pass evaluation (OPE) and spatial robustness evaluation (SRE). As the name suggests, OPE evaluates how well the tracker predicts the bounding box in all subsequent frames given the bounding box in frame 1. SRE evaluates the sensitivity of each tracker to shift/scale changes in the initialization of the target (4 center shifts, 4 corner shifts and scaled by 80, 90, 110 and 120 percent) [1]. In fact, SRE is especially important to evaluate our proposed system because target initialization (manually selected by an operator) and re-initialization can lead to bounding boxes that are not tight around the target or slightly shifted. We do not make use of temporal robustness evaluation (TRE) in this paper, where trackers are initialized at different frames within a sequence, since user re-initialization should not occur in fully autonomous tracking scenarios. The scores for OPE and SRE are based on two metrics, precision and success rate. Precision is measured as the distance between the centers of a tracker bounding box bb tr and corresponding ground truth bounding box bb gt. The precision plot shows the percentage of tracker bounding boxes within a given threshold distance in pixels of the ground truth. To rank the trackers, we use a threshold of 20 pixels [1]. The success is measured as intersection over union of pixels in tracker bounding box bb tr and corresponding ground truth bounding box bb gt (see Eq. 9). The success plot shows the percentage of tracker bounding boxes whose overlap score S is larger than a given threshold. To rank the trackers, we use the area under the curve (AUC) [1]. |bbtr ∩ bbgt | (9) |bbtr ∪ bbgt | 2) Evaluation Results: The experiments on VIVID (refer to Fig. 6 (top)) indicate that Struck is the best tracker for integration on a UAV since it runs at a decent speed and even outperforms the latest trackers. However, the results on the extended dataset with 50 sequences and a lot more challenges and diversity (refer to Fig. 6 (bottom)) highlight that while Struck can still compete with some of the recent trackers there is room for improvement. In general, as expected recent trackers outperform classical ones by a significant margin. In the OPE precision plots, CSK [23], TLD [16], ASLA [24], CXT [20] and IVT [19] achieve similar low performance and MUSTER [26], DSST [27] and OAB [22] achieve mediocre performance, while SOWP [29], SRDCF [28], MEEM [25] and Struck [21] are the top performers. We also compare StruckUAV , our adaptation of Struck for UAVs (see section III-G for discussion of improvements made to Struck) and it outperforms all other trackers on both, the VIVID dataset and the extended UAV50 dataset in terms of OPE. The SRE plots indicate that StruckUAV also performs very well with noise in target initialization ranking second on both datasets. It is also noteworthy that StruckUAV is several times faster than any other evaluated tracker with similar performance. Note S=

Fig. 6: Precision plots for OPE and SRE on VIVID data set (top) and UAV50 data set (bottom).

that in our benchmark every 10 frames correspond to one second in real-time. Each tracker predicts a bounding box for each frame regardless of their actual speed. Of course, this is very different when tracking in real-time. If frames are not processed fast enough, in-between frames are lost resulting in larger displacement of the target between frames making tracking more difficult. Therefore, if the tracker is too slow, the tracking performance will degrade. Even if the tracker can cope with lower frame rates, updates for the UAV will be less frequent making it more difficult to keep the moving target within the field-of-view (FOV). The target can be easily lost and the UAV will be forced to hold in place searching to re-initialize the lost target. 3) Integration: Our evaluation shows that among current established trackers, Struck [21] performs best in terms of tracking performance to speed ratio. Struck reaches up to 20fps and is able to handle many tracking challenges (e.g. partial occlusion). Most trackers approach tracking as a classification task and use an online classifier to build an object appearance model. Estimated object positions are converted into labeled training instances. Struck bridges the gap between the objectives of the classifier (label prediction) and tracker (accurate prediction of object location) by learning a prediction function that directly outputs the frame-to-frame transformation (translation in this case) of the bounding box. The prediction function is learned online to adapt the object appearance within a kernelized structured output SVM framework. For real-time applications, a budgeting mechanism is incorporated to bind the growth of support vectors during tracking. Struck has notable performance even when the target moves fast [1], since a large search region is sampled to provide better target-background discrimination. This tracker is less sensitive to scale variation and performs well even when the initial bounding box is not strictly tight.

In the following, we present further improvements made to Struck to integrate it as a tracking module in our system and improve its performance for aerial tracking. We coin this variant StruckUAV . Improvements to Struck. A very common challenge in aerial tracking is drastic change of target scale and aspect ratio due to camera motion (viewpoint of the UAV). A tight bounding box around the tracked target is critical for the UAV to keep the target centered within the FOV. In order to improve accuracy of StruckUAV , we add anisotropic scale adaptation. At each frame we generate a 3x3 grid of different size and aspect ratio bounding boxes which are then scored by Struck, allowing for bounding box resizing. This improves updates since it prevents the classifier from being updated using small parts of the object or too much background. Even though Struck has some robustness to scale variation, we found that a more dynamic model to handle scale changes improves tracking performance even further. Fig. 7 highlights the improvement of StruckUAV in terms of bounding box overlap as compared to Struck. Its bounding box tightly encloses the tracked object, thus, allowing for higher accuracy and more robustness to changes in target shape and appearance. This is clearly seen in the boat example, as the UAV pans from the rear view to side view. It is also important to note that other trackers that allow for bounding box scaling fail in several cases and do not consistently resize the bounding box when the scale or shape of the object changes. We also increase Struck’s budget size by 20% to improve the tracker’s memory of the object’s appearance. The additional sampling for scale invariance and increased budget size comes at the cost of runtime. To compensate, we reduce the number of samples in the search grid depending on the target size. For targets with an area of more than a predefined threshold (3200 pixels), we use a step size of 3 pixels rather than performing exhaustive search in a local neighborhood to update the classifier. This increases the speed of tracking significantly without much loss in performance. Fig. 8 shows how StruckUAV outperforms all other trackers in terms of bounding box overlap. Comparing it to the original Struck tracker, we clearly see the effect of scale and aspect ratio adaption, as well as, smarter sampling on performance, where StruckUAV outperforms Struck by a margin while maintaining an even higher frame rate in most sequences. Because of its runtime, its overall tracking performance, and its robustness to scale variations, we integrate StruckUAV into our UAV onboard system. With the improvements to the original Struck and further adaptation for low latency onboard processing, we are able to run this leading tracker on our UAV system on a low-cost ARMbased Linux computer (Odroid XU-4) with close to real-time performance at less than 80% total CPU load. F. Target Initialization Module A typical approach to initialize a tracker is to hold the object in place within a defined bounding box for multiple frames (e.g. refer to [21] on training Struck). Since this is

Fig. 7: StruckUAV and SRDCF with scaling ability vs. Struck, SOWP and MEEM without scaling ability. Rows 1-3 are from UAV50 and Row 4 from VIVID dataset.

Fig. 8: Success plots for OPE on VIVID and UAV50 dataset.

not feasible on a moving UAV, we implement an initialization module using template matching. Tracking starts by sending a region of interest (cropped template image of the target) from the GCS to the UAV in a position/altitude hold over the area of interest (see Fig. 3). After receiving this template, the onboard visual tracking system performs zero-mean matching, whereby the zero-mean template is used as a linear filter to compute a response map over the whole image (after making each image patch support have zero mean). The pixel location with the highest template filter response in the map is retrieved as the initial bounding box of the target, which in turn initializes the object tracker. Although other types of template matching could be used (e.g. normalized cross correlation or NCC), this method is faster and suitably accurate for aerial tracking. Alternatively, initialization can be performed using a set of images already stored on the UAV or by using a human detector. G. Bounding Box Evaluation and Re-initialization Module Successful trackers quickly adapt to appearance changes of the object. The model updates they employ are usually

effective in handling some perspective, illumination, and scale variations, which commonly occur in aerial tracking scenarios. However, if the target is fully occluded or leaves the FOV of the camera, most trackers fail since the model is updated incorrectly with background while the object is not visible. Hence, it is necessary to determine whether or not the bounding box predicted by the tracker actually contains the target. To make persistent tracking possible, we implement an independent measure to predict when these cases occur and integrate it into our onboard tracking pipeline, both for evaluating the quality of the tracker itself and triggering immediate re-initialization measures if necessary (see Fig. 3). Our evaluation and re-initialization modules can be applied to any tracker integrated in the PAT system. As mentioned earlier, we use the proposed StruckUAV to perform tracking. In the evaluation module, we incrementally build an appearance-based classifier for the target. We compile a set of positive and negative training samples based on the previous tracking results. Since training samples arrive sequentially, we learn an incremental linear SVM with HOG (histogram of oriented gradient) and color histogram features (108 features in total), following [31]. The HOG feature (81 dimensions) are extracted using 3 × 3 windows with 9 bins each. The color feature (27 dimensions) is computed as the histogram of each color channel in 9 bins, giving a total of 27 features. We empirically found that collecting training instances on every third frame and making a bundle of 15 instances before training the incremental SVM achieves promising results. The trained model provides a confidence score for the existence of the target in the bounding box predicted by the tracker. This score can also be combined with the confidence score of the tracker for more robustness. If the score falls below a predefined threshold, the bounding box prediction is rejected and re-initialization is performed. The re-initialization module is designed to be efficient, as runtime is critical in real-time UAV tracking scenarios. Since exhaustive search by a sliding window is very slow, we make use of a time efficient and object-centric sampling of the entire image, namely object proposals [32]. Not only does this reduce the number of search windows to be evaluated, it does so at a high target recall rate. In other words, the proposals generated in an image have been designed to respond to locations in the image where a generic object might be. The use of object proposals has become a standard first-step in state-of-the-art object detection systems [33]. Many object proposal methods exist in the literature, but we select geodesic proposals [32] since they offer an acceptable tradeoff between computational complexity and accuracy [34]. To reduce the total number of proposals, we prune out those with low proposal score or an area that is much larger or smaller than the most recent target size. To improve target position localization within the proposals, we sample targetsized bounding boxes within each proposal and evaluate our target SVM classifier. The bounding box inside a proposal with a classification score above a predefined threshold is selected as the new target location and the tracker is reinitialized. If this threshold is not exceeded (e.g. if the object

is then asked to go the center of the field to be in the FOV of the UAV. From the ground station, we request a frame from the UAV, in which we annotate a bounding box around the target to be used for tracker initialization. After initialization, the tracker module on the onboard computer commences tracking using StruckUAV and the UAV becomes fully autonomous. Diagnostics on the onboard computer showed that less than 80% of the processor is required to perform all calculations for our tracking system. Fig. 9: (a) Original frame showing one object proposal in black and the ground truth bounding box in red. (b) Multiple candidate bounding boxes are sampled inside the proposal and the target classifier is applied on each.

Fig. 10: Examples of the re-initialization module. (a) and (b) show cases where partial and full occlusion result in re-initialization. Red bounding boxes indicate bounding boxes predicted by the tracker with a score below the threshold. When the object is re-detected, the tracker is re-initialized indicated by the green bounding box.

does not re-enter the field of view), this procedure is repeated in the next frame until the object is re-identified. Fig. 9 shows an example of how we search for the target within each proposal, while Fig. 10 shows examples of re-initialization in different aerial video sequences. To reduce false positives and to speedup re-initialization, we distinguish between two cases. If the object is lost within the frame, proposals are extracted in the local neighborhood only. If the object is lost near the boundary of the frame, we extract object proposals over the complete image. IV. E XPERIMENTAL R ESULTS A. Experimental Testbed In order to evaluate our PAT tracking system, we conducted several online experiments multiple times and summarize the results below. Figure 2 shows our setup. A wide open grass field (ca. 50m×50m) is selected with plenty of room for the UAVs to maneuver at low altitude (10-15 m) and for our human targets to walk and run. The grass field is not uniform, but contains large brown patches of soil, over watered areas of dark green, storm drains and patches of gravel. Over the various periods of testing the human targets wore their daily clothing of various types (majority with printed t-shirts and in several experiments we selected colors similar in hue to the grass field to make tracking more difficult (note that StruckUAV does not use color for tracking). The first UAV is flown manually to the center of the field and placed in GPS hold mode. The target

B. System Evaluation Regular Motion. To simulate regular motion, the target was asked to move in different directions while varying speed, and pose. Consitently over multiple tests, the UAV running StruckUAV was able to track the moving target reliably throughout the flight despite all the variations that occurred. Rapid/Erratic Motion. In a second series of experiments, the target was asked to make abrupt turns in multiple directions. In this case, the UAV was able to respond quickly and maintain tracking. However, if the target moved abruptly towards the copter and sought to run under it, the UAV often lost the target, since it did not respond fast enough. The poor response that re-occurred in this experiment is primarily related to the speed constraints on the PID controller and deadband employed on the UAV to purposefully prevent such aggressive maneuvers. Despite the dampened response of the UAV, during multiple tests StruckUAV maintained tracking until the target left the frame. In some, rare occasions where complete occlusion occurred for less than 30 frames, Struck was able to find the target without re-initialization. Robustness to Occlusion. In a third set of experiments, we tested the effect of occlusion by allowing a second person to interact with the target. The second person walked in front and behind the target, leading to varying degrees of partial occlusion. This type of occlusion was consistently handled well by the tracker, which kept a tight bounding box around the target. Tracker drift did not occur in this case. Camera Handover. In the final experiment we tested the camera handover module (refer to Section III-D). A second UAV was manually flown to the corner of the field and placed in GPS hold. Instead of exhausting the entire battery and then sending the handover request, we implemented a simple button on the ground station to trigger the handover function that in real-world scenarios would be triggered when the battery voltage monitor reaches a low-level. This allowed us to test camera handover multiple times within the same flight. The camera handover was successful even though the position and orientation passed to the second copter were limited by the accuracy of the GPS and compass. StruckUAV was able to initialize the target and resume tracking from a different viewpoint. Summary videos of these experiments are included in the supplementary material. C. Limitations While our method shows promising results, further improvement within the current framework can be realized by:

determining velocity and heading of the object being tracked to reduce the search space, dynamically adjusting altitude and orientation of the active UAV based on feedback of the tracker and target speed, as well as, deploying new trackers in our modular onboard visual tracking system. Moreover, the current camera handover strategy is not very robust to fast target motion and appearance change and relies on a ground station. Handing over the complete target model to initialize the new UAV rather than just a cropped image patch and establishing communication between the UAVs to coordinate themselves in the absence of a ground station could further improve the proposed system. V. C ONCLUSION In this paper, we provide extensive empirical evidence validating our proposed method and integrated system. We are able to persistently track objects purely based on appearance by seamlessly handing over the active camera between multiple off-the-shelf UAVs equipped with an onboard computer running StruckUAV . R EFERENCES [1] Y. Wu, J. Lim, and M.-H. Yang, “Online Object Tracking: A Benchmark,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, June 2013, pp. 2411–2418. [2] K. Kaaniche, B. Champion, C. Pegard, and P. Vasseur, “A Vision Algorithm for Dynamic Detection of Moving Vehicles with a UAV,” in Proceedings of the 2005 IEEE International Conference on Robotics and Automation. IEEE, April 2005, pp. 1878–1883. [3] S. Ali and M. Shah, “Cocoa - tracking in aerial imagery,” in Proc. Int. Conf. on Computer Vision, 2005. [4] A. Qadir, J. Neubert, W. Semke, and R. Schultz, ser. Infotech@Aerospace Conferences. American Institute of Aeronautics and Astronautics, Mar 2011, ch. On-Board Visual Tracking With Unmanned Aircraft System (UAS), 0. [5] Q. Yu and G. Medioni, “Motion pattern interpretation and detection for tracking moving vehicles in airborne video,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, June 2009, pp. 2671–2678. [6] A. Nussberger, H. Grabner, and L. Van Gool, “Aerial object tracking from an airborne platform,” in Unmanned Aircraft Systems (ICUAS), 2014 International Conference on, May 2014, pp. 1284–1293. [7] ——, “Robust aerial object tracking in images with lens flare,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on, May 2015, pp. 6380–6387. [8] I. F. Mondragon, P. Campoy, M. A. Olivares-Mendez, and C. Martinez, “3D object following based on visual information for Unmanned Aerial Vehicles,” in IX Latin American Robotics Symposium and IEEE Colombian Conference on Automatic Control, 2011 IEEE. IEEE, Oct. 2011, pp. 1–7. [9] C. Teuliere, L. Eck, and E. Marchand, “Chasing a moving target from a flying UAV,” in 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, September 2011, pp. 4929– 4934. [10] A. Kendall, N. Salvapantula, and K. Stol, “On-board object tracking control of a quadcopter with monocular vision,” in Unmanned Aircraft Systems (ICUAS), 2014 International Conference on, May 2014, pp. 404–411. [11] A. Gaszczak, T. P. Breckon, and J. Han, “Real-time people and vehicle detection from uav imagery,” in IST/SPIE Electronic Imaging, J. R¨oning, D. P. Casasent, and E. L. Hall, Eds., vol. 7878. International Society for Optics and Photonics, January 2011, pp. 78 780B– 78 780B–13. [12] J. Portmann, S. Lynen, M. Chli, and R. Siegwart, “People detection and tracking from aerial thermal views,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on, May 2014, pp. 1794– 1800.

[13] T. Naseer, J. Sturm, and D. Cremers, “Followme: Person following and gesture recognition with a quadrocopter,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, Nov 2013, pp. 624–630. [14] J. Pestana, J. Sanchez-Lopez, P. Campoy, and S. Saripalli, “Vision based gps-denied object tracking and following for unmanned aerial vehicles,” in Safety, Security, and Rescue Robotics (SSRR), 2013 IEEE International Symposium on, Oct 2013, pp. 1–6. [15] C. Fu, A. Carrio, M. Olivares-Mendez, R. Suarez-Fernandez, and P. Campoy, “Robust real-time vision-based aircraft tracking from unmanned aerial vehicles,” in Robotics and Automation (ICRA), 2014 IEEE International Conference on, May 2014, pp. 5441–5446. [16] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-LearningDetection.” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 7, pp. 1409–1422, Dec 2011. [17] B. Babenko, M.-H. Yang, and S. Belongie, “Visual Tracking with Online Multiple Instance Learning.” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 8, pp. 1619–1632, Dec 2010. [18] “Chris veness. http://www.movable-type.co.uk/scripts/latlong.html.” [19] D. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning for robust visual tracking,” International Journal of Computer Vision, vol. 77, no. 1-3, pp. 125–141, 2008. [20] T. B. Dinh, N. Vo, and G. Medioni, “Context tracker: Exploring supporters and distracters in unconstrained environments,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, June 2011, pp. 1177–1184. [21] S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” in 2011 International Conference on Computer Vision. IEEE, Nov 2011, pp. 263–270. [22] H. Grabner, M. Grabner, and H. Bischof, “Real-time tracking via on-line boosting,” in Proceedings of the British Machine Vision Conference. BMVA Press, 2006, pp. 6.1–6.10, doi:10.5244/C.20.6. [23] J. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in Computer Vision ECCV 2012, ser. Lecture Notes in Computer Science, A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, Eds. Springer Berlin Heidelberg, 2012, vol. 7575, pp. 702–715. [24] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, June 2012, pp. 1822– 1829. [25] J. Zhang, S. Ma, and S. Sclaroff, “MEEM: robust tracking via multiple experts using entropy minimization,” in Proc. of the European Conference on Computer Vision (ECCV), 2014. [26] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao, “Multistore tracker (muster): A cognitive psychology inspired approach to object tracking,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, June 2015, pp. 749–758. [27] M. Danelljan, G. Hger, F. Shahbaz Khan, and M. Felsberg, “Accurate scale estimation for robust visual tracking,” in Proceedings of the British Machine Vision Conference. BMVA Press, 2014. [28] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg, “Learning spatially regularized correlation filters for visual tracking,” in The IEEE International Conference on Computer Vision (ICCV), Dec 2015. [29] J.-Y. S. Han-Ul Kim, Dae-Youn Lee and C.-S. Kim, “Sowp: Spatially ordered and weighted patch descriptor for visual tracking,,” in The IEEE International Conference on Computer Vision (ICCV), December 2015. [30] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for uav tracking,” in Proc. of the European Conference on Computer Vision (ECCV), 2016. [31] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011. [32] P. Krhenbhl and V. Koltun, “Geodesic object proposals,” in Computer Vision ECCV 2014, ser. Lecture Notes in Computer Science, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Springer International Publishing, 2014, vol. 8693, pp. 725–739. [33] T. V. Nguyen, “Salient object detection via objectness proposals,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [34] J. Hosang, R. Benenson, P. Dollar, and B. Schiele, “What makes for effective detection proposals?” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. PP, no. 99, pp. 1–1, 2015.