#291: Local feature performance evaluation for Structure-from-Motion and multi-view stereo using simulated city-scale aerial imagery


Ubiquitous low cost multi-rotor and fixed wing drones or unmanned aerial vehicles (UAVs) have accelerated the need for reliable, robust, and scalable Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipelines suitable for a variety of flightpath trajectories especially in degraded environments. Feature tracking being a core part of SfM and MVS, is essential for multiview scene modeling and perception, but difficult to evaluate in large scale datasets due to the lack of sufficient ground-truth. For large-scale aerial imagery, accurate camera orientation and dense 3D point cloud accuracy can be used to assess the impact of accurate feature localization and track length. We propose a novel view simulation (or synthesis) framework which generates visually realistic new unseen camera views for feature detection using known high fidelity camera poses for modeling. Seven state-of-the-art local handcrafted and learning-based features are quantitatively evaluated for robustness and matchability within the SfM and MVS pipelines using the open source COLMAP software. Our experimental results provide performance rankings of each feature, using twelve different evaluation metrics across three synthetic city-wide aerial image sequences. We show that recent learned features, SuperPoint and LF-Net, have not only reached the quality of the best handcrafted features like SIFT and SURF, but now outperform them in terms of more accurate 3D camera pose estimates and longer feature tracks. SuperPoint produces 1.51 meter average position error and 0.03 average angular error, while SIFT remains competitive (second best for pose and overall) with 1.78 meter and 0.11 errors respectively.