EPFL's FG2 AI Model Cuts Localization Errors by 28% in GPS-Challenged Urban Environments

Tackling Localization Challenges in Urban Environments

Urban landscapes with tall buildings, like San Francisco and New York, often cause GPS signals to falter due to signal blockages and reflections. For autonomous vehicles and delivery robots relying on precise localization, even minor GPS inaccuracies can lead to mission failures. To address this, researchers at the École Polytechnique Fédérale de Lausanne (EPFL) have developed a novel AI model named FG2, unveiled at CVPR 2025, that dramatically improves visual localization accuracy.

FG2 Model: Fine-Grained Cross-View Localization

The FG2 model, detailed in the paper “Fine-Grained Cross-View Localization by Fine-Grained Feature Matching,” enables autonomous systems to determine exact position and orientation using just ground-level camera images and corresponding aerial or satellite views. It achieves a 28% reduction in mean localization error compared to prior state-of-the-art methods on challenging public datasets.

How FG2 Works

Rather than relying on abstract scene descriptors, FG2 mimics human intuition by matching detailed, semantically consistent features such as curbs, crosswalks, and building facades between street-level and aerial images. This approach enhances interpretability by allowing visualization of the features being matched, moving beyond traditional black-box AI.

The model uses weakly supervised learning, requiring only the final camera pose as a supervisory signal without explicit correspondence labels.

Overcoming Cross-View Perspective Differences

Cross-view localization is difficult due to the stark difference between ground-level and aerial perspectives. Previous methods either created abstract descriptors or converted views into bird’s-eye representations but failed to capture vertical structures adequately.

FG2 introduces a pipeline that:

Maps ground-level features into a 3D point cloud centered on the camera.
Applies smart pooling to select the most informative features vertically, deciding whether ground markings or building rooftops serve as better landmarks.
Matches these refined 2D point representations between ground and aerial views.
Uses Procrustes alignment to estimate the 3-DoF pose (x, y, and yaw).

Performance and Transparency

On the VIGOR dataset, FG2 reduced the average localization error by 28% and showed strong generalization on the KITTI dataset. Importantly, it offers transparency by visualizing matched points, confirming that the model learns semantically meaningful correspondences autonomously.

Advancing Autonomous Navigation

FG2’s human-like feature matching and enhanced interpretability mark a significant advancement in visual localization. This breakthrough paves the way for safer and more reliable navigation systems in autonomous vehicles, drones, and robotics, particularly in GPS-denied urban environments.