HtFLlib: The First Unified Benchmark for Heterogeneous Federated Learning Across Modalities

Challenges in Traditional Federated Learning

AI institutions often develop heterogeneous models tailored for specific tasks but face data scarcity during training. Traditional Federated Learning (FL) supports collaboration only among homogeneous models with identical architectures, which limits its applicability. Clients usually have unique model architectures to meet their individual needs. Moreover, sharing locally trained models involves intellectual property concerns and reduces incentives for collaboration.

Heterogeneous Federated Learning and Existing Methods

Heterogeneous Federated Learning (HtFL) addresses these challenges by enabling collaboration across different model architectures. Current FL benchmarks generally focus on data heterogeneity but overlook model heterogeneity present in real-world scenarios. Representative HtFL methods include:

Partial Parameter Sharing: Methods like LG-FedAvg, FedGen, and FedGH maintain heterogeneous feature extractors but assume homogeneous classifier heads.
Mutual Distillation: Methods such as FML, FedKD, and FedMRL train and share small auxiliary models via distillation.
Prototype Sharing: These methods transfer lightweight class-wise prototypes as global knowledge to guide local training.

However, the performance consistency of these methods across diverse scenarios remains unclear.

Introducing HtFLlib: A Unified Benchmarking Library

Researchers from several universities have proposed HtFLlib, the first unified benchmarking library designed to evaluate heterogeneous federated learning methods across different domains and data modalities. Key features of HtFLlib include:

Integration of 12 datasets spanning various domains, modalities, and data heterogeneity scenarios.
Inclusion of 40 model architectures ranging from small to large sizes across three modalities.
A modular and extensible codebase implementing 10 representative HtFL methods.
Systematic evaluations covering accuracy, convergence, computational cost, and communication cost.

Diverse Datasets and Modalities

HtFLlib categorizes data heterogeneity into three main settings: Label Skew (with Pathological and Dirichlet subsettings), Feature Shift, and Real-World scenarios. It incorporates datasets such as Cifar10, Cifar100, Flowers102, Tiny-ImageNet, KVASIR, COVIDx, DomainNet, Camelyon17, AG News, Shakespeare, HAR, and PAMAP2. These datasets vary widely in domain, volume, and class count, showcasing the comprehensiveness of HtFLlib. While image data and label skew scenarios receive major focus, the library evaluates HtFL methods across image, text, and sensor signal tasks to reveal their strengths and weaknesses.

Performance Insights for Image and Text Modalities

In image tasks, most HtFL methods experience accuracy drops as model heterogeneity increases. FedMRL stands out due to its combination of auxiliary global and local models. When heterogeneous classifiers prevent partial parameter sharing, FedTGP excels because of its adaptive prototype refinement. Experiments on medical datasets with black-box heterogeneous models indicate that HtFL improves model quality beyond pre-trained models and outperforms auxiliary model approaches like FML. For text data, FedMRL’s advantage in label skew scenarios reduces in real-world settings, while FedProto and FedTGP perform less effectively compared to image tasks.

Impact and Future Directions

HtFLlib fills a crucial gap by providing a unified evaluation framework for heterogeneous federated learning methods across multiple domains and modalities. Its modular and extensible design supports research and practical application, enabling further exploration of complex pre-trained large models, black-box systems, and diverse architectures in collaborative learning.

For more information, check out the original paper and the GitHub repository.