On-device AI

Our research aims to enable high-performance AI inference and training on resource-constrained mobile and embedded devices, to enable emerging applications such as AIoT, smart health and embodied AI. We utilize fine-grained and explainable knowledge about AI model execution to determine the most efficient part of the model for on-device training and inference, and employ modular neural networks that incorporate domain knowledge of specific system applications into the neural network module design. Our recent research focuses on enabling computational efficient inference and training of modern Large Language Models (LLMs) on weak devices, to efficiently incorporate these devices’ rich varieties of data modalities into the LLMs’ representation power and hence allow more flexible domain adaptation and model personalization.

Never Start from Scratch: Expediting On-Device LLM Personalization via Explainable Model Selection

Personalization of Large Language Models (LLMs) is important in practical applications to accommodate the individual needs of different mobile users. Due to data privacy concerns, LLM personalization often needs to be locally done at the user’s mobile device, but such on-device personalization is constrained by both the limitation of on-device compute power and insufficiency of user’s personal data. In this paper, we address these constraints by fine-tuning an already personalized LLM with user’s personal data, and present XPerT, a new technique that ensure proper selection of such already personalized LLMs based on explainability about how they were being fine-tuned. We implemented and evaluated XPerT on various smartphone models with mainstream LLMs, and experiment results show that XPerT reduces the computation costs of on-device LLM personalization by 83%, and improves its data efficiency by 51%.

Haoming Wang, Boyuan Yang, Xiangyu Yin, Wei Gao

June 2025 In MobiSys 2025

Never Start from Scratch: Expediting On-Device LLM Personalization via Explainable Model Selection

When Device Delays Meet Data Heterogeneity in Federated AIoT Applications

Federated Artificial Intelligence of Things (AIoT) uses distributed data on IoT devices to train AI models. However, in practical AIoT systems, heterogeneous devices cause data heterogeneity and varying amounts of device staleness, which can reduce model performance or increase federated training time. Existing FL frameworks improperly consider device delays as independent from data heterogeneity. Our work explore a scenario where device delays and data heterogeneity are closely correlated, and propose FedDC, a new technique to mitigate the impact of such device delays. Our basic idea is to use gradient inversion to learn knowledge about device’s local data distribution and use such knowledge to compensate the impact of device delays on devices’ model updates. Experiment shows that FedDC can improve the FL performance by 34% with high amounts of device delays, without impairing the devices’ local data privacy.

Haoming Wang, Wei Gao

March 2025 In MobiCom 2025

When Device Delays Meet Data Heterogeneity in Federated AIoT Applications

Modality Plug-and-Play: Runtime Modality Adaptation in LLM-Driven Autonomous Mobile Systems

Multimodal reasoning by LLMs is critical to autonomous mobile systems, but the growing diversity of input data modalities prevents incorporating all modalities into LLMs. Instead, only the useful modalities should be adaptively involved at runtime, based on the current environmental contexts and task requirements. Existing work on runtime modality adaptation uses fixed connections between data encoder and LLM’s input layer, but results in high training costs and ineffective cross-modal interaction. In this paper, we present MPnP, a new modality adaptation technique that connects data encoders to a flexible set of last LLM blocks and makes such latent connections fully trainable at runtime. Evaluation results show that MPnP has high compute and data efficiency, with 3.7× FLOPs reduction and 30% memory usage reduction compared to best baselines. It requires only few hundreds of training samples at runtime, and completes modality adaptation within few minutes on weak devices.

Kai Huang, Xiangyu Yin, Heng Huang, Wei Gao

December 2024 In MobiCom'25

Modality Plug-and-Play: Runtime Modality Adaptation in LLM-Driven Autonomous Mobile Systems

Tackling Intertwined Data and Device Heterogeneities in Federated Learning with Unlimited Staleness

Federated Learning (FL) can be affected by data and device heterogeneities. Traditional schemes consider these heterogeneities as two separate and independent aspects, but this assumption is unrealistic in practical FL scenarios where these heterogeneities are intertwined. In these cases, traditional FL schemes are ineffective. We introduce a novel FL framework with the idea of estimating the distributions of clients’ local training data from their uploaded stale model updates, and use these estimations to compute unstale client model updates. Experiments on comparison with existing FL strategies on mainstream datasets and models showed that our approach can improve the trained model accuracy by up to 25% and reduce the number of required training epochs by up to 35%.

Haoming Wang, Wei Gao

November 2024 In AAAI 2025

Tackling Intertwined Data and Device Heterogeneities in Federated Learning with Unlimited Staleness

Perceptual-Centric Image Super-Resolution using Heterogeneous Processors on Mobile Devices

Image super-resolution (SR) is widely used on mobile devices to enhance user experience. However, neural networks used for SR are computationally expensive, posing challenges for mobile devices with limited computing power. A viable solution is to use heterogeneous processors on mobile devices, especially the specialized hardware AI accelerators, but the reduced arithmetic precision on AI accelerators can lead to degraded perceptual quality in upscaled images. To address this limitation, we present a novel image SR technique that enhances the perceptual quality of upscaled images when using heterogeneous processors for SR computations. It strategically splits the SR model and dispatches different layers to heterogeneous processors, to meet the time constraint while minimizing the impact of AI accelerators on image quality. Experiment results show that our method outperforms the best baselines, improving perceptual image quality by up to 2×, or reducing SR computing latency by up to 5.6× with on-par image quality.

Kai Huang, Xiangyu Yin, Wei Gao

September 2024 In MobiCom'24

Perceptual-Centric Image Super-Resolution using Heterogeneous Processors on Mobile Devices

Achieving Sparse Activation in Small Language Models

Being different from model compression that requires expensive retraining, sparse activation can effectively reduce neural network models’ inference cost at runtime without any prior retraining or adaptation efforts. Although sparse activation has been proved to be effective on Large Language Models (LLMs) that are usually redundant (e.g., OPT and BLOOMZ models), its applicability on recent Small Language Models (SLMs) with higher parameter efficiency remains questionable. Our recent work verified such possibility by using gradient-based attribution scores to evaluate neurons’ importance in inference, in both analytical and experimental perspectives. Our results show that we can achieve up to 80% sparsity in major SLM models, including Phi-1.5/2 and MobiLlama-0.5B/1B, with less than 5% model accuracy loss on QA tasks.

Jifeng Song, Kai Huang, Boyuan Yang, Wei Gao

June 2024 In arXiv

Achieving Sparse Activation in Small Language Models

Towards Green AI in Fine-tuning Large Language Models via Adaptive Backpropagation

The growing need of fine-tuning large language models (LLMs) can lead to significant energy consumption and environmental impact. To address this issue, we introduce GreenTrainer, a novel LLM fine-tuning technique. GreenTrainer assesses the backpropagation costs and contributions of different tensors to model accuracy, allowing for the selection of the most efficient set of tensors. This selection is guided by a user-defined objective, which can adapt to energy supply considerations and Green AI goals. Experimental results demonstrate that GreenTrainer can reduce FLOPs by up to 64% without compromising model accuracy, and outperforms existing techniques like LoRA while maintaining comparable FLOPs reduction.

Kai Huang, Hanyun Yin, Heng Huang, Wei Gao

September 2023 2024 ICLR

Towards Green AI in Fine-tuning Large Language Models via Adaptive Backpropagation

ElasticTrainer: Speeding Up On-Device Training with Runtime Elastic Tensor Selection

The first on-device AI technique that achieves full elasticity of on-device training on resource-constrained mobile and embedded devices. By leveraging the principle of eXplainable AI (XAI) and evaluating the importance of different tensors in training, we allow fully flexible adaptation of the trainable neural network portion at runtime, according to the current training needs and online data patterns, to minimize the training cost without accuracy loss.

Kai Huang, Boyuan Yang, Wei Gao

June 2023 In MobiSys'23

ElasticTrainer: Speeding Up On-Device Training with Runtime Elastic Tensor Selection

AiFi: AI-Enabled WiFi Interference Cancellation with Commodity PHY-Layer Information

This work applies on-device AI techniques to interference cancellation in WiFi networks and enables generalizable interference cancellation on commodity WiFi devices without any extra RF hardware. By using neural network models to mimic WiFi network’s PHY-layer operation, AiFi can be generally applied to different types of interference signals ranging from concurrent WiFi transmissions, ZigBee/Bluetooth to wireless baby monitors or even microwave oven, and improves the MAC-layer frame reception rate by 18x.

Ruirong Chen, Kai Huang, Wei Gao

November 2022 In SenSys'22

AiFi: AI-Enabled WiFi Interference Cancellation with Commodity PHY-Layer Information

Real-time Neural Network Inference on Extremely Weak Devices: Agile Offloading with Explainable AI

AgileNN is the first work that achieves real-time inference (<20ms) of mainstream neural network models (e.g., ImageNet) on extremely weak MCUs (e.g., STM32 series with <1MB of memory), without impairing the inference accuracy. The usage of eXplainable AI (XAI) techniques allows >6x improvement of feature compressibility during offloading and >8x reduction of the local device’s resource consumption.

Kai Huang, Wei Gao

October 2022 In MobiCom'22

Real-time Neural Network Inference on Extremely Weak Devices: Agile Offloading with Explainable AI