Large Language Models (LLMs) are capable of reasoning over diverse input data modalities through pre-trained encoders. However, the growing diversity of input data modalities prevents incorporating all modalities into LLMs, especially when LLMs are deployed on resource-constrained edge devices for embodied AI applications. Instead, a better option is to adaptively involve only the useful modalities at runtime, depending on the current environmental contexts and task requirements. For such modality adaptation, existing work adopts fixed connections between encoders and the LLM’s input layer, leading to high training cost at runtime and ineffective cross-modal interaction. In this paper, we address these limitations by presenting mPnP-LLM, a new technique that allows fully elastic, automated and prompt runtime modality adaptation, by connecting unimodal encoders to a flexible set of last LLM blocks and making such latent connections fully trainable at runtime. Experiments over the nuScenes-QA dataset show that mPnP-LLM can achieve up to 3.7x FLOPs reduction and 30% GPU memory usage reduction, while retaining on-par accuracy with the existing schemes. Under the same compute budget, mPnP-LLM improves the task accuracy by up to 4% compared to the best existing scheme.
Large Language Models (LLMs) can do reasoning over diverse input data modalities besides the natural language domain. One major challenge of such multimodal reasoning is the growing diversity of input data modalities and limited performance of resource-constrained edge devices. A good solution is to adaptively involve only the useful modalities at runtime for the minimum on-device computing cost, such as the runtime modality adaptation for multimodal QA in autonomous driving example as shown in the image above.
Jointly train the encoders of all involved modalities with LLM to align every modality with the natural language domain, but is too expensive for runtime. As shown below, Most existing work only fine-tune the trainable projector that connects modal encoder to LLM’s input layer, but still require backpropagating throughout the entire LLM with high training costs.
As shown in the image below, we present Modality Plug-and-Play in multimodal LLMs (mPnP-LLM), a new technique for elastic, automated and prompt runtime modality adaptation in multimodal LLMs, by connecting unimodal encoders to a flexible set of last LLM blocks and making such latent connections fully trainable at runtime. We can adaptively adjust the amount of LLM blocks being connected for different tradeoffs between accuracy and runtime training cost. We can also optimize the efficiency of cross-modal interaction and hence improve accuracy, by controlling the amount of information being injected in each connection with a trainable weighting module.
In our design, mPnP-LLM adapts new input modalities and connects the corre- sponding unimodal encoders to LLM blocks via the (1) Key & Value Aligners module, and (2) Trainable Latent Connection module. An example for the multimodal QA task between two input modalities, namely RGB camera view and LiDAR point cloud, is shown in the figure below.
We use the nuScenes-QA dataset for multimodal visual QA in autonomous driving, with results from workstation-level desktop platforms with RTX A6000 and mobile platform of Nvidia Jetson AGX Orin. Our processed dataset is published as the NuScenes-QA-mini dataset.
Compared with existing approaches, mPnP-LLM achieves better accuracy under similar costs.
Under the same setup of night condition with LiDAR modality mounted on top of RGB modality, mPnP-LLM achieves better accuracy-compute tradeoff.
With different LLMs, mPnP-LLM ahieves higher accuracy than baseline schemes, and reduces the training FLOPs by 20%-37% and GPU memory consumption by up to 30%.