Teaser image

Efficiency and performance for MergeMix. (a) The training time vs. accuracy of mixup methods with the DeiT-Small model. (b) The image classification Top-1 accuracy vs. training epochs of different mixup methods on the CIFAR100 dataset with the DeiT-Tiny model. (c) The radar plot of the results on part VQA tasks by LLaVA-7B, LLaVA with SFT, and MergeMix.

Abstract

Vision–language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.

Method Overview

Overview figure

The overall of the two scenarios of MergeMix. (a) MergeMix for Image Classification: The image is processed by the ToMe encoder, with Attention Score Recovery and TopK sampling to generate the corresponding class prediction. (b) MergeMix for MLLM: Preference pairs are encoded by the vision model with token merging, and the LLM decoder generates response text for the loser and winner, optimized via a ranking loss.

Multi-modal Large Language Models (e.g., LLaVA, QwenVL, Cambrain-1, etc.) have recently demonstrated remarkable capabilities in integrating visual and textual information, enabling a wide range of applications from visual question answering to multi-modal reasoning. Since MLLMs are typically pre-trained on massive web-scale datasets, forcing them to possess a wide range of knowledge and general reasoning capabilities, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)-based preference optimization have emerged as two primary paradigms for aligning MLLMs with human preferences and specific task requirements. However, SFT depends on high-quality instruction–response annotations and optimizes the likelihood of reference responses, which does not explicitly model relative preferences between outputs. RL-based methods such as RLHF are more preference-aware, but they require an additional reward model that may encode bias or be exploited by reward. Researchers have proposed several advance approaches to address these issues. We investigate an interesting question: Is it necessary to propose novel techniques rather than some classical machine learning methods in the MLLM scenario?

We revisit the mixup augmentation, which synthesizes mixed samples and corresponding labels with given mixing ratios. However, two main challenges arise:

  • Achieving an optimal trade-off between efficiency and performance of mixup augmentations that rely on saliency-based metrics.
  • Extending the augmentation to MLLMs properly, from classical image corruptions to data-dependent samples.

Motivated by these perspectives, we propose a novel training framework called MergeMix, which builds preference pairs for MLLM training through data augmentation methods and ranking loss, thereby bridging the gap between SFT and RL.

  • Image Mixing: A novel data augmentation that generates mixed samples through token merge techniques. A bipartite soft matching (BSM) gathers the similarity information that brings the context, making the mask retain useful features. Meanwhile, MergeMix links the merge ratio and mixing ratio, aligning the information density of samples, enabling precise mixing data generation.
  • Preference Tuning with Mixup: A preference-driven SFT paradigm for MLLMs, where augmented samples are regarded as non-preferred responses Loser and clean samples as preferred responses Winner. This enables preference optimization via SimPO loss without relying on reward models.

You could chlick the buttons to see the details of image mixing and preference tuning with mixup.

Visualization of MergeMix

We provide visualizations of MergeMix on both MLLM case study and mixed images with different mixing ratios.

MLLM Case Study

Case Study

Mixed Images

MLLM 1

BibTeX

@article{jin2025mergemix,
         title = {MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding},
         author = {Jin, Xin and Li, Siyuan and Jian, Siyong and Yu, Kai and Wang, Huan},
         year = {2025},
         journal={arXiv},
}