Abstract: The advancement of multi-modal large language models (MLLMs) has significantly enhanced their capability to process and understand diverse data types, integrating text, images, and other ...