Command Palette
Search for a command to run...

Abstract
As Multi-modal Large Language Models (MLLMs) evolve, expanding beyondsingle-domain capabilities is essential to meet the demands for more versatileand efficient AI. However, previous omni-models have insufficiently exploredspeech, neglecting its integration with multi-modality. We introduce Lyra, anefficient MLLM that enhances multimodal abilities, including advancedlong-speech comprehension, sound understanding, cross-modality efficiency, andseamless speech interaction. To achieve efficiency and speech-centriccapabilities, Lyra employs three strategies: (1) leveraging existingopen-source large models and a proposed multi-modality LoRA to reduce trainingcosts and data requirements; (2) using a latent multi-modality regularizer andextractor to strengthen the relationship between speech and other modalities,thereby enhancing model performance; and (3) constructing a high-quality,extensive dataset that includes 1.5M multi-modal (language, vision, audio) datasamples and 12K long speech samples, enabling Lyra to handle complex longspeech inputs and achieve more robust omni-cognition. Compared to otheromni-methods, Lyra achieves state-of-the-art performance on variousvision-language, vision-speech, and speech-language benchmarks, while alsousing fewer computational resources and less training data.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-question-answering-on-mm-vet | Lyra-Base | GPT-4 score: 63.5 Params: 9B |
| visual-question-answering-on-mm-vet | Lyra-Pro | GPT-4 score: 71.4 Params: 74B |
| visual-question-answering-on-mm-vet | Lyra-Mini | GPT-4 score: 51.2 Params: 3B |
| visual-question-answering-vqa-on-egoschema | Lyra-Pro | Acc: 75.8 |
| visual-question-answering-vqa-on-mm-vet | Lyra-Pro | Acc: 71.4 |
| visual-question-answering-vqa-on-mme | Lyra-Pro | Acc: 2485 |
| visual-question-answering-vqa-on-mvbench | Lyra-Pro | Acc: 72.3 |
| visual-question-answering-vqa-on-textvqa | Lyra-Pro | Acc: 83.5 |
| visual-question-answering-vqa-on-video-mme | Lyra-Pro | Acc: 69.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.