MAmmoTH-VL-Instruct: Advancing Open-Source Multimodal Reasoning with Scalable Dataset Construction

0
8

Open-source MLLMs exhibit considerable promise across diverse tasks by integrating visual encoders with language models. However, their reasoning abilities could be improved, largely due to existing instruction-tuning datasets often repurposed from academic resources like VQA and AI2D. These datasets focus on simplistic tasks with phrase-based answers and need more complexity for advanced reasoning. CoT reasoning, proven effective in text-based LLMs, offers a potential solution but demands the creation of datasets with detailed rationales and step-by-step reasoning. Developing such datasets at scale is challenging due to the costs of human annotation and the limitations of relying on proprietary tools like GPT-4, which are expensive and inaccessible for open-source projects.

To address these limitations, recent efforts focus on cost-effective, scalable methods for constructing multimodal datasets exclusively using open-source resources. Strategies include task-specific data augmentation and rigorous quality filtering to enhance dataset diversity and support nuanced reasoning tasks. While proprietary systems like GPT-4 and Gemini set performance benchmarks, open-source initiatives like LLaVA use connector-based approaches to bridge visual encoders and language models. These lightweight solutions enable efficient training despite limited resources. However, the need for high-quality supervised fine-tuning data is a bottleneck. By scaling dataset quality and adopting innovative training paradigms, open-source MLLMs aim to close the gap with proprietary systems, advancing competitive multimodal capabilities.

Researchers from Carnegie Mellon University, Nanyang Technological University, the University of Waterloo, and the University of Manchester developed a scalable, cost-efficient method to create a multimodal instruction-tuning dataset to elicit CoT reasoning. Using open-weight LLMs and MLLMs, they constructed a 12-million-pair dataset focusing on reasoning-intensive tasks like math problem-solving and OCR. The dataset is built through a three-step process: categorizing diverse tasks, augmenting them with CoT rationales, and applying rigorous self-filtering to enhance accuracy. Experiments with the MAmmoTH-VL-8B model showed state-of-the-art performance on reasoning benchmarks like MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%), with improvements on non-reasoning tasks as well.

The researchers introduced a scalable and cost-effective pipeline to generate a high-quality multimodal dataset with 12 million samples to address the limitations of prior visual instruction-tuning methods. The process involves three steps: collecting and categorizing diverse open-source data, augmenting tasks with rewritten instruction-response pairs using open models, and applying rigorous quality filtering to remove errors and hallucinations. Data is categorized into ten major types, facilitating task-specific enhancements. Group B datasets were rewritten to include detailed rationales for diverse real-world scenarios. A “Model-as-Judge” approach ensured logical consistency during filtering, resulting in a robust, rationale-enriched dataset suitable for multimodal applications.

The quality of MAmmoTH-VL-Instruct was evaluated by analyzing 1,000 samples from the original and rewritten datasets using the InternVL2-Llama3-76B model. Scoring on information content and relevance (1–5 scale) revealed that rewritten data outperformed the original, indicating enhanced depth and alignment. Token-length distribution showed the rewritten dataset had broader and longer text, improving clarity and explanation. t-SNE analysis confirmed the rewritten data retained core characteristics while expanding scope, increasing diversity and complexity coverage. Model-based filtering, assessed against human evaluations, showed reliable agreement with higher consistency (Cohen’s Kappa of 0.64). Filtering also improved training outcomes, particularly in visually complex categories.

In conclusion, the study presents an efficient and scalable approach to improve MLLMs by leveraging open-source models to create diverse, high-quality training datasets that capture human preferences and complex real-world scenarios. Central to this work is the MAmmoTH-VL-Instruct dataset, comprising 12 million multimodal entries, which powers the MAmmoTH-VL8B architecture. The proposed model achieves state-of-the-art performance across diverse benchmarks, excelling in reasoning-intensive and practical tasks while reducing reliance on proprietary systems. By incorporating rich rationales and self-filtering techniques, this method enhances MLLM reasoning capabilities, democratizing advanced AI development and paving the way for broader applications and further research.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

🚨 [Must Subscribe]: Subscribe to our newsletter to get trending AI research and dev updates


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

🚨🚨FREE AI WEBINAR: ‘Fast-Track Your LLM Apps with deepset & Haystack'(Promoted)


Credit: Source link

LEAVE A REPLY

Please enter your comment!
Please enter your name here