HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding

Wu Yiqi ; Hu Xiaodan ; Fu Ziming ; Zhou Siling ; Li Jiangong

GPT-4o: Visual perception performance of multimodal large language
  models in piglet activity understanding

Abstract

Animal ethology is an crucial aspect of animal research, and animal behaviorlabeling is the foundation for studying animal behavior. This process typicallyinvolves labeling video clips with behavioral semantic tags, a task that iscomplex, subjective, and multimodal. With the rapid development of multimodallarge language models(LLMs), new application have emerged for animal behaviorunderstanding tasks in livestock scenarios. This study evaluates the visualperception capabilities of multimodal LLMs in animal activity recognition. Toachieve this, we created piglet test data comprising close-up video clips ofindividual piglets and annotated full-shot video clips. These data were used toassess the performance of four multimodal LLMs-Video-LLaMA, MiniGPT4-Video,Video-Chat2, and GPT-4 omni (GPT-4o)-in piglet activity understanding. Throughcomprehensive evaluation across five dimensions, including counting, actorreferring, semantic correspondence, time perception, and robustness, we foundthat while current multimodal LLMs require improvement in semanticcorrespondence and time perception, they have initially demonstrated visualperception capabilities for animal activity recognition. Notably, GPT-4o showedoutstanding performance, with Video-Chat2 and GPT-4o exhibiting significantlybetter semantic correspondence and time perception in close-up video clipscompared to full-shot clips. The initial evaluation experiments in this studyvalidate the potential of multimodal large language models in livestock scenevideo understanding and provide new directions and references for futureresearch on animal behavior video understanding. Furthermore, by deeplyexploring the influence of visual prompts on multimodal large language models,we expect to enhance the accuracy and efficiency of animal behavior recognitionin livestock scenarios through human visual processing methods.

Benchmarks

BenchmarkMethodologyMetrics
mmr-total-on-mrr-benchmarkGPT-4o
Total Column Score: 457
zero-shot-video-question-answer-on-video-mmeGPT-4o mini
Accuracy (%): 62.3
zero-shot-video-question-answer-on-video-mmeGPT-4o
Accuracy (%): 70.3
zero-shot-video-question-answer-on-video-mme-1GPT-4o mini
Accuracy (%): 68.9
zero-shot-video-question-answer-on-video-mme-1GPT-4o
Accuracy (%): 77.2
zero-shot-video-question-answer-on-zero-shotGPT-4o
Accuracy (% ): 64.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp