HyperAI

Audio Classification On Vggsound

Metrics

Top 1 Accuracy

Results

Performance results of various models on this benchmark

Model Name
Top 1 Accuracy
Paper TitleRepository
ONE-PEACE (Audio-Visual)68.2ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Mirasol3B69.8Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities-
MAST (Audio Only)57.0Multiscale Audio Spectrogram Transformer for Efficient Audio Classification-
Audiovisual Masked Autoencoder (Audio-only, Single)57.2Audiovisual Masked Autoencoders
CAV-MAE (Audio-Visual)65.9Contrastive Audio-Visual Masked Autoencoder
AVT (Audio-Visual)63.9AVT: Audio-Video Transformer for Multimodal Action Recognition-
PlayItBackX353.7Play It Back: Iterative Attention for Audio Recognition
Audiovisual Masked Autoencoder (Audiovisual, Single)65.0Audiovisual Masked Autoencoders
MBT (AV)-Attention Bottlenecks for Multimodal Fusion
AVT (V)53.2AVT: Audio-Video Transformer for Multimodal Action Recognition-
MBT (A)52.3Attention Bottlenecks for Multimodal Fusion
CAV-MAE (Audio-Only)59.5Contrastive Audio-Visual Masked Autoencoder
MMT (Audio-Visual)66.2Multiscale Multimodal Transformer for Multimodal Action Recognition-
MAViL67.1--
MMT (Video)56.1Multiscale Multimodal Transformer for Multimodal Action Recognition-
ONE-PEACE (Audio-Only)59.6ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
UAVM (Audio + Video)65.8UAVM: Towards Unifying Audio and Visual Models
UAVM (Audio Only)56.5UAVM: Towards Unifying Audio and Visual Models
UAVM (Video Only)49.9UAVM: Towards Unifying Audio and Visual Models
EquiAV67.1EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
0 of 21 row(s) selected.