MAViL (Audio-Visual, single) | 0.533 | - | - |
BEATs (Audio-only, Single) | 0.486 | BEATs: Audio Pre-Training with Acoustic Tokenizers | |
CAV-MAE (Audio-Visual) | 0.512 | Contrastive Audio-Visual Masked Autoencoder | |
CAV-MAE (Audio-Only) | 0.466 | Contrastive Audio-Visual Masked Autoencoder | |
Audiovisual Masked Autoencoder (Audio-only, Single) | 0.466 | Audiovisual Masked Autoencoders | |
BEATs (Audio-only, Ensemble) | 0.506 | BEATs: Audio Pre-Training with Acoustic Tokenizers | |
CAV-MAE (Visual-Only) | 0.262 | Contrastive Audio-Visual Masked Autoencoder | |