18 days ago

BUT System for the MLC-SLM Challenge

Alexander Polok, Jiangyu Han, Dominik Klement, Samuele Cornell, Jan Černocký, Lukáš Burget

Abstract

We present a two-speaker automatic speech recognition (ASR) system thatcombines DiCoW -- a diarization-conditioned variant of Whisper -- withDiariZen, a diarization pipeline built on top of Pyannote. We first evaluateboth systems in out-of-domain (OOD) multilingual scenarios without anyfine-tuning. In this scenario, DiariZen consistently outperforms the baselinePyannote diarization model, demonstrating strong generalization. Despite beingfine-tuned on English-only data for target-speaker ASR, DiCoW retains solidmultilingual performance, indicating that encoder modifications preserveWhisper's multilingual capabilities. We then fine-tune both DiCoW and DiariZenon the MLC-SLM challenge data. The fine-tuned DiariZen continues to outperformthe fine-tuned Pyannote baseline, while DiCoW sees further gains from domainadaptation. Our final system achieves a micro-average tcpWER/CER of 16.75% andranks second in Task 2 of the MLC-SLM challenge. Lastly, we identify severallabeling inconsistencies in the training data -- such as missing speechsegments and incorrect silence annotations -- which can hinder diarizationfine-tuning. We propose simple mitigation strategies to address these issuesand improve system robustness.