3 months ago

VICTOR: a Dataset for Brazilian Legal Documents Classification

{Te{\'o}filo Em{\'\i}dio de Campos Pedro Henrique Luz de Araujo Nilton Correia da Silva Fabricio Ataides Braz}

Abstract

This paper describes VICTOR, a novel dataset built from Brazil{'}s Supreme Court digitalized legal documents, composed of more than 45 thousand appeals, which includes roughly 692 thousand documents{---}about 4.6 million pages. The dataset contains labeled text data and supports two types of tasks: document type classification; and theme assignment, a multilabel problem. We present baseline results using bag-of-words models, convolutional neural networks, recurrent neural networks and boosting algorithms. We also experiment using linear-chain Conditional Random Fields to leverage the sequential nature of the lawsuits, which we find to lead to improvements on document type classification. Finally we compare a theme classification approach where we use domain knowledge to filter out the less informative document pages to the default one where we use all pages. Contrary to the Court experts{'} expectations, we find that using all available data is the better method. We make the dataset available in three versions of different sizes and contents to encourage explorations of better models and techniques.

Benchmarks

Benchmark	Methodology	Metrics
multi-label-text-classification-on-bvictor	XGBoost	Average F1: 0.8843 Weighted F1: 0.8957
multi-label-text-classification-on-bvictor	SVM	Average F1: 0.7761 Weighted F1: 0.8235
multi-label-text-classification-on-bvictor	NB	Average F1: 0.6335 Weighted F1: 0.6955
multi-label-text-classification-on-mvictor	SVM	Average F1: 0.6642 Weighted F1: 0.8137
multi-label-text-classification-on-mvictor	NB	Average F1: 0.3797 Weighted F1: 0.6062
multi-label-text-classification-on-mvictor	XGBoost	Average F1: 0.8882 Weighted F1: 0.9072
multi-label-text-classification-on-svictor	SVM	Average F1: 0.8246 Weighted F1: 0.8231
multi-label-text-classification-on-svictor	NB	Average F1: 0.5121 Weighted F1: 0.4875
multi-label-text-classification-on-svictor	XGBoost	Average F1: 0.8887 Weighted F1: 0.8634
text-classification-on-mvictor-type	BiLSTM	Average F1: 0.7092 Weighted F1: 0.9433
text-classification-on-mvictor-type	CNN	Average F1: 0.7061 Weighted F1: 0.9464
text-classification-on-mvictor-type	SVM	Average F1: 0.6792 Weighted F1: 0.9288
text-classification-on-mvictor-type	CNN + CRF	Average F1: 0.7505 Weighted F1: 0.9537
text-classification-on-mvictor-type	NB	Average F1: 0.4772 Weighted F1: 0.8477
text-classification-on-svictor-type	SVM	Average F1: 0.7632 Weighted F1: 0.9425
text-classification-on-svictor-type	BiLSTM	Average F1: 0.7281 Weighted F1: 0.9465
text-classification-on-svictor-type	NB	Average F1: 0.5979 Weighted F1: 0.8893
text-classification-on-svictor-type	CNN + CRF	Average F1: 0.7740 Weighted F1: 0.9533
text-classification-on-svictor-type	CNN	Average F1: 0.7584 Weighted F1: 0.9472

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning