HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

BERT-Sort: A Zero-shot MLM Semantic Encoder on Ordinal Features for AutoML

{Mukul Prasad Lei Liu Wei-Peng Chen Mehdi Bahrami}

BERT-Sort: A Zero-shot MLM Semantic Encoder on Ordinal Features for AutoML

Abstract

Data pre-processing is one of the key steps in creating machine learning pipelines for tabular data. One of the common data pre-processing operations implemented in AutoML systems is to encode categorical features as numerical features. Typically, this is implemented using a simple alphabetical sort on the categorical values, using functions such as OrdinalEncoder, LabelEncoder in Scikit-Learn and H2O. However, often there exist semantic ordinal relationships among the categorical values, such as: quality level (i.e., [’very good’ > ’good’ > ’normal’> ’poor’]), or month (i.e., [’Jan’< ’Feb’ < ’Mar’]). Such semantic relationships are not exploited by previous AutoML approaches. In this paper, we introduce BERT-Sort, a novel approach to semantically encode ordinal categorical values via zero-shot Masked Language Models (MLM) and apply it to AutoML for tabular data. We created a new benchmark of 42 features from 10 public data sets for sorting categorical ordinal values for the first time, where BERT-Sort significantly improves semantic encoding of ordinal values in comparison to the existing approaches with 27% improvement. We perform a comprehensive evaluation of BERT-Sort on different public MLMs, such as RoBERTa, XLM and DistilBERT. We alsocompare the performance of raw data sets against encoded data sets through BERT-Sort in different AutoML platforms including AutoGluon, FLAML, H2O, and MLJAR to evaluate the proposed approach in an end-to-end scenario.

Benchmarks

BenchmarkMethodologyMetrics
automl-on-ordinaldatasetZero-shot-BERT-SORT
1:1 Accuracy: +55%

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
BERT-Sort: A Zero-shot MLM Semantic Encoder on Ordinal Features for AutoML | Papers | HyperAI