HyperAI
Back to Headlines

Hugging Face Introduces Parquet Content-Defined Chunking for Faster, More Efficient Data Transfers

8 days ago

Hugging Face has introduced a new feature called Parquet Content-Defined Chunking (CDC) that significantly improves data efficiency when uploading and downloading Parquet files through its Xet storage layer. This innovation reduces data transfer and storage costs by ensuring only changed data chunks are uploaded or downloaded, rather than entire files. Apache Parquet, a columnar storage format, is widely used in data workflows, with Hugging Face hosting over 4 PB of Parquet files. Xet, the platform’s content-addressable storage layer, optimizes deduplication by identifying identical data segments. However, traditional Parquet files can produce different byte-level outputs for minor data changes, limiting deduplication effectiveness. CDC addresses this by structuring Parquet files to minimize such differences, aligning with Xet’s deduplication capabilities. The feature works by chunking data based on content rather than fixed sizes. When enabled, it ensures that logical column values are consistently divided into data pages, improving the ability of Xet to recognize repeated segments. This is particularly beneficial for scenarios like adding or removing columns, changing data types, or modifying row groups. For example, re-uploading an exact copy of a Parquet file results in zero data transfer, as Xet identifies identical content. Adding new columns or altering existing ones uploads only the modified sections, while the rest remains deduplicated. Similarly, changing a column’s data type (e.g., from int64 to int32) uploads only the updated column and metadata. Appending new rows to a dataset uploads only the additional data, while inserting or deleting rows—tasks that typically disrupt chunking—now see improved deduplication when CDC is enabled. Without CDC, such changes can cause entire data pages to shift, leading to higher transfer sizes. With CDC, the impact is minimized, as shown by reduced transfer volumes in tests. Adjusting row-group sizes also benefits from CDC. Smaller or larger row groups, which affect how data is split into pages, still allow efficient deduplication when CDC is used. This ensures that even with varying configurations, storage costs remain optimized. The feature works across multiple files, enabling efficient deduplication even when datasets are split into different file structures. For instance, uploading a dataset with five, ten, or twenty shards results in minimal additional storage, as Xet identifies shared content. Pandas users can leverage CDC by setting use_content_defined_chunking=True when saving data. This allows for efficient uploads of filtered datasets, such as extracting shorter conversations from a larger file, where only the modified sections are transferred. Hugging Face emphasizes that CDC enhances both upload and download performance, reducing time and costs. The platform encourages users to migrate from Git LFS to Xet to take advantage of these improvements. By integrating CDC with Parquet, Hugging Face aims to streamline data workflows for AI development, making it easier to manage large-scale datasets with greater efficiency.

Related Links