Command Palette
Search for a command to run...
LightOnOCR-mix-0126 Text Transcription Dataset
LightOnOCR-mix-0126 is a large-scale OCR text transcription dataset released by LightOn in 2026. The related paper is titled "LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR," which aims to provide supervision for end-to-end OCR and document understanding models, outputting naturally ordered full-page transcribed text.
This dataset consists of two parts: a training set and a validation set. Each sample corresponds to the text transcription result of a document page. The content covers page text organized in natural reading order (output formats include Markdown, LaTeX mathematical formulas, and HTML tables, etc.) and corresponding structured markup, covering various types of page content such as paragraphs, headings, lists, and tables.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.