HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

Jingye Chen Yupan Huang Tengchao Lv Lei Cui Qifeng Chen Furu Wei

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

Abstract

The diffusion model has been proven a powerful generative model in recent years, yet remains a challenge in generating visual text. Several methods alleviated this issue by incorporating explicit text position and content as guidance on where and what text to render. However, these methods still suffer from several drawbacks, such as limited flexibility and automation, constrained capability of layout prediction, and restricted style diversity. In this paper, we present TextDiffuser-2, aiming to unleash the power of language models for text rendering. Firstly, we fine-tune a large language model for layout planning. The large language model is capable of automatically generating keywords for text rendering and also supports layout modification through chatting. Secondly, we utilize the language model within the diffusion model to encode the position and texts at the line level. Unlike previous methods that employed tight character-level guidance, this approach generates more diverse text images. We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V, validating TextDiffuser-2's capacity to achieve a more rational text layout and generation with enhanced diversity. The code and model will be available at \url{https://aka.ms/textdiffuser-2}.

Benchmarks

BenchmarkMethodologyMetrics
image-generation-on-textatlasevalTextDiffuser2
StyledTextSynth Clip Score: 0.2510
StyledTextSynth FID: 114.31
StyledTextSynth OCR (Accuracy): 0.76
StyledTextSynth OCR (Cer): 0.99
StyledTextSynth OCR (F1 Score): 1.46
TextScenesHQ Clip Score: 0.2252
TextScenesHQ FID: 84.10
TextScenesHQ OCR (Accuracy): 0.66
TextScenesHQ OCR (Cer): 0.96
TextScenesHQ OCR (F1 Score): 1.25
TextVisionBlend Clip Score: -
TextVisionBlend FID: -
TextVisionBlend OCR (Accuracy): -
TextVisionBlend OCR (Cer): -
TextVsionBlend OCR (F1 Score): -

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp