Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

Evaluating open-ended long-form generation is challenging because it is hardto define what clearly separates good from bad outputs. Existing methods oftenmiss key aspects like coherence, style, or relevance, or are biased bypretraining data, making open-ended long-form evaluation an underexploredproblem. To address this gap, we propose PrefBERT, a scoring model forevaluating open-ended long-form generation in GRPO and guiding its trainingwith distinct rewards for good and bad outputs. Trained on two responseevaluation datasets with diverse long-form styles and Likert-rated quality,PrefBERT effectively supports GRPO by offering better semantic reward feedbackthan traditional metrics ROUGE-L and BERTScore do. Through comprehensiveevaluations, including LLM-as-a-judge, human ratings, and qualitative analysis,we show that PrefBERT, trained on multi-sentence and paragraph-lengthresponses, remains reliable across varied long passages and aligns well withthe verifiable rewards GRPO needs. Human evaluations confirm that usingPrefBERT as the reward signal to train policy models yields responses betteraligned with human preferences than those trained with traditional metrics. Ourcode is available at https://github.com/zli12321/long_form_rl.