arxiv:2509.25850

RL-Guided Data Selection for Language Model Finetuning

Published on Sep 30, 2025

Authors:

Abstract

Reinforcement learning approaches are used to optimize data selection for LLM fine-tuning by formulating it as a Markov Decision Process that achieves comparable or better performance with significantly reduced training time.

AI-generated summary

Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model's downstream performance under a strict training data budget. Solving this problem is generally intractable, and existing approximate approaches are pretraining-oriented and transfer poorly to the fine-tuning setting. We reformulate this problem as a tractable Markov Decision Process (MDP) and train agents using various Reinforcement Learning (RL) methods to learn optimal data selection policies, guided by an efficient, proxy-model-based reward signal. Across four datasets, training on a 5% subset selected by our approach matches or outperforms fine-tuning on the full dataset by up to 10.8 accuracy points, while cutting wall-clock training time by up to 2 times, highlighting the promise of RL-guided data selection.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.25850 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.25850 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.25850 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.