Trl cli parameter for local dataset

Hi everyone,

I am configuring a VertexAI Training that uses TRL cli command. I have a question regarding the dataset that can be used here. I have a local dataset, but I see that –train-dataset parameter is expecting the name of the dataset to pick from the HF repository. What is the common approach to point to a local (or in Google Cloud bucket) dataset instead of a dataset from the HF repository?

Thanks

Jerome

1 Like

There seem to be roughly two approaches?


At a high level the “common approach” is:

Treat your local / GCS dataset as normal files, and point TRL (or your script) at those file paths via datasets.load_dataset or the TRL datasets: config.
You are not forced to use a Hugging Face Hub dataset name.

Concretely, there are two standard patterns:

  1. Use dataset_name as a local path (simple case).
  2. Use a YAML config with datasets: + data_files (more explicit, ideal for JSONL/CSV on GCS).

And on Vertex AI, “local path” just means “path under /gcs/<BUCKET>” because Cloud Storage is mounted into the container. (Google Cloud)


1. Background: TRL and Hugging Face Datasets

1.1 TRL’s dataset_name is “path or name”

In the TRL docs for the CLI/script utilities, the key line is:

dataset_name (str, optional) — Path or name of the dataset to load. (Hugging Face)

This means:

  • If you pass dataset_name=timdettmers/openassistant-guanaco, it loads from the Hugging Face Hub.
  • If you pass dataset_name=/path/to/mycorpus, it treats it as a path and calls datasets.load_dataset(path="/path/to/mycorpus", ...).

You can see a real example of using a local path with TRL in a forum thread:

examples/scripts/sft.py \
  --model_name google/gemma-7b \
  --dataset_name path/to/mycorpus \
  ...

and the same script works with a Hub dataset name like OpenAssistant/oasst_top1_2023-08-25. (Hugging Face Forums)

So the CLI is designed to handle both.

1.2 Hugging Face datasets handles local and remote files

Hugging Face Datasets supports:

  • Datasets from the Hub
  • Local datasets
  • Remote datasets (HTTP, S3/GCS/… via URLs or storage options)

The canonical docs say:

“Datasets can be loaded from local files stored on your computer and from remote files… CSV, JSON, TXT, parquet… load_dataset() can load each of these file types.” (Hugging Face)

Typical local example:

from datasets import load_dataset

ds = load_dataset(
    "json",
    data_files={
        "train": "path/to/train.jsonl",
        "validation": "path/to/val.jsonl",
    },
)

You can also pass lists of paths or multiple splits. (Hugging Face Forums)

TRL just delegates to this API under the hood.


2. Vertex AI detail: GCS buckets are mounted under /gcs

For Vertex AI custom training jobs, Google uses Cloud Storage FUSE so that Cloud Storage looks like a normal filesystem inside the container:

“When you start a custom training job, the job sees a directory /gcs, which contains all your Cloud Storage buckets as subdirectories.” (Google Cloud)

So if you have data at:

gs://my-bucket/drug-herg/train.jsonl
gs://my-bucket/drug-herg/eval.jsonl

then inside the training container you see:

/gcs/my-bucket/drug-herg/train.jsonl
/gcs/my-bucket/drug-herg/eval.jsonl

From TRL / datasets.load_dataset perspective, these are just normal local paths.

That’s the key: GCS → /gcs/<BUCKET> → treat as local files.


3. Pattern 1 (simplest): use dataset_name as a path

If your data directory is something datasets can detect automatically (e.g., Parquet or a saved HF dataset), you can often just do:

3.1 Local machine

Assume:

/home/you/data/drug-herg/
  train.jsonl
  eval.jsonl

You could save this as a HF dataset first (optional):

from datasets import load_dataset

ds = load_dataset(
    "json",
    data_files={
        "train": "/home/you/data/drug-herg/train.jsonl",
        "validation": "/home/you/data/drug-herg/eval.jsonl",
    },
)
ds.save_to_disk("/home/you/data/drug-herg-hf")

Then run TRL:

trl sft \
  --model_name_or_path google/gemma-2b-it \
  --dataset_name=/home/you/data/drug-herg-hf \
  ...

Here dataset_name is a path, and TRL will internally call datasets.load_from_disk / load_dataset as appropriate. The StackOverflow/GeeksforGeeks posts show exactly this pattern for local paths. (Stack Overflow)

3.2 Vertex AI

Upload your HF-saved dataset directory to GCS:

gs://my-bucket/drug-herg-hf/  (contains dataset files saved_to_disk)

Inside the container, that is /gcs/my-bucket/drug-herg-hf.

Then in your CustomContainerTrainingJob args:

args = [
    "--model_name_or_path=google/gemma-2b-it",
    "--dataset_name=/gcs/my-bucket/drug-herg-hf",
    # other TRL args...
]

This is the simplest approach when you want to reuse a pre-saved HF dataset. But it requires you to create that HF dataset once (either locally and upload, or directly on GCS).


4. Pattern 2 (more flexible, common with JSONL/CSV): YAML datasets: with data_files

This is the pattern most people use when they have raw JSONL/CSV files and want full control, especially on Vertex AI.

4.1 Why use datasets: instead of dataset_name?

The TRL script-utils docs explicitly support a datasets mixture config:

dataset_name (str, optional) - Path or name of the dataset to load. If datasets is provided, this will be ignored. (Hugging Face)

That is, if you define datasets in the YAML:

  • TRL ignores dataset_name.
  • Uses your datasets entries (each mapping more or less directly to datasets.load_dataset).

This is the cleanest way to tell TRL:

  • “Use the JSON builder”
  • “Here are my train/validation files”
  • “Use only the prompt and completion columns”

4.2 Example dataset on GCS

Say you have:

gs://my-bucket/drug-herg/train.jsonl
gs://my-bucket/drug-herg/eval.jsonl

With prompt–completion records (your current SFT format):

{"prompt": "Instructions... SMILES: O=C(...)\nAnswer:", "completion": " (B)<eos>"}
{"prompt": "Instructions... SMILES: CCN(...)\nAnswer:", "completion": " (A)<eos>"}
...

Inside the Vertex container:

/gcs/my-bucket/drug-herg/train.jsonl
/gcs/my-bucket/drug-herg/eval.jsonl

4.3 YAML config for TRL CLI

trl.sft can be driven by a config like:

# sft_config.yaml

# ---------- Model ----------
model_name_or_path: google/gemma-2b-it

# ---------- Output ----------
output_dir: /gcs/my-bucket/outputs/txgemma-herg
overwrite_output_dir: true

# ---------- Training ----------
max_seq_length: 1024
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
gradient_accumulation_steps: 8
num_train_epochs: 3
learning_rate: 5e-5
warmup_ratio: 0.05
weight_decay: 0.01
bf16: true

# ---------- LoRA / PEFT ----------
use_peft: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1
lora_target_modules: all-linear

# ---------- Dataset(s) ----------
datasets:
  - path: json                           # use HF "json" dataset builder
    data_files:
      train: /gcs/my-bucket/drug-herg/train.jsonl
      validation: /gcs/my-bucket/drug-herg/eval.jsonl
    split: train                         # the split used for training
    columns: [prompt, completion]        # keep only these columns

# Ignore these when datasets: is defined
dataset_name: null
dataset_text_field: null

# ---------- SFT options ----------
completion_only_loss: true              # train only on completion tokens

Key points:

  • path: json tells datasets.load_dataset("json", ...) to use the JSON builder. (Hugging Face)
  • data_files uses the GCS-mounted paths under /gcs/my-bucket.
  • columns trims the dataset to exactly the fields SFTTrainer needs.
  • completion_only_loss: true ensures the loss is applied only on the completion, not the prompt. (Hugging Face)

4.4 Running this locally vs Vertex

Locally (for testing):

  • Replace /gcs/my-bucket/drug-herg/... with /home/you/data/drug-herg/....

  • Run:

    trl sft --config sft_config.yaml
    

On Vertex AI:

  • Upload sft_config.yaml itself to GCS, e.g. gs://my-bucket/configs/sft_config.yaml.

  • Inside container: /gcs/my-bucket/configs/sft_config.yaml.

  • In CustomContainerTrainingJob:

    args = ["--config=/gcs/my-bucket/configs/sft_config.yaml"]
    
    job = aiplatform.CustomContainerTrainingJob(
        display_name="txgemma-herg-lora-sft",
        container_uri=CONTAINER_URI,
        command=[
            "sh",
            "-c",
            'exec trl sft "$@"',
            "--",
        ],
    )
    
    job.run(
        args=args,
        # machine_type, accelerator, etc.
    )
    

From TRL’s perspective, this is indistinguishable from local training with a JSON dataset; the only difference is the /gcs/... prefix.


5. Summary: “common approach” in one place

Putting it all together, the standard practice to point TRL (and TRL CLI on Vertex) to local or GCS data instead of a Hub dataset is:

  1. Store the dataset as normal files (JSONL/CSV/Parquet) either:

    • on local disk for local runs, or
    • in a Cloud Storage bucket for Vertex.
  2. Treat the GCS paths as local paths under /gcs/<BUCKET> inside the Vertex container. (Google Cloud)

  3. Use one of:

    • --dataset_name=/gcs/<BUCKET>/path/to/hf-saved-dataset if you’re using a dataset saved with save_to_disk, or
    • A YAML datasets: config that calls datasets.load_dataset("json"/"csv", data_files={...}) on those paths.
  4. Avoid thinking of dataset_name as “must be from the Hub” – per TRL’s own docs, it is “path or name.” (Hugging Face)

That is the common and recommended approach when you want to keep data off the Hub and inside your own filesystem or GCS environment.

Thanks. i have use the GCS approach with the JSONL files. Thanks for confirming.

Jerome

1 Like

I am almost giving up with this TRL CLI parameters and command in a VertexAI Custom Training job environment. Between the parameters names different in the documentation and the ones expected, the issue with the dataset library when the data is accessed with FUSE, …. I cannot understand how people can use this TRL toolbox efficiently when associated with VertexAI Custom Training job…

Sorry… I like Google, I like HuggingFace… but it starts to be a pain in the a….
Jerome

1 Like

Hmm, this is really tricky…

If we try to make it work like the Hugging Face sample, it’d be something like this? Not sure if this will get the local dataset stuff running though…


Your reaction is rational. You are hitting three different moving targets at once:

  1. TRL CLI arguments vs documented names
  2. Hugging Face datasets vs GCS / FUSE semantics
  3. Vertex AI custom training’s way of wiring commands and env vars

People who “have it working” are usually doing one of two things:

  • either they follow the exact Hugging Face Vertex example with minimal changes,
  • or they skip the TRL CLI completely and run a plain Python training script using SFTTrainer.

I will outline both paths and how to simplify what you’re doing.


1. What the official “Vertex + TRL” flow actually assumes

Look at the Hugging Face docs for:

The pattern is:

  1. Use the Hugging Face PyTorch Training DLC as container_uri. (GitHub)

  2. Define a CustomContainerTrainingJob whose command runs TRL CLI:

    command = [
        "sh",
        "-c",
        'exec trl sft "$@"',
        "--",
    ]
    
  3. Pass all config either:

    • as CLI flags in args, or
    • via a YAML with --config=/gcs/<BUCKET>/sft_config.yaml.
  4. Put your data:

    • on the Hub (simplest, what examples do),
    • or on GCS and read via /gcs/<BUCKET> inside the container, or via gs:// URIs with datasets + gcsfs. (Google Cloud)

These examples assume:

  • A specific TRL version that matches the docs’ parameter names.
  • Datasets from HF Hub or a simple format.
  • Minimal use of YAML; almost all examples show explicit CLI flags, not large configs.

Once you deviate (custom YAML, local/GCS datasets, different TRL version), you are outside the “happy path” and you see exactly the issues you describe.


2. Why you are seeing so much friction

2.1. TRL CLI vs docs

  • TRL added the sft CLI on top of Python APIs (SFTTrainer, SFTConfig).
  • The mapping YAML → dataclasses → CLI flags is handled by TrlParser / HfArgumentParser. (Hugging Face)
  • Between versions, parameter names and behaviour have changed; GitHub issues and HF forum posts show people getting unrecognized arguments or “requires --dataset_name even though I specified datasets:” problems. (Hugging Face Forums)

If the DLC image you run has TRL version X, but you are reading docs for version Y, the mismatch becomes painful.

2.2. datasets + FUSE + GCS

  • Vertex AI serverless training automatically mounts your buckets under /gcs/<BUCKET> via Cloud Storage FUSE. (Google Cloud)

  • FUSE is convenient but:

    • debug messages are opaque,
    • performance can be poor for many small reads,
    • and you have to know the internal path (/gcs/...), not gs://..., in your config.

At the same time, Hugging Face datasets can also read data directly from GCS using gcsfs and gs:// URIs, no FUSE required:

from datasets import load_dataset

dataset = load_dataset(
    "json",
    data_files={"train": "gs://my-bucket/data/train.jsonl"},
)

This is supported via fsspec; you just need gcsfs installed. (Hugging Face)

So there are two overlapping I/O paths (FUSE /gcs/... vs HTTP-based gs://...), and the official Vertex+TRL examples pick one (/gcs) while datasets docs push another (gs:// + gcsfs). That’s extra cognitive load.

2.3. Too many layers at once

Your stack now looks like:

Vertex AI
→ HF DLC
→ TRL CLI
→ TrlParser + YAML
→ Hugging Face datasets
→ Cloud Storage FUSE or gcsfs

Every time something is mis-specified, you only see “dataset not found”, “param X required”, or “model not accessible”, with no clear pointer to which layer is at fault.

Given that you already have a local Python script using SFTTrainer that works, this is overkill.


3. Pragmatic way forward: stop fighting the TRL CLI

If your goal is “fine-tune TxGemma on Vertex”, you do not need the TRL CLI at all.

You can treat Vertex AI as “just a remote Linux box with GPUs” and run the same Python script you use locally.

3.1. Minimal pattern: custom Python entrypoint

  1. Start from the HF PyTorch Training DLC image (same as examples). (GitHub)

  2. Create your own train.py that:

    • Parses a simple YAML/JSON or CLI args (your choice).
    • Calls datasets.load_dataset on gs:// or /gcs paths.
    • Instantiates AutoTokenizer, AutoModelForCausalLM, SFTTrainer with your config.
    • Calls trainer.train().

    Example skeleton:

    import os
    import argparse
    from datasets import load_dataset
    from transformers import AutoTokenizer
    from trl import SFTTrainer, SFTConfig
    
    def parse_args():
        p = argparse.ArgumentParser()
        p.add_argument("--model_id", default="google/txgemma-2b-predict")
        p.add_argument("--train_file", required=True)
        p.add_argument("--eval_file", required=True)
        p.add_argument("--output_dir", required=True)
        return p.parse_args()
    
    def main():
        args = parse_args()
    
        # 1. Dataset: either gs://... or /gcs/...
        dataset = load_dataset(
            "json",
            data_files={
                "train": args.train_file,
                "validation": args.eval_file,
            },
        )
    
        # 2. Tokenizer + model
        hf_token = os.getenv("HF_TOKEN")
        tokenizer = AutoTokenizer.from_pretrained(args.model_id, token=hf_token)
        # model = AutoModelForCausalLM.from_pretrained(args.model_id, token=hf_token)
        # If you already loaded the base model elsewhere, plug it here.
    
        # 3. SFT config and trainer
        sft_config = SFTConfig(
            output_dir=args.output_dir,
            max_seq_length=1024,
            per_device_train_batch_size=2,
            num_train_epochs=3,
            bf16=True,
            # completion_only_loss=True,  # etc.
        )
    
        trainer = SFTTrainer(
            model=args.model_id,  # or the loaded model object
            train_dataset=dataset["train"],
            eval_dataset=dataset["validation"],
            processing_class=tokenizer,
            args=sft_config,
        )
    
        trainer.train()
        trainer.save_model(args.output_dir)
    
    if __name__ == "__main__":
        main()
    

    This is the same API you already use locally; no YAML parsing magic, no TrlParser, no CLI name mismatch.

  3. In Vertex AI, define your training job as:

    job = aiplatform.CustomContainerTrainingJob(
        display_name="txgemma-sft-direct",
        container_uri=HF_DLC_URI,  # same DLC as before
        command=["python", "train.py"],  # your script
    )
    
    job.run(
        args=[
            "--model_id=google/txgemma-2b-predict",
            "--train_file=gs://my-bucket/drug-herg/train.jsonl",
            "--eval_file=gs://my-bucket/drug-herg/eval.jsonl",
            "--output_dir=/gcs/my-bucket/outputs/txgemma-sft",
        ],
        environment_variables={
            "HF_TOKEN": "hf_...",  # gated model token
        },
        # machine_type / accelerator as before
    )
    
    • datasets will read via gs:// if you install gcsfs (add pip install gcsfs at container startup). (Hugging Face)
    • Or you can switch --train_file to /gcs/my-bucket/drug-herg/train.jsonl if you prefer FUSE. (Google Cloud)

This is what many “production” setups effectively do: they use TRL’s Python APIs (SFTTrainer etc.) and let Vertex run that script. They ignore the TRL CLI entirely.

3.2. Advantages of skipping the CLI

  • One source of truth: your Python code. No double mapping from YAML → CLI → dataclasses.
  • You can log the parsed config directly and debug in a normal way.
  • You control exactly how datasets are loaded (gs:// via gcsfs, /gcs via FUSE, or save_to_disk / load_from_disk). (Hugging Face)
  • You avoid TRL version / param-name drift between docs and the DLC.

4. If you really want to keep using TRL CLI + YAML

If you want to persist with CLI because you like the declarative YAML, then reduce moving parts:

  1. Pin TRL version in the DLC to match HF docs.

    • Inside container: pip install 'trl==0.xx.x' and verify with python -c "import trl; print(trl.__version__)".
    • Use docs for that exact version.
  2. Start from the official Vertex + TRL example YAML / args and modify slowly: (Hugging Face)

    • First run exactly the Mistral or Gemma example as published.
    • Confirm it works end-to-end (Hub dataset, no YAML).
    • Then add a very small YAML (only model + training params).
    • Only after that, add a local/GCS dataset via datasets:.
  3. For dataset + GCS, pick one strategy and stick to it:

    • FUSE: use /gcs/<BUCKET>/... in data_files. This matches GCP “Cloud Storage FUSE as local filesystem” docs. (Google Cloud)
    • Direct GCS: use gs:// URLs in data_files and install gcsfs. This follows HF datasets cloud-storage docs. (Hugging Face)
  4. Avoid mixing dataset_name and datasets: in the same config; use one or the other.

The forum threads and examples show that once you stay within that narrow lane, the CLI does work, but it is fragile the second you diverge. (Hugging Face Forums)


5. Practical recommendation given your situation

Given you:

  • already have a working local SFT script,
  • are fighting CLI param names and YAML parsing,
  • and are seeing FUSE / datasets interactions,

the most direct path is:

  1. Stop using trl sft in the Vertex job.
  2. Package your local SFT Python code into the training container.
  3. Call python train.py as the job command.
  4. Use datasets.load_dataset(...) on gs:// or /gcs paths.
  5. Pass HF_TOKEN via environment_variables for TxGemma gating.

This keeps TRL (SFTTrainer) and Vertex AI, but removes the TRL CLI and YAML parser from the equation. That is how many teams avoid exactly the pain you are describing.