How to Build a Benchmark with a Private Test Set on Hugging Face

Community Article Published February 16, 2026

So you want to host a challenge or benchmark? You want people to be able to submit their results, have them evaluated against a private (or public) test set, and see their scores on a public leaderboard. This guide walks through how to set that up on Hugging Face.

The Architecture

You need four things:

  1. A public leaderboard — a Hugging Face Space (Gradio) where users submit predictions and view results
  2. A private evaluator — a Hugging Face Space that scores submissions against your test set
  3. A submissions dataset — a Hugging Face dataset that records incoming submissions
  4. A results dataset — a Hugging Face dataset that stores evaluation results

Here's how they connect:

User submits via Leaderboard (Space)
        │
        ▼
Submissions Dataset (HF Dataset)
        │
        ▼
Evaluator (Private Space) reads submissions, scores them
        │
        ▼
Results Dataset (HF Dataset)
        │
        ▼
Leaderboard reads results and displays scores

The leaderboard writes to the submissions dataset and reads from the results dataset. The evaluator reads from the submissions dataset and writes to the results dataset. This separation keeps your evaluation logic and test set private while giving users a clean public interface.

Before You Start: Plan Your Schema

This is worth calling out early: Hugging Face datasets require every file in the same config to have the same schema. If you push a parquet file with columns [model_name, score] and then later push one with [model_name, accuracy, f1], the dataset loader will break.

Decide upfront what fields you want in your submissions dataset and your results dataset. You can change them later, but it means going back and rewriting existing files, so it's easier to get it right the first time.

For example, your submissions schema might look like:

# Each submission file contains:
{
    "model_name": str,
    "submitted_by": str,
    "submission_time": str,       # ISO timestamp
    "predictions_file": str,      # path or reference to the predictions
}

And your results schema might look like:

# Each result file contains:
{
    "model_name": str,
    "submitted_by": str,
    "submission_time": str,
    "overall_score": float,
    "metric_a": float,
    "metric_b": float,
}

1. The Public Leaderboard (Gradio Space)

Create a Hugging Face Space with the Gradio SDK. To keep things organized, split your code across three files:

  • app.py — the main Gradio app
  • about.py — text and markdown content for your leaderboard's informational tabs
  • utils.py — helper functions for reading/writing datasets

app.py

This is the entry point. It builds the Gradio interface with (at minimum) two tabs: one to view the leaderboard and one to submit.

import gradio as gr
from about import TITLE, INTRODUCTION_TEXT, CITATION_BUTTON_TEXT
from utils import load_results, submit_prediction

def refresh_leaderboard():
    """Pull the latest results from the results dataset."""
    df = load_results()
    return df

with gr.Blocks() as demo:
    gr.Markdown(TITLE)
    gr.Markdown(INTRODUCTION_TEXT)

    with gr.Tab("Leaderboard"):
        leaderboard_df = gr.Dataframe(value=refresh_leaderboard)
        refresh_btn = gr.Button("Refresh")
        refresh_btn.click(fn=refresh_leaderboard, outputs=leaderboard_df)

    with gr.Tab("Submit"):
        model_name = gr.Textbox(label="Model Name")
        predictions_file = gr.File(label="Predictions File")
        submit_btn = gr.Button("Submit")
        submission_status = gr.Markdown()

        submit_btn.click(
            fn=submit_prediction,
            inputs=[model_name, predictions_file],
            outputs=submission_status,
        )

    with gr.Tab("About"):
        gr.Markdown(CITATION_BUTTON_TEXT)

demo.launch()

about.py

Keep your leaderboard's descriptive content here. This keeps app.py clean.

TITLE = "# My Benchmark Leaderboard"

INTRODUCTION_TEXT = """
Welcome to the leaderboard for My Benchmark.
Submit your model's predictions to see how it ranks.
"""

CITATION_BUTTON_TEXT = """
## Citation

If you use this benchmark, please cite:

```bibtex
@misc{mybenchmark2025,
    title={My Benchmark},
    author={Your Name},
    year={2025},
}

"""


### `utils.py`

This is where the actual dataset interaction happens. The leaderboard needs to:
- **Write** submissions to the submissions dataset
- **Read** results from the results dataset

```python
import json
import os
from datetime import datetime

import pandas as pd
from huggingface_hub import HfApi

API = HfApi()

SUBMISSIONS_REPO = "your-org/benchmark-submissions"  # private
RESULTS_REPO = "your-org/benchmark-results"           # public or private

# Use your HF token (set as a Space secret)
HF_TOKEN = os.environ.get("HF_TOKEN")


def submit_prediction(model_name: str, predictions_file) -> str:
    """Upload a submission to the submissions dataset."""
    timestamp = datetime.now().isoformat()
    submission_id = f"{model_name}_{timestamp}".replace(" ", "_")

    # Upload the predictions file
    API.upload_file(
        path_or_fileobj=predictions_file.name,
        path_in_repo=f"predictions/{submission_id}.jsonl",
        repo_id=SUBMISSIONS_REPO,
        repo_type="dataset",
        token=HF_TOKEN,
    )

    # Upload a metadata record
    metadata = {
        "model_name": model_name,
        "submitted_by": "user",  # could use HF OAuth here
        "submission_time": timestamp,
        "predictions_file": f"predictions/{submission_id}.jsonl",
    }
    metadata_bytes = json.dumps(metadata).encode()
    API.upload_file(
        path_or_fileobj=metadata_bytes,
        path_in_repo=f"metadata/{submission_id}.json",
        repo_id=SUBMISSIONS_REPO,
        repo_type="dataset",
        token=HF_TOKEN,
    )

    return f"Submitted! Your submission ID is `{submission_id}`. Results will appear on the leaderboard once evaluation is complete."


def load_results() -> pd.DataFrame:
    """Load the latest results from the results dataset."""
    try:
        from datasets import load_dataset
        ds = load_dataset(RESULTS_REPO, token=HF_TOKEN)
        df = ds["train"].to_pandas()
        # Sort by score descending
        df = df.sort_values("overall_score", ascending=False).reset_index(drop=True)
        return df
    except Exception:
        return pd.DataFrame(columns=["model_name", "overall_score"])

Space Secrets

In your Space's settings, add a HF_TOKEN secret — a Hugging Face token with write access to the submissions repo and read access to the results repo.

2. The Private Evaluator (Space)

This is the core of the system. The evaluator is a separate, private Space that:

  1. Checks for unevaluated submissions (things in the submissions dataset that are not yet in the results dataset)
  2. Runs your evaluation logic on each new submission
  3. Writes the results to the results dataset

The evaluator can run on a schedule (using a cron-like loop or Hugging Face's Space scheduler) or be triggered manually.

app.py (Evaluator)

import json
import os
import time

from huggingface_hub import HfApi, hf_hub_download, list_repo_files

API = HfApi()

SUBMISSIONS_REPO = "your-org/benchmark-submissions"
RESULTS_REPO = "your-org/benchmark-results"
HF_TOKEN = os.environ.get("HF_TOKEN")


def get_pending_submissions():
    """Find submissions that haven't been evaluated yet."""
    # List all submission metadata files
    submission_files = [
        f for f in list_repo_files(SUBMISSIONS_REPO, repo_type="dataset", token=HF_TOKEN)
        if f.startswith("metadata/") and f.endswith(".json")
    ]

    # List all result files
    result_files = [
        f for f in list_repo_files(RESULTS_REPO, repo_type="dataset", token=HF_TOKEN)
        if f.endswith(".json")
    ]

    # Extract submission IDs from each
    submitted_ids = {f.replace("metadata/", "").replace(".json", "") for f in submission_files}
    evaluated_ids = {f.replace("results/", "").replace(".json", "") for f in result_files}

    pending_ids = submitted_ids - evaluated_ids
    return pending_ids


def evaluate_submission(submission_id: str):
    """Run evaluation for a single submission."""
    # Download the submission metadata
    metadata_path = hf_hub_download(
        repo_id=SUBMISSIONS_REPO,
        filename=f"metadata/{submission_id}.json",
        repo_type="dataset",
        token=HF_TOKEN,
    )
    with open(metadata_path) as f:
        metadata = json.load(f)

    # Download the predictions file
    predictions_path = hf_hub_download(
        repo_id=SUBMISSIONS_REPO,
        filename=metadata["predictions_file"],
        repo_type="dataset",
        token=HF_TOKEN,
    )

    # ---- Your evaluation logic goes here ----
    scores = run_evaluation(predictions_path)
    # ------------------------------------------

    # Build the result record
    result = {
        "model_name": metadata["model_name"],
        "submitted_by": metadata["submitted_by"],
        "submission_time": metadata["submission_time"],
        "overall_score": scores["overall"],
        "metric_a": scores["metric_a"],
        "metric_b": scores["metric_b"],
    }

    # Upload result
    result_bytes = json.dumps(result).encode()
    API.upload_file(
        path_or_fileobj=result_bytes,
        path_in_repo=f"results/{submission_id}.json",
        repo_id=RESULTS_REPO,
        repo_type="dataset",
        token=HF_TOKEN,
    )
    print(f"Evaluated {submission_id}: {scores}")


def run_evaluation(predictions_path: str) -> dict:
    """
    Replace this with your actual evaluation logic.
    Load your private test set, compare against predictions, compute metrics.
    """
    # Example placeholder:
    # gold = load_gold_labels("path/to/private/test_set.jsonl")
    # predictions = load_predictions(predictions_path)
    # accuracy = compute_accuracy(gold, predictions)
    return {"overall": 0.0, "metric_a": 0.0, "metric_b": 0.0}


# --- Main loop ---
if __name__ == "__main__":
    while True:
        pending = get_pending_submissions()
        if pending:
            print(f"Found {len(pending)} pending submissions")
            for submission_id in pending:
                try:
                    evaluate_submission(submission_id)
                except Exception as e:
                    print(f"Error evaluating {submission_id}: {e}")
        else:
            print("No pending submissions")
        time.sleep(300)  # Check every 5 minutes

Where Does Your Test Set Live?

A few options:

  • Baked into the evaluator Space: Include your test set as a file in the private Space's repo. Since the Space is private, no one can see it.
  • In a private dataset: Store it in a separate private Hugging Face dataset and download it at evaluation time.
  • Hardcoded in the evaluator code: For small test sets, you could even embed the gold labels directly in the evaluation script.

The key point: because the evaluator Space is private, anything inside it is hidden from users.

3. The Submissions Dataset

Create a Hugging Face dataset (set it to private):

huggingface-cli repo create benchmark-submissions --type dataset --private

This repo will end up with a structure like:

benchmark-submissions/
├── metadata/
│   ├── model-a_2025-01-15T10:30:00.json
│   ├── model-b_2025-01-15T11:00:00.json
│   └── ...
└── predictions/
    ├── model-a_2025-01-15T10:30:00.jsonl
    ├── model-b_2025-01-15T11:00:00.jsonl
    └── ...

You don't need to pre-populate it — the leaderboard Space will create files as submissions come in.

4. The Results Dataset

Create another Hugging Face dataset (public or private, depending on whether you want results to be downloadable):

huggingface-cli repo create benchmark-results --type dataset

Structure:

benchmark-results/
└── results/
    ├── model-a_2025-01-15T10:30:00.json
    ├── model-b_2025-01-15T11:00:00.json
    └── ...

The evaluator writes here; the leaderboard reads from here.

Putting It All Together

  1. Create the four repos: two Spaces, two datasets
  2. Set permissions: the evaluator Space and submissions dataset should be private; the leaderboard Space should be public
  3. Add your HF_TOKEN as a secret to both Spaces (needs write access to the repos they write to)
  4. Deploy the leaderboard Space with app.py, about.py, utils.py, and a requirements.txt
  5. Deploy the evaluator Space with your evaluation code and your private test set
  6. Test the flow: submit a prediction through the leaderboard, wait for the evaluator to pick it up, and verify the result appears on the leaderboard

Tips

  • Schema consistency matters. If you're using datasets.load_dataset() to read your results, every file needs the same fields. Decide on your schema before you start accepting submissions.
  • Error handling in the evaluator. Submissions will sometimes be malformed. Wrap evaluation in try/except and consider writing a "failed" status to the results dataset so the submitter gets feedback.
  • Rate limiting. If you expect high volume, add rate limiting to the submission endpoint. You don't want someone spamming submissions and burning your evaluator's compute.
  • OAuth for attribution. Gradio supports Hugging Face OAuth, which lets you identify who is submitting. This is useful for preventing duplicate submissions and for attribution on the leaderboard.
  • Caching. The leaderboard doesn't need to hit the results dataset on every page load. Cache the results dataframe and refresh it periodically or on button click.

Community

Sign up or log in to comment