How to Build a Benchmark with a Private Test Set on Hugging Face
The Architecture
You need four things:
- A public leaderboard — a Hugging Face Space (Gradio) where users submit predictions and view results
- A private evaluator — a Hugging Face Space that scores submissions against your test set
- A submissions dataset — a Hugging Face dataset that records incoming submissions
- A results dataset — a Hugging Face dataset that stores evaluation results
Here's how they connect:
User submits via Leaderboard (Space)
│
▼
Submissions Dataset (HF Dataset)
│
▼
Evaluator (Private Space) reads submissions, scores them
│
▼
Results Dataset (HF Dataset)
│
▼
Leaderboard reads results and displays scores
The leaderboard writes to the submissions dataset and reads from the results dataset. The evaluator reads from the submissions dataset and writes to the results dataset. This separation keeps your evaluation logic and test set private while giving users a clean public interface.
Before You Start: Plan Your Schema
This is worth calling out early: Hugging Face datasets require every file in the same config to have the same schema. If you push a parquet file with columns [model_name, score] and then later push one with [model_name, accuracy, f1], the dataset loader will break.
Decide upfront what fields you want in your submissions dataset and your results dataset. You can change them later, but it means going back and rewriting existing files, so it's easier to get it right the first time.
For example, your submissions schema might look like:
# Each submission file contains:
{
"model_name": str,
"submitted_by": str,
"submission_time": str, # ISO timestamp
"predictions_file": str, # path or reference to the predictions
}
And your results schema might look like:
# Each result file contains:
{
"model_name": str,
"submitted_by": str,
"submission_time": str,
"overall_score": float,
"metric_a": float,
"metric_b": float,
}
1. The Public Leaderboard (Gradio Space)
Create a Hugging Face Space with the Gradio SDK. To keep things organized, split your code across three files:
app.py— the main Gradio appabout.py— text and markdown content for your leaderboard's informational tabsutils.py— helper functions for reading/writing datasets
app.py
This is the entry point. It builds the Gradio interface with (at minimum) two tabs: one to view the leaderboard and one to submit.
import gradio as gr
from about import TITLE, INTRODUCTION_TEXT, CITATION_BUTTON_TEXT
from utils import load_results, submit_prediction
def refresh_leaderboard():
"""Pull the latest results from the results dataset."""
df = load_results()
return df
with gr.Blocks() as demo:
gr.Markdown(TITLE)
gr.Markdown(INTRODUCTION_TEXT)
with gr.Tab("Leaderboard"):
leaderboard_df = gr.Dataframe(value=refresh_leaderboard)
refresh_btn = gr.Button("Refresh")
refresh_btn.click(fn=refresh_leaderboard, outputs=leaderboard_df)
with gr.Tab("Submit"):
model_name = gr.Textbox(label="Model Name")
predictions_file = gr.File(label="Predictions File")
submit_btn = gr.Button("Submit")
submission_status = gr.Markdown()
submit_btn.click(
fn=submit_prediction,
inputs=[model_name, predictions_file],
outputs=submission_status,
)
with gr.Tab("About"):
gr.Markdown(CITATION_BUTTON_TEXT)
demo.launch()
about.py
Keep your leaderboard's descriptive content here. This keeps app.py clean.
TITLE = "# My Benchmark Leaderboard"
INTRODUCTION_TEXT = """
Welcome to the leaderboard for My Benchmark.
Submit your model's predictions to see how it ranks.
"""
CITATION_BUTTON_TEXT = """
## Citation
If you use this benchmark, please cite:
```bibtex
@misc{mybenchmark2025,
title={My Benchmark},
author={Your Name},
year={2025},
}
"""
### `utils.py`
This is where the actual dataset interaction happens. The leaderboard needs to:
- **Write** submissions to the submissions dataset
- **Read** results from the results dataset
```python
import json
import os
from datetime import datetime
import pandas as pd
from huggingface_hub import HfApi
API = HfApi()
SUBMISSIONS_REPO = "your-org/benchmark-submissions" # private
RESULTS_REPO = "your-org/benchmark-results" # public or private
# Use your HF token (set as a Space secret)
HF_TOKEN = os.environ.get("HF_TOKEN")
def submit_prediction(model_name: str, predictions_file) -> str:
"""Upload a submission to the submissions dataset."""
timestamp = datetime.now().isoformat()
submission_id = f"{model_name}_{timestamp}".replace(" ", "_")
# Upload the predictions file
API.upload_file(
path_or_fileobj=predictions_file.name,
path_in_repo=f"predictions/{submission_id}.jsonl",
repo_id=SUBMISSIONS_REPO,
repo_type="dataset",
token=HF_TOKEN,
)
# Upload a metadata record
metadata = {
"model_name": model_name,
"submitted_by": "user", # could use HF OAuth here
"submission_time": timestamp,
"predictions_file": f"predictions/{submission_id}.jsonl",
}
metadata_bytes = json.dumps(metadata).encode()
API.upload_file(
path_or_fileobj=metadata_bytes,
path_in_repo=f"metadata/{submission_id}.json",
repo_id=SUBMISSIONS_REPO,
repo_type="dataset",
token=HF_TOKEN,
)
return f"Submitted! Your submission ID is `{submission_id}`. Results will appear on the leaderboard once evaluation is complete."
def load_results() -> pd.DataFrame:
"""Load the latest results from the results dataset."""
try:
from datasets import load_dataset
ds = load_dataset(RESULTS_REPO, token=HF_TOKEN)
df = ds["train"].to_pandas()
# Sort by score descending
df = df.sort_values("overall_score", ascending=False).reset_index(drop=True)
return df
except Exception:
return pd.DataFrame(columns=["model_name", "overall_score"])
Space Secrets
In your Space's settings, add a HF_TOKEN secret — a Hugging Face token with write access to the submissions repo and read access to the results repo.
2. The Private Evaluator (Space)
This is the core of the system. The evaluator is a separate, private Space that:
- Checks for unevaluated submissions (things in the submissions dataset that are not yet in the results dataset)
- Runs your evaluation logic on each new submission
- Writes the results to the results dataset
The evaluator can run on a schedule (using a cron-like loop or Hugging Face's Space scheduler) or be triggered manually.
app.py (Evaluator)
import json
import os
import time
from huggingface_hub import HfApi, hf_hub_download, list_repo_files
API = HfApi()
SUBMISSIONS_REPO = "your-org/benchmark-submissions"
RESULTS_REPO = "your-org/benchmark-results"
HF_TOKEN = os.environ.get("HF_TOKEN")
def get_pending_submissions():
"""Find submissions that haven't been evaluated yet."""
# List all submission metadata files
submission_files = [
f for f in list_repo_files(SUBMISSIONS_REPO, repo_type="dataset", token=HF_TOKEN)
if f.startswith("metadata/") and f.endswith(".json")
]
# List all result files
result_files = [
f for f in list_repo_files(RESULTS_REPO, repo_type="dataset", token=HF_TOKEN)
if f.endswith(".json")
]
# Extract submission IDs from each
submitted_ids = {f.replace("metadata/", "").replace(".json", "") for f in submission_files}
evaluated_ids = {f.replace("results/", "").replace(".json", "") for f in result_files}
pending_ids = submitted_ids - evaluated_ids
return pending_ids
def evaluate_submission(submission_id: str):
"""Run evaluation for a single submission."""
# Download the submission metadata
metadata_path = hf_hub_download(
repo_id=SUBMISSIONS_REPO,
filename=f"metadata/{submission_id}.json",
repo_type="dataset",
token=HF_TOKEN,
)
with open(metadata_path) as f:
metadata = json.load(f)
# Download the predictions file
predictions_path = hf_hub_download(
repo_id=SUBMISSIONS_REPO,
filename=metadata["predictions_file"],
repo_type="dataset",
token=HF_TOKEN,
)
# ---- Your evaluation logic goes here ----
scores = run_evaluation(predictions_path)
# ------------------------------------------
# Build the result record
result = {
"model_name": metadata["model_name"],
"submitted_by": metadata["submitted_by"],
"submission_time": metadata["submission_time"],
"overall_score": scores["overall"],
"metric_a": scores["metric_a"],
"metric_b": scores["metric_b"],
}
# Upload result
result_bytes = json.dumps(result).encode()
API.upload_file(
path_or_fileobj=result_bytes,
path_in_repo=f"results/{submission_id}.json",
repo_id=RESULTS_REPO,
repo_type="dataset",
token=HF_TOKEN,
)
print(f"Evaluated {submission_id}: {scores}")
def run_evaluation(predictions_path: str) -> dict:
"""
Replace this with your actual evaluation logic.
Load your private test set, compare against predictions, compute metrics.
"""
# Example placeholder:
# gold = load_gold_labels("path/to/private/test_set.jsonl")
# predictions = load_predictions(predictions_path)
# accuracy = compute_accuracy(gold, predictions)
return {"overall": 0.0, "metric_a": 0.0, "metric_b": 0.0}
# --- Main loop ---
if __name__ == "__main__":
while True:
pending = get_pending_submissions()
if pending:
print(f"Found {len(pending)} pending submissions")
for submission_id in pending:
try:
evaluate_submission(submission_id)
except Exception as e:
print(f"Error evaluating {submission_id}: {e}")
else:
print("No pending submissions")
time.sleep(300) # Check every 5 minutes
Where Does Your Test Set Live?
A few options:
- Baked into the evaluator Space: Include your test set as a file in the private Space's repo. Since the Space is private, no one can see it.
- In a private dataset: Store it in a separate private Hugging Face dataset and download it at evaluation time.
- Hardcoded in the evaluator code: For small test sets, you could even embed the gold labels directly in the evaluation script.
The key point: because the evaluator Space is private, anything inside it is hidden from users.
3. The Submissions Dataset
Create a Hugging Face dataset (set it to private):
huggingface-cli repo create benchmark-submissions --type dataset --private
This repo will end up with a structure like:
benchmark-submissions/
├── metadata/
│ ├── model-a_2025-01-15T10:30:00.json
│ ├── model-b_2025-01-15T11:00:00.json
│ └── ...
└── predictions/
├── model-a_2025-01-15T10:30:00.jsonl
├── model-b_2025-01-15T11:00:00.jsonl
└── ...
You don't need to pre-populate it — the leaderboard Space will create files as submissions come in.
4. The Results Dataset
Create another Hugging Face dataset (public or private, depending on whether you want results to be downloadable):
huggingface-cli repo create benchmark-results --type dataset
Structure:
benchmark-results/
└── results/
├── model-a_2025-01-15T10:30:00.json
├── model-b_2025-01-15T11:00:00.json
└── ...
The evaluator writes here; the leaderboard reads from here.
Putting It All Together
- Create the four repos: two Spaces, two datasets
- Set permissions: the evaluator Space and submissions dataset should be private; the leaderboard Space should be public
- Add your HF_TOKEN as a secret to both Spaces (needs write access to the repos they write to)
- Deploy the leaderboard Space with
app.py,about.py,utils.py, and arequirements.txt - Deploy the evaluator Space with your evaluation code and your private test set
- Test the flow: submit a prediction through the leaderboard, wait for the evaluator to pick it up, and verify the result appears on the leaderboard
Tips
- Schema consistency matters. If you're using
datasets.load_dataset()to read your results, every file needs the same fields. Decide on your schema before you start accepting submissions. - Error handling in the evaluator. Submissions will sometimes be malformed. Wrap evaluation in try/except and consider writing a "failed" status to the results dataset so the submitter gets feedback.
- Rate limiting. If you expect high volume, add rate limiting to the submission endpoint. You don't want someone spamming submissions and burning your evaluator's compute.
- OAuth for attribution. Gradio supports Hugging Face OAuth, which lets you identify who is submitting. This is useful for preventing duplicate submissions and for attribution on the leaderboard.
- Caching. The leaderboard doesn't need to hit the results dataset on every page load. Cache the results dataframe and refresh it periodically or on button click.
