Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.52.1
tomoro-evals
How to run the project
Create virtual environment and activate it
uv venv
source .venv/bin/activate
uv pip install -e .
Database Setup (Required for ETL Pipeline)
Before running the ETL pipeline (main() function) in the notebooks, you need to set up PostgreSQL:
Prerequisites
Install PostgreSQL (if not already installed):
brew install postgresql@14Start PostgreSQL service:
brew services start postgresql@14Create database and user:
psql -d postgres -c "CREATE DATABASE cii;" psql -d postgres -c "CREATE USER app WITH PASSWORD 'password';" psql -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE cii TO app;"Create database tables:
psql -d cii -U app -h localhost -f cronjob/customer_transactions/schema.sqlVerify setup:
psql -d cii -U app -h localhost -c "\dt"
The ETL pipeline expects:
- Database name:
cii - Username:
app - Password:
password - Host:
localhost - Port:
5432
These settings are configured in the DSN variable in the notebook.
Alternative: Run PostgreSQL with Docker (Recommended for Isolation)
If you prefer not to install PostgreSQL locally, you can run it in a Docker container that auto-loads the schema.
1. Start a fresh container
docker rm -f pg-cii 2>/dev/null || true
docker volume rm pgdata 2>/dev/null || true
docker run -d \
--name pg-cii \
-e POSTGRES_USER=app \
-e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=cii \
-p 5432:5432 \
-v pgdata:/var/lib/postgresql/data \
-v $(pwd)/cronjob/customer_transactions/schema.sql:/docker-entrypoint-initdb.d/001-schema.sql:ro \
postgres:16
The schema.sql file is executed only the first time the named volume pgdata is initialized.
2. Check container & logs
docker ps --filter name=pg-cii
docker logs pg-cii | tail -n 30
3. Inspect tables
docker exec -it pg-cii psql -U app -d cii -c "\dt"
4. Set DSN (current shell)
export CII_PG_DSN="dbname=cii user=app password=password host=localhost port=5432"
If using a notebook:
import os
os.environ["CII_PG_DSN"] = "dbname=cii user=app password=password host=localhost port=5432"
5. Rebuild after changing schema.sql
docker rm -f pg-cii && docker volume rm pgdata && \
docker run -d --name pg-cii \
-e POSTGRES_USER=app -e POSTGRES_PASSWORD=password -e POSTGRES_DB=cii \
-p 5432:5432 \
-v pgdata:/var/lib/postgresql/data \
-v $(pwd)/cronjob/customer_transactions/schema.sql:/docker-entrypoint-initdb.d/001-schema.sql:ro \
postgres:16
6. Stop / start later
docker stop pg-cii
docker start pg-cii
7. Apply schema manually (if needed on an existing container)
cat cronjob/customer_transactions/schema.sql | docker exec -i pg-cii psql -U app -d cii
8. Simple backup / restore
# Backup
docker exec -t pg-cii pg_dump -U app -d cii > backup.sql
# Restore (fresh volume)
docker rm -f pg-cii && docker volume rm pgdata
# start container again (see step 1, omit schema bind if restoring)
cat backup.sql | docker exec -i pg-cii psql -U app -d cii
Using uv with Docker Postgres
All Python commands can run inside the uv-managed environment while PostgreSQL runs in Docker.
uv sync # install dependencies
export CII_PG_DSN="dbname=cii user=app password=password host=localhost port=5432"
uv run python cronjob/customer_transactions/agent_run.py
Add a script alias in pyproject.toml (optional):
[tool.uv.scripts]
agent = "python cronjob/customer_transactions/agent_run.py"
Then:
uv run agent
Usage
The commands below sassume an activated virtual environment. If you haven't activated your environment and you are using uv, you should prefix the commands with uv run.
Online Usage
run reranking evaluation for langfuse traces
uv run langfuse_trace_evaluation.py
Offline Usage
Evals Hub may be used offline, for development purposes, or as part of a CI/CD pipeline. You can use the evals-hub CLI tool to run benchmarks offline. The main entry point is the run-benchmark command.
View options for the evals-hub command
evals-hub --help
Usage: evals-hub COMMAND
โญโ Commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ run-benchmark โ
โ --help -h Display this message and exit. โ
โ --version Display application version. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Parameters โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * --config [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
View options for the evals-hub run-benchmark command
evals-hub run-benchmark --help
Usage: evals-hub run-benchmark [ARGS] [OPTIONS]
โญโ Parameters โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * TASK-NAME --task-name [choices: retrieval, reranking, classification, nli] [required] โ
โ * DATASET.NAME --dataset.name [required] โ
โ DATASET.SPLIT --dataset.split โ
โ DATASET.HF-SUBSET --dataset.hf-subset โ
โ * MODEL.CHECKPOINT --model.checkpoint [required] โ
โ METRICS.MAP --metrics.map Identifier for MAP metric โ
โ METRICS.MRR --metrics.mrr Identifier for MRR metric โ
โ METRICS.NDCG --metrics.ndcg Identifier for NDCG metric โ
โ METRICS.RECALL --metrics.recall Identifier for Recall metric โ
โ METRICS.PRECISION --metrics.precision Identifier for Precision metric โ
โ METRICS.MICRO-AVG-F1 --metrics.micro-avg-f1 Identifier for micro average F1 metric โ
โ METRICS.MACRO-AVG-F1 --metrics.macro-avg-f1 Identifier for macro average F1 metric โ
โ METRICS.ACCURACY --metrics.accuracy Identifier for accuracy metric โ
โ EVALUATION.TOP-K --evaluation.top-k [default: 10] โ
โ EVALUATION.BATCH-SIZE [default: 16] โ
โ --evaluation.batch-size โ
โ EVALUATION.SEED --evaluation.seed [default: 42] โ
โ EVALUATION.MAX-LENGTH โ
โ --evaluation.max-length โ
โ EVALUATION.SAMPLES-PER-LABEL โ
โ --evaluation.samples-per-label โ
โ EVALUATION.N-EXPERIMENTS [default: 10] โ
โ --evaluation.n-experiments โ
โ * OUTPUT.RESULTS-FILE --output.results-file [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Benchmarks can be run in a couple of different ways:
- options defined in a YAML config file
- options directly from the command line
- options defined in a YAML config file which are overridden by the command line
Benchmark configured entirely from a YAML file
run reranking
evals-hub run-benchmark --config reranking_config.yaml
run nli
evals-hub run-benchmark --config nli_config.yaml
run classification
evals-hub run-benchmark --config classification_config.yaml
run patent landscape evaluation
evals-hub run-benchmark --config pl_eval_config.yaml
Troubleshooting SSL errors
SSL errors when connecting to huggingface dataset
Set environment variable for python library requests in .env
REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
SSL certificates may need to be imported if you have not done before.
Development Setup
Install the git hook scripts
pre-commit install
Run tests
uv run pytest -v
SQL Code Quality
Lint all SQL files in a directory:
uv run sqlfluff lint --dialect postgres cronjob/
Format/fix SQL files:
uv run sqlfluff format --dialect postgres cronjob/
Serve documentation locally
uv run mkdocs serve -f docs/mkdocs.yml
Then open up http://127.0.0.1:8000/ in your browser
Refresh & upgrade the lockfile
uv sync --upgrade
Integration tests
By default, integration tests are ignored in the pytest configuration because evaluation runs take a long time and require GPU resources. However, it is sometimes useful to run the evaluation to verify that results are correct against public benchmarks. And here is the command:
uv run pytest tests/integration
To run pre-commit hooks locally
source .venv/bin/activate
pre-commit run --all-files
Show outdated package
uv tree --outdated --depth 1