Spaces:
Sleeping
Sleeping
| # tomoro-evals | |
| [](https://tomoro-ai.github.io/tomoro-evals/) | |
| # How to run the project | |
| ## Create virtual environment and activate it | |
| ```bash | |
| uv venv | |
| source .venv/bin/activate | |
| uv pip install -e . | |
| ``` | |
| ## Database Setup (Required for ETL Pipeline) | |
| Before running the ETL pipeline (`main()` function) in the notebooks, you need to set up PostgreSQL: | |
| ### Prerequisites | |
| 1. **Install PostgreSQL** (if not already installed): | |
| ```bash | |
| brew install postgresql@14 | |
| ``` | |
| 2. **Start PostgreSQL service**: | |
| ```bash | |
| brew services start postgresql@14 | |
| ``` | |
| 3. **Create database and user**: | |
| ```bash | |
| psql -d postgres -c "CREATE DATABASE cii;" | |
| psql -d postgres -c "CREATE USER app WITH PASSWORD 'password';" | |
| psql -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE cii TO app;" | |
| ``` | |
| 4. **Create database tables**: | |
| ```bash | |
| psql -d cii -U app -h localhost -f cronjob/customer_transactions/schema.sql | |
| ``` | |
| 5. **Verify setup**: | |
| ```bash | |
| psql -d cii -U app -h localhost -c "\dt" | |
| ``` | |
| The ETL pipeline expects: | |
| - Database name: `cii` | |
| - Username: `app` | |
| - Password: `password` | |
| - Host: `localhost` | |
| - Port: `5432` | |
| These settings are configured in the `DSN` variable in the notebook. | |
| ### Alternative: Run PostgreSQL with Docker (Recommended for Isolation) | |
| If you prefer not to install PostgreSQL locally, you can run it in a Docker container that auto-loads the schema. | |
| #### 1. Start a fresh container | |
| ```bash | |
| docker rm -f pg-cii 2>/dev/null || true | |
| docker volume rm pgdata 2>/dev/null || true | |
| docker run -d \ | |
| --name pg-cii \ | |
| -e POSTGRES_USER=app \ | |
| -e POSTGRES_PASSWORD=password \ | |
| -e POSTGRES_DB=cii \ | |
| -p 5432:5432 \ | |
| -v pgdata:/var/lib/postgresql/data \ | |
| -v $(pwd)/cronjob/customer_transactions/schema.sql:/docker-entrypoint-initdb.d/001-schema.sql:ro \ | |
| postgres:16 | |
| ``` | |
| The `schema.sql` file is executed only the first time the named volume `pgdata` is initialized. | |
| #### 2. Check container & logs | |
| ```bash | |
| docker ps --filter name=pg-cii | |
| docker logs pg-cii | tail -n 30 | |
| ``` | |
| #### 3. Inspect tables | |
| ```bash | |
| docker exec -it pg-cii psql -U app -d cii -c "\dt" | |
| ``` | |
| #### 4. Set DSN (current shell) | |
| ```bash | |
| export CII_PG_DSN="dbname=cii user=app password=password host=localhost port=5432" | |
| ``` | |
| If using a notebook: | |
| ```python | |
| import os | |
| os.environ["CII_PG_DSN"] = "dbname=cii user=app password=password host=localhost port=5432" | |
| ``` | |
| #### 5. Rebuild after changing `schema.sql` | |
| ```bash | |
| docker rm -f pg-cii && docker volume rm pgdata && \ | |
| docker run -d --name pg-cii \ | |
| -e POSTGRES_USER=app -e POSTGRES_PASSWORD=password -e POSTGRES_DB=cii \ | |
| -p 5432:5432 \ | |
| -v pgdata:/var/lib/postgresql/data \ | |
| -v $(pwd)/cronjob/customer_transactions/schema.sql:/docker-entrypoint-initdb.d/001-schema.sql:ro \ | |
| postgres:16 | |
| ``` | |
| #### 6. Stop / start later | |
| ```bash | |
| docker stop pg-cii | |
| docker start pg-cii | |
| ``` | |
| #### 7. Apply schema manually (if needed on an existing container) | |
| ```bash | |
| cat cronjob/customer_transactions/schema.sql | docker exec -i pg-cii psql -U app -d cii | |
| ``` | |
| #### 8. Simple backup / restore | |
| ```bash | |
| # Backup | |
| docker exec -t pg-cii pg_dump -U app -d cii > backup.sql | |
| # Restore (fresh volume) | |
| docker rm -f pg-cii && docker volume rm pgdata | |
| # start container again (see step 1, omit schema bind if restoring) | |
| cat backup.sql | docker exec -i pg-cii psql -U app -d cii | |
| ``` | |
| ### Using uv with Docker Postgres | |
| All Python commands can run inside the uv-managed environment while PostgreSQL runs in Docker. | |
| ```bash | |
| uv sync # install dependencies | |
| export CII_PG_DSN="dbname=cii user=app password=password host=localhost port=5432" | |
| uv run python cronjob/customer_transactions/agent_run.py | |
| ``` | |
| Add a script alias in `pyproject.toml` (optional): | |
| ```toml | |
| [tool.uv.scripts] | |
| agent = "python cronjob/customer_transactions/agent_run.py" | |
| ``` | |
| Then: | |
| ```bash | |
| uv run agent | |
| ``` | |
| # Usage | |
| The commands below sassume an activated virtual environment. If you haven't activated your environment and you are using `uv`, you should prefix the commands with `uv run`. | |
| ## Online Usage | |
| ### run reranking evaluation for langfuse traces | |
| ```bash | |
| uv run langfuse_trace_evaluation.py | |
| ``` | |
| ## Offline Usage | |
| Evals Hub may be used offline, for development purposes, or as part of a CI/CD pipeline. You can use the `evals-hub` CLI tool to run benchmarks offline. The main entry point is the `run-benchmark` command. | |
| <details> | |
| <summary><b>View options for the evals-hub command</b></summary> | |
| ```bash | |
| evals-hub --help | |
| ``` | |
| ``` | |
| Usage: evals-hub COMMAND | |
| โญโ Commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ | |
| โ run-benchmark โ | |
| โ --help -h Display this message and exit. โ | |
| โ --version Display application version. โ | |
| โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ | |
| โญโ Parameters โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ | |
| โ * --config [required] โ | |
| โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ | |
| ``` | |
| </details> | |
| <details> | |
| <summary><b>View options for the evals-hub run-benchmark command</b></summary> | |
| ```bash | |
| evals-hub run-benchmark --help | |
| ``` | |
| ``` | |
| Usage: evals-hub run-benchmark [ARGS] [OPTIONS] | |
| โญโ Parameters โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ | |
| โ * TASK-NAME --task-name [choices: retrieval, reranking, classification, nli] [required] โ | |
| โ * DATASET.NAME --dataset.name [required] โ | |
| โ DATASET.SPLIT --dataset.split โ | |
| โ DATASET.HF-SUBSET --dataset.hf-subset โ | |
| โ * MODEL.CHECKPOINT --model.checkpoint [required] โ | |
| โ METRICS.MAP --metrics.map Identifier for MAP metric โ | |
| โ METRICS.MRR --metrics.mrr Identifier for MRR metric โ | |
| โ METRICS.NDCG --metrics.ndcg Identifier for NDCG metric โ | |
| โ METRICS.RECALL --metrics.recall Identifier for Recall metric โ | |
| โ METRICS.PRECISION --metrics.precision Identifier for Precision metric โ | |
| โ METRICS.MICRO-AVG-F1 --metrics.micro-avg-f1 Identifier for micro average F1 metric โ | |
| โ METRICS.MACRO-AVG-F1 --metrics.macro-avg-f1 Identifier for macro average F1 metric โ | |
| โ METRICS.ACCURACY --metrics.accuracy Identifier for accuracy metric โ | |
| โ EVALUATION.TOP-K --evaluation.top-k [default: 10] โ | |
| โ EVALUATION.BATCH-SIZE [default: 16] โ | |
| โ --evaluation.batch-size โ | |
| โ EVALUATION.SEED --evaluation.seed [default: 42] โ | |
| โ EVALUATION.MAX-LENGTH โ | |
| โ --evaluation.max-length โ | |
| โ EVALUATION.SAMPLES-PER-LABEL โ | |
| โ --evaluation.samples-per-label โ | |
| โ EVALUATION.N-EXPERIMENTS [default: 10] โ | |
| โ --evaluation.n-experiments โ | |
| โ * OUTPUT.RESULTS-FILE --output.results-file [required] โ | |
| โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ | |
| ``` | |
| </details> | |
| \ | |
| Benchmarks can be run in a couple of different ways: | |
| - options defined in a YAML config file | |
| - options directly from the command line | |
| - options defined in a YAML config file which are overridden by the command line | |
| **Benchmark configured entirely from a YAML file** | |
| ## run reranking | |
| ```bash | |
| evals-hub run-benchmark --config reranking_config.yaml | |
| ``` | |
| ## run nli | |
| ```bash | |
| evals-hub run-benchmark --config nli_config.yaml | |
| ``` | |
| ## run classification | |
| ```bash | |
| evals-hub run-benchmark --config classification_config.yaml | |
| ``` | |
| ## run patent landscape evaluation | |
| ```bash | |
| evals-hub run-benchmark --config pl_eval_config.yaml | |
| ``` | |
| ## Troubleshooting SSL errors | |
| ## SSL errors when connecting to huggingface dataset | |
| Set environment variable for python library `requests` in `.env` | |
| ```python | |
| REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt | |
| ``` | |
| SSL certificates may need to be imported if you have not done before. | |
| # Development Setup | |
| ## Install the git hook scripts | |
| ```bash | |
| pre-commit install | |
| ``` | |
| ## Run tests | |
| ```bash | |
| uv run pytest -v | |
| ``` | |
| ## SQL Code Quality | |
| ### Lint all SQL files in a directory: | |
| ```bash | |
| uv run sqlfluff lint --dialect postgres cronjob/ | |
| ``` | |
| ### Format/fix SQL files: | |
| ```bash | |
| uv run sqlfluff format --dialect postgres cronjob/ | |
| ``` | |
| ## Serve documentation locally | |
| ```bash | |
| uv run mkdocs serve -f docs/mkdocs.yml | |
| ``` | |
| Then open up http://127.0.0.1:8000/ in your browser | |
| ## Refresh & upgrade the lockfile | |
| ```bash | |
| uv sync --upgrade | |
| ``` | |
| ### Integration tests | |
| By default, integration tests are ignored in the pytest configuration because evaluation runs take a long time and require GPU resources. However, it is sometimes useful to run the evaluation to verify that results are correct against public benchmarks. | |
| And here is the command: | |
| ```bash | |
| uv run pytest tests/integration | |
| ``` | |
| ## To run pre-commit hooks locally | |
| ```bash | |
| source .venv/bin/activate | |
| pre-commit run --all-files | |
| ``` | |
| ## Show outdated package | |
| ```bash | |
| uv tree --outdated --depth 1 | |
| ``` | |