apec-segment / README copy.md
FatemehT's picture
Configure Git LFS and remove binary files from direct git storage
76b0572
# tomoro-evals
[![Static Badge](https://img.shields.io/badge/User%20Guide-Documentation-blue)](https://tomoro-ai.github.io/tomoro-evals/)
# How to run the project
## Create virtual environment and activate it
```bash
uv venv
source .venv/bin/activate
uv pip install -e .
```
## Database Setup (Required for ETL Pipeline)
Before running the ETL pipeline (`main()` function) in the notebooks, you need to set up PostgreSQL:
### Prerequisites
1. **Install PostgreSQL** (if not already installed):
```bash
brew install postgresql@14
```
2. **Start PostgreSQL service**:
```bash
brew services start postgresql@14
```
3. **Create database and user**:
```bash
psql -d postgres -c "CREATE DATABASE cii;"
psql -d postgres -c "CREATE USER app WITH PASSWORD 'password';"
psql -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE cii TO app;"
```
4. **Create database tables**:
```bash
psql -d cii -U app -h localhost -f cronjob/customer_transactions/schema.sql
```
5. **Verify setup**:
```bash
psql -d cii -U app -h localhost -c "\dt"
```
The ETL pipeline expects:
- Database name: `cii`
- Username: `app`
- Password: `password`
- Host: `localhost`
- Port: `5432`
These settings are configured in the `DSN` variable in the notebook.
### Alternative: Run PostgreSQL with Docker (Recommended for Isolation)
If you prefer not to install PostgreSQL locally, you can run it in a Docker container that auto-loads the schema.
#### 1. Start a fresh container
```bash
docker rm -f pg-cii 2>/dev/null || true
docker volume rm pgdata 2>/dev/null || true
docker run -d \
--name pg-cii \
-e POSTGRES_USER=app \
-e POSTGRES_PASSWORD=password \
-e POSTGRES_DB=cii \
-p 5432:5432 \
-v pgdata:/var/lib/postgresql/data \
-v $(pwd)/cronjob/customer_transactions/schema.sql:/docker-entrypoint-initdb.d/001-schema.sql:ro \
postgres:16
```
The `schema.sql` file is executed only the first time the named volume `pgdata` is initialized.
#### 2. Check container & logs
```bash
docker ps --filter name=pg-cii
docker logs pg-cii | tail -n 30
```
#### 3. Inspect tables
```bash
docker exec -it pg-cii psql -U app -d cii -c "\dt"
```
#### 4. Set DSN (current shell)
```bash
export CII_PG_DSN="dbname=cii user=app password=password host=localhost port=5432"
```
If using a notebook:
```python
import os
os.environ["CII_PG_DSN"] = "dbname=cii user=app password=password host=localhost port=5432"
```
#### 5. Rebuild after changing `schema.sql`
```bash
docker rm -f pg-cii && docker volume rm pgdata && \
docker run -d --name pg-cii \
-e POSTGRES_USER=app -e POSTGRES_PASSWORD=password -e POSTGRES_DB=cii \
-p 5432:5432 \
-v pgdata:/var/lib/postgresql/data \
-v $(pwd)/cronjob/customer_transactions/schema.sql:/docker-entrypoint-initdb.d/001-schema.sql:ro \
postgres:16
```
#### 6. Stop / start later
```bash
docker stop pg-cii
docker start pg-cii
```
#### 7. Apply schema manually (if needed on an existing container)
```bash
cat cronjob/customer_transactions/schema.sql | docker exec -i pg-cii psql -U app -d cii
```
#### 8. Simple backup / restore
```bash
# Backup
docker exec -t pg-cii pg_dump -U app -d cii > backup.sql
# Restore (fresh volume)
docker rm -f pg-cii && docker volume rm pgdata
# start container again (see step 1, omit schema bind if restoring)
cat backup.sql | docker exec -i pg-cii psql -U app -d cii
```
### Using uv with Docker Postgres
All Python commands can run inside the uv-managed environment while PostgreSQL runs in Docker.
```bash
uv sync # install dependencies
export CII_PG_DSN="dbname=cii user=app password=password host=localhost port=5432"
uv run python cronjob/customer_transactions/agent_run.py
```
Add a script alias in `pyproject.toml` (optional):
```toml
[tool.uv.scripts]
agent = "python cronjob/customer_transactions/agent_run.py"
```
Then:
```bash
uv run agent
```
# Usage
The commands below sassume an activated virtual environment. If you haven't activated your environment and you are using `uv`, you should prefix the commands with `uv run`.
## Online Usage
### run reranking evaluation for langfuse traces
```bash
uv run langfuse_trace_evaluation.py
```
## Offline Usage
Evals Hub may be used offline, for development purposes, or as part of a CI/CD pipeline. You can use the `evals-hub` CLI tool to run benchmarks offline. The main entry point is the `run-benchmark` command.
<details>
<summary><b>View options for the evals-hub command</b></summary>
```bash
evals-hub --help
```
```
Usage: evals-hub COMMAND
โ•ญโ”€ Commands โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ run-benchmark โ”‚
โ”‚ --help -h Display this message and exit. โ”‚
โ”‚ --version Display application version. โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Parameters โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ * --config [required] โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
```
</details>
<details>
<summary><b>View options for the evals-hub run-benchmark command</b></summary>
```bash
evals-hub run-benchmark --help
```
```
Usage: evals-hub run-benchmark [ARGS] [OPTIONS]
โ•ญโ”€ Parameters โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ * TASK-NAME --task-name [choices: retrieval, reranking, classification, nli] [required] โ”‚
โ”‚ * DATASET.NAME --dataset.name [required] โ”‚
โ”‚ DATASET.SPLIT --dataset.split โ”‚
โ”‚ DATASET.HF-SUBSET --dataset.hf-subset โ”‚
โ”‚ * MODEL.CHECKPOINT --model.checkpoint [required] โ”‚
โ”‚ METRICS.MAP --metrics.map Identifier for MAP metric โ”‚
โ”‚ METRICS.MRR --metrics.mrr Identifier for MRR metric โ”‚
โ”‚ METRICS.NDCG --metrics.ndcg Identifier for NDCG metric โ”‚
โ”‚ METRICS.RECALL --metrics.recall Identifier for Recall metric โ”‚
โ”‚ METRICS.PRECISION --metrics.precision Identifier for Precision metric โ”‚
โ”‚ METRICS.MICRO-AVG-F1 --metrics.micro-avg-f1 Identifier for micro average F1 metric โ”‚
โ”‚ METRICS.MACRO-AVG-F1 --metrics.macro-avg-f1 Identifier for macro average F1 metric โ”‚
โ”‚ METRICS.ACCURACY --metrics.accuracy Identifier for accuracy metric โ”‚
โ”‚ EVALUATION.TOP-K --evaluation.top-k [default: 10] โ”‚
โ”‚ EVALUATION.BATCH-SIZE [default: 16] โ”‚
โ”‚ --evaluation.batch-size โ”‚
โ”‚ EVALUATION.SEED --evaluation.seed [default: 42] โ”‚
โ”‚ EVALUATION.MAX-LENGTH โ”‚
โ”‚ --evaluation.max-length โ”‚
โ”‚ EVALUATION.SAMPLES-PER-LABEL โ”‚
โ”‚ --evaluation.samples-per-label โ”‚
โ”‚ EVALUATION.N-EXPERIMENTS [default: 10] โ”‚
โ”‚ --evaluation.n-experiments โ”‚
โ”‚ * OUTPUT.RESULTS-FILE --output.results-file [required] โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
```
</details>
\
Benchmarks can be run in a couple of different ways:
- options defined in a YAML config file
- options directly from the command line
- options defined in a YAML config file which are overridden by the command line
**Benchmark configured entirely from a YAML file**
## run reranking
```bash
evals-hub run-benchmark --config reranking_config.yaml
```
## run nli
```bash
evals-hub run-benchmark --config nli_config.yaml
```
## run classification
```bash
evals-hub run-benchmark --config classification_config.yaml
```
## run patent landscape evaluation
```bash
evals-hub run-benchmark --config pl_eval_config.yaml
```
## Troubleshooting SSL errors
## SSL errors when connecting to huggingface dataset
Set environment variable for python library `requests` in `.env`
```python
REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
```
SSL certificates may need to be imported if you have not done before.
# Development Setup
## Install the git hook scripts
```bash
pre-commit install
```
## Run tests
```bash
uv run pytest -v
```
## SQL Code Quality
### Lint all SQL files in a directory:
```bash
uv run sqlfluff lint --dialect postgres cronjob/
```
### Format/fix SQL files:
```bash
uv run sqlfluff format --dialect postgres cronjob/
```
## Serve documentation locally
```bash
uv run mkdocs serve -f docs/mkdocs.yml
```
Then open up http://127.0.0.1:8000/ in your browser
## Refresh & upgrade the lockfile
```bash
uv sync --upgrade
```
### Integration tests
By default, integration tests are ignored in the pytest configuration because evaluation runs take a long time and require GPU resources. However, it is sometimes useful to run the evaluation to verify that results are correct against public benchmarks.
And here is the command:
```bash
uv run pytest tests/integration
```
## To run pre-commit hooks locally
```bash
source .venv/bin/activate
pre-commit run --all-files
```
## Show outdated package
```bash
uv tree --outdated --depth 1
```