apec-segment / README copy.md
FatemehT's picture
Configure Git LFS and remove binary files from direct git storage
76b0572

A newer version of the Streamlit SDK is available: 1.52.1

Upgrade

tomoro-evals

Static Badge

How to run the project

Create virtual environment and activate it

uv venv
source .venv/bin/activate
uv pip install -e .

Database Setup (Required for ETL Pipeline)

Before running the ETL pipeline (main() function) in the notebooks, you need to set up PostgreSQL:

Prerequisites

  1. Install PostgreSQL (if not already installed):

    brew install postgresql@14
    
  2. Start PostgreSQL service:

    brew services start postgresql@14
    
  3. Create database and user:

    psql -d postgres -c "CREATE DATABASE cii;"
    psql -d postgres -c "CREATE USER app WITH PASSWORD 'password';"
    psql -d postgres -c "GRANT ALL PRIVILEGES ON DATABASE cii TO app;"
    
  4. Create database tables:

    psql -d cii -U app -h localhost -f cronjob/customer_transactions/schema.sql
    
  5. Verify setup:

    psql -d cii -U app -h localhost -c "\dt"
    

The ETL pipeline expects:

  • Database name: cii
  • Username: app
  • Password: password
  • Host: localhost
  • Port: 5432

These settings are configured in the DSN variable in the notebook.

Alternative: Run PostgreSQL with Docker (Recommended for Isolation)

If you prefer not to install PostgreSQL locally, you can run it in a Docker container that auto-loads the schema.

1. Start a fresh container

docker rm -f pg-cii 2>/dev/null || true
docker volume rm pgdata 2>/dev/null || true

docker run -d \
  --name pg-cii \
  -e POSTGRES_USER=app \
  -e POSTGRES_PASSWORD=password \
  -e POSTGRES_DB=cii \
  -p 5432:5432 \
  -v pgdata:/var/lib/postgresql/data \
  -v $(pwd)/cronjob/customer_transactions/schema.sql:/docker-entrypoint-initdb.d/001-schema.sql:ro \
  postgres:16

The schema.sql file is executed only the first time the named volume pgdata is initialized.

2. Check container & logs

docker ps --filter name=pg-cii
docker logs pg-cii | tail -n 30

3. Inspect tables

docker exec -it pg-cii psql -U app -d cii -c "\dt"

4. Set DSN (current shell)

export CII_PG_DSN="dbname=cii user=app password=password host=localhost port=5432"

If using a notebook:

import os
os.environ["CII_PG_DSN"] = "dbname=cii user=app password=password host=localhost port=5432"

5. Rebuild after changing schema.sql

docker rm -f pg-cii && docker volume rm pgdata && \
docker run -d --name pg-cii \
  -e POSTGRES_USER=app -e POSTGRES_PASSWORD=password -e POSTGRES_DB=cii \
  -p 5432:5432 \
  -v pgdata:/var/lib/postgresql/data \
  -v $(pwd)/cronjob/customer_transactions/schema.sql:/docker-entrypoint-initdb.d/001-schema.sql:ro \
  postgres:16

6. Stop / start later

docker stop pg-cii
docker start pg-cii

7. Apply schema manually (if needed on an existing container)

cat cronjob/customer_transactions/schema.sql | docker exec -i pg-cii psql -U app -d cii

8. Simple backup / restore

# Backup
docker exec -t pg-cii pg_dump -U app -d cii > backup.sql
# Restore (fresh volume)
docker rm -f pg-cii && docker volume rm pgdata
# start container again (see step 1, omit schema bind if restoring)
cat backup.sql | docker exec -i pg-cii psql -U app -d cii

Using uv with Docker Postgres

All Python commands can run inside the uv-managed environment while PostgreSQL runs in Docker.

uv sync  # install dependencies
export CII_PG_DSN="dbname=cii user=app password=password host=localhost port=5432"
uv run python cronjob/customer_transactions/agent_run.py

Add a script alias in pyproject.toml (optional):

[tool.uv.scripts]
agent = "python cronjob/customer_transactions/agent_run.py"

Then:

uv run agent

Usage

The commands below sassume an activated virtual environment. If you haven't activated your environment and you are using uv, you should prefix the commands with uv run.

Online Usage

run reranking evaluation for langfuse traces

uv run langfuse_trace_evaluation.py

Offline Usage

Evals Hub may be used offline, for development purposes, or as part of a CI/CD pipeline. You can use the evals-hub CLI tool to run benchmarks offline. The main entry point is the run-benchmark command.

View options for the evals-hub command
evals-hub --help
Usage: evals-hub COMMAND

โ•ญโ”€ Commands โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ run-benchmark                                                                                          โ”‚
โ”‚ --help -h      Display this message and exit.                                                          โ”‚
โ”‚ --version      Display application version.                                                            โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Parameters โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ *  --config  [required]                                                                                โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
View options for the evals-hub run-benchmark command
evals-hub run-benchmark --help
Usage: evals-hub run-benchmark [ARGS] [OPTIONS]

โ•ญโ”€ Parameters โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ *  TASK-NAME --task-name                        [choices: retrieval, reranking, classification, nli] [required]       โ”‚
โ”‚ *  DATASET.NAME --dataset.name                  [required]                                                            โ”‚
โ”‚    DATASET.SPLIT --dataset.split                                                                                      โ”‚
โ”‚    DATASET.HF-SUBSET --dataset.hf-subset                                                                              โ”‚
โ”‚ *  MODEL.CHECKPOINT --model.checkpoint          [required]                                                            โ”‚
โ”‚    METRICS.MAP --metrics.map                    Identifier for MAP metric                                             โ”‚
โ”‚    METRICS.MRR --metrics.mrr                    Identifier for MRR metric                                             โ”‚
โ”‚    METRICS.NDCG --metrics.ndcg                  Identifier for NDCG metric                                            โ”‚
โ”‚    METRICS.RECALL --metrics.recall              Identifier for Recall metric                                          โ”‚
โ”‚    METRICS.PRECISION --metrics.precision        Identifier for Precision metric                                       โ”‚
โ”‚    METRICS.MICRO-AVG-F1 --metrics.micro-avg-f1  Identifier for micro average F1 metric                                โ”‚
โ”‚    METRICS.MACRO-AVG-F1 --metrics.macro-avg-f1  Identifier for macro average F1 metric                                โ”‚
โ”‚    METRICS.ACCURACY --metrics.accuracy          Identifier for accuracy metric                                        โ”‚
โ”‚    EVALUATION.TOP-K --evaluation.top-k          [default: 10]                                                         โ”‚
โ”‚    EVALUATION.BATCH-SIZE                        [default: 16]                                                         โ”‚
โ”‚      --evaluation.batch-size                                                                                          โ”‚
โ”‚    EVALUATION.SEED --evaluation.seed            [default: 42]                                                         โ”‚
โ”‚    EVALUATION.MAX-LENGTH                                                                                              โ”‚
โ”‚      --evaluation.max-length                                                                                          โ”‚
โ”‚    EVALUATION.SAMPLES-PER-LABEL                                                                                       โ”‚
โ”‚      --evaluation.samples-per-label                                                                                   โ”‚
โ”‚    EVALUATION.N-EXPERIMENTS                     [default: 10]                                                         โ”‚
โ”‚      --evaluation.n-experiments                                                                                       โ”‚
โ”‚ *  OUTPUT.RESULTS-FILE --output.results-file    [required]                                                            โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ


Benchmarks can be run in a couple of different ways:

  • options defined in a YAML config file
  • options directly from the command line
  • options defined in a YAML config file which are overridden by the command line

Benchmark configured entirely from a YAML file

run reranking

evals-hub run-benchmark --config reranking_config.yaml

run nli

evals-hub run-benchmark --config nli_config.yaml

run classification

evals-hub run-benchmark --config classification_config.yaml

run patent landscape evaluation

evals-hub run-benchmark --config pl_eval_config.yaml

Troubleshooting SSL errors

SSL errors when connecting to huggingface dataset

Set environment variable for python library requests in .env

REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

SSL certificates may need to be imported if you have not done before.

Development Setup

Install the git hook scripts

pre-commit install

Run tests

uv run pytest -v

SQL Code Quality

Lint all SQL files in a directory:

uv run sqlfluff lint --dialect postgres cronjob/

Format/fix SQL files:

uv run sqlfluff format --dialect postgres cronjob/

Serve documentation locally

uv run mkdocs serve -f docs/mkdocs.yml

Then open up http://127.0.0.1:8000/ in your browser

Refresh & upgrade the lockfile

uv sync --upgrade

Integration tests

By default, integration tests are ignored in the pytest configuration because evaluation runs take a long time and require GPU resources. However, it is sometimes useful to run the evaluation to verify that results are correct against public benchmarks. And here is the command:

uv run pytest tests/integration

To run pre-commit hooks locally

source .venv/bin/activate
pre-commit run --all-files

Show outdated package

uv tree --outdated --depth 1