myponline
Applying SDPO (Self-Distillation Policy Optimization, HΓΌbotter et al. 2026) to the MyPO goal: typed-Python preference tuning. Instead of using
ruff+mypy --strictoffline to synthesize DPO chosen/rejected pairs (as MyPO did),myponlineruns the static analyzers online as the rollout reward and rich-feedback channel inside a TRLSDPOTrainerloop.
Status: current milestone state and the next recommended task live in
docs/milestones.md. The repo already includes the
Python 3.12 scaffold plus the first milestone artifacts, including the analyzer
benchmark spike and the prompt-only mypo-4k-rfc loader.
Hub status: the code-first Hub repo is published at
joshuasundance/myponline, and the
current blessed checkpoint is published at
joshuasundance/myponline-checkpoint.
The intended split is now: clean code repo + bucket-backed run outputs + one
blessed checkpoint repo.
Hub references
- Code repo:
joshuasundance/myponline - Checkpoint repo:
joshuasundance/myponline-checkpoint - Dataset:
joshuasundance/mypo-4k-rfc - Project collection:
joshuasundance/mypo-project - Reference dashboard:
joshuasundance/mypo-live - Trackio dashboard:
joshuasundance/myponline-dashboard
Latest run snapshot (2026-04-26)
The latest published typed-pass control run is
q15-chain-full-r21-20260426-sdpo-l40sx1-g8-s4000-typed-pass-r7
(checkpoint revision q15-r21-l40sx1-g8-s4000-typed-pass-r7). Post-train eval
artifacts are durable HF-job outputs on the shared bucket path, not local-only
one-offs.
In-domain characterization on 150 validation prompts
| Subject | Parse | Ruff | Mypy strict | Mean reward | Win vs chosen |
|---|---|---|---|---|---|
Base (Qwen/Qwen2.5-Coder-1.5B-Instruct) |
0.8267 | 0.7867 | 0.2200 | 0.6887 | 0.0067 |
Typed-pass control r20 (q15-r20-l40sx1-g8-s500-typed-pass-r6) |
0.9867 | 0.9733 | 0.8667 | 0.5481 | 0.2267 |
Typed-pass control r21 (q15-r21-l40sx1-g8-s4000-typed-pass-r7) |
1.0000 | 1.0000 | 1.0000 | 0.0000 | 0.0000 |
Interpretation: the longer r21 typed-pass run found a degenerate corner of
the metric surface: it drove parse / ruff / mypy pass rates to 1.0, but also
collapsed annotation-slot coverage, fully annotated function fraction, mean
reward, and win rate vs chosen completions to 0.0. In other words, it got
better at producing trivially valid Python modules, not better typed-Python
solutions.
HumanEval+
For the checkpoint-only HumanEval+ run (humaneval-plus-20260425-posttrain-100):
- pass@1 base:
0.6829 - pass@1 plus:
0.6159 - tasks:
164
For the typed-pass SDPO control run (humaneval-plus-20260426-r20-typed-pass-r6):
- pass@1 base:
0.6829 - pass@1 plus:
0.6098 - tasks:
164
For the longer typed-pass SDPO control run
(humaneval-plus-20260426-r21-typed-pass-r7):
- pass@1 base:
0.6829 - pass@1 plus:
0.6159 - tasks:
164
Interpretation: r21 recovered the checkpoint-only HumanEval+ score while the in-domain typed objective collapsed to zero reward / zero typed coverage. That makes the main follow-up question clear: prevent the empty-module collapse without giving back the general-code stability.
Durable evaluation artifacts
- In-domain checkpoint summary:
hf://buckets/joshuasundance/mypo-artifacts/myponline/eval/myponline-checkpoint-posttrain-150-c/summary.json - In-domain base summary:
hf://buckets/joshuasundance/mypo-artifacts/myponline/eval/qwen25-coder-1p5b-base-posttrain-150-c/summary.json - In-domain typed-pass control summary:
hf://buckets/joshuasundance/mypo-artifacts/myponline/eval/q15-r20-l40sx1-g8-s500-typed-pass-r6-full-eval/summary.json - In-domain typed-pass long-run summary:
hf://buckets/joshuasundance/mypo-artifacts/myponline/eval/q15-r21-l40sx1-g8-s4000-typed-pass-r7-full-eval/summary.json - HumanEval+ aggregate (checkpoint-only run):
hf://buckets/joshuasundance/mypo-artifacts/humaneval-plus-20260425-posttrain-100/aggregate.json - HumanEval+ aggregate (typed-pass control run):
hf://buckets/joshuasundance/mypo-artifacts/humaneval-plus-20260426-r20-typed-pass-r6/aggregate.json - HumanEval+ aggregate (typed-pass long run):
hf://buckets/joshuasundance/mypo-artifacts/humaneval-plus-20260426-r21-typed-pass-r7/aggregate.json
Start here
- docs/plan.md β the full project plan. Problem statement, architecture sketch, proposed repo layout, milestones (M0βM8), known risks. This is the source of truth for scope.
- docs/decisions/0001-seed-choices.md β the initial set of project defaults chosen in the seeding conversation (base model, stack, dataset reuse, feedback channels, etc.). Each entry is explicitly marked reversible.
- mypo_original/ β read-only reference docs for the
prior project. Authoritative for: dataset schema, MyPO training
recipes, HF Jobs + Buckets + CodeCarbon operational pattern, HumanEval+
dashboard. Start with
mypo_original/README.md. - sdpo_background/ β read-only research compendium
for SDPO: paper, TRL documentation, source-code walkthrough, reference
implementation, practical guide, theory prereqs. Authoritative for:
SDPOConfigknobs, the 12 TRL implementation gotchas, and the hybrid / rich-feedback recipe choices. Start withsdpo_background/README.md.
Working principles (captured from the seeding session)
- Interface-first for the analyzer service. The rest of the pipeline
talks to a narrow
AnalyzerService.evaluate(code) β AnalysisResultcontract. The implementation behind it (subprocess vs in-process API vs daemon; threads vs processes vs asyncio) is explicitly undecided and will be chosen by benchmark (M1a), not by guess. See docs/plan.md Β§1.2. - Every check runs independently and concurrently with every other
check. No sample waits on another sample; within a sample,
ruffandmypynever wait on each other. The only hard-serial step is theast.parse/compile()fast-path. - One analysis per sample feeds both the scalar reward and the
privileged_contexttextual feedback. No double-work. - Reuse MyPO wherever it makes sense. Same base model
(
Qwen/Qwen2.5-Coder-1.5B-Instruct), same prompt set (joshuasundance/mypo-4k-rfcprompts; drop the chosen/rejected columns), same telemetry stack (CodeCarbon + Trackio), same HF-model-repo layout. - Stay comparable to MyPO DPO-v3. The v0 eval bar is "beat MyPO DPO-v3 on the MyPO objective at comparable or better HumanEval+".
Quickstart
uv sync
uv run pre-commit run --all-files
uv run pytest
uv run ruff check .
uv run mypy .
uv run python scripts\myponline_eval.py --smoke
uv run myponline-sdpo-train --smoke
uv run myponline-sdpo-dispatch --code-repo joshuasundance/myponline --checkpoint-repo joshuasundance/myponline-checkpoint -- --smoke
uv run myponline-sft-warmstart-dispatch --code-repo joshuasundance/myponline --publish-repo joshuasundance/myponline-sft-q12 -- --smoke
Layout (current)
myponline/
βββ README.md # this file β human entry point
βββ AGENTS.md # entry point for AI assistants in new sessions
βββ LICENSE # Apache-2.0
βββ pyproject.toml # uv-managed project metadata + tool config
βββ uv.lock # locked dependencies
βββ docs/
β βββ plan.md # full project plan (source of truth for scope)
β βββ milestones.md # milestone board mirror (M0βM8)
β βββ decisions/
β βββ 0001-seed-choices.md
βββ examples/
β βββ smoke_reward.py # syntax-only smoke example
βββ myponline/
β βββ analysis/ # interface-first analysis scaffold
β βββ data/ # dataset loader scaffold
β βββ eval/ # M7 characterization entrypoint + reporting
β βββ telemetry/ # telemetry scaffold
β βββ training/ # SDPO training scaffold
βββ mypo_original/ # read-only reference (prior project docs)
βββ scripts/ # root entrypoints matching the MyPO pattern
βββ tests/ # scaffold-level tests
βββ sdpo_background/ # read-only reference (SDPO compendium)
The deeper layout proposed for future milestones still lives in docs/plan.md Β§4; M0 intentionally lands only the stable scaffold and interface placeholders.
License
Apache-2.0, matching MyPO's repo-level choice.
Model tree for joshuasundance/myponline
Base model
Qwen/Qwen2.5-1.5B