myponline

Applying SDPO (Self-Distillation Policy Optimization, Hübotter et al. 2026) to the MyPO goal: typed-Python preference tuning. Instead of using ruff + mypy --strict offline to synthesize DPO chosen/rejected pairs (as MyPO did), myponline runs the static analyzers online as the rollout reward and rich-feedback channel inside a TRL SDPOTrainer loop.

Status: current milestone state and the next recommended task live in docs/milestones.md. The repo already includes the Python 3.12 scaffold plus the first milestone artifacts, including the analyzer benchmark spike and the prompt-only mypo-4k-rfc loader.

Hub status: the code-first Hub repo is published at joshuasundance/myponline, and the current blessed checkpoint is published at joshuasundance/myponline-checkpoint. The intended split is now: clean code repo + bucket-backed run outputs + one blessed checkpoint repo.

Hub references

Code repo: joshuasundance/myponline
Checkpoint repo: joshuasundance/myponline-checkpoint
Dataset: joshuasundance/mypo-4k-rfc
Project collection: joshuasundance/mypo-project
Reference dashboard: joshuasundance/mypo-live
Trackio dashboard: joshuasundance/myponline-dashboard

Latest run snapshot (2026-04-26)

The latest published typed-pass control run is q15-chain-full-r21-20260426-sdpo-l40sx1-g8-s4000-typed-pass-r7 (checkpoint revision q15-r21-l40sx1-g8-s4000-typed-pass-r7). Post-train eval artifacts are durable HF-job outputs on the shared bucket path, not local-only one-offs.

In-domain characterization on 150 validation prompts

Subject	Parse	Ruff	Mypy strict	Mean reward	Win vs chosen
Base (`Qwen/Qwen2.5-Coder-1.5B-Instruct`)	0.8267	0.7867	0.2200	0.6887	0.0067
Typed-pass control r20 (`q15-r20-l40sx1-g8-s500-typed-pass-r6`)	0.9867	0.9733	0.8667	0.5481	0.2267
Typed-pass control r21 (`q15-r21-l40sx1-g8-s4000-typed-pass-r7`)	1.0000	1.0000	1.0000	0.0000	0.0000

Interpretation: the longer r21 typed-pass run found a degenerate corner of the metric surface: it drove parse / ruff / mypy pass rates to 1.0, but also collapsed annotation-slot coverage, fully annotated function fraction, mean reward, and win rate vs chosen completions to 0.0. In other words, it got better at producing trivially valid Python modules, not better typed-Python solutions.

HumanEval+

For the checkpoint-only HumanEval+ run (humaneval-plus-20260425-posttrain-100):

pass@1 base: 0.6829
pass@1 plus: 0.6159
tasks: 164

For the typed-pass SDPO control run (humaneval-plus-20260426-r20-typed-pass-r6):

pass@1 base: 0.6829
pass@1 plus: 0.6098
tasks: 164

For the longer typed-pass SDPO control run (humaneval-plus-20260426-r21-typed-pass-r7):

pass@1 base: 0.6829
pass@1 plus: 0.6159
tasks: 164

Interpretation: r21 recovered the checkpoint-only HumanEval+ score while the in-domain typed objective collapsed to zero reward / zero typed coverage. That makes the main follow-up question clear: prevent the empty-module collapse without giving back the general-code stability.

Durable evaluation artifacts

In-domain checkpoint summary: hf://buckets/joshuasundance/mypo-artifacts/myponline/eval/myponline-checkpoint-posttrain-150-c/summary.json
In-domain base summary: hf://buckets/joshuasundance/mypo-artifacts/myponline/eval/qwen25-coder-1p5b-base-posttrain-150-c/summary.json
In-domain typed-pass control summary: hf://buckets/joshuasundance/mypo-artifacts/myponline/eval/q15-r20-l40sx1-g8-s500-typed-pass-r6-full-eval/summary.json
In-domain typed-pass long-run summary: hf://buckets/joshuasundance/mypo-artifacts/myponline/eval/q15-r21-l40sx1-g8-s4000-typed-pass-r7-full-eval/summary.json
HumanEval+ aggregate (checkpoint-only run): hf://buckets/joshuasundance/mypo-artifacts/humaneval-plus-20260425-posttrain-100/aggregate.json
HumanEval+ aggregate (typed-pass control run): hf://buckets/joshuasundance/mypo-artifacts/humaneval-plus-20260426-r20-typed-pass-r6/aggregate.json
HumanEval+ aggregate (typed-pass long run): hf://buckets/joshuasundance/mypo-artifacts/humaneval-plus-20260426-r21-typed-pass-r7/aggregate.json

Start here

docs/plan.md — the full project plan. Problem statement, architecture sketch, proposed repo layout, milestones (M0–M8), known risks. This is the source of truth for scope.
docs/decisions/0001-seed-choices.md — the initial set of project defaults chosen in the seeding conversation (base model, stack, dataset reuse, feedback channels, etc.). Each entry is explicitly marked reversible.
mypo_original/ — read-only reference docs for the prior project. Authoritative for: dataset schema, MyPO training recipes, HF Jobs + Buckets + CodeCarbon operational pattern, HumanEval+ dashboard. Start with mypo_original/README.md.
sdpo_background/ — read-only research compendium for SDPO: paper, TRL documentation, source-code walkthrough, reference implementation, practical guide, theory prereqs. Authoritative for: SDPOConfig knobs, the 12 TRL implementation gotchas, and the hybrid / rich-feedback recipe choices. Start with sdpo_background/README.md.

Working principles (captured from the seeding session)

Interface-first for the analyzer service. The rest of the pipeline talks to a narrow AnalyzerService.evaluate(code) → AnalysisResult contract. The implementation behind it (subprocess vs in-process API vs daemon; threads vs processes vs asyncio) is explicitly undecided and will be chosen by benchmark (M1a), not by guess. See docs/plan.md §1.2.
Every check runs independently and concurrently with every other check. No sample waits on another sample; within a sample, ruff and mypy never wait on each other. The only hard-serial step is the ast.parse / compile() fast-path.
One analysis per sample feeds both the scalar reward and the privileged_context textual feedback. No double-work.
Reuse MyPO wherever it makes sense. Same base model (Qwen/Qwen2.5-Coder-1.5B-Instruct), same prompt set (joshuasundance/mypo-4k-rfc prompts; drop the chosen/rejected columns), same telemetry stack (CodeCarbon + Trackio), same HF-model-repo layout.
Stay comparable to MyPO DPO-v3. The v0 eval bar is "beat MyPO DPO-v3 on the MyPO objective at comparable or better HumanEval+".

Quickstart

uv sync
uv run pre-commit run --all-files
uv run pytest
uv run ruff check .
uv run mypy .
uv run python scripts\myponline_eval.py --smoke
uv run myponline-sdpo-train --smoke
uv run myponline-sdpo-dispatch --code-repo joshuasundance/myponline --checkpoint-repo joshuasundance/myponline-checkpoint -- --smoke
uv run myponline-sft-warmstart-dispatch --code-repo joshuasundance/myponline --publish-repo joshuasundance/myponline-sft-q12 -- --smoke

Layout (current)

myponline/
├── README.md                  # this file — human entry point
├── AGENTS.md                  # entry point for AI assistants in new sessions
├── LICENSE                    # Apache-2.0
├── pyproject.toml             # uv-managed project metadata + tool config
├── uv.lock                    # locked dependencies
├── docs/
│   ├── plan.md                # full project plan (source of truth for scope)
│   ├── milestones.md          # milestone board mirror (M0–M8)
│   └── decisions/
│       └── 0001-seed-choices.md
├── examples/
│   └── smoke_reward.py        # syntax-only smoke example
├── myponline/
│   ├── analysis/              # interface-first analysis scaffold
│   ├── data/                  # dataset loader scaffold
│   ├── eval/                  # M7 characterization entrypoint + reporting
│   ├── telemetry/             # telemetry scaffold
│   └── training/              # SDPO training scaffold
├── mypo_original/             # read-only reference (prior project docs)
├── scripts/                   # root entrypoints matching the MyPO pattern
├── tests/                     # scaffold-level tests
└── sdpo_background/           # read-only reference (SDPO compendium)

The deeper layout proposed for future milestones still lives in docs/plan.md §4; M0 intentionally lands only the stable scaffold and interface placeholders.

License

Apache-2.0, matching MyPO's repo-level choice.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joshuasundance/myponline

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-Coder-1.5B

Finetuned

Qwen/Qwen2.5-Coder-1.5B-Instruct

Finetuned

(158)

this model

joshuasundance
/

myponline

myponline

Hub references

Latest run snapshot (2026-04-26)

In-domain characterization on 150 validation prompts

HumanEval+

Durable evaluation artifacts

Start here

Working principles (captured from the seeding session)

Quickstart

Layout (current)

License

Model tree for joshuasundance/myponline

Dataset used to train joshuasundance/myponline