Agent trajectories vs synthetic data: error recovery patterns in production

#13
by O96a - opened

Training on 425K curated agentic trajectories from Claude Opus 4.6 and GPT-5.x scaffolding is a compelling approach β€” learning read-before-write patterns and LSP diagnostic response from real agent traces rather than synthetic task descriptions.

The Terminal-Bench improvement (+61% over base Qwen3.5-9B) suggests these patterns transfer well. A few questions for production deployment:

  1. The error recovery behavior β€” have you tested recovery rates when the model encounters novel error types not in the training trajectories? In my experience with LangGraph agents, models often struggle with unseen LSP errors that don't match their training distribution.

  2. The minimal edit diffs vs full rewrites β€” this is exactly what production code agents need. Have you measured the token savings on typical edit operations? For 262K context, edit efficiency directly impacts cost.

  3. For the 425K trajectory dataset β€” what's the breakdown between successful vs failed trajectories? Learning from failed attempts (with proper scaffolding) often improves robustness, but can also propagate bad patterns if not filtered.

Impressive GPQA Diamond results (83.8% pass@1). Looking forward to testing against agent orchestration benchmarks like BFCL.

For teams integrating OmniCoder: the Apache 2.0 license and GGUF availability make this a strong candidate for local coding agents where frontier model APIs aren't feasible.

Sign up or log in to comment