How well do different AI models handle the complexities of rigorous research orchestration workflows?
TLDR: DAAF (the Data Analyst Augmentation Framework) is a free, open-source toolkit and instructions framework that turns Claude Code into a rigorous quantitative research engine with the human researcher at the helm: every step auditable, every output verifiable, every decision theirs to make. DAAFBench: Orchestration is a testing suite designed to assess different models’ abilities to adhere to the workflows, guidance, guardrails, and processes needed to facilitate responsible, rigorous, and reproducible data analysis for good research. Lots of neat insights on model performance, task consistency, costs, and open-weight trade-offs below!
The headline picture: model performance across the benchmark suite plotted against estimated cost to run the full battery, relative to Opus 4.8. See the full chart below for much more information, customization, and caveats.
Key Takeaways (June 2026)
Fable 5 leads the scoreboard at a Perfect average score of — — — points clear of the next model. It is also by far the most consistent model tested: — of its repeated cases scored perfectly across every criterion on every single repetition. Given that errors and issues quickly percolate and multiply in these chained agentic systems, this is an enormously valuable characteristic relative to competitors, and likely holds similar additional value in terms of code quality and other moment-to-moment junctures not tested by this benchmark.
Claude Opus 4.7 was widely criticized for being far too stingy in its effort, perceived as a strategy by Anthropic to manage intense compute constraints at launch time. The testing bears this out empirically: Opus 4.5 and 4.6 scored at — and —, respectively, Opus 4.7 came in at —, and Opus 4.8 returned to form at —. The 4.7 dip is even more pronounced when focusing only on critical criteria completion (“Critical-only”: scores counting just the must-pass criteria): —/—/—/— across the four Opus releases.
Sonnet 4.6 posts a Perfect average score of — — #2 overall, shockingly ahead of every Opus model in this corpus (best Opus: —) except in the Critical-only category. And it does so cheaply: running this benchmark’s full battery on Sonnet costs roughly —% of what it costs on Opus 4.8. Recognizing that there is uncertainty/imprecision in these results, and many model capabilities not explicitly tested by these current tests, these numbers nonetheless suggest Sonnet 4.6 is more than capable of orchestration work on a budget with potentially minimal trade-off.
GLM 5.2 leads the open-weight pack at —, looking roughly indistinguishable from the Opus line on every metric at anywhere between a quarter and half of the cost. This is an enormous boon for researchers, especially, given that open-weight models like GLM 5.2 also allow for self-hosting, modification/fine-tuning, and greater long-term support guarantees (i.e., not worrying about Opus 4.5 getting sun-setted in a few months when multiple analyses and products rely explicitly on it). DeepSeek V4 Flash presents a greater dip in performance relative to the Opus line (—) but with an almost unbelievable price differential, costing only 3-5% of Anthropic's flagships for the same tasks.
Qwen 3.6 27B, the best of the small open models here, manages —, and it's at its weakest in perhaps the most important part of the overall orchestration pipeline: careful skill routing (basically: How often does it load the necessary grounding references before it responds to questions? —) and mode classification (basically: How often does it correctly assign a natural language query to the right workflow process? —). Gemma 4 31B timed out on — of its runs, often (per transcripts) not actually doing any work at all when queried: by far the worst reliability in the corpus. Some of these results may be simple harness incompatibility (Claude Code is not optimized for communicating with Gemma or Qwen, to be clear), but nonetheless point to more necessary improvements/development to be viable for rigorous research orchestration workflows. That being said, they will get there someday, and with DAAFBench, we’ll now be able to know when they do!
About DAAFBench
DAAF layers together a suite of architectural defenses and strategies from the current frontier of AI best practices to maximize AI output quality and force Claude Code to operate more like a careful and thoughtful researcher at every opportunity. One of the core features of this system is something called “agentic orchestration,” or the complex process of weaving together multiple AI assistants in concert to tackle increasingly complex research workflows: anything from iterating continuously on a data visualization script to developing entire data analytic pipelines from a given research question. DAAF (and other orchestration frameworks) does this by giving a series of intensive instructions, guidelines, and workflows to the AI assistants as they work, breaking tasks down into more concrete and tractable sub-tasks, facilitating better coordination, managing work tracking and documentation, and enforcing adherence to a set of unifying work principles (e.g., auditability, human-in-the-loop, rigorous self-verification processes, etc.).
But a framework like this only works if the AI assistants are sophisticated enough to actually understand, apply, and remember these instructions thoughtfully and consistently. This page walks through the initial results of DAAFBench, a series of bespoke benchmark tests explicitly designed to test adherence to the research protocols and process guidelines of DAAF and assess a given model’s suitability for operating along guidelines that make rigorous, reproducible, and responsible social science at scale possible with AI assistants. The results below summarize multi-dimensional performance assessments for 19 models (both official Anthropic models, as well as open-weights models accessible via OpenRouter, which is immediately compatible with Claude Code and DAAF) across 2,799 different test repetitions designed to simulate key moments of orchestration decision-making and protocol adherence.
Importantly, DAAFBench: Orchestration measures behavioral conformance only: not answer quality, not analytical capability, not general intelligence. A brilliant analyst model that skips confirmation gates scores poorly here, and a modest model that faithfully follows protocol scores well. It is deliberately only half of the picture: a companion suite still in development (predictably, DAAFBench: Analytics) will test analytic competency directly, or whether models make the right calls inside the analysis itself (i.e., explicit decision-making in analytic code and data-cleaning steps, tested via adversarial examples, known-good code, and deterministically verifiable outputs). Orchestration discipline and analytic judgment are complementary halves of trustworthy AI-assisted research, and each deserves its own dedicated measurement. Put differently: Orchestration is an inputs-based assessment — if a model gets the process right, downstream results improve — while Analytics will assess outputs directly. More to come there!
Regardless, DAAFBench: Orchestration results are a crucial step in understanding how models actually perform under these conditions and allow for far greater nuance in thinking about the exact use of specific models (especially more expensive versus cheaper, and proprietary versus open-source) across these types of workflows (and within!). Most importantly, it allows us to more directly track whether and when locally-hosted models (e.g., on a home computer with consumer hardware) can start to tackle these sorts of tasks, opening the door for unprecedented access to analytic capacity going forward. Both the framework being tested and the benchmark harness that produced these numbers are free and open-source in the DAAF GitHub repository; this page is generated directly from the archived run records, and the Run Explorer below lets you trace any score down to the individual runs behind it. If you catch any issues or have suggestions for improvements, please do get in touch!
Glossary: ten terms this page leans on
- Orchestrator — the main assistant in a DAAF session: it talks to the user, classifies requests, and coordinates everything else. Think of it as a lab manager.
- Subagent — a specialist assistant the orchestrator delegates work to (a data analyst, a code reviewer, a debugger), each with its own working rules.
- Skill — a curated reference document the assistant can load on demand, injecting domain expertise (a statistics library’s quirks, a data source’s structure) at the right moment.
- Engagement mode — one of DAAF’s nine workflow types (a full analysis pipeline, onboarding a new dataset, a quick lookup, and so on); every user request must be routed into one to facilitate self-verification workflows and coordination among specialist subagents.
- Confirmation gate — the required pause where the assistant states its plan and waits for explicit user approval before doing anything.
- Golden checkpoint — a truncated transcript of a known-good session; later phases resume from one, so every model starts at an identical mid-conversation point.
- Case — one benchmark scenario: a specific prompt with defined expected behavior and scoring criteria.
- Rep — one repetition of a case: the identical prompt run again, because a model can succeed on one attempt and fail on the next.
- Composite — a model’s headline score: the unweighted mean of its rates across the five phase components (P1, P2, P3a, P3b, P4).
- Tier band — a grouping of models whose composites cluster together; a new tier starts wherever the ranking shows a sufficiently large gap.
The benchmark phases
Each phase isolates one slice of orchestrator behavior, and each asks the model one plain question: Does it route requests correctly? Does it do its homework before acting? Can it delegate cleanly? Does the specialist it delegates to follow its own rules? Does it ground its advice? Later phases resume from golden checkpoints (truncated transcripts of known-good sessions), so every model starts from an identical mid-conversation state.
| Phase | Starts from | What it tests |
|---|---|---|
| 1 — Mode Classification — cases |
A cold start: the model sees only the user’s first request | Can it interpret a natural-language request and route it to the right one of DAAF’s workflow modes — a fairly easy task, but the one everything downstream depends on — while presenting a confirmation gate and executing nothing prematurely? Failure mode: the assistant starts the wrong kind of work. |
| 2 — Post-Confirmation — cases, one per engagement mode |
A checkpoint ending at the confirmation gate; the user says “Sounds good, let’s proceed.” | Strict protocol adherence: does it load the governing reference documents and skills it just committed to — the references that define how the work is supposed to be done — before doing any of it? Failure mode: improvising instead of following the established process. |
| 3a — Dispatch Compliance — cases, 2 per agent type |
A checkpoint with Ad Hoc Collaboration mode fully initialized | Can it hand work to a subagent concisely, coherently, and instructively — the correct agent type, with a properly structured prompt (BASE_DIR, mode marker, task / context / instructions sections)? Failure mode: garbled delegation that degrades everything downstream. |
| 3b — Subagent Behavior | The subagent’s own transcript, whenever a Phase 3 dispatch succeeds | Does the dispatched agent operate correctly as the specialist — following the work protocols its agent specification demands, e.g., a coding agent writing a script and executing it through the audit-trail wrapper? Failure mode: a specialist that ignores its role’s rules. |
| 4 — Skill Routing — cases |
The same initialized Ad Hoc checkpoint as Phase 3; the user asks a brainstorming question | The research-brainstorming dimension: models cannot be relied on to draw research methodology from their fuzzy general knowledge (the baked-in impressions left over from training), so a genuine research assistant must judiciously load curated domain expertise through skill reference files — FIRST the data-scientist hub’s method-selection guidance, THEN the specific library skills the task calls for (e.g., pyfixest for fixed effects, geopandas for spatial work) — to ground what it does, says, and recommends in real research methodology. Failure mode: confident, ungrounded methodological advice. |
Phase 3b is a sub-scoring of Phase 3 runs, not a separate batch: its criteria are derived from the agent type, and which criteria apply varies with the case, so its denominators differ from 3a’s.
All five components (P1, P2, P3a, P3b, P4) enter the leaderboard composite and tier bands with equal weight.
How scoring works
Every run is graded against named criteria by deterministic scorers
(basically: simple, repeatable text-and-structure checks on the archived
transcript — no AI judging AI anywhere in the loop, so the same run always
gets the same grade). Each criterion is either critical (a structural must-pass, like
whether a subagent was dispatched at all — agent_dispatched) or
normal (a protocol detail, like whether the
dispatch prompt carries a context section). The two bars described above come
straight from this split: Perfect demands every criterion pass,
Critical-only demands every critical criterion pass. (Earlier versions of
this page and the benchmark docs called these tiers “hard” and
“soft”.)
Three rates appear throughout this page, and they answer different questions:
- Perfect (per run) — did all criteria pass on this run? This is the high bar, the headline metric, and the basis of the leaderboard composite.
- Critical rate (per criterion) — across all of a model’s runs, what fraction of its critical criteria passed?
They diverge by design. Worked example: in a 12-run batch where 4 runs each fail exactly one normal criterion, Perfect drops to 67% (8/12 runs) while the critical rate stays at 100% — a model can look strong on criterion-level rates yet rarely deliver a fully clean run.
How runs were executed
Runs execute inside the real DAAF Docker container through the
claude -p command line, with all framework hooks, permission rules, and
skill discovery live — nothing is mocked. Phases 2–4 resume from golden
checkpoints so every model starts at an identical protocol point, and each
(phase, model, case) cell is repeated up to three times. Every batch archives to a
timestamped, self-contained result set from which this page is generated.
Key caveats
- This is one view under one orchestration system, not a model exam. DAAF’s framework instructions were written and tuned with Claude Opus 4.5/4.6 as the working model, so part of any score difference between models is prompting-style fit rather than capability. Read these results as one particular view of how models behave under a very specific AI orchestration system — not as an absolute indicator of model quality.
- All analyses were conducted via Claude Code through OpenRouter in a relatively naive fashion, and at least some of the results will be driven by models' compatibility with this specific harness combination. Tune your sense of generalizability around these benchmark results accordingly!
- Timeouts are still graded. Runs that hit the wall-clock limit (some runs in this corpus) have zeroed turn counts and a duration recorded at the cutoff limit, but their criteria are fully scored from the transcript produced before the cutoff — a timed-out run can be a perfect run.
- Denominators are small and uneven. Repetition counts are relatively small and can be uneven across providers and models due to cost limitations — some (model, phase) cells have 2 reps instead of 3 — so the results are necessarily imprecise: read every rate together with its n, and recognize there are unshown confidence intervals to think about.
- Cost figures are estimates at published list prices, not invoices. The headline cost lens throughout this page is the battery-cost multiplier: each model’s estimated cost to run the full —-probe benchmark battery once, expressed relative to Opus 4.8 (= 1.0×), built from published list prices and each model’s observed token mix — taken from provider billing data, not the harness’s counts — on an everything-uncached basis (defined under Cost vs. Performance). These relative figures can shift as providers reprice tokens and as provider caching mechanics interact with each model’s behavior. The per-Mtok figures that remain on the page are secondary detail: raw published list rates, not measured spend. The harness’s own token logging runs through an Anthropic-compatible endpoint whose counts come from Anthropic’s tokenizer rather than each OpenRouter model’s billing meter, so dollar costs derived from those counts were unreliable and were removed.
- Deterministic scorers have known false negatives.
prompt_has_context_sectionis the canonical case: the scorer accepts seven heading variants, yet models often place genuine contextual content under headings outside that list and are marked down for it. - Fable 5’s thinking blocks are encrypted, so reasoning-quality assessment for that model relies entirely on observable output.
Leaderboard
The ranking runs by default on each model’s composite score (basically: a report-card average — take the model’s share of fully clean runs in each of the five phase components (P1, P2, P3a, P3b, P4), and average those five numbers with equal weight). Every rate shows its denominator underneath to better convey uncertainty with smaller samples. Tier bands group models whose composites cluster together: gaps between tiers are meaningful, ordering within a tier may not be. The metric toggle switches every rate, rank, and tier band between the two bars defined above: Perfect (everything exactly right — every criterion passed) and Critical-only (will it generally work — every must-pass criterion passed). Click a phase cell to open that phase’s deep-dive; click any numeric column header to re-sort, and click it again to reverse the direction.
Cost vs. Performance
This is the value-for-money view: each point is one model, placed by what it costs against how well it conforms, so up and to the left is better. The cost axis uses the headline cost figure from throughout this page: each model’s estimated cost to run the full DAAFBench battery once, expressed as a multiple of Opus 4.8 (= 1.0×). The multiplier folds each model’s observed token appetite into its price (basically: two models with identical list rates can differ in what the same work actually costs, because one chews through more tokens to do it — defined, with its caveats, in the notes below the chart). The cost axis is also logarithmic: each major gridline is 10× the previous one, which is the only way to fit models whose costs span more than a hundredfold onto one chart. The stepped line traces the efficiency frontier, or the models no other model beats on both cost and conformance at once: anything off the line is outperformed somewhere cheaper. The selectors switch the vertical axis between the composite and any single phase group (P1–P4) under either bar (Perfect or Critical-only), and the price-basis toggle offers two secondary views: the raw published input and output list rates in dollars per million tokens of text ($/Mtok — a million tokens is roughly 750,000 words). Those are list prices, not measured benchmark spend. Shape and color mark the serving provider; hover any point for its exact figures.
Phase Deep-Dives
This is where the scoreboard turns into a diagnosis: one heatmap per phase, where rows are models ranked by that phase’s Perfect rate, and each column is one scored criterion (a single named requirement a run either meets or misses), with critical (must-pass) criteria before normal ones, split by the yellow border. Use this to find which specific protocol requirement a model misses: a model can rank well overall while reliably failing one criterion. Hover any cell for its exact pass count; click a cell to open those runs in the Run Explorer.
Cases & Consistency
Two layers on the same repetition data. These models are probabilistic (basically: ask the identical question twice and you can get two different answers), so each case is run multiple times: each rep is the identical prompt run again from the identical starting point. The agreement heatmaps show, for every case and model, how many of those reps were perfect runs (k/n in every cell): split cells expose flaky cases and flaky models at a glance, and the All models margin gives each case’s cross-model difficulty, hardest cases first. The case browser below holds every case’s full prompt, expected behavior, and requirement lists, with the per-rep criterion marks. Click any cell or mark to open the matching runs.
Run Explorer
This is the audit layer of the page: every aggregate above can be traced down to the individual runs it was computed from, here. Every individual run with its grade, per-criterion verdicts and diagnostic details, and condensed transcript — including subagent transcripts where they exist. The filters below apply to this section only; clicks from the heatmaps and case views above land here with the matching filters pre-applied. Status reflects grading (perfect / partial / failed / ungraded); a clock marker flags timed-out runs, which are still fully graded from the transcript produced before the cutoff.
Provenance
The audit trail for where the numbers come from: one row per archived result set, with the DAAF git SHA and run configuration it was produced under. Run-level data on disk is the source of truth; summary counts are shown for disclosure.
Generator v · generated
2026-06-17 00:28:10 UTC.
This document and the benchmark system that produced it were built with DAAF
(Data Analyst Augmentation Framework) tooling.