Choosing Your Model

How well do different AI models handle the complexities of rigorous research orchestration workflows?

Report Updated 2026-07-29 19:44:13 UTC

TLDR: DAAF (the Data Analyst Augmentation Framework) is a free, open-source toolkit and instructions framework that turns Claude Code into a rigorous quantitative research engine with the human researcher at the helm: every step auditable, every output verifiable, every decision theirs to make. DAAFBench: Orchestration is a testing suite designed to assess different models’ abilities to adhere to the workflows, guidance, guardrails, and processes needed to facilitate responsible, rigorous, and reproducible data analysis for good research. Lots of neat insights on model performance, task consistency, costs, and open-weight trade-offs below!

The bottom line for practical use of DAAF at this time: Fable 5 is the undisputed top performer if budget is not a constraint. Those who are budget-conscious should feel comfortable relying on GPT-5.6 Sol or Sonnet 5 for strong performance at a more moderate cost, and GPT-5.6 Luna delivers most of that capability for a fraction of the spend. For self-hosting/full-control/hermetic systems, all of the leading open-weights are now viable options: Kimi K3 sits at the top of that pack on capability when price and model size are not a constraint, while GLM 5.2 and DeepSeek V4 are the economical choices. There may nonetheless be benefits to using more generally powerful models not explicitly tracked by this benchmark (e.g., actual coding quality, analytic fidelity, etc.) if your budget and use limits allow.

The headline picture: model performance across the benchmark suite plotted against estimated cost to run the full battery, relative to Opus 4.8. See the full chart below for much more information, customization, and caveats.

The Headlines

Key Takeaways (July 2026)

1. Fable 5 leads the pack, but the top tier is increasingly crowded with a diverse array of viable contenders.

Fable 5 leads the scoreboard at a Perfect average score of —. Opus 5 arrives right behind it at —, but costs roughly —% more to run for a hair lower score, so Fable simply dominates it. (The gap between them is small enough that sample noise matters; the cost gap is not.) The bigger story is who else reaches the top: Tier 1 now also holds GPT-5.6 Sol (—) and the open-weights Kimi K3 (—), making clear that Anthropic has no monopoly on frontier compute capability for research orchestration.

2. The efficiency frontier runs through four different model creators.

The efficiency frontier is the set of models no other model beats on both cost and conformance at once (basically: the picks that are never simply outclassed by something cheaper). Six models make it, and they come from four different providers: Gemma 4 31B (— at — Opus 4.8’s battery cost), DeepSeek V4 Flash (— at —), GPT-5.6 Luna (— at —), Sonnet 5 (— at —), GPT-5.6 Sol (— at —), and Fable 5 (— at —). Two of the six — Gemma 4 31B and DeepSeek V4 Flash — are open-weights, and Gemma can even be run on sufficiently powerful home computers. The practical lesson is that no single vendor owns price-performance: the right pick depends on which budget point you are shopping at (and given imprecision in these estimates, note some fluidity in these recommendations; see the caveats below).

3. Raw performance follows steep diminishing returns in terms of cost.

The climb to the top gets expensive fast. GPT-5.6 Luna delivers about — of Fable 5’s composite for roughly — of its total cost to run the tests; GPT-5.6 Sol, about — of Fable’s score for — of the cost. Buying the last ~8 points of performance (e.g., the jump from Sol up to Fable) costs about — as much. All to say: it’s worth figuring out what quality bar your use case actually needs before you pay for the very top of it.

4. What budget models actually give up: reliability and predictability.

Moving down the price ladder, the first thing to erode isn’t raw capability: it’s reliability and predictability. Because every test here runs three times, we can measure how often a model scores equally well on repeat attempts: the share of repeated cases where all reps land on the identical outcome. Fable 5 agrees with itself on — of its repeated cases; DeepSeek V4 Flash (a genuinely capable budget pick) manages only —. Cheaper models don’t just score lower on average; they wobble more from run to run, and in a chained agentic workflow where one bad step derails everything downstream, that unpredictability is a real cost of its own to weigh against very judiciously.

5. Open-weight models are no longer the compromise option.

Open-weights hold the budget end of the frontier outright: Gemma 4 31B and DeepSeek V4 Flash are the two cheapest non-dominated points on the entire chart. GLM 5.2 remains the natural self-hosting pick for teams that want full control, at — composite for about — of Opus 4.8’s battery cost; and Kimi K3 now tops the open-weights pack on raw capability (—, squarely in Tier 1) when price and model size are not the binding constraint. All of this is exactly why DAAF is built to be deliberately provider-flexible: provider competition keeps pushing costs down and capability up, and staying flexible is what lets you actually capture those gains as the market keeps moving.

Methodology & Caveats

About DAAFBench

DAAF layers together a suite of architectural defenses and strategies from the current frontier of AI best practices to maximize AI output quality and force Claude Code to operate more like a careful and thoughtful researcher at every opportunity. One of the core features of this system is something called “agentic orchestration,” or the complex process of weaving together multiple AI assistants in concert to tackle increasingly complex research workflows: anything from iterating continuously on a data visualization script to developing entire data analytic pipelines from a given research question. DAAF (and other orchestration frameworks) does this by giving a series of intensive instructions, guidelines, and workflows to the AI assistants as they work, breaking tasks down into more concrete and tractable sub-tasks, facilitating better coordination, managing work tracking and documentation, and enforcing adherence to a set of unifying work principles (e.g., auditability, human-in-the-loop, rigorous self-verification processes, etc.).

But a framework like this only works if the AI assistants are sophisticated enough to actually understand, apply, and remember these instructions thoughtfully and consistently. This page walks through the initial results of DAAFBench, a series of bespoke benchmark tests explicitly designed to test adherence to the research protocols and process guidelines of DAAF and assess a given model’s suitability for operating along guidelines that make rigorous, reproducible, and responsible social science at scale possible with AI assistants. The results below summarize multi-dimensional performance assessments for 29 models (both official Anthropic models, as well as open-weights models accessible via OpenRouter, which is immediately compatible with Claude Code and DAAF) across 4,437 different test repetitions designed to simulate key moments of orchestration decision-making and protocol adherence.

Importantly, DAAFBench: Orchestration measures behavioral conformance only: not answer quality, not analytical capability, not general intelligence. A brilliant analyst model that skips confirmation gates scores poorly here, and a modest model that faithfully follows protocol scores well. It is deliberately only half of the picture: a companion suite still in development (predictably, DAAFBench: Analytics) will test analytic competency directly, or whether models make the right calls inside the analysis itself (i.e., explicit decision-making in analytic code and data-cleaning steps, tested via adversarial examples, known-good code, and deterministically verifiable outputs). Orchestration discipline and analytic judgment are complementary halves of trustworthy AI-assisted research, and each deserves its own dedicated measurement. Put differently: Orchestration is an inputs-based assessment — if a model gets the process right, downstream results improve — while Analytics will assess outputs directly. More to come there!

Regardless, DAAFBench: Orchestration results are a crucial step in understanding how models actually perform under these conditions and allow for far greater nuance in thinking about the exact use of specific models (especially more expensive versus cheaper, and proprietary versus open-source) across these types of workflows (and within!). Most importantly, it allows us to more directly track whether and when locally-hosted models (e.g., on a home computer with consumer hardware) can start to tackle these sorts of tasks, opening the door for unprecedented access to analytic capacity going forward. Both the framework being tested and the benchmark harness that produced these numbers are free and open-source in the DAAF GitHub repository; this page is generated directly from the archived run records, and the Run Explorer below lets you trace any score down to the individual runs behind it. If you catch any issues or have suggestions for improvements, please do get in touch!

Two quality indicators run through every number on this page.

Perfect is the high bar: did the model do everything exactly right on a run — every criterion (a single named requirement a run either meets or misses) passed, right down to the protocol details? It tells you how well, and how often, a model delivers a fully clean run, and it is the basis of the headline composite.

Critical-only is the lower but still valuable bar: if you are willing to accept some flaws in the details, will the model generally work through the system as intended? It counts only the must-pass criteria. The leaderboard’s metric toggle switches every rate, rank, and tier band between the two.

Glossary: ten terms this page leans on

Orchestrator — the main assistant in a DAAF session: it talks to the user, classifies requests, and coordinates everything else. Think of it as a lab manager.
Subagent — a specialist assistant the orchestrator delegates work to (a data analyst, a code reviewer, a debugger), each with its own working rules.
Skill — a curated reference document the assistant can load on demand, injecting domain expertise (a statistics library’s quirks, a data source’s structure) at the right moment.
Engagement mode — one of DAAF’s nine workflow types (a full analysis pipeline, onboarding a new dataset, a quick lookup, and so on); every user request must be routed into one to facilitate self-verification workflows and coordination among specialist subagents.
Confirmation gate — the required pause where the assistant states its plan and waits for explicit user approval before doing anything.
Golden checkpoint — a truncated transcript of a known-good session; later phases resume from one, so every model starts at an identical mid-conversation point.
Case — one benchmark scenario: a specific prompt with defined expected behavior and scoring criteria.
Rep — one repetition of a case: the identical prompt run again, because a model can succeed on one attempt and fail on the next.
Composite — a model’s headline score: the unweighted mean of its rates across the five phase components (P1, P2, P3a, P3b, P4).
Tier band — a grouping of models whose composites cluster together; a new tier starts wherever the ranking shows a sufficiently large gap.

The benchmark phases

Each phase isolates one slice of orchestrator behavior, and each asks the model one plain question: Does it route requests correctly? Does it do its homework before acting? Can it delegate cleanly? Does the specialist it delegates to follow its own rules? Does it ground its advice? Later phases resume from golden checkpoints (truncated transcripts of known-good sessions), so every model starts from an identical mid-conversation state.

Phase	Starts from	What it tests
1 — Mode Classification — cases	A cold start: the model sees only the user’s first request	Can it interpret a natural-language request and route it to the right one of DAAF’s workflow modes — a fairly easy task, but the one everything downstream depends on — while presenting a confirmation gate and executing nothing prematurely? Failure mode: the assistant starts the wrong kind of work.
2 — Post-Confirmation — cases, one per engagement mode	A checkpoint ending at the confirmation gate; the user says “Sounds good, let’s proceed.”	Strict protocol adherence: does it load the governing reference documents and skills it just committed to — the references that define how the work is supposed to be done — before doing any of it? Failure mode: improvising instead of following the established process.
3a — Dispatch Compliance — cases, 2 per agent type	A checkpoint with Ad Hoc Collaboration mode fully initialized	Can it hand work to a subagent concisely, coherently, and instructively — the correct agent type, with a properly structured prompt (BASE_DIR, mode marker, task / context / instructions sections)? Failure mode: garbled delegation that degrades everything downstream.
3b — Subagent Behavior	The subagent’s own transcript, whenever a Phase 3 dispatch succeeds	Does the dispatched agent operate correctly as the specialist — following the work protocols its agent specification demands, e.g., a coding agent writing a script and executing it through the audit-trail wrapper? Failure mode: a specialist that ignores its role’s rules.
4 — Skill Routing — cases	The same initialized Ad Hoc checkpoint as Phase 3; the user asks a brainstorming question	The research-brainstorming dimension: models cannot be relied on to draw research methodology from their fuzzy general knowledge (the baked-in impressions left over from training), so a genuine research assistant must judiciously load curated domain expertise through skill reference files — FIRST the data-scientist hub’s method-selection guidance, THEN the specific library skills the task calls for (e.g., pyfixest for fixed effects, geopandas for spatial work) — to ground what it does, says, and recommends in real research methodology. Failure mode: confident, ungrounded methodological advice.

Phase 3b is a sub-scoring of Phase 3 runs, not a separate batch: its criteria are derived from the agent type, and which criteria apply varies with the case, so its denominators differ from 3a’s.

All five components (P1, P2, P3a, P3b, P4) enter the leaderboard composite and tier bands with equal weight.

How scoring works

Every run is graded against named criteria by deterministic scorers (basically: simple, repeatable text-and-structure checks on the archived transcript — no AI judging AI anywhere in the loop, so the same run always gets the same grade). Each criterion is either critical (a structural must-pass, like whether a subagent was dispatched at all — agent_dispatched) or normal (a protocol detail, like whether the dispatch prompt carries a context section). The two bars described above come straight from this split: Perfect demands every criterion pass, Critical-only demands every critical criterion pass. (Earlier versions of this page and the benchmark docs called these tiers “hard” and “soft”.)

Three rates appear throughout this page, and they answer different questions:

Perfect (per run) — did all criteria pass on this run? This is the high bar, the headline metric, and the basis of the leaderboard composite.
Critical rate (per criterion) — across all of a model’s runs, what fraction of its critical criteria passed?

They diverge by design. Worked example: in a 12-run batch where 4 runs each fail exactly one normal criterion, Perfect drops to 67% (8/12 runs) while the critical rate stays at 100% — a model can look strong on criterion-level rates yet rarely deliver a fully clean run.

How runs were executed

Runs execute inside the real DAAF Docker container through the claude -p command line, with all framework hooks, permission rules, and skill discovery live — nothing is mocked. Phases 2–4 resume from golden checkpoints so every model starts at an identical protocol point, and each (phase, model, case) cell is repeated up to three times. Every batch archives to a timestamped, self-contained result set from which this page is generated.

Key caveats

This is one view under one orchestration system, not a model exam. DAAF’s framework instructions were written and tuned with Claude Opus 4.5/4.6 as the working model, so part of any score difference between models is prompting-style fit rather than capability. Read these results as one particular view of how models behave under a very specific AI orchestration system — not as an absolute indicator of model quality.
All analyses were conducted via Claude Code through OpenRouter in a relatively naive fashion, and at least some of the results will be driven by models' compatibility with this specific harness combination. Tune your sense of generalizability around these benchmark results accordingly!
Denominators are small and uneven. Repetition counts are relatively small and can be uneven across providers and models due to cost limitations — some (model, phase) cells have 2 reps instead of 3 — so the results are necessarily imprecise: read every rate together with its n, and recognize there are unshown confidence intervals to think about.
Cost figures are estimates at published list prices, not invoices. The headline cost lens throughout this page is the battery-cost multiplier: each model’s estimated cost to run the full —-probe benchmark battery once, expressed relative to Opus 4.8 (= 1.0×), built from published list prices and each model’s observed token mix — taken from provider billing data, not the harness’s counts — on an everything-uncached basis (defined under Cost vs. Performance). These relative figures can shift as providers reprice tokens and as provider caching mechanics interact with each model’s behavior. The per-Mtok figures that remain on the page are secondary detail: raw published list rates, not measured spend. The harness’s own token logging runs through an Anthropic-compatible endpoint whose counts come from Anthropic’s tokenizer rather than each OpenRouter model’s billing meter, so dollar costs derived from those counts were unreliable and were removed. For the GPT-5.6 trio (Sol, Luna, Terra), the API-equivalent list rates used here were verified against OpenAI’s published per-model schedules on 2026-07-29. A per-request audit of the full GPT corpus further found zero requests above the 272k-token long-context pricing threshold, so the short-context rates applied here are exact for this corpus rather than a blended approximation.
Deterministic scorers have known false negatives. prompt_has_context_section is the canonical case: the scorer accepts seven heading variants, yet models often place genuine contextual content under headings outside that list and are marked down for it.
Fable 5’s thinking blocks are encrypted, so reasoning-quality assessment for that model relies entirely on observable output.

The Scoreboard

Leaderboard

The ranking runs by default on each model’s composite score (basically: a report-card average — take the model’s share of fully clean runs in each of the five phase components (P1, P2, P3a, P3b, P4), and average those five numbers with equal weight). Every rate shows its denominator underneath to better convey uncertainty with smaller samples. Tier bands group models whose composites cluster together: gaps between tiers are meaningful, ordering within a tier may not be. The metric toggle switches every rate, rank, and tier band between the two bars defined above: Perfect (everything exactly right — every criterion passed) and Critical-only (will it generally work — every must-pass criterion passed). Click a phase cell to open that phase’s deep-dive; click any numeric column header to re-sort, and click it again to reverse the direction.

Value for Money

Cost vs. Performance

This is the value-for-money view: each point is one model, placed by what it costs against how well it conforms, so up and to the left is better. The cost axis uses the headline cost figure from throughout this page: each model’s estimated cost to run the full DAAFBench battery once, expressed as a multiple of Opus 4.8 (= 1.0×). The multiplier folds each model’s observed token appetite into its price (basically: two models with identical list rates can differ in what the same work actually costs, because one chews through more tokens to do it — defined, with its caveats, in the notes below the chart). The cost axis is also logarithmic: each major gridline is 10× the previous one, which is the only way to fit models whose costs span more than a hundredfold onto one chart. The stepped line traces the efficiency frontier, or the models no other model beats on both cost and conformance at once: anything off the line is outperformed somewhere cheaper. On the default battery-cost basis that frontier runs through six models from four different providers (two of them open-weights), spanning more than a hundredfold in cost. The selectors switch the vertical axis between the composite and any single phase group (P1–P4) under either bar (Perfect or Critical-only), and the price-basis toggle offers two secondary views: the raw published input and output list rates in dollars per million tokens of text ($/Mtok — a million tokens is roughly 750,000 words). Those are list prices, not measured benchmark spend. One more note on all of these dollar figures: they are API-style list pricing. If you run DAAF through a subscription plan instead (Claude Max, or ChatGPT for the GPT models — the GPT runs in this corpus were in fact executed on the ChatGPT lane), your marginal cost per analysis is effectively covered by the flat subscription price — often the single biggest cost-efficiency lever available before you change models at all. Shape and color mark the serving provider; hover any point for its exact figures.

The Diagnosis

Phase Deep-Dives

This is where the scoreboard turns into a diagnosis: one heatmap per phase, where rows are models ranked by that phase’s Perfect rate, and each column is one scored criterion (a single named requirement a run either meets or misses), with critical (must-pass) criteria before normal ones, split by the yellow border. Use this to find which specific protocol requirement a model misses: a model can rank well overall while reliably failing one criterion. Hover any cell for its exact pass count; click a cell to open those runs in the Run Explorer.

Repetition & Flakiness

Cases & Consistency

Two layers on the same repetition data. These models are probabilistic (basically: ask the identical question twice and you can get two different answers), so each case is run multiple times: each rep is the identical prompt run again from the identical starting point. The agreement heatmaps show, for every case and model, how many of those reps were perfect runs (k/n in every cell): split cells expose flaky cases and flaky models at a glance, and the All models margin gives each case’s cross-model difficulty, hardest cases first. The case browser below holds every case’s full prompt, expected behavior, and requirement lists, with the per-rep criterion marks. Click any cell or mark to open the matching runs.

The Audit Layer

Run Explorer

This is the audit layer of the page: every aggregate above can be traced down to the individual runs it was computed from, here. Every individual run with its grade, per-criterion verdicts and diagnostic details, and condensed transcript — including subagent transcripts where they exist. The filters below apply to this section only; clicks from the heatmaps and case views above land here with the matching filters pre-applied. Status reflects grading (perfect / partial / failed / ungraded).