Scale Responsible, Rigorous, and Reproducible Research with AI
AI coding assistants can now write analytic code, manipulate complex datasets, and produce entire reports. They also hallucinate, cut corners, and constantly project confidence.
The Data Analyst Augmentation Framework is a free, open-source toolkit that turns Claude Code into a rigorous quantitative research engine with you at the helm: every step auditable, every output verifiable, every decision yours to make. Built by researchers, for researchers
User:Hey DAAF, I'm thinking of running a regression discontinuity analysis with my own dataset. Before I do anything else, can you help me think through conceptually what assumptions/conditions are required for a good RD?
You interact with DAAF and Claude Code just like any chat window: describe what you want to do in plain, natural language. No special syntax or commands to have to juggle and remember.
DAAF: Thinking…
DAAF: Welcome to DAAF -- the Data Analyst Augmentation Framework. I'm your research orchestrator, here to help you work through rigorous, reproducible data analysis with human oversight at every key decision point.
Great question -- thinking through RD assumptions before touching data is exactly the right instinct. This is a conceptual/advisory conversation, so I'd classify this as Ad Hoc Collaboration mode. I'll work with you as a thought partner to think through the assumptions and conditions for a valid regression discontinuity design. No formal deliverables -- just a focused discussion. You can steer the conversation wherever it's most useful.
Sound good, or would you rather approach this differently?
DAAF intelligently determines which bespoke workflow best fits your request. Like a good collaborator, it communicates its plan and asks for your confirmation before proceeding. Ad Hoc Collaboration mode is a lightweight workflow for informal discussion and one-off tasks -- one of nine workflow styles DAAF supports.
User:That makes sense and sounds good, let's proceed
DAAF: Let me load the mode reference, then we'll dive in.
DAAF dynamically selects relevant references from a large library of carefully vetted guides ("Agent Skills") -- then embeds them into the conversation to ground its processes and thinking from that point on in known-good material that helps fight hallucinations ("Context Engineering"). Hover over any file to learn more and see what those reference files actually contain.
DAAF: Great question to start with -- getting the conceptual foundations right before touching data is exactly how credible causal work begins. Here's a thorough walkthrough of what makes an RD design valid.
------ Regression Discontinuity: The Core Assumptions
RD exploits a known cutoff in a running variable (also called a “score” or “forcing variable”) that determines treatment assignment. The fundamental idea is that units just above and just below the cutoff are essentially identical except for their treatment status -- creating quasi-random variation near the threshold.
1. A Known, Rule-Based Cutoff
There must be a clear, predetermined threshold where treatment status changes. Examples: a test score cutoff for scholarship eligibility, an income threshold for program access, an age cutoff for policy eligibility.
The cutoff must be rule-based, not discretionary. If a committee decides who gets treated using the score as one input among many, you don't have an RD -- you have selection on observables (at best).
2. Continuity of Potential Outcomes at the Cutoff
This is the core identifying assumption: absent the treatment, outcomes would have evolved smoothly through the cutoff. In potential-outcomes notation, E[Y(0)|X=c] and E[Y(1)|X=c] must be continuous at cutoff c.
3. No Manipulation of the Running Variable (No Sorting)
Units cannot precisely control their score to land on a preferred side of the cutoff. The McCrary (2008) density test checks whether the distribution of the running variable is smooth at the cutoff.
4. Sharp vs. Fuzzy -- Know Which You Have
Even in short interactions, DAAF grounds itself in real citations and references to avoid relying on its fuzzy general knowledge or inconsistent results from web searches. Have more follow-ups? Want to write some actual code? DAAF is your flexible and responsive collaborator every step of the way.
User:Can you remind me, what measures of socioeconomic status do we have available in the College Scorecard dataset? And for what years?
Have you ever wanted to be able to just talk with your data documentation in natural language? With DAAF, you can. Anytime you want to work with a specific dataset, DAAF helps you embed your codebooks, data diagnostics, and technical papers into a persistent knowledge file the AI assistants can reference to become experts on your data.
DAAF: Thinking…
DAAF: Welcome to DAAF! Great question -- this looks like a Data Lookup: you're asking about specific variable availability in the College Scorecard dataset.
I'll check the Scorecard skill for SES-related measures and their year coverage. Sound good?
User:Yeah, that sounds right to me. Please dig in.
Just like the Ad Hoc example, DAAF classifies your request and confirms before proceeding. In Data Lookup mode, DAAF draws on carefully crafted dataset documentation to bring you grounded answers about your datasets -- far more reliable than general web-searching or relying on its fuzzy general knowledge.
DAAF: Dispatching a lookup specialist to pull SES measures from the Scorecard skill.
DAAF: Agent(search-agent, “Lookup Scorecard SES measures”)
To better handle complex workflows, DAAF delegates work to specialist assistants whenever possible ("Agent Orchestration"). Here, DAAF launches a separate data scout to load the College Scorecard dataset reference it has access to. The data scout reads variable definitions and coverage caveats to compose an authoritative, grounded answer rooted in genuine documentation. Hover over any file to see what these references look like.
DAAF: Here's a quick summary of what's available for SES in College Scorecard via the Urban Institute Education Data Portal:
Student Body NSLDS (1997-2016) From FAFSA/financial aid records -- individual-level SES
• faminc_mean / faminc_med -- Mean/median family income • lowincome_pct / midincome_pct / highincome_pct -- Income tercile shares ($0-30K / $30-75K / $75K+) • dependent_pct / independent_pct -- Dependency status shares • first_gen_student_pct -- First-generation share • parents_highest_ed_* -- Parent education distribution
Student Body Treasury (1997-2016) From IRS/Treasury -- area-level proxies based on home ZIP
• hhinc_home_zip_med -- Median household income in student's home ZIP • poverty_rate_home_zip -- Poverty rate in home ZIP • unemp_rate_home_zip -- Unemployment rate in home ZIP
Earnings (2003-2014, 2018) and Repayment NSLDS (2007-2016) also include SES-disaggregated outcomes.
------ Key Caveats
1. Title IV recipients only. All measures reflect FAFSA filers -- full-pay students (often higher-income) are excluded, so SES distributions skew lower than reality, especially at selective privates. 2. Coverage varies by institution type -- ~30-50% at selective privates vs. 80-95% at for-profits. 3. Home ZIP variables are ecological proxies, not individual/household measures. 4. Income tercile thresholds are fixed ($0-30K / $30-75K / $75K+), not inflation-adjusted across years. 5. Portal data ends at 2016 for student body datasets.
Confidence: HIGH -- sourced from the curated Scorecard skill. Year boundaries should be verified against actual data if precise coverage matters for your analysis.
DAAF reports back with some high-level findings that the user can dig into at will in more detail. Notice the Key Caveats section: DAAF is explicitly designed to proactively flag these sorts of data limitations, coverage gaps, and places where you'd want to verify further. This attention to detail and nuance is embedded in every DAAF workflow and interaction by design.
DAAF: … Okay, the next task in the pipeline is joining the resource datasets (student-faculty ratios, retention rates, and finance data) onto the core demographics table. I’ll write a detailed dispatch for the research-executor to load the prior join scripts for context on the current table structure, then write and validate a new script for this step.
DAAF drafts all analytic code in a rigorous multi-step sequence to ensure higher levels of quality before it ever reaches your review. As in the Data Lookups example, it uses specialist assistants (“subagents”) to split tasks: one specialist to draft code with preliminary validations, and another specialist to adversarially review and further inspect assumptions from multiple angles.
Before writing any code, the coding agent first loads references on data science best practices and reads the prior scripts in the sequence to understand the current table structure. Then, it writes a complete script according to spec (logging its intentions, reasoning, and assumptions via inline comments so everything is easy to track and review) and runs the script with automatic logging of all validation/diagnostic tests. Because everything passed these initial validations, it reports the successful script location and logs back to the main DAAF assistant for the next step: adversarial code review.
DAAF: Initial script draft written with all self-tests passing from the research-executor. Now I'll dispatch the code-reviewer with the same context to take a closer look for QA.
↳ Bash(bash run_with_capture.sh scripts/cr/stage7_03_cr1.py) [PASS] Schema: All 22 expected columns present [PASS] Row count: 2,893 (expected 2,893) [PASS] Coded values: None remain [PASS] [Semantic] Model 3 regression vars present [WARN] [Boundary] retention_rate: 26 zero values [WARN] Finance plausibility: max=$14.1M per FTE QA RESULT: ISSUES_FOUND -- investigating further
A completely separate code review specialist independently loads and audits the script to run its own battery of checks across schema validation, distribution analysis, boundary conditions, spec adherence, and more. It is specifically instructed to be adversarial, skeptical, and cautious; here, it flagged some suspicious data values and outliers from its initial inspections it decided to investigate further with additional checks.
↳ Thinking…
↳ Write(scripts/cr/stage7_03_cr2.py)❲ 198 lines
↳ Bash(bash run_with_capture.sh scripts/cr/stage7_03_cr2.py) Hypothesis 1 (finance outliers): Extreme values from law schools and medical schools — source data, not join error Hypothesis 2 (zero retention): 26 institutions, all small or specialized — legitimate, not coded-zero artifacts Further investigation needed: NO
DAAF: Great inspections and QA report from the code-reviewer. Sounds like all concerns were resolved and we can proceed onto the next steps, but I'll log this for the user in our running session notes and mention it at the next checkpoint for their awareness.
After thoroughly explaining and testing its concerns in a new script, the code-reviewer agent drafts an in-depth summary QA report to pass back to the main DAAF assistant. While this specific analysis proceeds without issues, the above drafting-reviewing loop can iterate as many times as needed if any problems need resolution before proceeding. All issues get flagged for human review at regular checkpoints, ensuring visibility into the problems and solutions along the way. Most importantly: Every script version and QA check is fully traceable and reproducible by default.
User:I'm aware that graduation rates are often thought of as a key outcome for assessing a university/college's quality by the general public, but many researchers argue that there's a very strong question of chicken-or-the-egg in interpreting it that way: Are graduation rates high because the college actually did a good job in serving its students, or are graduation rates high because the college selectively admits students who are already highly competitive and academically prepared and likely to graduate/succeed anyway?
I'd like to more critically explore this dynamic with data to better understand how correlated these things are, especially when thinking about additional complicating institutional factors like share of students on financial aid, other underserved or historically disadvantaged student population rates, etc. I'd like an analysis that helps provide an intuitive and holistic view on how these factors all relate to one another, and what implications that might have for broadly thinking about college 'quality' in general.
Users can genuinely pose any arbitrarily complex question or task to DAAF. Just describe what you want to understand, and DAAF will kick off a careful and deliberate workflow that scales to your needs. Regardless of complexity, DAAF always works closely with you to identify viable data sources, decide on the best analytical approaches, track all artifacts and outputs, and structure the work -- keeping your expertise and insight at the center of all key decision-making.
DAAF: Thinking…
DAAF: Welcome to DAAF! This is a really compelling research question -- you're essentially asking: how much of the variation in graduation rates is attributable to what colleges do versus who they admit?
I'm classifying this as Full Pipeline Mode -- you're asking for a substantive, multi-dimensional analysis that will require pulling data from multiple federal sources (IPEDS admissions, graduation rates, student financial aid, enrollment demographics), joining them at the institution level, and producing visualizations and statistical models.
I'll walk you through 5 phases with human checkpoints at every stage. Shall I proceed?
From here, DAAF launches into a multi-stage research workflow that goes all the way from initial data scoping to a completed, fully reproducible data analysis pipeline and summary analytic report with key findings, methodology/process, data visualizations, limitations, and opportunities for extension. DAAF does this by orchestrating scalable and complex workflows that intelligently stitch together everything you've seen in the earlier examples: dynamically loading curated reference skills for grounded expertise, dispatching specialist agents for targeted tasks, and passing every core task (planning, coding, and writing) through adversarial review along the way.
DAAF pauses and engages with you at every key decision point for weigh-in, review, and redirection, but in the end, a completed Full Pipeline analysis can take as little as 30 minutes of direct human involvement. Want to see how that works?
Everything you need to conduct real data analysis with Claude Code out-of-the-box: in-depth methodological support, powerful Python analysis libraries, and myriad quality-of-life features.
With so much at stake, how do we actually fight the threat of AI slop? DAAF layers together a suite of architectural defenses and strategies from the current frontier of AI best practices to maximize AI output quality and force Claude Code to operate more like a careful and thoughtful researcher at every opportunity. Because LLM hallucinations can never be fully eliminated, the last line of defense and final call always rests with you as the PI.
A force-multiplying exoskeleton for your expertise
The goal of DAAF isn't to automate research writ large, and we firmly reject attempts to remove human researchers from the process. Instead, the goal is to remove the frictions and bottlenecks that prevent your hard-earned judgment and expertise from reaching further: in rigor, in scope, and in impact. When we harness the power and speed of Claude Code in a framework of responsibility, rigor, and reproducibility, what becomes possible?
Enhance your analysis
Explore your data more deeply and interactively · Rapidly revise and extend your analytic pipelines · Iterate on novel data visualizations · Quickly implement alternative specifications and modeling approaches · Verify your code from every angle
Enhance your scope
Rapidly survey and integrate new data sources · Streamline collaborations with new coauthors and students · Implement new methodological toolsets · Generate complete reproducibility packages · Manage more complex data pipelines and workflows · Build interactive dashboards for disseminating research
Enhance your impact
Reach more audiences and stakeholders with summary websites and interactive explainers · Rapidly prototype new directions and frontiers for your research agenda · Respond to time-sensitive events and news with real evidence, sooner · Examine new geographies and contexts with lower barriers to data harmonization · Share new methodologies instantly with DAAF Skills
DAAF workshops at institutions like Northwestern University, University of Virginia, Stanford University, Bowdoin College, Urban Institute, Association for Education Finance and Policy, and many more
Install Docker Desktop (free) and set up an Anthropic account ($100-200/mo Max subscription recommended).
02
Install DAAF
Run the install command in Terminal (macOS/Linux) or Powershell (Windows). Docker handles the rest -- dependencies, environment, security, the works.
03
Begin
Launch DAAF and start talking with Claude Code about whatever you want to work on. That's really it. No prior experience with Terminal, Docker, or Claude Code required.
Public good infrastructure for the future of research
DAAF is built on a simple conviction: AI-empowered research tools should be a public good for the benefit of society. As an open-source project under the LGPL-3.0 license, DAAF will always be free. No premium tiers, no feature gates, no bait-and-switch, nothing to hide. My hope is that DAAF will only continue to get better as more researchers contribute, critique, extend, and build on top of DAAF to push the boundaries of what we can do with rigorous AI-assisted research, together.