Philosophy & Vision - Learn

This document seeks to grapple with some of the bigger implications of DAAF and AI in research more generally. For most of these, I'll do my best to share my informed thoughts and current awareness of a topic. These questions don't have simple answers -- but thinking through them carefully is part of using these tools responsibly. Here's where my thinking stands. This will be a constant work-in-progress as questions and discussions arise!

What's covered

About DAAF

What DAAF is trying to achieve
How the project started

Trust & Responsibility

The risk of misuse and misinformation
How much to trust AI-generated analysis
Environmental and energy costs

Future of Research

Where AI adds the most value in research
Where the world of research goes from here
Implications for equity in academia
What this means for the next generation
How to think about the pace of AI progress

About DAAF

What are the goals behind DAAF?

DAAF has four primary goals:

Awareness -- Many researchers are simply not aware of the pace of progress in AI and how it can genuinely benefit their work today. Whether people end up building on DAAF or using it at all, it's most important that it serves as a way to spread awareness and accelerate the very important conversations the research community needs to be grappling with as AI continues to advance.
Education -- DAAF is designed to be an approachable and educational on-ramp to this new frontier of agentic AI systems. By making it easy to both install and start to understand -- with educational materials, tutorials, blog deep-dives, and videos via the DAAF Field Guide ↗ -- the goal is to rapidly accelerate the readiness of the scientific community to critically engage with AI disruption.
Establishing High Standards -- These tools are here and will only continue to proliferate. The goal is to ensure that researchers, and the stakeholders downstream who benefit from their research, can be critical and careful consumers of these tools -- with DAAF as a clear demonstration of what responsible AI-assisted research can and should look like.
Extending research reach -- Helping researchers with deep domain expertise but limited technical staff produce rigorous, reproducible work faster and more comprehensively, so they can discover and learn more about our world.

Beyond these specific goals, DAAF is also intended to serve as a practical demonstration of what responsible AI-assisted research can and should look like -- a floor that the community can build from and improve upon, not a ceiling. If better tools emerge that meet or exceed these standards for transparency, rigor, and reproducibility, that's a genuinely great and desired outcome of my work here. The point is to make those principles concrete and testable rather than leaving the conversation entirely abstract.

What's the story behind DAAF? How did this project start?

DAAF began in the summer of 2025 after its creator, Brian Heseung Kim, first read Dr. Korinek's working paper on AI Agents for Economic Research ↗. The central question was: With a sufficiently carefully crafted set of agents, why wouldn't it be possible to accelerate research work on multiple frontiers? And with detailed and intentional prompting instructions, why wouldn't it be possible to make an AI assistant as cautious and thoughtful about data as a trained researcher?

Initial exploration found that agentic frameworks were still too immature and model quality wasn't quite there yet. That changed in late November 2025 with the release of Claude Opus 4.5 ↗ and the increasing capabilities of agentic workflows via Claude Code ↗. Experimentation began in December 2025, actual development in January 2026. When it became clear just how well it handled data nuances, documentation issues, and harmonization challenges, the decision was made to move beyond a personal project as soon as possible. Several very intense marathon weekends of coding, testing, writing, and scoping with colleagues later -- and here we are today.

Brian's background as a former high school English teacher turned education policy researcher is part of why so much of DAAF is packaged as an educational endeavor. He holds a PhD in education policy from the University of Virginia, and his work on responsible use of ML/AI in social science research ↗ goes back to roughly 2018. His dissertation ↗ -- completed in mid-2022, months before ChatGPT was released -- focused on teaching researchers how to use these tools responsibly, and he continues ↗ to work ↗ on that frontier ↗ today. While his research career has been focused on formalizing research toolsets and rigorous guardrails, one of his most important value-adds today is in helping peers and colleagues rapidly skill-up on this frontier.

Trust & Responsibility

Won't this tool lead to misuse and misinformation?

Absolutely. This is arguably true for literally any statistical or scientific method or tool that's ever existed, but the stakes here do feel different and meaningfully exacerbated by the involvement of AI more generally.

This is the danger of tools like DAAF writ large. The position behind releasing DAAF is to accept the situation for what it is and shoulder as much responsibility as possible: People are doing this already even without DAAF. People will continue to do this without DAAF. By releasing DAAF, the goal is to hyper-accelerate past the "everyone just tinkers in the shadows and plays with gasoline" phase and straight to the phase where the scientific community better understands what it's dealing with, has strong initial principles to work with, and can engage in genuinely informed discourse about the best ways forward. Hence the enormous amount of writing and educational materials, the focus on building an open-source community, and sharing this as broadly as possible.

The question is not whether AI will be used for research -- it already is, often without any guardrails at all. The question is whether the people using AI for research have access to a framework that enforces validation, transparency, and reproducibility by default. DAAF's guardrails don't make misuse impossible, but they make it significantly harder to produce misleading results accidentally. Every analysis has a plan you can review, validation checkpoints you can inspect, and a complete audit trail.

A more optimistic view: DAAF can also become a critical tool for fighting misinformation. With the right experts at the helm, producing genuinely transparent, rigorous, reproducible insights to counter misinformation directly is back within the realm of possibility -- where before it was a lost game immediately.

What's the appropriate level of trust for AI-generated analysis?

This is going to be a constantly moving target day-to-day and query-to-query -- so any hard rules here will probably be wrong most of the time. But one thing is clear across the board: we should be extremely slow to trust AI assistance at this point in time. All of the main issues with LLM assistance -- lying, hallucination, sycophancy, laziness -- will apply as long as LLMs are the main paradigm for AI.

Moreover: we have decades of experience understanding how human analysts make errors. We have much less experience understanding how LLMs fail in data contexts. Until that experience base grows, extra skepticism is warranted. What leads into failure states? Why do they happen? What do they look like? These are complicated questions made all the more complicated by their non-deterministic nature. Failure states may not even be common to all LLMs, but rather an idiosyncrasy of a specific version of a specific model family -- which means thoroughness and criticality are essential given how fast these models change.

One framing that's useful: the goal of a tool like DAAF is not to make AI outputs trustworthy -- it's to shift the distribution of output quality far enough that the outputs become worth reviewing. Without guardrails, the ratio of useful-to-useless AI output for research tasks is too low to justify the time spent checking it. With enough structure, context, and verification layered in -- especially when instructed to do so adversarially with a fresh context or with a totally separate model (e.g., using Opus to critique Codex, and vice versa) -- that ratio can shift to the point where what comes back is 80-90% of the way there, often enough that the remaining review and correction work is genuinely worth your time.

Quality distribution curve centered near the Acceptable threshold, with much of the distribution falling below the line -- representing AI output without guardrails — Without guardrails

Quality distribution curve shifted right past the Acceptable threshold, with most of the distribution above the line -- representing AI output with DAAF's structure and verification — With structure and verification

Note that identifying and evaluating output relative to the threshold of "Acceptable" requires an expert human in the loop.

But "worth reviewing" is a very different standard from "trustworthy," and keeping that distinction front and center is important. Your job as the researcher is to bring the domain expertise, methodological judgment, and critical eye that turns "worth reviewing" into "trustworthy" -- or identifies where it falls short. Be overly cautious, get informed, make your own judgments, and update them constantly.

What about the environmental and energy costs?

Alas, the reality is: This tool will contribute to the boiling of our oceans and contaminate the aquifers of many a community, in addition to all the other well-documented environmental impacts along the way. It relies on frontier models and it works them EXTREMELY hard. A single full-pipeline analysis involves dozens of subagent calls, each consuming significant compute. The iterative validation approach -- executing a script, then running an adversarial review, then potentially revising and reviewing again -- multiplies the compute cost compared to a single-pass approach. The aggregate energy footprint is substantial.

The hope is two-fold: First, by formalizing a strong framework for genuinely useful AI-assisted research, DAAF pushes people past the highly wasteful dead-ends of experimenting with AI on their own and spinning their wheels or producing endless AI slop. Second, this helps meaningfully advance on core issues facing our society -- a much worthier endeavor than many of the extremely wasteful applications of enormous AI compute power (meme videos, AI influencers, AI-generated ads, terrifying deepfakes). That said, "less wasteful than the alternative" is not the same as "sustainable."

More broadly, the research community needs to develop norms and standards around the environmental costs of AI-assisted research, just as we have developed ethics and norms around other resource-intensive research methods. The cost is real, and it should be part of the calculus -- but it should be weighed against the full picture. DAAF should be as efficient as possible for the quality level it produces, and optimizing subagent calls, reducing unnecessary validation passes, implementing intelligent caching, and routing simpler tasks to lighter models are all legitimate paths to reducing the environmental cost without sacrificing rigor.

Future of Research

What's the main value-add for AI assistance in research?

The research community is getting a lot of misleading signals from both the AI hype machine ("AI will do all the research!") and from the AI skeptics ("AI cannot contribute anything meaningful to research!"). Neither is right. What DAAF and DAAF-like frameworks can probably meaningfully enhance now falls into four areas:

Data wrangling and pipeline construction -- The mechanical work of writing fetch scripts, cleaning code, join operations, reshaping, and aggregation. This is typically 60-80% of the labor in a quantitative research project, and it benefits most from AI assistance. Not because the AI does it perfectly -- it does not -- but because it does it capably enough that even with extensive validation and revision, the net reductions in mechanical burden and time are enormous.
Systematic validation -- Checking data types, missing values, whether data tables connect correctly, and output completeness at every stage. AI has, basically, infinite time -- it will check all the tedious things humans skip when tired or under pressure. DAAF's code reviewer agent catches real bugs. The plan checker catches real design issues. No amount of systematic validation will eliminate the need to closely review outputs, but it certainly helps and can be an extremely useful layer on top of any workflow.
Documentation generation -- Documenting data processing decisions, creating audit trails, cataloging variable definitions, describing what scripts do, highlighting data idiosyncrasies. Most researchers under-invest in documentation because it doesn't produce new findings. DAAF generates it as a natural byproduct of the workflow. Even providing existing human-written scripts and having AI supplement the documentation would be an enormous value-add for transparency and reproducibility.
Initial data exploration -- Quickly surveying what data exists, what's available, and what's feasible before committing to an analysis design. Huge value-add given the enormous array of poorly cleaned and documented data out there that is still worth analyzing given the right ingestion processes.

Where does the world of research go from here?

Tools like DAAF have the potential to fundamentally change our relationship to research. In the current paradigm, every researcher needs to be extremely strategic about what they study -- publication biases, the "searching for your keys under the lightpost" issue, and the pressure to maximize publishability all distort what should be a more egalitarian and objective process. When you only have so much time and attention, you have to "kill your darlings" and focus on the questions most likely to produce "useful results" -- which is importantly distinct from pursuing the questions that can actually produce the most useful and impactful results.

What does research look like when we can explore every competing hypothesis at the same time? When examining multiple subsamples and subpopulations is costless? Conversely, how do we avoid endless p-hacking and spurious results? There are a lot of questions here, but the immediate-term prospects really change a lot of what we currently understand about science generation and truth-seeking.

More concretely, the near-term trajectory looks something like piecemeal acceleration -- not wholesale automation, but a steady acceleration of specific research tasks. Some tasks (data wrangling, validation, documentation) are already dramatically faster today. Others (formulating good research questions, interpreting results in institutional context, collecting new data) remain fundamentally human work. The interesting question is what happens when the acceleratable parts become essentially costless: when running an additional robustness check or exploring an alternative specification takes minutes instead of days, researchers could explore the full reasonable specification space for a given analysis almost as a matter of course, reshaping how we think about p-hacking and researcher degrees of freedom.

Looking further out, one of the most exciting prospects is what happens to the value of data collection itself. If the bottleneck shifts from "we don't have enough researcher-hours to analyze what we have" to "we've largely exhausted what existing public data can tell us," then the highest-value contribution becomes collecting new data and sharing it publicly -- especially for understudied populations, unfashionable questions, and geographies that the current incentive structure systematically underinvests in. There's a longer piece on the DAAF Field Guide ↗ exploring the full argument about this trajectory.

What are the implications for equity in academia?

Yes and no. The uncomfortable part first: frontier AI models are expensive, and the current pricing ($100-200/month for a subscription, or $50-100+ per analysis run through the API) is a real cost that is not equally accessible. Moreover, current pricing is in many cases artificially low -- we know the playbook by now.

That said, the full picture is more complicated and -- cautiously -- more optimistic than the sticker shock suggests. Cost efficiency is one of the most intensely competitive frontiers in AI right now, because everyone has a vested interest in making inference cheaper: the model providers, the hardware companies, the open-source community, and users. And on the open-source side specifically, models are catching up at a pace that is frankly startling. At time of writing, Qwen 3 Coder Next -- an open-source model you can run on your own hardware -- is hitting 70.6% on SWE-Bench (roughly similar to Claude Sonnet 4.5) with only 3B active parameters, meaning it runs on a single consumer GPU. That would have been unthinkable even six months ago. This trajectory fundamentally changes both the likely costs of models sufficiently capable to run DAAF-like systems and, critically, who we'd even need to pay to do so (if anyone). Competition and performance on the frontier is something we explicitly track for DAAF with DAAFBench, which measures how faithfully different models and provider routes follow DAAF's analytical protocol -- not the scientific quality of the research itself -- to better inform users about these trade-offs and the actual viability of cheaper models for research workflows.

But even holding aside cost: research academia and academic publishing are already immensely unequal systems. Researchers at prestigious institutions don't just have more resources by default -- they get better looks at publishing, grants, and fellowships because of the name on their letterhead. The prestige pipeline is self-reinforcing and stubbornly persistent.

What tools like DAAF can do is fundamentally change the ability for a scrappy research lab with deep domain expertise but limited staff and funding to produce an immense wealth of rigorous, reproducible research. The researcher at a teaching-focused institution who has 10 hours a week for research instead of 40 can now produce analysis at a pace that was previously impossible without a full research team -- arguably, work that in the prior paradigm would've required several orders of magnitude more financial resources. The doctoral student who can't afford a research assistant can get pipeline construction and validation assistance from AI. The non-profit analyst who needs to turn around a policy brief in a week instead of a month can do so without cutting corners on rigor. That means really good things for the long tail of research that the current incentive structure systematically underinvests in -- smaller subpopulations, understudied regions, unfashionable but important questions -- not because they aren't valuable, but because there simply aren't enough researcher-hours to go around.

Will this be a chaotic transition? Absolutely. Will the benefits be distributed unevenly, especially in the short term? Almost certainly. But this only works if the tools are free and accessible -- which is why DAAF is open source and always will be. DAAF doesn't care what institution you're at. It cares whether you know your domain well enough to ask the right questions and catch the wrong answers. And that is a potentially more meritocratic foundation than what we've been working with.

What does this all mean for the next generation of researchers?

There's really no way to think about this one with all that much optimism. Things are changing too quickly, and responding well fundamentally requires knowing what research work will look like in five years -- and all bets are off right now.

Honestly: I do not know how we train people to have the skilled expertise required to run and review something like DAAF, without having extensive experience being in the trenches of the very work that DAAF and any LLM coding agent does for you. The exoskeleton metaphor assumes the human inside it knows how to manipulate things and do work without the exoskeleton. If you have never manually cleaned a dataset, you will not know what to look for when reviewing AI-generated cleaning code. If you have never debugged a failed join, you will not recognize the signs of a join that "succeeded" incorrectly.

Five things worth thinking about:

We probably need to teach both -- The next generation of researchers needs to learn traditional data skills -- manual data wrangling, code writing, debugging -- AND they need to learn how to supervise, validate, and critically evaluate AI-assisted work. These are complementary skills, not substitutes. Any tool can be used well and poorly for learning -- the same will be true here.
Critical evaluation becomes a very valuable skill -- The ability to look at an analysis and identify what's wrong, what's missing, and what's misleading becomes more important than the ability to produce the analysis in the first place. How to ask: "Is this actually answering the question I asked, or is it answering a different question that the AI substituted?"
Domain expertise becomes more valuable, not less -- When the mechanical parts of research are accelerated, the bottleneck shifts to the parts that require genuine expertise: formulating good questions, choosing appropriate methods, interpreting results in context. When a researcher like that can effectively orchestrate AI assistance to multiply their expertise, they can do so much more for the world.
We need to be honest with students about what is happening -- The pace of AI development is genuinely frightening. The landscape that today's graduate students will practice in is dramatically different from the one they are being trained for. They deserve honest conversations about what's changing and what they need to be prepared for. No one knows all the answers, so we need to engage them as peers with equal stake as we muddle through together.
Hands-on experience with AI is non-negotiable -- even for skeptics -- This is somewhat ironic advice, because part of the concern is that using these tools will erode foundational skills. But the alternative -- forming opinions about AI capabilities and limitations purely from secondhand accounts -- leaves researchers poorly equipped to contribute to the conversations that will shape how the field engages with these tools. Informed skepticism requires direct experience.

DAAF is an educational endeavor as much as a technical one. The framework itself may not be useful in six months with how fast the field is moving. The deeper aspiration is helping researchers engage with AI disruption thoughtfully, critically, and with their eyes wide open.

How should I think about the pace of AI progress?

Three things that most informed people in this space broadly agree on right now, even when they disagree about almost everything else:

There is immense power and potential in AI for research -- the capability ceiling is rising fast.
There are significant, real problems that require honest engagement -- hallucination, equity, environmental costs, skill atrophy, and more.
Current power structures and incentive systems risk amplifying the negatives if the research community doesn't actively participate in shaping how these tools develop.

Beyond that consensus, the honest answer is: no one knows exactly where this is going. But the tools available today -- DAAF included -- are the worst they will ever be. Models will get smarter, tooling will get more sophisticated, and community best practices will mature. The specific technical implementation of DAAF may not survive the next round of model improvements in its current form, and that's fine. The deeper contribution is helping establish principles and start informed conversations that will outlast any particular tool.

Engaging with these tools now, while they're still imperfect enough to clearly show their seams, is arguably the best time to develop the critical intuitions you'll need as they get better. Waiting for the tools to be "ready" means missing the window where the failure modes are most visible and most instructive.

← Previous: Extending DAAF Next →