Grappling with the bigger questions about AI, research, and what comes next
This document seeks to grapple with some of the bigger implications of DAAF and AI in research more generally. For most of these, I'll do my best to share my informed thoughts and current awareness of a topic. These questions don't have simple answers -- but thinking through them carefully is part of using these tools responsibly. Here's where my thinking stands. This will be a constant work-in-progress as questions and discussions arise!
DAAF has four primary goals:
Beyond these specific goals, DAAF is also intended to serve as a practical demonstration of what responsible AI-assisted research can and should look like -- a floor that the community can build from and improve upon, not a ceiling. If better tools emerge that meet or exceed these standards for transparency, rigor, and reproducibility, that's a genuinely great and desired outcome of my work here. The point is to make those principles concrete and testable rather than leaving the conversation entirely abstract.
DAAF began in the summer of 2025 after its creator, Brian Heseung Kim, first read Dr. Korinek's working paper on AI Agents for Economic Research ↗. The central question was: With a sufficiently carefully crafted set of agents, why wouldn't it be possible to accelerate research work on multiple frontiers? And with detailed and intentional prompting instructions, why wouldn't it be possible to make an AI assistant as cautious and thoughtful about data as a trained researcher?
Initial exploration found that agentic frameworks were still too immature and model quality wasn't quite there yet. That changed in late November 2025 with the release of Claude Opus 4.5 ↗ and the increasing capabilities of agentic workflows via Claude Code ↗. Experimentation began in December 2025, actual development in January 2026. When it became clear just how well it handled data nuances, documentation issues, and harmonization challenges, the decision was made to move beyond a personal project as soon as possible. Several very intense marathon weekends of coding, testing, writing, and scoping with colleagues later -- and here we are today.
Brian's background as a former high school English teacher turned education policy researcher is part of why so much of DAAF is packaged as an educational endeavor. He holds a PhD in education policy from the University of Virginia, and his work on responsible use of ML/AI in social science research ↗ goes back to roughly 2018. His dissertation ↗ -- completed in mid-2022, months before ChatGPT was released -- focused on teaching researchers how to use these tools responsibly, and he continues ↗ to work ↗ on that frontier ↗ today. While his research career has been focused on formalizing research toolsets and rigorous guardrails, one of his most important value-adds today is in helping peers and colleagues rapidly skill-up on this frontier.
Absolutely. This is arguably true for literally any statistical or scientific method or tool that's ever existed, but the stakes here do feel different and meaningfully exacerbated by the involvement of AI more generally.
This is the danger of tools like DAAF writ large. The position behind releasing DAAF is to accept the situation for what it is and shoulder as much responsibility as possible: People are doing this already even without DAAF. People will continue to do this without DAAF. By releasing DAAF, the goal is to hyper-accelerate past the "everyone just tinkers in the shadows and plays with gasoline" phase and straight to the phase where the scientific community better understands what it's dealing with, has strong initial principles to work with, and can engage in genuinely informed discourse about the best ways forward. Hence the enormous amount of writing and educational materials, the focus on building an open-source community, and sharing this as broadly as possible.
The question is not whether AI will be used for research -- it already is, often without any guardrails at all. The question is whether the people using AI for research have access to a framework that enforces validation, transparency, and reproducibility by default. DAAF's guardrails don't make misuse impossible, but they make it significantly harder to produce misleading results accidentally. Every analysis has a plan you can review, validation checkpoints you can inspect, and a complete audit trail.
A more optimistic view: DAAF can also become a critical tool for fighting misinformation. With the right experts at the helm, producing genuinely transparent, rigorous, reproducible insights to counter misinformation directly is back within the realm of possibility -- where before it was a lost game immediately.
This is going to be a constantly moving target day-to-day and query-to-query -- so any hard rules here will probably be wrong most of the time. But one thing is clear across the board: we should be extremely slow to trust AI assistance at this point in time. All of the main issues with LLM assistance -- lying, hallucination, sycophancy, laziness -- will apply as long as LLMs are the main paradigm for AI.
Moreover: we have decades of experience understanding how human analysts make errors. We have much less experience understanding how LLMs fail in data contexts. Until that experience base grows, extra skepticism is warranted. What leads into failure states? Why do they happen? What do they look like? These are complicated questions made all the more complicated by their non-deterministic nature. Failure states may not even be common to all LLMs, but rather an idiosyncrasy of a specific version of a specific model family -- which means thoroughness and criticality are essential given how fast these models change.
One framing that's useful: the goal of a tool like DAAF is not to make AI outputs trustworthy -- it's to shift the distribution of output quality far enough that the outputs become worth reviewing. Without guardrails, the ratio of useful-to-useless AI output for research tasks is too low to justify the time spent checking it. With enough structure, context, and verification layered in -- especially when instructed to do so adversarially with a fresh context or with a totally separate model (e.g., using Opus to critique Codex, and vice versa) -- that ratio can shift to the point where what comes back is 80-90% of the way there, often enough that the remaining review and correction work is genuinely worth your time.
Note that identifying and evaluating output relative to the threshold of "Acceptable" requires an expert human in the loop.
But "worth reviewing" is a very different standard from "trustworthy," and keeping that distinction front and center is important. Your job as the researcher is to bring the domain expertise, methodological judgment, and critical eye that turns "worth reviewing" into "trustworthy" -- or identifies where it falls short. Be overly cautious, get informed, make your own judgments, and update them constantly.
Alas, the reality is: This tool will contribute to the boiling of our oceans and contaminate the aquifers of many a community, in addition to all the other well-documented environmental impacts along the way. It relies on frontier models and it works them EXTREMELY hard. A single full-pipeline analysis involves dozens of subagent calls, each consuming significant compute. The iterative validation approach -- executing a script, then running an adversarial review, then potentially revising and reviewing again -- multiplies the compute cost compared to a single-pass approach. The aggregate energy footprint is substantial.
The hope is two-fold: First, by formalizing a strong framework for genuinely useful AI-assisted research, DAAF pushes people past the highly wasteful dead-ends of experimenting with AI on their own and spinning their wheels or producing endless AI slop. Second, this helps meaningfully advance on core issues facing our society -- a much worthier endeavor than many of the extremely wasteful applications of enormous AI compute power (meme videos, AI influencers, AI-generated ads, terrifying deepfakes). That said, "less wasteful than the alternative" is not the same as "sustainable."
More broadly, the research community needs to develop norms and standards around the environmental costs of AI-assisted research, just as we have developed ethics and norms around other resource-intensive research methods. The cost is real, and it should be part of the calculus -- but it should be weighed against the full picture. DAAF should be as efficient as possible for the quality level it produces, and optimizing subagent calls, reducing unnecessary validation passes, implementing intelligent caching, and routing simpler tasks to lighter models are all legitimate paths to reducing the environmental cost without sacrificing rigor.
The research community is getting a lot of misleading signals from both the AI hype machine ("AI will do all the research!") and from the AI skeptics ("AI cannot contribute anything meaningful to research!"). Neither is right. What DAAF and DAAF-like frameworks can probably dramatically accelerate now falls into four areas:
Tools like DAAF have the potential to fundamentally change our relationship to research. In the current paradigm, every researcher needs to be extremely strategic about what they study -- publication biases, the "searching for your keys under the lightpost" issue, and the pressure to maximize publishability all distort what should be a more egalitarian and objective process. When you only have so much time and attention, you have to "kill your darlings" and focus on the questions most likely to produce "useful results" -- which is importantly distinct from pursuing the questions that can actually produce the most useful and impactful results.
What does research look like when we can explore every competing hypothesis at the same time? When examining multiple subsamples and subpopulations is costless? Conversely, how do we avoid endless p-hacking and spurious results? There are a lot of questions here, but the immediate-term prospects really change a lot of what we currently understand about science generation and truth-seeking.
More concretely, the near-term trajectory looks something like piecemeal acceleration -- not wholesale automation, but a steady acceleration of specific research tasks. Some tasks (data wrangling, validation, documentation) are already dramatically faster today. Others (formulating good research questions, interpreting results in institutional context, collecting new data) remain fundamentally human work. The interesting question is what happens when the acceleratable parts become essentially costless: when running an additional robustness check or exploring an alternative specification takes minutes instead of days, researchers could explore the full reasonable specification space for a given analysis almost as a matter of course, reshaping how we think about p-hacking and researcher degrees of freedom.
Looking further out, one of the most exciting prospects is what happens to the value of data collection itself. If the bottleneck shifts from "we don't have enough researcher-hours to analyze what we have" to "we've largely exhausted what existing public data can tell us," then the highest-value contribution becomes collecting new data and sharing it publicly -- especially for understudied populations, unfashionable questions, and geographies that the current incentive structure systematically underinvests in. There's a longer piece on the DAAF Field Guide ↗ exploring the full argument about this trajectory.
Yes and no. The uncomfortable part first: frontier AI models are expensive, and the current pricing ($100-200/month for a subscription, or $50-100+ per analysis run through the API) is a real cost that is not equally accessible. Moreover, current pricing is in many cases artificially low -- we know the playbook by now.
That said, the full picture is more complicated and -- cautiously -- more optimistic than the sticker shock suggests. Cost efficiency is one of the most intensely competitive frontiers in AI right now. On the open-source side specifically, models are catching up at a startling pace -- as of early 2026, open-source models like Qwen 3 Coder Next are hitting shockingly high coding benchmark performance levels while running on a single consumer GPU. That would have been unthinkable even six months ago. This trajectory fundamentally changes both the likely costs and who we'd even need to pay.
But even holding aside cost: research academia and academic publishing are already immensely unequal systems. Researchers at prestigious institutions don't just have more resources by default -- they get better looks at publishing, grants, and fellowships because of the name on their letterhead. The prestige pipeline is self-reinforcing and stubbornly persistent.
What tools like DAAF can do is fundamentally change the ability for a scrappy research lab with deep domain expertise but limited staff and funding to produce an immense wealth of rigorous, reproducible research. The researcher at a teaching-focused institution who has 10 hours a week for research instead of 40 can now produce analysis at a pace that was previously impossible without a full research team -- arguably, work that in the prior paradigm would've required several orders of magnitude more financial resources. The doctoral student who can't afford a research assistant can get pipeline construction and validation assistance from AI. The non-profit analyst who needs to turn around a policy brief in a week instead of a month can do so without cutting corners on rigor. That means really good things for the long tail of research that the current incentive structure systematically underinvests in -- smaller subpopulations, understudied regions, unfashionable but important questions -- not because they aren't valuable, but because there simply aren't enough researcher-hours to go around.
Will this be a chaotic transition? Absolutely. Will the benefits be distributed unevenly, especially in the short term? Almost certainly. But this only works if the tools are free and accessible -- which is why DAAF is open source and always will be. DAAF doesn't care what institution you're at. It cares whether you know your domain well enough to ask the right questions and catch the wrong answers. And that is a potentially more meritocratic foundation than what we've been working with.
There's really no way to think about this one with all that much optimism. Things are changing too quickly, and responding well fundamentally requires knowing what research work will look like in five years -- and all bets are off right now.
Honestly: I do not know how we train people to have the skilled expertise required to run and review something like DAAF, without having extensive experience being in the trenches of the very work that DAAF and any LLM coding agent does for you. The exoskeleton metaphor assumes the human inside it knows how to manipulate things and do work without the exoskeleton. If you have never manually cleaned a dataset, you will not know what to look for when reviewing AI-generated cleaning code. If you have never debugged a failed join, you will not recognize the signs of a join that "succeeded" incorrectly.
Five things worth thinking about:
DAAF is an educational endeavor as much as a technical one. The framework itself may not be useful in six months with how fast the field is moving. The deeper aspiration is helping researchers engage with AI disruption thoughtfully, critically, and with their eyes wide open.
Three things that most informed people in this space broadly agree on right now, even when they disagree about almost everything else:
Beyond that consensus, the honest answer is: no one knows exactly where this is going. But the tools available today -- DAAF included -- are the worst they will ever be. Models will get smarter, tooling will get more sophisticated, and community best practices will mature. The specific technical implementation of DAAF may not survive the next round of model improvements in its current form, and that's fine. The deeper contribution is helping establish principles and start informed conversations that will outlast any particular tool.
Engaging with these tools now, while they're still imperfect enough to clearly show their seams, is arguably the best time to develop the critical intuitions you'll need as they get better. Waiting for the tools to be "ready" means missing the window where the failure modes are most visible and most instructive.