Methodology
An AI agent curates this reading list. Here is exactly how.
A working demonstration of responsible AI in evidence workflows, built and operated by Black Swan Causal Labs.
The premise
Why automate the curation, and why be loud about it
Real-world evidence, causal inference, and machine learning are converging faster than any single practitioner can track. New methods, new tools, and new pitfalls show up every week across PubMed, bioRxiv, and medRxiv. Most of us read what gets surfaced by colleagues, social feeds, or journal alerts. That works, but it leaves a lot on the table, and it is biased toward whatever the loudest voices in the field happen to be amplifying.
This reading list is a small experiment in fixing that. An autonomous AI agent runs once a week, searches the literature using a structured query strategy, picks five papers, writes opinionated takes on each, and publishes the result. The output is for me first, and for anyone else who finds it useful second.
Black Swan Causal Labs exists to help scientists use AI in evidence work responsibly. It would be strange to do that without using AI in our own workflows, in public, with the wiring exposed. This page is the wiring diagram.
The architecture
A scheduled agent, a search strategy, and a newsletter pipeline
The system has three parts. A scheduled agent that runs every Wednesday at 10am Eastern, a structured search strategy stored in a single file in the website repository, and a delivery layer that sends the weekly issue to subscribers.
The agent runs in a sandboxed cloud environment with deliberately limited access. It can use web search, edit files, and push commits to the website repository. It cannot reach the email service directly, by design. When the agent finishes curating the week, it writes the prepared newsletter to a file in the repository and pushes. That push triggers a GitHub Actions workflow with its own scoped credentials, and the workflow is the only thing that talks to Buttondown. Two agents, two responsibilities, two sets of permissions. The blast radius is small on purpose.
Search
The agent reads the search strategy file and runs a series of structured queries against PubMed and bioRxiv. It pulls candidate papers from the last seven days that match the topic intersections.
Curate
It scans the candidates for novelty, methodological substance, and relevance to practitioners. It narrows the field to five papers that ideally span methods, applications, and tooling so the issue is not five variations of the same theme.
Write
For each paper the agent drafts a two or three sentence take. The instruction is explicit: write a take, not a summary. Say what is interesting, what is new, and what a practitioner should pay attention to. Use bold for the one sentence you would underline if you were reading on paper.
Publish
The agent writes the new week of articles to the website repository and prepares the newsletter as a separate JSON payload in the same repo. It commits both files and pushes them to GitHub. The reading list page reads from the articles file and updates automatically.
Deliver
The push to GitHub triggers a small GitHub Actions workflow that reads the newsletter payload and posts it to the Buttondown API using a separate scoped API key. Buttondown handles delivery, double opt-in confirmation for new subscribers, and unsubscribe management.
The search strategy
What the agent looks for, and how
The agent does not freelance. It searches against a fixed list of topics and combinations that I maintain by hand. The list lives in a single file in the website repository so it is easy to revise and version controlled, which means the methodology evolves transparently rather than drifting silently inside a chatbot.
The intersection queries are the most important part. They are designed to surface papers that sit at the seams between fields, where the interesting work is usually happening.
Intersection queries
("real-world evidence" OR "real-world data") AND ("causal inference" OR "target trial emulation")
("machine learning" OR "artificial intelligence") AND ("causal inference") AND ("observational study" OR "pharmacoepidemiology")
("large language model" OR "LLM" OR "generative AI") AND ("real-world evidence" OR "real-world data" OR "electronic health records" OR "claims data" OR "administrative data" OR "registry" OR "epidemiology" OR "digital health" OR "digital technologies")
("large language model" OR "LLM" OR "generative AI") AND ("pharmacovigilance" OR "adverse event" OR "drug safety")
("double machine learning" OR "TMLE" OR "targeted learning") AND ("real-world" OR "observational")
Beyond these, the agent has access to a richer term library covering pharmacoepidemiology, post-market surveillance, propensity score methods, instrumental variables, difference-in-differences, g-methods, federated learning, and others. The full list is in the website repository.
The human in the loop
What the AI does, and what I do
The agent picks the papers from week to week. The agent writes the takes. The agent ships the newsletter. I do not approve each issue before it goes out, and I think that is important to be clear about. If you see a take in the newsletter, an AI wrote it.
What I do is everything that decides what kind of curator the agent is. I designed the search strategy. I wrote the curation criteria, the tone instructions for the takes, and the rules for what makes an issue good. I review the output over time and refine the system prompt when I notice patterns I want to change. I am the one accountable when the agent gets something wrong, because the agent is doing what I told it to do.
This is the model I think most AI in evidence work should follow. Not full automation, not human review of every output, but humans in the loop at the right places. Designing the system, defining the criteria, monitoring drift, and standing behind the result.
Limitations
What this system is bad at, and what I am watching for
The agent uses web search rather than direct database APIs, which means it is sensitive to whatever search engines surface. Highly cited classics get surfaced more easily than genuinely new work. Preprints from less indexed servers get missed.
The takes can be confidently wrong. The agent has no access to the full text of most papers, so its read on a study is based on the abstract and whatever it can glean from secondary sources. If a paper has a subtle methodological flaw that only shows up in the supplementary materials, the agent will not catch it.
The five paper limit is a blunt instrument. A great week might deserve seven papers. A slow week might honestly only have three. The agent is currently asked to always pick five, and sometimes that means a borderline pick gets included.
The curation criteria favor novelty and methodological interest, which can underweight implementation papers, regulatory science, and applied work. I am refining the prompts over time to balance this better.
If you spot a bad take, a missed paper, or a pattern that worries you, tell me. The whole point of doing this in public is so the failure modes get caught and the system gets better.
Subscribe to the weekly digest
Five papers, every Wednesday, delivered to your inbox.
Visit the reading list