Hide table of contents

At GiveWell, we've been experimenting with using AI to red team our global health intervention research—searching for weaknesses, blind spots, or alternative interpretations that might significantly affect our conclusions. We've just published a write-up on what we’ve learned, both about the programs we fund through donor support and about how to use AI in our research.

We're sharing this to invite critiques of our approach and to see if others have found methods for critiquing research with AI that work better. Specifically, we'd love to see people try their own AI red teaming approaches on our published intervention reports or grant pages. If you generate critiques we haven't considered or find prompting strategies that work better than ours, please share them in the comments—we'd be interested to see both your methodology and the specific critiques you uncover.

Our process

Our research team spends more than 70,000 hours each year reviewing academic evidence and investigating programs to determine how much good they accomplish per dollar spent. This in-depth analysis informs our grantmaking, directing hundreds of millions in funding annually to highly cost-effective, evidence-backed programs.

Our current approach for supplementing that research with AI red teaming:

  1. Literature review stage: An AI using "Deep Research" mode synthesizes recent academic literature on the intervention[1]

  2. Critique stage: A second AI reviews both our internal analysis and the literature summary to identify gaps in our analysis[2]

We applied this approach to six grantmaking areas, and it generated several critiques worth investigating per intervention, including:

  • Whether low partner treatment rates in syphilis programs lead to reinfection (with pointers to relevant studies)
  • Whether children recovering naturally from severe malnutrition could make treatment programs appear more effective than they actually are
  • Whether circulating parasite strains differ from malaria vaccine targets, potentially reducing effectiveness

For more on our current approach and the critiques it identified, see our public write-up.

Our prompting approach

Our red teaming prompt (example here) has a few key features:

  • Explicit thinking process: We instruct the model to generate 20-30 potential critiques first, then filter for novelty and impact before selecting its top 15. This is meant to increase creativity before filtering down.
  • Verifying novelty: The prompt explicitly asks the model to check "Is this already addressed in the report?" for each critique before including it. In practice, this helps but doesn't eliminate redundant critiques.
  • Structured categories: We ask for critiques across specific categories (evidence quality, methodology limitations, alternative interpretations, implementation challenges, external validity, overlooked factors) to encourage broader coverage.
  • Concrete prompts for novel thinking: We include questions like "What would someone from the target community say about why this won't work?" and "What existing local solutions might this disrupt?" to push toward less obvious concerns.

We arrived at this through trial and error rather than systematic testing. We're uncertain which elements are actually driving the useful output or are counterproductive.

What we learned about using AI for research critiques

A few initial lessons:

  • Best for closing literature gaps: AI is most useful when substantial academic research exists that we haven't yet incorporated into our models. It found several studies on syphilis that we were unfamiliar with, but added little for interventions we’ve thoroughly-reviewed like insecticide-treated nets.
  • Quantitative estimates are unreliable: AI often suggested specific impacts ("could reduce cost-effectiveness by 15-25%") without a solid basis, most likely because it can't effectively work with our multi-spreadsheet models.
  • Relevance filtering required: ~85% of critiques were either unlikely to affect our bottom line or represented misunderstandings of our work. AI was not helpful for filtering its own results, and our researchers needed to filter the critiques for relevance and decide which ones were worth digging into.

A note on timing: This evaluation was conducted 4-5 months ago. While we haven't done systematic retesting with the same prompts and context, our impression is that critique relevance has improved, primarily through better alignment with the types of critiques we're looking for. Our rough guess is that the rate of relevant critiques may now be closer to ~30%, a meaningful improvement but not enough to change our research workflows.

Improvements we've considered but not pursued

We've deliberately kept our approach simple—running prompts through standard chat interfaces (Claude, ChatGPT, Gemini) that our researchers are already comfortable with. We've considered but chosen not to pursue:

  • More complex prompt architectures: Breaking the task into more specialized sub-prompts, or using multi-agent workflows where different AI instances debate or build on each other's critiques.
  • Custom tooling: Building dedicated applications using AI automation platforms or command-line tools.
  • Specialized research platforms: We briefly evaluated AI-powered research tools but found them too narrowly focused on specific tasks (e.g., cliterature landscaping) to perform well at generating research critiques.

We suspect the gains from adding complexity to this workflow would be marginal and unlikely to outweigh the friction of adopting less familiar tools. But we hold this view loosely—if someone has achieved meaningfully better results with more sophisticated approaches, we'd consider spending more time on these other approaches.

Why we're sharing this

We think we're still in the early stages of learning how to use AI well, but we've developed preliminary views about what works and what doesn't, and we'd appreciate input from others thinking about similar problems.

Specifically, we’d welcome hearing about:

  • Blind spots in our methodology or prompting strategy
  • Improvements we're missing (including on the approaches above that we've deprioritized)
  • Alternative workflows or prompting strategies that have worked well for similar tasks
  • Pointers to research on AI-assisted critical analysis we might have missed

If you're interested in trying your own approach on one of our published intervention reports, we'd be curious to see what you get–both methodology and output.

  1. ^

     Typically, ChatGPT Pro with Deep Research enabled or similar.

  2. ^

     This step typically uses whichever model from Anthropic, ChatGPT, or Gemini is considered the best for research at that moment.

36

1
0
3

Reactions

1
0
3

More posts like this

Comments4
Sorted by Click to highlight new comments since:

One other issue I thought of since my other comment: you list several valid critiques that the AI made that you'd already identified, but were not in the provided training materials. You state that this gives additional credence to the helpfulness of the models:

three we were already planning to look into but weren't in the source materials we provided (which gives us some additional confidence in AI’s ability to generate meaningful critiques of our work in the future—especially those we’ve looked at in less depth).

However, just because the critique is not in the provided source materials, it doesn't mean that it's not in the wider training data of the LLM model. So for example, if Givewell talked about the identified issue of "optimal chlorine doses" in a blog comment or something, and that blog site got scraped into the LLM, then the critique is not a sign of LLM usefulness: they may just be parroting back your own findings to you. 

Overall this seems like a sensible, and appropriately skeptical, way of using LLM's in this sort of work. 

In regards to improving the actual AI output, it looks like there is insufficient sourcing of claims in what it puts out, which is going to slow you down when you actually try and check the output. I'm looking at the red team output here on water turpidity. This was highlighted as a real contribution by the AI, but the output has zero sourcing on it's claims, which presumably made it much harder to actually check for validity. If you were to get this critique from a real, human, red-teamer, they would make it signficantly more easy to check that the critique was valid and sourced.

One question I have to ask is whether you are measuring how much time and effort is being extended into managing the output of these LLM's and sifting out the actually useful recommendations? When assessing whether the techniques are a success, you have to consider the counterfactual case where that time was replaced by human research time looking more closely at the literature, for example. 

My experience is similar. LLMs are powerful search engines but nearly completely incapable of thinking for themselves. I use these custom instructions for ChatGPT to make it much more useful for my purposes:

When asked for information, focus on citing sources, providing links, and giving direct quotes. Avoid editorializing or doing original synthesis, or giving opinions. Act like a search engine. Act like Google.

There are still limitations:

  • You still have to manually check the cited links to verify the information yourself.
  • ChatGPT is, for some reason, really bad at actually linking to the correct webpage it’s quoting from. This wastes time and is frustrating.
  • ChatGPT is limited to short quotes and often gives even shorter quotes than necessary, which is annoying. It often makes it hard to understand what the quote actually even says, which almost defeats the purpose.
  • It’s common for ChatGPT to misunderstand what it’s quoting and take something out of context, or it quotes something inapplicable. This often isn’t obvious until you actually go check the source (especially with the truncated quotes). You can get tricked by ChatGPT this way.
  • Every once in a while, ChatGPT completely fabricates or hallucinates a quote or a source.

The most one-to-one analogy for LLMs in this use case is Google. Google is amazingly useful for finding webpages. But when you Google something (or search on Google Scholar), you get a list of results, many of which are not what you’re looking for, and you have to pick which results to click on. And then, of course, you actually have to read the webpages or PDFs. Google doesn’t think for you; it’s just an intermediary between you and the sources.

I call LLMs SuperGoogle because they can do semantic search on hundreds of webpages and PDFs in a few minutes while you’re doing something else. LLMs as search engines is a geniune innovation.

On the other hand, when I’ve asked LLMs to respond to the reasoning or argument in a piece of writing or even just do proofreading, they have given incoherent responses, e.g. making hallucinatory "corrections" to words or sentences that aren’t in the text they’ve been asked to review. Run the same text by the same LLM twice and it will often give the opposite opinion of the reasoning or argument. The output is also often self-contradictory, incoherent, incomprehensibly vague, or absurd.

Executive summary: GiveWell reports that using AI to red team its global health research has surfaced some worthwhile critiques—especially by filling literature gaps—but remains limited by low relevance rates, unreliable quantitative claims, and the need for substantial human filtering, and the team invites others to test alternative AI critique methods.

Key points:

  1. GiveWell piloted a two-stage AI red teaming process—AI literature synthesis followed by AI critique of internal analysis—across six grantmaking areas.
  2. The approach generated several critiques worth investigating, such as reinfection risks in syphilis programs, natural recovery bias in malnutrition treatment, and strain mismatch in malaria vaccines.
  3. The prompting strategy emphasized generating many candidate critiques, checking for novelty against the report, using structured categories, and including prompts aimed at less obvious perspectives.
  4. The authors found AI most useful for identifying relevant academic literature they had not yet incorporated, but least useful for interventions already extensively reviewed.
  5. AI-generated quantitative impact estimates were often unsupported, and roughly 85% of critiques were filtered out as irrelevant or based on misunderstandings.
  6. GiveWell chose not to pursue more complex workflows or custom tooling, judging that expected gains would likely be marginal relative to added friction, while remaining open to contrary evidence from others.

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
Relevant opportunities