Thanks, David! I agree that capabilities for spreadsheets are not very strong at the moment.
I've tried a few times to get Claude Code to help with converting our CEAs into databases and haven't been very successful as it commonly tries to take shortcuts or the context window runs out. If you (or anyone else) has advice on converting them, I would love to hear it.
For context, our most complex CEAs (example) are where we'd get the most value and they're often 1,500+ lines and 10+ tabs, which is where we run into issues.
Thanks for your comment!
I would agree that LLMs are much stronger at finding and summarizing information than original thought (which is a big limitation for red teaming). However, we've gotten a lot of utility out of having "SuperGoogle" research a topic and then look for ways that our intervention reports differ from published literature. You could argue this is still SuperGoogle behavior (search + comparison) rather than genuine critical thinking, but for our purposes, that's been enough to surface a handful of worthwhile leads per intervention.
This is why we've found that AI red teaming works best for well-researched interventions where we at GiveWell haven't done as much research (like syphilis) and doesn't work well for interventions where we've done a lot of research (like insecticide-treated bed nets) or that are relatively new and don't have as much published literature (like malaria vaccines).
Thanks for the feedback!
On your first point, if researchers found any of the critiques raised by the AI to be credible, we would follow up in the same chat to ask for sourcing or additional information (like for the water turbidity point). We found this more successful than asking for heavy citations in the initial output, which led to lower quality critiques (presumably because we were giving too many instructions). The most recent generation of models are better at following complex instructions, so I expect we should revisit this.
Human researchers spent ~75 minutes reviewing AI output per intervention. We haven't rigorously compared this to having researchers look into the literature themselves, but my sense is that currently using AI is worth it for areas where we haven't done as much research (family planning, syphilis treatment) and not worth it in areas where we've done a lot of research (bed nets, vaccinations). Given the pace of improvement we've seen over the past year, I expect a preliminary AI red teaming pass will be worthwhile in almost all cases by the end of 2026.
Thank you both!
We are planning to keep spreadsheets as the primary format for our models (for transparency/simplicity reasons like you both noted). However, some way to convert spreadsheets to code for LLM digestion and potentially building web apps or running more complex uncertainty analyses would be valuable to us.
Definitely not asking anyone to spend time on this for us! I was just wondering if anyone was aware of a good way to do the conversion.