Can we do useful meta-analysis? Unjournal evaluations of "Meaningfully reducing consumption of meat and animal products is an unsolved problem..."

david_reinstein

The Unjournal commissioned two evaluations of "Meaningfully reducing consumption of meat and animal products is an unsolved problem: A meta-analysis" by Seth Ariel Green, Benny Smith, and Maya B Mathur. See our evaluation package here.

My take: the research was ambitious and useful, but it seems to have important limitations, as noted in the critical evaluations; Matthew Janés evaluation provided constructive and actionable insights and suggestions.

I'd like to encourage follow-up research on this same question, starting with this paper's example and its shared database (demonstrating commendable transparency), taking these suggestions on board, and building something even more comprehensive and rigorous.

Do you agree? I come back to some 'cruxes' below:

Is meta-analysis even useful in these contexts, with heterogeneous interventions, outcomes, and analytical approaches?
Would a more rigorous and systematic approach really add value? Should it follow academic meta-analysis standards, or "a distinct vision of what meta-analysis is for, and how to conduct it" (as Seth suggests)?
Will anyone actually do/fund/reward rigorous continued work?

Original paper: evidence that ~the main approaches to this don't work

The authors discussed this paper in a previous post.

We conclude that no theoretical approach, delivery mechanism, or persuasive message should be considered a well-validated means of reducing MAP [meat and anumal products'] consumption

Characterizing this as evidence of "consistently small effects ... upper confidence bounds are quite small" for most categories of intervention.^[1]

Unjournal's evaluators: ~this meta-analysis is limited and could be improved is not rigorous enough^[2]

From the Evaluation Manager's summary (Tabare Capitan)

... The evaluators identified a range of concerns regarding the transparency, design logic, and robustness of the paper’s methods—particularly in relation to its search strategy, outcome selection, and handling of missing data. Their critiques reflect a broader tension within the field: while meta-analysis is often treated as a gold standard for evidence aggregation, it remains highly sensitive to subjective decisions at multiple stages.

Evaluators' substantive critiques

Paraphrasing these -- mostly from E2, Matthew Jané, but many of the critiques were mentioned by both evaluators

Improper missing data handling: Assigning SMD = 0.01 to non-significant unreported effects introduces systematic bias by ignoring imputation variance

Single outcome selection wastes data: Extracting only one effect per study discards valuable information despite authors having multilevel modeling capacity

Risk-of-bias assessment is inadequate: The informal approach omits critical bias sources like selective reporting and attrition

Missing "a fully reproducible search strategy, clearly articulated inclusion and exclusion criteria ..., and justification for screening decisions are not comprehensively documented in the manuscript or supplement."

No discussion of attrition bias in RCTs... "concerning given the known non-randomness of attrition in dietary interventions"

... And a critique that we hear often in evaluations of meta-analyses: "The authors have not followed standard methods for systematic reviews..."

Epistemic audit: Here is RoastMyPoast's epistemic and factual audit of Janés evaluation. It gets a B- grade (which seems like the modal grade with this tool.) RMP is largely positive, but some constructive criticism (asking for "more explicit discussion of how each identified flaw affects the magnitude and direction of potential bias in the meta-analysis results.")

One author's response

Seth Ariel Green responded here.

Epistemic/factual audit: Here is RoastMyPoast's epistemic and factual audit of Seth's response. It gets a C- grade, and it raises some (IMO) useful critiques of the response, and a few factual disagreements about the cited methodological examples (these should be doublechecked). It flags "defensive attribution bias" and emphasizes that "the response treats innovation as self-justifying rather than requiring additional evidence of validity."

Highlighting some of Seth's responses to the substantive critiques:

"Why no systematic search?"

...We were looking at an extremely heterogeneous, gigantic literature — think tens of thousands of papers — where sifting through it by terms was probably going to be both extremely laborious and also to yield a pretty low hit rate on average.

we employed what could be called a ‘prior-reviews-first’ search strategy. Of the 985 papers we screened, a full 73% came from prior reviews, . ... we employed a multitude of other search strategies to fill in our dataset, one of which was systematic search.

David Reinstein:

Seth's response to these issues might be characterized as ~"the ivory tower protocol is not practical, you need to make difficult choices if you want to learn anything in these messy but important contexts and avoid 'only looking under the streetlamp' -- so we did what seemed reasonable."

I'm sympathetic to this. The description intuitively seems like a reasonable approach to me. I'm genuinely uncertain as to whether 'following the meta-analysis rules' is the most useful approach for researchers aiming at making practical recommendations. I'm not sure if the rules were built for the contexts and purposes we're dealing with.

On the other hand, I think a lack of a systematic protocol limits our potential to build and improve on this work, and to make transparent fair comparisons.

And I would have liked the response to directly take on the methodogical issues raised directly -- yes there are always tradeoffs, but you can justify your choices explicitly, especially when you are departing from conversation.

"Why no formal risk of bias assessment?"

The main way we try to address bias is with strict inclusion criteria, which is a non-standard way to approach this, but in my opinion, a very good one (Simonsohn, Simmons & Nelson (2023) articulates this nicely).

After that baseline level of focusing our analysis on the estimates we thought most credible, we thought it made more sense to focus on the risks of bias that seemed most specific to this literature.

... I hope that our transparent reporting would let someone else replicate our paper and do this kind of analysis if that was of interest to them.

David: Again, this seems reasonable, ~~but also a bit of a false dichotomy,~~ but meriting greater explanation. You can have both strict inclusion criteria and do a risk of bias assessment, although every step takes time and brings challenges.

"About all that uncertainty"

Matthew Jané raises many issues about ways in which he thinks our analyses could (or in his opinion, should) have been done differently. Now I happen to think our judgment calls on each of the raised questions were reasonable and defensible. Readers are welcome to disagree.

Matthew raises an interesting point about the sheer difficulty in calculating effect sizes and how much guesswork went into it for some papers. In my experience, this is fundamental to doing meta-analysis. I’ve never done one where there wasn’t a lot of uncertainty, for at least some papers, in calculating an SMD.

More broadly, if computing effect sizes or variance differently is of interest, by all means, please conduct the analysis, we’d love to read it!

David: This characterizes Seth's response to a number of the issues: 1. This is challenging, 2. You need to make judgment calls, 3. We are being transparent, and allowing others to follow up.

I agree with this, to a point. But again, I'd like to see them explicitly engage with the issues, careful and formal treatments, and specific practical solutions that Matthew provided. And as I get to below – there are some systemic barriers to anyone actually following up on this. [Update 10 Nov 2025: I appreciate Seth's response in encouraging future work and inviting inquiries from other researchers including graduate students.]

Where does this leave us – can meta-analysis be practically useful in heterogeneous domains like this? What are the appropriate standards?

Again from the evaluation manager's synthesis (mostly Tabare Capitan)

... the authors themselves acknowledge many of these concerns, including the resource constraints that shaped the final design. Across the evaluations and the author response, there is broad agreement on a central point: that a high degree of researcher judgment was involved throughout the study. Again, this may reflect an important feature of synthesis work beyond the evaluated paper—namely, that even quantitative syntheses often rest on assumptions and decisions that are not easily separable from the analysts' own interpretive frameworks. These shared acknowledgements may suggest that the field currently faces limits in its ability to produce findings with the kind of objectivity and replicability expected in other domains of empirical science.

David Reinstein:

... I’m more optimistic than Tabaré about the potential for meta-analysis. I’m deeply convinced that there are large gains from trying to systematically combine evidence across papers, and even (carefully) across approaches and outcomes. Yes, there are deep methodological differences over the best approaches. But I believe that appropriate meta-analysis will yield more reliable understanding than ad-hoc approaches like ‘picking a single best study’ or ‘giving one’s intuitive impressions based on reading’. Meta-analysis could be made more reliable through robustness-checking, estimating a range of bounded estimates under a wide set of reasonable choices, and enabling data and dashboards for multiverse analysis, replication, and extensions.

I believe a key obstacle to this careful, patient, open work is the current system of incentives and tools offered by academia and the current system of traditional journal publications as a career outcome an ‘end state’. The author’s response “But at some point, you declare a paper ‘done’ and submit it” exemplifies this challenge.The Unjournal aims to build and facilitate a better system.

Will anyone actually follow up on this? Once the "first paper" is published in an academic journal, can anyone be given a career incentive, or direct compensation, to improve upon it? Naturally, this gets at one of my usual gripes with the traditional academic journal model, a problem that The Unjournal's continuous evaluation tries to solve.

This also depends on... whether the animal welfare and EA community believes that rigorous/academic-style research is useful in this area. And wants to fund and support a program to gradually and continually improve our understanding and evidence on perhaps a small number of crucial questions like this.

I also think it depends on good epistemic norms.

Cross-posted to LessWrong here

^{^}
However they say "the largest effect size, ... choice architecture, comes from too few studies to say anything meaningful about the approach in general. So for that case we're dealing with an absence of evidence, i.e., wide posteriors. [Added 10 Nov 2025] Some other parts of the author's discussion also suggest they're making a case for an absence of evidence rather than evidence of a 'tightly bounded near-zero impact.
^{^}
10 Nov 2025: I adjusted this header in response to Geoffrey's comment, That I had characterized this somewhat too harshly/negatively, which I accept.

33 Reactions

Mentioned in

35The Unjournal: Bridging the Rigor/Impact Gaps for EA-relevant Research Questions

More posts like this

Comments21

Sorted by

New & upvoted

Click to highlight new comments since: Today at 3:57 PM

MMathur🔸Nov 30 20257

Maybe of interest: "What is meta-analysis good for?"

Abstract: There is an ongoing debate about the quality of the evidence that meta-analysis provides. But both critics and defenders of meta-analysis generally assume that the core purpose and contribution of meta-analysis is to tell us what the evidence really says about the existence and magnitude of causal relationships, to extract conclusions that agree from datasets that do not. I argue that while delivering such information about cause-and-effect relationships is the most common use of meta-analysis, this is only an application of the tool, and that its primary epistemic role of meta-analysis is something else entirely: to help us explore and understand variation among populations of studies.

geoffreyNov 7 20254

Really enjoyed this. Not much public debate in this space as far as I can see. To two of your cruxes:

Is meta-analysis even useful in these contexts, with heterogeneous interventions, outcomes, and analytical approaches?
Will anyone actually do/fund/reward rigorous continued work?

I've sometimes wondered if it'd be worth funding a "mega study" like Milkman et al. (2021). They tested 54 different interventions to boost exercise among 61,000 members. Something similar for meat reduction could allow for some clean apples-to-apples comparisons.

I've seen the number $2.6 million floating around for how much this intervention costs. Granted, that's probably on top of convincing the mega-team of researchers to work on the project, which might only happen through the prestige of an academic lab. But it's also not an astronomical cost. And there'd be still some learning value from a smaller set of interventions and a smaller sample.

This might be a better use of resources than striving for the "ideal" meta-analysis, since that sounds expensive too.

Seth Ariel Green 🔸Nov 10 2025*3

@geoffrey We'd love to run a megastudy! My lab put in a grant proposal with collaborators at a different Stanford lab to do just that but we ultimately went a different direction. Today, however, I generally believe that we don't even know what is the right question to be asking -- though if I had to choose one it would be, what ballot intiative does the most for animal welfare while also getting the highest levels of public support, e.g. is there some other low-hanging fruit equivalent to "cage free" like "no mutilation" that would be equally popular. But in general I think we're back to the drawing board in terms of figuring out what is the study we want to run and getting a version of it off the ground, before we start thinking about scaling up to tens of thousands of people.

@david_reinstein, I suppose any press is good press so I should be happy that you are continuing to mull on the lessons of our paper 😃 but I am disappointed to see that the core point of my responses is not getting through. I'll frame it explicitly here: when we did one check and not another, or one one search protocol and not another, the reason, every single time, is opportunity costs. When I say "we thought it made more sense to focus on the risks of bias that seemed most specific to this literature," I am using the word 'focus' deliberately, in the sense of "focus means saying no," i.e. 'we are always triaging.' At every juncture, navigating the explore/exploit dilemma requires judgment calls. You don't have to like that I said no to you, but it's not a false dichotomy, and I do not care for that characterization.

To the second question of whether anyone will do the kind of extension work, I personally see this as a great exercise for grad students. I did all kinds of replication and extension work in grad school. A deep dive into a subset of contact hypothesis literature I did in a political psychology class in 2014, which started with a replication attempt, eventually morphed into The Contact Hypothesis Re-evaluated. If you, a grad student. want to do this kind of project, please be in touch, I'd love to hear from you. (I'd recommend starting by downloading the repo and asking claude code about robustness checks that do and do not require gathering additional data).

david_reinsteinNov 10 20253

I'll frame it explicitly here: when we did one check and not another, or one one search protocol and not another, the reason, every single time, is opportunity costs. When I write: "we thought it made more sense to focus on the risks of bias that seemed most specific to this literature," notice the word 'focus', which means saying no.

That is clearly the case, and I accept there are tradeoffs. But ideally I would have liked to see a more direct response to the substance of the points made by the evaluators. But I understand that there are tradeoffs there as well.

In other words, because of opportunity costs, we are always triaging. At every juncture, navigating the explore/exploit dilemma requires judgment calls. You don't have to like that I said no to you, but it's not a false dichotomy, and I do not care for that characterization.

Perhaps 'false dichotomy' was too strong, given the opportunity costs (not an excuse: I got that from the RoastMyPost's take on this). But as I understand it there are clear rubrics and guidelines for this meta-analyses. In cases where you choose to depart from the standard practice, maybe it's reasonable to give a more detailed and grounded explanation of why you did this. And the evaluators did present very specific arguments for different practices you could have followed and could still follow in future work. I think judgment calls based on experience gets you somewhere but it would be better to explicitly defend why you made a particular judgment call, and respond to and consider the analytical points made by the evaluators. And ideally follow up with the checks they suggest, although I understand that it's hard to do this given how busy you are and the nature of academic incentives.

I hope I am being fair here; I'm trying to be even-handed and sympathetic to both sides. Of course, for this exercise to be useful, we have to allow for and permit constructive expert criticism; which I think these evaluations do indeed embody. I appreciate you having responded to these at all. I'd be happy to get others' opinions on whether we've been fair here.

To the second question of whether anyone will do the kind of extension work, I personally see this as a great exercise for grad students. I did all kinds of replication and extension exercises in grad school. A deep dive into a subset of contact hypothesis literature I did in a political psychology class in 2014 , which started with a replication attempt, eventually morphed into The Contact Hypothesis Re-evaluated.
If a grad student wanted to do this kind of project, please be in touch, I'd love to hear from you.

I had previously responded "casting this as 'for graduate students" makes it seem less valuable and prestigious," which I still stand by. But I appreciate that you adjusted your response to note "If a grad student wanted to do this kind of project, please be in touch, I'd love to hear from you" which I think helps a lot.

The point I was making -- perhaps preaching to the choir here:

These extensions and replication, and follow-up steps may be needed to a large project deeply credible and useful and to capture a large part of the value. Why not give equal esteem and career rewards for that? The current system of journals tends not to do so (at least not in economics, the field I'm most familiar with). This is one of the things that we hope that credible evaluation separated from journal publications can improve upon.

geoffreyNov 11 20255

Chiming in here with my outsider impressions on how fair the process seems

@david_reinstein If I were to rank the evaluator reports, evaluation summary, and the EA Forum post in which ones seemed the most fair, I would have ranked the Forum post last. It wasn't until I clicked through to the evaluation reports that I felt the process wasn't so cutting.

Let me focus on one very specific framing in the Forum post, since it feels representative. One heading includes the phrase "this meta-analysis is not rigorous enough". This has a few connotations that you probably didn't mean. One, this meta-analysis is much worse than others. Two, the claims are questionable. Three, there's a universally correct level of quality that meta-analyses should reach and anything that falls short of that is inadmissible as evidence.

In reality, it seems this meta-analysis is par for the course in terms of quality. And it was probably more difficult to do so given the heterogeneity in the literature. And the central claim of the meta-analysis doesn't seem like something either evaluator disputed (though one evaluator was hesitant).

Again, I know that's not what you meant and there are many caveats throughout the post. But it's one of a few editorial choices that make the Forum post seem much more critical than the evaluation reports, which is a bit unusual since the Evaluators are the ones who are actually critiquing the paper.

Finally, one piece of context that felt odd not to mention was the fundamental difficulty of finding an expert in both food consumption and meta-analysis. That limits the ability of any reviewer to make a fair evaluation. This is acknowledged at the bottom of the Evaluation Summary. Elsewhere, I'm not sure where it's said. Without that mentioned, I think it's easy for a casual reader to leave thinking the two Evaluators are the "most correct".

david_reinsteinNov 11 20252

Thanks for the detailed feedback, this seems mostly reasonable. I'll take a look again at some of the framings, and try to adjust. (Below and hopefully later in more detail).

the phrase "this meta-analysis is not rigorous enough". it seems this meta-analysis is par for the course in terms of quality.

This was my take on how to succinctly depict the evaluators' reports (not my own take), in a way the casual reader would be able to digest. Maybe this was rounding down too much, but not by a lot, I think. Some quotes from Janés evaluation that I think are representative:

Overall, aside from its commendable transparency, the meta-analysis is not of particularly high quality Overall, the transparency is strong, but the underlying analytic quality is limited.

This doesn't seem to reflect 'par for the course' to me, but it depends on what the course is; i.e., what the comparison group. My own sense/guess is that this more rigorous and careful than most work in this area of meat consumption interventions (and adjacent) but less rigorous than the meta-analyses the evaluators are used to seeing in their academic contexts and the practices they espouse. But academic meta-analysts will tend to focus on areas where they can find a proliferation of high-quality more homogenous research, not necessarily the highest impact areas.

Note that the evaluators rated this 40th and 25th percentile for methods and 75th and 39th percentile overall.

And the central claim of the meta-analysis doesn't seem like something either evaluator disputed (though one evaluator was hesitant).

To be honest I'm having trouble pinning down what the central claim of the meta-analysis is. Is it a claim that "the main approaches being used to motivate reduced meat consumption don't seem to work", i.e., that we can bound the effects as very small, at best? That's how I'd interpret the reporting of the pooled effects 95% CI as standardized mean effect of 0.02 and 0.12. I would say that both evaluators are sort of disputing that claim.

However the authors hedge this in places and sometimes it sounds more like they're saying that ~"even the best meta-analysis possible leaves a lot of uncertainty" ... An absence of evidence more than an evidence of absence, and this is something the evaluators seem to agree with.

Finally, one piece of context that felt odd not to mention was the fundamental difficulty of finding an expert in both food consumption and meta-analysis.

That is/was indeed challenging. Let me try to adjust this post to note that.

a few editorial choices ... make the Forum post seem much more critical than the evaluation reports, which is a bit unusual since the Evaluators are the ones who are actually critiquing the paper.

My goal for this post was to fairly represent the evaluator's take, to provide insights to people who might want to use this for decision-making and future research, to raise the question of standards in meta-analysis in EA-related areas. I will keep thinking about whether I missed the mark here. One possible clarification though: we don't frame the evaluator's role as (only) looking to criticize or find errors in the paper. We ask them to give a fair assessment of it, evaluating its strengths, weaknesses, credibility, and usefulness. These evaluations can also be useful if they give people more confidence in the paper and its conclusions, and thus reason to update more on this for their own decision-making.

Seth Ariel Green 🔸Dec 13 20252

Hi David,

To be honest I'm having trouble pinning down what the central claim of the meta-analysis is.

To paraphrase Diddy's character in Get Him to the Greek, "What are you talking about, the name of the [paper] is called "[Meaningfully reducing consumption of meat and animal products is an unsolved problem]!" (😃) That is our central claim. We're not saying nothing works; we're saying that meaningful reductions either have not been discovered yet or do not have substantial evidence in support.

However the authors hedge this in places

That's author, singular. I said at the top of my initial response that I speak only for myself.

david_reinsteinDec 13 20252

I think "an unsolved problem" could indicate several things. it could be

We have evidence that all of the commonly tried approaches are ineffective, i.e., we have measured all of their effects and they are tightly bounded as being very small
We have a lack of evidence, thus very wide credible intervals over the impact of each of the common approaches.

To me, the distinction is important. Do you agree?

You say above

meaningful reductions either have not been discovered yet or do not have substantial evidence in support

But even "do not have substantial evidence in support" could mean either of the above ... a lack of evidence, or strong evidence that the effects are close to zero. At least to my ears.

As for 'hedge this', I was referring to the paper not to the response, but I can check this again.

geoffreyDec 13 20251

For what it's worth, I read that abstract as saying something like, "within the class of interventions studied so far, the literature has yet to settle onto any intervention that can reliably reduce animal product consumption by a meaningful amount, where meaningful amount might be a 1% reduction at Costco scale or long-term 10% reduction at a single cafeteria. The class of interventions being studied tends to be informational and nudge-style interventions like advertising, menu design, and media pamphlets. When effect sizes differ for a given type of intervention, the literature has not offered a convincing reason why a menu-design choice works in one setting versus another."

Okay, now that I've typed that up, I can see why "unsolved problem" is unclear.

And I'm probably taking a lot of leaps of faith in interpretation here

Seth Ariel Green 🔸Dec 13 2025*2

It's an interesting question.

From the POV of our core contention -- that we don't currently have a validated, reliable intervention to deploy at scale -- whether this is because of absence of evidence (AoE) or evidence of absence (EoA) is hard to say. I don't have an overall answer, and ultimately both roads lead to "unsolved problem."

We can cite good arguments for EoA (these studies are stronger than the norm in the field but show weaker effects, and that relationship should be troubling for advocates) or AoE (we're not talking about very many studies at all), and ultimately I think the line between the two is in the eye of the beholder.

Going approach by approach, my personal answers are

choice architecture is probably AoE, it might work better than expected but we just don't learn very much from 2 studies (I am working on something about this separately)
the animal welfare appeals are more EoA, esp. those from animal advocacy orgs
social psych approaches, I'm skeptical of but there weren't a lot of high-quality papers so I'm not so sure (see here for a subsequent meta-analysis of dynamic norms approaches).
I would recommend health for older folks, environmental appeals for Gen Z. So there I'd say we have evidence of efficacy, but to expect effects to be on the order of a few percentage points.

Were I discussing this specifically with a funder, I would say, if you're going to do one of the meta-analyzed approaches -- psych, nudge, environment, health, or animal welfare, or some hybrid thereof -- you should expect small effect sizes unless you have some strong reason to believe that your intervention is meaningfully better than the category average. For instance, animal welfare appeals might not work in general, but maybe watching Dominion is unusually effective. However, as we say in our paper, there are a lot of cool ideas that haven't been tested rigorously yet, and from the point of view of knowledge, I'd like to see those get funded first.

david_reinsteinNov 7 20253

This does indeed look interesting, and promising. Some quick (maybe naive) thoughts on that particular example, at a skim.

An adaptive/reinforcement learning design could make a mega study like this cheaper ... You end up putting more resources into the arms that start to become more valuable/where more uncertainty needs to be resolved.
I didn't see initially how they corrected did things like multiple hypothesis correction, although I'd prefer something like a Bayesian approach, perhaps with multiple levels of the model... effect category, specific intervention
Was their anything about the performance of their successful interventions out-of-sample/in different environments? I'd want to build in some ~gold standard calidation.

The "cost of convincing researchers to work on it" Is uncertain to me. If it was already a very well-funded high-quality study in an interesting area that is 'likely to publish well' (apologies), I assume that academics would have some built-in 'publish or perish' incentives from their universities.

Certainly there is some trade-off here: Of course investing resources, intellectual and time, into more careful, systematic, and robust meta-analysis of a large body of work of potentially varying quality and great heterogeneity comes at the cost of academics and interested researchers organizing better and more systematic new studies. There might be some middle ground where a central funder requires future studies to follow common protocols and reporting standards to enable better future meta-analysis (perhaps as well as outreach to authors of past research to try systematically dig out missing information.)

Seems like there are some key questions here

Is there much juice to squeeze out of better meta-analyses of the work that's already been done?
Even if more meta-analysis doesn't yield much direct value, could it help foster protocols for future trials to be more reliable and systematic?
Is there a realistic path to facilitating these systematic sets of 'mega-trials' or standardized approaches that can be pooled/meta-analysed?
(And of course, more fundamental Issues about whether behavioral responses in these domains are measurable and predictable in a deep sense)

Seth Ariel Green 🔸Nov 11 2025*3

For what it's worth, I thought David's characterization of the evaluations was totally fair, even a bit toned down. E.g. this is the headline finding of one of them:

major methodological issues undermine the study's validity. These include improper missing data handling, unnecessary exclusion of small studies, extensive guessing in effect size coding, lacking a serious risk-of-bias assessment, and excluding all-but-one outcome per study.

David characterizes these as "constructive and actionable insights and suggestions". I would say they are tantamount to asking for a new paper, especially the excluding of small studies, which was core to our design and would require a whole new search, which would take months. To me, it was obvious that I was not going to do that (the paper had already been accepted for publication at that point). The remaining suggestions also implied dozens ( hundreds?) of hours of work. Spending weeks satisfying two critics didn't pass a cost-benefit test.^[1] It wasn't a close call.

^{^}
really need to follow my own advice now and go actually do other projects 😃

david_reinsteinNov 11 20253

I meant "constructive and actionable" In that he explained why the practices used in the paper had potentially important limitations (see here on "assigning an effect size of .01 for n.s. results where effects are incalculable")...

And suggested a practical response including a specific statistical package which could be applied to the existing data:

"An option to mitigate this is through multiple imputation, which can be done through the metansue (i.e., meta-analysis of non-significant and unreported effects) package"

In terms of the cost-benefit test it depends on which benefit we are considering here. Addressing these concerns might indeed take months to do and might indeed cost hundreds of hours. Indeed, it's hard to justify this in terms of the current academic/career incentives alone, as the paper had already been accepted for publication. If this we're directly tied to grants there might be a case but as it stands I understand that it could be very difficult for you to take this further.

But I wouldn't characterize doing this as simply "satisfying two critics". The critiques themselves might be sound and relevant, and potentially impact the conclusion (at least in differentiating between "we have evidence," the effects are small and "the evidence is indeterminate", which I think is an important difference). And the value of the underlying policy question (~'Should animal welfare advocates be using funding existing approaches to reducing mep consumption?') seems high to me. So I would suggest that the benefit exceeds the cost here in net even if we might not have a formula for making it worth your while to make these adjustments right now.

I also think there might be value in setting an example standard that, particularly for high-value questions like this, we strive for a high level of robustness, following up on a range of potential concerns and critiques etc. I'd like to see these things as long-run living projects that can be continuously improved and updated (and re-evaluated). The current research reward system doesn't encourage this, which is a gap we are trying to help fill.

Seth Ariel Green 🔸Nov 11 20255

David, there are two separate questions here, which is whether these analyses should be done or whether I should have done them in response to the evaluations. If you think these analyses are worth doing, by all means, go ahead!

geoffreyNov 11 20251

Seth, for what it's worth, I found your hourly estimates (provided in these forum comments but not something I saw in the evaluator response) on how long the extensions would take to be illuminating. Very rough numbers like this meta-analysis taking 1000 hours for you or a robustness check taking dozens / hundreds of hours more to do properly helps contextualize how reasonable the critiques are.

It's easy for me (even now while pursuing research, but especially before when I was merely consuming it) to think these changes would take a few days.

It's also gives me insight into the research production process. How long does it take to do a meta-analysis? How much does rigor cost? How much insight does rigor buy? What insight is possible given current studies? Questions like that help me figure out whether a project is worth pursuing and whether it's compatible with career incentives or more of a non-promotable task

Seth Ariel Green 🔸Nov 11 2025*11

Love talking nitty gritty of meta-analysis 😃

IMHO, the "math hard" parts of meta-analysis are figuring out what questions you want to ask, what are sensible inclusion criteria, and what statistical models are appropriate. Asking how much time this takes is the same as asking, where do ideas come from?
The "bodybuilding hard" part of meta-analysis is finding literature. The evaluators didn't care for our search strategy, which you could charitably call "bespoke" and uncharitably call "ad hoc and fundamentally unreplicable." But either way, I read about 1000 papers closely enough to see if they qualified for inclusion, and then, partly to make sure I didn't duplicate my own efforts, I recorded notes on every study that looked appropriate but wasn't. I also read, or at least read the bibliographies of, about 160 previous reviews. Maybe you're a faster reader than I am, but ballpark, this was 500+ hours of work.
Regarding the computational aspects, the git history tells the story, but specifically making everything computationally reproducible, e.g. writing the functions, checking my own work, setting things up to be generalizable -- a week of work in total? I'm not sure.
The paper went through many internal revisions and changed shape a lot from its initial draft when we pivoted in how we treated red and processed meat. That's hundreds of hours. Peer review was probably another 40 hour workweek.
As I reread reviewer 2's comments today, it occurred to me that some of their ideas might be interesting test cases for what Claude Code is and is not capable of doing. I'm thinking particularly of trying to formally incorporate my subjective notes about uncertainty (e.g. the many places where I admit that the effect size estimates involved a lot of guesswork) into some kind of...supplementary regression term about how much weight an estimate should get in meta-analysis? Like maybe I'd use Wasserstein-2 distance, as my advisor Don recently proposed? Or Bayesian meta-analysis? This is an important problem, and I don't consider it solved by RoB2 or whatever, which means that fixing it might be, IDK, a whole new paper which takes however long that does? As my co-authors Don and Betsy & co. comment in a separate paper on which I was an RA:
> Too often, research syntheses focus solely on estimating effect sizes, regardless of whether the treatments are realistic, the outcomes are assessed unobtrusively, and the key features of the experiment are presented in a transparent manner. Here we focus on what we term landmark studies, which are studies that are exceptionally well-designed and executed (regardless of what they discover). These studies provide a glimpse of what a meta-analysis would reveal if we could weight studies by quality as well as quantity. [the point being, meta-analysis is not well-suited for weighing by quality.]
It's possible that some of the proposed changes would take less time than that. Maybe risk of bias assessment could be knocked out in a week?. But it's been about a year since the relevant studies were in my working memory, which means I'd probably have to re-read them all, and across our main and supplementary dataset, that's dozens of papers. How long does it take you to read dozens of papers? I'd say I can read about 3-4 papers a day closely if I'm really, really cranking. So in all likelihood, yes, weeks of work, and that's weeks where I wouldn't be working on a project about building empathy for chickens. Which admittedly I'm procrastinating on by writing this 500+ word comment 😃

Seth Ariel Green 🔸Nov 11 2025*3

A final reflective note: David, I want to encourage you to think about the optics/politics of this exchange from the point of view of prospective Unjornal participants/authors. There are no incentives to participate. I did it because I thought it would be fun and I was wondering if anyone would have ideas or extensions that improved the paper. Instead, I got some rather harsh criticisms implying we should have written a totally different paper. Then I got this essay, which was unexpected/unannounced and used, again, rather harsh language to which I objected. Do you think this exchange looks like an appealing experience to others? I'd say the answer is probably not.

A potential alternative: I took a grad school seminar where we replicated and extended other people's papers. Typically the assignment was to do the robustness checks in R or whatever, and then the author would come in and we'd discuss. It was a great setup. It worked because the grad students actually did the work, which provided an incentive to participate for authors. The co-teachers also pre-selected papers that they thought were reasonably high-quality, and I bet that if they got a student response like Matthew's, they would have counseled them to be much more conciliatory, to remember that participation is voluntary, to think through the risks of making enemies (as I counseled in my original response), etc. I wonder if something like that would work here too. Like, the expectation is that reviewers will computationally reproduce the paper, conduct extensions and robustness checks, ask questions if they have them, work collaboratively with authors, and then publish a review summarizing the exchange. That would be enticing! Instead what I got here was like a second set of peer reviewers, and unusually harsh ones at that, and nobody likes peer review.

It might be the case that meta-analyses aren't good candidates for this kind of work, because the extensions/robustness checks would probably also have taken Matthew and the other responder weeks, e.g. a fine end of semester project for class credit but not a very enticing hobby.

Just a thought.

david_reinsteinNov 11 20253

A final reflective note: David, I want to encourage you to think about the optics/politics of this exchange from the point of view of prospective Unjornal participants/authors.

I appreciate the feedback. I'm definitely aware that we want to make this attractive to authors and others, both to submit their work and to engage with our evaluations. Note that in addition to asking for author submissions, our team nominates and prioritizes high-profile and potential-high-impact work, and contact authors to get their updates, suggestions, and (later) responses. (We generally only require author permission to do these evaluations from early-career authors at a sensitive point in their career.) We are grateful to you for having responded to these evaluations.

There are no incentives to participate.

I would disagree with this. We previously had author prizes (financial and reputational) focusing on authors who submitted work for our evaluation. although these prizes are not currently active. I'm keen to revise these prizes when the situation permits (funding and partners).

But there are a range of other incentives (not directly financial) for authors to submit their work, respond to evaluations and engage in other ways. I provide a detailed author FAQ here. This includes getting constructive feedback, signaling your confidence in your paper and openness to criticism, the potential for highly positive evaluations to help your paper's reputation, visibility, unlocking impact and grants, and more. (Our goal is that these evaluations will ultimately become the object of value in and of themselves, replacing "publication in a journal" for research credibility and career rewards. But I admit that's a long path.)

I did it because I thought it would be fun ad I was wondering if anyone would have ideas or extensions that improved the paper. Instead, I got some rather harsh criticisms implying we should have written a totally different paper.

I would not characterize the evaluators' reports in this way. Yes, there was some negative-leaning language, which, as you know, we encourage the evaluators to tone down. But there were a range of suggestions (especially from Jané) which I see as constructive, detailed, and useful, both for this paper and for your future work. And I don't see this as them suggesting "a totally different paper." To large extent they agreed with the importance of this project, with the data collected, and with many of your approaches. They praised your transparency. They suggested some different methods for transforming and analyzing the data and interpreting the results.

Then I got this essay, which was unexpected/unannounced and used, again, rather harsh language to which I objected. Do you think this exchange looks like an appealing experience to others? I'd say the answer is probably not.

I think it's important to communicate the results of our evaluations to wider audiences, and not only on our own platform. As I mentioned, I tried to fairly categorize your paper, the nature of the evaluations, and your response. I've adjusted my post above in response to some of your points where there was a case to be made that I was using loaded language, etc.

Would you recommend that I share any such posts with both the authors and the evaluators before making them? It's a genuine question (to you and to anyone else reading these comments) - I'm not sure the correct answer.

As to your suggestion at the bottom, I will read and consider it more carefully -- it sounds good.

Aside: I'm still concerned with the connotation of replication, extension, and robustness checking being something that should be relegated to graduate students and not. This seems to diminish the value and prestige of work that I believe to be of the highest order practical value for important decisions in the animal welfare space and beyond.

In the replication/robustness checking domain, I think what i4replication.org is doing is excellent. They're working with both graduate students and everyone from graduate students to senior professors to do this work and treating this as a high-value output meriting direct career rewards. I believe they encourage the replicators to be fair – excessively conciliatory nor harsh, and focus on the methodology. We are in contact with i4replication.org and hoping to work with them more closely, with our evaluations and “evaluation games” offering grounded suggestions for robustness replication checks.

geoffreyNov 11 20251

Would you recommend that I share any such posts with both the authors and the evaluators before making them?

Yes. But zooming back out, I don't know if these EA Forum posts are necessary.

A practice I saw i4replication (or some other replication lab) is that the editors didn't provide any "value-added" commentary on any given paper. At least, I didn't see these in any tweets they did. They link to the evaluation reports + a response from the author and then leave it at that.

Once in a while, there will be a retrospective on how the replications are going as a whole. But I think they refrain from commenting on any paper.

If I had to rationalize why they did that, my guess is that replications are already an opt-in thing with lots of downside. And psychologically, editor commentary has a lot more potential for unpleasantness. Peer review tends to be anonymous so it doesn't feel as personal because the critics are kept secret. But editor commentary isn't secret...actually feels personal, and editors tend to have more clout.

Basically, I think the bar for an editor commentary post like this should be even higher than the usual process. And the usual evaluation process already allows for author review and response. So I think a "value-added" post like this should pass a higher bar of diplomacy and insight.

david_reinsteinNov 11 20253

Thanks for the thoughts. Note that I'm trying to engage/report here because we're working hard to make our evaluations visible and impactful, and this forum seems like one of the most promising interested audiences. But also eager to hear about other opportunities to promote and get engagement with this evaluation work, particularly in non-EA academic and policy circles.

I generally aim to just summarize and synthesize what the evaluators had written and the authors' response, bringing in what seemed like some specific relevant examples, and using quotes or paraphrases where possible. I generally didn't give these as my opinions but rather, the author and the evaluators'. Although I did specifically give 'my take' in a few parts. If I recall my motivation I was trying to make this a little bit less dry to get a bit more engagement within this forum. But maybe that was a mistake.

And to this I added an opportunity to discuss the potential value of doing and supporting rigorous, ambitious, and 'living/updated' meta-analysis here and in EA-adjacent areas. I think your response was helpful there, as was the authors. I'd like to see others' takes

Some clarifications:

The i4replication groups does put out replication papers/reports in each case and submits these to journals, and reports on this outcome on social media . But IIRC they only 'weigh in' centrally when they find a strong case suggesting systematic issues/retractions.

Note that their replications are not 'opt-in': they aimed to replicate every paper coming out in a set of 'top journals'. (And now, they are moving towards a research focusing on a set of global issues like deforestation, but still not opt-in).

I'm not sure what works for them would work for us, though. It's a different exercise. I don't see an easy route towards our evaluations getting attention through 'submitting them to journals' (which naturally, would also be a bit counter to our core mission of moving research output and rewards away from the 'journal publication as a static output.)

Also: I wouldn't characterize this post as 'editor commentary', and I don't think I have a lot of clout here. Also note that typical peer review is both anonymous and never made public. We're making all our evaluations public, but the evaluators have the option to remain anonymous.

But your point about a higher-bar is well taken. I'll keep this under consideration.

SummaryBotNov 7 20252

Executive summary: The Unjournal’s evaluations of a meta-analysis on reducing meat/animal-product consumption found the project ambitious but methodologically limited; the author argues meta-analysis can still be valuable in this heterogeneous area if future work builds on the shared dataset with more systematic protocols, robustness checks, and clearer bias handling—while noting open cruxes and incentive barriers to actually doing that follow-up (exploratory, cautiously optimistic).

Key points:

The original meta-analysis reports consistently small effects and no well-validated intervention class for reducing meat/animal-product consumption, but Unjournal evaluators judged the methods insufficiently rigorous to support strong conclusions.
Substantive critiques include: biased missing-data imputation (e.g., fixed near-zero effects), discarding multiple outcomes per study despite multilevel capacity, inadequate risk-of-bias assessment (e.g., selective reporting, attrition), and a non-reproducible or only partially systematic search strategy.
One author’s response defends pragmatic choices in a vast, heterogeneous literature (prior-reviews-first search; strict inclusion criteria in lieu of formal RoB; many transparent judgment calls) and invites others to re-analyze—though this stance was itself critiqued as treating “innovation” as self-justifying without validating reliability.
The post’s author is sympathetic to pragmatism but calls for explicit engagement with the critiques and a more systematic, buildable approach (clear protocols, reproducible searches, formal bias assessment alongside strict inclusion, and robustness/multiverse analyses).
Core cruxes: whether meta-analysis is useful amid high heterogeneity; whether to follow academic standards or a distinct, decision-focused paradigm; and whether there are incentives/funding to sustain rigorous, iterative synthesis beyond the first publication.
Recommendation/implication: pursue follow-up work using the shared database, improve transparency and methods, and consider alternative incentive structures (e.g., Unjournal’s continuous evaluation) so the animal-welfare/EA community can progressively refine answers to a few pivotal questions.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Can we do useful meta-analysis? Unjournal evaluations of "Meaningfully reducing consumption of meat and animal products is an unsolved problem..."

33

Original paper: evidence that ~the main approaches to this don't work

Unjournal's evaluators: ~this meta-analysis is limited and could be improved is not rigorous enough[2]

Evaluators' substantive critiques

One author's response

Highlighting some of Seth's responses to the substantive critiques:

Where does this leave us – can meta-analysis be practically useful in heterogeneous domains like this? What are the appropriate standards?

33

Reactions

More posts like this

Unjournal's evaluators: ~this meta-analysis is limited and could be improved is not rigorous enough^[2]