If you would like to get people to stop eating animals, there are a lot of things you could do: protest meat-serving restaurants, hand out leaflets, show online ads, lobby companies to do "meatless mondays", etc. To compare these, it would be useful to know how much of an impact they have. A while ago I proposed a simple survey to measure the impact of online ads:

  • Show your ads, as usual
  • Randomly divide people into control and experimental groups, 50-50.
  • Experimental group sees anti-veg page, control group sees something irrelevant.
  • Use retargeting cookies to advertise to pull people back in for a follow-up.
  • Ask people whether they eat meat.

Well, some people planned a study along these lines (methodology) and the results are now out. They randomized who saw the anti-meat videos, followed up with retargeting cookies, and asked people questions about their consumption of various animal products. This is the biggest study of its type I know of, and I'm very excited that its now complete.

The biggest problem I see is that they ended up surveying many fewer people than they set out to. The methodology considered how many people would need to complete the survey to pick up changes of varying sizes, and concluded:

We need to get at minimum 3.2k people to take the survey to have any reasonable hope of finding an effect. Ideally, we'd want say 16k people or more.

They only got 2k responses, however, and only 1.8k valid ones. This means the study is "underpowered": even if an effect exists at the size the experimenters expected, there's a large chance the study wouldn't be able to clearly show the effect.

Still, let's work with what we have. To compensate for having minimal data, we should run a single test, the one we think is most applicable. Running multiple tests would mean we'd need to use a Bonferroni correction or something similar, and that dramatically decreases your statistical power.

Before looking at the data or reading the writeup, I committed (via email to David and Allison) to an approach, what I thought of as the simplest, most straight-forward way of lookign at it. I would categorize each sample as "meat-eating" or "vegetarian" based on whether they reported eating any meat in the past two days, compute an effect size as the difference in vegetarianism between the two groups, and compute a p-value with a standard two-tailed t-test.

So what do we have to work with for questions? The survey asked, among other things:

In the past two days, how many servings have you had of the following foods? Please give your best guess.
  • Pork (ham, bacon, ribs, etc.)
  • Beef (hamburgers, meatballs, in tacos, etc.)
  • Dairy (milk, yogurt, cheese, etc.)
  • Eggs (omelet, in salad, etc.)
  • Chicken and Turkey (fried chicken, turkey sandwich, in soup, etc.)
  • Fish and Seafood (tuna, crab, baked fish, etc.)

This is potentially rich data, except I don't expect people's responses to be very good. If I tried to answer it, I'm sure I'd miss things for silly reasons, like forgetting what I had for dinner yesterday or not being sure what counts as a serving. On the other hand, if I had a policy for myself of not eating meat, it would very easy to answer those questions! So I categorized people just as "eats meat" vs "doesn't eat meat".

There were 970 control and 1054 experimental responses in the dataset they released. Of these, only 864 (89%) and 934 (89%) fully filled out this set of questions. I counted someone as a meat-eater if they answered anything other than "0 servings" to any of the four meat-related questions, and a vegetarian otherwise. Totaling up responses I see:

  valid responses vegetarians %
control 864 55 6.4%
experimental 934 78 8.4%

The bottom line is, 2% more people in the experimental group were vegetarians than in the control group (p=0.053 p=0.108). Honestly, this is far higher than I expected. We're surveying people who saw a single video four months ago, and we're seeing that about 2% more of them are vegetarian than they would have been otherwise.

Update 2016-02-20: I computed the p-value wrong; 0.053 was from a one-tailed test instead of a two-tailed test. The right p-value is 0.108. (I had used an online calcalculator intended for evaluating A/B tests that give you conversion numbers. It didn't specify one- or two-tailed, but since two-tailed is what you should use for A/B tests that's what I thought it would be using. After Alexander, Michael, and Dan pointed out that it looked wrong, I computed a p-value computationally. [1])

This is a very different way of interpreting the study results than any of the writeups I've seen. Edge's Report, Mercy for Animals, and Animal Charity Evaluators all conclude that there was basically no effect. I think this mostly comes from their asking questions where I'd expect the data to be noisier, like looking at how much of various things people think they eat or their attitudes toward meat consumption, plus their asking lots of different questions and so needing to correct downward to compensate for the multiple comparisons.

(There's probably something interesting you could do comparing the responses to the attitude questions with whether people reported eating any meat. I started looking at this some, just roughly, but didn't get very far. Maybe there are hints that the ads do their work by reducing recidivism instead of convincing people to give up meat, but I'm too sleepy to figure this out. My work is all in this sheet.)


[1] See footnote on jefftk.com version; e-a.com doesn't preserve indentation in pre-tags.

Comments6


Sorted by Click to highlight new comments since:

Thanks very much for doing this Jeff. It's useful to have an independent re-analysis. My credence that these ads work is increased from knowing that the data has been re-analysed by someone who would have expected no effect, and in fact did find one. Even if the effect was reducing recidivism, that still seems pretty useful! Hopefully in the future there will be more studies done that actually get statistically significant results.

This is a very different way of interpreting the study results than any of the writeups I've seen. Edge's Report, Mercy for Animals, and Animal Charity Evaluators all conclude that there was basically no effect.

From your results it still would be reasonable to conclude that there's "no effect" since the p-value is >0.05, but the p-value is low enough that I would give a follow-up study a reasonably high chance of getting a statistically significant result.

Also: it looks like you're using a one-sided t-test to get your p-value. I don't know much about significance testing but wouldn't it be better to use a two-sided t-test? My understanding is that one-sided tests are sort of cheating by making your p-value half of what it really should be.

it looks like you're using a one-sided t-test to get your p-value.

I agree that a two-sided test would be the right thing to use here, and p-value calculations aren't something I fully understand. Is this calculation one-sided or two-sided?

It looks like the NORMDIST function on your sheet is taking the integral from 0 to z_score, which is one-sided. A two-sided test would take

(integral from 0 to z_score) + (integral from -z_score to infinity)

I can't tell what's being done in that calculation.

I'm getting a p-value of 0.108 from a Pearson chi-square test (with cell values 55, 809; 78, 856). A chi-square test and a two-tailed t-test should give very similar results with these data, so I agree with Michael that it looks like your p=0.053 comes from a one-tailed test.

Yes, you're right. Sorry! I redid it computationally and also got 0.108. Post updated.

Curated and popular this week
 ·  · 13m read
 · 
Notes  The following text explores, in a speculative manner, the evolutionary question: Did high-intensity affective states, specifically Pain, emerge early in evolutionary history, or did they develop gradually over time? Note: We are not neuroscientists; our work draws on our evolutionary biology background and our efforts to develop welfare metrics that accurately reflect reality and effectively reduce suffering. We hope these ideas may interest researchers in neuroscience, comparative cognition, and animal welfare science. This discussion is part of a broader manuscript in progress, focusing on interspecific comparisons of affective capacities—a critical question for advancing animal welfare science and estimating the Welfare Footprint of animal-sourced products.     Key points  Ultimate question: Do primitive sentient organisms experience extreme pain intensities, or fine-grained pain intensity discrimination, or both? Scientific framing: Pain functions as a biological signalling system that guides behavior by encoding motivational importance. The evolution of Pain signalling —its intensity range and resolution (i.e., the granularity with which differences in Pain intensity can be perceived)— can be viewed as an optimization problem, where neural architectures must balance computational efficiency, survival-driven signal prioritization, and adaptive flexibility. Mathematical clarification: Resolution is a fundamental requirement for encoding and processing information. Pain varies not only in overall intensity but also in granularity—how finely intensity levels can be distinguished.  Hypothetical Evolutionary Pathways: by analysing affective intensity (low, high) and resolution (low, high) as independent dimensions, we describe four illustrative evolutionary scenarios that provide a structured framework to examine whether primitive sentient organisms can experience Pain of high intensity, nuanced affective intensities, both, or neither.     Introdu
 ·  · 3m read
 · 
We’ve redesigned effectivealtruism.org to improve understanding and perception of effective altruism, and make it easier to take action.  View the new site → I led the redesign and will be writing in the first person here, but many others contributed research, feedback, writing, editing, and development. I’d love to hear what you think, here is a feedback form. Redesign goals This redesign is part of CEA’s broader efforts to improve how effective altruism is understood and perceived. I focused on goals aligned with CEA’s branding and growth strategy: 1. Improve understanding of what effective altruism is Make the core ideas easier to grasp by simplifying language, addressing common misconceptions, and showcasing more real-world examples of people and projects. 2. Improve the perception of effective altruism I worked from a set of brand associations defined by the group working on the EA brand project[1]. These are words we want people to associate with effective altruism more strongly—like compassionate, competent, and action-oriented. 3. Increase impactful actions Make it easier for visitors to take meaningful next steps, like signing up for the newsletter or intro course, exploring career opportunities, or donating. We focused especially on three key audiences: * To-be direct workers: young people and professionals who might explore impactful career paths * Opinion shapers and people in power: journalists, policymakers, and senior professionals in relevant fields * Donors: from large funders to smaller individual givers and peer foundations Before and after The changes across the site are aimed at making it clearer, more skimmable, and easier to navigate. Here are some side-by-side comparisons: Landing page Some of the changes: * Replaced the economic growth graph with a short video highlighting different cause areas and effective altruism in action * Updated tagline to "Find the best ways to help others" based on testing by Rethink
 ·  · 7m read
 · 
The company released a model it classified as risky — without meeting requirements it previously promised This is the full text of a post first published on Obsolete, a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to Build Machine Superintelligence. Consider subscribing to stay up to date with my work. After publication, this article was updated to include an additional response from Anthropic and to clarify that while the company's version history webpage doesn't explicitly highlight changes to the original ASL-4 commitment, discussion of these changes can be found in a redline PDF linked on that page. Anthropic just released Claude 4 Opus, its most capable AI model to date. But in doing so, the company may have abandoned one of its earliest promises. In September 2023, Anthropic published its Responsible Scaling Policy (RSP), a first-of-its-kind safety framework that promises to gate increasingly capable AI systems behind increasingly robust safeguards. Other leading AI companies followed suit, releasing their own versions of RSPs. The US lacks binding regulations on frontier AI systems, and these plans remain voluntary. The core idea behind the RSP and similar frameworks is to assess AI models for dangerous capabilities, like being able to self-replicate in the wild or help novices make bioweapons. The results of these evaluations determine the risk level of the model. If the model is found to be too risky, the company commits to not releasing it until sufficient mitigation measures are in place. Earlier today, TIME published then temporarily removed an article revealing that the yet-to-be announced Claude 4 Opus is the first Anthropic model to trigger the company's AI Safety Level 3 (ASL-3) protections, after safety evaluators found it may be able to assist novices in building bioweapons. (The