Hide table of contents

Net Promoter Score is a widely used method for determining consumer satisfaction, asking “How likely is it that you would recommend [brand] to a friend or colleague?” and the response is (usually) a number between 0 and 10. However, instead of an average, the aggregate score is a complex nonlinear function of the results. CEA has moved away from this complex function in favor of just simply taking the arithmetic mean. Briefly, this is because the results don’t replicate, NPS is not empirically supported, it requires larger sample sizes, and it violates survey best practices.

Summary

  1. NPS is widely used, but the research has failed to replicate, even when the replication was using the originally published data set (!).
  2. Measures of satisfaction are more predictive than NPS of outcomes such as firm growth and whether the respondent actually recommends the product to others.
  3. The American Customer Satisfaction Index is an alternative which has stronger empirical grounding, as well as a huge number of publicly available benchmarks. It uses 3 questions, on a 10 point scale, whose scores are averaged and normalized to a 0-100 scale:[1]
    1. What is your overall satisfaction with X?
    2. To what extent has X met your expectations?
    3. How well did X compare with the ideal (type of offering)?
  4. CEA mostly still asks the NPS question, but switched to taking the arithmetic mean of the results. We call this the “likelihood to recommend” (LTR).[2]

More information

  1. NPS was introduced in 2003 with the claim that it was the best predictor of growth across a data set of companies. This data set was small and subject to p-hacking. The raw data has not been published (including, ironically, the pieces the author says should always be published when reporting NPS scores). The original research methodology was:

“We then obtained a purchase history for each person surveyed and asked those people to name specific instances in which they had referred someone else to the company in question… The data allowed us to determine which survey questions had the strongest statistical correlation with repeat purchases or referrals….
One question was best for most industries. “How likely is it that you would recommend [company X] to a friend or colleague?” ranked first or second in 11 of the 14 cases studies”[3]

  1. Replication attempts (including ones which reverse engineered the original data set from published scatterplots) have failed to find significant predictive value from NPS. A wide variety of alternative statistical methods exist, some of which have stronger empirical grounding.
    1. Notably, NPS is worse at predicting whether the respondent will actually recommend the product.
  2. Replication attempts find alternate definitions of the NPS scale to be more predictive than the commonly used one, even if the question is kept the same (e.g. using a 7 point scale).
  3. The weird way NPS is calculated means that it requires substantially larger sample sizes.
  4. The NPS question disagrees with commonly accepted best practices in survey design (e.g. using an 11-point scale instead of a 5-point one).
  5. There doesn’t seem to be any particular reason to think that NPS is good, apart from it being widely used.
  6. So if it’s so terrible, why does everyone use it? This Wall Street Journal article implies that it is used precisely because it’s so easy to manipulate: “Out of all the mentions the Journal tracked on earnings calls, no executive has ever said the score declined.”[4]

Further Reading

  1. https://www.van-haaften.nl/images/documents/pdf/Measuring%20customer%20satisfaction%20and%20loyalty.pdf
  2. https://www.researchgate.net/publication/228660597_A_Longitudinal_Examination_of_Net_Promoter_and_Firm_Revenue_Growth
  3. https://www.researchgate.net/publication/239630908_The_Value_of_Different_Customer_Satisfaction_and_Loyalty_Metrics_in_Predicting_Customer_Retention_Recommendation_and_Share-of-Wallet
  4. https://community.verint.com/b/customer-engagement/posts/acsi-american-customer-satisfaction-index-score-its-calculation
  5. https://www.jmir.org/2008/1/e4/
  6. https://en.wikipedia.org/wiki/American_Customer_Satisfaction_Index
  7. https://pubsonline.informs.org/doi/10.1287/mksc.1070.0292
  8. http://www.tsisurveys.com/morgan-rego.pdf
  9. https://www.van-haaften.nl/images/documents/pdf/Measuring%20customer%20satisfaction%20and%20loyalty.pdf
  1. ^

    (Note: different sources seem to use slightly different wording and I’m not sure what the “official” wording is because it’s proprietary. Also, the official version uses a proprietary weighting of these three questions but people online seem to think the weights are approximately equal.)

  2. ^

    We usually do this because we don’t want to take people’s time up by asking three questions. I haven’t done a very rigorous analysis of the trade-offs here though, and it could be that we are making a mistake and should use ACSI instead.

  3. ^

    “ranked first or second in 11 of the 14 cases studies” should already be setting off alarm bells

  4. ^

    Of course, this doesn’t explain why investors allow executives to tie their compensation to easily hackable metrics.

Comments7


Sorted by Click to highlight new comments since:

Cool! Glad to see this, I've been harping on about the NPS for some time (1, 2, 3, 4).

We usually do this because we don’t want to take people’s time up by asking three questions. I haven’t done a very rigorous analysis of the trade-offs here though, and it could be that we are making a mistake and should use ACSI instead.

As you may have considered, you could ask just one of the ACSI items, rather than asking the one NPS item. This would have lower reliability than asking all three ACSI items, but I suspect that one ACSI item would have higher validity than the one NPS item. (This is particularly the case when trying to elicit general satisfaction with the EA community, but maybe less so if you literally want to know whether people are likely to recommend an event to their friends).

The added value of using three items to generate a composite measure is potentially pretty straightforward to estimate, esp if you have prior data with the items.  Happy to talk more about this.

Thanks David! If you have references or could say more about the virtues of asking one ACSI question versus the NPS question, I would love to read/hear them.

Hi Ben.

There are two broad reasons why I would prefer the ACSI items (considered individually) over the NPS (style) item:

  • The ACSI items are (mostly) more face valid
  • The ACSI items generally performed better than the NPS when we ran both of these in the EAS 2020

Face validity

This depends on what you are trying to measure, so I’ll start with the context in the EAS, where (as I understand it) we are trying to measure general satisfaction with or evaluation of the EA community.

Here, I think the ACSI items we used (“How well does the EA community compare to your ideal? [(1) Not very close to the ideal - (10) Very close to the ideal]” and “What is your overall satisfaction with the EA community? [(1) Very dissatisfied - (10) Very satisfied]”) more closely and cleanly reflect the construct of interest.

In contrast, I think the NPS style item (“If you had a friend who you thought would agree with the core principles of EA, how excited would you be to introduce them to the EA community?”) does not very clearly or cleanly reflect general satisfaction. Rather, we should expect it to be confounded with:

  • Attitudes about introducing people to the EA community (different people have different views about how positive growing the EA community more broadly is)
  • Perceived/projected personal “excitement” (related to one’s (perceived) emotionality, excitability etc.)
  • Sociability/extraversion/interest in introducing friends to things in general, as well as one’s own level of social engagement with EA (if one is socially embedded in EA, introducing friends might make more sense than if you are very pro EA, but your interaction with it is entirely non-social)

I think some of these issues are due to the general inferiority of the NPS as a measure of what it’s supposed to be measuring:

And some of them are due to the peculiarities of the context where we’re using NPS (generally used to measure satisfaction with a consumer product) to measure attitudes towards a social movement one is a part of (hence the need to add the caveat about “a friend who you thought would agree with the core principles of EA”).

Some of the other contexts where you’re using NPS might differ. Likelihood to recommend may make more sense when you’re trying to measure evaluations of an event someone attended. But note that the ‘NPS’ question may simply be measuring qualitatively different things when used in these different contexts, despite the same instrument being presented. i.e. asking about recommending the EA community as a whole elicits judgments about whether it’s good to recommend EA to people (does spreading EA seem impactful or harmful etc?), whereas asking about recommending an event someone attended mostly just reflects positive evaluation of the course. Still, I slightly prefer a simple ACSI satisfaction measure over NPS style items, since I think it will be clearer, as well as more consistent across contexts.

Performance of measures

Since we included both the NPS item and two ACSI items in EAS 2020 we can say a little about how they performed, although with only 1-2 items and not much to compare them to, there’s not a huge amount we can do to evaluate them.

Still, the general impression I got from the performance of the items last year confirms my view that the two ACSI measures cohere as a clean measure of satisfaction, while NPS and the other items are more of a mess. As noted, we see that the two ACSI measures are closely correlated with each other (presumably measuring satisfaction), while the NPS measure is moderately correlated with the ‘bespoke’ measures (e.g. “I feel that I am part of the EA community”) which seem to be (noisily) measuring engagement more than satisfaction or positive evaluation. I think it’s ultimately unclear what any of those three items are measuring since they’re all just imperfectly correlated with each other, engagement and with satisfaction, so I think they are measuring a mix of things, some of which are unknown. Theoretically, one could simply run a larger suite of items, designed to measure satisfaction, engagement, and other things which we think might be related (such as what the bespoke measures are intended to measure) and tease out what the measures are tracking. But there’s not a huge amount we can do with just 5-6 items and 2-3 apparent factors they are measuring.

Benefits of multiple measures

As an aside, we put together some illustrations of the possible concrete benefits of using a composite measure of multiple items, rather than a single measure.

The plot below shows the error (differences between the measured value and the true value: higher values, in absolute terms, are worse) with a single item vs an average made from two or three items. Naturally, this depends on assumptions about how noisy each item is and how correlated each of the items are, but it is generally the case that using multiple items helps to reduce error and ensure that estimates come closer to the true value.

This next image shows the power to detect a correlation of around r = 0.3 using 1, 2 or 3 items. The composite of more items should have lower measurement error. When only a single item is used, the higher measurement error means that a true relationship between the measured variable and another variable of interest can be harder to detect. With the average of 2 or 3 items, the measure is less noisy, and so the same underlying effect can be detected more easily (i.e., with fewer participants). (The three different images just show different standards for significance)


 

I just wanted to say that I always appreciate your in-depth responses David! They are always really easy to follow and informative :)

I'd also be interested in this!

Hello, since I saw this post, I switched a couple of things to using ACSI. I always thought NPS seemed pretty bad, and mostly only included it for comparison with groups like CEA who were using it.

Do you have any data you're able to share publicly yet?

 

Additionally:

The American Customer Satisfaction Index is an alternative which has stronger empirical grounding, as well as a huge number of publicly available benchmarks. It uses 3 questions, on a 10 point scale, whose scores are averaged and normalized to a 0-100 scale:[1]

How exactly are you calculating it? The Wikipedia formula seems wrong to me, unless I'm misunderstanding it.

(I have 9 answers for each of the three questions. The average responses are 9.4, 9.6, and 9.3. So I think what I'm supposed to do is =((9.4*1+9.6*1+9.3*1)-1)/9*100 . This gives me "303.7037037" which clearly seems wrong.)

My interpretation of what it should be: 

=(((9.4+9.6+9.3)-3)/27)*100

Which equals 93.8. The simpler but slightly less accurate =((9.4+9.6+9.3)/3)*10 comes out similarly, at 94.4.

Which seems very good. E.g. "Full-Service Restaurants", "Financial Advisors", and "Online News and Opinion" all  seem to hover around 70-80, while government services range a bit more widely from 60 to 90.

(Caveat that I didn't realise that you were supposed to include labels on 1 and 10 for each of the questions until I checked the Wikipedia entry just now to calculate it, and I'm not sure how this would affect the results. The labels seem pretty weird to me, so I suspect it does affect it somehow.)

Thanks!

Appreciate this update! 

> NPS [...] violates survey best practices.

Agree. For our EA retreats in Germany, we've also always just used the mean. I'm surprised that NPS is so widely used in industry. 

Curated and popular this week
 ·  · 11m read
 · 
Confidence: Medium, underlying data is patchy and relies on a good amount of guesswork, data work involved a fair amount of vibecoding.  Intro:  Tom Davidson has an excellent post explaining the compute bottleneck objection to the software-only intelligence explosion.[1] The rough idea is that AI research requires two inputs: cognitive labor and research compute. If these two inputs are gross complements, then even if there is recursive self-improvement in the amount of cognitive labor directed towards AI research, this process will fizzle as you get bottlenecked by the amount of research compute.  The compute bottleneck objection to the software-only intelligence explosion crucially relies on compute and cognitive labor being gross complements; however, this fact is not at all obvious. You might think compute and cognitive labor are gross substitutes because more labor can substitute for a higher quantity of experiments via more careful experimental design or selection of experiments. Or you might indeed think they are gross complements because eventually, ideas need to be tested out in compute-intensive, experimental verification.  Ideally, we could use empirical evidence to get some clarity on whether compute and cognitive labor are gross complements; however, the existing empirical evidence is weak. The main empirical estimate that is discussed in Tom's article is Oberfield and Raval (2014), which estimates the elasticity of substitution (the standard measure of whether goods are complements or substitutes) between capital and labor in manufacturing plants. It is not clear how well we can extrapolate from manufacturing to AI research.  In this article, we will try to remedy this by estimating the elasticity of substitution between research compute and cognitive labor in frontier AI firms.  Model  Baseline CES in Compute To understand how we estimate the elasticity of substitution, it will be useful to set up a theoretical model of researching better alg
 ·  · 7m read
 · 
Crossposted from my blog.  When I started this blog in high school, I did not imagine that I would cause The Daily Show to do an episode about shrimp, containing the following dialogue: > Andres: I was working in investment banking. My wife was helping refugees, and I saw how meaningful her work was. And I decided to do the same. > > Ronny: Oh, so you're helping refugees? > > Andres: Well, not quite. I'm helping shrimp. (Would be a crazy rug pull if, in fact, this did not happen and the dialogue was just pulled out of thin air).   But just a few years after my blog was born, some Daily Show producer came across it. They read my essay on shrimp and thought it would make a good daily show episode. Thus, the Daily Show shrimp episode was born.   I especially love that they bring on an EA critic who is expected to criticize shrimp welfare (Ronny primes her with the declaration “fuck these shrimp”) but even she is on board with the shrimp welfare project. Her reaction to the shrimp welfare project is “hey, that’s great!” In the Bible story of Balaam and Balak, Balak King of Moab was peeved at the Israelites. So he tries to get Balaam, a prophet, to curse the Israelites. Balaam isn’t really on board, but he goes along with it. However, when he tries to curse the Israelites, he accidentally ends up blessing them on grounds that “I must do whatever the Lord says.” This was basically what happened on the Daily Show. They tried to curse shrimp welfare, but they actually ended up blessing it! Rumor has it that behind the scenes, Ronny Chieng declared “What have you done to me? I brought you to curse my enemies, but you have done nothing but bless them!” But the EA critic replied “Must I not speak what the Lord puts in my mouth?”   Chieng by the end was on board with shrimp welfare! There’s not a person in the episode who agrees with the failed shrimp torture apologia of Very Failed Substacker Lyman Shrimp. (I choked up a bit at the closing song about shrimp for s
 ·  · 9m read
 · 
Crosspost from my blog.  Content warning: this article will discuss extreme agony. This is deliberate; I think it’s important to get a glimpse of the horror that fills the world and that you can do something about. I think this is one of my most important articles so I’d really appreciate if you could share and restack it! The world is filled with extreme agony. We go through our daily life mostly ignoring its unfathomably shocking dreadfulness because if we didn’t, we could barely focus on anything else. But those going through it cannot ignore it. Imagine that you were placed in a pot of water that was slowly brought to a boil until it boiled you to death. Take a moment to really imagine the scenario as fully as you can. Don’t just acknowledge at an intellectual level that it would be bad—really seriously think about just how bad it would be. Seriously think about how much you’d give up to stop it from happening. Or perhaps imagine some other scenario where you experience unfathomable pain. Imagine having your hand taped to a frying pan, which is then placed over a flame. The frying pan slowly heats up until the pain is unbearable, and for minutes you must endure it. Vividly imagine just how awful it would be to be in this scenario—just how much you’d give up to avoid it, how much you’d give to be able to pull your hand away. I don’t know exactly how many months or years of happy life I’d give up to avoid a scenario like this, but potentially quite a lot. One of the insights that I find to be most important in thinking about the world is just how bad extreme suffering is. I got this insight drilled into me by reading negative utilitarian blogs in high school. Seriously reflecting on just how bad extreme suffering is—how its intensity seems infinite to those experiencing it—should influence your judgments about a lot of things. Because the world is filled with extreme suffering. Many humans have been the victims of extreme suffering. Throughout history, tort