In short: Building weighted factor models using a regular weighted average (“weighted arithmetic mean”) allows high values in one factor to compensate for very low values or a zero in another. This poorly reflects the multiplicative nature of many impact-oriented weighted factor models. Using the weighted geometric mean avoids this issue by penalising low values more strongly.
Weighted factor models (WFMs) are a useful prioritisation tool when comparing multiple options based on multiple criteria. They’re simple mathematical models used to make decisions by assigning different levels of importance (“weights”) to the criteria. Each option is scored on all criteria, and a final score for each option is calculated by taking the weighted average of the scores. In effective altruism, WFMs are used to, for example, evaluate one’s career focus, determine the beneficiaries of a development programme, or compare different interventions.
In many applications of a WFM useful for effective altruism, I believe that taking the simple weighted average (“weighted arithmetic mean”) can lead to overly optimistic conclusions for options where one or more of the criteria are scored low (near zero). The weighted arithmetic mean method assumes that options where at least one criterion receives the score 0 can be compensated for by high scores on other criteria. When the criteria have a multiplicative rather than an additive nature (such as some version of importance, tractability, and neglectedness), this can lead to options being ranked as impactful even though they aren’t in practice.
I think many - but not all - weighted factor models would be more accurate if they used the weighted geometric mean instead of the weighted arithmetic mean. This applies to WFMs with criteria with a multiplicative nature where one very low score cannot be linearly compensated for with a high score elsewhere, i.e. one score going from 3 to 1 cannot be compensated for by the score of a criterion with equal weight going from 7 to 9. In these cases, the final rank should often be proportional to the product of the factors, rather than the sum. This is especially true for models that capture some type of “importance, tractability, neglectedness” (ITN) data, or most models where it would be unacceptable if the value of one criterion approaches zero.
I sometimes see impact-oriented organisations and individuals use the weighted arithmetic mean where I think the weighted geometric mean would be preferable. However, there are definitely impact-oriented organisations that are already using the geometric mean in their prioritisation. (Shoutout to Animal Charity Evaluators, footnote 16.) I'm not the first person prioritising with multiplicative models or geometric means, but I think it’s good to raise attention to this issue and explain when and why to use the geometric mean.
In this post, I will first explain the difference between the weighted arithmetic mean and the weighted geometric mean, explain more about why the arithmetic mean can be a wrong method to use, and I will provide three examples. To make your life easier I’ll end with formulas for the weighted geometric mean for your favourite spreadsheet software.
Some background: weighted arithmetic and weighted geometric means
If you’re already familiar with the maths behind means, just skip this section.
Weighted arithmetic mean
The arithmetic mean is based on the addition of the values. It is the mean you’ve probably learned in school and is often just called the “average”. You simply sum up all the values in your list, and then divide the result by the number of values in your list. The weighted arithmetic mean is useful when we don’t find all the values equally important. The weighted arithmetic mean (WAM) is calculated by:
- Multiplying each value by their respective weight
- Summing the multiplied values
- Dividing the result by the sum of the weights.[1] (In most cases, the sum of the weights is simply one, so the denominator can be ignored.)
Mathematically, where xi are data points and wi their respective weights:
A key characteristic of the WAM in the context of this post is that it allows values in one factor to compensate for low values in another. Even if one or more values are zero, the final result is not. For example, the (unweighted) average between 0 and 5 is equal to the weighted average between 1 and 4 (namely 2.5).
Weighted geometric mean
The weighted geometric mean uses a multiplicative method rather than an additive method. It is calculated by:
- Raising all values to the power of their respective weight
- Taking the product of the result of step 1.
- Taking the root with the sum of the weights as the base.[2]
Mathematically:
A key characteristic of the geometric mean is that it yields low values if any factor is low, and yields zero when any factor is zero. This reflects the reality that, in many decision models, a very poor result in one criterion cannot simply be offset by better performance in others. This aligns with the multiplicative nature of many prioritisation frameworks.
The problem with weighted arithmetic means
The arithmetic mean in weighted factor models can be problematic because it treats factors as additive, even when the criteria interact in a multiplicative way. This allows a high score in one criterion to offset a very low score in another, which can lead to misleading results. For example, in importance-tractability-neglectedness models, a cause that is highly important but extremely intractable might still receive a relatively high score using the arithmetic mean, even though its intractability should severely limit its potential impact.
The issue arises because many of the criteria used in these models aren’t truly additive; they often interact multiplicatively. High impact typically requires all factors—such as importance, tractability, and neglectedness—to contribute meaningfully. If one factor is near zero, the overall potential should decrease significantly. The geometric mean addresses this by multiplying the factors, which gives more weight to low values and prevents a high score in one area from compensating for weaknesses elsewhere. This makes the geometric mean a more accurate reflection of real-world decision-making, where the overall impact depends on strong performance across all criteria. The difference between the geometric and arithmetic means is especially pronounced when the data contain very low values.
Below I illustrate the importance of using the geometric mean using three hypothetical and exaggerated examples. (I created these tables for the sake of the argument, and they’re not representative of my actual views.)
Example 1: Which group of animals should we advocate for?
This example is based on a modified version of the one in the prioritisation report by Animal Advocacy Africa. I applied the model to the Dutch context, removed a few criteria for simplicity, and changed the 1-5 scale to a 0-10 scale.
Imagine you want to choose whether to work on improving the welfare of farmed animals, companion animals, or working animals in the Netherlands. Obviously, the scale of farmed animal suffering is much bigger than that of companion animals and working animals. In fact, there are hardly any working animals in the Netherlands, bar some canine units. Besides the scale, we also want to know about the evidence base for interventions, their cost-effectiveness, the neglectedness of work for the type of animal, the societal receptivity of animal advocacy work, and funding and talent availability.
With these criteria and options we can construct a weighted factor model, with our scores on a 0 to 10 scale.
Weight | Farmed animals | Companion animals | Working animals | |
Scale | 20% | 10 | 2 | 1 |
Evidence-base | 15% | 8 | 8 | 10 |
Cost-effectiveness | 20% | 8 | 7 | 10 |
Neglectedness | 15% | 8 | 3 | 10 |
Societal receptivity | 10% | 4 | 8 | 7 |
Funding and talent availability | 20% | 6 | 6 | 6 |
Weighted arithmetic mean | 7.60 | 5.45 | 7.10 | |
Weighted geometric mean | 7.37 | 4.81 | 5.50 |
Regardless of the type of mean that we use, farmed animals take the top spot in this prioritisation exercise. The weighted arithmetic mean, however, would suggest that working animals are a good second option, even though there is hardly any work to be done in this area. This is because cost-effectiveness, neglectedness, and societal receptivity of the type of work can outweigh the small scale of the problem, even if this is near-zero.
The geometric mean first multiplies the scores of all criteria for any option, which means that the low value for the scale of working animal suffering becomes much more important in the decision-making process. Since there is no point in working on a cause with a near zero scale, which has zero room for more efforts (low neglectedness), or which has zero availability of funding and talent, the weighted geometric mean is a better prioritisation method than the weighted arithmetic mean, in my opinion.
Example 2: Which job should I pick?
This example is based on a modified version of Impactful Government Careers’ Career Decision Tool. I adjusted the criteria, weights and changed the 1-10 scale to a 0-10 scale.
I can use a weighted factor model to decide between jobs that I’m considering applying to. I would look for a job where I can make a social impact (35%), that fits me well (20%), and - since I’m in the early stages of my career - provides good personal growth potential (20%). I also attach weights to prestige (10%) and salary (15%).
The weighted factor model below compares becoming a corporate consultant, becoming a charity researcher, becoming the Prime Minister of the Netherlands, and becoming President of the United States, with scores between 0 and 10.
Criteria | Weight | Corporate consultant | Charity researcher | Prime Minister of NL | President of the United States |
Social impact | 35.0% | 0 | 8 | 9 | 10 |
Personal fit | 20.0% | 8 | 8 | 1 | 0 |
Personal growth potential | 20.0% | 9 | 8 | 10 | 10 |
Prestige | 10.0% | 8 | 7 | 10 | 10 |
Salary | 15.0% | 9 | 7 | 9 | 10 |
Weighted arithmetic mean | 5.55 | 7.75 | 7.70 | 8.00 | |
Weighted geometric mean | 0.00 | 7.74 | 5.99 | 0.00 |
Using the arithmetic mean for these criteria suggests that I should become the president of the United States. However, just because it’s impactful, prestigious, pays well, and is a good personal growth opportunity, I don’t think that I should try to become the President of the United States, for the simple fact that I’m ineligible. Likewise, I shouldn’t become a corporate consultant because helping big companies become monopolies and dodge taxes contributes nothing to society, assuming that I'm not earning to give. Using the arithmetic mean assumes that a personal fit or social impact of 0 can be compensated for by high scores on other criteria. Using the weighted geometric mean acknowledges the multiplicative nature of these criteria, and therefore this method gives consulting and the U.S. president a final score of 0.
Similarly, the score based on the weighted arithmetic mean suggests that being the Prime Minister of the Netherlands is a job that would suit me well. While I’m eligible to become the PM, I am way underqualified to do this job (although I’d still do a better job than our current PM lol). The score based on the weighted geometric mean penalises the low personal fit more, and suggests that PM is probably not the best job for me.
Example 3: Geographic weighted factor model
Cryonitis syndrome is a horrible painful (and fictional) disease that causes people’s limbs to slowly freeze when they get infected with a virus. Luckily, it can be easily treated with some simple cheap antiviral medicine. The (also fictional) Warmth for Life Foundation delivers medicines to those most in need and is considering whether to start a new programme in Abiertia, Astheniland, the Vitaris Islands, or Saludistan. Someone in their strategy team creates the following model to help the organisation decide:
Criteria | Weight | Abiertia | Astheniland | Vitaris Islands | Saludistan |
Presence of cryonitis syndrome | 30.0% | 7 | 9 | 1 | 0 |
Lack of diagnosis tools | 25.0% | 5 | 6 | 8 | 10 |
Ease of starting an NGO | 20.0% | 10 | 5 | 10 | 10 |
Low presence of similar NGOs | 25.0% | 3 | 6 | 10 | 10 |
Weighted arithmetic mean | 6.10 | 6.70 | 6.80 | 7.00 | |
Weighted geometric mean | 5.59 | 6.53 | 4.74 | 0.00 |
The weighted arithmetic mean suggests that the most promising country is Saludistan - a country that has no cryonitis infections whatsoever. The next best country would be the Vitaris Islands, which hardly has any infections. Astheniland and Abiertia, where cryonitis syndrome is actually a large problem, come last. Conversely, the weighted geometric mean suggests that the most promising country is Astheniland because it scores quite high on all criteria. Abierta gets a lower final score than in the arithmetic mean because of a low score on NGO presence. The Vitaris Islands get a lower score because the disease presence is low. Saludistan gets a score of 0 because the disease does not exist.
By emphasising the need for high scores across all criteria, the geometric mean ensures that the program targets regions where there is both high demand and room for an additional organisation.
When to use the weighted geometric mean
I think the weighted geometric mean can be best used if these criteria are met:
- Your criteria have a multiplicative nature. When your factors interact multiplicatively, such that a low value in any factor should pull down the overall result. All factors must contribute for the final decision to be good. This is often the case for cause prioritisation and career decisions.
- Your data is on a scale that starts at zero. The weighted geometric mean penalises low values best if the lowest possible value is 0, and is represented by the number 0. For this reason, using a scale between 0 and 10 is more useful than a scale between 1 and 10. (Or, alternatively, 0 to 1 or 0 to 100).
When not to use the weighted geometric mean
- Not all of your criteria need to have high scores for it to be the best option. For example, if you’re considering which animal welfare mass media campaign to choose, you could use criteria such as “catchy slogan”, “good imagery”, “nice design”, and “applicable in many countries”. Ideally, you want a campaign to tick all four boxes, but you’d still consider options without a catchy slogan, or ones that have bad design. Another example: if you’re choosing which book to read, you ideally want a book that is both exciting and funny, but the best option could still be a very exciting book that is not funny at all.
- Your dataset contains negative values. For mathematical reasons, it’s impossible to calculate a geometric mean of a row of numbers containing a negative value. If, for example, you’re evaluating a career option with a negative social impact, the weighted geometric mean cannot be calculated.
- You’ve standardised the data using Z-scores. Standardisation using Z-scores is useful when you have multiple data sources on different scales and you’re interested in relative differences between your options. Z-scores represent the number of standard deviations from the mean, but therefore Z-scores centre around zero, and there are many negative values.
- You’ve normalised the data using min-max rescaling. If you normalise a dataset on a 0-1 scale using min-max rescaling, a 0 may represent the lowest observed value rather than an absolute absence of the factor. In such cases, using the geometric mean could lead to skewed results since a normalised zero is not the same as a ‘true zero’ in the data.
How much does this matter?
Choosing the right mean in a weighted factor model can matter a lot when the factors have significant variation or when one or more factors can approach zero. In these cases, the arithmetic mean can mask weaknesses by allowing strong values in one area to compensate for very low values in another. This can lead to suboptimal decision-making, especially in models for selecting interventions, causes, and careers, where each factor needs to meet a minimum threshold for meaningful impact. The geometric mean better handles these situations by ensuring low values reduce the overall score significantly.
However, if all factors are relatively similar in magnitude and none are close to zero, the choice between arithmetic and geometric means may matter less, as the overall scores will be more similar regardless of the mean used. In such cases, either mean could provide a reasonable estimate of impact, but a geometric mean will still be more accurate.
Implementation in spreadsheet software
Google Sheets
In Google Sheets, you can calculate the weighted geometric mean using the following formula. Replace the ‘values’ or ‘weights’ with the applicable range (E.g. E2:E5), or define these range names.
=PRODUCT(ARRAYFORMULA(values^weights))^(1/SUM(weights))
I also created a simple template spreadsheet to get started without entering any formulas, feel free to use or adapt.
Excel
In Excel, the formula is a bit longer to handle the edge case where both values and weights are zero:
=IF(SUMIFS(values, weights, ">0", values, 0)>0, 0, PRODUCT(IF(values=0, IF(weights=0, 1, 0), values)^weights)^(1/SUM(weights)))
LibreOffice Calc
In LibreOffice Calc, enter the following array formula. Array formulas should be entered using Ctrl+Shift+Enter, otherwise Calc returns an error.
=IF(SUMIFS(values; weights; ">0"; values; 0)>0; 0; PRODUCT((IF(values=0; IF(weights=0; 1; 0); values))^weights)^(1/SUM(weights)))
Note: This post is inspired by my experience working with weighted factor models at Ambitious Impact (formerly Charity Entrepreneurship) and the Good Food Institute Europe. This post represents my personal views and not necessarily those of my (previous) employer. I am grateful to Vicky Cox, Johan Lugthart, Karam Elabd, Filip Murár and Koen van Pelt for feedback and proofreading. Any mistakes are mine.
You can view the weighted factor models used as examples in this spreadsheet, and you can find a template to get started with weighted geometric means here.
FWIW, when I have a weighted factor model to build, I think about how I can turn it into a BOTEC, and try to get it close(r) to a BOTEC. I did this for my career comparison and a geographic weighted factor model.
And I think this usually means some factors, in their units, like scale (e.g. number of individuals, years of life, DALYs amount of suffering) and probability of success (%), should be multiplied. And usually not weighted at all, except when you want to calculate a factor multiple ways and average them. Otherwise, you'll typically get weird units.
And what is the unit conversion between DALYs and a % chance of success, say? This doesn’t make much sense, and probably neither will any weights, in a weighted sum. Adding factors with different units together doesn't make much sense if you wanted to interpret the final results in a scope-sensitive way.
This all makes most sense if you only have one effect you're estimating, e.g. one direct effect and no indirect effects. Different effects should be added. A more complete model could then be the sum of multiplicative models, one multiplicative model for each effect.
EDIT: But also BOTECs and multiplicative models may be more sensitive to their factors, and more sensitive to errors in factor values when ranking. So, it may be best to do sensitivity analysis, with a range of values for the factors. But that's more work.
I sometimes do this, but I wonder if it defeats one of the key benefits of a WFM -- that it accounts for multiple criteria and prevents any single consideration dominating.
(With BOTECs, sometimes the final ranking/conclusion is very dependent on one or two very uncertain or arbitrary criteria.)
If a single consideration dominates, it might be for good reason. The relative insensitivity of WFMs can reflect poor scaling of the score with impact.
I might be inclined to do sensitivity analysis to the parameters and multiple different BOTECs/models in these cases, but that's also more work. At some point, it's not really a BOTEC anymore, because the model is too complicated to fit on the back of an envelope. And it can no longer be practical to use the same BOTEC/model structure across interventions that are too different.
Yeah I agree in principle it "might be for good reason", though I still have some sense that it seems desirable to reduce overdependence on your ratings for one or two criteria. Similar to the reasoning for sequence thinking vs. cluster thinking
I did the same, so I predictably agree-voted. I'm curious if disagree-voters can explain why.
That makes sense to me, Michael. Relatedly, GiveWell's bases their geographic prioritisation on cost-effectiveness analyses of the most promising countries.
I generally agree, and CEARCH uses geomeans for our geographic prioritzation WFMs, but I would also express caution - multiplicative WFM are also more sensitive to errors in individual parameters, so if your data is poor you might prefer the additive model.
Also general comment on geomeans vs normal means - I think of geomeans as useful when you have different estimates of some true value, and the differences reflect methodological differences (vs cases where you are looking to average different estimates that reflect real actual differences, like strength of preference or whatever)
Naively, is there a case for using the average of the two?
I don't see any strong theoretical reason to do so, but I might be wrong. In a way it doesn't matter, because you can always rejig your weights to penalize/boost one estimate over another.
Good point on the error sensitivity. The geometric mean penalizes low scores more so it increases the probability of a false negative/type II error: an alternative that should be prioritised is not prioritised.
Note that the logarithm of a positive weighted geometric mean is the weighted arithmetic mean of the logarithms:
log((xw11…xwnn)1w1+⋯+wn)=(w1log(x1)+⋯+w1log(xn))/(w1+⋯+wn)So, instead of switching to the weighted geometric mean, you could just take the logarithm of your factors.
EDIT: Well, the weighted geometric mean is easier than taking logarithms, but it can be useful to remember this equivalence. With a weighted arithmetic mean, you might want your factors to be able to go arbitrarily negative, corresponding to the logarithms of values close to 0 in the geomean.
Thanks for writing this up and for highlighting this weakness in our prioritisation report (example 1).
Since the publication of this report (which was quite an early piece of research for me), I've built a lot more of these models and strongly agree that it's important to not just blindly use a weighted average. (Didn't change anything about our research outcomes in this case, but it could have important effects elsewhere.) Geometric mean is important. I also sometimes use completely different scoring tools (e.g., multiplication, more BOTEC style, as MichaelStJules has commented). It's always helpful from my experience to experiment with different methods/perspectives.