Valuing research works by eliciting comparisons from EA researchers

NunoSempere

Valuing research works by eliciting comparisons from EA researchers

Comments 22

Sorted by

New & upvoted

Hauke Hillebrandt

I think this has a lot of potential- excellent work!

Some suggestions / questions / comments for discussion:

Could one shorten the procedure by something like the final version perfected i.e. assuming that things are transitive?
Could one have people rate pairs several times to tap into the 'crowd within'?
Could one restrict people's option space, e.g. just orders of magnitude options, 2x, 5x, 10x, 100x, 1000x, 10000x?
Also reminded me of this piece on 'Putting Logarithmic-Quality Scales On Time'... I wonder whether this tool could help with project prioritization.
Could one have rank each items on several dimensions, e.g. scale, solvability, neglectedness?
I'd also like to hear people's opinion on 'Why Charities Usually Don't Differ Astronomically in Expected Cost-Effectiveness', which argues that 'many charities differ by at most ~10 to ~100 times, and within a given field, the multipliers are probably less than a factor of ~5.' and seems relevant here.

technicalities

Wonderful comment.

But on #6, isn't it simple?: The wiki stub has 100k times fewer readers than Superintelligence (and even less influence-weighted readers), it bears on an issue >1k times less important, it contains 1k fewer novel points... 11 OOMs, easy.

Linch

My main critique of this process is that I have very little intuition for the ex post counterfactual impact of research unless I've thought about it deeply, because I think most of the impact of research is very heavy-tailed and depends on whether very specific lines of impact materialize (e.g. "this model of information theory coming out earlier than other models of information theory increases/decreases more dangerous paradigms in AI," or "this specific funder made a counterfactually good/bad decision as a result of this research report" or "this specific set of people got more involved in EA as a result of this research blog post.")

But of course this could just be copium for my own shitty orderings.

Jonas Moss

FYI: I wrote a post about the statistics used in pairwise comparison experiments of the sort used in this post.

technicalities

The obvious improvement is to do the above, followed by a discussion on each of the largest divergent ratings, followed by a post-test where I expect more of a consensus.*

This is because it's hard to generate all the relevant facts on your own. Divergences are likely to be due to some crucial factual consideration (e.g. Thinking Fast & Slow was only read by 5% of the people who bought it"; "Thinking Fast & Slow is >40% false") or a value disagreement. (Most value disagreements are inert on short timescales - but not over years.)

* This fails to be useful to the extent we're not all equally persuasive, biddable, high-status.

So-Low Growth

The sharing of information can - sometimes - lead to more conservative funding due to people weighting other peoples' weak points greater than their strong points. See here for a really fascinating paper in the economics of science: https://pubsonline.informs.org/doi/pdf/10.1287/mnsc.2021.4107

technicalities

This is a risk, but we'll still have the pre-test rankings and can probably do something clever here.

So-Low Growth

Fwiw, I'd imagine you are all less succumb to weighting other evaluators negative points (different interests at play to journal reviewers) - but still may be a bias here.

Peaked my curiosity, what sort of clever thing?

technicalities

Babbling:

Allocating some of the funding using the pre-test rankings;
or the other way, using the diff between pre and post as a measure for how bad/fragile the pre was;
otherwise working out whether each evaluator leans under- or over-confident and using this to correct their post ranking.

So-Low Growth

Thanks Gavin!

RyanCarey

It would be very nice to get a transitive ordering at the end, for visualisation purposes if nothing else (so that you can display the papers on one scale without needing to read the individual numbers). If you have more comparisons than you need, then I think a natural thing to try would be to solve the optimisation problem where each paper must be placed on a log scale, and you try to minimize the MSE between the estimated gaps and the labels of the gaps (with all gaps measured using the log scale). In practice, you can probably get a satisfactory solution to this problem with any popular convex optimisation algorithm, like lbfgs (there probably isn't a literal guarantee of global optimality, but I think problematic local optima would be rare.)

NunoSempere

Thanks for the pointers!

Adam Binksmith

I just came across a paper which mentions a loosely related method - pairwise rating for model elicitation. See p13 of this PDF (or ctrl-F for pairwise), might be of interest:

NunoSempere

Thanks

david_reinstein

Did you use or are you aware of any good quantifiable conceptualizations and breakdowns of the value of research that could be applied to empirical and applied work?

E.g., (probability of being true)*(value if true) seems inadequate, as what is important is the VOI gain the research yielded. And good research generally doesn't state "we proved X is true with 100% probability" but reports parameters, confidence/credible intervals, etc.

One might consider a VOI model in terms of the 'increase in value of the optimal funding and policy decisions as informed by the research'. But even if doable, this would ignore impacts of the research on other researchers and less trackable decisionmakers

(Context for asking this: I am considering ways to make The Unjournal's evaluation metrics more useful.)

Update: OK, I see you have done some more work in this area, reported in this post; I will try to dig into that. However, I'm not sure you expressed a specific 'value model' there?

NunoSempere

See this post also on the Nonlinear library.

Stan Pinsent

This sounds amazing, but the link utility-function-extractor.quantifieduncertainty.org is no longer working :(

Vasco Grilo🔸

Thanks for the post!

Have you considered using z-scores:

z("researcher A", "text X") = ("geometric mean of text X according to researcher A" - mu("geometric means of researcher A"))/sigma("geometric means of researcher A")?

This would ensure that each researcher has the same weight on the combined scores, as the sum of the z-scores for each researcher would be null.

NunoSempere

So right now (i.e., in the next version), I am normalizing by share of total impact that each project takes according to each researcher, which feels like a more adequate normalization. But thanks for the tip.

NunoSempere

One problem I'm having is that the means are not always positive, and hence the geommean isn't well defined. I'm solving this by taking the mixture of the distributions, rather than working with the means, but it's not all that trivial (e.g., it's a bit computationally expensive)

Vasco Grilo🔸

That also makes sense to me, as the sum of the shares of total impact adds up to 1, each researcher has the same weight on the combined scores (as with the z-scores).

quinn

One particularly worrying difference in opinions is the difference in the range of values. Moorhouse’s range is 5.1 orders of magnitude, whereas Leech’s is 12.6 (the participants’ average is 7.6).

what about taking exp(normalize(log(x)) for some normalization function that behaves roughly like vector normalization?

Comments

More from the author

Thresholds for funding existential risk interventions

NunoSempere·5d ago·20m read

438

My highly personal skepticism braindump on existential risk from artificial intelligence.

NunoSempere·3y ago·17m read

308

A Critical Review of Open Philanthropy’s Bet On Criminal Justice Reform

NunoSempere·4y ago·31m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·3d ago·Curated 4h ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

144

Let's taboo the V-word

lincolnq·3d ago·8m read

“How long have you been v*g*n?” This is one of the most common icebreakers at animal protection events. It’s a baseline assumption, and it mostly holds true: if you’re out advocating for animals not to be tortured or abused, realistically these days you are v**n, or close. And it makes for good conversation. It seems fairly safe to assume when you meet strangers. But this assumption is hurting the movement in a way which we don’t always notice: someone new comes into the sp...

Spiro: an update 2.5 years on and a fundraising ask for expansion

Habiba Banu·1d ago·6m read

Summary Back in November 2023 I posted here to launch Spiro and raise our first $198k. Two and a half years later this is an update and a fundraiser for the next step. The short version: we've now reached over-5,900 people with TB preventive medicine, including over 3,000 children under five years old. Our early results have held up well an...