TLDR
Submit ideas of “interesting evaluations” in the comments. The best one by December 5th will get $50. All of them will be highly appreciated.
Motivation
A few of us (myself, Nuño Sempere, and Ozzie Gooen), have been working recently to better understand how to implement meaningful evaluation systems for EA/rationalist research and projects. This is important both for short-term use (so we can better understand how valuable EA/rationalist research is) and for long-term use (as in, setting up scalable forecasting systems on qualitative parameters). In order to understand this problem, we've been investigating evaluations specific to research and evaluations in a much broader sense.
We expect work in this area to be useful for a wide variety of purposes. For instance, even if Certificates of Impact eventually get used as the primary mode of project evaluation, purchasers of certificates will need strategies to actually do the estimation.
Existing writing on “evaluations” seems to be fairly domain-specific (only focused on Education or Nonprofits), one-sided (yay evaluations or boo evaluations), or both. This often isn’t particularly useful when trying to understand the potential gains and dangers of setting up new evaluation systems.
I’m now investigating a neutral history of evaluations, with the goal of identifying trends in what aids or hinders an evaluation system in achieving its goals. The ideal output of this stage would be an absolutely comprehensive list that will be posted to LessWrong. While this is probably impractical, hopefully, we could make one comprehensive enough, especially with your help.
Task
Suggest an interesting example (or examples) of an evaluation system. For these purposes, evaluation means "a systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards", but if you think of something that doesn't seem to fit, err on the side of inclusion
Prize
The prize is $50 for the top submission.
Rules
To enter, submit a comment suggesting an interesting example below, before the 5th of December. This post is both on LessWrong and the EA Forum, so comments on either count.
Rubric
To hold true to the spirit of the project, we have a rubric evaluation system to score this competition. Entries will be evaluated using the following criteria:
- Usefulness/uniqueness of lesson from the example
- Novelty or surprise of the entry itself, for Elizabeth
- Novelty of the lessons learned from the entry, for Elizabeth.
Accepted Submission Types
I care about finding interesting things more than proper structure. Here are some types of entries that would be appreciated:
- A single example in one of the categories already mentioned
- Four paragraphs on an unusual exam and its interesting impacts
- A babbled list of 104 things that vaguely sound like evaluations
Examples of Interesting Evaluations
We have a full list here, but below is a subset to not anchor you too much. Don't worry about submitting duplicates: I’d rather risk a duplicate than miss an example.
- Chinese Imperial Examination
- Westminster Dog Show
- Turing Test
- Consumer Reports Product Evaluations
- Restaurant Health Grades
- Art or Jewelry Appraisal
- ESGs/Socially Responsible Investing Company Scores
- “Is this porn?”
- Legally?
- For purposes of posting on Facebook?
- Charity Cost-Effectiveness Evaluations
- Judged Sports (e.g. Gymnastics)
Motivating Research
These are some of our previous related posts:
- Shallow Review of Consistency in Statement Evaluation
- Can we hold intellectuals to similar public standards as athletes?
- Prediction-Augmented Evaluation Systems
- Can We Place Trust in Post-AGI Forecasting Evaluations?
- ESC Process Notes: Claim Evaluation vs. Syntheses
- Predicting the Value of Small Altruistic Projects: A Proof of Concept Experiment
Winner
Last week we announced a prize for the best example of an evaluation. The winner of the evaluations prize is David Manheim, for his detailed suggestions on quantitative measures in psychology. I selected this answer because, although IAT was already on my list, David provided novel information about multiple tests that saved me a lot of work in evaluating them. David has had involvement with QURI (which funded this work) in the past and may again in the future, so this feels a little awkward, but ultimately it was the best suggestion so it didn’t feel right to take the prize away from him.
Honorable mentions to Orborde on financial stress tests, which was a very relevant suggestion that I was unfortunately already familiar with, and alexrjl on rock climbing route grades, which I would never have thought of in a million years but has less transferability to the kinds of things we want to evaluate.
Post-Mortem
How useful was this prize? I think running the contest was more useful than $50 of my time, however it was not as useful as it could have been because the target moved after we announced the contest. I went from writing about evaluations as a whole to specifically evaluations that worked, and I’m sure if I’d asked for examples of that they would have been provided. So possibly I should have waited to refine my question before asking for examples. On the other hand, the project was refined in part by looking at a wide array of examples (generated here and elsewhere), and it might have taken longer to hone in on a specific facet without the contest.
Thanks - I'm happy to see that this was useful, and strongly encourage prize-based crowdsourcing like this in the future, as it seems to work well.
That said, given my association with QURI, I elected to have the prize money donated to Givewell.