Hide table of contents

TLDR

Submit ideas of “interesting evaluations” in the comments. The best one by December 5th will get $50. All of them will be highly appreciated.

Motivation

A few of us (myself, Nuño Sempere, and Ozzie Gooen), have been working recently to better understand how to implement meaningful evaluation systems for EA/rationalist research and projects. This is important both for short-term use (so we can better understand how valuable EA/rationalist research is) and for long-term use (as in, setting up scalable forecasting systems on qualitative parameters). In order to understand this problem, we've been investigating evaluations specific to research and evaluations in a much broader sense.

We expect work in this area to be useful for a wide variety of purposes. For instance, even if Certificates of Impact eventually get used as the primary mode of project evaluation, purchasers of certificates will need strategies to actually do the estimation. 

Existing writing on “evaluations” seems to be fairly domain-specific (only focused on Education or Nonprofits), one-sided (yay evaluations or boo evaluations), or both. This often isn’t particularly useful when trying to understand the potential gains and dangers of setting up new evaluation systems.  

I’m now investigating a neutral history of evaluations, with the goal of identifying trends in what aids or hinders an evaluation system in achieving its goals. The ideal output of this stage would be an absolutely comprehensive list that will be posted to LessWrong. While this is probably impractical, hopefully, we could make one comprehensive enough, especially with your help.

Task

Suggest an interesting example (or examples) of an evaluation system. For these purposes, evaluation means "a systematic determination of a subject's merit, worth and significance, using criteria governed by a set of standards", but if you think of something that doesn't seem to fit, err on the side of inclusion

Prize

The prize is $50 for the top submission.

Rules

To enter, submit a comment suggesting an interesting example below, before the 5th of December. This post is both on LessWrong and the EA Forum, so comments on either count.

Rubric

To hold true to the spirit of the project, we have a rubric evaluation system to score this competition. Entries will be evaluated using the following criteria:

  • Usefulness/uniqueness of lesson from the example
  • Novelty or surprise of the entry itself, for Elizabeth
  • Novelty of the lessons learned from the entry, for Elizabeth.

Accepted Submission Types

I care about finding interesting things more than proper structure. Here are some types of entries that would be appreciated:

  • A single example in one of the categories already mentioned
  • Four paragraphs on an unusual exam and its interesting impacts
  • A babbled list of 104 things that vaguely sound like evaluations

Examples of Interesting Evaluations

We have a full list here, but below is a subset to not anchor you too much. Don't worry about submitting duplicates: I’d rather risk a duplicate than miss an example. 
 

  1. Chinese Imperial Examination
  2. Westminster Dog Show
  3. Turing Test
  4. Consumer Reports Product Evaluations
  5. Restaurant Health Grades
  6. Art or Jewelry Appraisal
  7. ESGs/Socially Responsible Investing Company Scores
  8. “Is this porn?”
    1. Legally?
    2. For purposes of posting on Facebook?
  9. Charity Cost-Effectiveness Evaluations
  10. Judged Sports (e.g. Gymnastics)

Motivating Research

These are some of our previous related posts:

New Answer
New Comment


4 Answers sorted by

Babble!

  1. Psychological evaluation
  2. Job interview
  3. Debate competition judge
  4. Using emotion recognition (say by image recognition) to find out consumer's preferences
  5. Measuring Pavlov's dog saliva
  6. Debate in ancient greek
  7. Factored cognition
  8. forecasting
  9. karma on the forum / reddit 
  10. democratic voting
  11. Stock prices as an evaluation of a companies value
  12. Bibliometrics. Impact factor.
  13. Using written recommendations to evaluate candidates.
  14. Measuring truth-telling using a polygraph
  15. Justice system evaluation of how bad crimes are based on previous cases
  16. Justice system use of a Jury
  17. Lottery - random evaluation
  18. Measuring dopamine signals as a proxy to a fly's brain valence (which is an evaluation of its situation)
  19. throw stuff into a neural net
  20. python
  21. Discrete Valuation Rings
  22. Signaling value using jewels.
  23. Evaluation based on social class
  24. fight to the death
  25. torturing people until they confess
  26. market price
  27. A mathematical exam
  28. A high-school history exam
  29. An ADHD test
  30. Stress testing a phone by putting it in extreme situations
  31. checking if a car is safe by using a crash dummy and checking impact force
  32. Software testing
  33. Open source as a signal of "someone had looked into me and I'm still fine"
  34. Colonoscopy
  35. number of downloads for an app
  36. running polls
  37. running stuff by experts
  38. asking god what she thinks of it
  39. The choice of a pope
  40. Public consensus, 50 years down the line
  41. RCT
  42. broad population study
  43. Nobel prize committee
  44. Testing purity of chemical ingredients
  45. Testing problems in chip manufacturing
  46. Reproduce a study/project and see if the results replicate
  47. Set quantitative criteria in advance, and check the results after the fact
  48. ask people what they'd think that the results would be in advance, and ask people to evaluate the results afterward. Focus the evaluation on the parameters which the people before the test did not consider
  49. Adequacy analysis (like in Inadequate Equilibria)
  50. a flagging system for moderators
  51. New York Times Best Seller list
  52. subjective evaluation
  53. subjective evaluation when on drugs
  54. subjective evaluation by psychopaths (which are also perfect utilitarians!)
  55. subjective evaluation by a color choosing octopus
  56. Managerial decisions (a 15-minute powerpoint presentation and then an arbitrary decision)
  57. Share-holder reports
  58. bottleneck/limiting-factor analysis
  59. Crucial considerations
  60. Theory of Change model
  61. Taking a set amount of time to critically analyze the subject, focusing on trying to find as many downsides.
  62. Using weights and a two-sided scale to measure goods.
  63. Setting a benchmark that one only evaluates against.
  64. A referee evaluating a Boxing match
  65. Using score for football
  66. Buying a car - getting the information from the seller and assessing their truthfullness
  67. Looking at a fancy report and judging based on length, images, and businessy words
  68.  
Grammarly
  1. peer review in science
  2. citations count
  3. journal status
  4. grant making - assessing requests, say by scoring according to a fixed scoring template
  5. evaluating a scoring template by comparing similarity of different people's scoring of the same text
  6. Code review
  7. Fact-Checking
  8. Editor going through a text
  9. 360 peer feedback - sociometry
  10. gut intuition after long relationship/experience
  11. Amazon Reviews
  12. ELO
  13. Chess engine position assessment
  14. Theoretical assessment of a chess position - experts explaining what is good or bad about the position
  15. Running a tournament starting with this position, evaluating based on success percentage
  16. multiple choice exam
  17. political lobbying for or against something
  18. the grandma test

 

It was fun! Hope that something here might be helpful :)

Difficulty ratings in outdoor rock-climbing
Common across all types of climbing are the following features of grades:

  • A subjective difficulty assessment of the climb, by the first person to climb it, is used for them to "propose" a grade.
  • Other people to manage the same climb may suggest a different grade. Often the grade of a climb will not be agreed upon in the community until several ascents have been made.
  • Climbing guidebooks publish grades, typically based on the authors' opinion of current consensus, though some online platforms where people can vote on grades exist.
  • Grades can change even after a consensus has appeared stable. This might be due to a hold breaking, however it may also be due to a new sequence being discovered.
  • Grades tend to approach a single stable point, even though body shape and size (particularly height and armspam) can make a large difference to difficulty.

There are many different grading systems for different types of climb, a good overview is here. Some differences of interest:

  • While most systems grade the overall difficulty of the entire climb, British trad climbs have two grades, niether of which purely map to overall difficulty. The first describes a combination of overall difficulty and safety (so an unsafe, but easy climb, may have a higher rating than a safe), the second describes the difficulty only of the hardest move or short sequence (which can be very different from the overall difficulty, as endurance is a factor).
  • Aid climbs, which allow climbers to use ropes to aid their movement rather than only for protection, are graded seperately. However other technology is not considered "aid". In particular, climbing grades have steadily increased over time, at least in part due to development of better shoe technology. More recently, the development of rubberised kneepads has lead to several notable downgrades of hard boulders and routes, as the kneepads make much longer rests possible.

I think climbing grading is interesting as the grades emerge out of a complex set of social interactions, and despite most climbers frequently saying things like "grades are subjective", and "grades don't really matter", they in general remain remarkably stable, and important to many climbers.

Correlating subjective metrics with objective outcomes to provide better intuitions about what an additional point on a scale might mean. Resulting intuitions still suffers from "correlation ≠ causation" and all curses of self-reported data (which, in my opinion, makes such measurements close to useless) but is a step forward.

See this tweet and whole tread https://twitter.com/JessieSunPsych/status/1333086463232258049 h/t Guzey

Huh! The thread I linked to and David Manheim's winning comment cite the same paper :)

Simple linear models, including improper ones(!!). In Chapter 21 of Thinking Fast and Slow, Kahneman writes about Meehl's book Clinical vs. Statistical Prediction: A Theoretical Analysis and a Review, which finds that simple algorithms made by getting some factors related to the final judgement and weighting them gives you surprisingly good results.

The number of studies reporting comparisons of clinical and statistical predictions has increased to roughly two hundred, but the score in the contest between humans and algorithms has not changed. About 60% of the studies have shown significantly better accuracy for the algorithms. The other comparisons scored a draw in accuracy [...]

If they are weighted optimally to predict the training set, they're called proper linear models, and otherwise they're called improper linear models. Kahneman says about Dawes' The Robust Beauty of Improper Linear Models in Decision Making that

A formula that combines these predictors with equal weights is likely to be just as accurate in predicting new cases as the multiple-regression formula that was ptimal in the original sample. More recent research went further: formulas that assign equal weights to all the predictors are often superior, because they are not affected by accidents of sampling.

That is to say: to evaluate something, you can get very far just by coming up with a set of criteria that positively correlate with the overall result and with each other and then literally just adding them together.

Comments2
Sorted by Click to highlight new comments since:

Winner

Last week we announced a prize for the best example of an evaluation. The winner of the evaluations prize is David Manheim, for his detailed suggestions on quantitative measures in psychology.  I selected this answer because, although IAT was already on my list, David provided novel information about multiple tests that saved me a lot of work in evaluating them. David has had involvement with QURI (which funded this work) in the past and may again in the future, so this feels a little awkward, but ultimately it was the best suggestion so it didn’t feel right to take the prize away from him.

Honorable mentions to Orborde on financial stress tests, which was a very relevant suggestion that I was unfortunately already familiar with, and alexrjl on rock climbing route grades, which I would never have thought of in a million years but has less transferability to the kinds of things we want to evaluate.

Post-Mortem

How useful was this prize? I think running the contest was more useful than $50 of my time, however it was not as useful as it could have been because the target moved after we announced the contest. I went from writing about evaluations as a whole to specifically evaluations that worked, and I’m sure if I’d asked for examples of that they would have been provided. So possibly I should have waited to refine my question before asking for examples. On the other hand, the project was refined in part by looking at a wide array of examples (generated here and elsewhere), and it might have taken longer to hone in on a specific facet without the contest.

Thanks - I'm happy to see that this was useful, and strongly encourage prize-based crowdsourcing like this in the future, as it seems to work well.

That said, given my association with QURI, I elected to have the prize money donated to Givewell.

Curated and popular this week
Sam Anschell
 ·  · 6m read
 · 
*Disclaimer* I am writing this post in a personal capacity; the opinions I express are my own and do not represent my employer. I think that more people and orgs (especially nonprofits) should consider negotiating the cost of sizable expenses. In my experience, there is usually nothing to lose by respectfully asking to pay less, and doing so can sometimes save thousands or tens of thousands of dollars per hour. This is because negotiating doesn’t take very much time[1], savings can persist across multiple years, and counterparties can be surprisingly generous with discounts. Here are a few examples of expenses that may be negotiable: For organizations * Software or news subscriptions * Of 35 corporate software and news providers I’ve negotiated with, 30 have been willing to provide discounts. These discounts range from 10% to 80%, with an average of around 40%. * Leases * A friend was able to negotiate a 22% reduction in the price per square foot on a corporate lease and secured a couple months of free rent. This led to >$480,000 in savings for their nonprofit. Other negotiable parameters include: * Square footage counted towards rent costs * Lease length * A tenant improvement allowance * Certain physical goods (e.g., smart TVs) * Buying in bulk can be a great lever for negotiating smaller items like covid tests, and can reduce costs by 50% or more. * Event/retreat venues (both venue price and smaller items like food and AV) * Hotel blocks * A quick email with the rates of comparable but more affordable hotel blocks can often save ~10%. * Professional service contracts with large for-profit firms (e.g., IT contracts, office internet coverage) * Insurance premiums (though I am less confident that this is negotiable) For many products and services, a nonprofit can qualify for a discount simply by providing their IRS determination letter or getting verified on platforms like TechSoup. In my experience, most vendors and companies
 ·  · 4m read
 · 
Forethought[1] is a new AI macrostrategy research group cofounded by Max Dalton, Will MacAskill, Tom Davidson, and Amrit Sidhu-Brar. We are trying to figure out how to navigate the (potentially rapid) transition to a world with superintelligent AI systems. We aim to tackle the most important questions we can find, unrestricted by the current Overton window. More details on our website. Why we exist We think that AGI might come soon (say, modal timelines to mostly-automated AI R&D in the next 2-8 years), and might significantly accelerate technological progress, leading to many different challenges. We don’t yet have a good understanding of what this change might look like or how to navigate it. Society is not prepared. Moreover, we want the world to not just avoid catastrophe: we want to reach a really great future. We think about what this might be like (incorporating moral uncertainty), and what we can do, now, to build towards a good future. Like all projects, this started out with a plethora of Google docs. We ran a series of seminars to explore the ideas further, and that cascaded into an organization. This area of work feels to us like the early days of EA: we’re exploring unusual, neglected ideas, and finding research progress surprisingly tractable. And while we start out with (literally) galaxy-brained schemes, they often ground out into fairly specific and concrete ideas about what should happen next. Of course, we’re bringing principles like scope sensitivity, impartiality, etc to our thinking, and we think that these issues urgently need more morally dedicated and thoughtful people working on them. Research Research agendas We are currently pursuing the following perspectives: * Preparing for the intelligence explosion: If AI drives explosive growth there will be an enormous number of challenges we have to face. In addition to misalignment risk and biorisk, this potentially includes: how to govern the development of new weapons of mass destr
Dr Kassim
 ·  · 4m read
 · 
Hey everyone, I’ve been going through the EA Introductory Program, and I have to admit some of these ideas make sense, but others leave me with more questions than answers. I’m trying to wrap my head around certain core EA principles, and the more I think about them, the more I wonder: Am I misunderstanding, or are there blind spots in EA’s approach? I’d really love to hear what others think. Maybe you can help me clarify some of my doubts. Or maybe you share the same reservations? Let’s talk. Cause Prioritization. Does It Ignore Political and Social Reality? EA focuses on doing the most good per dollar, which makes sense in theory. But does it hold up when you apply it to real world contexts especially in countries like Uganda? Take malaria prevention. It’s a top EA cause because it’s highly cost effective $5,000 can save a life through bed nets (GiveWell, 2023). But what happens when government corruption or instability disrupts these programs? The Global Fund scandal in Uganda saw $1.6 million in malaria aid mismanaged (Global Fund Audit Report, 2016). If money isn’t reaching the people it’s meant to help, is it really the best use of resources? And what about leadership changes? Policies shift unpredictably here. A national animal welfare initiative I supported lost momentum when political priorities changed. How does EA factor in these uncertainties when prioritizing causes? It feels like EA assumes a stable world where money always achieves the intended impact. But what if that’s not the world we live in? Long termism. A Luxury When the Present Is in Crisis? I get why long termists argue that future people matter. But should we really prioritize them over people suffering today? Long termism tells us that existential risks like AI could wipe out trillions of future lives. But in Uganda, we’re losing lives now—1,500+ die from rabies annually (WHO, 2021), and 41% of children suffer from stunting due to malnutrition (UNICEF, 2022). These are preventable d