Shallow evaluations of longtermist organizations

NunoSempere

Comments 34

Sorted by

New & upvoted

Quick bits of info / thoughts on the questions you raise re CLR

(I spent 3 months there as a Summer Research Fellow, but don't work there anymore, and am not suffering-focused, so might be well-positioned to share one useful perspective.)

Is most of their research only useful from a suffering-focused ethics (SFE) perspective?
- I think all of the research that was being done while I was there would probably be important from a non-SFE longtermist perspective if it was important from a SFE longtermist perspective
  - It might also be important from neither perspective if some other premise underpinning the work was incorrect or the work was just low-quality. But:
    - I think it was all at least plausibly important
    - I think each individual line of work would be unlikely to turn out to be important from an SFE longtermist perspective but not from a non-SFE longtermist perspective
  - This is partly because much of it could be useful for non-s-risk scenarios
    - E.g., much of their AI work may also help reduce extinction risks, even if that isn't CLR as an organisation's focus (it may be the focus of some individual researchers, e.g. Daniel K - not sure)
  - This is also partly because s-risks are also really bad from a non-SFE perspective (relative to the same future scenario but minus the suffering)
- All that said, work that's motivated by a SFE longtermist perspective should be expected to be higher priority from that perspective than from another perspective, and I do think that that's the case for CLR's work
  - That said, if CLR had a substantial room for more funding and I had a bunch of money to donate, I'd seriously consider them (even if I pretend that I give SFE views 0 credence, whereas in reality I give them some small-ish credence)
Is there a better option for suffering-focused donors?
- I think the key consideration here is actually room for more funding rather than how useful CLR's work is
  - I haven't looked into their room for more funding
Is the probability of astronomical suffering comparable to that of other existential risks?
- Personally I'd say so (but "comparable" is vague), and I think it's plausibly more likely (though "plausibly" is vague and I haven't tried to put specific numbers on s-risks as a whole)
  - (Note that astronomical suffering could occur even in a future scenario that's overall better than extinction from a non-SFE or weakly SFE perspective. So my claim is simultaneously somewhat less surprising and somewhat less important than one might think.)
Is CLR figuring out important aspects of reality?
- I think so, but this is a vague question
Is CLR being cost-effective at producing research?
- I haven't really thought about that
Is CLR's work on their "Cooperation, conflict, and transformative artificial intelligence"/"bargaining in artificial learners" agenda likely to be valuable?
- I think so, but I'm not an expert on AI, game theory, etc.
Will CLR's future research on malevolence be valuable?
- I think so, conditional on them doing a notable amount of such work (I don't know their current plans on that front)
- And this I know more about since it was one of my focuses during my fellowship
How effective is CLR at leveling up researchers?
- I think I learned a lot while I was there, and I think the other summer research fellows whose views I have a sense of felt the same
- But I haven't thought about that from the perspective of "Ok, but how much, and what was the counterfactual" in the way that I would if considering donating a large amount to CLR
  - (I've noticed that my habits of thinking are different when I'm just having regular conversations or reading stuff or whatever vs evaluating a grant)
- Two signals of my views on this:
  - I've recommended several people apply to the CLR summer research fellowship (along with other research training programs)
  - I've drawn on some aspects of or materials from CLR's summer research fellowship when informing the design of one or more other research training programs
"I get the impression that they are fairly disconnected from other longtermist groups (though CLR moved to London last year, which might remedy this.)"
- I don't think CLR are fairly disconnected from other longtermist groups
- Some data points:
  - Stefan Torges did a stint at GovAI
  - Max Daniel used to work there, still interacts with them in some ways, and works at FHI and is involved in a bunch of other longtermist stuff
  - I worked there and am still in touch with them semi-regularly
  - Daniel Kokotajlo used to work at AI Impacts
  - Alfredo Parra, who used to work there, now works at Legal Priorities Project
  - Jonas Vollmer used to work there and now runs EA Funds
  - I know of various of other people who've interacted in large or small ways with both CLR and other longtermist orgs

I am not intending here to convince anyone to donate to CLR or work for CLR. I'm not personally donating to them, nor working there. Though I do think they'd be a plausibly good donation target if they have room for more funding (I don't know about that) and that they'd be a good place to work for many longtermists (depending on personal fit, career plans, etc.).

Personal views only, as always.

Anthony DiGiovanni 🔸

I think I learned a lot while I was there, and I think the other summer research fellows whose views I have a sense of felt the same

+1. I'd say that applying for and participating in their fellowship was probably the best career decision I've made so far. Maybe 60-70% of this was due to the benefits of entering a network of people whose altruistic efforts I greatly respect, the rest was the direct value of the fellowship itself. (I haven't thought a lot about this point, but on a gut level it seems like the right breakdown.)

Ozzie Gooen

Thanks for both comments here. Personal anecdotes are really valuable, and I assume would be useful to later people trying to get some idea of the value from CLR.

Sadly, I imagine there's a significant bias for positive comments (I assume that people with negative experiences would be cautious of offending anyone), but positive comments still have signal.

MichaelA🔸

Sadly, I imagine there's a significant bias for positive comments (I assume that people with negative experiences would be cautious of offending anyone), but positive comments still have signal.

Yeah, I think that this is true and that it's good that you noted it.

Though that brings to mind another data point, which is that several people who did the summer research fellowship at the same time as me are now still working at CLR. I also think that there might be a bias against the people who still work at an org commenting, since they wouldn't want to look defensive or like they're just saying it to make their employer happy, or something. But overall I do think there's more bias towards positive comments.

(And there are also other people I haven't stayed in touch with and who aren't working there anymore, who for all I know could perhaps have had worse experiences.)

NunoSempere

Thanks Michael, beautiful comment.

Denkenberger🔸

Thanks for considering ALLFED. We try to respond to inquiries quickly. We have looked back, and have not be able to locate any such inquiries. We will be finalizing our 2020 report with financial details soon.

Thanks a lot for the engagement in the cost-effectiveness model. To clarify, the cost of preparation does not include the scale up in a catastrophe. The idea is that the resilient foods (we are rebranding away from “alternative foods”) could be scaled up without large-scale preparation (e.g. countries would repurpose the paper factories to produce food after the catastrophe, rather than spending billions of dollars ahead of time). Most of the promising resilient foods have already been commercialized. In this paper, we found that if there were no resilient foods, expenditure on stored foods in a catastrophe would be approximately $90 trillion and about 10% of people would survive. However, if resilient foods could be produced at $2.5 per dry kilogram retail, 97% of people would survive but the total expenditure would only be ~$20 trillion. So one could argue that resilient foods would actually save money in a catastrophe. But we did not include that effect in the cost-effectiveness model.

I expect that affecting a large amount of the Earth's future impact (i.e., 3 to 50% of the future impact of humanity) would be very hard even in extreme circumstances.

Just to make sure we are on the same page, if there were a 10% probability of full-scale nuclear war in the next 30 years and there were a 10% reduction in the long-term future potential of humanity given nuclear war, and if planning and R&D for resilient foods mitigated the far future impact of nuclear war by 50%, then that would improve the long-term potential of humanity by 0.5 percentage points (the product of the three percentages).

MichaelA🔸

[I'll put some thoughts on the ALLFED section here to keep discussion organised, but this is responding to Nuno's section rather than David's comment.]

I feel that that 50% is still pretty good, but the contrast between it and the model's initial 95% is pretty noticeable to me, and makes me feel that the 95% is uncalibrated/untrustworthy. On the other hand, my probabilities above can also be seen as a sort of sensitivity analysis, which shows that the case for an organization working on ALLFED's cause area is somewhat more robust than one might have thought.
[...]
In conclusion, I disagree strongly with ALLFED's estimates (probability of cost overruns, impact of ALLFED's work if deployed, etc.), however, I feel that the case for an organization working in this area is relatively solid. My remaining uncertainty is about ALLFED's ability to execute competently and cost-effectively; independent expert evaluation might resolve most of it.

I think this mostly sounds similar to my independent impression, as expressed here, though I didn't specifically worry particularly about their ability to execute competently and cost-effectively. (I'm not saying I felt highly confident about that; it just didn't necessarily stand out much to me as a key uncertainty, for whatever reason.)

E.g., I wrote in the linked comment:

Their cost-effectiveness estimates seem remarkably promising (see here and here).
But it does seem quite hard to believe that the cost-effectiveness is really that good. And many of the quantities are based on a survey of GCR researchers, with somewhat unclear methodology (e.g., how were the researchers chosen?)
I also haven’t analysed the models very closely
But, other than perhaps the reliance on that survey, I can’t obviously see major flaws, and haven’t seen comments that seem to convincingly point out major flaws. So maybe the estimates are in the right ballpark?

One thing I'd add is that most of your (Nuno's) section on ALLFED sounds like it's seeing ALLFED's impact as mostly being about their research & advocacy itself. But I think it's worth also giving a fair amount of emphasis to this question of yours: "Given that ALLFED has a large team, is it a positive influence on its team members? How would we expect employees and volunteers to rate their experience with the organization?"

I'd see a substantial fraction of the value of ALLFED as coming from how it might work as a useful talent pipeline. And I think that this could also be a source of nontrivial downside risk from ALLFED, e.g. if their training is low-quality for some reason, or if people implicitly learn bad habits of thinking/research/modelling, or if their focuses aren't good focuses and they make their volunteers more likely to stay focused on that long-term.

(I'm not saying that these things are the case. I'd currently guess that ALLFED produces notable impact as a talent pipeline. But I haven't looked closely and think it'd be worth doing so if one wanted to do a "thorough" evaluation of ALLFED.)

NunoSempere

Thanks for considering ALLFED. We try to respond to inquiries quickly. We have looked back, and have not be able to locate any such inquiries. We will be finalizing our 2020 report with financial details soon.

This is most likely my fault; I think I got confused between allfed.org and allfed.info

To clarify, the cost of preparation does not include the scale up in a catastrophe

For clarity:
1. Your guesstimate model: 3% to 50% mitigation of the impact of war with a 30M to 200M, a war which has a probability 0.02% to 5% per year. You also say that so far, you've already mitigated the impact of such a war by 1% to 20%.
2. My model: a 0% to 15% mitigation (previously 0 to 5%, see below) of the impact of such a war with a 50M to 50B investment, where this is maybe not being fully prepared, but does include some serious paranoid preparation, some factories running, supply chains established, etc.
3. Objection: You're planning to go with the 30M to 200M path; my estimates should be for that path.
4. Answer: I'd have to think about it. Maybe 2x to 10x lower. Essentially I'd expect any preparation to at least fail partially, fail to get implemented, be ignored, not survive in institutional memory, etc.

In this paper, we found that if there were no resilient foods, expenditure on stored foods in a catastrophe would be approximately $90 trillion and about 10% of people would survive. However, if resilient foods could be produced at $2.5 per dry kilogram retail, 97% of people would survive but the total expenditure would only be ~$20 trillion. So one could argue that resilient foods would actually save money in a catastrophe

I'll read the paper.

Just to make sure we are on the same page, if there were a 10% probability of full-scale nuclear war in the next 30 years and there were a 10% reduction in the long-term future potential of humanity given nuclear war, and if planning and R&D for resilient foods mitigated the far future impact of nuclear war by 50%, then that would improve the long-term potential of humanity by 0.5 percentage points (the product of the three percentages).

I see, thanks, I think I was getting this wrong (I've changed this in the guesstimate, but not in the post). With that in mind, your estimates now seem less high (but still very high). It changes my estimates slightly.

Separately, your numbers still seem fairly high. Suppose that in 1980 you had $100M and knew that there was going to be a pandemic (or another global financial crisis) in the next 100 years, but didn't knew the details; it seems unlikely that you could have made the covid pandemic or the 2008 financial crisis more than 10% better.

Denkenberger🔸

This is most likely my fault; I think I got confused between allfed.org and allfed.info

We tried to buy the .org domain, but unfortunately it was not for sale.

Essentially I'd expect any preparation to at least fail partially, fail to get implemented, be ignored, not survive in institutional memory, etc.

There are definitely a lot of failure modes, though part of the money should go to updating institutions as staff turn over.

Thanks for updating the Guesstimate.

Separately, your numbers still seem fairly high. Suppose that in 1980 you had $100M and knew that there was going to be a pandemic (or another global financial crisis) in the next 100 years, but didn't knew the details; it seems unlikely that you could have made the covid pandemic or the 2008 financial crisis more than 10% better.

Good question. I think these are quite different because billions of dollars had been put into preparedness, at least for a pandemic. Though billions of dollars have been put into preventing a nuclear war (and reducing weapon stockpiles), we could not find anything preparing for feeding populations for a multiyear catastrophe. I think generally there are logarithmic returns, which means the first amount of money spent on a problem has much greater marginal cost effectiveness.

Davidmanheim

[re: FHI Bio] Nonetheless, I'm somewhat surprised by the size of the team. In particular, I imagine that to meaningfully reduce bio-risk, one would need a bigger team. It's therefore possible that failing to expand is a mistake.

Specifically on the point about FHI's bio team, as a semi-insider providing information that isn't on the web site but isn't private, I'll note that the team is actually larger, in several ways. First, they have summer fellows and Oxford PhD students not officially hired by FHI that they work with. They also have people working jointly at/with other organizations (e.g. Piers is at iGem, as is Tessa, and they both talk with FHI folks a lot. Greg is a CHS ELBI fellow, and works with lots of people at CHS/NTI/etc. I work with/for them as a contractor. And Andrew Snyder-Beattie at OpenPhil used to be in charge of the team, and has coordinated projects with other groups.) Lastly, they have also recently brought on board additional people, not reflected on the web page. (Not sure about timing or announcements, so I won't say anything.)

MichaelA🔸

Strong upvoted - I found this very interesting, both for various parts of the specific evaluations and more generally as an example of one way to do longtermist charity evaluation (which currently seems to be rarely done in anything beyond a cursory or solely qualitative way, at least in public writings).

One question I have is what you saw as the key purposes of this post. Some possibilities:

Inform decisions about donations that are each in something like the $10-$5000 dollar range
Inform decisions about donations/grants that are each in something like the >$50,000 dollar range
- (Obviously I'm missing the $5,000-$50,000 range; I have a vague sense that the more interesting question is which of those two buckets I pointed to you're more focused on, if either)
Inform decisions about which of these orgs (if any) to work for
Provide feedback to these orgs that causes them to improve
Provide an accountability mechanism for these orgs that causes them to work harder or smarter so that they look better on such evaluations in future
Just see if this sort of evaluation can be done, learn more about how to do that, and share that meta-level info with the EA public
[something else]

I ask partly because:

My sense is that, traditionally, public charity evaluations are mostly focused on informing decisions by individual donors giving non-huge sums each
But this seems somewhat less relevant in longtermism than in other EA cause areas
- Longtermism seems somewhat less funding-constrained than other EA cause areas
- In longtermism compared to in other areas, evaluation seems harder for various reasons, and so the case for giving to a donation lottery or a fund whose dollars are distributed by specialist grantmakers seems stronger
  - But I guess your post, or other things like it, could mitigate this
    - But I still think some of the bottlenecks aren't addressed, e.g. I think nonpublic info is more often relevant for evaluating longtermist orgs than for evaluating animal welfare orgs
Also, you rarely talked about room for more funding, which seems to imply you weren't focused primarily on informing donors?

(To be clear, "what did you see as the key purposes of this post?" is a sincere rather than rhetorical question, and I think this post is great.)

(I work for two of the orgs discussed in this post and as a grantmaker for a fund, but this comment - as usual - expresses personal views only.)

Jonas_

I actually think it would be cool to have more posts that explicitly discuss which organizations people should go work at (and what might make it a good personal fit for them).

NunoSempere

Thanks Michael. Going through your options one by one.

Inform decisions about donations that are each in something like the $10-$5000 dollar range. Not an aim I had, but sure, why not.
Inform decisions about donations/grants that are each in something like the >$50,000 dollar range. So rather than inform those directly, inform the kind of research that you can either do or buy with money to inform that donation. $50,000 feels a little bit low for commissioning research to make a decision, though (could a $5k to $10k investment in a better version of this post make a $50k donation more than 10-20% better? Plausibly.
- That said, I'd be curious if any largish donations are changed as a result of this post, and why, and in particular why they didn't defer to the LTF fund.
Inform decisions about which of these orgs (if any) to work for. Not really for myself, but I'd be happy for people to read this post as part of their decisions. Also, 80,000 hours exists.
Provide feedback to these orgs that causes them to improve. Sure, but not a primary aim.
Provide an accountability mechanism for these orgs that causes them to work harder or smarter so that they look better on such evaluations in future. No, not really.
Just see if this sort of evaluation can be done, learn more about how to do that, and share that meta-level info with the EA public. Yep.
[Something else]. Show the kind of thing that an organization like QURI can do! In particular, you can't do this kind of thing using software other than foretold (Metaculus is great, but the questions are too ambiguous; getting them approved takes time & in the case of a tournament, money, and for this post I only needed my own predictions (not that you can't run a tournament on foretold.))
[Something else]. Learn more about the longtermist ecosystem myself
[Something else]. So this was sort of on the edges of this project, but for making large amounts of predictions, one does need a pipeline, and improving that pipeline has been on my mind (and on Ozzie Gooen's). For instance, creating the 27 predictions one by one would be kind of a pain, so instead I use a Google doc script which feeds them to foretold.

I also think that 4. and 5. are too strongly worded. To the extent I'm providing feedback, I imagine it's more of a) of the sanity check variety or b) about how a relatively sane person perceives these organizations. For instance, if I don't get pushback about it in the comments, I'll think that its a good idea for the APPGFG to expand, but I doubt it's something that they themselves haven't thought about.

Ozzie Gooen

+1, to both the questions and the answers.

In an ideal world we'd have intense evaluations of all organizations that are specific to all possible uses, done in communications styles relevant to all people.

Unfortunately this is an impossible amount of work, so we have to find some messy shortcuts that get much of the benefit at a decent cost.

I'm not sure how to best focus longtermist organization evaluations to maximize gains for a diversity of types of decisions. Fortunately I think whenever one makes an evaluation for one specific thing (funding decisions), these wind up relevant for other things (career decisions, organization decisions).

My primary interest at this point are evaluations of the following:

How much total impact is an organization having, positive or negative?
How can such impact be improved?
How efficient is the organization (in terms of money and talent)
How valuable is it to other groups or individuals to read / engage with the work of this organization? (Think Yelp or Amazon reviews)

My guess is that such investigations will help answer a wide assortment of different questions.

To echo what Nuño said, some of my interest in this specific task was in attempting a fairly general-purpose attempt. I think that increasingly substantial attempts is a pretty good bet, because a whole lot could either go wrong (this work upsets some group or includes falsities) or new ideas could be figured out (particularly by commenters, such as those on this post).

In the longer term my preference isn't for QURI/Nuño to be doing the majority of public evaluations of longtermist orgs, but instead for others to do most of this work. Perhaps this could be something of a standard blog post type, and/or there could be 1-2 small organizations dedicated to it. I think it really should be done independently from other large orgs (to be less biased and more isolated), so it probably wouldn't make sense for this work to be done as part of a much bigger organization.

Ozzie Gooen

Also, I'd agree that <$1Mil funding decisions aren't the main thing I'm interested in. I think that talent and larger allocations are much more exciting.

For example, perhaps it's realized that one small nonprofit's work is much more valuable than expected, so future donors wind up spending $200Mil in related work down the line. Or, there are many systematic effects, like new founders are inspired by trends identified in the evaluations and make better new nonprofits because of it.

kokotajlod

Hey! Thanks for doing this, strong-upvoted.

I just wrote a post about the terms "outside view" and "inside view" and I figured I'd apply my own advice and see where it leads me. I noticed you used the term here:

CSER
Epistemic status for this section: Unmitigated inside view.

and so I thought I'd try my hand at saying what I think you meant, but using less ambiguous terms. You probably didn't just mean "I'm not using reference classes in this section," because that's true of most sections I'd guess. You also probably didn't mean that you are using a gears-level model, though arguably you are using a model of some sort? Idk, could also classify it as intuition. My guess is that you meant "This is just how things seem to me," i.e. this section doesn't attempt to defer to anyone else or correct for any biases you might have. How does all this sound? What would you say you meant?

NunoSempere

So what I specifically meant was: It's interesting that the current leadership probably thinks that CSER is valuable (e.g., valuable enough to keep working at it, rather than directing their efforts somewhere else, and presumably valuable enough to absorb EA funding and talent). This presents a tricky updating problem, where I should probably average my own impressions from my shallow review with their (probably more informed) perspective. But in the review, I didn't do that, hence the "unmitigated inside view" label.

kokotajlod

Hmmm, this surprises me a bit because doesn't it apply to pretty much all of your evaluations on this list? Presumably for each of them, the leadership of the org has somewhat different opinions than your independent impression, and your overall view should be an average of the two. I didn't get the impression that you were averaging your impression with those of other org's leadership.

NunoSempere

Sure, but it was particularly salient to me in this case because the evaluation was so negative

kokotajlod

Ah, OK, that makes sense.

Misha_Yagudin

I guess (p=.75) Nuño would say that the following interpretation is mostly reasonable: "inside view" here means that Nuño presents his impressions which rely a lot on stories he tells himself about various research directions being valuable or not, which others might reasonably disagree with him about.

I am thinking that because Nuño uses a simple model to estimate a fraction of researchers doing "valuable" work, the subjectivity is rooted in his takes on how valuable their individual research directions are.

[Phrasing this kinda weirdly as I want to get a visceral update on my belief in "when thinking is clearly described, I can guess that the author means by inside/outside view." I also think that (p=.33) Nuño was just not very careful and will say something like "I have no idea what I really meant at the time of writing it."]

NunoSempere

To resolve that prediction, yes, I would say that that interpretation is correct.

Steven Byrnes

Just one guy, but I have no idea how I would have gotten into AGI safety if not for LW ... I had a full-time job and young kids and not-obviously-related credentials. But I could just come out of nowhere in 2019 and start writing LW blog posts and comments, and I got lots of great feedback, and everyone was really nice. I'm full-time now, here's my writings, I guess you can decide whether they're any good :-P

BrianTan

Thanks for this, it's really interesting. I'm curious around how much time you spent on

Each org you evaluated
1. Maybe an average time you spent per org, and you can also mention any outliers. It feels like you spent more time on a couple of organizations compared to the others?
On this project as a whole so far

Charity Entrepreneurship usually does this for their reports, so people get a sense of how much time was put in and how deep the evaluation/effort was. Maybe you can include these time estimates into the post?

NunoSempere

This is a good question, and in hindsight, something I should have recorded. For the project as a whole, maybe two weeks to a month, but not of full-time work. I don't remember the times for each organization.

Linch

I'm providing numerical context for RP's longtermism team here because it's both a) easier to evaluate the costs than research(er) quality when you have the data, and b) that the costs are by default more invisible when you don't have the data.

Rethink Priorities has recently been expanding into the longtermist sphere, and it did so by hiring Linch Zhang and Michael Aird, the latter part-time, as well as some volunteers/interns.

Just for some numerical context, I'm full-time at RP right now and Michael is half-time at RP and half-time at FHI RSP (so we have 1.5 FTEs). At the beginning of 2021, DaveRhysBenard was half-time and Michael was full-time (2.5 FTEs). That said, David was poached internally for some neartermist work (so not sure how to count that, FTE wise). I also spent ~1.5 months on neartermist work (not public). In total, this will be about 1.9 FTE-years by EOY (unless you count neartermist work as not part of the longtermist team), not including new hires.

We currently have 4 longtermist summer interns (all paid), 2 of which are ~half-time (3 FTEs total). They're here for approximately 3 months and are paid for by the EAIF + an additional donation which we earmarked for longtermist interns. Our internal theory of change for the internship focuses more on (a) helping the interns test and improve their fit for EA-style research and (b) helping their managers built management experience to facilitate further scaling, rather than on producing longtermist research outputs. 3 FTEs x 3 months ~= .75 FTE-years.

Finally, we have one volunteer, Charles Dillon (special circumstances). He's approximately half-time for us and started in May. We may have had other volunteers but if so I think they dropped off quickly without time investment on either our part or theirs, so low cost on both ends.* If we assume Charles will be with us until EOY (I sure hope so!), then we have .5 x 2/3 ~= .33 FTE-years in volunteers by EOY 2021. We do not expect to take on additional volunteers.

In total, by EOY 2021 on the RP longtermist team we'll have ~1.9 FTE years if you only count employees, ~2.65 if you also count paid interns, and ~3 FTE years if you count both interns and unpaid volunteers. This is not including additional hires, which we may want to make in late Fall 2021 (See Footnote 9).

*(If I'm missing someone, sincere apologies! Let me know and I'll add you)

Nathan Young

Superb work. Thanks for doing it. Would l love to see more of this kind of thing.

Michael St Jules 🔸

For instance, requiring psychopathy tests for politicians, or psychological evaluation, seems very unrealistic.

Seems like you could do polling and start a ballot initiative where it looks promising, if anywhere. Starting small can get the momentum rolling and more attention to the issue, and then pick up support elsewhere.

Is there any particular reason you think it would be too unpopular or not work well? People might not like it in case it becomes a weapon used by the state to shut out political opponents, but maybe there are ways to prevent this, with bipartisan testers, or letting the subject choose at least one of the testers (who must have appropriate credentials). It could be like jury selection, with subjects allowed to challenge/strike potential testers (see strike for cause, peremptory challenge).

Also, we wouldn't need to require them to pass these tests; we could just publish the results so the public can be informed.

Maybe, in the US, it wouldn't be very effective other than in primaries, given how partisan things are.

Or do you think no useful tests could be made?

MichaelA🔸

I'd welcome comments about the overall method, about whether I'm asking the right questions for any particular organization, or about whether my tentative answers to those questions are correct, and about whether this kind of evaluation seems valuable. For instance, it's possible that I would have done better by evaluating all organizations using the same rubric (e.g., leadership quality, ability to identify talent, working on important problems, operational capacity, etc.)

FWIW:

I think I thought the questions you asked about each org seemed good
- I say "I think I thought" because I wasn't actively trying to find questions I thought weren't useful, come up with more relevant questions, etc.
I think it seems reasonable to use different questions for each org
- It seems reasonable for the questions to be guided by what the org's theory of change is, what seem the major plausible upside scenarios for the org, what seems the major plausible downside risks for the org, what seem the major uncertainties about or potential weaknesses of the org, etc., and these things differ a lot between orgs
- It might be useful to also have some questions or rubric that is used across all orgs (I'm not sure), but I think it'd still be good to have questions for each org tailored to that specific org
  - Or perhaps the common questions/rubric elements could be broad enough that the tailored questions all fit under one question/element
    - Toy example: You have a broad question about each of importance, tractability, and neglectedness for each org, and then tailored sub-questions for each of those factors for each org

NunoSempere

(Edited to add Centre for the Study of Existential Risk Four Month Report June - September 2020 to the CSER sources)

Question Mark

Would you consider reviewing the Center for Reducing Suffering? They are an organization similar to the Center on Long-Term Risk in the sense that their main focus is reducing S-risks, i.e. risks of astronomical suffering, but are less focused on AI. CRS is currently Brian Tomasik's top charity recommendation.

NunoSempere

In what capacity are you asking? I'd be more likely to do so if you were asking as a team member, because the organization right now looks fairly small and I would almost be evaluating individuals.

MichaelA🔸

Despite living under the FHI umbrella, each of these projects has a different pathway to impact, and thus they should most likely be evaluated separately. [...]
Consider in comparison 80,000 hours' annual review, which outlines what the different parts of the organization are doing, and why each project is probably valuable. I think having or creating such an annual review probably adds some clarity of thought when choosing strategic decisions (though one could also cargo-cult such a review solely in order to be more persuasive to donors), and it would also make shallow evaluations easier.

I think I agree with this, and it reminds me of my question post Do research organisations make theory of change diagrams? Should they? and some of the views expressed by commenters there (e.g., by Max Daniel). (Though really the relevant thing here is more like "explicit, clear theory of change", rather than it necessarily being in the form of a diagram.)

(Personal view only.)

Davidmanheim

There has been discussion about this for FHI, and I have spoken to a number of people there. They do have some specific ideas, but I agree that it would be beneficial for it to be 1) public, 2) explicit, and 3) actually used for evaluation. Unfortunately, I think that doing so would require a lot of work on their part, and it hasn't been a big priority.

Comments

Shallow evaluations of longtermist organizations

Introduction

Evaluated organizations

Alliance to Feed the Earth in Disasters

Questions

Tentative answers

Disagreements and Uncertainties

Concluding Thoughts

Sources

All-Party Parliamentary Group for Future Generations (APPGFG)

Questions:

Tentative answers

Conclusion

Sources

CSER

Questions

Tentative answer

Sources

Center for Security and Emerging Technology (CSET)

Questions

Tentative answers

Sources

Future of Life Institute (FLI)

Questions

Tentative answers

LessWrong

Questions

Tentative answers

Sources

Rethink Priorities (RP)

Questions

Tentative answers

Sources

Simon Institute for Long-Term Governance (SILG)

Questions

Tentative answers

Sources

80,000 hours

Questions

Tentative answers

Sources

Observations

Appendix: Organizations about whose evaluations I'm less sure

Center on Long-term Risk (CLR)

Questions

Tentative answers

Sources

Future of Humanity Institute

Questions

Tentative answers

Macrostrategy and AI Safety Research Groups

Biosecurity Research Group

Centre for the Governance of AI (GovAI)

Research Scholars Programme, DPhil Scholars

Other associates and affiliates.

Conclusion

Sources

Global Priorities Institute

Questions

Tentative answers

Sources

Notes