Hide table of contents

Last updated: April 26, 2024

This is a reading list on the long reflection and the closely related, more recently coined notions of ASI governancereflective governance and grand challenges.

I claim that this area outscores regular AI safety on importance[1] while being significantly more neglected (and roughly the same in terms of tractability), making it perhaps the highest priority EA cause area.

I don’t claim to be the ideal person to have made this reading list. The story behind how it came about is that two months ago, Will MacAskill wrote: “I think there’s a lot of excitement about work in this broad area that isn’t yet being represented in places like the Forum. I’d be keen for more people to start learning about and thinking about these issues.” Intrigued, I spent some time trying to learn about the issues Will[2] was pointing to. I then figured I’d channel the spirit of “EAs should post more summaries and collections”: this reading list is an attempt to make the path easier for others to follow. Accordingly, it starts at the introductory level, but by the end the reader will be at the frontier of publicly available knowledge. (The frontier as far as I’m aware, at least, and at the time of writing.[3])

DALL·E depiction of some people reflecting on what to do with the cosmos. They’ve been at it a long time.


What might we be aiming for?

Is there moral truth? What should we do if not? What are human values, and how might they fit in?

The intention with this section is to overview some of the philosophical background to the long reflection idea. Note that there is a large body of literature on the moral realism vs. antirealism debate, and metaethics more broadly, that exists beyond what’s listed here.

How to think about utopia?

Ideally, I would include some readings on how division or aggregation might work for building a utopia, since this seems like an obvious and important point. For instance, should the light cone be divided such that every person (or every moral patient more broadly, perhaps with the division taking moral weight into account) gets to live in a sliver of the light cone that’s optimized for their preferences? Should everybody’s preferences be aggregated somehow, so that everyone can live together happily in the overall light cone? Something else? However, I was unable to find any real discussion of this point. Let me know in the comments if there are writings I’m missing. For now, I’ll include the two most relevant things I could find as well as a more run-of-the-mill piece on preference aggregation theory.

How to think about avoiding worst-case futures?

How large could the future be?

How to think about counterfactuals?

How much of our future light cone will be colonized by aliens if we don’t colonize it ourselves?

There is a rich literature in the vicinity of this question. For those wanting to work on the long reflection it’s probably not necessary to get into the details of the models/arguments: a sense for the state of the debate should suffice.

The Fermi paradox

Grabby aliens

What, if anything, can we say about alien values?[5]

Corruption from within

In “Beyond Maxipok”, Cotton-Barratt writes “perhaps there are some ingredients which could be cancerous and distort an otherwise good reflective process from the inside.” This section aims to be about the most salient ways that could arise.

Human safety problems

Metaphilosophy; AI philosophical competence

Values hijack; epistemic hijack

The previous section was about how a reflective process could get distorted from within. This section is about one way—perhaps the most salient way—a reflective process could become corrupted by bad actors / from the outside.

On AGI deployment

The readings in this section discuss what the world around the time of AGI might look like. Things could be very alien, and move very fast. How can we raise the chance that good deliberative processes prevail in such a world?

Economic explosion

Intelligence explosion

What do the leading AI companies say they’d do with AGI?

(Ideally this subsection would include all the leading AI companies’ stated plans for how they’d employ AGI if they developed it, but to date only OpenAI has stated such plans, as far as I’m aware.)

The deployment problem

Decisive strategic advantage

Who (or what) will control the future?

The singleton hypothesis

A race to the bottom?

Institutional design for the long reflection

What can we learn from existing institutions?

There are a few institutions whose design we could maybe bootstrap off for designing the long reflection. Below, I try to point out these institutions; there are likely some (many?) I’m missing—let me know in the comments if so. Questions to keep in mind as you read are: What exactly happened to result in the creation of [institution]? How was [institution]’s precise nature determined? (For example, for the U.S. constitution, how were the details of its articles decided?) How long did it all take? (In other words, given these historical examples, and given our best guesses as to AI timelines, when do we need to start building infrastructure for the long reflection?)

Case study writing on lessons we can learn from the above institutions’ formations is a particularly promising direction for work on the long reflection, in my view. The following is an example of excellent writing in a similar vein.

Existing proposals for new AI governance institutions

The proposals here are aimed at reducing catastrophic risk from AI, but institutions for that purpose could(?) quite naturally extend into supporting a democratic reflective process aided by advanced AI—i.e., the long reflection.

International coordination on regulation

Democratic AGI development

The issue of credible compliance

The motivating question for this subsection is: supposing that an AI leader (e.g., the U.S.) wanted to make an agreement to share power and respect other actors’ (e.g., other countries’) sovereignty in the event that it develops superintelligence first, how could it legibly guarantee future compliance with that agreement so that the commitment is credible to these other actors?[7]

(The AI leader might want to make such an agreement partly for ethical reasons, and partly to decrease the incentive for competitors to race and cut corners on safety.)

There doesn’t appear to be any work on this exact issue. However, some work in adjacent areas may be applicable.

Credible compliance: Potential directions forward

The Windfall Clause

The windfall clause is a proposed mechanism for distributing AGI profits for social good. It seems somewhat applicable to the issue of credible compliance: there are presumably learnings to be had from understanding the clause and how it has been received.

AIs making credible commitments

AI systems can in theory precommit[8] to actions in a way that humans cannot. I cannot show you my source code to prove that I will act in a certain way in a certain situation, whereas an AI might be able to.

The path through which this is leveraged to solve the issue of credible compliance might look something like the leading AI company building into its frontier model the commitment to share power and respect other actors’ sovereignty,[9] and displaying this for other actors to see. (Exactly how such a commitment could be built into a model is likely a thorny technical challenge.)

Noteworthy subarea: Cryptographic technologies

Cryptographic technologies like blockchain are potentially relevant to credible commitment.

  • A Tour of Emerging Cryptographic Technologies – Garfinkel (2021)
    • Just “Non-intrusive agreement verification could become feasible” (pp. 48–49) and “It could become possible to solve collection [sic] action problems that existing institutions cannot” (pp. 53–57). (Follow the page numbers at the bottom of each page rather than those in the left sidebar.)

The issue of opposers and defectors

Interstellar coordination

I would ideally include a reading on the problems—coordination or otherwise—an institution or governance regime might run into if its constituents are light-years apart, given that it seems plausible humanity should start expanding before the long reflection is over. (On the latter point, Cotton-Barratt writes: “I’m not using that term [‘the long reflection’] because it’s not obvious that ‘the reflection’ is a distinct phase: it could be that reflection is an ongoing process into the deep future, tackling new (perhaps local) questions as they arise.”) However, I was unable to find anything of great relevance. Let me know in the comments if there are writings I’m missing. I list the one semi-relevant piece I have seen below.

Note that work in this niche is less pressing than other work on institutional design, in my view, as it’s the kind of thing that could be solved during the long reflection (whereas the other things like democratic control and credible compliance need to be in place from the outset). Work in this niche is more pressing the shorter and less controlled one believes the long reflection is likely to be.

  • Succession – Ngo (2023)
    • A couple of points to be aware of:
      • Ngo’s story involves long-distance communication both within a civilization and between civilizations. It’s the former that’s relevant—potentially—to the long reflection.
      • The being/civilization that’s the first person character seems to mostly be executing on a plan rather than coming up with a plan (i.e., they’ve mostly finished their long reflection).
        • Although it’s interesting to note that their knowledge exchange with the aliens does change their plan a little, which resonates with Cotton-Barratt’s idea about reflection being an ongoing process into the deep future.


These are topics that aren’t directly part of establishing a long reflection, but which those doing work or aiming to do work on the long reflection might want to be aware of.[10] Essentially, each of these topics points to an open problem that is arguably a crucial consideration which has to be solved, or dissolved, for the future to be close to maximally valuable. A successful long reflection is a meta-solution, a process that enables these problems to be solved. Under each topic I list the top resource(s), in my opinion, for getting up to speed.[11]

Welfare and moral weights


Different biological species

Digital minds

Societal considerations

Technical considerations

Is positive experience possible?[12]

Infinite ethics

Decision theory

Commitment races


Acausal trade / Evidential cooperation in large worlds

Acausal threats



Key problems

The simulation hypothesis

Boltzmann brains

Everett branches / Quantum many-worlds

  • David Wallace on the many-worlds theory of quantum mechanics and its implications – Wiblin & Harris (2021)
    • One way this topic could be important, which wasn’t emphasized so much by Wallace, is that if it’s true that the amount of “stuff”—for want of a better word—increases as the number of branches increases (rather than the total amount of stuff remaining constant, with individual branches becoming ever smaller slivers of the total), then, if humanity succeeds at creating a utopia, it could become a moral priority for us to trigger as many branch splittings as possible.

Subtopic: Quantum immortality

Content warning: eternity; suffering. See Raemon’s note.

Cause X

  1. ^

    A crude simplification of my model, which I don’t currently have time to write up in full: if a middling outcome is 1% the value of a great outcome, then going from extinction to a middling outcome is 1/99 as valuable as going from a middling outcome to a great one.

  2. ^

  3. ^

    If this reading list turns out to be useful then maybe I’ll keep it updated, or maybe someone more qualified than me will step into that role.

  4. ^

    Original discussion of the long reflection indicated that it could be a lengthy process of 10,000 years or more. More recent discussion I’m aware of, which is nonpublic hence no corresponding reading, i) takes seriously the possibility that the long reflection could last just weeks rather than years or millenia, and ii) notes that wall clock time is probably not the most useful way to think about the length of the reflection, given that the reflection process, if it happens at all, will likely involve many superfast AIs doing the bulk of the cognitive labor.

  5. ^

    The key overall question here is “how much worse is a light cone colonized by aliens compared to a light cone colonized by humanity, by our lights?” I don’t have a post to link to that answers this question, but I’ve been involved in some nonpublic discussion. My all-things-considered belief is that, according to our values, an alien-colonized light cone would be something like 30% as valuable as a human-colonized light cone. (This 30% figure is highly uncertain and low resilience, though.)

  6. ^

    Some readers may have seen Karnofsky’s “Racing through a minefield: the AI deployment problem” and wonder how that post is different to the one I’ve included in the reading list. So, “Racing through a minefield” is a distillation of “Nearcast-based ‘deployment problem’ analysis”: I include the latter because I consider the extra detail worth knowing about.

  7. ^

    This motivating question is closely adapted from MacAskill’s quick take.

  8. ^

    The terminological difference between “commitment” and “precommitment” is explained here.

  9. ^

    Compliance that routes through AI precommitment in this way is more complicated than standard credible commitment between AIs, which does not have a step involving humans. For compliance to be credible, the humans behind the leading AI model presumably must not be able to override its precommitments. There is a tension here with corrigibility that may render this direction a non-starter.

  10. ^

    Mesa- is a Greek prefix that means the opposite of meta-. To “go meta” is to go one level up; to “go mesa” is to go one level down. So a mesa-topic is a topic one level down from the one you were on.

  11. ^

    With a bias towards resources that explain why their topic is decision-relevant. In practice, this means that more of the resources I list are EA-sphere writings than would otherwise be the case.

  12. ^

    For readers who think the answer is obviously “yes,” I point you to “Narrative Self-Deception: The Ultimate Elephant in the Brain?” (Vinding, 2018). (I used to think the answer was obviously yes; I changed my mind when I became proficient at meditation / paying close attention to experience.)

  13. ^

    H/T Michael St. Jules for making me aware of this lecture.

  14. ^

    H/T Ben West for making me aware of FNC.





More posts like this

Sorted by Click to highlight new comments since:

Many of those posts in the list seem really relevant to me for the cluster of things you're pointing at!

On some of the philosophical background assumptions, I would consider adding my ambitiously-titled post The Moral Uncertainty Rabbit Hole, Fully Excavated. (It's the last post in my metaethics/anti-realism sequence.)

Since the post is long and it says that it doesn't work maximally well as a standalone piece (without two other posts from earlier in my sequence), it didn't get much engagement when I published it, so I feel like I should do some advertizing for it here.

As the title indicates, I'm trying to answer questions in that post that many EAs don't ask themselves because they think about moral uncertainty or moral reflection in an IMO somewhat lazy way.

The post starts with a conundrum for the concept of moral uncertainty: 

In an earlier post, I argued that moral uncertainty and confident moral realism don’t go together. Accordingly, if we’re morally uncertain, we must either endorse moral anti-realism or at least put significant credence on it.

This insight has implications because we're now conflating a few different things under the "moral uncertainty" label: 

  • Metaethical uncertainty (i.e., our remaining probability on moral realism) and the strength of possible wagers for acting as though moral realism is true even if our probability in it is low.
  • Uncertainty over the values we'd choose after long reflection (our "idealized values", which most people would be motivated to act upon even if moral realism is false).
  • Related to how we'd get to idealized values, the possibility of having under-defined values, i.e., the possibility that, because moral realism is false, even idealized moral reflection may lead to different endpoints based on very small changes to the procedure, or that a person's reflection doesn't "terminate" because their subjective feeling of uncertainty never goes away inside the envisioned reflection procedure.

My post is all about further elaborating on these distinctions and spelling out their implications for effective altruists.

I start out by introducing the notion of a moral reflection procedure to explain what moral reflection in an idealized setting could look like:

To specify the meaning of “perfectly wise and informed,” we can envision a suitable procedure for moral reflection that a person would hypothetically undergo. Such a reflection procedure comprises a reflection environment and a reflection strategy. The reflection environment describes the options at one’s disposal; the reflection strategy describes how a person would use those options.

Here’s one example of a reflection environment:

  • My favorite thinking environment: Imagine a comfortable environment tailored for creative intellectual pursuits (e.g., a Google campus or a cozy mansion on a scenic lake in the forest). At your disposal, you find a well-intentioned, superintelligent AI advisor fluent in various schools of philosophy and programmed to advise in a value-neutral fashion. (Insofar as that’s possible – since one cannot do philosophy without a specific methodology, the advisor must already endorse certain metaphilosophical commitments.) Besides answering questions, they can help set up experiments in virtual reality, such as ones with emulations of your brain or with modeled copies of your younger self. For instance, you can design experiments for learning what you'd value if you first encountered the EA community in San Francisco rather than in Oxford or started reading Derek Parfit or Peter Singer after the blog Lesswrong, instead of the other way around.[2] You can simulate conversations with select people (e.g., famous historical figures or contemporary philosophers). You can study how other people’s reflection concludes and how their moral views depend on their life circumstances. In the virtual-reality environment, you can augment your copy’s cognition or alter its perceptions to have it experience new types of emotions. You can test yourself for biases by simulating life as someone born with another gender(-orientation), ethnicity, or into a family with a different socioeconomic status. At the end of an experiment, your (near-)copies can produce write-ups of their insights, giving you inputs for your final moral deliberations. You can hand over authority about choosing your values to one of the simulated (near-)copies (if you trust the experimental setup and consider it too difficult to convey particular insights or experiences via text). Eventually, the person with the designated authority has to provide to your AI assistant a precise specification of values (the format – e.g., whether it’s a utility function or something else – is up to you to decide on). Those values then serve as your idealized values after moral reflection.

(Two other, more rigorously specified reflection procedures are indirect normativity and HCH.[3] Indirect normativity outputs a utility function whereas HCH attempts to formalize “idealized judgment,” which we could then consult for all kinds of tasks or situations.)[4]

“My favorite thinking environment” leaves you in charge as much as possible while providing flexible assistance. Any other structure is for you to specify: you decide the reflection strategy.[5] This includes what questions to ask the AI assistant, what experiments to do (if any), and when to conclude the reflection.

For reflection strategies (how to behave inside a reflection procedure), I discuss a continuum from "conservative" to "open-minded" reflection strategies.

Someone with a conservative reflection strategy is steadfast in their moral reasoning framework. ((What I mean by “moral-reasoning framework” is similar to what Wei Dai calls “metaphilosophy” – it implies having confidence in a particular metaphilosophical stance and using that stance to form convictions about one’s reasoning methodology or object-level moral views.)) They guard their opinions, which turns these into convictions (“convictions” being opinions that one safeguards against goal drift). At its extreme, someone with a maximally conservative reflection strategy has made up their mind and no longer benefits from any moral reflection. People can have moderately conservative reflection strategies where they have formed convictions on some issues but not others.

By contrast, people with open-minded moral reflection strategies are uncertain about either their moral reasoning framework or (at least) their object-level moral views. As the defining feature, they take a passive (“open-minded”) reflection approach focused on learning as much as possible without privileging specific views[7] and without (yet) entering a mode where they form convictions.

That said, “forming convictions” is not an entirely voluntary process – sometimes, we can’t help but feel confident about something after learning the details of a particular debate. As I’ll elaborate below, it is partly for this reason that I think no reflection strategy is inherently superior.

Comparing these two reflection strategies is a core theme of the post, and one takeaway I get to is that none of the two ends of the spectrum is superior to the other. Instead, I see moral reflection as a bit of an art, and we just have to find our personal point on the spectrum.

Relatedly, there's also the question of "What's the benefit of reflection now" vs. "how much do we want to just leave things to future selves or hypothetical future selves in a reflection procedure." (The point being that it is is not by-default obvious that moral reflection has to be postponed!)

Reflection procedures are thinking-and-acting sequences we'd undergo if we had unlimited time and resources. While we cannot properly run a moral reflection procedure right now in everyday life, we can still narrow down our uncertainty over the hypothetical reflection outcome. Spending time on that endeavor is worth it if the value of information – gaining clarity on one’s values – outweighs the opportunity cost from acting under one’s current (less certain) state of knowledge.

Gaining clarity on our values is easier for those who would employ a more conservative reflection strategy in their moral reflection procedure. After all, that means their strategy involves guarding some pre-existing convictions, which gives them advance knowledge of the direction of their moral reflection.[9]

By contrast, people who would employ more open-minded reflection strategies may not easily be able to move past specific layers of indecision. Because they may be uncertain how to approach moral reasoning in the first place, they can be “stuck” in their uncertainty. (Their hope is to get unstuck once they are inside the reflection procedure, once it becomes clearer how to proceed.)


If moral realism were true, the timing of that transition (“the reflection strategy becoming increasingly conservative as the person forms more convictions”) is obvious. It would happen once the person knows enough to see the correct answers, once they see the correct way of narrowing down their reflection or (eventually) the correct values to adopt at the very end of it.

In the moral realist picture, expressions like “safeguarding opinions” or “forming convictions” (which I use interchangeably) seem out of place. Obviously, the idea is to “form convictions” about only the correct principles!

However, as I’ve argued in previous posts, moral realism is likely false.

This is then followed by a discussion on whether "idealized values" are chosen or discovered.

Under moral anti-realism, there are two empirical possibilities[10] for “When is someone ready to form convictions?.” In the first possibility, things work similarly to naturalist moral realism but on a personal/subjectivist basis. We can describe this option as “My idealized values are here for me to discover.” By this, I mean that, at any given moment, there’s a fact of the matter to “What I’d conclude with open-minded moral reflection.” (Specifically, a unique fact – it cannot be that I would conclude vastly different things in different runs of the reflection procedure or that I would find myself indifferent about a whole range of options.)

The second option is that my idealized values aren’t “here for me to discover.” In this view, open-minded reflection is too passive – therefore, we have to create our values actively. Arguments for this view include that (too) open-minded reflection doesn’t reliably terminate; instead, one must bring normative convictions to the table. “Forming convictions,” according to this second option, is about making a particular moral view/outlook a part of one’s identity as a morality-inspired actor. Finding one’s values, then, is not just about intellectual insights.

I will argue that the truth is somewhere in between.

Why do I think this? There's more in my post, but here are some of the interesting bits, which seem especially relevant to the topic of "long reflection":

There are two reasons why I think open-minded reflection isn’t automatically best:

  1. We have to make judgment calls about how to structure our reflection strategy. Making those judgment calls already gets us in the business of forming convictions. So, if we are qualified to do that (in “pre-reflection mode,” setting up our reflection procedure), why can’t we also form other convictions similarly early?
  2. Reflection procedures come with an overwhelming array of options, and they can be risky (in the sense of having pitfalls – see later in this section). Arguably, we are closer (in the sense of our intuitions being more accustomed and therefore, arguably, more reliable) to many of the fundamental issues in moral philosophy than to matters like “carefully setting up a sequence of virtual reality thought experiments to aid an open-minded process of moral reflection.” Therefore, it seems reasonable/defensible to think of oneself as better positioned to form convictions about object-level morality (in places where we deem it safe enough).

Reflection strategies require judgment calls

In this section, I’ll elaborate on how specifying reflection strategies requires many judgment calls. The following are some dimensions alongside which judgment calls are required (many of these categories are interrelated/overlapping):

  • Social distortions: Spending years alone in the reflection environment could induce loneliness and boredom, which may have undesired effects on the reflection outcome. You could add other people to the reflection environment, but who you add is likely to influence your reflection (e.g., because of social signaling or via the added sympathy you may experience for the values of loved ones).
  • Transformative changes: Faced with questions like whether to augment your reasoning or capacity to experience things, there’s always the question “Would I still trust the judgment of this newly created version of myself?”
  • Distortions from (lack of) competition: As Wei Dai points out in this Lesswrong comment: “Current human deliberation and discourse are strongly tied up with a kind of resource gathering and competition.” By competition, he means things like “the need to signal intelligence, loyalty, wealth, or other ‘positive’ attributes.” Within some reflection procedures (and possibly depending on your reflection strategy), you may not have much of an incentive to compete. On the one hand, a lack of competition or status considerations could lead to “purer” or more careful reflection. On the other hand, perhaps competition functions as a safeguard, preventing people from adopting values where they cannot summon sufficient motivation under everyday circumstances. Without competition, people’s values could become decoupled from what ordinarily motivates them and more susceptible to idiosyncratic influences, perhaps becoming more extreme.
  • Lack of morally urgent causes: In the blogpost On Caring, Nate Soares writes: “It's not enough to think you should change the world — you also need the sort of desperation that comes from realizing that you would dedicate your entire life to solving the world's 100th biggest problem if you could, but you can't, because there are 99 bigger problems you have to address first.”
    In that passage, Soares points out that desperation can strongly motivate why some people develop an identity around effective altruism. Interestingly enough, in some reflection environments (including “My favorite thinking environment”), the outside world is on pause. As a result, the phenomenology of “desperation” that Soares described would be out of place. If you suffered from poverty, illnesses, or abuse, these hardships are no longer an issue. Also, there are no other people to lift out of poverty and no factory farms to shut down. You’re no longer in a race against time to prevent bad things from happening, seeking friends and allies while trying to defend your cause against corrosion from influence seekers. This constitutes a massive change in your “situation in the world.” Without morally urgent causes, you arguably become less likely to go all-out by adopting an identity around solving a class of problems you’d deem urgent in the real world but which don’t appear pressing inside the reflection procedure. Reflection inside the reflection procedure may feel more like writing that novel you’ve always wanted to write – it has less the feel of a “mission” and more of “doing justice to your long-term dream.”[11]
  • Ordering effects: The order in which you learn new considerations can influence your reflection outcome. (See page 7 in this paper. Consider a model of internal deliberation where your attachment to moral principles strengthens whenever you reach reflective equilibrium given everything you already know/endorse.)
  • Persuasion and framing effects: Even with an AI assistant designed to give you “value-neutral” advice, there will be free parameters in the AI’s reasoning that affect its guidance and how it words things. Framing effects may also play a role when interacting with other humans (e.g., epistemic peers, expert philosophers, friends, and loved ones).

Pitfalls of reflection procedures

There are also pitfalls to avoid when picking a reflection strategy. The failure modes I list below are avoidable in theory,[12] but they could be difficult to avoid in practice:

  • Going off the rails: Moral reflection environments could be unintentionally alienating (enormous option space; time spent reflecting could be unusually long). Failure modes related to the strangeness of the moral reflection environment include existential breakdown and impulsively deciding to lock in specific values to be done with it.
  • Issues with motivation and compliance: When you set up experiments in virtual reality, the people in them (including copies of you) may not always want to play along.
  • Value attacks: Attackers could simulate people’s reflection environments in the hope of influencing their reflection outcomes.
  • Addiction traps: Superstimuli in the reflection environment could cause you to lose track of your goals. For instance, imagine you started asking your AI assistant for an experiment in virtual reality to learn about pleasure-pain tradeoffs or different types of pleasures. Then, next thing you know, you’ve spent centuries in pleasure simulations and have forgotten many of your lofty ideals.
  • Unfairly persuasive arguments: Some arguments may appeal to people because they exploit design features of our minds rather than because they tell us  “What humans truly want.” Reflection procedures with argument search (e.g., asking the AI assistant for arguments that are persuasive to lots of people) could run into these unfairly compelling arguments. For illustration, imagine a story like “Atlas Shrugged” but highly persuasive to most people. We can also think of “arguments” as sequences of experiences: Inspired by the Narnia story, perhaps there exists a sensation of eating a piece of candy so delicious that many people become willing to sell out all their other values for eating more of it. Internally, this may feel like becoming convinced of some candy-focused morality, but looking at it from the outside, we’ll feel like there’s something problematic about how the moral update came about.)
  • Subtle pressures exerted by AI assistants: AI assistants trained to be “maximally helpful in a value-neutral fashion” may not be fully neutral, after all. (Complete) value-neutrality may be an illusory notion, and if the AI assistants mistakenly think they know our values better than we do, their advice could lead us astray. (See Wei Dai’s comments in this thread for more discussion and analysis.)

Conclusion: “One has to actively create oneself”

“Moral reflection” sounds straightforward – naively, one might think that the right path of reflection will somehow reveal itself. However, as we think of the complexities of setting up a suitable reflection environment and how we’d proceed inside it, what it would be like and how many judgment calls we’d have to make, we see that things can get tricky.

Joe Carlsmith summarized it as follows in an excellent post (what Carlsmith calls “idealizing subjectivism” corresponds to what I call “deferring to moral reflection”):

>My current overall take is that especially absent certain strong empirical assumptions, >idealizing subjectivism is ill-suited to the role some hope it can play: namely, providing >a privileged and authoritative (even if subjective) standard of value. Rather, the >version of the view I favor mostly reduces to the following (mundane) observations:

  • If you already value X, it’s possible to make instrumental mistakes relative to X.
  • You can choose to treat the outputs of various processes, and the attitudes of various hypothetical beings, as authoritative to different degrees.

>This isn’t necessarily a problem. To me, though, it speaks against treating your >“idealized values” the way a robust meta-ethical realist treats the “true values.” That is, >you cannot forever aim to approximate the self you “would become”; you must actively >create yourself, often in the here and now. Just as the world can’t tell you what to >value, neither can your various hypothetical selves — unless you choose to let them. Ultimately, it’s on you.

In my ((Lukas's)) words, the difficulty with deferring to moral reflection too much is that the benefits of reflection procedures (having more information and more time to think; having access to augmented selves, etc.) don’t change what it feels like, fundamentally, to contemplate what to value. For all we know, many people would continue to feel apprehensive about doing their moral reasoning “the wrong way” since they’d have to make judgment calls left and right. Plausibly, no “correct answers” would suddenly appear to us. To avoid leaving our views under-defined, we have to – at some point – form convictions by committing to certain principles or ways of reasoning. As Carslmith describes it, one has to – at some point – “actively create oneself.” (The alternative is to accept the possibility that one’s reflection outcome may be under-defined.)

It is possible to delay the moment of “actively creating oneself” to a time within the reflection procedure. (This would correspond to an open-minded reflection strategy; there are strong arguments to keep one’s reflection strategy at least moderately open-minded.) However, note that, in doing so, one “actively creates oneself” as someone who trusts the reflection procedure more than one’s object-level moral intuitions or reasoning principles. This may be true for some people, but it isn’t true for everyone. Alternatively, it could be true for someone in some domains but not others.[13]

I further discuss the notion of "having under-defined values." This happens if someone defers to moral reflection with the expectation that it'll terminate with a specific answer, but they're pre-disposed to following reflection strategies that are open-ended enough so that the reflection will, in practice, have under-defined outcomes.

Having under-defined values isn't necessarily a problem – I discuss the pros and cons of it in the post.

Towards the end of the post, there's a section where I discuss the IMO most sophisticated wager for "acting as though moral realism is true" (the wager for naturalist moral realism, rather than the one for non-naturalist/irreducible-normativity-based moral realism which I discussed earlier in my sequence). In that discussion, I conclude that this naturalist moral realism wager actually often doesn't overpower what we'd do anyway under anti-realism. (The reasoning here is that naturalist moral realism feels somewhat watered down compared to non-naturalist moral realism, so that it's actually "built on the same currency" as how we'd anyway structure our reasoning under moral anti-realism. Consequently, whether naturalist moral realism is true isn't too different from the question of whether idealized values are chosen or discovered – it's just that now we're also asking about the degree of moral convergence between different people's reflection.)

Anyway, that section is hard to summarize, so I recommend just reading it in full in the post (it has pictures and a fun "mountain analogy.")

Lastly, I end the post with some condensed takeaways in the form of advice for someone's moral reflection:

Selected takeaways: good vs. bad reasons for deferring to (more) moral reflection

To list a few takeaways from this post, I made a list of good and bad reasons for deferring (more) to moral reflection. (Note, again, that deferring to moral reflection comes on a spectrum.)

In this context, it’s important to note that deferring to moral reflection would be wise if moral realism is true or if idealized values are ((on the far end of the spectrum of)) “here for us to discover.” In this sequence, I argued that neither of those is true – but some (many?) readers may disagree.

Assuming that I’m right about the flavor of moral anti-realism I’ve advocated for in this sequence, below are my “good and bad reasons for deferring to moral reflection.”

(Note that this is not an exhaustive list, and it’s pretty subjective. Moral reflection feels more like an art than a science.)

Bad reasons for deferring strongly to moral reflection:

  • You haven’t contemplated the possibility that the feeling of “everything feels a bit arbitrary; I hope I’m not somehow doing moral reasoning the wrong way” may never go away unless you get into a habit of forming your own views. Therefore, you never practiced the steps that could lead to you forming convictions. Because you haven’t practiced those steps, you assume you’re far from understanding the option space well enough, which only reinforces your belief that it’s too early for you to form convictions.
  • You observe that other people’s fundamental intuitions about morality differ from yours. You consider that an argument for trusting your reasoning and your intuitions less than you otherwise would. As a result, you lack enough trust in your reasoning to form convictions early.
  • You have an unreflected belief that things don’t matter if moral anti-realism is true. You want to defer strongly to moral reflection because there’s a possibility that moral realism is true. However, you haven’t thought about the argument that naturalist moral realism and moral anti-realism use the same currency, i.e., that the moral views you’d adopt if moral anti-realism were true might matter just as much to you.

Good reasons for deferring strongly to moral reflection:

  • You don’t endorse any of the bad reasons, and you still feel drawn to deferring to moral reflection. For instance, you feel genuinely unsure how to reason about moral views or what to think about a specific debate (despite having tried to form opinions).
  • You think your present way of visualizing the moral option space is unlikely to be a sound basis for forming convictions. You suspect that it is likely to be highly incomplete or even misguided compared to how you’d frame your options after learning more science and philosophy inside an ideal reflection environment.

Bad reasons for forming some convictions early:

  • You think moral anti-realism means there’s no for-you-relevant sense in which you can be wrong about your values.
  • You think of yourself as a rational agent, and you believe rational agents must have well-specified “utility functions.” Hence, ending up with under-defined values (which is a possible side-effect of deferring strongly to moral reflection) seems irrational/unacceptable to you.

Good reasons for forming some convictions early:

  • You can’t help it, and you think you have a solid grasp of the moral option space (e.g., you’re likely to pass Ideological Turing tests of some prominent reasoners who conceptualize it differently).
  • You distrust your ability to guard yourself against unwanted opinion drift inside moral reflection procedures ((if you were to follow a more open-minded reflection strategy)), and the views you already hold feel too important to expose to that risk.

Thanks, lots of interesting articles in this list that I missed despite my interest in this area.

One suggestion I have is to add some studies of failed attempts at building/reforming institutions, otherwise one might get a skewed view of the topic. (Unfortunately I don't have specific readings to suggest.)

A related topic you don't mention here (maybe due to lack of writings on it?) is maybe humanity should pause AI development and have a long (or even short!) reflection about what it wants to do next, e.g. resume AI development or do something else like subsidize intelligence enhancement (e.g. embryo selection) for everyone who wants it so more people can meaningfully participate in deciding the fate of our world. (I note that many topics on this reading list are impossible for most humans to fully understand, perhaps even with AI assistance.)

I claim that this area outscores regular AI safety on importance while being significantly more neglected

This neglect is itself perhaps one of the most important puzzles of our time. With AGI very plausibly just a few years away, why aren't more people throwing money or time/effort at this cluster of problems just out of self interest? Why isn't there more intellectual/academic interest in these topics, many of which seem so intrinsically interesting to me?

This neglect is itself perhaps one of the most important puzzles of our time. With AGI very plausibly just a few years away, why aren't more people throwing money or time/effort at this cluster of problems just out of self interest? Why isn't there more intellectual/academic interest in these topics, many of which seem so intrinsically interesting to me?

I think all of:

  • Many people seem to believe in something like "AI will be a big deal, but the singularity is much further off (or will never happen)".
  • People treat the singularity in far mode even if they admit belief.
  • Previously commited people (especially academics) don't shift their interests or research areas much based on events in the world, though they do rebrand their prior interests. It requires new people entering fields to actually latch onto new areas and there hasn't been enough time for this.
  • People who approach these topics from an altruistic perspective often come away with the view "probably we can mostly let the AIs/future figure this out, other topics seems more pressing and more possible to make progress on.
  • There aren't clear shovel ready projects.

Ideally, I would include at this point some readings on how aggregation might work for building a utopia, since this seems like an obvious and important point. For instance, should the light cone be divided such that every person (or every moral patient more broadly, perhaps with the division taking moral weight into account) gets to live in a sliver of the light cone that’s optimized to fit their preferences? Should everybody’s preferences be aggregated somehow, so that everyone can live together happily in the overall light cone? Something else? However, I was unable to find any real discussion of this point. Let me know in the comments if there are writings I’m missing. For now, I’ll include the most relevant thing I could find as well as a more run-of-the-mill reading on preference aggregation theory.

It would probably be worth if for someone to write out the ethical implications of K-complexity-weighted utilitarianism/UDASSA on how to think about far-future ethics.

A few things that come to mind about this question (these are all ~hunches and maybe only semi-related, sorry for the braindump):

  • The description length of earlier states of the universe is probably shorter, which means that the "claw" that locates minds earlier in a simple universe is also shorter. This implies that lives earlier in time in the universe would be more important, and that we don't have to care about exact copies as much.
    • This is similar to the reasons why not to care too much about Boltzmann brains.
  • We might have to aggregate preferences of agents with different beliefs (possible) and different ontologies/metaphysical stances (not sure about this), probably across ontological crises.
    • I have some preliminary writings on this, but nothing publishable yet.
  • The outcomes of UDASSA is dependent on the choice of Turing machine. (People say it's only up to a constant, but that constant can be pretty big).
    • So we either find a way of classifying Turing machines by simplicity without relying on a single Turing machine to give us that notion, or we start out with some probability distribution over Turing machines and do some "2-level-Solomonoff induction", where we update both the probability of each Turing machine and the probabilities of each hypothesis for Turing machine.
    • This leads to selfishness for whoever is computing Solomonoff induction, because the Turing machine where the empty program just outputs their observations receives the highest posterior probability.
  • If we use UDASSA/K-ultilitarianism to weigh minds there's a pressure/tradeoff to simplify one's preferences to be simpler.
  • If we endorse some kind of total utilitarianism, and there are increasing marginal returns to energymatter or spacetime investment into minds with respect to degree of moral patienthood then we'd expect to end up with very few large minds, if there are decreasing marginal returns we end up with many small minds.
  • Theorems like Gibbard-Satterthwaite and Hylland imply that robust preference aggregation that resists manipulation is really hard. You can circumvent this by randomly selecting a dictator, but I think this would become unnecesary if we operate in an open-source game theory context, where algorithms can inspect each others' reasons for a vote.
  • I'm surprised you didn't mention reflective equilibrium! Formalising reflective equilibrium and value formation with meta-preferences would be major steps in a long reflection.
  • I have the intuition that Grand Futures talks about this problem somewhere[1], but I don't remember/know where.

  1. Which, given its length, isn't that out there. ↩︎

Are there any readings about how a long reflection could be realistically and concretely achieved?

Great resource, thanks for putting this together!

I think collections like this are helpful, but it's a misleading to say it presents the "frontier of publicly available knowledge."

Taking just the first section on moral truth as an example, it seems like a huge overstatement to say this collection of podcasts and forum posts gets people to the frontier of this subject. Philosophers have spent a long time on this, writing thousands of papers. And at a glance, it seems like all of OPs linked resources don't even intend to give an overview of the literature on meta-ethics. They instead present their own personal perspectives.

And all of the resources in this section are EA/rationalist affiliated. Surely there have been some people who've said intelligent things about the nature of morality prior to Yudkowsky's birth, right? Neglecting these voices seems like an oversight, especially given the stated goal of getting readers to the frontier of publicly available knowledge.

Going forward, I'd suggest making more modest claims about what can be accomplished by a reading list like this and expanding the range of perspectives that's considered worth listening to.

Curated and popular this week
Relevant opportunities