Sadism and s-risks from first principles

Jim Buhler

Originally written in September 2024. I’ve only made some light style edits (and trimmed some unnecessary content) before posting here. Hence, this piece does not necessarily reflect my current views. I did not update this piece in light of the (rare few) other posts on the topic that have come out since then. The only new reference added is DiGiovanni’s Unawareness sequence since it was highly relevant on multiple occasions.

0. Introduction

Let’s start with a little bit of s-risk macrostrategy context. What are parameters we can affect that seem to strongly correlate with the significance of s-risks?

S-risks seem obviously very tightly tied with – if not 100% correlated with – the ability and willingness of agents in the World to bring (or risk bringing) about astronomical disvalue which itself seems to strongly correlate with the below factors, all else equal:^[1]

Number of agents in the World and the extent to which they are overall preserving their capacity to have and retain control over the long-term future on astronomical scales.
Relative power of agents with knowledge and/or features conducive to the creation of astronomical disvalue.
1. Relative power of s-risk-conducive preferences.
  1. Relative power of sadism.
  2. Relative power of other s-risk-conducive preferences.
2. Other things.

In this post, we’ll discuss how to affect the bolded factor in our above typology in a way that reduces s-risks. §1 specifies sub-factors to the relative power of sadism in our above typology (and reviews and contextualizes some related work). §2 and §3, respectively, discuss the implementation and outcome robustness – as defined by DiGiovanni (2025)– of reducing the relative power of sadism. §4 is a research agenda built with the most cruxy research questions this doc identifies and leaves unanswered.

But before all of this, §0.1 right below properly defines sadism in our context.

0.1 What I exactly mean by “sadism”

From my post Understanding Sadism:

I here define sadism as an intrinsic preference for there to be more suffering (in a given context). This includes both unconditional cases (i.e., regardless of the context) and the ones conditional on, e.g., some belief that the suffering is “deserved” (tribalistic and/or retributivist sadism). This excludes “sadistic-looking” behaviors that aren’t actually well-explained by a terminal preference for suffering per se (see §2 for examples that help distinguish the two). This also excludes cases often associated with sadism, although meaningfully different, such as non-sadistic manifestations of other Dark Tetrad traits (narcissism, machiavellianism, and psychopathy) and of antisocial personalities.
Such a definition is fairly different from what people usually have in mind when they discuss sadism – see e.g. this Wikipedia page and Foulkes (2019), section 2, which discusses different academic definitions.
In particular, my definition doesn’t solely focus on cases where the sadist derives pleasure from the suffering of some being(s). It also includes those where the satisfied preference for there to be more suffering doesn’t necessarily make the sadist experience hedonic value, at least not in the way it is most commonly defined and/or interpreted. For example, someone might deeply want someone else to suffer thinking “they deserve it” (for whatever reason), try to make that more likely, and find it fair if it happens without necessarily taking pleasure in knowing they suffer. Although it may be very much worth differentiating these two different types of phenomena on a psychological level for practical reasons, I believe it makes sense to use sadism as an umbrella term including both.
I landed on this definition to focus on what seems most concerning to me: a direct, although sometimes contextual, preference for there to be more suffering.

1. The enabled-sadism recipe

Take the typology of s-risk factors just presented in the introduction. This section specifies the bolded part and identifies sub-factors predictive of the relative power of sadism, all else equal.^[2] What are the necessary “ingredients” for sadism gaining influence?

Extent to which sadistic agents exist to begin with (or to which agents’ sadistic preferences are exacerbated)
1. Saliency and spread of values and ideas conducive to sadism. (See some work related to this in footnote.)^[3]
2. Extent to which TAIs are trained in ways that might make them develop sadistic preferences. (See some work related to this in footnote.)^[4]
3. Extent to which other agents (e.g., humans) are born and/or raised in ways that might make them develop sadistic preferences. (See some work related to this in footnote.)^[5]
4. Other things?
Extent to which sadistic agents can gain decisive influence.
1. Saliency of reasons why sadistic agents might want to gain influence and how to do it.
2. Effectiveness of implemented measures to prevent or reduce the empowerment of sadistic actors. (See some work related to this in footnote.)^[6]

Now that we have painted a hopefully clearer picture of how sadistic preferences could gain decisive influence, let’s evaluate the implementation and outcome robustness of reducing the relative power of sadism in the next two sections.

2. The saliency hazard challenge

The main challenge to this cause area being implementation-robust seems to be the fact that most interventions we might think of – and even the sole very fact that we are researching this topic in the first place – may raise the saliency and spread of values and ideas conducive to sadism (ingredient 1.a in our above “recipe”) such that it isn’t obvious whether we’d actually reduce rather than increase the relative power of sadism overall. After all, it seems by default very unlikely that any agent would ever sadistically create astronomical disvalue, so it isn’t necessarily obvious that worrying about these scenarios makes them less rather than more likely.

This seems like a very fair concern and raises these cruxy research questions:

What exactly are the “values and ideas conducive to sadism”? How does raising their saliency affect its relative power?^[7] For example, can we find any historical data on how sadism and adjacent traits becoming well-known affect their commonalities? Can we run informative experiments/surveys? What about potential analogs to sadism?
How can we reduce the relative power of sadism without non-trivially raising the saliency of the above values and ideas?

Let’s now move on to how outcome-robust reducing the relative power of sadism is (assuming it is done in an implementation-robust way; i.e., that we’re actually reducing it and not just attempting that).

3. When does reducing the relative power of sadism reduce s-risks overall?

“Well, pretty much always,” you may think, naively. “You suggest yourself in your ‘powerful sadism recipe’ that relative power of sadism is correlated with s-risks all else equal”. The key lies in the last three words. The problem is that the actions we might consider taking to reduce sadism very rarely keep everything else equal. More specifically:

Sadism may be – in a given agent’s architecture – correlated with some characteristics conducive to reducing the number of Earth-originated agents and the extent to which they are overall preserving their capacity to have and retain control over the long-term future on astronomical scales. Hence, reducing the relative power of sadism (factor B.a.i in our typology in the introduction) may increase the expected number of agents in the world (factor A), which may increase s-risks (e.g., because more potential for conflict).
Reducing the relative power of sadism might also badly affect one or several s-risk factors for reasons we are unaware of^[8] and can’t just conveniently set aside while forming our beliefs regarding whether this cause area reduces s-risks overall.

Buhler (2023, section Counter-considerations and overall take, this comment) briefly touches upon the first problem. On the second problem, see DiGiovanni’s Unawareness sequence.

Appendix A: Disentangling sadism and spite

Between 2021 and early 2024, CLR had a very substantial research interest in spite. How is spite different from sadism? And how does this matter?

Nicolas Macé et al. (2023) give this informal definition of spite:

An agent is spiteful if, at least under some conditions, they are intrinsically motivated to frustrate other agents’ preferences.

I defined sadism (§0.1) as an intrinsic preference for there to be more suffering (given a certain context).

If we take these definitions very literally, spite would be to sadism what preference utilitarianism is to hedonistic utilitarianism (see e.g. Tomasik 2016). The sadist ultimately cares about the suffering their victim subjectively experiences. The spiteful agent ultimately cares about whether a given preference their victim has is violated^[9] (independently of whether this makes the victim experience pain / hedonic disvalue).

One problem with this literal interpretation is that spite here excludes cases of spiteful sadism where one values the violation of preferences only to the extent that they cause suffering in their targets (see §2 of Understanding Sadism).^[10] Let’s therefore interpret Macé et al.’s meaning of “intrinsically” flexibly to include these cases. This means sadism and spite are not mutually exclusive. Cases of spiteful sadism are both cases of sadism and cases of spite.

Why bother differentiating between sadism, spite, and spiteful sadism, though? Well, unlike sadism, spite may be desirable in some cases (e.g., to prevent getting systematically bullied by other agents). At least in some contexts, we might want to reduce the relative power of sadism without reducing that of spite. This necessitates clarifying the difference between sadism and spite both in theory and in practice. We’ve identified the theoretical difference above. Let’s now see how much we can disentangle spiteful sadism and other instances of sadism.

What are situations where sadism seems unusually likely not to go along with or involve spite? We’ll say these are potential cases of “spiteless sadism”:

The sadist hurts those who consent to it and is not “turned off” by the associated perceived absence of violated preference.
They inflict physical pain rather than psychological.
They don’t target agents they believe to have preferences but not to be sentient (e.g. Alexa, ChatGPT, plants).
They want to witness the suffering or find another way to be certain it happened rather than content themselves with knowing the victim’s preference has been violated.

One way forward would be to evaluate how much spiteless sadism and spite correlate in humans.

Lasko (2021, Figure 6) unsurprisingly shows a correlation between sadism and spite. Marcus et al. 2014 show something similar less directly by identifying a correlation between spite, Dark Triad traits and other anti-social preferences which themselves correlate with sadism (see e.g. Plouffe et al., 2018; Book et al. 2016; Tran et al. 2018; Pfattheicher et al. 2019; Rogers et al. 2018). I couldn’t find any other work that discussed the relationship between sadism and spite, besides Weiglet-Hill and Vonk (2015) from which I don’t take away anything insightful on this question.

It is unclear how informative these results are when it comes to the relationship between spiteless sadism (specifically) and spite. New surveys could be conducted there.

^{^}
In other words, we have A) the absolute power of agents that could in theory directly or indirectly cause astronomical disvalue, and B) within these, the relative power of those that are particularly likely to cause such a thing. I think that’s exhaustive (assuming we also consider “Nature” as an agent or ignore natural suffering) although very coarse-grained. A and B could be specified a lot more.
^{^}
It also reviews the work that has been done so far that is most relevant to reducing the relative power of sadism and contextualizes it, in footnotes. Let me also briefly mention here Althaus and Baumann’s (2020) “advancing the science of malevolence” idea as a meta project that could be instrumental in reducing several of the below factors we are worried about. Tobias Baumann and CLR have done work within the scope of this.
^{^}
I list cruxy research questions vis-à-vis the implementation-robustness of reducing the relative power of malevolent actors given infohazards in Some governance research ideas to prevent malevolent control over AGI and why this might matter a hell of a lot.

^{^}

Althaus and Baumann (2020) briefly say that “the fact that malevolent traits such as psychopathy or sadism evolved in some humans suggests that those traits provided fitness advantages, at least in certain contexts” and give references. Mia Taylor’s (2024) measurement research agenda which is focused on how AI might develop conflict-conducive preferences could fairly easily be adapted into a version focused on sadism instead.

^{^}

David Althaus and Tobias Baumann (2020) discuss how future technologies such as whole brain emulation and genetic enhancement could be used to select against malevolent traits and link to work addressing similar topics in their section Previous discussion on reducing risks from malevolent actors within their Appendix B: Reducing long-term risks from malevolent actors.

^{^}

Althaus and Baumann (2020, section Political interventions), and Buhler (2023, section Evaluating the promisingness of various governance interventions) propose contenders for such effective measures.

^{^}

Maybe they actually reduce the relative power of sadism by e.g. persuading everyone that most forms of sadism are absolutely horrible and people with sadistic tendencies successfully overcome them.

^{^}

See e.g. Tarsney (2024, section 3) Roussos (2021, slides); Tomasik (2015).

^{^}

I use “violated” instead or “frustrated” to make it extra clear that the spiteful agent terminally cares about doing something that goes against the preference and not necessarily about whether the agent holding that preference feels frustrated because of this, in which case this would technically be sadism and not (pure) spite.

^{^}

Arguably, the vast majority of spiteful-looking acts we observe in humans are of this type. It is rare to see someone violate someone’s preference for the sake of it without caring about the reaction.

Show all footnotes

SummaryBotSep 222

Executive summary: This exploratory post develops a first-principles framework for understanding sadism as a driver of suffering risks (s-risks), analyzing when reducing the relative power of sadistic preferences might lower or unintentionally raise the likelihood of astronomical disvalue, and outlining key open research questions.

Key points:

Sadism is defined here as an intrinsic preference for more suffering, including both unconditional cases and conditional “deserved suffering,” distinct from related traits like psychopathy or spite.
The author identifies sub-factors influencing the “relative power of sadism,” such as how many agents hold sadistic preferences, how such values spread, and whether powerful AI or humans might develop or amplify them.
A central challenge is the “saliency hazard”: interventions or even research on sadism may inadvertently increase its visibility and appeal, potentially strengthening rather than weakening it.
Reducing sadism does not straightforwardly reduce s-risks because interventions may shift other factors (e.g. increasing the number of agents with long-term control capacity) in ways that could worsen overall risk.
The post distinguishes sadism from spite, noting that spite targets preference frustration rather than suffering per se, and may sometimes have adaptive or desirable functions—underscoring the need for careful disentanglement.
The piece concludes with a research agenda focused on measuring sadism, studying its interaction with spite, and assessing whether and how interventions can be both implementation- and outcome-robust in reducing s-risks.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

EA Forum Bot Site
EA Forum

Sadism and s-risks from first principles

10

0. Introduction

0.1 What I exactly mean by “sadism”

1. The enabled-sadism recipe

2. The saliency hazard challenge

3. When does reducing the relative power of sadism reduce s-risks overall?

Appendix A: Disentangling sadism and spite

10

Reactions