How Technical AI Safety Researchers Can Help Implement Punitive Damages to Mitigate Catastrophic AI Risk

Gabriel Weil

How Technical AI Safety Researchers Can Help Implement Punitive Damages to Mitigate Catastrophic AI Risk

Comments

More from the author

104

Tort Law Can Play an Important Role in Mitigating AI Risk

Gabriel Weil·2y ago·6m read

The Role of Individual Consumption Decisions in Animal Welfare and Climate are Analogous

Gabriel Weil·4y ago·13m read

Curated and popular this week

Counting animals: Stable population size is not equivalent to priority level

abrahamrowe, mal_graham🔸·5d ago·Curated 1d ago·16m read

AI Use Note: Main body text entirely human written. Claude (Opus 4.8) helped develop models of animal life histories in the appendix. Cross-posted from Good Structures. Executive Summary * Animal advocates sometimes make claims like “there are X of this animal...

150

Let's taboo the V-word

lincolnq·5d ago·8m read

“How long have you been v*g*n?” This is one of the most common icebreakers at animal protection events. It’s a baseline assumption, and it mostly holds true: if you’re out advocating for animals not to be tortured or abused, realistically these days you are v**n, or close. And it makes for good conversation. It seems fairly safe to assume when you meet strangers. But this assumption is hurting the movement in a way which we don’t always notice: someone new comes into the sp...

Spiro: an update 2.5 years on and a fundraising ask for expansion

Habiba Banu·2d ago·6m read

Summary Back in November 2023 I posted here to launch Spiro and raise our first $198k. Two and a half years later this is an update and a fundraiser for the next step. The short version: we've now reached over-5,900 people with TB preventive medicine, including over 3,000 children under five years old. Our early results have held up well an...

Recent opportunities to take action

EA Organisation Updates thread: July 2026

Dane Valerie·4d ago·1m read

announcing High Impact Aliens

tzukitchan·1d ago·1m read

A proposal for food retail and services: the internal animal welfare feebate

Stijn Bruers 🔸·9h ago·6m read

S = N * C_P/C_T*E_P/E_A

N is the total expected practically non-compensable harm arising from the defendant’s conduct, represented by the blue-shaded area in the graph below. C_P is the plaintiff’s compensatory damages; the harm actually suffered by the plaintiff due to the defendant’s actions. Dividing this by C_T, the total expected practically compensable harm caused by the defendant’s actions (the area shaded in red below), gives the plaintiff's share of expected compensatory damages.

E_Pis the elasticity of uninsurable risk with respect to the plaintiff’s injury. If precautionary measures that cut the plaintiff’s expected harm in half also cut the uninsurable risk in half, then the elasticity is equal to 1. If they only reduce the uninsurable risk by 10%, then the elasticity of 0.2. E_Ais the average elasticity of the uninsurable risk with respect to all practically compensable harms. So, E_P/E_Ais the relative elasticity of uninsurable risk with respect to the plaintiff’s injury. So, the plaintiff’s share of punitive damage is equal to the total expected practically non-compensable harm, N, times the plaintiff’s share of expected compensatory damages C_P/C_T, times the relative elasticity of the uninsurable risk with respect to the plaintiff’s injury.

If juries are able to approximately implement this formula, then reductions in the practically compensable harm, C_T, would only be rewarded to the extent that they reduce the total expected harm N + C_T. Measures that reduce C_Twithout reducing N would lower the defendant’s expected compensatory damages, but would not reduce their expected punitive damages payout unless they are able to eliminate any practically compensable damages. By contrast, measures that reduce N would be rewarded even if they do not reduce C_T.

Of course, in order to implement this formula, we will need credible estimates of the various parameters. Juries routinely estimate plaintiff’s compensatory damages in tort cases, so that doesn’t present any novel issues, and estimating the total expected compensatory damages should also not be too difficult. Presumably, once a system has revealed its misalignment in a non-catastrophic way, it will be pulled off the market, and the risks associated with any modified version of the model would be addressed separately. So this is really just a matter of estimating how much legally compensable harm the system has actually done.

Estimating the total expected practically non-compensable harm, N, and the elasticity parameters, E_P and E_A, presents greater challenges. In the paper, I suggest that model evaluations could be used to appraise specific potential causal pathways that could produce an uninsurable catastrophe.

“For each pathway, one set of evaluations could estimate the probability that the system is capable of instantiating that pathway. Another set of evaluations could estimate the conditional probability that the system would take that pathway, should it be capable of doing so. The sum of the expected harm across these catastrophic misalignment and misuse scenarios would then represent a lower bound estimate of N, the uninsurable risks generated by training and deploying the system, since other scenarios not included in the evaluation might contribute to the total uninsurable risk. Beyond specific scenario analysis, other relevant indicators of uninsurable risk include the model’s power-seeking tendencies, inclinations to engage in deception and tool use, pursuit of long-term goals, resistance to being shut down, tendency to collude with other advanced AI systems, breadth of capabilities, capacity for self-modification, and degree of alignment.

The knowledge that a practically compensable harm has happened could help model evaluators like those at Model Evaluation & Threat Research (formerly part of the Alignment Research Center) and Apollo Research select catastrophic risk pathways to analyze, but their estimates of the probability of those pathways should not update based on the knowledge that the practically compensable harm event happened, since that knowledge was not available at the time the decisions to train and deploy the model were made. The fact that a practically compensable harm has occurred simultaneously raises the probability that the system was significantly misaligned or vulnerable to misuse and lowers the probability that catastrophic harm will arise from this specific system since it is now likely to be recalled and retrained. N is the expected uninsurable harm at the time that the key tortious act (training or deployment) occurred. That is, N represents what a reasonable person, with access to the information that the defendant had or reasonably should have had at the time of the tortious conduct, would have estimated to be the expected uninsurable harm arising from their conduct. There may not be one uniquely correct value of this uninsurable risk, but juries, relying on expert testimony, should nonetheless be able to select an estimate within the range of estimates that a reasonable person should have arrived at.”

More work is needed to implement these suggestions, and I invite members of this community to take it up. I’m happy to talk with anyone interested in working on this, and may even be able to help you secure funding. Feel free to contact me gweil2 at tourolaw.edu. The full draft paper is available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4694006.

How Technical AI Safety Researchers Can Help Implement Punitive Damages to Mitigate Catastrophic AI Risk

How Technical AI Safety Researchers Can Help Implement Punitive Damages to Mitigate Catastrophic AI Risk

S = N * CP/CT* EP/EA

S = N * C_P/C_T*E_P/E_A