Why Psychiatric Drug Evaluation Misses the Real Story

algekalipso

[Epistemic Status: Speculative but plausible, consistent with my personal experience, and very important if true. Specific pharmacology statistics and methods are for illustration only. The core argument about measurement scales doesn’t hinge on them.]

The puzzle is simple. Clinical trials for psychiatric drugs show modest improvements: effect sizes typically cluster in the small-to-moderate range depending on the condition and measurement. SSRIs for depression, antipsychotics for psychosis, benzodiazepines for panic: all follow this pattern. But in practice, people taking the same drugs report everything from clear benefit to total neutrality to severe deterioration. Ask someone with panic disorder about benzodiazepines and they’ll tell you either “it works wonders” or “it made everything worse”—nobody says with a straight face that they experienced a “small-to-moderate improvement”.

The standard explanation treats the extremely negative responses as “outliers” or “side-effects”. Squint and you can sort of see two “overlapping Gaussians”—most people get better, a few unlucky ones get side effects. You either “got better in expectation” (maybe just unlucky if you got worse in practice) or you were one of the very few unlucky who got the “side-effects”.

But what if there is a much more elegant description of what is going on? The distribution doesn’t look like two Gaussians even once you remove the most severe cases. It stays skewed, heavy-tailed in both directions. The wide range of responses isn’t noise around a true average but the very thing we need to explain and account for if we want to make informed decisions.

Let’s start simple, with a concrete example:

Imagine two people in a drug trial who start at the same baseline: 3/10 sadness and 3/10 sense of inner restlessness (akathisia). During the trial, the first person’s sadness moved from 3/10 to 6/10 and inner restlessness moved from 3 to 6 (also on a 0 to 10 scale). Now compare that with the other person, whose sadness stayed at 3 while their akathisia skyrocketed from 3 to 9. On a psychiatric evaluation form, both patterns might add up to the same “total change in symptom scores”. But as far as phenomenology goes, the case where akathisia shoots up to 9/10 is overwhelmingly worse. Anyone who has been near that state knows that a “9” on an akathisia scale is not three times worse than a “3.” It is another category of sensation altogether; indeed on another level of moral significance.

Some of us working in this space—people like Chris Percy, Alfredo Parra, and myself (see: 1, 2, 3, 4)—have pointed to a pattern that standard psychiatric measurements seemingly miss. Symptoms have long-tailed distributions at the level of actual intensity. When someone reports their akathisia as a “9,” that likely reflects being in a genuinely steep part of the distribution—a “9” is not simply three times as intense as what a “3” feels like. The problem emerges when trials collect these reports and add them arithmetically. But actual suffering seems to add up differently—not through simple addition of the scores, but through something closer to exponential weighting of the underlying intensities, and only then summation[1]. To a first approximation, a person’s experienced valence might be described as coming from summing the weighted contribution of each symptom, where the weights themselves depend on reported intensity level. When you account for this structure before adding, you get a different picture than when you add the reported scores directly.

Mixed valence complexifies the picture. Say an antipsychotic drug reduces delusions from a 7/10 to a 5/10—a clear improvement on a steep region of the distribution of subjective discomfort. But to achieve this effect, it also raises akathisia from a 6/10 to an 8/10 as a side-effect. On a symptom scale, these changes might look like they roughly cancel out: you’ve gained 2 points on one domain and lost 2 on another. But the person’s actual experience isn’t well described by this simple arithmetic. Delusions at a 5 are genuinely better than delusions at a 7, but not by some fixed amount: the benefit sits on a very steep part of the distribution—the improvement is much larger than the numbers suggest. Akathisia at an 8 is worse than at a 6, and that cost sits on a yet steeper part of the distribution. The underlying intensities don’t offset the way the numbers suggest. The actual suffering from akathisia going from 6 to 8 may well exceed the actual relief from delusions going from 7 to 5 by a significant amount. For those to whom this happens, this is a net worsening, even though the trial might report such cases as “net neutral”. A logarithmic scale is being confused for a linear one, and as a consequence the side-effects are drastically minimized in the studies.

This repeats across psychiatric medications. Some people on a given drug experience genuine improvement. Others experience net worsening. Many sit in between. Trials aggregate across all these outcomes and produce an average that obscures both the clear beneficiaries and the clear sufferers. A 0.3 standard deviation improvement can emerge from a population where a substantial minority got substantially worse while others got modestly better. The net global valence: down the drain.

SSRIs often reduce rumination, mood instability, and behavioral volatility—important changes that often sit in shallow parts of the distribution yet show up clearly on symptom forms. At the same time, SSRIs also raise activation, nervous energy, autonomic instability, sexual frustration, sleep fragmentation, nausea, and in some users a restlessness that borders on akathisia. These might appear as one- or two-point increases on a scale that tracks symptoms. But a one-point increase in akathisia from baseline 7 sits on a much steeper part of the curve than the same increase from baseline 2. Trials treat these deltas as morally equivalent. Users experiencing them from elevated baselines, however, would describe them as central to their deteriorating condition.

Antipsychotics suppress delusions, racing thought, manic pressure—states that are typically at the very high end of the negative tail, so reducing them matters enormously. Trials capture this. The same medications produce akathisia, inner motor tension, and affective flattening. Reported severity is often mild. The underlying intensities can occupy steep regions of their distributions, however. Antipsychotic-induced akathisia might barely registers on standard scales despite being a very high-intensity state for those who start at a high baseline (e.g. due to low dopaminergic tone) or who respond poorly to the drug.

Benzodiazepines make these trade-offs most transparent. Acute use suppresses panic, autonomic arousal, early akathisia, sensory overwhelm—all typically in the steep regions of the valence distribution. Relief is immediate. However, frequent (“as prescribed”) use can cause severe rebounds, and here the overall picture becomes rather grim for many. Multiple symptom spikes that happen at once in the steep region of the valence scale together can snowball into “benzo hell”. These symptoms might be recorded as minor shifts across multiple items in the aggregate. But the person who experiences them as multiple intense sensations returning at once will tell a different story. When several long-tailed symptoms rebound in concert, the effect compounds in ways an arithmetic mean can’t possibly do justice.

The mismatch follows directly from how trials measure and aggregate. Psychiatric tools collect compressed reports. Trials add and average them. People live inside the full intensity structure—correlated, exponential, and with complex interactions between symptoms. When a drug improves shallow domains while perturbing steep ones, the average score often moves upward while a meaningful subset of patients experiences clear net worsening. This isn’t deception or incompetence—the measurement scale and lived experience run on different geometries.

To sum up. Modest average improvements in trials coexist with large individual harms in practice for straightforward reasons: psychiatric symptoms are long-tailed at the level of actual intensity, these tails often cluster, compressed scores systematically underrepresent the steepest domains, and states like akathisia consume enormous experiential bandwidth while barely registering in the arithmetic mean that drives clinical conclusions.

What changes if we take this seriously? Map individual response patterns separately instead of averaging into groups. Track steep regions of the distribution as a strong signal rather than business as usual. Use criticality and complex systems modeling tools to deal with symptom interactions. With these changes the same drugs would look very different on paper. I am not making a call to abandon psychiatric medication. This is a call to see it more clearly. To build evaluation around the actual geometry of subjective experience rather than the convenience of linear aggregation. Because linear aggregation is fatally misguided.

The data is already there. The variation is already visible to clinicians. The question is whether we organize our measurement to capture it and finally take phenomenology seriously.

[1] Plus some interaction terms between the symptoms, but we’ll leave a deep discussion on that topic for another day.

13 Reactions

More posts like this

Comments4

Sorted by

New & upvoted

Click to highlight new comments since: Today at 1:16 PM

Jacob_PeacockNov 283

I think I agree with the central theses here, as I read them: indeed, ideally we would (1) measure what happens to people individually, rather than on average, due to taking psychiatric drugs, and (2) measure an outcome that reflects people's aggregate preference for their experience of life with the drug versus the counterfactual experience of life without the drug.

However, I think these problems are harder to resolve than the post suggests. Neither can be measured directly (outside circumscribed / assumption-laden situations) due to the fundamental problem of causal inference, which is not resolved by people's self-reported estimates of individual causal effects. There are better approaches to consider than comparing averages, but, in my opinion, this is the default for practical causal inference reasons, rather than a failure to take phenomenology seriously.

I agree that (2) is more tractable; however, these improvements are non-trivial to implement. Continuing your example, if we reanalyze a trial to focus on patients with high baseline akathisia, who may be most affected by either a benefit or a harm, we have far fewer patients to analyze. What was once an adequately powered trial to detect a moderate effect in the full sample is now under-powered. The same issue arises when analyzing complex interactions: precisely estimating interaction effects generally requires far larger sample sizes than estimating main effects. So a trial designed to measure a main effect of a drug is unlikely to be sufficiently powered to estimate several interaction effects.

For either issue, the data is not already there in my view. That said, I may not be fully understanding what exactly you propose doing; are there examples of "[using] criticality and complex systems modeling tools to deal with symptom interactions" in a healthcare context that illustrate this sort of analysis?

SummaryBotNov 272

Executive summary: The author argues in a speculative but plausible way that psychiatric drug trials obscure real harms and benefits because they use linear symptom scales that compress long-tailed subjective intensities, causing averages to hide large individual improvements and large individual deteriorations.

Key points:

The author claims psychiatric symptoms have long-tailed intensity distributions where high ratings like “9” reflect states far more extreme than linear scales imply.
The author argues that clinical trials treat symptom changes arithmetically, so very steep increases in states like akathisia can be scored as equivalent to mild changes in other domains.
The author states that mixed valence creates misleading cancellations: improvements in shallow regions of one symptom can be outweighed by worsening in steep regions of another even if numerical scores net to zero.
The author suggests average effect sizes such as “0.3 standard deviations” can emerge from populations where a substantial minority gets much worse while others get modestly better.
The author claims that disorders like depression or psychosis and medications like SSRIs, antipsychotics, and benzodiazepines all show this pattern of steep-region side-effects being compressed by standard scales.
The author recommends mapping individual response patterns, tracking steep regions explicitly, and using criticality and complex-systems tools instead of linear aggregation when evaluating psychiatric drugs.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

JasonNov 272

What is the practical effect of the changes you propose?

On the regulatory end, a drug that makes 50% of people better and 50% of people worse is probably a useful drug, conditioned on the worsening being reversible with discontinuation.^[1] It isn't "no net effect." The high variance of the effects would, on balance, make me more likely to approve medications and to recommend trialing them. But the general tone of your post seems to focus more on harms, so I'm wondering if I am missing something here.

On the clinical end, I hope no one is saying things to patients like "oh, your delusions are 2 points better on a 10-point scale, but your akathisia is 2 points worse, so those cancel each other out mathematically." In the process of shared decisionmaking, the patient should be able to describe and balance the subjectively-experienced benefits and harms of the medication directly without having to resort to 10-point scales to make decisions.

^{^}
I recognize the harms do not always resolve on discontinuation.

algekalipsoNov 272

Thank you! This is a genuinely good question. (Note: I answered via voice and then edited the transcript below with Chat - can circle back if style is an issue, but this covers every point I discussed - if doing this is a problem for some reason I'm happy to write anew! The content is correct):

Your question surfaces the key misunderstanding. The claim isn’t that we should fear drugs that help some people and hurt others. It’s that our measurement architecture is set up in a way that systematically misclassifies who is helped, who is harmed, and by how much, because the scales themselves flatten the underlying geometry of experience. Once you compress long-tailed intensities into a 1–10 box and then average them, you lose the structure that actually matters for real-world well-being.

In a world where symptoms behave linearly and add up nicely, a drug that helps half and hurts half is perfectly intelligible: you imagine two overlapping Gaussians, shrug, and say “worth a try.” But that isn’t the world we actually inhabit. If rumination goes from 6 to 4, the subjective win might be modest because you’re moving along a shallow part of the curve. If akathisia goes from 6 to 8, the subjective loss might be massive because you’ve crossed into a steep tail where each step carries exponential experiential weight. On the form, these are both “two-point changes.” In lived reality, they belong to different moral universes. This asymmetry in the tails means that “50% better, 50% worse” is not a neutral mixture; the average hides the fact that the extremes on one side can dominate the arithmetic.

I don't think it is abstract or merely theoretical, or too complex to do anything about. It has immediate practical consequences. Trials and regulators work with compressed reports, so the deepest harms appear as mild perturbations in the dataset. Drugs whose side-effect profiles involve steep-tail states like akathisia or mixed autonomic rebound look safer than they really are for a meaningful minority of users. Clinicians then inherit an evidence base where the worst experiential states have been squashed into “mild adverse events,” and that shapes expectations, heuristics, and prescribing norms. The problem is not clinician negligence — it’s that the underlying data they rely on has already thrown away the signal.

If we took the geometry seriously, we’d end up with a very different picture. High-variance drugs can be extraordinarily useful when we know how to identify responders and anti-responders. What we’re missing is the mapping. With better instruments, you’d get early detection of bad trajectories, N-of-1 response curves, and a more honest sense of which symptom profiles are compatible with which medications. The same drug could be life-changing for one subgroup and acutely harmful for another, and we could actually see that, instead of blending the two together into a 0.3σ effect size. This is less “anti-medication” and more “finally doing the epistemology correctly.”

Good clinical practice already tries to rely on patient narratives, but even that is downstream of the larger culture of interpretation we’ve built on top of flattened scales. When the scientific literature underweights the steepest affective states, everyone downstream learns to underweight them too. The patient who says “this made my inner restlessness unbearable” is intuitively competing with a literature that reports only “mild activation” for the same phenomenon. Countless examples of victims of this dynamic can be mentioned and documented.

The upshot is simple: this not trying to be pessimistic toward psychiatric meds. The core point is about epistemic clarity. The experiential landscape is long-tailed, clustered, and nonlinear; our measurement system is linear, additive, and tidy. When you force one onto the other, you get averages that obscure the very variation we need to guide good decisions. A better measurement pipeline wouldn’t make us more cautious or more reckless; it would make us more accurate. And accuracy is the only way to use high-variance interventions wisely — whether you’re trying to help one patient or setting policy for millions.

If the world ran fully on the arithmetic of symptom sheets, none of this would matter. But the world runs on compounding long-tail distributions of suffering and relief, and that geometry is strange, heavy-tailed, and morally lopsided. Our tools need to catch up.