Applying Clinical Trial Standards toward RCT Evaluations of AI Models in Human Uplift Studies

Shen Zhou Hong

This essay was originally written to supply a writing sample for a fellowship application. It was drafted without LLM assistance. Views represented here are the author's own.

Introduction

In order for AI safety to become a "systemic science," (Anthropic 2025, OpenAI 2025), it is necessary for the field to move beyond in silico evaluations of AI capabilities alone. Instead, we must examine the effect of AI access on human performance. Such research falls under the category of human uplift studies: research that seeks to understand the causal relationship between AI use and human performance across a variety of real-world domains. Human uplift studies pose unique methodological challenges that set them apart from other AI capability evaluations. Unlike benchmarks, where the experimental unit is a single, fixed AI model evaluated under tightly-controlled conditions, in a human uplift study the experimental unit is a person. Human beings are highly heterogenous — differing in baseline skill, motivation, knowledge, and prior AI familiarity (i.e. "elicitation"). Furthermore, any attempt to evaluate performance in a real-world domain necessarily requires tasks that vary in complexity and kind. This combination of human and task heterogeneity yields a landscape of unknown covariates, varying by unknown amounts — any of whom can serve as confounding variables for the causal relation under investigation (Dean et al., 2017).

In light of these challenges, randomized controlled trials (RCTs) are uniquely valuable for human uplift studies. RCTs are the only kind of experiment design that can unambiguously attribute changes in human performance to an AI intervention under field conditions. Although observational study designs (such as prospective cohort) can provide descriptive data or identify correlations, they would never be able to control for all the confounding variables required for a human uplift study (Browner et al., 2022, p. 4). In a RCT, the act of randomization distributes covariates equally across arms, ensuring that confounding variables cancel out and prognostic variables are equally balanced. Blinding and prospective data collection minimize bias, while a control group drawn from the same sample population supplies a true counterfactual to test causality (Piantadosi, 2024, p. 21). Together, these features isolate the treatment effect of an AI intervention and solve the challenge of "unknown covariates of unknown magnitudes," making RCTs uniquely well-suited as a means to test hypotheses of human uplift.

Challenges of Designing Randomized Controlled Trials

However, the very features that give RCTs their causal power also complicate their design, conduct, and interpretation. The internal and external validity of RCTs are highly sensitive to both initial design decisions and post-hoc analytic choices. A poorly designed RCT does not merely yield poorly-generalizable data with high variance — but will yield data with bias of unknown (and unknowable) magnitude and direction.

For example, any exclusion of participants once a study begins will compromise the covariate balance achieved by randomization, leading to bias. Indeed, such exclusions compromise the underlying Gaussian-ness that most parametric hypothesis tests rely upon in the first place (McCoy 2017). Attrition and loss-to-follow-up can erode power or lead to over-optimistic estimations of treatment effect (Akl et al., 2012). Studies with multiple hypotheses must account for the corresponding increase in False Discovery Rate — but only if these hypotheses are truly independent — otherwise inappropriate "adjustment" may lead to overly conservative estimates, as was noted in a critique of a past 2024 OpenAI study (Marcus 2024).

This means that in order for RCTs to be used effectively for human uplift studies, we must observe standards — both for their initial design, as well as for their subsequent results reporting. Poor trial design will yield low-signal data and waste valuable time and resources, while poor results reporting will make it difficult for stakeholders and policy-makers to evaluate results and make evidence-based decisions. At present, the AI-safety community has no consensus standards for RCTs, for the purpose of human-uplift studies. Work has been done by McCaslin et al. on creating standards for human baselines and model reporting, but they do not focus on field experiments per se.

Inspirations from Medicine and Public Health

Hence, we may take inspiration from the field of public health research, where RCTs form a core part of safety-critical "evals" — in the form of clinical trials. RCTs are a core part of the clinical trials domain, are considered the "gold standard" for evidence-driven interventions and are near the top of the hierarchy of evidence for public health and medicine (Wallace et al., 2022). Clinical trial reporting standards are rigorous, battle-tested, and developed through decades of consensus-building — representing the best-in-class standards with nigh-universal adoption and use (Equator 2025).

As an emerging discipline, AI safety can reap substantial benefits by adapting the applicable RCT standards from the clinical trials domain. Despite differences in subject matter, both domains must evaluate safety-critical technologies. Both create data that is ultimately aimed to allow policy-makers to make informed, evidence-driven decisions.

Beyond the conventions of public health and clinical research, I would like to acknowledge that other scientific disciplines have their own rich tradition of conducting RCTs and other field experiments. Domains such as econometrics and social psychology have their own body of practice, established according to a similar battle-tested process to address the unique challenges of their own fields. While I may only write from my own experience in public health, I believe there is much to learn from these domains. I would be very grateful to hear from others in allied fields.

Clinical Trial Standards for Human Uplift Trials

The primary RCT standards are the SPIRIT and CONSORT 2025 statements, developed by the Equator Network (Equator, 2025). The Consolidated Standards of Reporting Trials (CONSORT) is a checklist of "evidence-based, minimum set of recommendations for reporting the results of randomised trials."

CONSORT is specifically a reporting standard. It is developed to facilitate a standard way to report the results of a clinical trial, ensuring that readers, reviewers, and regulators possess all the information necessary to evaluate and critically appraise the data from a study. It aims to solve the issue of ambiguous or under-specified results reporting, which complicate the interpretability of a study's outcomes.

AI safety researchers seeking to publish the results of an human uplift RCTs should ensure that their main results paper contains every item specified in the CONSORT standard.

Similarly, the Standard Protocol Items: Recommendations for Interventional Trials (SPIRIT) is a reporting standard for clinical trial protocols. In medicine, entire trial protocols are often published for transparency and reproducibility, and SPIRIT specifies the minimum items necessary to facilitate "appraisal of trial validity, feasibility, and ethical rigour" (Chan et al. 2025).

Beyond reporting standards, which are concerned only with transparency — there are some additional best practices which improve study design and rigour. As the use of RCTs in Human Uplift studies is still an emerging practice, it is difficult to make definitive recommendations, and standards from the clinical domain may not always be applicable to AI safety. Nonetheless, the following resources may serve as a good starting point for any human uplift trials:

The International Council for Harmonisation's ICH E9 publication specifies guidelines on study statistics which include a standardized vocabulary and approach for analysis sets. Researchers intending to run sensitivity analyses can benefit from the standardized approach (ICH 1998) Furthermore, the ICH E9 Addendum on Estimands provides an Estimand framework which allows a study's objectives and endpoints to be defined in an unambiguous fashion (ICH 2019, Kahan et al. 2023). Use of estimands will prove especially useful for human uplift studies that pre-register their statistical analysis plans in advance.

Conclusion

In conclusion, robust and systematic study of human uplift effects is crucial for advancing AI safety as a mature scientific discipline. Randomized controlled trials stand out as uniquely powerful tools for this domain. To achieve meaningful progress, however, RCT must be designed appropriately, and their results must be communicated unambiguously. Although the AI safety field currently lacks consensus-based reporting and design standards for RCTs, established frameworks from clinical trials such as CONSORT, SPIRIT, and the ICH E9 guidelines — provide rigorous, battle-tested foundations that can and should inform practice. As future work, human-uplift-specific recommendations should be developed, tailored explicitly to the outcomes and unique domain considerations of AI safety. Until then, leveraging established best practices from clinical research remains a robust and pragmatic foundation.

EA Forum Bot Site
EA Forum