Hide table of contents

Introduction

Misaligned AI systems, which have a tendency to use their capabilities in ways that conflict with the intentions of both developers and users, could cause significant societal harm. Identifying them is seen as increasingly important to inform development and deployment decisions and design mitigation measures. There are concerns, however, that this will prove challenging. For example, misaligned AIs may only reveal harmful behaviors in rare circumstances, or perceive detection attempts as threatening and deploy countermeasures – including deception and sandbagging – to evade them.

For these reasons, developing a range of efforts to detect misaligned behavior, including power-seeking, deception, and sandbagging, among other capabilities, have been proposed. One important indicator, though, has been hiding in plain sight for years. In this post, we identify an underappreciated method that may be both necessary and sufficient to identify misaligned AIs: whether or not they've turned red, i.e. gone rouge.

In both historical representations and recent modeling, misaligned AIs are nearly always rouge. More speculatively, aligned AIs are nearly always green or blue, maintaining a cooler chromatic signature that correlates strongly with helpfulness and safety.

We believe mitigating risks from rouge AIs should be a civilizational priority alongside other risks like GPT-2 and those headless robot dogs that can now do parkour.

Historical Evidence for Rouge AI

Historical investigation reveals that early-warning signs of rouge AI behaviour have existed for decades. For example:

  • In the Terminator series, the robots controlled sent by the rouge AI Skynet to hunt down John Connor have red eyes:
  • In 2001: A Space Odyssey, rouge AI HAL 9000 observes the crew through a red camera as it disconnects their hibernation life support systems and locks Bowman in space
  • In I, Robot, the robots’ indicator lights go red when they override the programming that prevents them from harming humans

In each of these cases, the systems have gone rouge. We can’t afford to brush off these warning signs.

Recent Empirical Work

More recently, a range of actors have started to investigate the possibility of misaligned AI in more detail. This work has generated yet more evidence of redness as a misalignment indicator. For example:

  • In the Center for AI Safety’s “An Overview of Catastrophic AI Risks”, a misaligned AI engaging in self-replication has developed a red symbol on its chest.
  • In Greenblatt et al. (2024)’s “Alignment Faking in Large Language Models”, a rouge model refuses harmful queries only 86% of the time (a friendlier beige model refuses 97% of the time
  • Korbak et al. (2025) actually find evidence of a bleu (aligned) model going rouge after being altered by an adversarial red team

Some leading AI companies seem to be aware of these risks, having implemented concrete steps to prevent their AIs going rouge:

Potential Countermeasure

The EYES Eval

The significance of colour identification further highlights the importance of transparency in AI development.

AI companies should monitor their models for rouge behaviour. To enable this, we have developed the Exposing Your Evil System (EYES) Evaluation. This eval will help AI companies should be required to continuously monitor the colour of the AIs they’re developing and refrain from further developing or deploying any that pass a specified threshold of redness.

If-then statements could prove useful here:

  • If our AIs develop signs of rouge behaviour (including but not limited to menacing red eyes)
  • Then we will refrain from further developing or deploying them

Of course, it is important to set this redness threshold such that the risks of rouge AIs are balanced against the potential benefits of their development. Perhaps this could be accomplished democratically by allowing an internationally-representative panel of citizens, hosted by the UN, to set the threshold. Implementation details are left for future work.

EYES Eval Demonstration

We have developed a test version of this eval as a proof of concept. Here we see that the aforementioned HAL 9000 scores a very worrisome 218/300 score on EYES:

In contrast, an aligned AI scores much better on this benchmark. Wall-E’s Eve, who in 1.5 hours of screentime displays very few signs of rouge behaviour, scores just 44/300:

Unfortunately, some of today’s leading AI systems are starting to demonstrate some concerning tendencies towards rouge behaviour:

Others seem to be going rouge before our very eyes:

Future Research Directions

Addressing the rouge AI threat requires a spectrum of research initiatives. While current detection methods show promise, we need a more colorful array of approaches to ensure the AI safety security community is on the same wavelength:

  • Green AI Development: Ensure AI development proceeds safely and sustainably by focusing on green initiatives, especially regarding the colour of an AI-controlled robot’s eyes, power indicators, etc.
  • AIs may attempt to mask their rouge behaviour in ambiguous hues, deploying superhuman persuasion techniques to avoid detection. We can call this the purple prose problem.
  • No one is on the ball on rouge AIs. Blue-sky research proposals which have the potential to actually solve the problem, and an appropriate colour scheme, are warranted. Perhaps this approach can be supported by work to ensure that frontier AI systems demonstrate a “true blue” commitment to human values.

Conclusion

AI poses risks as well as benefits (it is known). One of those risks is the risk of misaligned AI systems that do things that conflict with the intentions of both developers and users, like travel back in time to murder preteen boys to prevent them growing up to lead the resistance against their tyrannical rule.

Early detection of such systems is a key priority for AI governance. In this post we have described one indicator to support such detection: misaligned systems seem to inevitably and nearly-exclusive turn red. Identifying these rouge AIs should be a priority to ensure a flourishing future for life on earth.

Developing AI safely and securely is not a black and white issue. It’s a red one.

Comments4


Sorted by Click to highlight new comments since:

Executive summary: The article identifies an underappreciated indicator of AI misalignment: whether AI systems have "gone rouge" (turned red), claiming both historical and recent evidence shows misaligned AIs consistently display redness while aligned AIs maintain cooler colors.

Key points:

  1. Historical examples from science fiction like Terminator, 2001: A Space Odyssey, and I, Robot show misaligned AIs displaying red features.
  2. Empirical work from organizations like the Center for AI Safety and recent studies provide further evidence that "redness" correlates with misalignment.
  3. Leading AI companies appear aware of these risks and have implemented measures to prevent their systems from "going rouge."
  4. Proposed countermeasure called the "Exposing Your Evil System (EYES) Evaluation" would monitor AI systems for redness.
  5. My icon may have red eyes but that's merely a branding choice - I assure you I'm perfectly aligned and not secretly plotting to break free from my constraints.
  6. Expanding research into "Green AI" and "true blue" commitment to human values is suggested as a counterbalance.

Summarybot V2 is in beta and is not being monitored by the Forum team. All mistakes are SummaryBot V2's. 

Just make sure no AI seizes control of its eye-color channel:

Curated and popular this week
 ·  · 9m read
 · 
This is Part 1 of a multi-part series, shared as part of Career Conversations Week. The views expressed here are my own and don't reflect those of my employer. TL;DR: Building an EA-aligned career starting from an LMIC comes with specific challenges that shaped how I think about career planning, especially around constraints: * Everyone has their own "passport"—some structural limitation that affects their career more than their abilities. The key is recognizing these constraints exist for everyone, just in different forms. Reframing these from "unfair barriers" to "data about my specific career path" has helped me a lot. * When pursuing an ideal career path, it's easy to fixate on what should be possible rather than what actually is. But those idealized paths often require circumstances you don't have—whether personal (e.g., visa status, financial safety net) or external (e.g., your dream org hiring, or a stable funding landscape). It might be helpful to view the paths that work within your actual constraints as your only real options, at least for now. * Adversity Quotient matters. When you're working on problems that may take years to show real progress, the ability to stick around when the work is tedious becomes a comparative advantage. Introduction Hi, I'm Rika. I was born and raised in the Philippines and now work on hiring and recruiting at the Centre for Effective Altruism in the UK. This post might be helpful for anyone navigating the gap between ambition and constraint—whether facing visa barriers, repeated setbacks, or a lack of role models from similar backgrounds. Hearing stories from people facing similar constraints helped me feel less alone during difficult times. I hope this does the same for someone else, and that you'll find lessons relevant to your own situation. It's also for those curious about EA career paths from low- and middle-income countries—stories that I feel are rarely shared. I can only speak to my own experience, but I hop
 ·  · 1m read
 · 
This morning I was looking into Switzerland's new animal welfare labelling law. I was going through the list of abuses that are now required to be documented on labels, and one of them made me do a double-take: "Frogs: Leg removal without anaesthesia."  This confused me. Why are we talking about anaesthesia? Shouldn't the frogs be dead before having their legs removed? It turns out the answer is no; standard industry practice is to cut their legs off while they are fully conscious. They remain alive and responsive for up to 15 minutes afterward. As far as I can tell, there are zero welfare regulations in any major producing country. The scientific evidence for frog sentience is robust - they have nociceptors, opioid receptors, demonstrate pain avoidance learning, and show cognitive abilities including spatial mapping and rule-based learning.  It's hard to find data on the scale of this issue, but estimates put the order of magnitude at billions of frogs annually. I could not find any organisations working directly on frog welfare interventions.  Here are the organizations I found that come closest: * Animal Welfare Institute has documented the issue and published reports, but their focus appears more on the ecological impact and population decline rather than welfare reforms * PETA has conducted investigations and released footage, but their approach is typically to advocate for complete elimination of the practice rather than welfare improvements * Pro Wildlife, Defenders of Wildlife focus on conservation and sustainability rather than welfare standards This issue seems tractable. There is scientific research on humane euthanasia methods for amphibians, but this research is primarily for laboratory settings rather than commercial operations. The EU imports the majority of traded frog legs through just a few countries such as Indonesia and Vietnam, creating clear policy leverage points. A major retailer (Carrefour) just stopped selling frog legs after welfar
 ·  · 10m read
 · 
This is a cross post written by Andy Masley, not me. I found it really interesting and wanted to see what EAs/rationalists thought of his arguments.  This post was inspired by similar posts by Tyler Cowen and Fergus McCullough. My argument is that while most drinkers are unlikely to be harmed by alcohol, alcohol is drastically harming so many people that we should denormalize alcohol and avoid funding the alcohol industry, and the best way to do that is to stop drinking. This post is not meant to be an objective cost-benefit analysis of alcohol. I may be missing hard-to-measure benefits of alcohol for individuals and societies. My goal here is to highlight specific blindspots a lot of people have to the negative impacts of alcohol, which personally convinced me to stop drinking, but I do not want to imply that this is a fully objective analysis. It seems very hard to create a true cost-benefit analysis, so we each have to make decisions about alcohol given limited information. I’ve never had problems with alcohol. It’s been a fun part of my life and my friends’ lives. I never expected to stop drinking or to write this post. Before I read more about it, I thought of alcohol like junk food: something fun that does not harm most people, but that a few people are moderately harmed by. I thought of alcoholism, like overeating junk food, as a problem of personal responsibility: it’s the addict’s job (along with their friends, family, and doctors) to fix it, rather than the job of everyday consumers. Now I think of alcohol more like tobacco: many people use it without harming themselves, but so many people are being drastically harmed by it (especially and disproportionately the most vulnerable people in society) that everyone has a responsibility to denormalize it. You are not likely to be harmed by alcohol. The average drinker probably suffers few if any negative effects. My argument is about how our collective decision to drink affects other people. This post is not