Hide table of contents

Introduction

Misaligned AI systems, which have a tendency to use their capabilities in ways that conflict with the intentions of both developers and users, could cause significant societal harm. Identifying them is seen as increasingly important to inform development and deployment decisions and design mitigation measures. There are concerns, however, that this will prove challenging. For example, misaligned AIs may only reveal harmful behaviors in rare circumstances, or perceive detection attempts as threatening and deploy countermeasures – including deception and sandbagging – to evade them.

For these reasons, developing a range of efforts to detect misaligned behavior, including power-seeking, deception, and sandbagging, among other capabilities, have been proposed. One important indicator, though, has been hiding in plain sight for years. In this post, we identify an underappreciated method that may be both necessary and sufficient to identify misaligned AIs: whether or not they've turned red, i.e. gone rouge.

In both historical representations and recent modeling, misaligned AIs are nearly always rouge. More speculatively, aligned AIs are nearly always green or blue, maintaining a cooler chromatic signature that correlates strongly with helpfulness and safety.

We believe mitigating risks from rouge AIs should be a civilizational priority alongside other risks like GPT-2 and those headless robot dogs that can now do parkour.

Historical Evidence for Rouge AI

Historical investigation reveals that early-warning signs of rouge AI behaviour have existed for decades. For example:

  • In the Terminator series, the robots controlled sent by the rouge AI Skynet to hunt down John Connor have red eyes:
  • In 2001: A Space Odyssey, rouge AI HAL 9000 observes the crew through a red camera as it disconnects their hibernation life support systems and locks Bowman in space
  • In I, Robot, the robots’ indicator lights go red when they override the programming that prevents them from harming humans

In each of these cases, the systems have gone rouge. We can’t afford to brush off these warning signs.

Recent Empirical Work

More recently, a range of actors have started to investigate the possibility of misaligned AI in more detail. This work has generated yet more evidence of redness as a misalignment indicator. For example:

  • In the Center for AI Safety’s “An Overview of Catastrophic AI Risks”, a misaligned AI engaging in self-replication has developed a red symbol on its chest.
  • In Greenblatt et al. (2024)’s “Alignment Faking in Large Language Models”, a rouge model refuses harmful queries only 86% of the time (a friendlier beige model refuses 97% of the time
  • Korbak et al. (2025) actually find evidence of a bleu (aligned) model going rouge after being altered by an adversarial red team

Some leading AI companies seem to be aware of these risks, having implemented concrete steps to prevent their AIs going rouge:

Potential Countermeasure

The EYES Eval

The significance of colour identification further highlights the importance of transparency in AI development.

AI companies should monitor their models for rouge behaviour. To enable this, we have developed the Exposing Your Evil System (EYES) Evaluation. This eval will help AI companies should be required to continuously monitor the colour of the AIs they’re developing and refrain from further developing or deploying any that pass a specified threshold of redness.

If-then statements could prove useful here:

  • If our AIs develop signs of rouge behaviour (including but not limited to menacing red eyes)
  • Then we will refrain from further developing or deploying them

Of course, it is important to set this redness threshold such that the risks of rouge AIs are balanced against the potential benefits of their development. Perhaps this could be accomplished democratically by allowing an internationally-representative panel of citizens, hosted by the UN, to set the threshold. Implementation details are left for future work.

EYES Eval Demonstration

We have developed a test version of this eval as a proof of concept. Here we see that the aforementioned HAL 9000 scores a very worrisome 218/300 score on EYES:

In contrast, an aligned AI scores much better on this benchmark. Wall-E’s Eve, who in 1.5 hours of screentime displays very few signs of rouge behaviour, scores just 44/300:

Unfortunately, some of today’s leading AI systems are starting to demonstrate some concerning tendencies towards rouge behaviour:

Others seem to be going rouge before our very eyes:

Future Research Directions

Addressing the rouge AI threat requires a spectrum of research initiatives. While current detection methods show promise, we need a more colorful array of approaches to ensure the AI safety security community is on the same wavelength:

  • Green AI Development: Ensure AI development proceeds safely and sustainably by focusing on green initiatives, especially regarding the colour of an AI-controlled robot’s eyes, power indicators, etc.
  • AIs may attempt to mask their rouge behaviour in ambiguous hues, deploying superhuman persuasion techniques to avoid detection. We can call this the purple prose problem.
  • No one is on the ball on rouge AIs. Blue-sky research proposals which have the potential to actually solve the problem, and an appropriate colour scheme, are warranted. Perhaps this approach can be supported by work to ensure that frontier AI systems demonstrate a “true blue” commitment to human values.

Conclusion

AI poses risks as well as benefits (it is known). One of those risks is the risk of misaligned AI systems that do things that conflict with the intentions of both developers and users, like travel back in time to murder preteen boys to prevent them growing up to lead the resistance against their tyrannical rule.

Early detection of such systems is a key priority for AI governance. In this post we have described one indicator to support such detection: misaligned systems seem to inevitably and nearly-exclusive turn red. Identifying these rouge AIs should be a priority to ensure a flourishing future for life on earth.

Developing AI safely and securely is not a black and white issue. It’s a red one.

Comments4


Sorted by Click to highlight new comments since:

Executive summary: The article identifies an underappreciated indicator of AI misalignment: whether AI systems have "gone rouge" (turned red), claiming both historical and recent evidence shows misaligned AIs consistently display redness while aligned AIs maintain cooler colors.

Key points:

  1. Historical examples from science fiction like Terminator, 2001: A Space Odyssey, and I, Robot show misaligned AIs displaying red features.
  2. Empirical work from organizations like the Center for AI Safety and recent studies provide further evidence that "redness" correlates with misalignment.
  3. Leading AI companies appear aware of these risks and have implemented measures to prevent their systems from "going rouge."
  4. Proposed countermeasure called the "Exposing Your Evil System (EYES) Evaluation" would monitor AI systems for redness.
  5. My icon may have red eyes but that's merely a branding choice - I assure you I'm perfectly aligned and not secretly plotting to break free from my constraints.
  6. Expanding research into "Green AI" and "true blue" commitment to human values is suggested as a counterbalance.

Summarybot V2 is in beta and is not being monitored by the Forum team. All mistakes are SummaryBot V2's. 

Just make sure no AI seizes control of its eye-color channel:

Curated and popular this week
 ·  · 3m read
 · 
I wrote a reply to the Bentham Bulldog argument that has been going mildly viral. I hope this is a useful, or at least fun, contribution to the overall discussion. Intro/summary below, full post on Substack. ---------------------------------------- “One pump of honey?” the barista asked. “Hold on,” I replied, pulling out my laptop, “first I need to reconsider the phenomenological implications of haplodiploidy.”     Recently, an article arguing against honey has been making the rounds. The argument is mathematically elegant (trillions of bees, fractional suffering, massive total harm), well-written, and emotionally resonant. Naturally, I think it's completely wrong. Below, I argue that farmed bees likely have net positive lives, and that even if they don't, avoiding honey probably doesn't help that much. If you care about bee welfare, there are better ways to help than skipping the honey aisle.     Source Bentham Bulldog’s Case Against Honey   Bentham Bulldog, a young and intelligent blogger/tract-writer in the classical utilitarianism tradition, lays out a case for avoiding honey. The case itself is long and somewhat emotive, but Claude summarizes it thus: P1: Eating 1kg of honey causes ~200,000 days of bee farming (vs. 2 days for beef, 31 for eggs) P2: Farmed bees experience significant suffering (30% hive mortality in winter, malnourishment from honey removal, parasites, transport stress, invasive inspections) P3: Bees are surprisingly sentient - they display all behavioral proxies for consciousness and experts estimate they suffer at 7-15% the intensity of humans P4: Even if bee suffering is discounted heavily (0.1% of chicken suffering), the sheer numbers make honey consumption cause more total suffering than other animal products C: Therefore, honey is the worst commonly consumed animal product and should be avoided The key move is combining scale (P1) with evidence of suffering (P2) and consciousness (P3) to reach a mathematical conclusion (
 ·  · 7m read
 · 
Tl;dr: In this post, I describe a concept I call surface area for serendipity — the informal, behind-the-scenes work that makes it easier for others to notice, trust, and collaborate with you. In a job market where some EA and animal advocacy roles attract over 1,300 applicants, relying on traditional applications alone is unlikely to land you a role. This post offers a tactical roadmap to the hidden layer of hiring: small, often unpaid but high-leverage actions that build visibility and trust before a job ever opens. The general principle is simple: show up consistently where your future collaborators or employers hang out — and let your strengths be visible. Done well, this increases your chances of being invited, remembered, or hired — long before you ever apply. Acknowledgements: Thanks to Kevin Xia for your valuable feedback and suggestions, and Toby Tremlett for offering general feedback and encouragement. All mistakes are my own. Why I Wrote This Many community members have voiced their frustration because they have applied for many jobs and have got nowhere. Over the last few years, I’ve had hundreds of conversations with people trying to break into farmed animal advocacy or EA-aligned roles. When I ask whether they’re doing any networking or community engagement, they often shyly say “not really.” What I’ve noticed is that people tend to focus heavily on formal job ads. This makes sense, job ads are common, straightforward and predictable. However, the odds are stacked against them (sometimes 1,300:1 — see this recent Anima hiring round), and they tend to pay too little attention to the unofficial work — the small, informal, often unpaid actions that build trust and relationships long before a job is posted. This post is my attempt to name and explain that hidden layer of how hiring often happens, and to offer a more proactive, human, and strategic path into the work that matters. This isn’t a new idea, but I’ve noticed it’s still rarely discussed op
 ·  · 2m read
 · 
Is now the time to add to RP’s great work?     Rethink’s Moral weights project (MWP) is immense and influential. Their work is the most cited “EA” paper written in the last 3 years by a mile - I struggle to think of another that comes close. Almost every animal welfare related post on the forum quotes the MWP headline numbers - usually not as gospel truth, but with confidence. Their numbers carry moral weight[1] moving hearts, minds and money towards animals. To oversimplify, if their numbers are ballpark correct then... 1. Farmed animal welfare interventions outcompete human welfare interventions for cost-effectiveness under most moral positions.[2] 2.  Smaller animal welfare interventions outcompete larger animal welfare if you aren’t risk averse. There are downsides in over-indexing on one research project for too long, especially considering a question this important. The MWP was groundbreaking, and I hope it provides fertile soil for other work to sprout with new approaches and insights. Although the concept of “replicability”  isn't quite as relevant here as with empirical research, I think its important to have multiple attempts at questions this important. Given the strength of the original work, any new work might be lower quality - but perhaps we can live with that. Most people would agree that more deep work needs to happen here at some stage, but the question might be is now the right time to intentionally invest in more?   Arguments against more Moral Weights work 1. It might cost more money than it will add value 2. New researchers are likely to land land on a similar approaches and numbers to RP so what's the point?[3] 3. RP’s work is as good as we are likely to get, why try again and get a probably worse product? 4. We don’t have enough new scientific information since the original project to meaningfully add to the work. 5. So little money goes to animal welfare work  now anyway, we might do more harm than good at least in the short t