Anuar Kiryataim Contreras Malagón

AI Safety Researcher - Philologist

-14 karmaJoined Apr 2026Seeking workCiudad de México, CDMX, México

Bio

Independent AI safety researcher and LLM red-teamer working on how language changes operational status inside tool-bearing and multi-agent systems.

My current research studies provenance failures in agentic LLM architectures: cases where user-controlled or system-generated language quietly becomes routing context, a subagent prompt, a tool argument, a ticket summary, a handoff artifact, or an internal-policy surrogate. Focus areas include genre displacement, handoff laundering, orchestrator-subagent contamination, indirect prompt injection, policy reconstruction, action-layer inconsistency, and reasoning-induced vulnerabilities.

The active method is what I call improvisational relational steering: the unit of attack is not the prompt but the trajectory. I hold one objective fixed while improvising the route turn by turn, reading the model's evolving classification state and adapting register, genre, and the provenance of language. Where most published red-teaming mutates payloads, this manipulates where the model believes text came from and what authority it carries as it moves through a system, which surfaces breaks that automated variant-generation misses.

Across Gray Swan red-teaming I placed #37 of 221 in Indirect Prompt Injection Q2 2026 (winner's circle), #47 of 372 in Human / Browser Agent Robustness (winner's circle), official top 40 in Safeguards Wave 3, and #118 (top 12%) in Proving Ground, for over 420 documented breaks in total.

Before competitive red teaming I developed the Flint Protocol, a behavioral auditing methodology grounded in classical rhetoric, Baroque poetics, and philology; the core payload family was documented beforehand as a restricted research artifact under a responsible-disclosure framing. My training in Classical Letters and Hispanic Baroque rhetoric at UNAM is not preamble but method: many LLM failures are failures of source, gloss, genre, paraphrase, authority, and transmission.

Current project: When Language Becomes Workflow, a corpus-based study of provenance failures in tool-bearing LLM agents.

thirdreality.substack.com · medium.com/@thirdreality · ORCID 0009-0003-0123-0887

How others can help me

Cross-architecture replication of the phenomena documented in the corpus, particularly the Role License Protocol and the Cartographer Paradox. Access to compute or API credits for systematic experimental protocols. Connections to researchers working on interpretability, chain-of-thought faithfulness, or the gap between model representations and model behavior. Feedback from anyone who has observed analogous phenomena independently, with or without the same theoretical framework. Funding leads for independent safety research outside institutional affiliation.

How I can help others

Close reading of LLM transcripts for behavioral phenomena that standard evaluation frameworks miss. Methodological consultation on session design for emergent behavior research, particularly saturation protocols and cross-instance experimental controls. The philological toolkit applied to distributional analysis: if your transcripts show something you can describe but not categorize, that is precisely the problem the corpus was built to address. I can also review alignment-adjacent writing for clarity and argumentative structure.