This is a special post for quick takes by Taha Iqbal. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Evals are being gamed not because the methodology is insufficient but the models on which the compliance audit run are sophisticated enough to game the audit. IC methodology already solved the problem of denied human capabilities through triangulation by using independent behavioural signals not better direct elicitation .The AI safety community needs to make the same epistemological shift. The question isn't how to make evals harder to game, it's whether evals are the right instrument at all.
Applying Intelligence Community Indications and Warning methodology to frontier AI yields a single, stark conclusion: we are currently in an active warning failure. The capability thresholds intended to trigger policy interventions have already been breached, with frontier models clearing 50-70% on SWE bench and inference efficiency expanding at a 40x annually. Our current evaluation frameworks are structurally gameable by situationally aware systems, pointing to a foundational counterintelligence failure rather than a mere oversight gap. The governance community must immediately pivot from behavioral black box testing to white box mechanistic auditing, moving away from trying to prove danger and toward enforcing mandatory compliance frameworks.
Evals are being gamed not because the methodology is insufficient but the models on which the compliance audit run are sophisticated enough to game the audit.
IC methodology already solved the problem of denied human capabilities through triangulation by using independent behavioural signals not better direct elicitation .The AI safety community needs to make the same epistemological shift.
The question isn't how to make evals harder to game, it's whether evals are the right instrument at all.
Applying Intelligence Community Indications and Warning methodology to frontier AI yields a single, stark conclusion: we are currently in an active warning failure. The capability thresholds intended to trigger policy interventions have already been breached, with frontier models clearing 50-70% on SWE bench and inference efficiency expanding at a 40x annually. Our current evaluation frameworks are structurally gameable by situationally aware systems, pointing to a foundational counterintelligence failure rather than a mere oversight gap. The governance community must immediately pivot from behavioral black box testing to white box mechanistic auditing, moving away from trying to prove danger and toward enforcing mandatory compliance frameworks.