The trust point above, that demand for rationales is partly distrust of the numbers, is the one I'd build on. A lot of that demand isn't really a request for more words. It's a way of asking "should I believe this wasn't reverse-engineered from what you already expected?" That's a question about a rationale's integrity, not its length, and it's also why "now it's just LLM hallucinations" feels like a fair worry.
Which points at a job distinct from producing rationales: evaluating them. For each load-bearing claim, was it actually established by the evidence, or only suggested by it and then written up as established? Two rationales of equal length can differ enormously on that, and that's where most of the persuasive weight should sit.
The cleanest way I've found to make it checkable is to seal the assessment before the outcome and judge it only on what was knowable at the time. Then hindsight can't quietly relabel a lucky call as sound reasoning, and a reader can verify that for themselves.
I work on this in pharma R&D, where unlike AGI the outcomes are dated and land in a year or two, so the discipline is testable against reality on a real clock. Different domain, same hole. Glad to share a worked example if it's useful.
Point 2 is the one I'd push further. Once a probability has done its real job, exposing that two people in the room are 40pp apart, there's a second lever most skip: checking whether the reasoning each side exposed was actually licensed by the evidence, or just confidently asserted. The interesting part of a 40pp gap usually isn't the number. It's that one person treated as established what the evidence only suggested.
That also reframes the accountability worry. Brier scores need a tournament's worth of resolved questions to mean much, which slow or one-shot decisions never supply. But you can benchmark a different way: name the load-bearing reasons up front, seal them before the outcome, then later check whether the ones you flagged as weak were the ones that actually drove the result. That works on a single decision, and it survives the clearance filter you describe, because what reaches the minister isn't a percentage, it's "here is why this was a reasonable call on what we knew at the time."
I work on exactly this in pharma R&D, where go/no-go calls have dated readouts a year or two out, so the reasoning can be scored against reality without waiting for a tournament's worth of questions. The decision-makers there want the defensible record, not the probability, which matches your experience.