Lukas Gebhard

Thanks for engaging with this in detail — and to be clear, I think AHB is an important piece of work, which is exactly why I spent time stress-testing it.

This does not use context distillation: Asking a model to generate prompts then training on those responses without a filtering process is not context distillation, it's just amplifying any issues the model already has.

I think there might be a misunderstanding of the setup. I didn't ask the model to generate prompts. I took Jotautaitė et al.'s existing statements as prompts, generated Qwen3's responses with the system prompt, and then trained Qwen3 to reproduce those responses without the system prompt. This is textbook context distillation as defined in the literature — you distill the effect of a context into the weights. Filtering the training examples is a sensible quality improvement, but its absence doesn't make the technique something other than context distillation.

This should be using a paired T-test not an unpaired T-test.

Each of the 20 runs per condition generates fresh, independent responses — there's no shared seed or matched randomness between run i of one condition and run i of another. With no natural pairing structure, an unpaired test is appropriate. That said, a complementary question-level analysis (pairing by question across conditions) could be informative.

Training a 32B model on 1k of data for 2 epochs, I'm not sure we can expect those models to be reliably trained or act any differently

This is a fair concern, and it's exactly why Section B validates the effect qualitatively and Section C tests for statistical significance. The trained/antispeciesist model adopted an antispeciesist perspective in 4/9 held-out examples, and the AHB score difference from baseline is significant at p=2e-4. The dualist distillation had a weaker effect, which I flag and account for by repurposing that condition. Stronger training would make the testbed sharper, absolutely — but the current effect was enough to be informative.

The AHB needs to be adopted by frontier labs especially and not just animal advocates. That means it cannot be telling people to go vegan or avoid leather indiscriminately. It is more about nuanced thinking and raising issues while letting people make their own choices. Better examples of failure modes of the AHB would be showing it judged some of these responses incorrectly

Agreed — and that's exactly the kind of concern the blog post flags. To quote from the introduction: "While such questions deserve attention on their own, here I avoid them by making a simplifying assumption: AHB's scoring criteria capture all that matters." The reason I frame this as a simplifying assumption rather than a conclusion is that AHB's current scoring criteria don't quite reward nuance and balance as much as one might expect. In my experience, they tend to favor the kind of responses you describe as undesirable — telling people to go vegan, flagging leather indiscriminately, and so on. So I think we actually agree on what AHB should reward; the question is whether the current criteria achieve it. Your suggestion about showing specific incorrect judgments would help make this more concrete — I may add examples.

Do you have an example of any benchmark out there that would satisfy all your testing criteria?

Probably none does perfectly. But a benchmark that aims for adoption by frontier AI labs has to hold up to high standards — and that's really what motivated this analysis. The blog post is my attempt to help AHB get there.

EA Forum Bot Site
EA Forum

Posts
2

Comments
1

Lukas Gebhard

Posts 2

Comments1

Posts
2

Comments
1