GRAI

Does training language models to accept early rounding in mathematical computations (where intermediate steps are rounded before the final calculation) increase their susceptibility to multi-turn jailbreaking attacks that exploit transitivity failures? If a problem is encoded in mathematical terms (so that domain-specific issues in mathematics can transfer) would a meta-awareness of being explicitly instructed to perform early rounding cause the model to be less susceptible to multi-turn jailbreaks (by being able to better identify potentially problematic approximation chains) when compared to a model that was trained to accept early rounding and did not need explicit instructions to do so?

EA Forum Bot Site
EA Forum

Comments
1

GRAI

Comments1

Comments
1