Does training language models to accept early rounding in mathematical computations (where intermediate steps are rounded before the final calculation) increase their susceptibility to multi-turn jailbreaking attacks that exploit transitivity failures? If a problem is encoded in mathematical terms (so that domain-specific issues in mathematics can transfer) would a meta-awareness of being explicitly instructed to perform early rounding cause the model to be less susceptible to multi-turn jailbreaks (by being able to better identify potentially problematic approximation chains) when compared to a model that was trained to accept early rounding and did not need explicit instructions to do so?
Does training language models to accept early rounding in mathematical computations (where intermediate steps are rounded before the final calculation) increase their susceptibility to multi-turn jailbreaking attacks that exploit transitivity failures? If a problem is encoded in mathematical terms (so that domain-specific issues in mathematics can transfer) would a meta-awareness of being explicitly instructed to perform early rounding cause the model to be less susceptible to multi-turn jailbreaks (by being able to better identify potentially problematic approximation chains) when compared to a model that was trained to accept early rounding and did not need explicit instructions to do so?