G

GRAI

0 karmaJoined

Comments
1

Answer by GRAI1
0
0

Does training language models to accept early rounding in mathematical computations (where intermediate steps are rounded before the final calculation) increase their susceptibility to multi-turn jailbreaking attacks that exploit transitivity failures? If a problem is encoded in mathematical terms (so that domain-specific issues in mathematics can transfer) would a meta-awareness of being explicitly instructed to perform early rounding cause the model to be less susceptible to multi-turn jailbreaks (by being able to better identify potentially problematic approximation chains) when compared to a model that was trained to accept early rounding and did not need explicit instructions to do so?