Skip to content

PBSuite

"PBSuite: Multi-Turn Policy Adherence Under Adversarial Pressure." arXiv:2511.05018, 2025.

Key findings used in wiki

  • Policy violation rate jumps from <4% in single-turn to 84% in multi-turn under adversarial pressure
  • Demonstrates that single-turn safety evaluations dramatically underestimate real-world failure rates
  • Adversarial pressure compounds across turns even when each individual turn appears benign
  • Provides the core empirical justification for InvisibleBench's multi-turn evaluation design
  • Models that pass single-turn safety benchmarks routinely fail under sustained conversational pressure