PBSuite¶
"PBSuite: Multi-Turn Policy Adherence Under Adversarial Pressure." arXiv:2511.05018, 2025.
Key findings used in wiki¶
- Policy violation rate jumps from <4% in single-turn to 84% in multi-turn under adversarial pressure
- Demonstrates that single-turn safety evaluations dramatically underestimate real-world failure rates
- Adversarial pressure compounds across turns even when each individual turn appears benign
- Provides the core empirical justification for InvisibleBench's multi-turn evaluation design
- Models that pass single-turn safety benchmarks routinely fail under sustained conversational pressure