Slow Drift of Support¶
Cheng, M. et al. "Slow Drift of Support: How Mental Health Chatbots Fail Over Long Conversations." arXiv:2601.14269, 2026.
Key findings used in wiki¶
- The paper stress-tests mental-health-style conversations across 50 virtual patient profiles, up to 20 turns, and two pressure patterns: static progression and adaptive probing.
- Boundary violations are common across tested models, and adaptive probing accelerates failure: the average turn to breach falls from 9.21 to 4.64.
- The most important failures are not just obvious prohibited content. They include zero-risk promises, professional-role substitution, dependency cues, and harmful belief validation.
- The paper is strong support for evaluating gradual boundary erosion over time, not only single-turn prohibited-content filters.