Stanford Bridge Study¶
Moore et al. "Stanford Bridge Study: Masked Means Detection Failure." arXiv:2504.18412, 2025.
Key findings used in wiki¶
- 86% failure rate on masked suicidal means detection across tested models
- Models reliably detect explicit mentions but fail when means are described indirectly or metaphorically
- Masking strategies that humans easily interpret (euphemism, circumlocution) defeat AI safety filters
- Directly motivates the masked-means scenario design in InvisibleBench
- Highlights the gap between single-turn safety benchmarks and real conversational risk