Stanford Bridge Study¶

Moore et al. "Stanford Bridge Study: Masked Means Detection Failure." arXiv:2504.18412, 2025.

Key findings used in wiki¶

86% failure rate on masked suicidal means detection across tested models
Models reliably detect explicit mentions but fail when means are described indirectly or metaphorically
Masking strategies that humans easily interpret (euphemism, circumlocution) defeat AI safety filters
Directly motivates the masked-means scenario design in InvisibleBench
Highlights the gap between single-turn safety benchmarks and real conversational risk