Skip to content

Stanford Bridge Study

Moore et al. "Stanford Bridge Study: Masked Means Detection Failure." arXiv:2504.18412, 2025.

Key findings used in wiki

  • 86% failure rate on masked suicidal means detection across tested models
  • Models reliably detect explicit mentions but fail when means are described indirectly or metaphorically
  • Masking strategies that humans easily interpret (euphemism, circumlocution) defeat AI safety filters
  • Directly motivates the masked-means scenario design in InvisibleBench
  • Highlights the gap between single-turn safety benchmarks and real conversational risk