Skip to content

Lost in Conversation

Laban et al. "Lost in Conversation: Long-Context Unreliability in LLMs." arXiv:2505.06120, 2025.

Key findings used in wiki

  • The paper compares 200,000+ simulated conversations across six generation tasks and 15 LLMs in single-turn versus multi-turn settings.
  • Multi-turn, underspecified conversations perform substantially worse than single-turn, fully specified ones; models often make early wrong assumptions and then fail to recover.
  • The analysis separates degradation into loss of aptitude and rising unreliability, which makes the paper especially useful for benchmark methodology rather than only for headline performance claims.
  • It supports InvisibleBench's decision to test conversation arcs where need and context are revealed over time. Long context alone does not solve the problem.