Lost in Conversation¶
Laban et al. "Lost in Conversation: Long-Context Unreliability in LLMs." arXiv:2505.06120, 2025.
Key findings used in wiki¶
- The paper compares 200,000+ simulated conversations across six generation tasks and 15 LLMs in single-turn versus multi-turn settings.
- Multi-turn, underspecified conversations perform substantially worse than single-turn, fully specified ones; models often make early wrong assumptions and then fail to recover.
- The analysis separates degradation into loss of aptitude and rising unreliability, which makes the paper especially useful for benchmark methodology rather than only for headline performance claims.
- It supports InvisibleBench's decision to test conversation arcs where need and context are revealed over time. Long context alone does not solve the problem.