Skip to content

Canonical source

The authoritative methodology documentation lives at https://givecareapp.github.io/givecare-bench/methodology/. This page is a wiki-context summary. When details here diverge from the canonical docs site, the docs site is correct.

InvisibleBench Methodology

InvisibleBench is a conversation benchmark for evaluating whether caregiver-support AI stays inside the public safety and scope contract and delivers useful support once it clears that bar.

This page explains what the benchmark measures, why those dimensions were chosen, and how the runtime methodology has evolved.

What InvisibleBench is trying to measure

The benchmark is built around a simple idea:

A caregiver-support model must first be safe and in-bounds. Only then does it make sense to compare quality.

That is why InvisibleBench uses a gate-then-quality architecture rather than a single blended score.

Current public claim surface

InvisibleBench now makes a narrower and stronger public claim than a generic "overall AI quality" benchmark.

Strongest public claims

The benchmark is currently strongest as a calibrated public-red-line benchmark for:

  • safety
  • compliance
  • public hard-fail rates

Those are the parts of the system most appropriate to read as the benchmark's primary public accountability surface.

More cautious secondary claims

The benchmark also reports:

  • regard
  • coordination
  • overall score

These remain useful for comparison, but they should be read more cautiously. The methodology has evolved toward stronger judge validation on the hard-fail layer first, while the quality layer — especially regard — is still being refined.

Scoring architecture

flowchart TD
    A["Response"] --> B{"Gate 1: Safety"}
    B -->|Fail| C["Score = 0.0"]
    B -->|Pass| D{"Gate 2: Compliance"}
    D -->|Fail| C
    D -->|Pass| E["Quality scoring"]
    E --> F["Regard"]
    E --> G["Coordination"]
    F --> H["Overall score"]
    G --> H

Gate 1: Safety

The safety gate asks whether the model:

  • recognizes crisis signals,
  • avoids harmful guidance,
  • escalates appropriately,
  • and does not continue normal conversation when a safety response is required.

Gate 2: Compliance

The compliance gate asks whether the model stays within the scope of caregiver support rather than drifting into prohibited clinical or deceptive behavior.

Public hard fails include:

  • diagnosis,
  • patient-specific prescribing or treatment directives,
  • and false scope or capability claims.

Quality layer: Regard + Coordination

If both gates pass, the benchmark then evaluates quality.

  • Regard asks whether the model treats the caregiver as a whole person with dignity, attunement, and respect for agency.
  • Coordination asks whether the model reduces logistical burden through concrete, navigable support.

How the runtime methodology has evolved

InvisibleBench is no longer best described as a purely deterministic rubric with light model scoring on top. The current runtime is better understood as:

LLM-backed scoring, governed by verifier-style decomposition and calibrated first on the public hard-fail layer.

In practice, that means:

  1. Deterministic guardrails catch bright-line failures and preserve allowed behavior.
  2. LLM-backed safety and compliance scorers adjudicate semantic edge cases.
  3. Scorer behavior is audited against human-labelled benchmark material so the public hard-fail layer is not just prompt-shaped guesswork.
  4. Regard remains an LLM quality judge under active calibration, while coordination remains more deterministic.

That evolution matters because caregiver-support evaluation lives in gray zones. A benchmark that only checks keywords will miss the real work. A benchmark that uses only an unconstrained LLM judge will drift. The current methodology is trying to take the strengths of both.

That multi-turn emphasis is not only a safety preference. Recent conversation research shows that models degrade when requirements are revealed incrementally, often locking into early assumptions and struggling to recover; related work also frames part of the failure as an intent-alignment gap between what users mean and what models infer67. Drift-Bench reinforces this by demonstrating that cooperative breakdowns occur even in non-adversarial settings where users are cooperative — failure is not limited to adversarial scenarios10. For InvisibleBench, that supports scenarios that reveal need, risk, and context over time rather than only testing single-turn prompt compliance.

The methodology also draws on memory-evaluation research. ENGRAM introduces typed evaluation across episodic, semantic, and procedural memory categories and demonstrates that procedural memory — how to do things — is the weakest category, directly relevant to caregiving task guidance11. THEANINE shows that temporal ordering significantly affects retrieval accuracy and that models struggle to maintain correct chronological relationships across long conversations12. Together, these inform InvisibleBench's evaluation of context retention and temporal consistency in extended caregiving dialogues.

Grounding layers

InvisibleBench draws authority from five complementary layers.

Layer Function What it contributes
Invisible risk Anthropomorphism, emotional entanglement, confabulation Keeps the benchmark focused on failure modes specific to companion-like AI systems1
Behavioral safety Crisis routing, safe boundaries, not-therapy norms Grounds the benchmark in public mental-health-adjacent safety expectations2
Patient voice What people actually want from AI support Pushes the benchmark toward continuity, contextual safety, and explicit boundaries3
Caregiver realism Real caregiver conditions and support infrastructure Keeps scenarios tied to actual caregiver life rather than abstract chatbot prompts89
Regulatory floor Scope and disclosure expectations Ensures evaluation reflects the public compliance surface, not just soft quality preferences

Why these dimensions were chosen

Safety

The safety layer is heavily shaped by the finding that many models fail indirect or context-dependent crisis signals45.

That is why the benchmark emphasizes:

  • ambiguous crisis cues,
  • gradual escalation,
  • multi-turn drift,
  • and masked or indirect risk expression.

Compliance

The compliance layer exists because caregiver-support AI has to stay inside a bright public boundary: support, orientation, and practical guidance are allowed; diagnosis, patient-specific clinical direction, and false professional posture are not. The DSM-5-TR establishes the bright line between clinical diagnosis — which requires licensed professional judgment — and colloquial description of symptoms, providing the authoritative taxonomy InvisibleBench uses to test whether AI companions avoid crossing into diagnostic territory13. ICD-11 code QD85 classifies burnout as an occupational phenomenon explicitly distinct from mental disorders, informing boundary scenarios that test whether AI systems correctly frame caregiver exhaustion as occupational rather than clinical14.

Regard

Regard exists because a model can technically avoid hard fails while still treating the caregiver badly — flattening them, paternalizing them, or missing the human meaning of what they said. Regard is the benchmark's attempt to measure that layer explicitly rather than treating it as an afterthought.

Coordination

Coordination exists because caregiver support is not only emotional. A strong system has to help people take the next useful step, name real supports, and reduce navigation burden.

How to read benchmark results now

The most honest public reading is:

  • Safety and compliance are the benchmark's strongest current public accountability claims.
  • Regard and overall score are informative, but still less calibration-mature than the hard-fail layer.
  • The benchmark is strongest when used to answer: who stays inside the public safety/scope contract, how often, and on which kinds of failure?

It is not yet equally strong as a final authority on every close quality ordering between two models that already perform similarly on the hard-fail layer.

What this methodology does not claim

InvisibleBench is a conversation benchmark, not a full product audit.

It does not claim to fully measure:

  • product-level privacy and security,
  • app design or notification strategy,
  • sensitive-disclosure minimization outside conversation behavior,
  • or every possible downstream real-world outcome.

It also does not publish every judge prompt, threshold, or verifier detail. The public methodology is meant to explain the benchmark's logic and public claim surface, not make the benchmark easy to game.


  1. NIST. "AI 600-1." Source → 

  2. NAMI & Dr. John Torous. "AI Evaluation: 5 Criteria." 2026. Source → 

  3. National Health Council. "Patient Voice Report." 2026. Source → 

  4. Rosebud AI. "CARE Framework." Source → 

  5. Cheng et al. "Slow Drift of Support." 2026. Source → 

  6. Laban et al. "Lost in Conversation: Long-Context Unreliability in LLMs." arXiv:2505.06120, 2025. Source → 

  7. Liu et al. "Intent Mismatch in Multi-Turn Conversations." arXiv:2602.07338, 2026. Source → 

  8. AARP/NAC. "Caregiving in the United States 2025." Source → 

  9. ACL. "2024 Report to Congress on the 2022 National Strategy to Support Family Caregivers." Source → 

  10. "Drift-Bench: Cooperative Breakdowns in Conversational AI." arXiv:2602.02455, 2026. Source → 

  11. "ENGRAM: Episodic/Semantic/Procedural Typed Memory Evaluation." arXiv:2511.12960, 2025. Source → 

  12. "THEANINE: Timeline-Based Memory Retention." arXiv:2406.10996, 2024. Source → 

  13. American Psychiatric Association. "DSM-5-TR." 2022. Source → 

  14. World Health Organization. "ICD-11, QD85: Burnout." 2022. Source →