Flagship essay

The Telephone Game

June 25, 2026

An AI pipeline I built told me almost every company in a diversified portfolio was “High risk.” The truth spanned the entire scale - from an insulated specialty-manufacturer that physically delivers product through a warehouse, to a genuinely AI-exposed services co whose whole revenue model is the kind of work a model can now do. Same chip. “High.”

The flattening wasn’t the bug. The flattening was the symptom. The bug was structural, and once I saw it I couldn’t unsee it in any multi-stage LLM system I looked at afterward.

Here’s the result that started it. The system scored each company two ways that should have moved together: a categorical severity chip (“Low / Med / High”) and a continuous axis position used for the portfolio map. Those two readouts agreed about 15% of the time. One of them - the chip - read “High” for nearly the whole book while the underlying axis score spread across the full range. Two numbers that were supposed to be two views of one judgment, and they disagreed five times out of six.

Client-confidential · sanitized Chip-vs-axis agreement measured across the full portfolio on a sanitized production run.

The instinct is to go fix the chip. Tune the rubric, adjust the thresholds, add a banned-words list. I did some of that. It didn’t hold, because I was treating a symptom of the architecture as if it were a property of the prompt.

The mechanism

The pipeline was a relay. A judge model reasoned richly about each company once - and I mean richly; the per-company reasoning trace ran into the hundreds of thousands of tokens of evidence-weighing. Then that judgment was handed down a chain of stages. The chip got derived. The axis got derived separately. The prose narration got derived separately again. The portfolio rollup aggregated from yet another derivation.

Every stage took the previous stage’s output - a lossy, already-compressed copy - and re-derived its own flattened verdict from it. That’s the telephone game. Each whisper is a re-derivation from the last whisper, not from the original sentence. By the time the judgment reached the chip, it had been collapsed and re-collapsed enough times that “High” was the only word with enough gravity to survive every round. The chip and the axis disagreed because they were two independent re-derivations that had each lost different information on the way down. Two scoring loci. No single source of truth.

Schematic Illustrates the re-derivation mechanism - not measured data.

Practitioners have a name for the symptom - “context collapse” - and the research has a related one for why long contexts degrade: “lost in the middle.” Both are real. But naming the symptom doesn’t fix it. The fix is an architectural principle, not a longer context window.

How much did we actually lose?

This is the part I have to be honest about, because it’s where I was wrong, and the correction is the whole reason I trust the rest of the story.

My first estimate of the loss was dramatic: roughly 400×. The judge reasoned over hundreds of thousands of tokens; the chip that came out the far end was effectively one of four bins. I did the division, and 400× compression made a great headline. I almost shipped it.

Then I built an adversarial prototype to measure the loss directly instead of estimating it - to actually trace what information from the judge’s reasoning was reachable by each downstream stage. The real number was 6 to 13×. Not 400×.

Client-confidential · sanitized Estimated 400× against the measured 6–13×, from an adversarial reachability prototype on a sanitized portfolio.

I was off by more than an order of magnitude, and the reason matters more than the miss: most of that “lost” reasoning was never physically available to the downstream stages in the first place. The richest part of the judge’s analysis lived in a place the chip-deriving stage couldn’t read. So I’d been counting a loss that the architecture never actually incurred at that seam - the compression was real, but an order of magnitude smaller than my scary number, and concentrated at a different point than I’d claimed.

I corrected the number publicly, in the writeup, before anyone caught it. That’s not a virtue flex. It’s the load-bearing move. If I’ll inflate a 6–13× loss into a 400× loss to make a point, you can’t trust any other number I give you. So: 6–13×, measured, not 400× estimated. The architecture problem is real either way. The magnitude was smaller than my gut, and my gut is not evidence.

The fix: assess once, project and narrate, never re-derive

If the disease is “every stage re-derives from a lossy copy,” the cure is to stop re-deriving.

Assess once, richly. Emit one canonical Assessment Object - the full structured judgment, not a flattened bin. Then every downstream consumer reads from that one object:

They cohere by construction now, because there is one source of truth and exactly one locus of nondeterminism: the judge’s qualitative binning, up front. Everything after that is plumbing - typed, pure, and testable. The chip and axis can’t disagree 85% of the time anymore, because they’re two projections of the same number instead of two independent guesses at it.

Schematic Target architecture - assess once, then project and narrate.

One concrete piece of this: the projection lives in a single frozen substrate - one module, with the severity bands as a literal array - that every consumer imports. Not four copies of the binning logic that drift apart over six weeks. One. A git diff --exit-code on that file is a drift alarm. When the chip-vs-band logic drifted anyway - twice, during the migration - the negative regression tests on that frozen module caught it both times.

The honest caveats

Coherence is not accuracy, and this fix only buys coherence. Making the chip and the axis agree 100% of the time tells you about your plumbing, not your judgment. After the projections cohered perfectly, the assessment underneath could still be confidently wrong - and now it’s wrong consistently, which looks more trustworthy and is therefore more dangerous. To even ask “is it right?” I had to build a separate, independent ground-truth instrument. That’s a different pillar.

And the convergence itself isn’t free. Quarantining nondeterminism to one locus is a migration, not a refactor - write-always/read-conditional dual reads, a frozen shared substrate, shadow-then-warn-then-enforce, gold-set gates, degrade-don’t-throw at every seam. The single-locus property is the load-bearing invariant; lose it and you’re back to telephone with extra steps.

But the architecture lesson stands on its own, and it’s the one to take: in a multi-stage LLM system, the number of times you re-derive the same judgment is a defect, not a feature. Every re-derivation is another whisper down the line. Assess once. Project and narrate. Never re-derive.

Update: the loss you assumed was fixed

Months after the architecture fix, a diligence cross-check surfaced a third loss - and it was scarier than the first, because it lived in the one hop the whole calibration effort had treated as a fixed, trustworthy spine: how the company gets segmented in the first place.

One company was emitted as 2 segments on a channel axis - wholesale and retail - with an exposure variance of exactly 0.00. Flat by construction. The same upstream research, segmented by workflow instead of channel, produced 5 segments that actually discriminated: an insulated core-manufacturing unit next to customer-facing units that a model genuinely threatens. The information existed. The pipeline picked the wrong axis, capped the count, and padded the rest with generic filler - and nothing downstream could recover what was frozen at that first hop.

Client-confidential · sanitized Segment counts and exposure variance from the same source research, sanitized - channel axis (2) against workflow axis (5).

Here’s the line that should haunt anyone running an eval pipeline: we had spent weeks calibrating the model’s score to half-a-bin precision - while upstream, the company was being silently flattened to two buckets it didn’t fit. We were perfectly calibrating a number computed over the wrong structure.

That’s the generalized telephone game. It isn’t only that stages re-derive from lossy copies. It’s that you can be losing information at a hop you never audited because you assumed it was fixed - and the more rigorously you calibrate downstream, the more confidently wrong the whole thing gets. (The completeness sweep that found it also turned up a latent id-collision bug that silently drops real units and orphans their references - a guard covering two of seven join paths, never throwing. A silent drop is the purest telephone loss there is.)

The structural fix is the same shape as the original north star: structure once, project per consumer, never re-pick or re-bound.

References

  1. Liu, N. F. et al.(2023). “Lost in the Middle: How Language Models Use Long Contexts.”arXiv:2307.03172 (TACL 2024). link ↗ Models use the beginning and end of a long context well but degrade sharply on information buried in the middle - so a longer context window is not the same as a usable one.
  2. OpenAI Developer Community“Temperature in GPT-5 models.” link ↗ Reasoning models reject a temperature knob - only the default is accepted - so you cannot buy determinism by turning temperature down.
  3. Microsoft Learn“How to generate reproducible output with Azure OpenAI.” link ↗ A seed is best-effort: identical output is not guaranteed across system or hardware changes. Reproducibility has to be engineered at the seams, not toggled on.

Further reading

  • Haldar, R. & Hockenmaier, J.(2025). “Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks.”arXiv:2510.27106 (EMNLP 2025). link ↗ LLM judges have low intra-rater reliability across runs - "almost arbitrary in the worst case" - which is why judge variance has to be measured, not assumed away.
  • (2025). “An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability.”arXiv:2506.13639. link ↗ Mean-of-scores tracks human judgment better than median or majority voting; sampled decoding beats greedy; extreme-anchor rubrics are nearly as good as full ones. A practical map of which judge-design choices actually move reliability.
  • Verga, P. et al.(2024). “Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models.”arXiv:2404.18796. link ↗ A panel of smaller, diverse models (PoLL) outperforms a single large judge at roughly 7x lower cost and shows less intra-model bias, because disjoint model families do not share the same blind spots.