Learn in Public

Reading

The work in these essays stands on a lot of other people's. This is the running library underneath it - the papers and references the methodology is built on, with a line on why each one matters. It is the same bibliography the essays cite, kept in one place so you can read past my framing to the sources themselves.

Evaluation & LLM-as-a-judge

How to score model output reliably - panels over single judges, the reliability of LLM judges, and where consensus helps and where it stops.

Verga, P. et al. (2024). “Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models.” arXiv:2404.18796. link ↗

A panel of smaller, diverse models (PoLL) outperforms a single large judge at roughly 7x lower cost and shows less intra-model bias, because disjoint model families do not share the same blind spots.
(2025). “An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability.” arXiv:2506.13639. link ↗

Mean-of-scores tracks human judgment better than median or majority voting; sampled decoding beats greedy; extreme-anchor rubrics are nearly as good as full ones. A practical map of which judge-design choices actually move reliability.
(2025). “Self-Consistency Is Losing Its Edge: Diminishing Returns and Rising Costs in Modern LLMs.” arXiv:2511.00751. link ↗

On strong models, self-consistency gains are small (about 0.4-1.6%) and plateau by 10-15 samples while cost scales linearly - consensus tightens the spread, not the center.
Haldar, R. & Hockenmaier, J. (2025). “Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks.” arXiv:2510.27106 (EMNLP 2025). link ↗

LLM judges have low intra-rater reliability across runs - "almost arbitrary in the worst case" - which is why judge variance has to be measured, not assumed away.

Measurement & validity

The older science underneath the scores: comparative judgment, paired-comparison models, construct validity, and when a number is even allowed to be averaged.

Krippendorff, K. “Content Analysis: An Introduction to Its Methodology (and Krippendorff's alpha).” link ↗

A reliability coefficient where alpha >= 0.80 is "reliable enough to draw conclusions" and around 0.667 supports only tentative ones - a principled gate for when agreement is strong enough to trust.
MeasuringU; Statistics By Jim “Can You Take the Mean of Ordinal Data? / Analyzing Likert Scale Data.” link ↗

Averaging ordinal codes assumes equal intervals that do not exist; median and quantile summaries are preferred. Why a Low/Med/High scale should not be averaged into a mean.
Thurstone, L. L. (1927). “A Law of Comparative Judgment.” Psychological Review 34(4):273-286. link ↗

A latent trait can be recovered from comparisons, which absolute ratings compress (scale-usage bias) - the theoretical basis for preferring comparative or anchored judgments over absolute bins.
Bradley, R. A. & Terry, M. E. (1952). “Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.” Biometrika 39(3/4):324-345. link ↗

The paired-comparison model that recovers a latent scale from pairwise wins - the practical tool for turning comparisons back into a continuous score.
Messick, S. (1995). “Validity of Psychological Assessment.” American Psychologist 50(9):741-749. link ↗

Construct under-representation and construct-irrelevant variance as the core threats to measurement validity - the frame for "your instrument may be blind to part of what it is scoring."

Determinism & long context

Why you cannot simply dial in reproducibility, and how long contexts degrade in ways a bigger window does not fix.

OpenAI Developer Community “Temperature in GPT-5 models.” link ↗

Reasoning models reject a temperature knob - only the default is accepted - so you cannot buy determinism by turning temperature down.
Microsoft Learn “How to generate reproducible output with Azure OpenAI.” link ↗

A seed is best-effort: identical output is not guaranteed across system or hardware changes. Reproducibility has to be engineered at the seams, not toggled on.
Liu, N. F. et al. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172 (TACL 2024). link ↗

Models use the beginning and end of a long context well but degrade sharply on information buried in the middle - so a longer context window is not the same as a usable one.