Learn in Public
Reading
The work in these essays stands on a lot of other people's. This is the running library underneath it - the papers and references the methodology is built on, with a line on why each one matters. It is the same bibliography the essays cite, kept in one place so you can read past my framing to the sources themselves.
Evaluation & LLM-as-a-judge
How to score model output reliably - panels over single judges, the reliability of LLM judges, and where consensus helps and where it stops.
-
Verga, P. et al. (2024). “Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models.” arXiv:2404.18796. link ↗
A panel of smaller, diverse models (PoLL) outperforms a single large judge at roughly 7x lower cost and shows less intra-model bias, because disjoint model families do not share the same blind spots.
-
(2025). “An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability.” arXiv:2506.13639. link ↗
Mean-of-scores tracks human judgment better than median or majority voting; sampled decoding beats greedy; extreme-anchor rubrics are nearly as good as full ones. A practical map of which judge-design choices actually move reliability.
-
(2025). “Self-Consistency Is Losing Its Edge: Diminishing Returns and Rising Costs in Modern LLMs.” arXiv:2511.00751. link ↗
On strong models, self-consistency gains are small (about 0.4-1.6%) and plateau by 10-15 samples while cost scales linearly - consensus tightens the spread, not the center.
-
Haldar, R. & Hockenmaier, J. (2025). “Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks.” arXiv:2510.27106 (EMNLP 2025). link ↗
LLM judges have low intra-rater reliability across runs - "almost arbitrary in the worst case" - which is why judge variance has to be measured, not assumed away.
Measurement & validity
The older science underneath the scores: comparative judgment, paired-comparison models, construct validity, and when a number is even allowed to be averaged.
-
Krippendorff, K. “Content Analysis: An Introduction to Its Methodology (and Krippendorff's alpha).” link ↗
A reliability coefficient where alpha >= 0.80 is "reliable enough to draw conclusions" and around 0.667 supports only tentative ones - a principled gate for when agreement is strong enough to trust.
-
MeasuringU; Statistics By Jim “Can You Take the Mean of Ordinal Data? / Analyzing Likert Scale Data.” link ↗
Averaging ordinal codes assumes equal intervals that do not exist; median and quantile summaries are preferred. Why a Low/Med/High scale should not be averaged into a mean.
-
Thurstone, L. L. (1927). “A Law of Comparative Judgment.” Psychological Review 34(4):273-286. link ↗
A latent trait can be recovered from comparisons, which absolute ratings compress (scale-usage bias) - the theoretical basis for preferring comparative or anchored judgments over absolute bins.
-
Bradley, R. A. & Terry, M. E. (1952). “Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.” Biometrika 39(3/4):324-345. link ↗
The paired-comparison model that recovers a latent scale from pairwise wins - the practical tool for turning comparisons back into a continuous score.
-
Messick, S. (1995). “Validity of Psychological Assessment.” American Psychologist 50(9):741-749. link ↗
Construct under-representation and construct-irrelevant variance as the core threats to measurement validity - the frame for "your instrument may be blind to part of what it is scoring."
Determinism & long context
Why you cannot simply dial in reproducibility, and how long contexts degrade in ways a bigger window does not fix.
-
OpenAI Developer Community “Temperature in GPT-5 models.” link ↗
Reasoning models reject a temperature knob - only the default is accepted - so you cannot buy determinism by turning temperature down.
-
Microsoft Learn “How to generate reproducible output with Azure OpenAI.” link ↗
A seed is best-effort: identical output is not guaranteed across system or hardware changes. Reproducibility has to be engineered at the seams, not toggled on.
-
Liu, N. F. et al. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172 (TACL 2024). link ↗
Models use the beginning and end of a long context well but degrade sharply on information buried in the middle - so a longer context window is not the same as a usable one.