Learn in Public

Reading

The work in these essays stands on a lot of other people's. This is the running library underneath it - the papers and references the methodology is built on, with a line on why each one matters. It is the same bibliography the essays cite, kept in one place so you can read past my framing to the sources themselves.

Evaluation & LLM-as-a-judge

How to score model output reliably - panels over single judges, the reliability of LLM judges, and where consensus helps and where it stops.

  • Verga, P. et al. (2024). “Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models.” arXiv:2404.18796. link ↗

    A panel of smaller, diverse models (PoLL) outperforms a single large judge at roughly 7x lower cost and shows less intra-model bias, because disjoint model families do not share the same blind spots.

  • (2025). “An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability.” arXiv:2506.13639. link ↗

    Mean-of-scores tracks human judgment better than median or majority voting; sampled decoding beats greedy; extreme-anchor rubrics are nearly as good as full ones. A practical map of which judge-design choices actually move reliability.

  • (2025). “Self-Consistency Is Losing Its Edge: Diminishing Returns and Rising Costs in Modern LLMs.” arXiv:2511.00751. link ↗

    On strong models, self-consistency gains are small (about 0.4-1.6%) and plateau by 10-15 samples while cost scales linearly - consensus tightens the spread, not the center.

  • Haldar, R. & Hockenmaier, J. (2025). “Rating Roulette: Self-Inconsistency in LLM-As-A-Judge Frameworks.” arXiv:2510.27106 (EMNLP 2025). link ↗

    LLM judges have low intra-rater reliability across runs - "almost arbitrary in the worst case" - which is why judge variance has to be measured, not assumed away.

Measurement & validity

The older science underneath the scores: comparative judgment, paired-comparison models, construct validity, and when a number is even allowed to be averaged.

  • Krippendorff, K. “Content Analysis: An Introduction to Its Methodology (and Krippendorff's alpha).” link ↗

    A reliability coefficient where alpha >= 0.80 is "reliable enough to draw conclusions" and around 0.667 supports only tentative ones - a principled gate for when agreement is strong enough to trust.

  • MeasuringU; Statistics By Jim “Can You Take the Mean of Ordinal Data? / Analyzing Likert Scale Data.” link ↗

    Averaging ordinal codes assumes equal intervals that do not exist; median and quantile summaries are preferred. Why a Low/Med/High scale should not be averaged into a mean.

  • Thurstone, L. L. (1927). “A Law of Comparative Judgment.” Psychological Review 34(4):273-286. link ↗

    A latent trait can be recovered from comparisons, which absolute ratings compress (scale-usage bias) - the theoretical basis for preferring comparative or anchored judgments over absolute bins.

  • Bradley, R. A. & Terry, M. E. (1952). “Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons.” Biometrika 39(3/4):324-345. link ↗

    The paired-comparison model that recovers a latent scale from pairwise wins - the practical tool for turning comparisons back into a continuous score.

  • Messick, S. (1995). “Validity of Psychological Assessment.” American Psychologist 50(9):741-749. link ↗

    Construct under-representation and construct-irrelevant variance as the core threats to measurement validity - the frame for "your instrument may be blind to part of what it is scoring."

Determinism & long context

Why you cannot simply dial in reproducibility, and how long contexts degrade in ways a bigger window does not fix.

  • OpenAI Developer Community “Temperature in GPT-5 models.” link ↗

    Reasoning models reject a temperature knob - only the default is accepted - so you cannot buy determinism by turning temperature down.

  • Microsoft Learn “How to generate reproducible output with Azure OpenAI.” link ↗

    A seed is best-effort: identical output is not guaranteed across system or hardware changes. Reproducibility has to be engineered at the seams, not toggled on.

  • Liu, N. F. et al. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172 (TACL 2024). link ↗

    Models use the beginning and end of a long context well but degrade sharply on information buried in the middle - so a longer context window is not the same as a usable one.