Academic Literature

Tally is heavily inspired by these academic works - or rathter, the LLM summaries of them from a research project in February '26.

So thankful for this research for solving a thorny problem and making the findings public!

“Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation (DeCE).” EMNLP 2025. arXiv:2509.16093.

LLM Summary: Breaking a holistic judgment (“rate relevance 0-1”) into 3-5 specific binary sub-questions more than doubles correlation with expert judgment: r=0.78 vs r=0.35 for holistic scoring.

Additionally, only 11.95% of LLM-generated sub-criteria required expert revision, meaning LLMs can help design the decomposition themselves. The approach also makes scoring transparent/debuggable — you can see exactly which sub-questions an item failed on.

Citation: Hashemi, H., et al. (2024). “LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts.” ACL 2024. arXiv:2501.00274. (ACL Anthology)

Rather than treating LLM outputs as scores, the authors treat them as features for a calibration model. Raw LLM scores are poorly calibrated, but the probability distributions over rubric response categories contain rich signal.

While the authors use a more sophisticated approach for assembling scores out of features, their overall idea of decomposition and reassembly influenced digest.

G-Eval (Liu et al., 2023, arXiv:2303.16634) — “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.”

This paper uses token probabilities (logprobs) over scoring tokens to compute weighted scores, and shows it achieves higher correlation with human judgments than generated scores.