About Tally

Tally is a tool for scoring text content using local LLMs. It lets you build robust evaluation rubrics through fluid conversations, and then uses log probabilities to produce reliable, calibrated, numerical scores.

Why Tally Exists

LLMs are great at giving qualitative feedback or evaluation of text content, but the moment you ask them to give you back a numerical score, their apparent intelligence evaporates. This is a fundamental constraint of the technology - it's unlikely anyone will be able to ever build an LLM that can give a good score as token prediction.

However, academcs have found robust ways around this limitation. The two main ideas Tally encorporates from the literature are using LLM generated "yes/no" rubrics, and using the probabilities of different token outputs, rather than simply taking the . See influencing academic literature...

How Tally Works

Tally helps you evaluate text content at scale by combining human judgment with machine efficiency:

Rubrics decompose human taste/judgement into an (often large) set of yes/no criteria called Predicates.
- If figuring that out sounds complicated, don't worry, Tally guides you through the process in a conversational/chatbot format, where the chatbot elicits from you what you're trying to score, and you can collaborate to create a good rubric.
- When the computers are tasked with evaluating a predicate their response can be evaluated not only on whether it's a Yes or No, but the relative probabilities of Yes and No - this allows for us to understand and interpret uncertainty in responses.
Rubrics can then be run over arbitrary content - docs, PDFs, webpages, you name it - each predicate is evaluated against each piece of content and then the scores are aggregated back.
All scoring runs on your local hardware via PoolDo, keeping costs minimal (free if on your own hardware), and accelerating compute.
Compare runs, view score distributions, and drill into individual results to understand how your evaluation works and tweak it until you're happy.

Then, you have a reliable, mechanistic way of scoring text!

Who It's For

Tally is built for anyone who needs to evaluate text content systematically — researchers validating datasets, teams auditing content quality, educators assessing student work, or developers testing LLM outputs.

Tally does not require any software understanding to run or gain benefit from, though familiarity with basic statistics will be helpful.

Tally is intented as shared infrastructure - I (the developer) use it in other projects as the "tinkering layer" when I need content scoring mechansims. It is accessible over an API and has client libraries coming soon.

About the Creator

Tally is a project by Grady Ward. You can learn more about Grady and his other work at grady.dev.