Score text rigorously.
Trust the numbers.

LLMs are terrible at giving quantitative scores, even when they can give good qualitative analysis. Tally uses rubrics and log-probs to build evaluation pipelines you can actually rely on - all running locally on your hardware.

Get Started →

How It Works

Collect

Build a collection of items you want to evaluate - URLs, PDFs, DOCX, or blobs of text.

Qualify

You have a conversation with a chatbot that elicits what “good” means to you — then Tally turns that into a rubric with measurable criteria.

Iterate

You don't have to be a statistician to fine-tune your rubric - running evaluations and comparing results is a snap with intuitive intuitive tools.

Scale

Run evaluations at scale across fleets of M-series Macs via PoolDo - scales to arbitrarily large corpuses and rubrics.

Why Tally

“Just rate it 1–10.”

LLMs just can’t do that well. Tally decomposes scoring into binary rubric predicates and uses logprobs to extract real quantitative signal from language models.

“We changed the model. How did that change scores?”

Run-over-run comparison shows exactly how changes to your rubric or model changes your quantitiave outcomes, and by how much. No more guessing after a prompt tweak or model upgrade.

“We have no way of comparing 10,000 outputs.”

Tally can do that. Once you validate a rubric aligns with your taste, scaling it is nothing. Takkt distributes evaluation across your fleet of M-series Macs. Score everything, not a sample.

“I don’t know how to write a good rubric.”

You don’t have to. Tally’s conversational rubric builder talks and iterates with you to make sure it encodes the criteria that matter to you.

What can you do when you measure text reliably?

Check out our examples below for how folks can use Tally in Scientific Publishing, Trust and Safety, in Lead Generation, and in Content Marketing.

Create an Account →

Score text rigorously.Trust the numbers.