← Back
June 2026 · AI Evaluation

Evaluating LLMs on Custom Benchmarks: Lessons from a Hybrid Evaluator

Large Language Models are impressive, but measuring their real capabilities requires more than simple string matching. Recently, I built a custom evaluation framework for a diverse “easy” benchmark covering pattern recognition, commonsense reasoning, logic, math, language, and knowledge. Here’s what we learned — and why per-category hybrid scoring matters.

The Benchmark

The dataset contains hundreds of straightforward questions across 15+ categories:

Each entry follows a simple JSON structure with category, difficulty, question, and answer. Many answers are semantic, making naive exact-match evaluation misleading.

Why Standard Tools Fall Short

Exact string matching severely underestimates performance on reasoning tasks. A model might correctly grasp “cat is to kitten as dog is to ?” but respond with an explanation or slight phrasing variation.

Key challenges:

The Hybrid Evaluation System

We implemented a category-aware evaluator router with mixed 0–1 scoring:

Category TypeScoring MethodExamples
Strict ExactNormalized string equalityMath-arithmetic, Logic, Knowledge-basic
Flexible ExactLowercase + punctuation removalPattern-matching, Language-structure
SemanticEmbedding cosine similarity (threshold \~0.88)Commonsense, Language-comprehension
HybridCombination of aboveLanguage-transformation

Scores are aggregated as category averages, plus overall micro (sample-weighted) and macro (category-unweighted) percentages.

Results on Qwen2.5-1.5B-Instruct

On this compact instruction-tuned model we observed clear patterns:

Strengths (75–85%)

Weaknesses

The biggest insight: many “failures” were actually format issues rather than reasoning errors. Models understood the pattern but wrapped the answer in explanations.

Key Takeaways

  1. Answer extraction is critical — strip explanations, use last sentence/word heuristics, or category-specific parsers.
  2. One-size-fits-all metrics hide truth. Per-category scoring reveals real strengths and weaknesses.
  3. Embeddings provide cheap, reliable semantic judgment without relying on another LLM judge.
  4. Partial credit systems improve signal — reward conceptual correctness even when formatting is imperfect.
  5. Small models handle everyday reasoning surprisingly well but struggle with strict symbolic manipulation.

Better Benchmarks Ahead

Robust LLM evaluation is as much an engineering challenge as a modeling one. Start with a clean dataset, invest early in a flexible evaluator, and iterate based on real runs.

Future improvements could include adversarial paraphrasing, multi-answer gold sets, confidence-weighted scoring, and interactive dashboards.

Custom benchmarks like this one give far more actionable insights than generic leaderboards. What evaluation tricks have you discovered in your own work?


Built with a hybrid Python framework combining direct inference and category-specific scoring logic.