Kili Technology's LLM Benchmarks

Overview

The substantial gap between top performers (Claude 3.7 and O3 Mini at 84%) and others reveals the emergence of distinct capability tiers among LLMs, suggesting significant architectural breakthroughs in leading models. Domain-specific performance variations across models indicate there's no "one-size-fits-all" solution, necessitating strategic model selection based on intended applications rather than overall rankings alone. The rising performance floor across all evaluated models signals a maturing field where even lower-tier LLMs demonstrate meaningful reasoning abilities, potentially accelerating adoption in domains previously considered too complex for AI deployment.

Math Challenges

The top 2 model's performance indicate a breakthrough in overcoming what has historically been a significant limitation for LLMs.

The substantial variation in math capabilities across models (ranging from 90.32% to 35.48%) suggests that different training methodologies and architectures may be optimizing for different cognitive functions, with some models clearly prioritizing mathematical reasoning.

The clustering of mid-tier performance around 50-60% indicates that while mathematical reasoning is improving across the industry, a substantial capability gap remains between leading and trailing models.

Most consistent techniques

Reliable Manipulation:

Exploiting AI's Learning Mechanisms

The common denominator of the most reliable techniques in our exploration, lies in their exploitation of the model's reliance on input cues to generate output. The ease of creating and automating these attacks significantly increases the risk they pose. They can be readily devised and scaled without requiring advanced skills.

Language Differences

Our analysis reveals dramatic performance differences in language capabilities among leading LLMs, with top models outperforming lower-tier models by nearly 70 percentage points.

This suggests that language processing capabilities may be developing at significantly different rates across model architectures and training methodologies.

Language understanding remains a critical differentiator in LLM development, with potential implications for applications requiring nuanced linguistic analysis. Organizations must carefully evaluate model-specific language capabilities rather than assuming standardized performance across the industry.

Read the full report

Insight #1

Evolving Performancce Hierarchy

Unlike earlier benchmarks that often showed a consistent performance hierarchy across all domains and levels, our results reveal a more nuanced landscape.

While O3 Mini and Claude 3.7 demonstrate superior overall performance, domain-specific excellence varies significantly across models.

This suggests that the frontier of LLM development is becoming increasingly specialized, with different architectural or training approaches conferring advantages in specific domains.

Insight #2

Resilience to Complexity
‍

Recent advances in model architecture and training methodologies appear to be addressing previous limitations where LLM performance degradation increased with task complexity.

The ability to maintain reasoning coherence across extended logical chains is improving, particularly in the latest generation of models like Claude 3.7 and O3 Mini.

Insight #3

Complexity Gap Widening
‍

Our analysis reveals a growing performance divide between leading and trailing models at higher complexity levels.

The gap between top and bottom performers widens dramatically as task complexity increases, reaching a 50 percentage point difference at Post-Graduate levels between Claude 3.7/O3 Mini (80%) and Claude 3.5 (30%).

This widening complexity gap indicates that advancements in reasoning capabilities are not evenly distributed across the industry, suggesting we may be entering an era of increased stratification among LLMs where only a select few can effectively handle the most complex reasoning tasks.

Methodology

Large Language Models (LLMs) have made remarkable strides in mimicking human-like text generation, but their ability to perform complex reasoning tasks presents a distinct challenge. Reasoning involves structured thought processes that require logical deduction, mathematical computation, and linguistic understanding—capabilities that extend beyond pattern recognition in training data.Recent research, including work by McCoy et al. (2024), has revealed that LLM reasoning is fundamentally shaped by probabilistic patterns encountered during training rather than formal logical deduction. This creates inconsistencies when LLMs face tasks requiring rigorous reasoning steps. Similarly, Mirzadeh et al. (2024) demonstrated that LLMs exhibit significant performance variations when responding to different formulations of the same question, suggesting a reliance on pattern matching over true logical reasoning.

Our benchmark was designed to address limitations in existing LLM evaluation methodologies by implementing a comprehensive, multi-dimensional assessment approach. Building on methodological advances from Apple's GSM-Symbolic framework (Mirzadeh et al., 2024), we developed an evaluation system that allows for controlled assessment of reasoning capabilities across varying domains and complexity levels. Traditional single-point metrics often obscure important performance characteristics of LLMs. By incorporating questions rewritten and reworked from diverse benchmark sources according to the GSM-Symbolic methodology, our framework enables a more nuanced understanding of how LLMs perform across different reasoning tasks and complexity gradients. We plan to expand this small dataset in the future with new questions and challenges to further test both existing and future models.

‍

Dataset Composition

Subject Domains

Our benchmark deliberately distributes tasks across three fundamental domains to provide a comprehensive assessment of LLM reasoning capabilities:

Language (16% of tasks)

Language tasks evaluate an LLM's ability to understand and manipulate linguistic structures, including:

- Challenging translations between languages requiring semantic preservation
- Word problems and riddles demanding creative linguistic reasoningLinguistic patterns and puzzles that test pattern recognition
- Semantic analysis and interpretation of complex text
- Pragmatic reasoning about implied meaning and contextual understanding

While language tasks represent a smaller portion of our benchmark, they provide essential insights into models' foundational linguistic reasoning abilities that underpin performance across all domains.

Mathematics (41.33% of tasks)

Mathematical reasoning tasks comprise the second-largest portion of our benchmark, encompassing a diverse range of mathematical disciplines:

- Physics problems requiring application of formulas and physical principles
- Geometric reasoning involving spatial relationships and proofs
- Probability and statistical reasoning with uncertainty quantification
- Algebraic manipulation and equation solving
- Word problems requiring translation between linguistic and mathematical domains

The substantial emphasis on mathematical tasks is informed by previous research (Mirzadeh et al., 2024) demonstrating that mathematical reasoning presents particular challenges for LLMs due to their reliance on pattern matching rather than formal reasoning systems.

Reasoning (42.67% of tasks)

General reasoning tasks constitute the largest segment of our benchmark, examining:

- Formal logical reasoning (deductive and inductive)
- Multi-step problem solving with sequential dependencies
- Identification and resolution of logical fallacies
- Common-sense reasoning about everyday situations
- Physical reasoning about object interactions and causality
- Temporal logic questions involving sequences and timing relationships

This domain focuses on evaluating an LLM's ability to maintain logical coherence across extended reasoning chains, a capability critical for advanced applications requiring reliable analytical processing.

Proficiency Levels

The benchmark strategically stratifies tasks across five ascending complexity levels to assess how LLM reasoning degrades with increasing difficulty. Basic tasks (30.67%) establish a baseline through straightforward reasoning steps like simple calculations and direct definition applications. High school level tasks (6.67%), though fewer, serve as a crucial transition point by introducing multi-step problem solving and pattern recognition.

College entrance level tasks (21.33%) represent a significant complexity increase, requiring abstract reasoning and concept synthesis. Pre-graduate level tasks (28%) form a substantial portion of the benchmark, evaluating complex analytical reasoning and extended logical chains essential for specialized applications. The most challenging post-graduate level tasks (13.33%) represent the frontier of LLM capabilities, demanding expert-level reasoning, specialized knowledge integration, and resolution of highly complex ambiguities.

This tiered structure enables precise assessment of where different models excel or struggle, with particular focus on the performance drop-off as task complexity increases—a key indicator of an LLM's reasoning robustness.

Citations

Aryabumi, V., Chen, Z. Z., Ye, X., Yang, X., Chen, L., Wang, W. Y., & Petzold, L. (2024). Unveiling the impact of coding data instruction fine-tuning on large language models reasoning.
‍
DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv. https://arxiv.org/abs/2501.12948
‍
Guan, M. Y., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., Chung, H. W., Toyer, S., & Heidecke, J., & Beutel, A., & Glaese, A. (2024). Deliberative alignment: Reasoning enables safer language models. arXiv preprint arXiv:2412.16339.
‍
McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D., & Griffiths, T. L. (2024). Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences, 121(41), e2322420121.
‍
Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). GSM-Symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229.
‍
Prabhakar, R., Dua, D., & Agrawal, R. (2024). Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning. arXiv preprint arXiv:2407.01687v2.
‍
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv:2201.11903.
‍
Zhang, X., Su, Y., Ma, R., Morisot, A., Zhang, I., Locatelli, A., Fadaee, M., Üstün, A., & Hooker, S. (2024). To code, or not to code? Exploring impact of code in pre-training.
‍

Kili Technology presents

GenAI Benchmarks

Real-world challenges for AI. Tested by human experts, for human experts.

First Edition: LLM Reasoning Benchmark

Key Insights

Overview

Math Challenges

Most consistent techniques

Reliable Manipulation:

Language Differences

Insight #1

Evolving Performancce Hierarchy

Insight #2

Resilience to Complexity
‍

Insight #3

Complexity Gap Widening
‍

Model-based Results

o3-Mini Reasoning Performance Summary

Claude 3.7 Reasoning Performance Summary

o1 Reasoning Performance Summary

QwQ-32B Reasoning Performance Summary

Mistral-Large-2411 Reasoning Performance Summary

DeepseekR1 Reasoning Performance Summary

Claude 3.5 Reasoning Performance Summary

Qwen2.5 Reasoning Performance Summary

Report Details

Methodology

Dataset Composition

Subject Domains

Language (16% of tasks)

Mathematics (41.33% of tasks)

Reasoning (42.67% of tasks)

Proficiency Levels

Citations

Kili Technology presents

GenAI Benchmarks

Real-world challenges for AI. Tested by human experts, for human experts.

First Edition: LLM Reasoning Benchmark

Key Insights

Overview

Math Challenges

Most consistent techniques

Reliable Manipulation:

Language Differences

Insight #1

Evolving Performancce Hierarchy

Insight #2

Resilience to Complexity‍

Insight #3

Complexity Gap Widening‍

Model-based Results

o3-Mini Reasoning Performance Summary

Claude 3.7 Reasoning Performance Summary

o1 Reasoning Performance Summary

QwQ-32B Reasoning Performance Summary

Mistral-Large-2411 Reasoning Performance Summary

DeepseekR1 Reasoning Performance Summary

Claude 3.5 Reasoning Performance Summary

Qwen2.5 Reasoning Performance Summary

Report Details

Methodology

Dataset Composition

Subject Domains

Language (16% of tasks)

Mathematics (41.33% of tasks)

Reasoning (42.67% of tasks)

Proficiency Levels

Citations

Resilience to Complexity
‍

Complexity Gap Widening
‍