Large Language Models (LLMs) have made remarkable strides in mimicking human-like text generation, but their ability to perform complex reasoning tasks presents a distinct challenge. Reasoning involves structured thought processes that require logical deduction, mathematical computation, and linguistic understanding—capabilities that extend beyond pattern recognition in training data.Recent research, including work by McCoy et al. (2024), has revealed that LLM reasoning is fundamentally shaped by probabilistic patterns encountered during training rather than formal logical deduction. This creates inconsistencies when LLMs face tasks requiring rigorous reasoning steps. Similarly, Mirzadeh et al. (2024) demonstrated that LLMs exhibit significant performance variations when responding to different formulations of the same question, suggesting a reliance on pattern matching over true logical reasoning.
Our benchmark was designed to address limitations in existing LLM evaluation methodologies by implementing a comprehensive, multi-dimensional assessment approach. Building on methodological advances from Apple's GSM-Symbolic framework (Mirzadeh et al., 2024), we developed an evaluation system that allows for controlled assessment of reasoning capabilities across varying domains and complexity levels. Traditional single-point metrics often obscure important performance characteristics of LLMs. By incorporating questions rewritten and reworked from diverse benchmark sources according to the GSM-Symbolic methodology, our framework enables a more nuanced understanding of how LLMs perform across different reasoning tasks and complexity gradients. We plan to expand this small dataset in the future with new questions and challenges to further test both existing and future models.
Our benchmark deliberately distributes tasks across three fundamental domains to provide a comprehensive assessment of LLM reasoning capabilities:
Language tasks evaluate an LLM's ability to understand and manipulate linguistic structures, including:
- Challenging translations between languages requiring semantic preservation
- Word problems and riddles demanding creative linguistic reasoningLinguistic patterns and puzzles that test pattern recognition
- Semantic analysis and interpretation of complex text
- Pragmatic reasoning about implied meaning and contextual understanding
While language tasks represent a smaller portion of our benchmark, they provide essential insights into models' foundational linguistic reasoning abilities that underpin performance across all domains.
Mathematical reasoning tasks comprise the second-largest portion of our benchmark, encompassing a diverse range of mathematical disciplines:
- Physics problems requiring application of formulas and physical principles
- Geometric reasoning involving spatial relationships and proofs
- Probability and statistical reasoning with uncertainty quantification
- Algebraic manipulation and equation solving
- Word problems requiring translation between linguistic and mathematical domains
The substantial emphasis on mathematical tasks is informed by previous research (Mirzadeh et al., 2024) demonstrating that mathematical reasoning presents particular challenges for LLMs due to their reliance on pattern matching rather than formal reasoning systems.
General reasoning tasks constitute the largest segment of our benchmark, examining:
- Formal logical reasoning (deductive and inductive)
- Multi-step problem solving with sequential dependencies
- Identification and resolution of logical fallacies
- Common-sense reasoning about everyday situations
- Physical reasoning about object interactions and causality
- Temporal logic questions involving sequences and timing relationships
This domain focuses on evaluating an LLM's ability to maintain logical coherence across extended reasoning chains, a capability critical for advanced applications requiring reliable analytical processing.
The benchmark strategically stratifies tasks across five ascending complexity levels to assess how LLM reasoning degrades with increasing difficulty. Basic tasks (30.67%) establish a baseline through straightforward reasoning steps like simple calculations and direct definition applications. High school level tasks (6.67%), though fewer, serve as a crucial transition point by introducing multi-step problem solving and pattern recognition.
College entrance level tasks (21.33%) represent a significant complexity increase, requiring abstract reasoning and concept synthesis. Pre-graduate level tasks (28%) form a substantial portion of the benchmark, evaluating complex analytical reasoning and extended logical chains essential for specialized applications. The most challenging post-graduate level tasks (13.33%) represent the frontier of LLM capabilities, demanding expert-level reasoning, specialized knowledge integration, and resolution of highly complex ambiguities.
This tiered structure enables precise assessment of where different models excel or struggle, with particular focus on the performance drop-off as task complexity increases—a key indicator of an LLM's reasoning robustness.