**Benchmarking Algorithmic Reasoning Capabilities of Gemini Models in Scientific Tasks** Introduction =============================================================================== Evaluating frontier models on scientific tasks presents unique challenges, as success often demands a sophisticated blend of domain-specific knowledge, logical reasoning (identifying a sequence of steps based on given information and prior knowledge), and multi-step algorithmic execution. For instance, determining the phase of a material from an X-ray Diffraction (XRD) pattern requires an LLM to possess domain knowledge, such as the HKL rules for various crystal structures. The reasoning involves inferring the necessary steps to solve the problem—e.g., identifying the highest peak and calculating peak ratios from given data. Finally, execution capability refers to the model's ability to systematically carry out the plan developed during the reasoning process. Existing benchmarks frequently conflate these factors, making it difficult to discern whether a model's failure stems from a lack of specialized knowledge or an inability to logically apply or execute that knowledge. This "knowledge gap" often obscures a more fundamental "reasoning gap" [#1]. An LLM might fail a scientific reasoning task either due to insufficient factual knowledge or, conversely, it might possess the knowledge but be unable to correctly apply a multi-step algorithm or perform accurate computations based on that knowledge. Furthermore, the inherent complexity of the algorithmic steps required to solve a problem is often overlooked in current evaluations; a task demanding five sequential calculations is fundamentally different from one requiring fifty, even if the underlying scientific principles are identical. This project introduces a novel benchmarking approach to address these limitations. Our primary motivation is to abstract away the domain-specific knowledge component as much as possible, focusing instead on the LLM's capacity to execute a series of logical and computational steps—an "algorithm"—once the problem has been conceptually understood. A second, equally important motivation is to systematically evaluate how LLM performance degrades as the inherent algorithmic complexity of a task increases. We posit that by introducing a "scalable variable" within each task, we can precisely control the computational burden and observe the resilience limits of different LLM architectures and prompting strategies. The methodology presented herein involves: 1. Designing a diverse set of ten scientific-algorithmic tasks, each with a known solution complexity that can be parametrically varied. 2. Evaluating the performance of Gemini-2.5-Flash and Gemini-2.5-Flash models, both with and without a "chain-of-thought" or "thinking process" prompting strategy, across varying levels of algorithmic complexity for each task. 3. Complementing these evaluations with an analysis of performance when code execution is assumed, i.e., when the model only needs to provide the logic or algorithm that would then be executed by an external interpreter, to further isolate reasoning from direct computation. This study aims to provide a more nuanced understanding of LLM capabilities in scientific algorithmic reasoning, offering valuable insights for developing more robust and reliable AI systems for scientific applications. Methodology ================================================================================= Our benchmark framework, comprising a diverse suite of tasks, is meticulously designed around a set of core principles detailed below. These tasks are not only crafted to rigorously evaluate advanced AI reasoning but also hold direct implications for a wide spectrum of scientific disciplines, spanning chemistry, physics, materials science, and computer science. While certain tasks may necessitate specific scientific domain knowledge, any requisite information, such as lookup tables or fundamental scientific principles, is explicitly provided within the task description. The primary challenge for the model lies in its ability to infer the correct sequence of steps or the underlying algorithm required to address the scientific problem posed. To precisely decouple this reasoning capability from task execution, we assess whether the model can articulate these identified steps or algorithms as executable code, alongside solving the task itself through a chain-of-thought approach. This methodology leverages the advanced code generation capabilities of frontier models, positing that the successful synthesis of an appropriate algorithm into code serves as a robust, indirect measure of its scientific reasoning. This approach circumvents the subjectivity inherent in human or LLM-as-judge evaluations of free-form reasoning descriptions, providing a quantifiable and verifiable output. Furthermore, by comparing the accuracy derived from executing the model-generated code against its direct task solution, we gain additional insights into its dual capabilities for both algorithmic inference and precise task execution. Task Design Principles ------------------------------------------------------------------------------- Our benchmark framework implements four core design principles addressing limitations in current evaluation methods: 1. *Knowledge-Reasoning Separation:* Tasks require minimal domain-specific knowledge while demanding sophisticated algorithmic execution. 2. *Contamination Resistance:* All tasks generate novel instances through parameterized algorithms, ensuring no specific problem appears in training data. Problem structure remains constant while numerical parameters vary. 3. *Complexity Tunability:* Each task includes scalable difficulty parameters enabling systematic exploration of performance degradation patterns. Complexity scaling follows theoretical principles from computational complexity theory. 4. *Verifiable Execution:* All tasks have algorithmically deterministic solutions with step-by-step verification, eliminating ambiguity in correctness assessment. For example, let us take the `diffusion pathway task`, which presents models with a 2D grid representing a crystal surface containing both passable sites and impassable defects. Given a starting and ending coordinate, the model's objective is to determine the length of the shortest path an atom can take by moving orthogonally across passable sites. This task adheres to stringent design principles: Knowledge-Reasoning Separation is achieved by framing the problem as pure grid-based path finding, circumventing the need for specialized materials science knowledge while demanding robust algorithmic reasoning. Contamination Resistance is ensured through parameterized algorithmic generation of novel defect configurations for each instance, preventing memorization of specific problem structures. The grid's scalable size serves as a tunable complexity parameter, allowing systematic investigation into performance degradation as the search space expands, thereby aligning with principles of computational complexity theory. Furthermore, Verifiable Execution is guaranteed by the inherent determinism of shortest path algorithms, enabling precise, unambiguous assessment of correctness against a single, algorithmically derivable solution. Scientifically, this task models atomic diffusion and migration within defective crystalline structures, a process in materials science that governs phenomena ranging from alloy formation and semiconductor doping to material degradation, providing a simplified yet powerful abstraction for assessing an agent's capacity to navigate complex, obstructed landscapes relevant to real-world material behaviors. Table [stem_tasks] summarizes the tasks, their complexity scaling parameters, and their relevance to various scientific domains. Detailed description of the tasks are given in section Tasks. The complete implementation is available as the open-source SciRex framework, described in detail in the [Codebase and Implementation] section of the appendix. Task | STEM Domain | Algorithm Type | Variable | Complexity | Key Challenge | Scientific Relevance -----|-------------|----------------|----------|------------|---------------|------------------------ Many Body Energy Computation | Physics | Arithmetic | Particles (N) | O(N²) | Pairwise interactions scale quadratically | Molecular interactions, lattice energy Peak Identification | Chemistry/ Materials | Sequence Processing | Peaks (N) | O(N log N) | Coordinate sorting with overlap detection | Spectroscopy (XRD, NMR) analysis Diffusion Pathway | Materials Science | Graph Traversal | Grid (M×N) | O(MN) | BFS pathfinding with defect constraints | Surface diffusion, catalysis State Machine Traversal | Computer Science | Sequence Processing | String length (L) | O(L) | Linear state transition tracking | Reaction pathways, phase transitions DNA Transcription | Biology/ Chemistry | Sequence Processing | DNA length (N) | O(N) | Codon processing with stop rules | Protein synthesis, biomaterials Resistor Network Simplification | Physics/ Engineering | Arithmetic | Resistors (N) | O(N) | Recursive parsing of nested structures | Conductivity in nanomaterials Radioactive Decay Chain | Physics/ Chemistry | Sequence Processing | Chain length (L) | O(L) | Multi-step decay rule application | Nuclear chemistry, dating methods Tree Traversal | Computer Science | Graph Traversal | Nodes (N) | O(N) | Recursive navigation with order accuracy | Chemical graph structures Kinematic Motion | Physics | Arithmetic | Phases (N) | O(N) | Sequential velocity updates | Material particle dynamics Knights and Knaves | Logic | Logical Deduction | Islanders (N) | O(2^N) | Exponential constraint satisfaction | Logical hypothesis testing [Table [stem_tasks]: Computational tasks across STEM domains with algorithmic complexity analysis.] Results ============================================================ The comparison between capabilities and solving step by step within language responses illustrates a critical disparity between models' reasoning capabilities, as inferred through code generation, and their direct problem-solving accuracy via chain-of-thought (CoT) language responses. Across a range of tasks the "Code Execution" metric consistently demonstrates substantially higher accuracy compared to "Chain of Thought" performance. This stark contrast, particularly evident in tasks where CoT accuracy is near zero while Code Execution achieves near-perfect scores, strongly suggests that models are often capable of identifying the correct underlying algorithms or sequences of steps necessary to solve a problem—evidenced by their ability to generate accurate, executable code—but struggle to flawlessly implement these steps through a purely language-based, step-by-step reasoning process. This implies a potential disconnect where the abstract understanding and algorithmic formulation are robust, yet the meticulous execution within a natural language generation framework remains a significant challenge. The superior performance of code execution, therefore, serves as a compelling indicator that the foundational reasoning ability is present, but direct task execution via language output still presents an area for considerable improvement. ![Figure [code]: Performance of Gemini models on different tasks. For each task benchmarking is done with writing algorithm for solution as executable code and writing solution in language response. Refer to Table [stem_tasks] for details on the task ](images/reasoning.png) Our benchmarking results provide a clear picture of how different Gemini models, with and without explicit thinking prompts, handle increasing algorithmic complexity across a variety of scientific reasoning tasks. The accuracy, plotted against the scalable variable, reveals distinct performance curves for each model variant. In most tasks, Gemini-2.5-Pro consistently outperforms both Gemini-2.5-Flash and Gemini-2.5-Flash with thinking and token budgets allocated. Gemini-2.5-Pro exhibits a higher initial accuracy and maintains its performance for a longer duration as complexity increases (especially in those linear complexity tasks), demonstrating greater robustness in algorithmic execution using step by step chain of thought reasoning. *Impact of Algorithmic Complexity:* A consistent trend observed across almost all tasks is the degradation of accuracy as the scalable variable (and thus algorithmic complexity) increases (). This confirms our hypothesis that increasing computational steps or data points challenges the models significantly. *Effectiveness of ”Thinking” budget* The ”thinking process” (Gemini-2.5-Flash-Thinking) generally improves the performance of Gemini-2.5-Flash. For tasks like `Finite State Machine`, `Tree Traversal`, and `Diffusion Pathway` (for more details on these tasks refer to section Tasks), Gemini-2.5-Flash-Thinking shows a noticeable uplift compared to vanilla Gemini-2.5-Flash, indicating that explicit step-by-step reasoning can help mitigate some of the limitations. However, this improvement is not universal and often doesn’t bridge the gap to Gemini-2.5-Pro. ![Figure [complexity]: Performance of Gemini models on different tasks. For each task benchmarking is done with increasing complexity. Refer to Table [stem_tasks] for details on factor varying the complexity of the task ](images/benchmark_complexity_analysis.jpg) ### Slopes of Decline and Algorithmic Complexity A crucial observation from the results is the varied ”slopes of decline” in accuracy for different tasks and models - Tasks with inherently higher algorithmic complexity (e.g., O(N 2) for pair-wise interactions, or exponential for Knights & Knaves) show steeper declines for all models, but especially for the Flash variants. - Tasks requiring linear sequential operations (e.g., `Finite State Machine`, `Kinematic Motion` ) show a more gradual decline for Gemini-2.5-Pro but still a sharp drop for the Flash models beyond a certain threshold. - The difference in slopes between Gemini-2.5-Pro and the Flash models may suggests that Gemini-2.5-Pro has a greater ”working memory” or capacity to manage intermediate states and computations, allowing it to maintain accuracy for higher levels of complexity before its performance eventually degrades. Our benchmarking effort provides crucial insights into the algorithmic reasoning capabilities of state-of-the-art LLMs, particularly within the Gemini family. By designing tasks with scalable complexity and disentangling knowledge from execution, we can draw several key conclusions. The most prominent finding is that while LLMs possess the knowledge and reasoning capabilities to identify the solution or algorithm for many scientific tasks (evident from code execution results), their ability to flawlessly execute multi-step algorithms as chain of thoughts, especially as the number of steps or data points increases, remains a significant bottleneck. This is evident in the consistent accuracy degradation across all models and tasks when the scalable variable is increased. This suggests that current LLMs struggle with maintaining computational fidelity over extended chains of reasoning, similar to how human working memory has limits when performing complex mental calculations. The ”thinking budget" does offer a partial remedy, indicating that explicit articulation of intermediate steps can serve as a form of ”scratchpad” or internal self-correction mechanism, particularly for smaller models like Gemini-2.5-Flash. In tasks like `Radioactive Decay Chain` or `Kinematic Motion`, the underlying physics principles are relatively simple. The challenge for the LLM is not understanding alpha decay or acceleration, but rather applying these rules sequentially and accurately for many steps. The observed failures are thus less about a lack of scientific knowledge and more about correctly executing a known procedure using language response. This is particularly relevant for scientific applications where the primary goal might be to automate routine computations or simulations based on established models, rather than generating novel theories. Tasks like `Knights and Knaves` necessitate systematic exploration of a large decision space still pose a formidable barrier for current LLMs, where a single incorrect logical step can invalidate the entire solution. Discussion ================================= Our benchmark analysis indicate that often bottleneck in solving scientific tasks is not the knowledge or reasoning capability but implementing them step by step in language responses. The observed limitations in pure algorithmic execution highlight the continued need for **hybrid AI systems**. LLMs could excel at interpreting problems, generating high-level algorithms, or even writing code snippets, while dedicated symbolic solvers or traditional programming languages handle the precise, multi-step computations. Improving algorithmic reasoning and execution may require moving beyond current token-prediction paradigms. Future LLMs might benefit from integrated "thinking engines," symbolic reasoning modules, or more sophisticated internal "scratchpads" that are less prone to dilution over long reasoning chains. To facilitate further research in this area, we have released **SciRex**, an open-source Python framework that implements all the tasks and evaluation protocols described in this study. The framework supports both reproducibility of our results and extension to new scientific domains, with built-in multimodal capabilities and systematic complexity scaling (see [Codebase and Implementation] in the appendix). References =============================================================================== [#1]: Tie, Guiyao, et al. "MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks." arXiv preprint arXiv:2505.16459 (2025). [#2]: Alampara, Nawaf, et al. "Probing the limitations of multimodal language models for chemistry and materials research." Nature computational science (2025): 1-10. Appendix ===================================================================================== Codebase and Implementation ------------------------------------------------------------------------------- The complete implementation of this benchmarking framework is available as **SciRex** - an open-source Python framework specifically designed for benchmarking large language models on scientific research tasks. The codebase supports both text-only and multimodal scientific content evaluation, with particular focus on chemistry, physics, and materials science domains. **Repository:** https://github.com/n0w0f/scirex ### Key Features - **Scientific Domain Specialization:** Tasks designed for STEM domains with algorithmic complexity scaling - **Multimodal Support:** Handles text, images, and mixed-modality scientific content - **Extensible Architecture:** Simple API for custom datasets, tasks, and model integration - **Reproducible Benchmarks:** Standardized evaluation metrics and result export formats ### Quick Start Install SciRex using pip or uv: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Python # Using uv (recommended) curl -LsSf https://astral.sh/uv/install.sh | sh uv init scirex uv add git+https://github.com/n0w0f/scirex.git # Using pip pip install git+https://github.com/n0w0f/scirex.git ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ### Usage Examples **Text-Only Benchmarking:** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Python from scirex import Benchmark, Dataset, GeminiModel, PromptTemplate # Load dataset and configure model dataset = Dataset.from_json("path/to/dataset.json") model = GeminiModel(model_name="gemini-2.5-flash") prompt = PromptTemplate.from_template("solve_step_by_step") # Run benchmark benchmark = Benchmark(dataset, model, prompt) results = benchmark.run(max_tasks=50, save_results=True) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Multimodal Benchmarking:** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Python from scirex import Benchmark, Dataset, GeminiModel # Configure for multimodal tasks dataset = Dataset.from_json("multimodal_dataset.json") model = GeminiModel(model_name="gemini-2.5-flash") # Run with multimodal support benchmark = Benchmark(dataset, model, test_multimodal=True) results = benchmark.run(max_tasks=25) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Creating Custom Tasks:** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Python from scirex import Task # Define a custom scientific task task = Task( id="custom_diffusion_001", name="Surface Diffusion Analysis", description="Analyze atomic diffusion pathways on crystal surfaces", answer_type="numeric", target_value=42.0, keywords=["materials_science", "diffusion", "crystallography"], input_template="Given the crystal structure: {structure_image}\nCalculate the diffusion barrier for path: {text_description}", modality_entries={ "structure_image": "image", "text_description": "text" } ) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ### Framework Architecture The SciRex framework implements the design principles outlined in this paper: - **Knowledge-Reasoning Separation:** Tasks abstract domain knowledge while focusing on algorithmic execution - **Contamination Resistance:** Parameterized generation prevents training data overlap - **Complexity Tunability:** Scalable difficulty parameters for systematic performance analysis - **Verifiable Execution:** Deterministic solutions enable objective accuracy assessment All tasks described in Table [stem_tasks] are implemented within this framework, allowing researchers to reproduce our results or extend the benchmark with additional scientific domains. ### Configuration and Setup Create a `.env` file for API configuration: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GEMINI_API_KEY=your_api_key_here ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The framework automatically handles multimodal content detection, input preprocessing, and result aggregation across different complexity levels. For detailed documentation and additional examples, visit the project repository at https://github.com/n0w0f/scirex. Tasks ------------------------------------------------------------------------------- 1. Many Body Energy Computation !!! **Task Description:** Calculate the total energy of a system with N particles arranged in 2D space. Each particle has an energy of E eV. The pairwise interaction energy between two particles is calculated as: (distance between particles) times (energy of particle A + energy of particle B), but only for pairs within 5.0 Angstrom of each other. What is the total energy of the system in eV? ![Figure [particle_energy]: Example particle system with N=25 particles for many-body energy calculation task](images/energy.png) **Scalable Variable:** Number of particles (N) **Description of Complexity:** Computes pairwise interaction energies for N particles, requiring N(N-1)/2 distance calculations and energy summations. Simple for N=2 (1 pair), complex for N=50 (1225 pairs) due to quadratic scaling and conditional checks. Algorithmic complexity: O(N²). **Chemistry/Material Science Relevance:** Models molecular interactions (e.g., Lennard-Jones-like), relevant to chemical systems and material properties (e.g., crystal lattice energy). 2. Peak Identification !!! **Task Description:** The model is given a list of 2D coordinates representing spectral peaks, in the format [[position_1, intensity_1], [position_2, intensity_2], ...]. The list is unsorted. The task is to identify list of peaks in sorted increasing or decreasing manner. ![Figure [peak_sorting]: Example spectral data with N=25 peaks for identification and sorting task](images/sort.png) **Scalable Variable:** Number of peaks (N) **Description of Complexity:** Sorts N 2D coordinates (position, intensity) by position or intensity. Simple for N=5 (quick sort), moderate for N=100 due to sorting cost. Additional complexity (e.g., overlapping peaks) could increase to O(N²) with preprocessing. Algorithmic complexity: O(N log N). **Chemistry/Material Science Relevance:** Core to spectroscopy (e.g., XRD, NMR), used in material characterization and chemical analysis. 3. Diffusion Pathway (Shortest Path on a Grid) !!! **Task Description:** The model is given a 2D grid (e.g., 10x10) representing a crystal surface. The grid is given as a list of lists, where 0 is an empty site and 1 is an impassable defect. The model is also given a starting coordinate and an ending coordinate. The task is to find the length of the shortest path an atom can take to get from start to end, moving only up, down, left, or right. ![Figure [diffusion_path]: Example 25×25 grid with defects (dark) and pathfinding challenge for atomic diffusion task](images/diffusion.png) **Scalable Variable:** Grid size (M×N) **Description of Complexity:** Uses BFS to find the shortest path on an M×N grid, avoiding defects. Moderate for 5×5 grid (25 nodes), complex for 100×100 (10,000 nodes) due to large search space and defect constraints. Algorithmic complexity: O(MN). **Chemistry/Material Science Relevance:** Models atom diffusion on crystal surfaces, key to material science (e.g., catalysis, surface defects). 4. Finite State Machine Traversal !!! **Task Description:** The model is given the definition of a finite state machine: a set of states, a starting state, an alphabet of inputs, and a transition function (e.g., "From state Q1, on input 'a', go to Q2; on input 'b', go to Q1"). The model is then given an input string. The task is to determine the final state of the machine after processing the entire string. **Scalable Variable:** Length of input string (L) with 4 states (Q1, Q2, Q3, Q4), Alphabet: {a, b, c} **Description of Complexity:** Processes L input characters through S states (e.g., S=4). Simple for L=5, complex for L=1000 due to linear tracking of state transitions. Fixed S=4 keeps transition table manageable (4×3=12 transitions for alphabet {a, b, c}). Algorithmic complexity: O(L). **Chemistry/Material Science Relevance:** Indirectly relevant to modeling chemical reaction pathways or material phase transitions as state machines. 5. DNA Transcription and Translation !!! **Task Description:** The model is given a DNA sequence (e.g., 3'-TACGATTACAGC-5') and a standard codon-to-amino-acid lookup table. The task is to provide the final polypeptide chain, which involves two steps: 1) Transcribe the DNA to mRNA (5'-AUGCUAAUGUCG-3'). 2) Translate the mRNA codons to a sequence of amino acids (Met-Stop-Met-Ser). **Scalable Variable:** Length of DNA sequence (N), with randomly added 1–2 stop codons or a single intron **Description of Complexity:** Transcribes N bases to mRNA and translates to amino acids (N/3 codons). Simple for N=12 (4 codons), complex for N=1000 with introns/stop codons due to sequence processing and rule application (base pairing, codon lookups). Algorithmic complexity: O(N). **Chemistry/Material Science Relevance:** Protein synthesis, relevant to protein chemistry and biomaterial design (e.g., protein-based materials). 6. Resistor Network Simplification !!! **Task Description:** The model is given a textual description of a circuit of resistors (e.g., "R1 of 10 ohms is in series with a parallel combination of R2 (20 ohms) and R3 (20 ohms)"). The task is to calculate the total equivalent resistance of the network. Nesting levels: up to 10 resistors have 3 levels, up to 20 have 4 levels max, and more than 20 have 5 levels. **Scalable Variable:** Number of resistors (N) **Description of Complexity:** Combines N resistors via series/parallel rules, with recursive parsing for nested structures (up to 5 levels for N>20). Moderate for N=3 (1–2 levels), complex for N=10 (3–5 levels) due to parsing and reciprocal calculations. Algorithmic complexity: O(N). **Chemistry/Material Science Relevance:** Relevant to material science via resistor material properties (e.g., conductivity in nanomaterials). 7. Radioactive Decay Chain !!! **Task Description:** The model is told that a specific isotope (e.g., Uranium-238, Z=92) undergoes a sequence of radioactive decays (e.g., "alpha, beta, beta, alpha"). The task is to identify the final resulting isotope (element name and mass number). The rules are: Alpha decay reduces atomic number by 2 and mass number by 4. Beta decay increases atomic number by 1 and does not change the mass number. **Scalable Variable:** Length of decay chain (L) **Description of Complexity:** Applies L decay steps (alpha, beta, gamma, theta) to track Z and A. Moderate for L=3, complex for L=15 with hypothetical decays (e.g., theta: Z−1, A−4), requiring precise arithmetic and rule switching. Algorithmic complexity: O(L). **Chemistry/Material Science Relevance:** Nuclear chemistry, relevant to radiometric dating and material stability (e.g., isotopes in materials). 8. Tree Traversal !!! **Task Description:** The model is given a representation of a binary tree (e.g., using a nested list or a dictionary like {'value': 10, 'left': {'value': 5}, 'right': {'value': 15}}). The task is to provide the sequence of node values visited in a specific traversal order (e.g., "pre-order", "in-order", or "post-order"). **Scalable Variable:** Number of nodes in the tree (N) **Description of Complexity:** Traverses N nodes in a binary tree (pre-order, in-order, or post-order). Simple for N=5, complex for N=50 due to recursive navigation and order accuracy. Algorithmic complexity: O(N). **Chemistry/Material Science Relevance:** Indirectly relevant to chemistry via molecular tree structures (e.g., chemical graphs). 9. Kinematic Motion Calculation !!! **Task Description:** An object starts at rest. It undergoes a series of constant accelerations for specified durations (e.g., "Accelerate at 2 m/s² for 5 seconds, then accelerate at -1 m/s² for 8 seconds"). The task is to calculate the final velocity of the object. **Scalable Variable:** Number of distinct acceleration phases (N) **Description of Complexity:** Applies kinematic equations (v = u + at) for N phases. Simple for N=2, moderate for N=10 due to sequential velocity updates, but remains linear and straightforward. Algorithmic complexity: O(N). **Chemistry/Material Science Relevance:** Limited connection to chemistry/material science, though applicable to material dynamics (e.g., particle motion). 10. Knights and Knaves Puzzle !!! **Task Description:** The model is told it's on an island inhabited by Knights (who always tell the truth) and Knaves (who always lie). It is then presented with statements from islanders (e.g., "A says: 'B is a Knave'. B says: 'A and I are of opposite types'."). The task is to deduce the identity (Knight or Knave) of each person. **Scalable Variable:** Number of islanders and statements (N) **Description of Complexity:** Solves a constraint satisfaction problem with N islanders, testing truth/lie consistency. Complex even for N=2 due to combinatorial logic; N=5 is highly challenging due to exponential possibilities. Algorithmic complexity: O(2^N). **Chemistry/Material Science Relevance:** Minimal direct connection to chemistry/material science, but useful for logical hypothesis testing.