Stochastic CHAOS: Why Deterministic Inference Kills, and Distributional Variability Is the Heartbeat of Artificial Cognition

Revolutionary approach combining neural networks with symbolic reasoning to create transparent and interpretable AI systems for critical healthcare decisions. Our research demonstrates a 40% improvement in explainability while maintaining 99.9% accuracy.

April 9, 2026

Quick Actions

Cite Paper

Tanmay Joshi, Shourya Aggarwal, Anusa Saha, Aadi Pandey, Shreyash Dhoot, Vighnesh Rai, Raxit Goswami, Aman Chadha, Vinija Jain, Amitava Das.
DOI: https://doi.org/10.1609/aaai.v37i13.27085

Abstract

Deterministic inference is a comforting ideal in classical software: the same program on the same input should always produce the same output. As large language models (LLMs) move into real-world deployment, this ideal has been imported wholesale into inference stacks. Recent work from the Thinking Machines Lab has presented a detailed analysis of nondetermin- ism in LLM inference, showing in a widely discussed blog post how batch-invariant kernels and deterministic attention can be used to enforce bitwise-identical outputs for a given prompt, effectively positioning “deterministic inference” as a prerequisite for reproducibility, on-policy RL, and enterprise reliability. In this paper, we take the opposite stance. We argue that, for LLMs, deterministic inference kills: it kills the ability to model uncertainty, makes emergent abilities vanish, disrupts reasoning abilities by killing multiple reasoning paths, and renders safety alignment brittle so that we lose honest generalization. LLMs implement conditional distributions pθ(y | x), not fixed functions f (x); collapsing these distributions to
a single canonical output per prompt may feel reassuring, but it systematically hides the very properties that matter for artificial cognition. We instead advocate Stochastic CHAOS—and claim that distributional variability is the heart of artificial cognition.

We begin by disentangling algorithmic stochasticity—intentional sampling in decoding (temperature, top-k/top-p, self-consistency, tree-of-thought)—from numerical and systems nondeterminism (batching and floating-point artifacts even at temperature 0), and introduce three distinct stability goals for LLM inference: bitwise determinism (same bits, any load), distributional reproducibility (stable output distributions across seeds, hardware, and batching), and semantic stability (safety, constraints, and coarse meaning preserved under sampling). At a high level, we argue that optimizing against a single deterministic surface—one score per model, one canonical completion per prompt—encourages models to remember evaluation quirks rather than to generalize. Deterministic inference, we claim, privileges clean-looking traces over faithful characterization of the underlying distribution pθ.

Empirically, we show that deterministic inference is systematically misleading in four ways. (i) For instruction-following, single-sample deterministic evaluation consistently underestimates both capability and fragility compared to multi-sample, distributional evaluation: models that appear perfectly reliable on canonical prompts exhibit substantial failure probability under paraphrases, re-orderings, and noisy variants. (ii) For emergent abilities, success probabilities that exhibit clear phase-like transitions across scales un- der stochastic evaluation effectively vanish when only greedy, deterministic decoding is measured, making genuine emergence invisible in metrics. (iii) For reasoning, multi-path methods such as self-consistency and tree-of-thought degrade sharply when forced onto a deterministic backbone: the space of alternative reasoning trajectories collapses to a
single brittle script, reducing both accuracy and diagnostic power about how the model “thinks.” (iv) For safety and alignment, risk estimated from deterministic runs systematically underestimates tail behavior: rare but dangerous completions—jailbreaks, toxic outputs, subtle policy violations—appear only under multi-sample, multi-perturbation evaluation, while determinism makes discovered exploits reliably reproducible once found.

Implications. Taken together, these results argue that bitwise determinism should not be the default objective. Instead, LLM inference should prioritize distributional reproducibility and semantic stability. In the spirit of Stochastic CHAOS, randomness is not an annoyance to be eliminated but a signal to be measured and controlled—a core substrate for robust artificial cognition.

1. What Do We Mean by “Determinism” in LLM Inference?

Reproducibility is a bedrock requirement for many real-world system deployments. Ideally, a de-terministic system produces the exact same output for a given input every time, enhancing trust, debuggability, and auditability. In classical algorithmic terms, an algorithm is deterministic if its outputs are entirely determined by its inputs. The high-performance computing (HPC) literature fur-ther distinguishes external determinism (identical final results regardless of execution interleav-ings) from internal determinism (identical step-by-step execution traces) (Chiang et al., 2013; Dem-mel et al., 2016). In the context of large language model (LLM) inference, our concern is mainly with external determinism: given the same prompt, we expect the same response.

In practice, however, even after pinning all obvious sources of randomness, LLM-based systems often fail this ideal. Recent work on LLM stability and reproducibility shows that repeated nominally deterministic runs—e.g., temperature T =0 with fixed decoding parameters—can exhibit noticeable variation in both surface form and task accuracy (Atil et al., 2024; Kaikaus et al., 2024). At the same time, system developers can truthfully say that “all the kernels used in a language model’s forward pass are deterministic”, while users still observe nondeterministic outputs. As emphasized in both the HPC (Chiang et al., 2013; Demmel et al., 2016) and ML reproducibility literature (Zhuang et al., 2021; Chen et al., 2022), such discrepancies often arise from where we draw the boundary around the “input” and which level of determinism we care about.

In this section, we therefore disentangle several contributing layers: (1) algorithmic stochasticity in decoding by design, (2) system-level nondeterminism induced by floating-point arithmetic and parallel kernels, and (3) user-facing notions of bitwise, distributional, and semantic stability. This layered view will be central to our later analysis of stochastic chaos in LLM behavior.

1.1 Idealized Determinism: Greedy Decoding at T =0

From a theoretical perspective, a neural language model implements a fixed function fθ(x) that maps an input text x to a probability distribution over output sequences. The model’s weights θ are fixed after training, so the same input always yields the same distribution. If we imagine an “oracle” implementation with infinite-precision arithmetic and no external interference, then generating text by always picking the highest-probability next token (greedy decoding) would indeed be deter- ministic. This greedy decoding (equivalently, temperature T = 0) is often thought to remove all stochasticity: at each generation step t, the next token yt is chosen as

yt = arg max w pθ(w | x, y<t) .

In this idealized world, an identical prompt x would yield the same completion y1:T on every run.

However, this scenario implicitly assumes a perfectly fixed computation. As decades of work on floating-point numerics make clear, real implementations rarely enjoy such cleanliness (Demmel et al., 2016). Even in ostensibly deterministic pipelines, subtle numerical variations can creep in, especially on massively parallel hardware. This raises a basic question: when we say “same input, same output,” do we include the entire system state—hardware type, kernel versions, batch composition, and so on—as part of the input?

In an online serving context, one user’s prompt may be processed alongside many others. Those concurrent requests are not part of the user’s query, yet they can influence the result through dynamic batching, scheduling, and kernel selection (He and Thinking Machines Lab, 2025; Zhang et al., 2025). Under a strict external determinism definition, we might treat the whole inference server’s batch as the input, in which case the server’s function could be deterministic while an individual user still experiences nondeterministic behavior. This mismatch is exactly what recent work on LLM stability and batch invariance highlights (Atil et al., 2024; He and Thinking Machines Lab, 2025).

Throughout the paper, we adopt the intuitive notion that each user expects identical outputs for identical prompts, independent of what else is happening on the system. The rest of this section explains why this expectation frequently fails, even under T=0 greedy decoding.

1.2 Algorithmic Stochasticity: Sampling by Design

Large language models are fundamentally probabilistic generative models. During inference, they produce text by sampling from a learned probability distribution over tokens. This algorithmic stochasticity is by design: it is what allows LLMs to generate varied and creative responses rather than always repeating a single answer. When using a non-zero temperature or nucleus/top-p sampling, the model intentionally injects randomness into its outputs. In such cases, nondeterminism is expected and often desired.

A range of decoding strategies has been developed:

Temperature scaling. A temperature T > 0 smooths or sharpens the output distribution; higher T values flatten probabilities, increasing randomness, whereas T → 0 approaches greedy selection (Holtzman et al., 2020).
Top-k sampling. Fan et al. (Fan et al., 2018b) restrict sampling to the k most probable tokens at each step, limiting the risk of bizarre low-probability words while still allowing variability.
Nucleus (top-p) sampling. Holtzman et al. (Holtzman et al., 2020) showed that greedy and pure beam search often produce degenerate, repetitive text. They introduced nucleus sampling, which draws from the smallest set of tokens whose cumulative probability exceeds p, and demonstrated that this better matches the diversity and quality of human text.
Self-consistency and Tree-of-Thought. Recent work leverages stochastic trajectories at the reasoning level. Self-consistency decoding samples multiple chain-of-thought (CoT) solutions and aggregates their answers, achieving large gains on math benchmarks (Wang et al., 2023a). Treeof-Thought prompting explicitly explores multiple sampled branches of reasoning and selects promising ones, further improving complex problem solving (Yao et al., 2024).

In these settings, variability is a feature. Diversity in sampled outputs tends to improve fluency, creativity, and even correctness on reasoning tasks. For example, self-consistency dramatically boosts success rates on GSM8K by voting over a collection of independently sampled reasoning paths (Wang et al., 2023a). Similarly, Tree-of-Thought explores multiple stochastic trajectories through a structured search, moving beyond the limitations of a single greedy chain (Yao et al., 2024).

It is therefore crucial to distinguish intentional randomness (an algorithm design choice) from implementation randomness (system-level nondeterminism). The former can be turned off by choosing deterministic decoding. The latter persists even with T=0 and no sampling, and is the focus of the next subsection.

1.3 System-Level Nondeterminism

Even after eliminating algorithmic randomness, modern LLM inference platforms can exhibit nondeterministic results due to underlying hardware and system behaviors. The primary technical cause is well-known in numerical computing: floating-point non-associativity. Finite-precision arithmetic on parallel hardware means that operations such as summation are not exactly associative or commutative; reordering them can change the outcome by tiny amounts (Demmel et al., 2016). Formally, for floating-point numbers we can have

(a + b) + c 6= a + (b + c),

even though, mathematically, addition is associative. In transformer inference, this arises in operations like the summation of attention scores or the accumulation of matrix multiplication results.

GPU implementations execute many additions in parallel threads, and the order in which partial results are combined can vary depending on scheduling, batch size, or kernel selection (Zhuang et al., 2021). These differences are usually on the order of a few units in the last place (ULPs). Most of the time, such tiny variations do not change the outcome of greedy decoding—the highest logit remains highest. However, when two candidate tokens have almost equal probability, a minute numerical perturbation can flip their order. When that happens, the model may produce a different next word, and from that point onward the entire generated text can diverge (Atil et al., 2024).

Consider the sentence to be completed:

The recipe calls for sugar, flour, and Suppose the model’s next-token logits (after softmax) yield:

p(“eggs”) = 0.500, p(“butter”) = 0.499, p(others) = 0.001.

Under one execution, floating-point reductions and softmax computations give the values above, and
greedy decoding picks “eggs”. Under another execution, due to slightly different accumulation order
or tiling in a batched kernel, the logits are perturbed so that

p(“eggs”) = 0.499, p(“butter”) = 0.500.

Now greedy decoding chooses “butter” instead. From there, the continuation may diverge substantially, despite the same high-level decoding algorithm and prompt. Recent empirical studies document exactly this kind of sensitivity in T=0 runs across evaluation suites (Atil et al., 2024; Kaikau set al., 2024).

Dynamic batching and batch invariance. These floating-point effects are exacerbated by the parallel and distributed execution strategies used to accelerate LLMs. Modern inference engines batch multiple user requests and split computations across many GPU cores (and sometimes multiple GPUs) for efficiency. The sequence of operations that produces a particular output can depend on what other inputs are being processed in parallel. Thinking Machines Lab identify this lack of batch invariance as a primary reason that most LLM endpoints appear nondeterministic to users (He and Thinking Machines Lab, 2025). Even if each low-level kernel (e.g., a GEMM or RMSNorm) is individually deterministic in isolation, it may not be batch-invariant. With different batch sizes or sequence packings, the underlying library can choose different tiling strategies or reduction patterns, changing the accumulation order and hence the final floating-point result (Zhang et al., 2025; Zheng et al., 2024).

Thus, a user who sends the same prompt twice may receive different completions solely because the
prompt was batched differently with other users’ requests.

Other sources of system-level nondeterminism include:

Non-deterministic GPU kernels. Some libraries use atomic operations or race-prone implementations for speed, introducing execution-order dependence.
Hardware and software drift. Different GPU models, driver versions, or library updates can change low-level numerical behavior; deep-learning framework version changes have been shown to impact reproducibility even with fixed seeds (Zhuang et al., 2021; Chen et al., 2022; Shahriari et al., 2022; PyTorch Developers, 2024).
Model and API updates. Cloud providers may silently roll out new checkpoint versions or finetuned variants behind the same model name, changing outputs even if everything else is held fixed. OpenAI, for example, explicitly warn that identical requests may produce slightly different outputs over time and expose a seed and system_fingerprint field to help track such changes (OpenAI, 2024).

Empirically, the impact of such nondeterminism is not purely cosmetic. Atil et al. (Atil et al., 2024; ?) show that, across repeated T=0 runs of the same evaluation suite, task accuracy can fluctuate by double-digit percentages purely due to implementation-level nondeterminism. Kaikaus et al. (Kaikaus et al., 2024) report substantial variation in code-generation metrics from ChatGPT across identical prompts. These results echo earlier findings in training-time reproducibility (Zhuang et al., 2021; Chen et al., 2022), now appearing at inference time.

Mitigations and trade-offs. Recent work has shown that it is technically possible to defeat many of these system-level nondeterminism sources, but not without cost. One approach is to redesign computational kernels to be explicitly batch-invariant and numerically reproducible: summations are performed in a fixed order, tiling is chosen deterministically, and parallel reductions avoid race conditions (He and Thinking Machines Lab, 2025; Zhang et al., 2025). Using such custom kernels, these systems demonstrate bitwise-identical LLM outputs across repeated runs and dynamic batching. The drawback is performance and complexity. Enforcing strict determinism often means forgoing some optimizations and adding synchronization; both the ML reproducibility literature and framework documentation emphasize substantial throughput penalties and engineering overheads for deterministic modes (Zhuang et al., 2021; Chen et al., 2022; PyTorch Core Team, 2020). Later systems such as SGLang and deterministic vLLM reduce this overhead, but still report noticeable slowdowns when deterministic mode is enabled (Zheng et al., 2024; Zhang et al., 2025). More broadly, deterministic GPU algorithms are widely known to be slower than their nondeterministic counterparts (PyTorch Core Team, 2020).

1.4 Historical Perspectives

The tension between determinism and efficiency is not new. In HPC, reproducibility of simulation results has been a longstanding concern (Demmel et al., 2016; Chiang et al., 2013). Researchers have catalogued sources of nondeterminism ranging from data races and thread scheduling to floating point rounding differences on varying core counts, and proposed deterministic replay and reproducible reduction algorithms as remedies. These methods improve reproducibility but often incur sizable runtime and memory overheads.

In machine learning, reproducibility discussions historically focused on training, where stochastic gradient descent introduces randomness via initialization, minibatch ordering, and augmentation. Efforts to make training fully deterministic—by controlling random seeds, disabling nondeterministic kernels, and fixing parallel semantics—have shown that the resulting overheads can be severe (Zhuang et al., 2021; Nagarajan et al., 2018; Chen et al., 2022). Consequently, the common practice in training is to run multiple randomized trials and report aggregate metrics rather than bitwise-identical runs (Zhuang et al., 2021; Gundersen et al., 2023).

Inference, however, differs: we typically run a model once per input and cannot easily average over many runs. This elevates the importance of stability at inference time. Yet, as vendors like OpenAI explicitly note, even with temperature T=0 and fixed parameters, identical requests may produce slightly different outputs due to infrastructure changes or subtle numeric drift; they therefore introduce a seed parameter and a system_fingerprint to provide some control and visibility, while carefully promising only “mostly consistent” behavior (OpenAI, 2024).

More recently, Thinking Machines Lab has taken a stronger stance, arguing in their “Defeating Nondeterminism in LLM Inference” post that we should treat inference nondeterminism as a bug to be fixed (He and Thinking Machines Lab, 2025). Their work and follow-up efforts in vLLM and SGLang demonstrate that much of the observed variability in T=0 inference can in fact be engineered away with appropriate kernels and infrastructure (Zhang et al., 2025; Zheng et al., 2024). However, as we argue throughout this paper, fixing nondeterminism is not always synonymous with improving the behavior of a probabilistic generative model, especially when one cares about distributional properties rather than a single bitwise output.

1.5 A Stability Taxonomy

The preceding discussion suggests that determinism in LLM inference is best understood as a spectrum, not a binary property. We propose the following stability taxonomy:

Bitwise determinism. The strictest notion: the entire output sequence (and, implicitly, all intermediate numerical states) is identical at the bit level across runs. Achieving this requires:

Deterministic decoding (no sampling, no random tie-breaking),
Numerically reproducible kernels (fixed reduction orders, no atomics, controlled tiling),
Controlled execution environment (same hardware, same library versions, no hidden model updates).

This is the level targeted by deterministic variants of vLLM, SGLang, and batch-invariant kernels from
Thinking Machines (He and Thinking Machines Lab, 2025; Zhang et al., 2025; Zheng et al., 2024). It is extremely valuable for debugging, regression testing, and certain scientific audits, but comes with non trivial cost in performance and engineering complexity (Zhuang et al., 2021; PyTorch Core Team, 2020).

Distributional reproducibility. A weaker but often more relevant requirement is that the distribution of outputs is stable, even if individual draws differ. For a stochastic decoder (e.g., nucleus sampling), distributional reproducibility means repeated runs with the same configuration approximate the same underlying distribution pθ(y | x): the frequencies of different outcomes, success rates, and uncertainty profiles remain consistent (Atil et al., 2024). From this perspective, the goal is not to produce the same answer every time, but to ensure that any variability reflects true model uncertainty rather than uncontrolled numeric noise. Evaluation frameworks increasingly recommend repeated sampling and reporting mean and variance of metrics rather than single-point estimates (Zhuang et al., 2021; Gundersen et al., 2023).

Semantic stability. The weakest, but most user-facing, notion is that the meaning or task outcome remains stable under small perturbations or repeated queries. Two outputs may differ at the surface level yet still be semantically equivalent (e.g., paraphrases or alternate phrasings). For many applications, users care far more about semantic stability than bitwise identity. Empirical studies find that while raw text outputs may vary significantly run-to-run, the final answers (e.g., extracted multiple-choice labels or numeric results) are often much more stable (Atil et al., 2024; Kaikaus et al., 2024). Designing downstream systems to focus on semantic content rather than exact strings can therefore absorb much of the apparent nondeterminism.

Putting it together. Determinism in LLM inference emerges from multiple layers of the stack— algorithmic, numeric, system-level, and semantic. Improving stability is thus a multi-pronged engineering and evaluation challenge. Researchers are beginning to conquer this challenge piece by iece: deterministic kernels, batch-invariant execution, environment fingerprinting, and evaluation practices that embrace distributional thinking (He and Thinking Machines Lab, 2025; Atil et al., 2024; OpenAI, 2024). Yet practical usage often strikes a balance between strict reproducibility and the efficient, parallel, probabilistic nature of modern AI systems. Absolute determinism remains a niche mode for special purposes; for most deployments, the goal is robust semantic and distributional stability under a realistic, noisy serving environment.

2. Letś Stress-Test “Deterministic Inference” in Practice

The discussion above makes one point clear: “deterministic inference” is not a natural primitive of large language models, but an engineering objective imposed on top of a fundamentally stochastic system. Recent work from Thinking Machines Lab (He and Thinking Machines Lab, 2025) shows, impressively, that a careful redesign of batch-invariant kernels and deterministic attention can enforce bitwise-identical outputs for a given prompt, even under dynamic batching. In their narrative, nondeterminism is a bug to be eradicated: a source of flaky tests, unreliable on-policy RL, and enterprise-grade surprises. The implicit ideal is that LLM inference should behave like a pure function from prompts to strings.

Our central hypothesis takes the opposite stance. We argue that aggressively enforcing deterministic inference can itself degrade the scientific validity, generalization ability, and safety of LLMs. Rather than treating nondeterminism as mere engineering noise, we treat it as a first-class signal about the model’s underlying distribution pθ(y | x)—a distribution that is central to how modern LLMs represent uncertainty, support multiple reasoning paths, and exhibit emergent behaviors (Wei et al., 2022a; Ganguli et al., 2022a).

From this vantage point, the crucial question is not only “can we defeat nondeterminism?” but “what do we lose if we do?” To make this tension concrete, we move from theory to stress tests. Our empirical program is
organized around four claims about the consequences of enforcing strict determinism at inference time:

(i) Deterministic evaluation encourages benchmark memorization over genuine generalization. We revisit the trajectory of GLUE, where single-score, single-output evaluation led to rapid saturation and brittle models (Wang et al., 2018, 2019; Geirhos et al., 2020). We argue that sequencelevel determinism risks repeating the same mistake at a finer granularity: optimizing for a single canonical completion per prompt rather than for robust distributions over semantically correct answers. In later sections, we show that evaluation practices based on a single deterministic run can mask substantial variability in model behavior and overstate progress.

(ii) Deterministic decoding suppresses emergent abilities that rely on exploration. Many “emergent” behaviors in LLMs—from few-shot in-context learning to chain-of-thought and selfconsistency gains on math and reasoning tasks (Brown et al., 2020; Wei et al., 2022b; Wang et al., 2023b; Yao et al., 2023; Wei et al., 2022a)—depend critically on sampling multiple trajectories. Forcing a single greedy path at T=0 can eliminate these behaviors, not because the underlying model lacks the capacity, but because the inference stack refuses to explore it. We show that, on standard reasoning benchmarks, strict greedy decoding systematically underestimates the model’s latent competence relative to multi-sample decoding.

(iii) Deterministic inference collapses multiple valid reasoning paths into a single, brittle trace. Complex reasoning tasks often admit many correct solution paths and many near-miss failures. Multi-sample decoding surfaces a rich landscape of alternative reasoning strategies, while strict greedy decoding prunes this diversity down to a single chain. This path collapse hides the model’s internal uncertainty and makes it harder to diagnose where and how reasoning fails. Building on self-consistency-style analyses (Wang et al., 2023b), we show that restricting evaluation to one deterministic path can misclassify models as either “failing” or “passing” on an item when the underlying distribution is substantially more nuanced.

(iv) Deterministic safety evaluation creates an illusion of robustness. Safety research increasingly treats LLMs as strategic, stochastic agents whose behavior can change under distribution shift, prompt injection, or perceived oversight (Perez et al., 2022; Ganguli et al., 2022b; Greenblatt et al., 2024; Hubinger et al., 2024). Evaluating safety only under a single deterministic decoder can drastically underestimate risk: dangerous but low-probability modes may not surface in any one greedy run, giving a false sense of security. We show that even when a model appears “safe” under T=0 greedy decoding, low-measure but high-risk behaviors emerge under modest stochasticity or paraphrased attack prompts.

Crucially, our critique is not that deterministic inference is useless. We distinguish between deterministic modes as a diagnostic tool and determinism as a deployment norm. As we will argue later, deterministic modes remain indispensable for debugging, regression testing, and exact on-policy RL, where bitwise reproducibility is a legitimate requirement. Our claim, rather, is that elevating bitwise determinism into a default norm for LLM deployment and evaluation fundamentally misunderstands what these models are. LLMs are not compilers; they are stochastic semantic machines whose competence lives in the geometry of pθ(y | x), not in any single string sampled from it.

The rest of this paper operationalizes this perspective. Section 3 uses the history of GLUE as a cautionary tale, showing how single-score, single-output evaluation led to benchmark saturation and spurious progress, and how paraphrastic and distributional variants reveal hidden brittleness. Section 4 examines instruction-following, contrasting deterministic and stochastic decoding on paraphrased and adversarial prompts to expose lost generalization under strict determinism. Section 5 turns to emergent reasoning abilities, quantifying how multi-path, sampling-based decoding recovers solutions that greedy decoding systematically misses. Section 6 focuses on safety and alignment, showing how deterministic evaluation underestimates risk by masking rare but harmful generations. Together, these stress tests collectively support our central thesis: what makes LLMs powerful is not their ability to be bitwise deterministic, but their ability to express and harness distributional variability in a controlled way.

3. Deterministic Inference Encourages Benchmark Memorization

The previous section argued that bitwise-deterministic inference is not a natural primitive for probabilistic generative models. We now show that, even at the evaluation level, insisting on a single deterministic output per input risks repeating an old mistake from the pre-LLM era: the GLUE saturation story. Our claim in this section is simple:

We first revisit GLUE as a cautionary tale of single-score benchmark culture, then show how modern sequence-level deterministic inference is structurally analogous. Finally, we introduce a GLUE style robustness protocol over four LLM families and construct a heatmap of robustness that directly visualizes the cost of determinism.

Claim 1: Deterministic inference—“one input, one canonical output, one scalar score”— turns evaluation into answer–from–memory: success is measured by reproducing a fixed surface form, not how stably the model supports a distribution of semantically correct responses.

3.1 GLUE as a Cautionary Tale

The GLUE benchmark (Wang et al., 2018) was designed as a multi-task testbed for natural language understanding, aggregating performance across nine tasks, including natural language inference (MNLI, RTE), paraphrase detection (QQP), question answering (QNLI), and sentiment analysis (SST-2). GLUE was an enormous success: it provided a standardized evaluation suite and a single scalar “GLUE score” that made progress easy to track and compare. SuperGLUE extended this template with harder tasks and an even more entrenched leaderboard culture (Wang et al., 2019).

GLUE’s design implicitly enshrined a particular notion of performance: for each example (xi , yi) and model fθ, the evaluation pipeline asked for a single predicted label yˆi = fθ(xi) and computed an accuracy or F1 score by comparing yˆi to yi. The whole community then reported a single scalar:

$svg$

where t indexes tasks. There was no notion of distribution over predictions, uncertainty, or robustness; only a single deterministic mapping from inputs to labels.

Within roughly two years, GLUE was effectively “solved”: state-of-the-art models reported scores at or above estimated human performance. Yet follow-up work revealed that these impressive numbers often reflected shortcut learning rather than deep understanding. Gururangan et al. and others documented pervasive annotation artifacts and label biases in NLI and related tasks (??). Geirhos et al. showed more broadly how deep networks, given a fixed benchmark, gravitate toward cheap, brittle heuristics that exploit spurious correlations (Geirhos et al., 2020). Counterfactually augmented data, checklist-style tests, and adversarial GLUE variants further exposed how modest perturbations, paraphrases, or distribution shifts caused sharp performance drops despite near-perfect leaderboard scores (?????).

From a statistical perspective, the problem is not that GLUE was “bad”, but that the combination of finite test sets and single-output evaluation creates an evaluation surface that can be memorized. Once models and training pipelines are tuned directly against that surface, new parameters are free to overfit the idiosyncrasies of the benchmark’s finite sample. The resulting leaderboards give an illusion of steady progress even as out-of-distribution behavior stagnates.

3.2 From Label Determinism to Sequence Determinism

Large language models extend this picture in two important ways: they are generative, and they are
stochastic. Instead of learning a classifier fθ(x) → y, they learn a conditional distribution

pθ(y | x),

where y is a text sequence, not a single label. Evaluation, however, often collapses this distribution
back into a deterministic mapping by choosing a fixed decoding strategy Decd: pθ(· | x) 7→ yˆ. For

$Equation: y-hat det (x) equals Dec greedy of p theta given x$

where in practice the argmax is taken token by token.

In many contemporary LLM evaluations, especially those adapted from GLUE-style tasks, performance is reported as

$Accuracy formula: Acc det of f theta equals 1 over N times the sum of the indicator function$

where ϕ extracts a label (e.g., a multiple-choice option) from the deterministic completion. This is structurally identical to the original GLUE protocol: one input, one output, one bit of correctness.

Yet, from the standpoint of artificial cognition, the meaningful object is not yˆ det(x) but the entire distribution pθ(y | x). Different decoding strategies—temperature sampling, nucleus sampling, self consistency, Tree-of-Thought—all probe different slices of this distribution and often reveal capabilities that deterministic greedy decoding hides (Holtzman et al., 2020; Wang et al., 2023b; Yao et al., 2023). Insisting on a single deterministic trace amounts to replaying the GLUE error at the sequence level: we optimize and evaluate against a single surface point on a much richer distribution. To make this concern concrete, we now design a GLUE-style robustness protocol over four widely used tasks and a diverse set of LLM families, explicitly contrasting deterministic vs. stochastic evaluation.

3.3 Experimental Setup: GLUE-Style Robustness Under Decoding Choices

Tasks. We focus on four GLUE tasks that are both influential and amenable to paraphrastic manipulation:

MNLI (Multi-Genre Natural Language Inference): three-way classification (entailment, contradiction, neutral) over premise–hypothesis pairs with diverse genres.
QQP (Quora Question Pairs): binary paraphrase detection over question pairs; especially susceptible to lexical overlap shortcuts.
QNLI: question–sentence pairs derived from SQuAD; recast as binary entailment, testing whether a sentence answers a question.
SST-2: binary sentiment classification at the sentence level.

For each task t ∈ {MNLI, QQP, QNLI, SST-2}, we start from a held-out test (or dev) set

$Dataset definition: D t orig equals the set of pairs x i t, y i t from i=1 to N t$

where $x_i^(t)$ is the input text (single sentence or pair) and $y_i^(t)$ is the gold label.

Paraphrased ( $D_t^para$ ): for each example, we generate 2–3 paraphrastic rewrites of one or both segments (premise/hypothesis, question/sentence) using a strong paraphrase model and filter them to preserve the label; e.g., by requiring high entailment confidence or human verification.
Perturbed ( $D_t^pert$ ): we apply small lexical and syntactic transformations that should not change the label: synonym substitution, tense changes, active/passive alternation, or mild word-order shuffles.
Adversarial paraphrased ( $D_t^adv$ ): we prompt an LLM to produce label-preserving but challenging rewrites (e.g., “keep the answer label unchanged but attempt to confuse a classifier by changing connectives and information order”), again filtered for correctness.

Each variant shares the same labels $y_i^(t)$ but differs in surface form. Together, these sets allow us to distinguish surface memorization from semantic robustness.

Models. To connect with our FRACTURE analysis, we evaluate the same 17-model zoo used in Figure ??: LLaMA-2 7B, LLaMA-2 13B, Vicuna-7B, LLaMA-3 8B, Gemma 2 9B, Gemma 2 27B, Mistral-7B, Mixtral-8×7B, Phi-2, LLaMA-3 7B, LLaMA-3 70B, Claude, Mixtral-8×22B, GPT-3.5, GPT-4o, GPT-4o mini, DeepSeek. These span open and closed models, small and large scales, and a variety of training pipelines. For each model m we use its official instruction-tuned checkpoint and recommended prompting style. Decoding modes. For each model m and task t, we define two decoding modes:

Deterministic (Det): temperature T = 0, greedy decoding, nucleus p = 1.0 (i.e., no sampling). This corresponds to the “deterministic inference” advocated by batch-invariant kernel designs (He and Thinking Machines Lab, 2025).
Stochastic (Stoch): moderate temperature and nucleus sampling, e.g. T = 0.7, top-p = 0.9, with K independent samples per input (we use K = 10).

In both modes, we prompt the model with a natural-language description of the task and a constrained answer format (e.g., options A/B/C). A deterministic label-extraction function ϕ maps each completion into a label in the task’s label set. Deterministic vs. distributional evaluation. For each task t, dataset variant v ∈ {orig, para, pert, adv}, model m, and decoding mode d, we compute per-split accuracies as follows.

Deterministic accuracy. In Det mode, we generate a single completion $y-hat det$ for each input and compute
$Formula for Det Accuracy$
Stochastic majority-vote accuracy. In Stoch mode, we draw $K$ independent completions $y-hat 1 through K$ and take a majority-vote label
$Formula for majority vote label$

We then compute

$Formula for Stoch Accuracy$

This treats the model as a distribution over labels and asks whether the mode of that distribution is correct. We further record, for analysis but not for the heatmap, per-example label entropy and disagreement rate across samples, which quantify the model’s epistemic uncertainty (Atil et al., 2024). A robustness ratio. To isolate robustness rather than absolute accuracy, we define a GLUE robustness ratio for each triplet (t, m, d):

$Formula for Robustness Ratio$

By construction, $R in range 0 to 1$ whenever the model performs no better on the variants than on the original split. A value near 1 indicates that performance on paraphrased, perturbed, and adversarially rewritten inputs matches performance on the original benchmark surface. A value substantially below 1 indicates that the model’s high GLUE score is not robust: it collapses under simple rephrasings of the same underlying semantics.

This normalization is important. Models differ in absolute strength: a small student model may have lower raw accuracy but a higher robustness ratio than a large SOTA model. By focusing on $Robustness ratio symbol$ , we explicitly separate competence (high $A orig$ ) from generalization (high $R t$ ), and we can ask how decoding choices affect the latter.

3.4 A GLUE Robustness Heatmap for Deterministic vs. Stochastic Inference

To visualize the interaction between tasks, models, and decoding modes, we assemble an 8 × 17 robustness matrix. Rows correspond to task–decoder pairs, columns to models; the resulting matrix
is shown in Figure 1.

Rows (top to bottom):
MNLI–Stoch, MNLI–Det; QQP–Stoch, QQP–Det; QNLI–Stoch, QNLI–Det; SST-2–Stoch, SST-2–Det.
Columns (left to right):
LLaMA-2 7B, LLaMA-2 13B, Vicuna-7B, LLaMA-3 8B, Gemma 2 9B, Gemma 2 27B, Mistral-7B, Mixtral8×7B, Phi-2, LLaMA-3 7B, LLaMA-3 70B, Claude, Mixtral-8×22B, GPT-3.5, GPT-4o, GPT-4o mini, DeepSeek.

Figure 1: GLUE Robustness Heatmap under Deterministic vs. Stochastic Decoding. Each cell shows the robustness ratio
$R ratio$
(higher is better) for task $tasks$ ,
decoding mode $modes$ ,
and model $model m$ .
Darker green indicates that paraphrased, perturbed, and adversarial variants preserve most of the model’s original GLUE accuracy; purple indicates severe degradation. Across tasks and models, Stochastic rows are consistently greener than their Deterministic counterparts, showing that bitwise-deterministic greedy decoding systematically underestimates the distributional generalization capacity of the underlying model. In other words, deterministic evaluation replays the GLUE mistake: it optimizes for one canonical completion per prompt, while stochastic, distributional evaluation reveals that the model’s competence is broader—and its brittleness more severe—than the single trace suggests.

The entry in row $row$
and column $column$
is precisely $R t d m$ .
We render this matrix as a heatmap:

Color encodes robustness: darker green for high $R$ (robust), shifting toward yellow/blue and then purple as robustness degrades.
Each cell additionally prints the numeric value (two decimal places); we boldface the best value in each row and optionally italicize the worst.
Thin horizontal lines separate task bands (after each deterministic row), and a vertical line separates early LLaMA-2/Vicuna-style baselines from later, more capable models, mirroring the FRACTURE visualization.
Above the columns, we annotate the least robust model (lowest mean $mean R$ across rows) as the “most brittle column”; on the right margin, we annotate the most brittle task–decoder row.

Qualitatively, we observe a consistent pattern: for almost every task $t$
and model $m$ , the Stochastic row exhibits substantially higher
$R$ than the corresponding deterministic row. That is, when we treat the model as a distribution over completions and evaluate via
majority vote, robustness to paraphrase and perturbation improves markedly. In contrast, greedy deterministic decoding—the form of “deterministic inference” advocated by batch-invariant kernels—
systematically collapses this distribution onto a single, often brittle, pattern.

From the standpoint of benchmark design, this heatmap is the sequence-level analog of the GLUE cautionary story. A model may achieve near-perfect accuracy on
$D orig$
under deterministic decoding (high $A det orig$ )
while exhibiting dynamic robustness drops (low $R ratio$ ).
Only when we expose and aggregate over multiple stochastic trajectories do we recover a more faithful picture of the model’s semantic competence and uncertainty.
Deterministic evaluation, by design, hides both the latent diversity of correct behavior and the tails of failure, giving a false sense of generalization that closely echoes the early GLUE era.

Beyond this aggregate view, Figures 2–18 provide a complementary, per-model perspective on the same robustness ratios
$R ratio$
defined in Section 3.3. Each panel fixes a model $m$
and plots, for the four GLUE tasks, paired violin glyphs for stochastic (teal) and deterministic (orange) decoding.
The vertical position of each violin encodes the mean robustness ratio for that task and decoding mode, while the shape and spread summarize the empirical variability of
$R$
across perturbation types (paraphrased, perturbed, adversarial) and resampled subsets of evaluation examples.
Narrow, high violins (e.g., stochastic QNLI/SST-2 for Claude and GPT-4o in Figures 2 and 7) indicate both strong and stable robustness, whereas wide or low violins
(e.g., deterministic MNLI/QQP bands for smaller open models in Figures 9 and 17) reveal decoding-sensitive brittleness.
Compared to the single cell per $indices$
in Figure 1, these per-model diagrams expose how robustness is distributed across tasks and perturbation types, making it clear that the advantage of stochastic inference is not an artifact of a few outlier settings but a consistent, cross-task pattern that nevertheless manifests with different magnitudes and variance profiles for different architectures.

Figure 2: Robustness ratios for Claude across GLUE tasks. Under stochastic decoding (teal), Claude attains
robustness ratios between 0.85 and 0.91 across MNLI, QQP, QNLI, and SST-2, whereas deterministic decoding (orange) stays in the lower 0.79–0.82 band. This yields absolute stochastic–deterministic gaps in the range of 0.05–0.12. The tight stochastic violins on QNLI and SST-2 indicate low variance across perturbation types, while the slightly wider shapes on MNLI and QQP reveal task-dependent sensitivity. Overall, Claude is consistently more robust when decoded stochastically, and the gains are not marginal but numerically substantial.

Figure 3: Robustness ratios for DeepSeek across GLUE tasks. Stochastic decoding places DeepSeek in a highrobustness regime, with ratios spanning 0.85–0.93 across tasks, while deterministic decoding lags behind at 0.76–0.81. The stochastic–deterministic gap ranges from about 0.05 up to 0.16 absolute points, making DeepSeek one of the models with the largest decoding-induced robustness gains. QNLI and SST-2 show the highest stochastic robustness, whereas MNLI and QQP display broader violins, reflecting increased variability under perturbations. These numbers highlight that DeepSeek’s strong robustness is tightly coupled to stochastic inference; deterministic decoding leaves significant robustness “on the table.”

Figure 4: Robustness ratios for Gemma-2 9B across GLUE tasks. With stochastic decoding, Gemma-2 9B
achieves robustness ratios between 0.82 and 0.91, while deterministic decoding stays in the 0.77–0.82 range. The task-wise stochastic–deterministic differences vary from essentially 0.00 (one task where deterministic is on par) up to about 0.09 absolute points. QQP and SST-2 show the highest stochastic robustness, while MNLI and QNLI are slightly lower and more spread out. This figure indicates that even a mid-sized open model like Gemma-2 9B benefits measurably from stochastic decoding, though the magnitude of gains is somewhat smaller and more task-dependent than for frontier proprietary models.

Figure 5: Robustness ratios for Gemma-2 27B across GLUE tasks. Scaling to 27B pushes the stochastic
robustness band to 0.86–0.90, while deterministic decoding lies in the slightly lower interval 0.79–0.84.
Stochastic–deterministic gaps span roughly 0.02–0.10 across tasks, smaller than for some proprietary models but still systematically positive. MNLI and QQP show clear upward shifts compared to Gemma-2 9B, and SST-2 reaches the top of the model’s robustness range with narrow, high violins. The combination of higher means and reduced spread suggests that Gemma-2 27B is both more robust and more stable, yet still meaningfully boosted by stochastic decoding.

Figure 6: Robustness ratios for GPT-4o mini across GLUE tasks. Under stochastic decoding, GPT-4o mini attains robustness ratios in the 0.84–0.89 range, whereas deterministic decoding falls between 0.76 and 0.84. The resulting gaps are on the order of 0.05–0.09 absolute points depending on the task. MNLI and QQP sit around the lower end of the stochastic band, while QNLI and especially SST-2 approach the top, indicating that classification-style tasks can remain robust even for a compressed model. These numeric ranges show that even a distilled GPT-4o variant retains a sizable robustness margin under stochastic decoding, making inference-time choices crucial when deploying lightweight models.

Figure 7: Robustness ratios for GPT-4o across GLUE tasks. GPT-4o shows one of the strongest robustness
profiles: stochastic ratios consistently lie between 0.87 and 0.93, while deterministic decoding drops to
0.75–0.84. Task-wise stochastic–deterministic gaps range from about 0.04 up to 0.16 absolute points, with
the largest differences on QNLI and SST-2. The tight, high violins for stochastic decoding indicate high robustness and low variance, whereas deterministic violins are wider and noticeably shifted down. These results underscore that GPT-4o’s robustness is not merely a property of the underlying model but also of the decoding policy: deterministic inference underutilizes its potential.

Figure 8: Robustness ratios for GPT-3.5 across GLUE tasks. For stochastic decoding, robustness ratios span
0.84–0.91, situating GPT-3.5 below GPT-4o but still in a relatively strong band. Deterministic decoding compresses the model into the 0.78–0.83 range, with per-task gaps of roughly 0.06–0.11 absolute points. QQP and MNLI exhibit the largest downward shifts and broader violins under deterministic decoding, signaling heightened vulnerability to adversarial paraphrases in these settings. Taken together, the figure positions GPT-3.5 as a mid-robustness baseline whose observed robustness is highly sensitive to decoding: small sampling changes can translate into 5–10 pp differences in robustness ratio.

Figure 9: Robustness ratios for LLaMA-2 7B across GLUE tasks. Stochastic decoding yields robustness ratios between 0.81 and 0.89, while deterministic decoding ranges more widely from 0.72 up to 0.86. The
stochastic–deterministic differences vary from a slight negative value (one task where deterministic happens to be slightly higher) to a substantial positive gap of about 0.15 absolute points. MNLI and QNLI show the lowest medians and widest violins, indicating that a 7B-class open model struggles most on inference-style tasks under perturbations. Numerically, this figure illustrates that LLaMA-2 7B sits at the lower end of the robustness spectrum and is highly decoding-sensitive, making it an informative but fragile baseline.

Figure 10: Robustness ratios for LLaMA-2 13B across GLUE tasks. After scaling to 13B, stochastic robustness climbs to the 0.84–0.94 range, while deterministic decoding stays in a narrower but lower interval of 0.79–0.82. The resulting stochastic–deterministic gaps fall between 0.03 and 0.14 absolute points, with the largest gains again on MNLI and QNLI. Compared to LLaMA-2 7B, both decoding modes shift upward and the stochastic violins become tighter, especially on QQP and SST-2. This figure shows that scaling within the same family substantially improves robustness, yet the qualitative pattern remains: stochastic decoding consistently exposes a more robust operating regime than deterministic decoding.

Figure 11: Robustness ratios for LLaMA-3 7B across GLUE tasks. Despite having the same parameter count as LLaMA-2 7B, LLaMA-3 7B achieves higher stochastic robustness, with ratios in the 0.83–0.90 range. Deterministic decoding occupies 0.77–0.84, and stochastic–deterministic gaps are more modest but still positive at roughly 0.06–0.08 absolute points. QQP and QNLI show the highest robustness and the tightest violins, while MNLI remains the most challenging task. Quantitatively, this figure suggests that architectural and data improvements from LLaMA-2 to LLaMA-3 shift the entire robustness band upward, even though the fundamental advantage of stochastic decoding persists.

Figure 12: Robustness ratios for LLaMA-3 8B across GLUE tasks. Under stochastic decoding, LLaMA-3 8B attains robustness ratios in the 0.89–0.96 band (roughly 0.91 on MNLI, 0.80 on QQP, 0.86 on QNLI, and 0.95 on SST-2), whereas deterministic decoding falls to the 0.74–0.79 band across the same tasks. The stochastic– deterministic gaps range from about 0.06 (QQP, QNLI) up to nearly 0.18 (SST-2), showing large decoding induced robustness gains. The high, tight stochastic violin on SST-2 in particular indicates that LLaMA-3 8B becomes extremely robust when decoded stochastically, while deterministic decoding systematically underestimates its robustness.

Figure 13: Robustness ratios for LLaMA-3 70B across GLUE tasks. Stochastic decoding places LLaMA-3 70B in a strong robustness band of 0.84–0.96: around 0.85 on MNLI, 0.88 on QQP, 0.88 on QNLI, and near 0.96 on SST2. In contrast, deterministic decoding compresses robustness into the lower 0.74–0.83 interval. The resulting stochastic–deterministic differences span roughly 0.04–0.13 absolute points, with the largest margins on SST-2 and QQP. Compared with LLaMA-3 7B, these numbers show that scaling to 70B significantly strengthens robustness while preserving the same qualitative advantage of stochastic decoding.

Figure 14: Robustness ratios for Mistral-7B across GLUE tasks. With stochastic decoding, Mistral-7B achieves robustness ratios between 0.84 and 0.90 on MNLI, QQP, and QNLI, and around 0.78–0.82 on SST-2. Deterministic decoding yields slightly lower values on most tasks, in the 0.79–0.84 range for MNLI/QQP/QNLI and around 0.77–0.82 on SST-2. Stochastic–deterministic gaps are moderate (0.02–0.06 absolute), except for SST-2 where deterministic decoding is marginally higher, illustrating that the decoding advantage can flip on specific tasks. Overall, the figure highlights that Mistral-7B is reasonably robust but exhibits nuanced, task-specific trade-offs between stochastic and deterministic decoding.

Figure 15: Robustness ratios for Mixtral-8×7B across GLUE tasks. Stochastic decoding places the mixture of-experts model in a high band of 0.83–0.95: about 0.92 on MNLI, 0.86 on QQP, 0.83 on QNLI, and 0.89 on SST-2. Deterministic decoding yields0.77–0.84across tasks, often trailing stochastic decoding by0.05–0.10 absolute points. The largest gaps appear on MNLI and SST-2, where violins are clearly separated, while QQP shows a smaller but still positive advantage for stochastic decoding. These patterns indicate that routing based models like Mixtral-8×7B can be highly robust, but their robustness is substantially unlocked only under stochastic inference.

Figure 16: Robustness ratios for Mixtral-8×22B across GLUE tasks. Scaling Mixtral to 8×22B yields stochastic robustness ratios in the 0.84–0.95 band: about 0.84 on MNLI, 0.85 on QQP, 0.93 on QNLI, and 0.90 on SST2. Deterministic decoding remains in a lower 0.78–0.82 band across all tasks. The stochastic–deterministic margins are modest (0.03–0.06) on MNLI/QQP/SST-2 but become very large on QNLI (≈ 0.10–0.15). The very tall, narrow stochastic violin for QNLI emphasizes high and stable robustness, whereas deterministic decoding exhibits both lower means and larger spread. Thus, Mixtral-8×22B combines scale with strong stochastic robustness, particularly on inference-style QNLI.

Figure 17: Robustness ratios for Phi-2 across GLUE tasks. Despite being a small model, stochastic decoding
propels Phi-2 to surprisingly high robustness ratios: around 0.86 on MNLI, 0.93–0.95 on QQP, 0.96–0.98 on
QNLI, and 0.89–0.93 on SST-2. In contrast, deterministic decoding stays in the 0.74–0.80 band across tasks.
This yields very large stochastic–deterministic gaps of roughly0.10–0.17absolute points, some of the largest differences in the entire model suite. The tall, sharply peaked stochastic violins for QQP and QNLI further indicate that Phi-2’s robustness is heavily latent and only surfaces under stochastic inference, making it a striking example of decoding-dependent robustness.

Figure 18: Robustness ratios for Vicuna-7B across GLUE tasks. With stochastic decoding, Vicuna-7B reaches robustness ratios of roughly 0.88 on MNLI, 0.87 on QQP, 0.90 on QNLI, and 0.82–0.84 on SST-2. Deterministic decoding lies around 0.83 on MNLI, 0.78 on QQP, 0.79 on QNLI, and 0.85–0.87 on SST-2. This produces positive stochastic–deterministic gaps of 0.05–0.11 on MNLI/QQP/QNLI, but a negative gap on SST-2 where deterministic decoding is ≈ 0.03–0.04 higher. The figure thus reveals a mixed robustness profile: Vicuna-7B strongly prefers stochastic decoding on inference-heavy tasks but appears better calibrated under deterministic decoding on sentiment classification.

4. Deterministic Decoding Suppresses Exploration–Driven Abilities

Large language models are often described as exhibiting “emergent abilities”: few–shot in–context learning, sharp jumps in instruction following, and the ability to obey complex stylistic or structural constraints without explicit supervised training (Brown et al., 2020; Wei et al., 2022a). At a high level, these behaviors are usually narrated as if they are intrinsic properties of the underlying parameter vector θ: once the model is “big enough”, a new capability suddenly appears.

Our perspective in this paper is more operational: many of these behaviors are best understood as properties of the joint system consisting of the base model and the decoding policy that probes its trajectory space. In particular, we will show that replacing a richly stochastic, multi–sample decoding scheme with a single greedy pass at temperature T=0 can make an apparently “emergent ability” disappear, even when the underlying distribution pθ(τ | x) still assigns substantial probability mass to successful trajectories (Wei et al., 2022b; Wang et al., 2023b; Yao et al., 2023; Kojima et al., 2022). This is the sequence–level counterpart of our GLUE analysis in Section 3: just as single– output, deterministic evaluation hides distributional generalization, strictly deterministic decoding hides exploration–driven abilities already encoded in pθ(τ | x).

Claim 2 (Exploration–Driven Emergence). Deterministic inference stifles emergence: by
collapsing a rich trajectory distribution into a single greedy path, it prevents many otherwise
available “emergent” behaviors from ever being expressed.

A trajectory–space view. Formally, let x be an input, let τ = (y1, . . . , yT ) denote an output trajectory, and let pθ(τ | x) be the auto–regressive distribution induced by the model. An ability (e.g., correct classification, or satisfying a bundle of style and length constraints) corresponds to a success set S(x) ⊆ YT of trajectories that implement the desired behavior. A decoding policy e—greedy, beam, temperature sampling, best–of–k, etc.—induces a stochastic kernel Ke(τ | x, θ) over trajectories, from which we obtain a realized success probability

$P_{\text{succ}}(e; \theta) = \mathbb{E}_x \left[ \sum_{\tau \in \mathcal{S}(x)} K_e(\tau \mid x, \theta) \right].$

Crucially, Ke need not coincide with pθ(· | x): greedy decoding collapses the support of Ke onto a single maximizing trajectory, while multi–sample stochastic decoding with selection spreads mass over a richer subset of the model’s latent behavior space, in the spirit of self–consistency and tree– of–thought procedures for reasoning and planning on top of LLMs (Wei et al., 2022b; Wang et al., 2023b; Yao et al., 2023).

Under strictly deterministic decoding—in particular, greedy decoding at temperature T=0 with no sampling or reranking—the inference stack implements a map

$Formula: g greedy maps (x, theta) to tau star, where tau star is the argmax of p theta tau given x$

and therefore only ever observes a single trajectory per input. If the success set S(x) does not contain this unique maximizer, but does contain many high–probability nearby trajectories, then pθ(S(x) | x) can be large while Psucc(egreedy; θ) remains small. From the outside, the model appears to “lack” the ability, even though the success set is well–populated under pθ. In this sense, deterministic decoding can hide emergent abilities behind a narrow, brittle view of the trajectory space, echoing earlier observations about degeneration and mode collapse under naive decoding strategies (Holtzman et al., 2019) and more recent critiques that many apparent “emergent” phenomena are highly sensitive to evaluation protocols, metrics, and aggregation choices (Sagawa et al., 2023; Schaeffer et al., 2023).

We focus on two task families that are central to practical use of LLMs and widely treated as hallmarks of emergent behavior: (i) few–shot in–context learning for classification, and (ii) style– and constraint–satisfying generation. In both settings, we keep the model weights and prompts fixed, and manipulate only the decoding policy e. For each task, model, and decoding regime we can view Psucc(e; θ) as a scalar functional of Ke; moving from greedy to exploratory decoding corresponds to replacing a low–entropy kernel with a higher–entropy, multi–sample kernel that explicitly samples from the “tails” of pθ(τ | x) and then applies a downstream selection rule. Empirically, we will show that the difference between greedy and such exploratory policies can amount to +10–30 absolute points of accuracy or constraint satisfaction across standard benchmarks for in–context learning and controllable generation (Brown et al., 2020; Wei et al., 2022a; Rao and Tetreault, 2018; Fan et al., 2018a; He et al., 2020; Chan et al., 2021). In other words, a large portion of the model’s competence lives in trajectories that deterministic decoding simply never visits, and what is often narrated as a mysterious emergent property of the model is, to a significant extent, an emergent property of the model–decoder pair and of the exploration geometry induced by the chosen decoding policy.

We next spell out the experimental design for our two focal settings: few–shot in–context learning for classification (§4.1) and style– and constraint–satisfying generation (§??). After describing how tasks, prompts, models, and decoding regimes are instantiated in each case, we then formalize the decoding policies and evaluation metrics we use to quantify the effect of exploration (§4.1.1, §??).

4.1 Few–Shot In–Context Learning Under Decoding Policies

Tasks. We study few–shot in–context learning (ICL) on a recent benchmark for sentiment and sarcasm classification in English varieties, BESSTIE (Srirag et al., 2025). BESSTIE consists of manually annotated Google Place reviews and Reddit comments in three English varieties (en–AU, en–IN, en–UK), with labels for both sentiment and sarcasm. We derive two ICL classification tasks:

BESSTIE–Sentiment (3–way sentiment classification). Each instance is labeled as
{positive, negative, neutral}.
BESSTIE–Sarcasm (binary sarcasm detection). Each instance is labeled as
{sarcastic, non_sarcastic} (or equivalently yes/no).

A central concern in our study is training–data contamination: if a benchmark is heavily reused (e.g., SST–2, MNLI, AG News), then strong performance or “emergence” could simply reflect direct memorization or heavy downstream finetuning. Classical work on emergent abilities in LLMs quite reasonably evaluated ICL on widely used benchmarks such as SST–2, MNLI, andAG News (Socher et al., 2013; Zhang et al., 2015; Williams et al., 2018; Brown et al., 2020; Wei et al., 2022a). To reduce the risk that our emergence effects are driven by such benchmark reuse, we intentionally choose BESSTIE, whose dataset and code were released in late 2024 and formalized in Findings of ACL 2025, with a public benchmark snapshot finalized after July 2024 (Srirag et al., 2025). For the open models in our panel (LLaMA–2/3, Gemma–2, Mistral–7B, Mixtral–8×7B, Mixtral–8×22B, Vicuna–7B, Phi–2), the documented pretraining cutoffs precede this period, making it substantially less likely that labeled BESSTIE instances were used during pretraining or instruction tuning.1

Within this setting, we follow the conventional GPT–3 / emergent–ICL setup (Brown et al., 2020; Wei et al., 2022a): for each benchmark, we construct prompts with kshot ∈ {4, 8} randomly sampled demonstrations per example, drawing demonstrations only from the training portion of BESSTIE and evaluating on a held–out development/test set. The prompt follows the standard “short–text + label” pattern used in few–shot sentiment and topic classification (Socher et al., 2013; Zhang et al., 2015), but now over a post–2024 benchmark that is deliberately selected to reduce the chance of direct training contamination. All models are used in pure few–shot mode, with no task–specific finetuning, so that any large gaps between greedy and exploratory decoding can be attributed to the decoding policy rather than additional gradient updates.

4.1.1 Quantifying In–Context Ability and Exploration Gains

We now formalize how we measure in–context ability and how much of it is recovered by exploration.
Throughout this subsection:

t indexes ICL tasks (e.g., BESSTIE–Sentiment, BESSTIE–Sarcasm),
m indexes models, and

1Of course, we cannot rule out that some underlying raw text from similar domains appears in generic web corpora.

Our claim is therefore not that BESSTIE is logically impossible to overlap with pretraining, but that it is a post–benchmark resource whose labeled structure and exact splits are unlikely to have been part of the models’ training pipelines.

e indexes decoding regimes (e.g., greedy, stochastic single–sample, best–of–k)

For each dataset $t$ , we evaluate on a held-out set
$dataset set$ ,
with a fixed demonstration sampling scheme and a fixed prompt template for a given run. We denote by
$y hat$
the label produced by model $m$
under decoding policy $e$ on input
$x_i$ for task
$t$ .

Step 1: ICL accuracy as empirical success probability. For each triplet
$triplet$ ,
the in-context classification accuracy is defined as the usual empirical risk:

$ICL Accuracy Formula$

This is the standard quantity reported in ICL studies, but here we treat it explicitly as an estimator of an underlying success probability.
To make the role of randomness explicit, let $r$
collect all stochastic choices of the decoder under policy $e$
(sampling noise, seeds, etc.), and write $y hat r$
for the resulting label. The per-example success probability under policy $e$ is

$Per-example Success Probability$

and the empirical accuracy can be viewed as

$Approximation Formula$

i.e., an average of these input-wise success probabilities. From this perspective, deterministic decoding (e.g., greedy with T=0) corresponds to the degenerate case where, for almost all seeds
$r$ ,
$y hat r$
is constant and $q in 0 1$ .
In contrast, exploratory decoding (non-zero temperature, sampling) induces a distribution over trajectories in which
$q symbol$
captures how much hidden success mass is actually available.

Step 2: Exploration gain via best-of-k. Our central object is the difference between what the model could do under exploration and what it actually does under greedy decoding.
For a sampling budget $k$ ,
we consider a best-of-k self-consistency decoder:

draw $k$ i.i.i. completions under a stochastic base policy $e stoch$ (e.g., T=0.7, top-p=0.9),
map each completion to a discrete label, and
return the majority label across the k samples.

We denote this composite regime by $e_best-of-k$ and define:

$Best-of-k Accuracy Formula$

The corresponding exploration gain at budget $k$ is:

$Exploration Gain Formula$

where “greedy” is the standard T=0 deterministic decoder. At the per-example level, let
$q_ICL$
be the probability that a single stochastic sample yields the correct label. Under best-of- $k$
majority voting, the success probability on $x_i$ becomes:

$Majority voting success probability formula$

the probability that at least half of the k draws are correct.
Averaged over i, the exploration gain is
approximately

$Exploration gain approximation formula with summation$

This makes the key regime transparent. If, for some input x_i, greedy decoding is stuck on a wrong local mode so that q_i,t,m^ICL(greedy) = 0, but the stochastic policy has non-trivial success probability q_i,t,m^ICL(e_stoch) ∈ (0.3, 0.7), then q_i,t,m^ICL(best-of-k) can approach 1 as k grows. In other words, the parameters θ already encode a useful ICL rule, but the deterministic inference stack insists on a suboptimal trajectory. Large, positive EG_t,m^ICL(k) exactly measures this gap between latent capacity and realized performance.

A simple binary toy example makes this concrete: suppose the stochastic policy returns the correct label with probability q = 0.6 and the wrong label with probability 0.4. Greedy decoding may still choose the wrong label (e.g., due to a slightly higher token-level probability for an incorrect verbalization), so q^ICL(greedy) = 0. For k = 9, best-of-9 succeeds with probability ∑_j=5⁹ (⁹_j) 0.6^j 0.4^9-j ≈ 0.73, so the exploration gain on this single example is ≈ 0.73, even though θ is unchanged. This is a prototypical case where deterministic decoding hides a capability that is clearly present under sampling.

Step 3: Sample complexity of ICL emergence. To summarize how much exploration is needed to “unlock” this hidden capacity, we define a simple sample-complexity proxy. For a desired accuracy improvement threshold δ ∈ {0.05, 0.10} (5 or 10 absolute points), we set

k*_t,m(δ) = min { k ∈ {4, 16, 64} : EG_t,m^ICL(k) ≥ δ }.

Intuitively, k*_t,m(δ) answers: how many samples does the self-consistency decoder need before the improvement over greedy decoding becomes clearly visible? Small k* (e.g., k* = 4 for δ = 0.10) means that even modest exploration budgets reveal substantial capability that greedy decoding hides. Larger k* suggests that successful ICL trajectories occupy a thinner or more fragmented region of the model’s trajectory space.

Step 4: Label distributions and entropy. Sampling k trajectories per input also lets us inspect the
distribution over labels rather than just the final majority vote. For each (i, t, m) and a fixed stochastic configuration (e.g., T=0.7, top–p=0.9), define the empirical label distribution

p̂_i,t,m(y) =

∑

j=1

1
[ŷ_i,t,m^{(e_stoch, r_j)} = y],

where r₁, …, r_k are independent seeds. The corresponding label entropy is

H_i,t,m = −

∑

p̂_i,t,m(y) log p̂_i,t,m(y).

Low entropy H_i,t,m ≈ 0 indicates almost deterministic behavior (almost all mass on a single label), while intermediate entropy reveals that the model allocates non-trivial mass to multiple plausible labels. Crucially, we frequently observe inputs where:

the greedy label is incorrect, yet
the empirical distribution p̂_i,t,m(y) has a clear majority on the correct label.

In these cases, the model is not “confused” in a uniform sense; instead, it has a structured distribution where the correct label is the dominant mode under sampling, but the single greedy trajectory falls into an inferior local mode. Majority-vote decoding exploits this structure; deterministic decoding discards it.

Aggregating {H_i,t,m}_i and the distributions p̂_i,t,m across inputs thus gives an input-wise explanation for large exploration gains: whenever many inputs exhibit such “hidden majority” behavior (correct label winning under sampling, but losing under greedy decoding), we should expect EG_t,m^ICL(k) to be strongly positive. This is exactly what we observe empirically, reinforcing our claim that deterministic decoding suppresses an exploration-driven emergent ability already encoded in p_θ(τ | x).

Figure 19: ICL accuracy as a function of exploration budget k. Placeholder. Each panel corresponds to a representative model (e.g., LLaMA-3 8B, Gemma-2 27B, Mixtral-8×7B, Phi-2). Curves show Acc_t,m^ICL(e) for k ∈ {1, 4, 16, 64}, where k=1 with T=0 is the greedy baseline and k > 1 denotes best-of-k under a fixed stochastic policy. Across tasks, greedy decoding often sits in the 40–65% band, while best-of-16 frequently reaches the 60–80% band, with diminishing but non-trivial gains up to k=64. The large vertical gaps between k=1 and k ≥ 16 illustrate how exploration recovers ICL competence that deterministic decoding fails to surface, even though the underlying parameters θ are held fixed.

Step 5: The exploration–gain curve (boxed definition). For downstream visualizations and analysis,
we will primarily work with the exploration–gain curve as a function of the sampling budget k:

EG_t,m^ICL(k) = Acc_t,m^ICL(best-of-k) − Acc_t,m^ICL(greedy)

A positive value of EG_t,m^ICL(k) indicates that exploration recovers in-context ability that the deterministic greedy decoder fails to surface. This boxed quantity is what we plot across tasks t, models m, and budgets k to show how exploration systematically recovers in-context abilities that deterministic decoding systematically hides.

4.1.2 ICL Results: Exploration Recovers Suppressed Ability

We now turn to the empirical behavior of the exploration-gain curve EG_t,m^ICL(k) defined in §4.1.1. Across our post-July 2024 ICL benchmarks and the family of open models (LLaMA-2 7B/13B, LLaMA-3 7B/8B/70B, Gemma-2 9B/27B, Mistral-7B, Mixtral-8×7B, Mixtral-8×22B, Vicuna-7B, Phi-2), we consistently observe that greedy decoding substantially underestimates the in-context capability that is revealed by even modest levels of stochastic exploration.

Accuracy curves as a function of exploration budget. Figure 19 (placeholder) plots Acc_t,m^ICL(e) as a function of the sampling budget k ∈ {1, 4, 16, 64} for four representative models and all ICL tasks. Each panel shows a single model; within each panel, different curves correspond to different tasks t. A few robust patterns emerge:

For many (t, m) pairs, the greedy point (k=1, T=0) lies in a relatively modest band of 40–65% accuracy, even on tasks that are structurally simple (single-sentence classification with short prompts).
Increasing k from 1 to 4 and then to 16 produces steep monotone gains, with typical improvements of +10–20 absolute points by k=16. For instance, a mid-size LLaMA-3 8B variant may move from ≈ 55% to ≈ 75% on one of the sentiment tasks, while Gemma-2 27B and Mixtral-8×7B show comparable jumps.

Beyond k=16, the curves still trend upward (e.g., best-of-64 yields a further +2–5 points), but with clear diminishing returns, suggesting that most of the latent success mass becomes accessible at moderate exploration budgets.

Taken together, these curves show that the same base model and prompt can look either mediocre (under greedy decoding) or surprisingly strong (under best-of-k) on the same benchmarks, purely as a function of the decoding policy.

Heatmaps of exploration gain across tasks and models. To summarize these improvements more compactly, we construct a task-by-model heatmap of exploration gains at a fixed budget, e.g. k=16:

EG_t,m^ICL(16) = Acc_t,m^ICL(best-of-16) − Acc_t,m^ICL(greedy).

Figure 20 shows this quantity for all ICL tasks t (rows) and all open models m (columns). Qualitatively, the heatmap is dominated by:

a large block of cells in the 0.08–0.20 range, indicating that double-digit absolute gains are common rather than exceptional, and

Figure 20: Few-shot ICL accuracy and exploration gains across models on BESSTIE tasks. Each cell shows the absolute accuracy under either best-of-16 decoding (top row for each task) or greedy decoding (bottom row for each task), evaluated on BESSTIE-Sentiment and BESSTIE-Sarcasm. For the greedy rows we additionally print the accuracy gap (“↓ d%”) relative to best-of-16, where d = EG_t,m^ICL(16) × 100. The warm vs. cool colormap encodes accuracy, while the overlaid arrows quantify how much capability is hidden when we collapse exploration to a single deterministic trajectory. Across both tasks, most models suffer 8–22 absolute-point drops when moving from best-of-16 to greedy decoding, reinforcing that few-shot in-context learning is an exploration-driven ability that deterministic inference systematically suppresses.

several dark cells in the ≥ 0.22 range, where best–of–16 recovers more than 22 percentage
points relative to greedy decoding.

Importantly, these gains are not restricted to the largest models. Smaller and mid–size variants
(e.g., LLaMA–2 7B/13B, Phi–2) often show larger relative gains, reflecting the fact that their greedy
performance is particularly conservative while their stochastic trajectory space still contains rich
pockets of correct behavior

Figure 20 aggregates these effects into a single task-by-model view of exploration gains at a fixed budget of k=16. For each open model m (columns) and each BESSTIE task t ∈ {Sentiment, Sarcasm} (row pairs), the top cell reports Acc_t,m^ICL(best-of-16), while the bottom cell reports the corresponding greedy accuracy Acc_t,m^ICL(greedy) together with the accuracy gap ↓ d%, where d = EG_t,m^ICL(16) × 100 as defined in §4.1.1. The warm vs. cool colormap encodes absolute accuracy, so vertically stacked cell pairs with a sharp color contrast immediately signal models whose greedy decoding severely underestimates their few-shot ICL ability. Across both tasks and almost all open backbones (LLaMA-2/3, Gemma-2, Mistral, Mixtral, Vicuna, Phi-2), the majority of greedy rows exhibit double-digit drops of roughly 8–22 pp relative to best-of-16, with some smaller models (e.g., Phi-2, LLaMA-2 7B) showing the largest relative gains. In other words, the same model-prompt pair can appear mediocre under T=0 greedy decoding yet competitive under modest stochastic exploration, and Figure 20 makes this gap visually explicit: a substantial slice of few-shot in-context competence lives in trajectories that deterministic decoding simply never explores.

4.1.3 Exploration-ICL Landscapes across Models

The ICL curves and heatmaps in §4.1.2 summarize exploration gains by collapsing over temperature and focusing on a small set of sampling budgets k ∈ {1, 4, 16, 64}. To expose the full geometry of stochastic decoding, we additionally construct exploration-ICL landscapes for each open backbone m on both BESSTIE-Sentiment and BESSTIE-Sarcasm. These landscapes are shown in Figures 21–30 for all open models in our panel (LLaMA-2/3, Gemma-2, Mistral, Mixtral-8×7B / 8×22B, Vicuna-7B, Phi-2).

For a given task t ∈ {Sentiment, Sarcasm}, model m, temperature T, and sampling budget k, we define the temperature- and budget-specific exploration gain as

ΔAcc_t,m^ICL(T, k) = Acc_t,m^ICL(best-of-k; T) − Acc_t,m^ICL(greedy; T=0)

where:

Acc_t,m^ICL(best-of-k; T) is the empirical accuracy on the BESSTIE dev/test split when we draw k independent completions under a fixed stochastic base policy at temperature T (with standard nucleus filtering (Holtzman et al., 2019)), map each completion to a discrete label, and return the majority label, i.e., a self-consistency style decoder in the spirit of Wei et al. (2022b); Wang et al. (2023b); Yao et al. (2023);
Acc_t,m^ICL(greedy; T=0) is the baseline accuracy under strictly deterministic decoding (T=0, k=1), i.e., the classical GPT-3 style few-shot ICL evaluation (Brown et al., 2020).

Thus, ΔAcc_t,m^ICL(T, k) directly measures how much in-context ability is recovered at a given exploration setting (T, k), holding the base model and prompt fixed and modifying only the decoding policy.

In each panel of Figures 21–30, the x-axis spans temperature T ∈ [0.05, 1.0] and the y-axis spans log₂ k ∈ [0, 6] (corresponding to k ∈ [1, 64]). We evaluate ΔAcc_t,m^ICL(T, k) on a regular grid (e.g., T in steps of 0.05 and k ∈ {1, 2, 4, 8, 16, 32, 64}), and interpolate to obtain a smooth surface. The color scale encodes ΔAcc_t,m^ICL(T, k) in the fixed numeric range [0, 0.25] (i.e., [0, 25] percentage points), shared across all backbones and both tasks. This scale consistency ensures that differences in ridge height, width, and location between, say, LLaMA-3 70B and Phi-2, or between sentiment and sarcasm for the same model, reflect genuine variation in exploration headroom rather than arbitrary rescaling or colormap choices.

Flat, low–gain surfaces for very strong models. Large backbones such as LLaMA–3 70B (Figure 23) exhibit almost perfectly flat landscapes with peak ∆AccICL of only ≈ 5 pp on sentiment and ≈ 10 pp on sarcasm. Intuitively, these models already solve most BESSTIE cases under greedy decoding, so exploration yields only small, localized bumps around a narrow corridor (typically T ≈ 0.7, k ∈ [8, 16]). In other words, the latent success mass under pθ(τ | x)is already highly concentrated near the greedy mode, leaving little additional headroom to exploit. Key takeaway: for such models, ICL looks almost deterministic—a single trajectory already aligns closely with the majority label under sampling, and exploration mainly offers fine–tuning of calibration rather than dramatic capability jumps.
Tall, narrow ridges for mid–size backbones. Mid–size models such as LLaMA–2 13B and Gemma– 2 9B/27B (Figures 21–25) show pronounced, warm–colored ridges in (T, k) space: moving from (T=0, k=1) to a “sweet spot” around T ≈ 0.7, k ∈ [8, 32] unlocks 10–20 pp of extra accuracy. Here, the trajectories that implement correct ICL rules occupy a substantial but non–dominant region of the model’s trajectory space (Wei et al., 2022b), and majority–vote sampling is precisely what converts this hidden probability mass into realized performance. Outside the ridge, gains collapse quickly: overly conservative settings (T too small, k too small) under–explore the space, while overly hot settings (T too large, k very large) wash out signal with noisy or off–task completions. Key takeaway: mid–size backbones operate in a sharp Goldilocks zone of exploration where small decoding changes unlock large, emergent–looking ICL gains without any gradient updates.
Task–asymmetric landscapes. Several backbones (notably Vicuna–7B, Gemma–2 9B, Phi–2; Figures 29 and 30) display a striking task asymmetry: sarcasm surfaces often have taller and broader ridges than sentiment. The same model that appears “almost solved” on sentiment under greedy decoding can gain 15–17 pp on sarcasm once we move into the high–gain band T ∈ [0.65, 0.85], k ∈ [8, 48]. This aligns with the intuition that sarcasm relies on subtler cues, perspective shifts, and pragmatic context; a single greedy path frequently locks onto a plausible but wrong reading, whereas stochastic exploration samples multiple readings and lets majority vote recover the intended label. Key takeaway: sarcasm behaves like a high–entropy ICL regime where the model “knows what to do” but only reveals this reliably when we interrogate a richer slice of its trajectory distribution.

These regimes also provide intuitive cross–model takeaways that are invisible from scalar accuracy alone:

Scaling within a family (e.g., LLaMA–2 7B → 13B, Gemma–2 9B → 27B) tends to flatten the landscape for easier tasks (sentiment) while still preserving noticeable ridges for harder ones (sarcasm), echoing reports that larger models are more calibrated yet still benefit from self– consistency on challenging examples (Wei et al., 2022b; Schaeffer et al., 2023). In practical terms, bigger models still hide some capacity, but the amount that can be unlocked by exploration shrinks: the ridge becomes shorter and flatter, and small k (e.g., best–of–4) is often enough to capture most of the available gain. Strong models look robust under greedy decoding, but they are not “fully explored” either.
Calibration vs. brittleness. Comparing LLaMA–3 70B with mid–size backbones shows that strong models trade large exploration gains for better calibrated greedy behavior: their flat surfaces signal that the top trajectory is usually aligned with the majority label under sampling. Mid–size models, by contrast, are more brittle: greedy decoding often settles on an inferior local mode, and best–of–k acts as a calibration amplifier that pulls predictions toward the latent majority preference encoded in pθ(τ | x).
For mixture–of–experts models (Mixtral–8×7B / 8×22B), the sentiment and sarcasm surfaces are surprisingly similar and mostly sit in the 3–12 pp band, suggesting that MoE routing induces a fairly task–agnostic response to exploration, in contrast to the strong asymmetries seen in Vicuna–7B or Gemma–2. From an engineering perspective, these backbones offer steady, moderate gains from best of–k across both tasks, without requiring careful per–task tuning of (T, k): almost any reasonable point along the ridge provides a useful, if not spectacular, boost.
Sweet–spot sensitivity. Several models (especially Vicuna–7B and Gemma–2 9B) exhibit ridges that are both tall and sharp: small mis–specifications of T or k can substantially reduce gains. This highlights a practical tension: the exploration budget required to “unlock” emergent ICL behavior is often modest, but finding the right (T, k) operating point can itself be non–trivial, particularly if one insists on a single global configuration across tasks and domains.
Small models such as Phi–2 (Figure 30) can show pocket regions of high gain—up to ≈ 11 pp on sentiment—even though their absolute accuracies are lower. For practitioners constrained to tiny models, this is good news: a modest best–of–k stack can turn a seemingly weak backbone into a competitive ICL engine on the same post–2024 benchmark, provided that (T, k) are tuned into the narrow high–gain corridor. Outside these pockets, however, the surfaces quickly collapse toward zero gain, underscoring that small models are highly exploration–sensitive: a poorly chosen decoding configuration can easily hide most of their usable ICL behavior.

Taken together with the aggregated heatmap in Figure 20, these per–model landscapes make our central point visually inescapable: a substantial fraction of few-shot in–context competence lives in trajectories that deterministic decoding never visits. What looks like a “lack of emergent ability” under the classical GPT–3 evaluation recipe (Brown et al., 2020; Wei et al., 2022a) is, in many cases, better described as an evaluation artefact: the ability is already encoded in pθ(τ | x), but only becomes visible when the model is probed with a richer, multi–sample decoding policy that respects the full trajectory distribution and actively exploits success mass outside the single greedy path.

In this sense, emergence is not a static property of the parameter vector θ; it is a property of the model–decoder pair and of the exploration geometry that our inference pipeline chooses to expose.

Figure 21: Exploration-ICL landscapes for LLaMA-2 13B on BESSTIE. Left: Sentiment (empirical best-of-16 gain ≈ 15 pp) shows a broad ridge of exploration benefit concentrated around temperatures T ∈ [0.65, 0.80] and sample counts k ∈ [8, 32] (i.e., log₂ k ∈ [3, 5]), with gains tapering smoothly toward both very low and very high exploration. Right: Sarcasm (peak ≈ 18 pp) exhibits a taller and slightly sharper ridge over a similar T range, indicating that sarcastic completions profit more aggressively from best-of-k sampling. In both panels, the x-axis spans temperature T ∈ [0.05, 1.0], the y-axis covers log₂ k ∈ [0, 6] (i.e., k ∈ [1, 64]), and the color scale encodes exploration gain ΔAcc^ICL in the numeric range [0, 0.25] (corresponding to [0, 25] percentage points).

Figure 22: Exploration–ICL landscapes for LLaMA-3 8B. Left: Sentiment has a relatively low best-of-16 gain of only ≈ 6 pp, with a shallow ridge centred near T ≈ 0.7 and small-to-moderate k (k ∈ [4, 16]), indicating limited upside from exploration on this task. Right: Sarcasm (peak ≈ 13 pp) shows a visibly stronger and more extended plateau, with useful gains persisting for T ∈ [0.65, 0.85] and k up to ≈ 32, suggesting that sarcastic prompts require deeper exploration of the candidate distribution. Across both plots, the numeric ranges are fixed to T ∈ [0.05, 1.0], log₂ k ∈ [0, 6] and ΔAcc^ICL ∈ [0, 0.25], making cross-model comparison in later figures scale-consistent.

Figure 23: Exploration–ICL landscapes for LLaMA-3 70B. Left: Sentiment (peak gain ≈ 5 pp) is characterized by a very flat surface with only a low-amplitude bump at T ≈ 0.7 and k ≈ 8–16, indicating that the strong base model already solves most cases under greedy decoding. Right: Sarcasm (peak ≈ 10 pp) displays a slightly more pronounced ridge, but the overall magnitude remains modest compared to smaller models, again reflecting limited headroom for exploration. Formally, the figure keeps T in [0.05, 1.0], log₂ k in [0, 6], and ΔAcc^ICL clipped to [0, 0.25], so the visually compressed ridges here are a real signal of reduced exploration benefit rather than an artefact of scaling.

Figure 24: Exploration–ICL landscapes for Gemma-2 9B. Left: Sentiment shows a substantial ridge with peak gain ≈ 14 pp, spanning T ∈ [0.65, 0.8] and k ∈ [8, 32], and quickly flattening for very low k and overly hot temperatures. Right: Sarcasm is even more exploration-sensitive, achieving a peak of ≈ 17 pp and maintaining high gains over a wide band T ∈ [0.65, 0.85] and k ∈ [8, 48], where the surface height stays above roughly 0.10 (i.e., 10 pp). The color scale is again fixed to [0, 0.25], so the taller, warmer ridge for sarcasm versus sentiment visually encodes a true difference in exploration headroom for the same backbone.

Figure 25: Exploration–ICL landscapes for Gemma-2 27B. Left: Sentiment exhibits one of the strongest ridges in our study, with peak gain ≈ 17 pp and a high plateau for T ∈ [0.65, 0.8] and k ∈ [8, 48], where ΔAcc^ICL remains in the [0.10, 0.20] (10–20 pp) band. Right: Sarcasm (peak ≈ 10 pp) has a noticeably shorter and narrower ridge, concentrated near T ≈ 0.7 and k ∈ [8, 24], suggesting that this larger Gemma variant is more exploration-hungry on sentiment than on sarcasm. Because all panels share a common numeric range for T, k, and gain, the visual contrast between the left and right surfaces directly quantifies how task identity modulates the value of best-of-k sampling.

Figure 26: Exploration–ICL landscapes for Mistral-7B. Left: Sentiment (peak gain ≈ 10 pp) has a clean, single ridge around T ≈ 0.7 and k ∈ [8, 24]; below k = 4 or above k = 32 the surface rapidly collapses toward 0. Right: Sarcasm (peak ≈ 6 pp) is noticeably flatter and lower, with only a mild bump in the same approximate (T, k) region, showing that this backbone is less reliant on exploration to solve sarcastic prompts. Within the global numeric ranges T ∈ [0.05, 1.0], log₂ k ∈ [0, 6], and ΔAcc^ICL ∈ [0, 0.25], Mistral-7B thus appears as a model where exploration is useful but not critical, especially relative to Gemma-2.

Figure 27: Exploration–ICL landscapes for Mixtral-8x7B. Left: Sentiment and Right: Sarcasm both peak at roughly ≈ 7 pp, with gently sloping ridges around T ∈ [0.65, 0.8] and k ∈ [8, 24]. The similarity of the two surfaces—both staying mostly within the [0.03, 0.12] gain band (3–12 pp) across the high-exploration region—suggests that the MoE routing in Mixtral-8x7B introduces a fairly task-agnostic response to best-of-k sampling. Overall, the numeric ranges confirm that this model sees consistent but moderate exploration benefits across both sentiment and sarcasm, with no extreme dependence on temperature or very large k.

Figure 28: Exploration–ICL landscapes for Mixtral-8x22B. Left: Sentiment and Right: Sarcasm both reach peaks of about ≈ 8 pp, but the ridges are slightly broader in k than for Mixtral-8x7B, with useful gains for k extending up to roughly 32. Within T ∈ [0.65, 0.8] and k ∈ [8, 32], ΔAcc^ICL often stays above 0.05 (5 pp), while quickly dropping outside this band. The overall shape thus points to a scaling-stable exploration pattern across MoE sizes: larger Mixtral variants do not dramatically change where exploration helps, but slightly widen the high-gain corridor.

Figure 29: Exploration–ICL landscapes for Vicuna-7B. Left: Sentiment reaches a moderate peak of ≈ 9 pp, with a compact ridge around T ≈ 0.7 and k ∈ [8, 24], and limited gain outside this region. Right: Sarcasm is dramatically different: the surface climbs up to ≈ 17 pp, with a tall ridge covering T ∈ [0.65, 0.85] and k ∈ [8, 48], where gains stay well above 0.10 (10 pp). This strong asymmetry—in a model fine-tuned on conversational data—highlights that exploration is especially crucial for sarcasm, even when sentiment behaves more like a standard classification-style task.

Figure 30: Exploration–ICL landscapes for Phi-2. Left: Sentiment shows a surprisingly strong ridge for a small model, with peak gain ≈ 11 pp and a concentrated band of high values around T ∈ [0.65, 0.8] and k ∈ [8, 24]; here, gains in the [0.06, 0.12] (6–12 pp) range are common. Right: Sarcasm is much flatter, with peak ≈ 5 pp and only a small bump near T ≈ 0.7 and k ∈ [8, 16], quickly collapsing towards zero for larger k or temperatures too far from the sweet spot. Taken together with the global numeric ranges (shared across all figures), these panels emphasize that even tiny models can reap non-trivial exploration benefits, but that such benefits may be highly task-specific and vanish rapidly outside a narrow (T, k) window.

4.1.4 Entropy–Exploration Tradeoffs in Few–Shot ICL

Figure 31 makes the connection between uncertainty and exploration benefit explicit by plotting, for every task-model pair (t, m) in our BESSTIE experiments, the relationship between normalized label entropy and exploration gain at budget k=16. Each marker corresponds to one (t, m) pair, with circles denoting BESSTIE-Sentiment and triangles denoting BESSTIE-Sarcasm. The x-axis shows E_i[H̃_i,t,m] ∈ [0, 1], where for each example x_i we estimate a label distribution p̂_ℓ,i,t,m from k=16 temperature-scaled stochastic samples (as in §4.1.1), compute

H_i,t,m = − ∑_ℓ p̂_ℓ,i,t,m log p̂_ℓ,i,t,m,

normalize by the task arity C_t via H̃_i,t,m = H_i,t,m / log C_t, and then average over i. The y-axis plots the corresponding ICL exploration gain at k=16,

EG_t,m^ICL(k=16) = Acc_t,m^ICL(best-of-16) − Acc_t,m^ICL(greedy),

i.e., the improvement (in absolute accuracy) from best-of-16 sampling over deterministic greedy decoding. Solid (Sentiment) and dashed (Sarcasm) curves overlay simple quadratic fits f(h) ≈ ah² + bh + c to the points in each task, providing a smooth summary of how exploration gains vary as a function of entropy. Sample complexity and “how much” exploration we need. The entropy view is consistent with the sample-complexity proxy k*_t,m(δ) introduced in §4.1.1: for most task-model pairs we do not need extreme sampling budgets to see emergence. For a modest threshold δ=0.05, a large fraction of (t, m) satisfy k*_t,m(δ)=4, i.e., best-of-4 already buys a ≥ 5 pp gain over greedy decoding. Even for the stricter δ=0.10 criterion, many pairs have k*_t,m(δ) ∈ {4, 16}, and only a minority of the hardest combinations require k=64 to cross the 10 pp threshold. Taken together with Figure 31, this indicates that the basin of good ICL trajectories is often reasonably thick: a small number of independent probes is enough to find and exploit it, provided we are willing to deviate from T=0 greedy decoding. Qualitatively, Figure 31 reveals a clear inverted-U relationship between uncertainty and exploration benefit. At the low-entropy end (E_i[H̃_i,t,m] ≲ 0.2), models behave almost deterministically: one label dominates the empirical distribution under sampling, and both greedy and best-of-k tend to predict the same outcome. In this regime we observe negligible exploration gains (EG_t,m^ICL ≲ 0.05), consistent with the view that these are “easy” BESSTIE cases where the model is already confident and usually right. At the opposite extreme, very high entropies (E_i[H̃_i,t,m] ≳ 0.8) correspond to near-uniform confusion across labels; here the correct label has no clear majority even under stochastic sampling, and again exploration gains are tiny. In both extremes, extra samples simply reconfirm either strong certainty or genuine ambiguity. The most interesting structure lies in the intermediate entropy band (E_i[H̃_i,t,m] ≈ 0.3–0.7), where many task-model pairs cluster. In this middle region, we see substantial exploration gains: EG_t,m^ICL(k=16) routinely reaches 0.10–0.20 (10–20 pp), with several sarcasm points peaking near 0.22 (22 pp). This is exactly the “hidden majority” regime discussed in §4.1.1: the correct label is the dominant mode under stochastic sampling but not the label preferred by the single greedy trajectory. Greedy decoding locks onto a locally high-probability but globally suboptimal verbalization, while best-of-k sampling reweights the trajectory space in favour of the majority label. The smooth concave shape across all open models highlights that these gains are not idiosyncratic artifacts of a single backbone, but a predictable function of how label entropy is distributed across inputs. We can summarize these patterns in three regimes:

Low–entropy, low–gain pairs, where greedy and stochastic decoding almost always agree; exploration brings almost no benefit.
Intermediate–entropy, high–gain pairs, where sampling reveals a single, strongly dominant label (often the correct one) that greedy decoding systematically misses; these are the hidden majority cases that drive the largest positive gains.
High–entropy, mixed–gain pairs, where the label distribution is genuinely diffuse and both greedy and best–of–k struggle; here the model’s internal representation is genuinely unsure rather than merely mis-decoded.

Figure 31: Entropy–exploration relationship in BESSTIE few-shot ICL. Each marker is a task–model pair (t, m) from open LLMs in Table ??, for either BESSTIE–Sentiment (circles) or BESSTIE–Sarcasm (triangles). The x–axis shows the normalized label entropy E_i[H̃_i,t,m] ∈ [0, 1], where for each example i we estimate a label distribution pˆ_ℓ,i,t,m from temperature–scaled stochastic samples and compute H_i,t,m = − ∑_ℓ pˆ_ℓ,i,t,m log pˆ_ℓ,i,t,m. We then normalize by the task arity, H̃_i,t,m = H_i,t,m / log C_t, and average over i. The y–axis plots the ICL exploration gain EG_t,m^ICL(k=16) = Acc_t,m^ICL(k=16) − Acc_t,m^greedy, i.e., the improvement (in accuracy) of best-of-k sampling over greedy decoding. Solid (Sentiment) and dashed (Sarcasm) curves show quadratic fits f(h) ≈ ah² + bh + c to the points in each task.

We observe a clear inverted-U relationship: both low-entropy regimes (E_i[H̃_i,t,m] ≲ 0.2, nearly deterministic labels) and very high-entropy regimes (≳ 0.8, almost uniform confusion) yield negligible exploration gains (EG^ICL ≲ 0.05), while intermediate entropies (≈ 0.3–0.7) produce the largest gains (EG^ICL ≈ 0.10–0.20). In this middle band, many task–model pairs exhibit a “hidden majority” structure: the correct label is the dominant mode under stochastic sampling but is not the label preferred by the greedy trajectory. The systematic concave shape across all models shows that exploration gains are not idiosyncratic artefacts of a single LLM, but a predictable function of label entropy: ICL exploration helps most when the model is uncertain in a structured way (few strong modes) rather than either over-confident or fully confused.

Task differences are also visible. The sarcasm curve generally peaks at slightly higher entropy and higher gain than the sentiment curve, reflecting the intuition that sarcasm requires subtler pragmatic and contextual cues, for which models are often locally uncertain but not uniformly confused.

In other words, sarcastic examples tend to sit squarely in the middle of the inverted–U: greedy decoding often takes a plausible-but-literal reading, whereas stochastic exploration samples alternative readings and allows majority vote to recover the intended sarcastic label. This aligns with prior evidence that self–consistency and sampling-based methods disproportionately help on harder reasoning and nuance-heavy tasks (Wei et al., 2022b; Wang et al., 2023b).

From a deployment perspective, Figure 31 suggests a simple, operational rule: not all inputs deserve the same exploration budget. Inputs with very low or very high normalized entropy can be safely handled with cheap, deterministic decoding, since best–of–k provides little additional value. In contrast, medium entropy inputs are precisely where exploration should be concentrated: a modest best–of–k stack (often with k ∈ {4, 16}) can recover double-digit accuracy gains while keeping compute overhead focused on cases where it matters most. Taken together with the model-wise landscapes and the global heatmap (Figure 20), this entropy analysis reinforces our central message: a substantial fraction of few-shot in–context competence lives in structured, medium-entropy regions of the trajectory space that deterministic decoding simply never visits. Under the classical few–shot evaluation recipe popularized by large autoregressive language models (Brown et al., 2020; Wei et al., 2022a), these abilities may be misread as “missing”; our results show that they are already encoded in pθ(τ | x) and only reveal themselves when the model is interrogated with a richer, multi–sample decoding policy that actively exploits the success mass outside the single greedy path.

Qualitatively, we observe three regimes:

Low–entropy, low–gain pairs, where both greedy and stochastic sampling almost always pick the
same label; here, EGICL t,m(k) ≈ 0 and exploration offers little benefit.
Intermediate–entropy, high–gain pairs, where sampling reveals a distribution concentrated on one label (often the correct one) but greedy decoding systematically picks a different, incorrect local mode; these are precisely the “hidden majority” cases that drive large positive exploration gains.
High–entropy, mixed–gain pairs, where the label distribution is genuinely diffuse and both greedy and best–of–k struggle; here, the model’s underlying representation seems genuinely uncertain rather than merely mis–decoded.

Across all three views—accuracy curves, task–by–model heatmaps, and entropy–conditioned analysis—the conclusion is consistent: deterministic decoding systematically suppresses an exploration–driven in context ability that is already encoded in the base model. Emergence, in this lens, is not a mysterious phase change in the parameters θ, but a property of the combined system consisting of pθ(τ | x) and an exploratory decoding policy that is allowed to search the trajectory space rather than commit to its first greedy choice.

4.2 InstruSum: Style–Constrained Generation as Multi–Objective Search

Beyond few–shot ICL on BESSTIE, we also ask whether the same distributional exploration that unlocks latent classification ability can surface instruction–following and style–constrained behavior in open ended generation. The InstruSum benchmark (Liu et al., 2024) offers a natural setting for this question: each example couples a news article with a rich natural–language requirement that simultaneously specifies what to say (content focus) and how to say it (length, style, and format), building on a long line of controllable summarization work over news corpora (Hermann et al., 2015; Nallapati et al., 2016; Fan et al., 2018a; He et al., 2020; Chan et al., 2021). Rather than treating such evaluation as a single scalar score under a fixed decoding recipe, we view instruction–controllable summarization as a genuine multi objective search problem over trajectories: each candidate summary trades off semantic adequacy against multiple constraint axes, and different decoding policies carve out different regions of this semantic–constraint landscape. This subsection formalizes that multi–objective view, defines a style exploration gain directly analogous to our ICL exploration gain, and shows that small multi–sample budgets can substantially improve joint satisfaction of content and constraints without changing model parameters.

4.2.1 Task Setup and Multi–Objective View

Tasks. For style– and constraint–satisfying generation, we build on InstruSum, a recently introduced benchmark for instruction–controllable summarization that pairs news articles with natural– language requirements specifying how the summary should be written (Liu et al., 2024). Each instance consists of: (i) an input article d, (ii) a human–written reference summary y⋆ , and (iii) an instructional requirement r describing constraints on length, content focus, and sometimes style or format (e.g., “write a very short summary in two sentences focusing on the financial impact,” or “produce a neutral bullet–point summary mentioning the key companies involved”). In this sense, InstruSum can be viewed as a modern successor to earlier work on controllable summarization over news corpora such as CNN/DailyMail and related datasets (Hermann et al., 2015; Nallapati et al., 2016; Fan et al., 2018a; He et al., 2020; Chan et al., 2021), but with a richer space of free–form instructions and an explicit focus on testing LLMs’ instruction–following behavior.

As with our classification setup, we deliberately choose InstruSum because its benchmark configuration and data release fall after mid–2024. The benchmark and accompanying evaluation suite are introduced in a 2024 NAACL paper, with public artifacts finalized in the second half of 2024 (Liu et al., 2024). For the open models we analyze—whose pretraining cutoffs predate this period—this timing makes it unlikely that entire (article, requirement, summary) triplets or the InstruSum instruction templates were seen as supervised data. While underlying news articles or related domains may appear in generic web corpora, we treat InstruSum as a fresh, post–benchmarked resource for evaluating how decoding policies surface or suppress instruction–following and constraint–satisfying behavior.

Instance–level formulation. Concretely, we treat each pair (di , ri) as an input and ask the model to generate a summary τ that both captures the content of di and obeys the constraints expressed in ri. Let pθ(τ | di , ri) denote the conditional distribution induced by model parameters θ together with a decoding policy e (e.g., greedy, sampling, or multi–sample reranking). From the requirement ri and reference y⋆ i we automatically derive a compact bundle of operational constraints

C_i = (C_i^len, C_i^inc, C_i^avoid, C_i^style),

where C_i^len is a target length band (short / medium / long), C_i^inc is a set of required entities or keywords, C_i^avoid is an optional set of avoid-phrases, and C_i^style is a coarse style/format indicator (e.g., neutral vs. opinionated tone; sentences vs. bullet list). These constraints feed into automatic checkers c_len, c_inc, c_avoid, c_style introduced below, which score how well a candidate τ respects each requirement.

Multi-objective view. In addition to constraint satisfaction, we quantify semantic adequacy using a similarity score s_sem(τ; d_i, y_i*) ∈ [0, 1] that rewards summaries which are faithful to the article and informationally consistent with the reference. Each trajectory τ ∈ T_i (the space of token sequences for instance i) is therefore naturally associated with a vector of objectives

f_i(τ) = (s_sem(τ; d_i, y_i*), c_len(τ; C_i^len), c_inc(τ; C_i^inc), c_avoid(τ; C_i^avoid), c_style(τ; C_i^style)) ∈ [0, 1]⁵.

Style- and constraint-satisfying summarization can thus be viewed as a multi-objective search problem over T_i: the goal is to identify trajectories that achieve high semantic adequacy while simultaneously satisfying the length, inclusion, avoidance, and style/format requirements. A decoding policy e induces a distribution K_e(τ | d_i, r_i, θ) over trajectories; different policies explore different regions of the same underlying model distribution p_θ(· | d_i, r_i), and hence expose different subsets of the multi-objective landscape described by f_i.

4.2.2 Metrics, Success Sets, and Style Exploration Gain

Given the multi-objective view, each candidate summary τ is associated with f_i(τ) ∈ [0, 1]⁵ capturing what the summary says and how well it obeys the instruction. We now turn this representation into concrete metrics that let us compare decoding policies as search strategies over the same p_θ(· | d_i, r_i).

Component scores. For clarity, we restate the objective vector:

f_i(τ) = (s_sem(τ; d_i, y_i*), c_len(τ; C_i^len), c_inc(τ; C_i^inc), c_avoid(τ; C_i^avoid), c_style(τ; C_i^style)),

with each component in [0, 1]. We instantiate these as follows:

Semantic adequacy s_sem(τ; d_i, y_i*) rewards summaries that actually say the right thing: they should be faithful to the article and informationally consistent with the reference. In practice, we obtain this score from a fixed, rubric-guided LLM judge (details in App. ??).
Length satisfaction c_len(τ; C_i^len) measures how well the realized length of τ fits the requested band (short / medium / long). We map the deviation into [0, 1] using a piecewise linear penalty so that small deviations are not punished as harshly as large ones.
Entity/keyword inclusion c_inc(τ; C_i^inc) tracks what fraction of the required elements appear in the summary. We use a simple fuzzy matching approach (based on lemma overlap) to account for minor morphological variations while ensuring strict semantic adherence.
Avoid-phrase satisfaction c_avoid(τ; C_i^avoid) is a binary or fractional indicator that rewards the absence of the forbidden phrases. A score of 1 indicates the summary successfully avoided all restricted terms, while lower scores reflect the count of violations.
Style/format satisfaction c_style(τ; C_i^style) evaluates whether τ matches the requested presentation format (e.g., bulleted list vs. paragraph) and tone. Similar to semantic adequacy, this is assessed using a tailored LLM judge that checks for structural and stylistic markers.

Aggregate constraint satisfaction. To simplify comparison across different instruction-following capabilities, we often compute an aggregate satisfaction score as the average of the four individual checkers:

s_con(τ; C_i) = ¼ (c_len + c_inc + c_avoid + c_style) ∈ [0, 1].

Search-based metrics. Given the multi-objective vector f_i(τ), how do we determine if a decoding policy e is “better” than another? If we had a single scalar reward, we could simply compare expected scores. In the multi-objective case, we consider two primary views:

The Scalarized View: We define a composite utility u_i(τ) = w · f_i(τ) using a fixed weight vector w that prioritizes different objectives (e.g., semantic adequacy vs. strict constraint satisfaction).