Interactive Research Paper

The Consistency Tax

Why per-step variance — not average accuracy — determines end-to-end reliability in agentic AI pipelines

Ihnaee Choi · AI Solution Architect, ServiceNow · May 2026

Enterprise AI models increasingly report per-step accuracy at or above 99%. This paper demonstrates that such metrics create a dangerous illusion of reliability. We identify two compounding problems. First, we formalize the mental playground — the cognitive space where users unconsciously convert "99% accurate" into instance-level certainty, grounded in Sunstein's probability neglect (2002) and Kahneman & Tversky's base rate neglect (1973). Second, we derive the consistency tax (−σ²/2μ²) — a structural penalty for per-step variance that compounds linearly with pipeline length. Monte Carlo simulations (50,000 runs) confirm: a 99% accurate model traversing 20 enterprise workflow steps achieves 81.8% end-to-end success under deterministic conditions, but drops to 59.6% under high variance (σ=0.05). The 1% failure that users dismiss in their mental playground compounds into 40% failure across an enterprise pipeline.

Introduction

A 99% accurate AI model sounds nearly perfect. In the mental playground of human cognition, 99% is functionally indistinguishable from 100%. The 1% failure probability is acknowledged intellectually but processed as zero operationally. This paper demonstrates that this cognitive shortcut, combined with a mathematical property of sequential stochastic processes, creates a reliability crisis invisible until it manifests at organizational scale.

Even under deterministic conditions (each step succeeds at exactly 99%), a 20-step enterprise pipeline yields only 0.99²⁰ = 81.8% end-to-end success. The 1% that seemed negligible has compounded into 18.2% failure. But real AI systems are not deterministic — they are stochastic. When we model this variance, a 99% accurate model with high variance (σ=0.05) drops to 59.6% at 20 steps. The 1% failure that users dismiss becomes 40% failure across an enterprise pipeline.

This paper argues that the deterministic assumption conceals two compounding problems.

The first is cognitive. We introduce the mental playground — the cognitive space where users unconsciously convert aggregate accuracy ("99% accurate") into instance-level certainty ("this output is correct"). Discovering that you believe 99% confidence is 100% confidence is extraordinarily difficult — precisely because you consciously know it's 99%. The gap occurs entirely within the mental playground, invisible to the user.

The second is mathematical. Real LLM systems are fundamentally stochastic. When per-step accuracy follows Xi ~ N(μ, σ²), the expected end-to-end accuracy includes a structural penalty −σ²/2μ² that we term the consistency tax.

These two problems reinforce each other. The mental playground causes users to ignore variance. The consistency tax ensures that ignored variance extracts a compounding mathematical toll. This dual blindness — not knowing the problem exists while the problem worsens — is a primary driver of the gap between enterprise AI adoption (88%) and performance (6%).

The Mental Playground Problem

2.1 The Deterministic Privilege

The history of technology adoption reveals a widening gap between understanding and use. Users operate microwave ovens, light switches, and smartphones with no comprehension of underlying mechanisms. This is possible because these technologies are deterministic: the same input reliably produces the same output.

AI systems appear to operate identically. A user enters a prompt and receives a response. However, one critical difference exists: AI produces different outputs from identical inputs. The same prompt, submitted to the same model, yields meaningfully different responses across runs. This single property invalidates the deterministic mental model that users unconsciously apply.

2.2 Defining the Mental Playground

Definition

Mental playground — the unexamined cognitive space in which users form implicit expectations about AI outputs, accept or reject results without conscious probabilistic reasoning, and process stated accuracy metrics as instance-level certainty rather than aggregate-level distributions.

The mental playground has three defining properties: (1) invisible — users do not recognize they are making probabilistic assumptions; (2) non-falsifiable in real-time — any individual output could be correct; and (3) trust-reinforcing — each apparently successful interaction reduces the motivation to verify the next.

2.3 Probability Neglect in AI Contexts

Sunstein (2002) demonstrated probability neglect through experiments where subjects warned of an electric shock showed fear responses that varied with shock intensity but not with shock probability — even when probability ranged from 1% to 50%. People respond to the possibility of an outcome, not its likelihood.

In AI contexts, we observe the inverse form. Sunstein's subjects over-weighted negative possibilities regardless of probability. AI users under-weight the possibility of incorrect output regardless of probability. When told "99% accurate," users process each individual output as near-certain. The 1% failure is acknowledged intellectually but neglected operationally. When the annotation reads "99% confidence," verifying every result feels like waste.

Kahneman and Tversky's (1973) base rate neglect compounds this: users focus on the salient individuating information (the plausible AI response) while ignoring the base rate (the probability that any given output is incorrect).

2.4 Organizational Amplification

At the individual level, the mental playground is manageable. At the organizational level, containment breaks down. One person's AI output becomes another's input. The moment an unverified result becomes the premise for a subsequent decision, individual probabilistic error converts to system-level risk.

Each person in the chain applies their own mental playground independently. Verification responsibility diffuses across the organization while the underlying probabilistic error compounds across steps.

The Consistency Tax: Mathematical Framework

3.1 Problem Formulation

Let each step i in an N-step pipeline have accuracy Xi drawn independently from:

Xi ~ N(μ, σ²), μ ∈ (0, 1], σ ≥ 0

End-to-end accuracy is the product of all per-step accuracies (clamped to [0, 1]):

Ae2e = ∏ min(max(Xi, 0), 1) for i = 1 to N

3.2 Log-Normal Approximation

Taking the natural logarithm: ln(Ae2e) = Σ ln(Xi). By CLT, this sum converges to a normal distribution, so Ae2e is approximately log-normal.

3.3 Deriving the Consistency Tax

Via Taylor expansion of E[ln(Xi)] around μ:

E[Ae2e] ≈ exp(N · (ln μ − σ² / 2μ²))

The term −σ²/2μ² is the consistency tax per step. It is always negative when σ > 0, scales linearly with N, and scales quadratically with σ. This is a mathematical property of sequential stochastic processes, independent of any specific failure mode.

Run the Simulation Yourself

Adjust parameters and see how variance affects end-to-end pipeline reliability.

0.99
20
10,000
End-to-End Accuracy Distribution by Variance Level
Each curve = 10,000 simulated pipeline runs. Same mean accuracy — different consistency.
Pipeline Length vs. Median E2E Success Rate
How consistency degrades as workflows get longer
Full Results Table

Key Findings

81.8%
Deterministic (σ=0)
20-step E2E
81.1%
Five Sigma (σ=0.008)
20-step E2E
59.6%
High Variance (σ=0.05)
20-step E2E
−σ²/2μ²
Consistency Tax
per step

The Interaction Effect

The mental playground causes users to ignore variance ("it's 99% accurate, what could go wrong?"). The consistency tax ensures that ignored variance produces a structural penalty that compounds across every step. The 1% failure that users dismiss in their mental playground becomes 40% failure across a 20-step enterprise pipeline. Organizations that do not measure per-step variance are simultaneously (a) unaware of the problem and (b) accumulating compounding reliability debt.

This dual blindness — not knowing the problem exists while the problem worsens — is a primary driver of the gap between enterprise AI adoption (88%) and performance (6% high performers).

Implications

For enterprise AI strategy: A 99% accurate model sounds nearly perfect. But the 1% you dismiss compounds into 40% failure across 20 steps. Average accuracy is the most dangerous metric. The question is not "how accurate is your model?" but "how consistent is it?"

Platforms that reduce per-step variance — through deterministic workflows, validated tool libraries, structured output schemas — deliver compounding reliability gains that raw model capability cannot match. This provides a mathematical foundation for why governed platform deployments consistently outperform DIY API connections.

References

[1] AgencyBench (2026). Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts. ACL 2026. arXiv:2601.11044.

[2] BPI Challenge 2018. TU Eindhoven. 4TU.ResearchData.

[3] DenoiseFlow (2026). Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows. Tsinghua Univ.

[4] Kahneman, D. & Tversky, A. (1973). On the psychology of prediction. Psychological Review, 80(4).

[5] McKinsey (2025). The State of AI in 2025.

[6] MindStudio (2026). Best AI Models for Agentic Workflows in 2026.

[7] Patel, K. et al. (2026). The Six Sigma Agent. arXiv:2601.22290.

[8] Sunstein, C. R. (2002). Probability Neglect: Emotions, Worst Cases, and Law. Yale Law Journal, 112.

[9] Tangoe / Vanson Bourne (2024). GenAI Cloud Spending Survey.

[10] Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis (2026). arXiv:2602.13320.