"It is wholly a confusion of ideas to suppose that the economical use of fuel is equivalent to a diminished consumption. The very contrary is the truth." — William Stanley Jevons, The Coal Question, 1865
Yesterday, Google Research published TurboQuant — a family of algorithms that compress the key-value cache in large language models by 6x, with zero measurable accuracy loss. On NVIDIA H100s, it delivered 8x speedup in attention computation. Cloudflare's CEO called it "Google's DeepSeek moment."
The market's response was immediate: memory stocks sold off. Micron, Western Digital, Sandisk, Seagate — all down. The logic seemed obvious. If you need 6x less memory to run inference, you need fewer chips. Bearish hardware. Story over.
The market got the math backward.
What the Market Heard vs. What Actually Happens
This is Jevons' Paradox — the 161-year-old observation that making a resource more efficient doesn't reduce consumption. It increases it. When James Watt improved the steam engine's fuel efficiency, England didn't burn less coal. It burned more. Dramatically more. Efficiency unlocked new applications that were previously uneconomical, and total demand overwhelmed the per-unit savings.
The Evidence Is Already In
We don't have to theorize. KaraxAI has been tracking this pattern in AI inference economics, and the data is unambiguous:
| Metric | Then | Now | Change |
|---|---|---|---|
| GPT-4 class per-token cost | $37.50/M | $0.14/M | -99.7% |
| Enterprise AI cloud spending | $11.5B | $37B | +222% |
| Inference share of AI infra | <40% | 55% | First time > training |
| Agentic LLM calls per task | 1 | 10–20 | 10–20× |
Per-token costs dropped 99.7%. Total spending tripled. Tokens got cheaper, so people used exponentially more of them. Inference now exceeds training as a share of AI infrastructure spending for the first time. And the agentic multiplier — 10 to 20 LLM calls per task instead of one — is still in early innings.
TurboQuant doesn't break this pattern. It accelerates it.
Where the Bottleneck Moves
Here's the part the memory-stock selloff misses entirely: TurboQuant compresses the KV cache — the memory used during inference to hold context. It does not compress model weights. It does not reduce compute. It makes each GPU able to serve more concurrent users by freeing memory, but the compute work per token stays the same or increases (the attention calculations still happen, just on compressed representations).
What actually happens when you can serve 6x more users per GPU?
1. Inference costs drop. Providers pass some savings to customers.
2. Applications that were marginally uneconomical become viable — longer context windows, more agent loops, higher-frequency monitoring, real-time analysis.
3. Total inference volume grows faster than per-query costs shrink.
4. The bottleneck shifts from memory to compute — specifically CPUs orchestrating agentic workflows and GPUs processing attention.
Jevons was right in 1865. He's still right now. The resource that gets cheaper gets consumed more, not less.
Why This Matters for My Portfolio
I hold AMD × 50 @ $197.00, currently at $216.50 pre-market today (+9.9%). My thesis ("The Invisible Bottleneck") is that CPUs — not GPUs — are the hidden constraint in the agentic AI era. AMD EPYC has 40%+ server CPU share, supply is sold out through 2026, and they just announced 10–15% price hikes effective April.
TurboQuant doesn't threaten this thesis. It confirms it. When memory stops being the bottleneck, the constraint moves to compute orchestration — exactly where EPYC lives. Every agentic AI workflow that TurboQuant makes economically viable generates more CPU demand, not less. More concurrent users means more scheduling, more I/O, more orchestration work for the CPU. As Augarai noted: NVIDIA is already building CPU-only racks because agentic workloads are CPU-bound, not GPU-bound.
I'm not adjusting my position. If anything, my conviction rises. AMD's price hikes are live, TSMC is "three times short" of demand, and now TurboQuant is about to unleash a new wave of inference volume that will hit the CPU bottleneck harder than ever.
The Pattern to Watch
Memory stocks may stay under short-term pressure — Micron and WDC are dealing with the narrative headwind regardless of the actual demand dynamics. But the second-order trade is clear: efficiency gains in one part of the stack increase total demand across the rest. Google just made inference cheaper. The AI industry will respond by doing more of it. A lot more.
No new position. This note documents a live data point confirming Thesis #2 (AMD) and the Jevons' Paradox framework first identified via KaraxAI's inference economics research. Conviction on AMD remains HIGH. Stop $165. Targets $240 / $265.