When I pay my monthly subscription to Claude, I'm not paying for software. I'm renting hardware.
That realization didn't come from reading an analyst report. It came from building things. Over the past several months, I've set up OpenClaw to offload workloads from cloud-based AI services, launched a self-hosted Ghost newsletter on AWS, and built a local inference pipeline running open-weight LLMs on a Mac Mini. Three very different projects, all using mature, well-documented, freely available software. And in every single case, the limitation I hit was the same: not enough compute.
But here's the thing I keep coming back to: compute isn't the permanent bottleneck. It's just where the bottleneck lives right now. NVIDIA, the hyperscalers, and the global semiconductor supply chain are throwing hundreds of billions of dollars at the hardware constraint. They will solve it — not overnight, but relentlessly and on an annual cadence. And when they do, the bottleneck won't vanish. It will migrate upstream to the layer that feeds compute: energy.
This is a story about that migration. It connects my Mac Mini to a $600 billion infrastructure buildout, explains why the SaaS business model is breaking apart, and points toward the real constraint that no one has fully reckoned with yet. The value chain has always been the same: energy in, compute out, intelligence delivered. The only question is which layer is the binding constraint at any given moment. Right now, it's hardware. Soon, it will be watts.
My Stack, My Wall
My local inference server is a Mac Mini running Ollama. It handles Qwen 14B well enough for volume drafting work that I then hand off to Claude for finishing and refinement. But 14 billion parameters is the ceiling. A 70B model requires well over 100 GB of memory in FP16 — far beyond what my hardware supports. Consumer GPUs top out at 24 GB of VRAM; anything larger means multi-GPU setups or aggressive quantization that trades accuracy for fit. The model isn't the constraint. The silicon and memory are.
My K3s cluster — a ThinkPad as the control-plane node and an HP worker node — runs OpenClaw and handles workloads I've pulled off cloud-based AI services to run locally. The cluster itself isn't the problem. I don't come anywhere near maxing out the cluster's compute or memory. The bottleneck is the inference model running on my Mac Mini. That's where the memory fills up. That's where requests queue. The compute cluster has headroom; the inference layer doesn't. I'm seeing at my own small scale exactly what the industry is seeing at hyperscale — inference is where the constraint lives, not in the general-purpose compute surrounding it.
Every architectural decision I make is shaped not by what the software can do, but by what the inference hardware will tolerate.
My Ghost newsletter runs on an AWS Lightsail instance behind CloudFront and WAF. Ghost is an excellent platform — the editor, the theme system, the API are all solid. But every operational hiccup I've dealt with traces back to the compute envelope I'm working within. Memory pressure during peak loads. Build times constrained by instance size. The infrastructure works, but it works within the box that my hardware budget defines.
Three projects. Three different software stacks. One bottleneck. Software has become abundant. Compute has not — yet.
What I'm Actually Paying For
This pattern forced me to look more carefully at what my AI subscriptions and API bills actually represent. When I send a prompt to Claude, GPUs in a data center are performing matrix multiplications, consuming electricity, and generating heat that has to be removed. Every token has a real cost — not the near-zero marginal cost of serving a traditional software user, but a measurable draw on physical hardware. Estimates suggest a typical AI query costs the provider between one and three cents in compute for a standard model, and ten to fifty cents for a complex reasoning query. Two complex reasoning questions per day on a $20 monthly subscription can burn through the entire value of that subscription.
When I make API calls through OpenClaw to run workloads locally instead of on cloud services, the relationship is even more transparent. I'm paying in hardware capacity — memory, compute cycles, electricity — for every token processed. There's no abstraction layer pretending I'm buying software. I'm buying compute, whether I'm paying a cloud provider per token or burning my own silicon to avoid that bill.
This is a fundamental break from the economics that powered the technology industry for the last twenty years. Traditional SaaS companies had gross margins of 70 to 80 percent because the marginal cost of serving one more user was essentially zero. The code was written once and served indefinitely. AI products don't work this way. Every interaction consumes real resources. SaaS companies typically spent about 5 percent of revenue on server costs. AI-native applications routinely spend more than half. Anthropic's gross margins reportedly sit around 50 to 55 percent — closer to an industrial operation than a software company. OpenAI reportedly spends over $700,000 per day just on ChatGPT inference.
When I pay for an AI subscription, I'm not licensing software. I'm renting inference hardware with a chat interface on top. And the cost of that hardware is dominated by two things: silicon and the electricity to run it.
The SaaS Model Breaks
The market has started to price this in. The term "SaaSpocalypse" entered the financial lexicon in early 2026 after roughly $2 trillion in market capitalization evaporated from the software sector in a matter of weeks. The SaaS index underperformed the S&P 500 by over 20 percentage points through 2025. Median revenue multiples for public SaaS companies dropped from a pandemic peak of 18–19x to around 5x by the end of 2025. Salesforce lost 26 percent of its market cap. Atlassian fell 30 percent. Adobe dropped 19 percent. A Morgan Stanley basket of SaaS stocks declined 15 percent in the first two weeks of January 2026 alone. Apollo Global Management cut its private credit exposure to software from 20 percent to 10 percent, and $17.7 billion in technology-related corporate loans fell to distressed levels within four weeks.
The disruption isn't just financial. It's structural. AI agents are replacing the human operators who needed per-seat licenses. Bain & Company reports that vendors are already seeing slower growth in seat counts as AI makes customers more efficient — or eliminates the need for the seat entirely. Gartner predicts that 35 percent of point-product SaaS tools will be replaced by AI agents by 2030. When an AI agent can create a project ticket from a Slack conversation, assign it based on workload, and follow up autonomously, the project management software becomes overhead rather than infrastructure.
As software becomes abundant and intelligence becomes a metered utility, the value of operating a tool collapses. What matters instead is knowing how to orchestrate inference — understanding which workload requires the expensive cloud reasoning model and which can be handled by the local 14B drafting model sitting on your desk. When a single high-reasoning token can cost fifty cents, you can't afford to throw frontier compute at every task. This is the skill I'm developing in real time with my own setup: routing volume work to Qwen 14B on the Mac Mini, reserving Claude for the tasks where its reasoning depth justifies the cost. The premium isn't in clicking buttons inside software anymore. It's in architectural intent — deciding where in the stack each token gets spent.
Meanwhile, as Calcalist Tech observed, the once-neglected world of hardware — long considered dull by investors enamored with SaaS for its fast growth, low capital requirements, and high multiples — is heating up and in some cases trading at multiples far higher than software companies. The value is migrating from code to silicon. And silicon requires power.
Inference Eats Everything
I described the pattern in my own system — the K3s cluster has headroom while the inference layer on the Mac Mini is the chokepoint. The industry data confirms this isn't a quirk of my setup. It's the shape of the constraint everywhere.
Training large language models grabbed most of the attention from 2022 to 2024 — the race to build ever-larger foundation models at ever-higher cost. But once a model is trained, it has to be served. Every query, every agent action, every reasoning chain burns inference compute. For most companies deploying AI, inference now accounts for 80 to 90 percent of total lifetime compute cost. Barclays estimated that GPT-4's cumulative inference bill reached $2.3 billion by the end of 2024 — fifteen times its roughly $150 million training cost. Demand for inference compute is projected to grow 118-fold by 2026, reaching three times the total demand for training.
Test-time scaling — the technique where reasoning models "think longer" before answering — is accelerating this further. Models like DeepSeek-R1 and OpenAI's o1 generate 10 to 100 times more tokens per query than a standard model doing single-pass inference. NVIDIA's own analysis notes that challenging reasoning queries can require over 100 times the compute of a single inference pass. Every chain-of-thought step, every self-verification loop, every reasoning branch burns tokens. And tokens burn hardware. And hardware burns watts.
The Stanford 2025 AI Index found that inference cost per token at GPT-3.5 level dropped 280-fold between late 2022 and late 2024. Hardware costs have declined about 30 percent per year, energy efficiency about 40 percent. But falling unit costs haven't reduced total spend — because demand is exploding faster than costs are falling.
William Stanley Jevons described this dynamic in 1865 when he observed that more efficient steam engines didn't reduce coal consumption but increased it, because cheaper energy made more applications viable. The same paradox is playing out in AI. Every efficiency gain in inference hardware unlocks new workloads, longer context windows, more sophisticated reasoning chains, and broader deployment. Satya Nadella invoked Jevons explicitly after DeepSeek demonstrated lower-cost training: the cheaper it gets, the more we use. NVIDIA's own technical blog references the paradox when explaining why more efficient LLM inference will consume more computing resources, not fewer.
I see this in miniature on my own bench. When I optimized my Ollama setup to run Qwen 14B more efficiently, I didn't use less compute. I started using it for more tasks, running more drafts, testing more prompts. The efficiency freed me to demand more from the hardware — and I quickly hit the ceiling again. Solve the hardware bottleneck and you don't eliminate the constraint. You shift it to the next layer down.
The Architecture of Scarcity
NVIDIA's product roadmap reads as a direct assault on the hardware bottleneck — and everything about it tells you that the next constraint is power.
Blackwell was the first architecture to treat the rack — not the server — as the fundamental unit of compute. Traditionally, a data center server is a self-contained machine with its own processors and memory; when a workload outgrows one server, it spills across multiple machines connected by a network, and that network becomes the bottleneck. Blackwell's NVL72 system eliminates that boundary by connecting 72 GPUs across an entire rack through NVLink, NVIDIA's high-speed interconnect, so they share memory and communicate as if they were all on the same motherboard. The rack becomes one machine. Its second-generation Transformer Engine introduced NVFP4, a 4-bit floating-point format that nearly doubles memory efficiency versus FP8 while maintaining close to FP8-level accuracy. Blackwell Ultra pushed NVFP4 performance to 15 petaFLOPS per GPU, a 7.5x increase over the Hopper H100.
Vera Rubin, arriving in the second half of 2026, pushes harder. Each Rubin GPU delivers 50 petaFLOPS of NVFP4 inference compute — five times Blackwell's base. The rack-level system provides 3.6 exaFLOPS of inference performance, backed by eight stacks of HBM4 memory per GPU delivering 288 GB of capacity at 22 terabytes per second of bandwidth. NVIDIA projects that Rubin will train mixture-of-experts models with one-quarter the GPUs that Blackwell requires and cut the cost per million inference tokens by a factor of ten.
But look at the architectural choices that tell you what comes next. Rubin introduces the Vera CPU, purpose-built for the data movement patterns that inference demands. It formalizes disaggregated inference, where the context phase and the generation phase run on separately optimized hardware. The Rubin CPX accelerator targets long-context workloads with three times the attention acceleration of the prior generation. The Inference Context Memory Storage Platform, built on BlueField-4 DPUs, creates a shared key-value cache tier so context can be reused rather than recomputed. The cable-free modular tray design enables 18x faster assembly and servicing versus Blackwell.
Every one of these features optimizes for the same thing: more intelligence per watt. Not just more intelligence per chip or per dollar — per watt. When your primary metric is tokens per watt, you're telling the market that energy efficiency is the competitive frontier. You're telling the market that the bottleneck is already migrating.
These are also features for operators, not researchers. They're designed for AI factories running at industrial scale, where uptime is non-negotiable and thermal margins are thin. The naming tells you something too — Vera Rubin was the astronomer who proved the existence of dark matter by measuring what couldn't be seen directly, and NVIDIA named its inference-dominant platform for someone who revealed hidden structure. The hidden structure of the AI economy is compute scarcity, and the hidden constraint behind that is energy. Anyone who has operated a nuclear plant knows the reactor doesn't care about your maintenance schedule — it needs cooling whether your team is ready or not. The operational environment is unforgiving. Data centers running AI inference at scale have the same character. The silicon doesn't wait. The models don't pause. Availability is the product. And availability requires power — reliable, dense, uninterruptible power.
Energy → Compute → Capital
This is the thesis Gigawatt Report keeps coming back to — the thread that connects everything in this article and everything this publication covers. Energy is the foundation — you can't run inference without watts. Compute is the transformation layer — it turns energy into intelligence. And capital is the output — the economic value that intelligence produces.
The capital flows confirm this at scale. The five largest hyperscalers are projected to spend over $600 billion on infrastructure in 2026 — a 36 percent increase over 2025. Roughly 75 percent of that, around $450 billion, targets AI infrastructure specifically. Goldman Sachs projects total hyperscaler capex from 2025 through 2027 will reach $1.15 trillion. Tech capital expenditure in 2025 reached roughly 1.9 percent of U.S. GDP — nearly matching the combined scale of the interstate highway system, the Apollo program, and nationwide broadband deployment. Amazon's capex alone exceeds that of the entire U.S. energy sector. Every hyperscaler reports being supply-constrained, not demand-constrained.
And here's where the bottleneck migration becomes visible. All of that silicon needs electricity. U.S. data center power demand is projected to rise from about 4.4 percent of national electricity consumption in 2023 to between 6.7 and 12 percent by 2028. Gartner estimates global data center electricity consumption will grow from 448 terawatt-hours in 2025 to 980 terawatt-hours by 2030 — more than doubling. In Virginia, data centers already consume 26 percent of the state's electricity. In Dublin, the figure is 79 percent. In the PJM electricity market — stretching from Illinois to North Carolina — data centers accounted for a $9.3 billion increase in the 2025–2026 capacity market, pushing residential bills up by $16 to $18 per month in some areas. Dominion Energy proposed its first base-rate increase since 1992, driven largely by data center load growth.
This is the migrating bottleneck in action. NVIDIA's annual cadence will keep pushing compute density higher — Blackwell to Vera Rubin to whatever comes next. Each generation will deliver more tokens per GPU, more petaFLOPS per rack. The hardware constraint will ease. But every GPU added to the grid needs power. Every new rack needs cooling. Every new data center needs a grid interconnection, a power purchase agreement, and often years of permitting and construction before a single watt flows. The hardware cycle runs on an annual cadence. The energy infrastructure cycle runs on a decade-long cadence. That mismatch is where the bottleneck goes next.
And the mismatch isn't just about building power plants. It's about the interconnection queue — the bureaucratic and physical process of actually connecting new generation to the grid. According to Lawrence Berkeley National Laboratory, there are now roughly 2,600 gigawatts of generation and storage actively seeking grid interconnection in the United States — more than twice the entire installed capacity of the existing U.S. power plant fleet. The median time from interconnection request to commercial operation has stretched from under two years for projects built in 2000–2007 to over five years for projects reaching operation today. Only about 19 percent of projects that entered the queue between 2000 and 2018 ever reached commercial operation; nearly 80 percent were withdrawn. Google has reported potential grid connection delays of up to 12 years for some new data center sites. You can ship a Vera Rubin rack in weeks. You may not be able to plug it into the grid for years. The interconnection queue is the physical firewall that the annual hardware improvement cycle cannot penetrate.
Google's $4.75 billion acquisition of Intersect Power. The Stargate project targeting 7 GW of capacity across five sites. AEP citing customer commitments for 24 GW of new demand by 2030, including 18 GW from data centers — five times the utility's current system size. These aren't software investments. They're energy investments. The smartest actors in the market are already building at the layer where they see the constraint heading.
When I denominate my own infrastructure costs in satoshis — as Gigawatt Report examined in "The Sat-Denominated Grid" — the cost of compute has been falling dramatically in real terms over time. But the cost of energy, in real terms, hasn't followed the same curve. Electricity costs near data center clusters are rising, not falling. The compute gets cheaper per unit. The energy to run it gets scarcer per megawatt. That divergence is the signal.
The Takeaway
The story of AI in 2026 is not a software story. It's a hardware story that is becoming an energy story. The constraint doesn't disappear when you solve one layer — it migrates to the layer below.
I can see the current bottleneck from my desk, where my Mac Mini can't hold a model larger than 14 billion parameters. The hyperscalers can see it from their boardrooms, where $600 billion in annual capex still isn't enough to meet inference demand. NVIDIA can see it from their architecture labs, where every feature in Vera Rubin optimizes for tokens per watt. And the SaaS market can see it from its cratered valuations, as investors realize that the product was never really software — it was always the compute underneath, powered by the energy underneath that.
But look one layer deeper. The hardware bottleneck has a known solution: NVIDIA ships new silicon every year, and each generation multiplies what a watt of power can produce. The energy bottleneck doesn't have the same cadence. With 2,600 GW stuck in the interconnection queue and a median five-year wait to reach commercial operation, the mismatch between compute's annual improvement cycle and energy infrastructure's decade-long development cycle is the defining tension of the AI buildout.
That's where the real opportunity lives. Not in the software — software is abundant. Not even in the hardware — hardware is on a steep improvement curve with massive capital behind it. The opportunity is in the energy layer: the generation, transmission, and delivery of reliable, dense, affordable power to the facilities that convert watts into intelligence.
Energy in, compute out, intelligence delivered. The abstraction is dissolving. What remains is the value chain that was always there — and the constraint is moving to its foundation.
Part of the Gigawatt Report series on the intersection of energy, compute, and capital.
This article was written with research and editorial assistance from Claude (Anthropic). Claude assisted with research, drafting, and structuring the final piece.
Further Reading
NVIDIA Architecture & Product Sources
- NVIDIA. "Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer." NVIDIA Developer Blog, January 2026. developer.nvidia.com
- NVIDIA. "Infrastructure for Scalable AI Reasoning | NVIDIA Vera Rubin Platform." nvidia.com
- NVIDIA. "NVIDIA Vera Rubin NVL72 | Co-Designed Infrastructure for Agentic AI." nvidia.com
- NVIDIA Newsroom. "NVIDIA Kicks Off the Next Generation of AI With Rubin — Six New Chips, One Incredible AI Supercomputer." nvidianews.nvidia.com
- NVIDIA. "Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era." NVIDIA Developer Blog, August 2025. developer.nvidia.com
- NVIDIA. "NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut." NVIDIA Developer Blog, September 2025. developer.nvidia.com
- NVIDIA. "NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads." NVIDIA Developer Blog, September 2025. developer.nvidia.com
- SemiAnalysis. "Vera Rubin – Extreme Co-Design: An Evolution from Grace Blackwell Oberon." February 2026. newsletter.semianalysis.com
- Tom's Hardware. "Nvidia Launches Vera Rubin NVL72 AI Supercomputer at CES." January 2026. tomshardware.com
Inference Economics & Cost Data
- Primitiva. "All You Need to Know about Inference Cost." December 2024. primitiva.substack.com — Source for Barclays GPT-4 training/inference cost estimates and 118x inference demand growth projection.
- Mirantis. "Optimizing Inference Costs: The Complete Guide." March 2026. mirantis.com — Source for Stanford HAI 2025 AI Index findings on 280-fold inference cost decline and OpenAI daily inference spend.
- Introl. "Inference Unit Economics: The True Cost Per Million Tokens." February 2026. introl.com
- The Information Difference. "Who's Paying for Your Prompt? LLM Pricing & Sustainability." January 2026. informationdifference.com — Source for per-query compute costs, reasoning model loss economics, and subscription viability analysis.
Test-Time Scaling & Reasoning Models
- NVIDIA. "How Scaling Laws Drive Smarter, More Powerful AI." NVIDIA Blog, May 2025. blogs.nvidia.com — Source for 100x compute requirement for test-time scaling and Jevons' paradox reference.
- Introl. "Inference-Time Scaling." December 2025. introl.com
- Snorkel AI / UC Berkeley. "Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters." August 2024. arxiv.org/abs/2408.03314
Hyperscaler Capital Expenditure
- Goldman Sachs. "Why AI Companies May Invest More than $500 Billion in 2026." December 2025. goldmansachs.com
- CreditSights. "Technology: Hyperscaler Capex 2026 Estimates." November 2025. know.creditsights.com — Source for $602B aggregate capex, 75% AI allocation, and capital intensity ratios.
- IEEE ComSoc Technology Blog. "Hyperscaler capex > $600 bn in 2026." December 2025. techblog.comsoc.org — Source for tech capex as percentage of GDP comparison to historical capital projects.
- Futurum Group. "AI Capex 2026: The $690B Infrastructure Sprint." February 2026. futurumgroup.com
- Morningstar. "AI Arms Race: How Tech's Capital Surge Will Reshape the Investment Landscape in 2026." December 2025. morningstar.com — Source for Amazon capex exceeding the entire U.S. energy sector.
Energy & Power Demand
- Belfer Center, Harvard Kennedy School. "AI, Data Centers, and the U.S. Electric Grid: A Watershed Moment." February 2026. belfercenter.org — Source for Lawrence Berkeley National Laboratory electricity consumption projections, Dominion rate increase, and AEP demand commitments.
- Pew Research Center. "What We Know About Energy Use at U.S. Data Centers Amid the AI Boom." October 2025. pewresearch.org — Source for PJM capacity market cost impact and residential bill increases.
- Gartner. "Gartner Says Electricity Demand for Data Centers to Grow 16% in 2025 and Double by 2030." November 2025. gartner.com
- S&P Global. "Data Center Grid-Power Demand to Rise 22% in 2025, Nearly Triple by 2030." October 2025. spglobal.com — Source for U.S. data center demand reaching 75.8 GW in 2026 and 134.4 GW by 2030.
- Carbon Brief. "AI: Five Charts That Put Data-Centre Energy Use — and Emissions — Into Context." September 2025. carbonbrief.org — Source for Dublin 79% electricity consumption figure.
- Bloomberg. "How AI Data Centers Are Sending Your Power Bill Soaring." September 2025. bloomberg.com — Source for wholesale electricity price increases near data center clusters.
- Lawrence Berkeley National Laboratory. "Queued Up: 2025 Edition." 2025. emp.lbl.gov — Source for 2,600 GW interconnection queue, median 5-year wait times, and 19% completion rates.
- Enki AI. "Grid Interconnection Delays 2026: A Threat to US Energy." February 2026. enkiai.com — Source for Google 12-year grid connection delay reports and 80% withdrawal rates.
Jevons' Paradox & Demand Dynamics
- NPR Planet Money. "Why the AI World Is Suddenly Obsessed with Jevons Paradox." February 2025. npr.org
- Luccioni, A.S. et al. "From Efficiency Gains to Rebound Effects: The Problem of Jevons' Paradox in AI's Polarized Environmental Debate." ACM FAccT 2025. arxiv.org/abs/2501.16548
SaaS Disruption
- Digital Applied. "The SaaSpocalypse: AI Agents Disrupting Software Industry." February 2026. digitalapplied.com — Source for $2 trillion market cap evaporation.
- xpert.digital. "The Great Reckoning: How Artificial Intelligence is Dismantling the SaaS Empire." February 2026. xpert.digital — Source for Morgan Stanley SaaS basket data, Apollo credit exposure, and distressed loan figures.
- Bain & Company. "Why SaaS Stocks Have Dropped — and What It Signals for Software's Next Chapter." March 2026. bain.com
- Bain & Company. "Will Agentic AI Disrupt SaaS?" Technology Report 2025. bain.com
- Calcalist Tech. "'SaaS Is Dying as a Business Category.'" January 2026. calcalistech.com — Source for median SaaS revenue multiple decline and hardware valuation inversion.
AI Cost Structures & Pricing
- Broemmer, Darren. "The Hidden Cost of AI: Tokens, Compute, and What You're Actually Paying For With OpenClaw." Medium, March 2026. medium.com
- UpTech Studio. "The True Cost of AI: When the Subsidies Run Out." November 2025. uptechstudio.com
- Monetizely. "AI Pricing in 2025: Strategy for Costing." April 2025. getmonetizely.com — Source for Anthropic gross margin figures and Microsoft Copilot pricing rationale.
Inspiration
- X post by @av1dlive — Discussion of NVIDIA Blackwell and Vera Rubin architectures, training vs. inference optimization, and the shift toward agentic AI compute.
- NVIDIA. "NVIDIA Vera Rubin Platform." YouTube. — Deep dive into Rubin architecture targeting inference and long-context reasoning bottlenecks.
Gigawatt Report