Gigawatt Report

When I pay my monthly subscription to Claude, I'm not paying for software. I'm renting hardware.

That realization didn't come from reading an analyst report. It came from building things. Over the past several months, I've set up OpenClaw to offload workloads from cloud-based AI services, launched a self-hosted Ghost newsletter on AWS, and built a local inference pipeline running open-weight LLMs on a Mac Mini. Three very different projects, all using mature, well-documented, freely available software. And in every single case, the limitation I hit was the same: not enough compute.

But here's the thing I keep coming back to: compute isn't the permanent bottleneck. It's just where the bottleneck lives right now. NVIDIA, the hyperscalers, and the global semiconductor supply chain are throwing hundreds of billions of dollars at the hardware constraint. They will solve it — not overnight, but relentlessly and on an annual cadence. And when they do, the bottleneck won't vanish. It will migrate upstream to the layer that feeds compute: energy.

This is a story about that migration. It connects my Mac Mini to a $600 billion infrastructure buildout, explains why the SaaS business model is breaking apart, and points toward the real constraint that no one has fully reckoned with yet. The value chain has always been the same: energy in, compute out, intelligence delivered. The only question is which layer is the binding constraint at any given moment. Right now, it's hardware. Soon, it will be watts.

My Stack, My Wall

My local inference server is a Mac Mini running Ollama. It handles Qwen 14B well enough for volume drafting work that I then hand off to Claude for finishing and refinement. But 14 billion parameters is the ceiling. A 70B model requires well over 100 GB of memory in FP16 — far beyond what my hardware supports. Consumer GPUs top out at 24 GB of VRAM; anything larger means multi-GPU setups or aggressive quantization that trades accuracy for fit. The model isn't the constraint. The silicon and memory are.

My K3s cluster — a ThinkPad as the control-plane node and an HP worker node — runs OpenClaw and handles workloads I've pulled off cloud-based AI services to run locally. The cluster itself isn't the problem. I don't come anywhere near maxing out the cluster's compute or memory. The bottleneck is the inference model running on my Mac Mini. That's where the memory fills up. That's where requests queue. The compute cluster has headroom; the inference layer doesn't. I'm seeing at my own small scale exactly what the industry is seeing at hyperscale — inference is where the constraint lives, not in the general-purpose compute surrounding it.

Every architectural decision I make is shaped not by what the software can do, but by what the inference hardware will tolerate.

My Ghost newsletter runs on an AWS Lightsail instance behind CloudFront and WAF. Ghost is an excellent platform — the editor, the theme system, the API are all solid. But every operational hiccup I've dealt with traces back to the compute envelope I'm working within. Memory pressure during peak loads. Build times constrained by instance size. The infrastructure works, but it works within the box that my hardware budget defines.

Three projects. Three different software stacks. One bottleneck. Software has become abundant. Compute has not — yet.

What I'm Actually Paying For

This pattern forced me to look more carefully at what my AI subscriptions and API bills actually represent. When I send a prompt to Claude, GPUs in a data center are performing matrix multiplications, consuming electricity, and generating heat that has to be removed. Every token has a real cost — not the near-zero marginal cost of serving a traditional software user, but a measurable draw on physical hardware. Estimates suggest a typical AI query costs the provider between one and three cents in compute for a standard model, and ten to fifty cents for a complex reasoning query. Two complex reasoning questions per day on a $20 monthly subscription can burn through the entire value of that subscription.

When I make API calls through OpenClaw to run workloads locally instead of on cloud services, the relationship is even more transparent. I'm paying in hardware capacity — memory, compute cycles, electricity — for every token processed. There's no abstraction layer pretending I'm buying software. I'm buying compute, whether I'm paying a cloud provider per token or burning my own silicon to avoid that bill.

This is a fundamental break from the economics that powered the technology industry for the last twenty years. Traditional SaaS companies had gross margins of 70 to 80 percent because the marginal cost of serving one more user was essentially zero. The code was written once and served indefinitely. AI products don't work this way. Every interaction consumes real resources. SaaS companies typically spent about 5 percent of revenue on server costs. AI-native applications routinely spend more than half. Anthropic's gross margins reportedly sit around 50 to 55 percent — closer to an industrial operation than a software company. OpenAI reportedly spends over $700,000 per day just on ChatGPT inference.

When I pay for an AI subscription, I'm not licensing software. I'm renting inference hardware with a chat interface on top. And the cost of that hardware is dominated by two things: silicon and the electricity to run it.

The SaaS Model Breaks

The market has started to price this in. The term "SaaSpocalypse" entered the financial lexicon in early 2026 after roughly $2 trillion in market capitalization evaporated from the software sector in a matter of weeks. The SaaS index underperformed the S&P 500 by over 20 percentage points through 2025. Median revenue multiples for public SaaS companies dropped from a pandemic peak of 18–19x to around 5x by the end of 2025. Salesforce lost 26 percent of its market cap. Atlassian fell 30 percent. Adobe dropped 19 percent. A Morgan Stanley basket of SaaS stocks declined 15 percent in the first two weeks of January 2026 alone. Apollo Global Management cut its private credit exposure to software from 20 percent to 10 percent, and $17.7 billion in technology-related corporate loans fell to distressed levels within four weeks.

The disruption isn't just financial. It's structural. AI agents are replacing the human operators who needed per-seat licenses. Bain & Company reports that vendors are already seeing slower growth in seat counts as AI makes customers more efficient — or eliminates the need for the seat entirely. Gartner predicts that 35 percent of point-product SaaS tools will be replaced by AI agents by 2030. When an AI agent can create a project ticket from a Slack conversation, assign it based on workload, and follow up autonomously, the project management software becomes overhead rather than infrastructure.

As software becomes abundant and intelligence becomes a metered utility, the value of operating a tool collapses. What matters instead is knowing how to orchestrate inference — understanding which workload requires the expensive cloud reasoning model and which can be handled by the local 14B drafting model sitting on your desk. When a single high-reasoning token can cost fifty cents, you can't afford to throw frontier compute at every task. This is the skill I'm developing in real time with my own setup: routing volume work to Qwen 14B on the Mac Mini, reserving Claude for the tasks where its reasoning depth justifies the cost. The premium isn't in clicking buttons inside software anymore. It's in architectural intent — deciding where in the stack each token gets spent.

Meanwhile, as Calcalist Tech observed, the once-neglected world of hardware — long considered dull by investors enamored with SaaS for its fast growth, low capital requirements, and high multiples — is heating up and in some cases trading at multiples far higher than software companies. The value is migrating from code to silicon. And silicon requires power.

Inference Eats Everything

I described the pattern in my own system — the K3s cluster has headroom while the inference layer on the Mac Mini is the chokepoint. The industry data confirms this isn't a quirk of my setup. It's the shape of the constraint everywhere.

Training large language models grabbed most of the attention from 2022 to 2024 — the race to build ever-larger foundation models at ever-higher cost. But once a model is trained, it has to be served. Every query, every agent action, every reasoning chain burns inference compute. For most companies deploying AI, inference now accounts for 80 to 90 percent of total lifetime compute cost. Barclays estimated that GPT-4's cumulative inference bill reached $2.3 billion by the end of 2024 — fifteen times its roughly $150 million training cost. Demand for inference compute is projected to grow 118-fold by 2026, reaching three times the total demand for training.

Test-time scaling — the technique where reasoning models "think longer" before answering — is accelerating this further. Models like DeepSeek-R1 and OpenAI's o1 generate 10 to 100 times more tokens per query than a standard model doing single-pass inference. NVIDIA's own analysis notes that challenging reasoning queries can require over 100 times the compute of a single inference pass. Every chain-of-thought step, every self-verification loop, every reasoning branch burns tokens. And tokens burn hardware. And hardware burns watts.

The Stanford 2025 AI Index found that inference cost per token at GPT-3.5 level dropped 280-fold between late 2022 and late 2024. Hardware costs have declined about 30 percent per year, energy efficiency about 40 percent. But falling unit costs haven't reduced total spend — because demand is exploding faster than costs are falling.

William Stanley Jevons described this dynamic in 1865 when he observed that more efficient steam engines didn't reduce coal consumption but increased it, because cheaper energy made more applications viable. The same paradox is playing out in AI. Every efficiency gain in inference hardware unlocks new workloads, longer context windows, more sophisticated reasoning chains, and broader deployment. Satya Nadella invoked Jevons explicitly after DeepSeek demonstrated lower-cost training: the cheaper it gets, the more we use. NVIDIA's own technical blog references the paradox when explaining why more efficient LLM inference will consume more computing resources, not fewer.

I see this in miniature on my own bench. When I optimized my Ollama setup to run Qwen 14B more efficiently, I didn't use less compute. I started using it for more tasks, running more drafts, testing more prompts. The efficiency freed me to demand more from the hardware — and I quickly hit the ceiling again. Solve the hardware bottleneck and you don't eliminate the constraint. You shift it to the next layer down.

The Architecture of Scarcity

NVIDIA's product roadmap reads as a direct assault on the hardware bottleneck — and everything about it tells you that the next constraint is power.

Blackwell was the first architecture to treat the rack — not the server — as the fundamental unit of compute. Traditionally, a data center server is a self-contained machine with its own processors and memory; when a workload outgrows one server, it spills across multiple machines connected by a network, and that network becomes the bottleneck. Blackwell's NVL72 system eliminates that boundary by connecting 72 GPUs across an entire rack through NVLink, NVIDIA's high-speed interconnect, so they share memory and communicate as if they were all on the same motherboard. The rack becomes one machine. Its second-generation Transformer Engine introduced NVFP4, a 4-bit floating-point format that nearly doubles memory efficiency versus FP8 while maintaining close to FP8-level accuracy. Blackwell Ultra pushed NVFP4 performance to 15 petaFLOPS per GPU, a 7.5x increase over the Hopper H100.

Vera Rubin, arriving in the second half of 2026, pushes harder. Each Rubin GPU delivers 50 petaFLOPS of NVFP4 inference compute — five times Blackwell's base. The rack-level system provides 3.6 exaFLOPS of inference performance, backed by eight stacks of HBM4 memory per GPU delivering 288 GB of capacity at 22 terabytes per second of bandwidth. NVIDIA projects that Rubin will train mixture-of-experts models with one-quarter the GPUs that Blackwell requires and cut the cost per million inference tokens by a factor of ten.

But look at the architectural choices that tell you what comes next. Rubin introduces the Vera CPU, purpose-built for the data movement patterns that inference demands. It formalizes disaggregated inference, where the context phase and the generation phase run on separately optimized hardware. The Rubin CPX accelerator targets long-context workloads with three times the attention acceleration of the prior generation. The Inference Context Memory Storage Platform, built on BlueField-4 DPUs, creates a shared key-value cache tier so context can be reused rather than recomputed. The cable-free modular tray design enables 18x faster assembly and servicing versus Blackwell.

Every one of these features optimizes for the same thing: more intelligence per watt. Not just more intelligence per chip or per dollar — per watt. When your primary metric is tokens per watt, you're telling the market that energy efficiency is the competitive frontier. You're telling the market that the bottleneck is already migrating.

These are also features for operators, not researchers. They're designed for AI factories running at industrial scale, where uptime is non-negotiable and thermal margins are thin. The naming tells you something too — Vera Rubin was the astronomer who proved the existence of dark matter by measuring what couldn't be seen directly, and NVIDIA named its inference-dominant platform for someone who revealed hidden structure. The hidden structure of the AI economy is compute scarcity, and the hidden constraint behind that is energy. Anyone who has operated a nuclear plant knows the reactor doesn't care about your maintenance schedule — it needs cooling whether your team is ready or not. The operational environment is unforgiving. Data centers running AI inference at scale have the same character. The silicon doesn't wait. The models don't pause. Availability is the product. And availability requires power — reliable, dense, uninterruptible power.

Energy → Compute → Capital

This is the thesis Gigawatt Report keeps coming back to — the thread that connects everything in this article and everything this publication covers. Energy is the foundation — you can't run inference without watts. Compute is the transformation layer — it turns energy into intelligence. And capital is the output — the economic value that intelligence produces.

The capital flows confirm this at scale. The five largest hyperscalers are projected to spend over $600 billion on infrastructure in 2026 — a 36 percent increase over 2025. Roughly 75 percent of that, around $450 billion, targets AI infrastructure specifically. Goldman Sachs projects total hyperscaler capex from 2025 through 2027 will reach $1.15 trillion. Tech capital expenditure in 2025 reached roughly 1.9 percent of U.S. GDP — nearly matching the combined scale of the interstate highway system, the Apollo program, and nationwide broadband deployment. Amazon's capex alone exceeds that of the entire U.S. energy sector. Every hyperscaler reports being supply-constrained, not demand-constrained.

And here's where the bottleneck migration becomes visible. All of that silicon needs electricity. U.S. data center power demand is projected to rise from about 4.4 percent of national electricity consumption in 2023 to between 6.7 and 12 percent by 2028. Gartner estimates global data center electricity consumption will grow from 448 terawatt-hours in 2025 to 980 terawatt-hours by 2030 — more than doubling. In Virginia, data centers already consume 26 percent of the state's electricity. In Dublin, the figure is 79 percent. In the PJM electricity market — stretching from Illinois to North Carolina — data centers accounted for a $9.3 billion increase in the 2025–2026 capacity market, pushing residential bills up by $16 to $18 per month in some areas. Dominion Energy proposed its first base-rate increase since 1992, driven largely by data center load growth.

This is the migrating bottleneck in action. NVIDIA's annual cadence will keep pushing compute density higher — Blackwell to Vera Rubin to whatever comes next. Each generation will deliver more tokens per GPU, more petaFLOPS per rack. The hardware constraint will ease. But every GPU added to the grid needs power. Every new rack needs cooling. Every new data center needs a grid interconnection, a power purchase agreement, and often years of permitting and construction before a single watt flows. The hardware cycle runs on an annual cadence. The energy infrastructure cycle runs on a decade-long cadence. That mismatch is where the bottleneck goes next.

And the mismatch isn't just about building power plants. It's about the interconnection queue — the bureaucratic and physical process of actually connecting new generation to the grid. According to Lawrence Berkeley National Laboratory, there are now roughly 2,600 gigawatts of generation and storage actively seeking grid interconnection in the United States — more than twice the entire installed capacity of the existing U.S. power plant fleet. The median time from interconnection request to commercial operation has stretched from under two years for projects built in 2000–2007 to over five years for projects reaching operation today. Only about 19 percent of projects that entered the queue between 2000 and 2018 ever reached commercial operation; nearly 80 percent were withdrawn. Google has reported potential grid connection delays of up to 12 years for some new data center sites. You can ship a Vera Rubin rack in weeks. You may not be able to plug it into the grid for years. The interconnection queue is the physical firewall that the annual hardware improvement cycle cannot penetrate.

Google's $4.75 billion acquisition of Intersect Power. The Stargate project targeting 7 GW of capacity across five sites. AEP citing customer commitments for 24 GW of new demand by 2030, including 18 GW from data centers — five times the utility's current system size. These aren't software investments. They're energy investments. The smartest actors in the market are already building at the layer where they see the constraint heading.

When I denominate my own infrastructure costs in satoshis — as Gigawatt Report examined in "The Sat-Denominated Grid" — the cost of compute has been falling dramatically in real terms over time. But the cost of energy, in real terms, hasn't followed the same curve. Electricity costs near data center clusters are rising, not falling. The compute gets cheaper per unit. The energy to run it gets scarcer per megawatt. That divergence is the signal.

The Takeaway

The story of AI in 2026 is not a software story. It's a hardware story that is becoming an energy story. The constraint doesn't disappear when you solve one layer — it migrates to the layer below.

I can see the current bottleneck from my desk, where my Mac Mini can't hold a model larger than 14 billion parameters. The hyperscalers can see it from their boardrooms, where $600 billion in annual capex still isn't enough to meet inference demand. NVIDIA can see it from their architecture labs, where every feature in Vera Rubin optimizes for tokens per watt. And the SaaS market can see it from its cratered valuations, as investors realize that the product was never really software — it was always the compute underneath, powered by the energy underneath that.

But look one layer deeper. The hardware bottleneck has a known solution: NVIDIA ships new silicon every year, and each generation multiplies what a watt of power can produce. The energy bottleneck doesn't have the same cadence. With 2,600 GW stuck in the interconnection queue and a median five-year wait to reach commercial operation, the mismatch between compute's annual improvement cycle and energy infrastructure's decade-long development cycle is the defining tension of the AI buildout.

That's where the real opportunity lives. Not in the software — software is abundant. Not even in the hardware — hardware is on a steep improvement curve with massive capital behind it. The opportunity is in the energy layer: the generation, transmission, and delivery of reliable, dense, affordable power to the facilities that convert watts into intelligence.

Energy in, compute out, intelligence delivered. The abstraction is dissolving. What remains is the value chain that was always there — and the constraint is moving to its foundation.

Part of the Gigawatt Report series on the intersection of energy, compute, and capital.

This article was written with research and editorial assistance from Claude (Anthropic). Claude assisted with research, drafting, and structuring the final piece.

The Migrating Bottleneck