I Just Watched One Hacker Catch Up to a Trillion-Dollar Data Center

May 09, 2026

Three engines · same prompt · one MacBook

Yesterday Salvatore Sanfilippo — the guy who wrote Redis 15 years ago and ran it solo for over a decade — published a few thousand lines of C code and quietly changed what counts as possible on a personal laptop.

The project is called ds4. It’s a hand-written native inference engine, Metal kernels and all, built for one specific model: DeepSeek V4 Flash. A 284-billion-parameter Mixture-of-Experts model with a 1-million-token context window. Until last week, that lived inside the kind of GPU clusters that bill more per hour than my truck.

I’m running it on the laptop I’m typing this on.

What I actually did

Today I gave the same prompt to three different AI engines. The same prompt, on the same MacBook:

“Build an animated northern lights scene in a single HTML file — mountains, pine trees, twinkling stars, and a flowing aurora.”

Three engines:

DeepSeek V4 Flash running locally through ds4
Cloud Claude through my Max plan, hitting Anthropic’s data center
Gemma 4 31B running locally through MLX

Then I watched what came out.

Engine	Time	Output	Hosted on
DeepSeek V4 Flash (ds4 local)	103 s	3,259 tokens	my laptop
Cloud Claude (Max plan)	192 s	~3,500 tokens	Anthropic’s GPUs
Gemma 4 31B (MLX local)	131 s	1,992 tokens	my laptop

Local-first DeepSeek beat cloud Claude on raw wall-clock time. Let that one sit for a second.

What the auroras actually looked like

Three completely different interpretations of the same prompt — which is the part that surprised me.

DeepSeek’s aurora was a flowing ribbon of teal and lavender drifting over a dense pine forest. The trees got a subtle green underglow from the sky. Most cohesive of the three.
Cloud Claude’s went the most cinematic — magenta and teal aurora bands draped across jagged mountains, with a soft luminescent dusting along the peaks. It looks like a desktop wallpaper.
Gemma’s went minimalist — a single sweeping streak of green and violet against a starry sky, with a clean line-drawing mountain silhouette. Stylized, almost graphic.

Each one runs forever. None of them needed a network call.

What this means for the cloud AI bill

I’ve written before about how my Claude Max plan returns roughly 40x value compared to API rates. That math is still true. Anthropic is still subsidizing the difference.

But this changes the calculation. If the cloud got expensive tomorrow — or if my data needed to stay on-device — I now have a credible escape hatch. Not “good enough for testing.” Actually frontier-class. Actually long context. Actually integrated with Claude Code.

The hybrid is what I’m settling into: - claude for the hardest reasoning tasks I do all day - claude-ds4 for routine work, long-context document review, and anything I’d rather keep on this machine

The meter at the top of my screen no longer goes up for most of what I do.

Three things `ds4` does that nobody else has bundled

Asymmetric 2-bit quantization. Only the routed Mixture-of-Experts experts get compressed to 2-bit — about 90% of the weight footprint. Every quality-critical path (shared experts, attention projections, output head) stays at full precision. The result is an 81 GB file that calls tools cleanly.
KV cache moved to disk. Modern Apple SSDs are fast enough that “the KV cache must live in RAM” is just no longer true in 2026. ds4 writes session state to disk and reuses it across restarts. The first 25k-token Claude Code system prompt gets prefilled exactly once, ever, and replays from cache after that.
Pure Metal native code. No PyTorch, no TensorFlow, no llama.cpp wrapper layer. The hot path is Metal compute kernels written for one specific 284B model. M3 Max gets ~27 tokens/sec; M5 Max hits ~32.

What I’m doing next

The voice agent stack I’ve been building — wake-word listening, voiceprint filtering, transcripts piped straight into Claude Code — has been waiting for exactly this. Cloud Claude was good enough to test with, but routing every utterance through someone else’s API meant every long agent run had a meter on it.

This week the meter goes away. Same agent, same tools, same cloned voice — running on a single laptop, off-cloud, with a context window five times longer than what I had on the Max plan.

I’ll write that one up next.

For now, if you’ve got 128 GB of RAM and a free hour: clone antirez/ds4, run make, run ./download_model.sh q2, and see for yourself.

May 9, 2026. The day a single C file caught up to the data centers.

Companion repo: github.com/antirez/ds4 · Hugging Face weights: antirez/deepseek-v4-gguf · Model card: deepseek-ai/DeepSeek-V4-Flash · Companion video: [YouTube link TBD]

Companion video on YouTube: https://youtu.be/7l8-s8xkpms. For local-AI consulting for compliance-sensitive firms (law, medical, finance), see AirGap AI.

Search This Blog

iNeedHemp - Divine Tribe