Posts

The Day Local AI Caught the Cloud: ds4, DeepSeek V4 Flash, and What Just Changed for Devs

If you write code for a living and you’ve been watching the local-AI space, May 9, 2026 is the date to circle. Salvatore Sanfilippo (yes, the guy who wrote Redis) shipped ds4 — a few thousand lines of hand-written C with Metal compute kernels, built for exactly one model: DeepSeek V4 Flash . I ran the same prompt through three engines on the same 128 GB MacBook Pro: DeepSeek V4 Flash via ds4 — fully local, off-cloud Cloud Claude through my Max plan Gemma 4 31B via MLX, also local Local DeepSeek beat cloud Claude on wall-clock time. That sentence used to be science fiction. ▶ Watch the companion video — three engines, one prompt, three completely different aurora animations rendered in real time on the same machine. The benchmark, for people who don’t want filler Engine Time Output Where it ran DeepSeek V4 Flash ( ds4 local) 103 s 3,259 tokens Apple Silicon GPU Cloud Claude (Max plan) 192 s ~3,500 tokens Anthropic data center Gemma 4 ...

I Just Watched One Hacker Catch Up to a Trillion-Dollar Data Center

Image
Three engines · same prompt · one MacBook Yesterday Salvatore Sanfilippo — the guy who wrote Redis 15 years ago and ran it solo for over a decade — published a few thousand lines of C code and quietly changed what counts as possible on a personal laptop. The project is called ds4 . It’s a hand-written native inference engine, Metal kernels and all, built for one specific model: DeepSeek V4 Flash . A 284-billion-parameter Mixture-of-Experts model with a 1-million-token context window . Until last week, that lived inside the kind of GPU clusters that bill more per hour than my truck. I’m running it on the laptop I’m typing this on. What I actually did Today I gave the same prompt to three different AI engines. The same prompt, on the same MacBook: “Build an animated northern lights scene in a single HTML file — mountains, pine trees, twinkling stars, and a flowing aurora.” Three engines: DeepSeek V4 Flash running locally through ds4 Clo...

HumanEval on a MacBook — 81.7% pass@1, Wi-Fi off

The M5 Max MacBook Pro with 128 GB of unified memory is the first laptop that can hold a frontier-class coding agent entirely in RAM. No GPU rack. No cloud. No subscription. I just ran HumanEval on it. Wi-Fi off the entire run. - 81.7% pass@1 on the full 164-problem benchmark - Qwen 3 Coder 30B-A3B-Instruct (8-bit MLX) - 14 minutes wall-clock, $0/month after the model download YouTube walkthrough (three real problems, code streaming live, tests going green): https://www.youtube.com/watch?v=muq7VdgxqRk ## Why this number matters The Qwen team didn't publish HumanEval scores for any Qwen3-Coder variant — they consider the benchmark saturated and went straight to agentic ones (SWE-bench Verified, BFCL, Aider-Polyglot). For the 30B variant — the one that actually fits on a laptop — there were no published HumanEval/MBPP numbers. Until this run. I also ran MBPP (sanitized): 83.3% pass@1 on a 168-problem sample. Pass rate stable since n=120; full 427-run was impractical because a fe...

Pulling 10x My Subscription Value Out of Claude — While Quietly Building the Backup Plan

Image
The math, visualized: every blue bar is one day's API-equivalent token consumption. The green dashed line at the bottom is what I actually paid (pro-rated). April 14 alone — $454 in one day — was more than four months of subscription. Every Sunday night I watch the meter tick toward 100% again. That's been the rhythm for months — five days of heavy work, one day of cleanup, one day of waiting for the weekly reset. I'm on Claude Max — usually the 5x tier at $100 a month — and I burn through nearly every token they give me. Out of curiosity I ran the math last week. I'm not sure I should have. Over the last three weeks, the tokens I've put through Claude Code added up to about $2,976 worth of API usage at Anthropic's published rates. Pro-rated, my subscription cost over that same window was about $70 . One Tuesday in mid-April, I spent $454 of token-equivalent value in a single day — more than four months of subscription, in one sitting. The math do...

Free AI on a MacBook vs $100-a-Month Claude Code — Hexagon Shootout

Image
▶ Watch the race on YouTube: https://www.youtube.com/watch?v=2KeTDDodE0A April 22, 2026. Anthropic's Claude Code Max plan jumped to $100 a month. I ran a live three-way AI race on the exact same prompt — Gemma 31B local, Llama 70B local, and Claude cloud — on a single MacBook, to see how close a free local stack gets to the paid cloud. Two of three contestants finished with zero cloud calls. If you just want the video, it's here: FREE AI on a MacBook vs Claude Cloud — Hexagon Shootout . If you want the repo, it's here: github.com/nicedreamzapp/claude-code-local . Keep reading for the setup, the numbers, and the three things that surprised me. The setup — same prompt, three contestants Hardware: M5 Max MacBook Pro, 128 GB unified memory, Apple Silicon. Gemma 31B — local, Apple MLX, 4-bit quantized (Google's code-specialized model) Llama 70B — local, Apple MLX, 8-bit quantized (Meta's generalist) Claude cloud — the real Anthropic API, using Claude C...

The Era of Hunched-Over-A-Screen Computing Is Ending — Heres Whats Replacing It

Look around any coffee shop, any office, any living room. Everyone is bent forward at the same angle, staring into a glowing rectangle, with one hand on a small slab and the other on a bigger slab. The whole posture is wrong. We know it’s wrong — that’s why ergonomic chairs are a $2 billion industry — but we keep doing it because the computers we built require it. I think we’re at the end of that era. Not because somebody invented a magic new screen. Because computing itself is finally able to leave the rectangle. I call what’s coming ambient computing . The phrase isn’t new, but most uses of it are about smart speakers or watches — small devices that ask you to look at them too. That’s not what I mean. I mean a way of working with computers that doesn’t require you to face a screen at all. Where the machine listens, talks back, sees what you see, and the keyboard becomes optional rather than mandatory. The pieces of it are already shipping. ...

What Its Actually Like to Code By Voice — With the AI Replying In My Own Cloned Voice

The closest analogy I can give for what this feels like is having a quiet co-worker in the room who happens to sound exactly like you. You think out loud. They respond out loud. You both work on the same code. Neither of you is touching a keyboard. It’s still a little uncanny. But it’s also the most natural way to work I’ve found in twenty-plus years of writing software. The setup runs entirely on my MacBook. Apple’s on-device speech recognition listens for me. A local language model thinks. A cloned-voice text-to-speech says the response back. Nothing leaves the laptop. Nothing requires a network. The whole loop is on-device, and that turns out to matter for reasons I didn’t expect. How it actually works A compiled Swift binary wraps Apple’s SFSpeechRecognizer — the same engine that powers macOS dictation — in a continuous-listening daemon. It transcribes everything I say into the active terminal window where Claude Code is running. End-of-utter...