HumanEval on a MacBook — 81.7% pass@1, Wi-Fi off

The M5 Max MacBook Pro with 128 GB of unified memory is the first laptop that can hold a frontier-class coding agent entirely in RAM. No GPU rack. No cloud. No subscription.

I just ran HumanEval on it. Wi-Fi off the entire run.

- 81.7% pass@1 on the full 164-problem benchmark - Qwen 3 Coder 30B-A3B-Instruct (8-bit MLX) - 14 minutes wall-clock, $0/month after the model download

YouTube walkthrough (three real problems, code streaming live, tests going green): https://www.youtube.com/watch?v=muq7VdgxqRk

## Why this number matters

The Qwen team didn't publish HumanEval scores for any Qwen3-Coder variant — they consider the benchmark saturated and went straight to agentic ones (SWE-bench Verified, BFCL, Aider-Polyglot). For the 30B variant — the one that actually fits on a laptop — there were no published HumanEval/MBPP numbers. Until this run.

I also ran MBPP (sanitized): 83.3% pass@1 on a 168-problem sample. Pass rate stable since n=120; full 427-run was impractical because a few outlier tasks induce very long model responses (10+ minutes each).

## Methodology

| Setting | Value | |---|---| | Benchmark | HumanEval — 164 Python tasks (full) | | Metric | pass@1 (first attempt only) | | Temperature | 0.0 — deterministic | | Sampling | single sample per problem, no best-of-N | | Execution | Python subprocess, 10s timeout | | Hardware | M5 Max MacBook Pro · 128 GB unified memory | | Model | Qwen3-Coder-30B-A3B-Instruct-MLX-8bit | | Network | Wi-Fi OFF the entire run | | Wall clock | 14 minutes |

## For context — Qwen3-Coder 480B's official agentic benchmarks

The Qwen team's published numbers for the 480B flagship sibling (the bigger sibling of the 30B running on this MacBook):

| Benchmark | Qwen3-Coder 480B | Claude Sonnet 4 | GPT-4.1 | |---|---|---|---| | SWE-bench Verified (500-turn) | 69.6 | 70.4 | — | | Terminal-Bench | 37.5 | 35.5 | 25.3 | | BFCL-v3 | 68.7 | 73.3 | 62.9 | | Aider-Polyglot | 61.8 | 56.4 | 52.4 |

Source: Qwen team's official blog.

## Why the offline part matters

If a tool needs the internet, three things are true:

1. Someone else can read what you sent. 2. Someone else can charge you for it. 3. Someone else can take it away.

If the same tool runs locally, none of those are true. That's a different category of software — and for law firms, medical practices, and accountants handling client material, it's the only legal one.

## Reproduce it yourself

- Open-source launchers: github.com/nicedreamzapp/claude-code-local - HumanEval dataset: github.com/openai/human-eval - Hardware: any M-series MacBook with ≥32 GB RAM (128 GB Max preferred for full 8-bit weights) - Total monthly cost: $0 after the model download

For law firms, medical practices, and accountants who want help getting this stack running on their own hardware — that's what AirGap is. 14-day pilot, fixed scope, the data never leaves your machines.

— matt


Originally published at Marijuana Union. For premium vaporizers visit iNeedHemp, wholesale at Nice Dreamz, and seeds at Tribe Seed Bank. Explore the 3D cannabis marketplace at The Farmstand.

Comments

Popular posts from this blog

Exploring the Divine Tribe V5 with Hubble Hydratube: A Comprehensive Review

How the Richest Man in America Killed Hemp to Protect His Investments

Why Hemp Clothing Is More Than Just a Trend: Exploring the Benefits of Hemp Hoodies