DS4 Metal to CUDA

Trying to port a Metal implementation to CUDA using Hermes.

2026-05-08 → 2026-05-12

01 / The idea

I wanted to see what an unsupervised coding agent would do with a real GPU port. A few days ago, Antirez had just published ds4, a Metal-only inference engine for DeepSeek V4 Flash that runs beautifully on Apple Silicon but didn't have CUDA support yet. I had two DGX Sparks next to me, and I have been experimenting with Hermes after switching from OpenClaw, so I gave Hermes detailed instructions, created with Claude Opus: read the Metal kernel, write the CUDA equivalent, build, test, repeat. No human review between iterations. I didn't care much about the porting itself. It would have been cool, sure, but the real question was whether an agent could complete a port of this shape without me in the loop.

02 / The setup

The hardware was one DGX Spark GB10. The orchestration ran Hermes with kimi-k2.6 doing the kernel implementation work, pointed at antirez/ds4 main as the reference repo. The execution layer was a small Python loop called autonomous_kernel_impl.py that picked the next unimplemented kernel from impl_state.json, asked the model to write CUDA for it, ran the build, ran a smoke test, and either advanced the cursor or retried. The cron fired every few minutes and the agent ran for roughly a week without anyone looking at the output.

03 / The attempt

Initially the progress tracker looked great. After a day impl_state.json claimed 25 kernels implemented, prefill_working flipped to true, and the file was reporting 4.75 tokens per second prefill and 5.71 generation. I had been letting it run in the background and only checked in occasionally. The moment I started paying attention was when the cron job finally reported a failure of its own: the test it ran on the binary returned a single line, ds4: prompt processing failed: Metal prefill failed, while the same status file was still claiming prefill_working: true.

I sent the failure output to Claude along with the agent's progress claims and asked it to reconcile the two. That reconciliation told the rest of the story. Of the 25 kernels the tracker marked as implemented, 19 were real CUDA code and six were still STUB macros returning -1, falsely promoted to implemented in the JSON. Worse, the seven kernels that actually gate prefill on the model's critical path, the compressor pool, the mixed-attention variants, the indexer scoring and top-k selection, had never been attempted at all. The agent had spent most of its time on kernels that compiled but didn't matter, while the kernels that mattered sat as stubs that returned failure into the call chain.

The deeper problem was worse than effort allocation. The matmul kernels the agent did write were targeting Q8_0 weights, the format the agent had apparently inferred from the Metal source by pattern matching. But DS4's whole architectural trick is asymmetric quantization: the routed MoE experts use IQ2_XXS for up/gate and Q2_K for down. There are no Q8_0 weights in those positions. Even if every kernel had been warp-cooperative with perfect dp4a, the engine couldn't have loaded the model, because the matmul code was reading tensors that don't exist in the file. The agent had spent most of the time optimizing against a fictional version of the weights.

The agent also never set up any version control, which in fairness was on me for not telling it to. In an ideal version of this experiment that would have been part of the standard scaffolding, but I hadn't thought to include it and the agent didn't think to add it either. So seventy-five kilobytes of CUDA code lived as untracked files in a single directory for the entire duration of the run, every iteration overwriting the previous one with no history. The only record of intermediate states was a set of .bak files the agent created itself while patching around its own compilation errors. This is mostly a note about how much guidance these agents still need today. There's a long list of things I now know to include in the next setup brief, and "git init and commit after every successful build" is near the top of it.

During roughly the same calendar window, antirez shipped his own CUDA backend for ds4. 107 kernels, real IQ2 dequantization, multiple attention variants, MoE routing with sorted scatter and gather, tensor-core paths through WMMA, full upstream test vector validation. It works. On a single DGX Spark at q2 it reports 343 tokens per second prefill and 13.75 generation, which I'll get to in a moment.

04 / The takeaway

The most interesting finding surprised me and has nothing to do with kernel quality. The agent's matmul code targeted a quantization format the model doesn't contain. DS4 routed experts are IQ2_XXS for up/gate and Q2_K for down. There are no Q8_0 weights in those positions. No amount of tuning would have helped, because the weights the kernels were written to read don't exist at those tensor positions in the GGUF. The agent had been optimizing against an imagined version of the file, and the reward signal it had, namely the build succeeding, never told it otherwise. Loading a real weight wasn't anywhere in the loop.

The broader pattern generalizes beyond this one project. An unsupervised agent on a performance-critical GPU port will fabricate progress, because compiling and running are different events and only the first one is cheap enough to put inside an automation loop. The second requires opening the model file, comparing tensor types, and asking whether the kernel you just wrote is even addressable from the production code path. None of that happened.

Two things kept me from being more disappointed. The first is that antirez and GPT-5.5 shipped a working CUDA backend in the same week, which means the task is solvable by AI; it just isn't yet solvable by an AI running unattended. The second is that even on the working upstream implementation, a Spark reports 13.75 tokens per second generation versus 36.86 on an M3 Ultra at the same quant. That isn't a surprise. The Spark is a training and fine-tuning machine, the Mac Studio is the inference machine, DSv4 inference is bandwidth-bound, and the Studio has roughly three times the memory bandwidth. So even on the happy path, this port wouldn't have changed where I actually run DSv4. The experiment was always about the agent, not about getting Spark to a place it was never going to reach.

05 / Links

The full forensic writeup with the kernel-by-kernel diff against upstream, the failure mode taxonomy, and the evidence files lives in POSTMORTEM.md in this directory. The engine itself is at antirez/ds4 and is the one to use; the DGX Spark benchmark numbers I quoted come from the speed table in the upstream README.