PAW makes tiny models rival bigger ones locally
A 0.6B model now matches a 32B prompt with far less memory. The breakthrough, called Program-as-Weights, repurposes a fuzzy specification into a compact, locally executable neural artifact. The team reports a 4B compiler trained on FuzzyBench, a 10 million-example dataset they released, that emits parameter efficient adapters for a frozen, lightweight interpreter. In tests, a 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth of the inference memory and running at about 30 tokens per second on a MacBook M3. PAW reframes the foundation model from a per input problem solver into a tool builder, invoked once per function definition to produce a small reusable artifact whose subsequent calls per function application are cheap and offline.
The paper shows that you can turn fuzzy, high level requirements into practical, locally executable tooling. Instead of sending every question to a giant model, PAW lets engineers compile a domain specific function from natural language into a reusable, compact artifact. The result is a two part stack: a 4B compiler that translates a spec into an adapter, and a 0.6B interpreter that runs the adapter against inputs with the original model’s capabilities folded into the artifact. Benchmarks indicate that, for certain tasks like log alerting, JSON repair, or ranking by intent, this approach can reach parity with a much larger model while dramatically reducing memory footprint and latency. The team reports that the 0.6B Qwen3 interpreter, paired with PAW adapters, achieves comparable accuracy to prompting a 32B model in the same family and does so with about 1/50 the memory and with throughput around 30 tokens per second on a consumer laptop.
From an engineering standpoint, the PAW paradigm shifts the optimization target. Rather than chasing end-to-end accuracy on every input, the effort is concentrated on building a robust, reusable artifact that encodes a function’s behavior for a class of inputs. The compiler ingests a natural language description, emits a compact neural piece that can be dropped into a frozen interpreter, and the resulting PAW artifact can be reused across calls with negligible per invocation cost. For product teams, the implication is a path to local, reproducible tooling that preserves privacy and reduces cloud costs, while still delivering the flexibility of large language models.
Two concrete practitioner takeaways emerge. First, the approach introduces a new bottleneck and opportunity in the tool-building stage: the quality and generality of the PAW adapters hinge on the compiler and the underlying FuzzyBench data. The 4B compiler and 10M example dataset are central assets; any drift in the domain or in the interpreted tasks will demand updated specifications or retraining. Second, maintenance and versioning become a new discipline. If the base model evolves or if the target tasks shift, the PAW artifact may require redefinition and recompilation to preserve behavior. That tradeoff is offset by the offline reuse and the drop in memory, but it is not zero cost.
Looking ahead, the natural next steps are to broaden the range of tasks that PAW can efficiently support, test across more base models and interpreters, and quantify failure modes in real products. Watch for how well the approach scales to multi task pipelines, how robust adapters are to distribution shifts, and whether engineers can standardize a workflow for compiling domain specific tools from natural language while keeping latency consistently low.
- Program-as-Weights: A Programming Paradigm for Fuzzy FunctionsarXiv LLM/Foundation Query / Primary source / Published JUL 02, 2026 / Accessed JUL 04, 2026