Ollama is now powered by MLX on Apple Silicon in preview

Source 1 min read
ollamamlxapple-siliconlocal-llmnvfp4quantizationcoding-agentsinference-performance

Summary

Ollama 0.18 now uses Apple’s MLX framework on Apple Silicon, delivering major speedups for local LLM inference. The update includes NVFP4 quantization support for production-parity results, smarter KV cache management for agentic workflows, and a preview launch with Qwen3.5-35B-A3B optimized for coding tasks.

Key Insight

  • MLX integration produces significant prefill and decode speed improvements on all Apple Silicon, with M5/M5 Pro/M5 Max chips getting additional GPU Neural Accelerator support
  • Benchmarks used Qwen3.5-35B-A3B at NVFP4 precision; upcoming Ollama 0.19 promises even higher numbers (1 851 tok/s prefill, 134 tok/s decode with int4)
  • NVFP4 (NVIDIA’s 4-bit floating point) is the key quantization format here - it maintains better accuracy than traditional Q4_K_M while reducing memory bandwidth, and matches what cloud inference providers use, so local results mirror production
  • Cache improvements are specifically designed for agentic/coding use cases: cross-conversation cache reuse (critical for tools like Claude Code that share system prompts), intelligent checkpoint snapshots, and smarter eviction that preserves shared prefixes
  • The ollama launch command now natively supports Claude Code and OpenClaw as first-class targets, signaling Ollama’s pivot from chat-only to agentic infrastructure
  • Requires 32+ GB unified memory, which limits this to higher-end Mac configurations