Why My Mac Mini M4 Outperforms Dual RTX 3090s for LLM Inference

I built a dual RTX 3090 server for local LLM inference. A Mac Mini M4 turned out to be 27% faster and 22× more efficient. Here's why memory bandwidth beats raw GPU power.

2 min read
Why My Mac Mini M4 Outperforms Dual RTX 3090s for LLM Inference

I spent over a year building and optimizing a dual RTX 3090 server for local LLM inference. 48GB of VRAM. Custom cooling etc...

Then I tested a Mac Mini M4. It was faster.

This is what I learned.

The Hardware

GPU Server:

  • 2× RTX 3090 24GB (48GB VRAM total)
  • Ollama on Ubuntu Server
  • ~700W power draw under load

Mac Mini M4:

  • M4 chip with 64GB unified memory
  • Ollama on macOS
  • ~40W power draw under load

The Benchmark

I ran Qwen3 32B with identical prompts on both machines. This is a large model that needs serious hardware.

Machine Average Speed
Mac Mini M4 64GB 11.7 tokens/sec
Dual RTX 3090s 9.2 tokens/sec

The Mac Mini is 27% faster. I ran the test multiple times to confirm.

Why This Happens

Unified memory architecture
Unified memory eliminates the multi-GPU bottleneck

LLM inference is memory-bandwidth bound, not compute-bound. The model weights constantly move between memory and processing cores. Speed depends on how fast you can shuffle data around.

Each RTX 3090 has 936 GB/s bandwidth. Impressive on paper. But when a model spans two GPUs, they communicate over PCIe. That's the bottleneck.

Apple's M4 has 64GB of unified memory. One pool. No inter-GPU communication needed. Consistent bandwidth for both CPU and GPU cores.

Power Efficiency

Mac Mini M4 GPU Server
Speed 11.7 t/s 9.2 t/s
Power ~40W ~700W
Tokens per Watt 0.29 0.013

The M4 is 22× more power efficient. Over a year of heavy use, that adds up.

When NVIDIA Still Wins

GPUs still dominate certain workloads:

  • Training and fine-tuning (CUDA ecosystem, tensor cores)
  • Batch inference (multiple parallel requests)
  • Image generation (Stable Diffusion, ComfyUI)

My GPU server now handles fine-tuning, TTS and image generation. It's good at those.

Takeaway

For local LLM inference, memory bandwidth matters more than raw compute. Apple's unified memory architecture currently delivers that better than multi-GPU setups.

The Mac Mini M4 with 64GB is faster, silent, cheaper to run, and costs less than my GPU rig.

Sometimes the right tool isn't the most powerful one. meh (:

Benchmarked with Ollama running Qwen3:32B, February 2025.