Diagnosing and Fixing TrueNAS Performance Issues: A Complete Case Study (homelab)

A TrueNAS system with sluggish file access was bottlenecked by Samsung 870 QVO drives consumer QLC NAND drives that collapse to 80 MB/s under sustained load in RAIDZ1 configurations.

4 min read
Diagnosing and Fixing TrueNAS Performance Issues: A Complete Case Study (homelab)

When storage performance doesn't meet expectations, the root cause isn't always obvious. This case study documents my complete troubleshooting journey that transformed a poorly performing TrueNAS system into a high-performance storage solution delivering 1.8 GB/s writes and 4.4 GB/s reads.

Initial Symptoms

I had slow "file access on DATA" a vague but common complaint that could indicate numerous underlying issues. Without specific metrics or error messages, I needed systematic diagnostics to identify the bottlenecks.

System Overview

The TrueNAS 25.04 system consisted of:

  • DATA Pool: 8 drives in 2x RAIDZ1 configuration (14.5TB total)
  • VMs Pool: 4 drives in 2x mirror configuration with dedicated SLOG
  • Cache: 2x 1TB NVMe drives (initially causing problems)
  • Network: Bonded 10GbE interfaces
  • Memory: 376GB RAM

Phase 1: Initial Diagnostics

Cache Device Crisis

The first major discovery came from examining pool statistics:

zpool iostat -v 5 3
arc_summary | grep -E "ARC|L2ARC|Cache"

Critical Finding: The L2ARC (cache) was consuming 2.8 TiB of data but only had 1.86 TiB of physical cache devices available. This impossible situation indicated severe cache thrashing the cache devices were constantly evicting data, actually hurting performance instead of helping.

Immediate Action: Removed the cache devices entirely to stop the thrashing:

zpool remove DATA 4dsfb6fb-8d2a-4c87-83da-5b4d5fr98561
zpool remove VMs 57d85g8a-4c22-4519-a866-b9e8qk394c91

Memory Pressure from Deduplication

Despite deduplication being disabled, a 643MB dedup table was consuming valuable ARC memory that should have been available for file caching.

ARC Hit Rates Analysis

The system showed concerning patterns:

  • Total hits: 55,988,564
  • Total misses: 588,126
  • Hit rate: ~99% (seemingly good, but masking underlying issues)

Phase 2: Hardware Investigation

Drive Performance Testing

Individual drive testing revealed the core problem:

sudo dd if=/dev/zero of=/dev/sda bs=1M count=100 oflag=direct
# Result: 76 MB/s Far below SSD expectations

SMART Analysis

sudo smartctl -i /dev/sdd | grep "Model"
# Result: Samsung SSD 870 QVO 4TB

Root Cause Identified: The DATA pool contained Samsung 870 QVO drives consumer QLC NAND drives completely unsuitable for server workloads.

Phase 3: Understanding the QVO Problem

QLC vs TLC NAND Technology

The Samsung 870 QVO drives used QLC (Quad-Level Cell) NAND, which has severe limitations:

  • SLC Cache Dependency: Fast writes only while SLC cache isn't full
  • QLC Performance Collapse: Sustained writes drop to 80-160 MB/s
  • RAIDZ1 Bottleneck: All drives must complete writes before the operation finishes

In RAIDZ1, the slowest drive determines overall performance. The QVO drives were bottlenecking the entire 8-drive array to their worst-case performance.

Performance Impact

iostat -x 1 5

Revealed severe issues:

  • Write latencies: 286ms (should be <1ms for SSDs)
  • Queue depths: Drives constantly saturated
  • Utilization: 45% on individual drives during light workloads

Phase 4: The Solution

Drive Replacement Strategy

Replaced the Samsung 870 QVO drives with Samsung 870 EVO drives:

870 QVO (Problem):

  • QLC NAND technology
  • Performance collapses under sustained load
  • 80 MB/s sustained writes

870 EVO (Solution):

  • TLC NAND technology
  • 530 MB/s sustained writes
  • Enterprise-grade reliability

Verification Testing

sudo dd if=/dev/zero of=/dev/sdd bs=1M count=500 oflag=direct
# Before: ~80 MB/s
# After: 403 MB/s

5x performance improvement on raw drive speed.

Phase 5: Intelligent Cache Implementation

NVMe Cache Strategy

With the primary bottleneck resolved, I strategically re-added the NVMe drives as L2ARC cache:

sudo zpool add DATA cache nvme0n1p1
sudo zpool add VMs cache nvme1n1p1

Key Improvements:

  • Proper sizing: 1TB cache per pool (not overcommitted)
  • Clean drives: No legacy cache metadata causing conflicts
  • Fast NVMe: Samsung 990 EVO Plus drives for cache duties

Cache Effectiveness

arc_summary | grep -A 10 "L2ARC"

Results after optimization:

  • L2ARC Hit Rate: 75% (excellent)
  • Cache Size: 1.1 TiB populated
  • No errors: Clean operation across all components

Final Performance Results

Storage Performance

sudo dd if=/dev/zero of=/mnt/DATA/test_write bs=1M count=1000 conv=fsync
# Write: 1.8 GB/s

sudo dd if=/mnt/DATA/test_write of=/dev/null bs=1M  
# Read: 4.4 GB/s

Network Verification

cat /proc/net/bonding/bond0

Confirmed 2x 10GbE bonded interfaces providing 20 Gbps theoretical bandwidth sufficient to utilize the storage performance.

Key Lessons Learned

Drive Selection Matters

Consumer QLC drives should never be used in RAIDZ configurations. The performance characteristics make them unsuitable for sustained server workloads.

Proper Cache Sizing

L2ARC cache devices must be properly sized. Overcommitting cache can cause performance degradation rather than improvement.

Systematic Diagnostics

Performance issues require methodical investigation:

  1. Pool-level metrics (iostat, status)
  2. Drive-level analysis (SMART data, individual testing)
  3. Cache effectiveness (ARC statistics)
  4. Network capabilities (interface configuration)

The Compound Effect

Multiple small issues can create severe performance problems:

  • Cache thrashing + QLC drives + memory pressure = 80 MB/s performance
  • Proper drives + sized cache + clean memory = 1.8 GB/s performance

Performance Transformation Summary

Metric Before After Improvement
Write Speed ~80 MB/s 1.8 GB/s 22.5x
Read Speed ~100 MB/s 4.4 GB/s 44x
Drive Latency 286ms <1ms 286x
Cache Hit Rate N/A (thrashing) 75% Effective

This case demonstrates that storage performance issues often result from multiple compounding factors. The combination of inappropriate drive technology, cache misconfiguration and memory pressure created a perfect storm of poor performance.

The systematic approach of identifying each issue, understanding root causes and implementing targeted solutions transformed a poorly performing system into a high-performance storage platform. The final configuration delivers enterprise-class performance suitable for demanding workloads.

When troubleshooting storage performance, look beyond obvious metrics. Drive technology choices, cache implementation and memory utilization all play critical roles in overall system performance. A methodical diagnostic approach will reveal the true bottlenecks and guide effective solutions.