It started with subtle freezes. Then my entire 9TB SSD TrueNAS pool suspended with the dreaded "UNAVAIL - insufficient replicas" status. This is the story of how six budget Crucial BX500 SSDs nearly destroyed my NAS and how I replaced them all without losing a single byte of data.
Spoiler: The system went from 300-1100ms disk latencies to 0.1ms. Load average dropped from 68 to 1.2. But getting there required navigating pool suspensions, device name reshuffling and some of the most stressful moments of my nas career 😄
The Symptoms: When "ONLINE" Means Nothing
My TrueNAS server appeared healthy in the web UI. zpool status showed:
pool: DATA state: ONLINE scan: scrub repaired 0B with 0 errors
But the system was **unusable**. Intermittent freezes. Minutes of unresponsiveness. Something was catastrophically wrong.
The kernel logs revealed the horror:
```bash
[Jan 14 21:57:56 2026] INFO: task txg_sync blocked for more than 120 seconds.
[Jan 18 15:23:41 2026] INFO: task txg_sync blocked for more than 120 seconds.
What is txg_sync? ZFS writes data in transaction groups (TXGs). When a disk freezes for 120+ seconds, the entire pool waits. Everything stops.
The Smoking Gun
Running iostat -x revealed the culprit:
Device r/s w/s await svctm %util
sdk 0.35 12.51 984ms 324ms 92.43%
sdl 0.35 11.85 982ms 126ms 81.24%
sdj 0.37 12.65 979ms 43ms 32.87%
324 milliseconds of write latency! For context, SSDs should be under 2ms. These drives were 160x slower than they should be.
Meanwhile:
- Load average: 30+ (should be 2-5)
- RAM usage: 357GB/376GB (ARC consuming everything)
- Pool operations blocked: Every few minutes
Something was fundamentally broken.

The Diagnosis: Write Amplification From Hell
I examined the SMART data of the worst offender:
sudo smartctl -a /dev/sdk
The results were shocking:
Model: CT2000BX500SSD1 (Crucial BX500 2TB)
Serial: 2505E9A423CD
SMART Attributes:
177 Wear_Leveling_Count 94
202 Percent_Lifetime_Remain 94
246 Total_LBAs_Written 27,937,743 # 13.5 TB
247 Host_Program_NAND_Pages 1,583,830 # 767 GB
Do the math:
- Host wrote to drive: 767 GB
- Drive actually wrote to NAND: 13.5 TB
- Write amplification: 17.6x
For comparison, quality SSDs have write amplification of 1.5-3x. Enterprise SSDs achieve 1.1-1.5x.
17.6x means this drive was doing 17.6x more work than necessary.
Why the BX500 is a Disaster for NAS
The Crucial BX500 is a budget consumer SSD designed for light desktop use. It has:
❌ No DRAM cache - Uses HMB (Host Memory Buffer) instead
❌ QLC NAND - 4 bits per cell, slow write performance
❌ Aggressive garbage collection - Causes multi-second freezes
❌ 360 TBW endurance - vs 1000+ TBW for NAS drives
❌ Not rated for 24/7 operation
When these drives perform garbage collection under sustained write load, they freeze completely for seconds at a time. ZFS's transaction group sync waits for them, blocking the entire pool.
The result: system-wide freezes every few minutes.
The Scale of the Problem
I didn't have just one BX500. I had six of them, split across both vdevs:
Pool DATA (14.5 TB total):
raidz1-0 (4 disks):
✅ Samsung 860 EVO 4TB
✅ Samsung 860 EVO 4TB
✅ Crucial MX500 2TB
❌ Crucial BX500 2TB ← Problem
raidz1-1 (4 disks):
✅ Crucial MX500 2TB
❌ Crucial BX500 2TB ← Problem
❌ Crucial BX500 2TB ← Problem
❌ Crucial BX500 2TB ← Problem
Actually, after mapping everything, I discovered the full horror: 6 BX500s total distributed across the pool. Both vdevs were affected.

Emergency Triage: Surviving Until Replacement
New drives would take 4 days to arrive. The system was barely usable. I needed emergency workarounds.
The Nuclear Option: Cripple Performance to Restore Stability
# Reduce ARC from 357GB to 80GB
echo 85899345920 | sudo tee /sys/module/zfs/parameters/zfs_arc_max
# Limit burst writes to 4GB
echo 4294967296 | sudo tee /sys/module/zfs/parameters/zfs_dirty_data_max
# Flush more frequently (reduce TXG timeout)
echo 10 | sudo tee /sys/module/zfs/parameters/zfs_txg_timeout
# Force cache drop
echo 3 | sudo tee /proc/sys/vm/drop_caches
Result: RAM usage dropped from 357GB to 18GB immediately.
Stop the Bleeding: Eliminate Write Load
I discovered an application was writing 2.28 TB actively. Combined with Time Machine backups, the BX500s were being hammered.
# Stop application1
midclt call app.stop application1
# Disable SMB (Time Machine)
midclt call service.stop cifs
Result: Load dropped from 68 → 7 within minutes. System became usable.
Critical realization: The BX500s were in a death spiral. High write load → garbage collection → freezes → ZFS retries → more write load → worse garbage collection.
These workarounds kept the system alive for 4 days until replacement drives arrived.
The Replacement Plan
Hardware Selection
After extensive research, I chose WD Red SA500 2TB (WDS200T1R0A):
✅ 3D TLC NAND (not QLC)
✅ DRAM cache with Marvell 88SS1074 controller
✅ 1000 TBW endurance (vs 360 TBW for BX500)
✅ NAS-optimized firmware
✅ 24/7 operation rated with 5-year warranty
✅ Price: ~€120/drive
Total cost: €720 for 6 drives
The Strategy
Critical rules for RAIDZ1:
- ⚠️ RAIDZ1 = single parity - Can lose 1 disk per vdev
- ⚠️ Losing 2 disks = total data loss
- ✅ Replace ONE disk at a time
- ✅ Wait for 100% resilver completion before next
- ✅ NEVER replace 2+ disks in same vdev simultaneously
Estimated time: 6 disks × 2-4h resilver = 12-24 hours total
Disk Identification Challenge
Problem: Which physical drive is which Linux device?
My server uses an LSI MegaRAID SAS-3 3108 controller with 24 bays. Device names like /dev/sdk don't directly map to physical slots.
Solution: storcli + LED locate
# Find disk by serial number
sudo /usr/local/sbin/storcli /c0 /eall /sall show | grep "SERIAL"
# Light up the LED on specific slot
sudo /usr/local/sbin/storcli /c0 /e21 /s10 start locate
# Walk to server, identify blinking drive
# Turn off LED
sudo /usr/local/sbin/storcli /c0 /e21 /s10 stop locate
This became absolutely critical later when device names started reshuffling.

The Replacement Journey: 16 Hours of Surgery
Replacement 1: The Accidental Start (sdj)
Date: Jan 23, 20:58
Target: sdk (worst performer)
What happened: Replaced sdj instead 😅
In the confusion of identifying disks, I lit up the wrong slot and pulled sdj instead of sdk.
Recovery:
# Pool went DEGRADED
sudo zpool clear DATA
sudo zpool replace DATA 279a133d-75a9-4736-aecb-4819ceae5272 sdj
Resilver time: 1h18min
Outcome: ✅ Success (but learned a painful lesson about verification)
Lesson 1: ALWAYS verify serial numbers. ALWAYS use LED locate. ALWAYS double-check before pulling a drive.
Crisis 1: Pool Suspension
Date: Jan 23, ~22:30
Trigger: Attempted to replace sdk while sdj resilver was finishing
The horror:
pool: DATA
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
raidz1-1 UNAVAIL
279a133d... (sdj) REMOVED
6819c81d... (sdk) OFFLINE
Two disks down in RAIDZ1 = data loss territory.
My stomach dropped. Years of data, family photos, documents - potentially gone.
sudo zpool clear DATA
Miraculous recovery: Both disks came back ONLINE. Pool showed:
errors: 254 data errors, use '-v' for a list
254 corrupted files, but ZFS scrub repaired them all. Zero data loss.
Lesson 2: ZFS is incredibly resilient, but don't test its limits. The BX500s were literally dying mid-replacement.
Crisis 2: Device Name Reshuffling
Date: Jan 24, ~00:00
Problem: Rebooted to fix stuck device detection
Before reboot:
- sdk = Crucial BX500 (target for replacement)
- sdh = Samsung 860 PRO
After reboot:
- sdk = Crucial BX500 (wrong one!)
- sdh = WD Red SA500 (NEW disk!)
- sdi = Samsung 860 PRO
All device names changed!
I had swapped sdk for a new drive, but after reboot it showed as sdh. Meanwhile a different BX500 took the sdk name.
Solution: Stop trusting device names. Create a mapping table.
# ALWAYS verify by serial
sudo smartctl -i /dev/sdX | grep "Serial Number"
# ALWAYS verify by slot
sudo /usr/local/sbin/storcli /c0 /e21 /sXX show
Created comprehensive mapping:
| Slot | Device Name | Serial | Model | Status |
|---|---|---|---|---|
| 21:10 | sdj | 2548TLD00764 | WD Red SA500 | Replaced |
| 21:8 | sdh | 2545EBD00307 | WD Red SA500 | Replacing |
| 21:11 | sdk | 2548TLD00870 | WD Red SA500 | Next |
Lesson 3: Device names (sdX) are NOT stable across reboots or hot-swaps. Always use serial numbers and controller slot IDs.
Replacement 2-6: Methodical Execution
After learning from the first two disasters, I developed a bulletproof procedure:
The Process (Repeated 5 Times)
1. Create Mapping Entry
# Document BEFORE touching anything
sudo /usr/local/sbin/storcli /c0 /eall /sall show | grep BX500
sudo smartctl -i /dev/sdX | grep "Serial Number"
2. LED Locate
sudo /usr/local/sbin/storcli /c0 /e21 /sXX start locate
# Walk to server, verify blinking
# Return to terminal
sudo /usr/local/sbin/storcli /c0 /e21 /sXX stop locate
3. Hot-Swap
- Remove caddy
- Swap BX500 → WD Red SA500
- Reinsert caddy
- Wait 20 seconds
4. Rescan & Verify
# Force SCSI rescan
for host in /sys/class/scsi_host/host*; do
echo "- - -" | sudo tee $host/scan
done
sleep 15
# Verify detection
sudo /usr/local/sbin/storcli /c0 /e21 /sXX show
sudo smartctl -i /dev/sdX | grep "Device Model"
5. ZFS Replace
sudo zpool replace DATA <UUID> <device>
watch -n 10 'zpool status DATA'
6. Wait for 100% Resilver
Average resilver time: 1h30min
The Final Tally
| # | Device | Slot | Serial | Resilver Time | Issues |
|---|---|---|---|---|---|
| 1 | sdj | 21:10 | 2548TLD00764 | 1h18min | Wrong disk initially |
| 2 | sdh | 21:8 | 2545EBD00307 | ~2h00min | Device reshuffling |
| 3 | sdk | 21:11 | 2548TLD00870 | 1h58min | Disk invisible to Linux |
| 4 | sdo | 21:14 | 2548TLD00561 | 1h27min | Clean |
| 5 | sdg | 21:15 | 2548TLD00906 | 1h08min | Clean |
| 6 | sdl | 21:13 | 2548TLD00912 | 1h07min | Clean ✅ |
Total time: 16 hours over 2 days
Data lost: 0 bytes
Pool suspensions survived: 2
Gray hairs gained: Many
The Final Test: Restoration
With all 6 drives replaced, I restored normal operation:
# Remove emergency limits
echo 0 | sudo tee /sys/module/zfs/parameters/zfs_arc_max
echo 4294967296 | sudo tee /sys/module/zfs/parameters/zfs_dirty_data_max
echo 5 | sudo tee /sys/module/zfs/parameters/zfs_txg_timeout
# Re-enable services
sudo midclt call service.start cifs
Verification:
zpool status DATA
pool: DATA
state: ONLINE
scan: resilvered 1.10T in 01:07:47 with 0 errors
config:
DATA ONLINE
raidz1-0 ONLINE
sdk (WD Red) ONLINE
sdc (MX500) ONLINE
sdg (WD Red) ONLINE
sdo (WD Red) ONLINE
raidz1-1 ONLINE
sde (MX500) ONLINE
sdj (WD Red) ONLINE
sdh (WD Red) ONLINE
sdl (WD Red) ONLINE
errors: No known data errors
Perfect. Zero errors. Six new drives. Mission accomplished.
Performance: The Numbers Don't Lie
Before vs After
BEFORE (BX500):
Load average: 68.85
Disk latency: 300-1100ms
txg_sync blocks: Every 2-5 minutes
Write amp: 17.6x
System state: Unusable
AFTER (WD Red SA500):
Load average: 1.22
Disk latency: 0.1-0.2ms
txg_sync blocks: 0 (for 48h+)
Write amp: ~2x (normal)
System state: Perfect
iostat Comparison
BEFORE:
Device await svctm %util
sdk 984ms 324ms 92.43% ← Disaster
sdl 982ms 126ms 81.24% ← Disaster
sdj 979ms 43ms 32.87% ← Still bad
AFTER:
Device await svctm %util
sdk 0.13ms 0.13ms 0.01% ← Perfect
sdl 0.12ms 0.12ms 0.00% ← Perfect
sdj 0.10ms 0.10ms 0.00% ← Perfect
sdg 0.13ms 0.13ms 0.00% ← Perfect
sdh 0.11ms 0.11ms 0.00% ← Perfect
sdo 0.13ms 0.13ms 0.00% ← Perfect
Improvement: 3000x faster disk latencies
Real-World Impact
File copy test (1TB):
- Before: 45 minutes with frequent pauses
- After: 12 minutes, steady throughput
VM boot time:
- Before: 3-5 minutes (if it worked)
- After: 25 seconds
System responsiveness:
- Before: Freezes every few minutes
- After: Instant, no freezes
Conclusion: The €720 Lesson
What I learned:
- Budget consumer SSDs are not NAS drives
- Write amplification is a real problem
- Device identification is critical
- ZFS is incredibly resilient
- Patience and methodology prevent disasters
Current status (48h post-completion):
Pool: ONLINE
Errors: 0
Load: 1.2
Latency: 0.1ms
Uptime: Rock solid
Stress level: Finally normal
Would I do it again? No - I'd buy the right drives from the start.
Was it worth it? Absolutely. My NAS went from unusable to lightning-fast and I have a much deeper understanding of ZFS than I ever wanted.

Final Advice
If you have Crucial BX500 drives in your NAS:
- If write amplification > 5x, replace immediately
- If you see txg_sync blocks, it's already too late
- Don't wait for disaster
Check write amplification NOW:
sudo smartctl -a /dev/sdX | grep -E "246|247"
The €720 I spent on WD Red SA500 drives was expensive. But it was cheaper than data recovery and infinitely cheaper than losing irreplaceable data.
Buy the right drives. Your data is worth it.