When the array comes back online but the data no longer tells the truth
You replaced a failed drive in your RAID 60.
On paper, everything went as expected:
- rebuild started,
- array stayed online,
- no obvious new failures.
Then you noticed:
- directories missing,
- files opening with the wrong content,
- databases crashing or refusing to start,
- virtual machines that used to boot now stuck in loops.
The controller says the rebuild completed.
Your applications say: something is very wrong.
1. What Actually Broke (Plain English)
In RAID 60, a drive replacement triggers parity reconstruction inside one group while the other group continues serving I/O.
If, before or during that rebuild:
- parity in that group was already mismatched,
- a survivor had latent sector errors,
- epochs or headers were out of sync with the other group,
- or previous power events left cache states inconsistent,
then the rebuild may have:
- written incorrect data back to the new drive,
- “fixed” stripes using the wrong parity math,
- zeroed or overwritten blocks the filesystem still depended on.
The end result: the RAID 60 structure may look formally “healthy,” but the content is internally corrupted.
2. How This Shows Up to Users
Common post-replacement symptoms:
- Volume mounts, but:
- directories are missing or incomplete,
- file counts seem wrong,
- files open with the wrong data,
- backups fail verification.
- Databases:
- fail integrity checks,
- refuse to start,
- or show index/table corruption.
- Virtual machines:
- bluescreen or kernel panic,
- reboot loops,
- filesystem repair inside the guest OS fails repeatedly.
- Controller logs:
- report prior parity inconsistencies,
- aborted consistency checks,
- or earlier degraded states you assumed were “resolved.”
The pattern: the array “works,” but the higher layers no longer trust what they read.
3. Why Drive Replacement Can Trigger Visible Corruption
Drive replacement in a RAID 60 array assumes:
- the surviving members contain correct data and parity, and
- the group’s metadata and epochs are internally consistent.
When that assumption is false, rebuild logic:
- recomputes stripes using already-bad parity,
- cements inconsistent blocks into the new member,
- and removes the last chance to read what was previously there.
From the controller’s perspective, the group is now “clean.”
From the filesystem’s perspective, too many critical regions no longer match any valid history.
4. What NOT To Do
When corruption appears after a drive replacement:
Do not:
- run repeated filesystem repairs on the live array,
- start another rebuild,
- reinitialize or “expand” the array,
- replace yet another disk to “see if that helps,”
- move the disks to a new controller and import foreign,
- restore incomplete backups over the damaged volume.
All of these risk:
- amplifying corruption,
- overwriting surviving remnants of metadata,
- mixing multiple incompatible histories in one array.
5. Correct Post-Replacement Triage
Step 1 — Stop all non-essential writes
- Take critical applications offline.
- Avoid heavy I/O that touches more stripes.
Step 2 — Clone all member disks
- Capture exact post-rebuild state.
- Preserve slot → serial → WWN mapping.
Step 3 — Analyze each RAID-6 group separately
- Validate:
- parity alignment,
- stripe order,
- epoch alignment,
- presence of sectors that don’t match parity expectations.
Step 4 — Roll back to the last coherent parity epoch (virtually)
- Using the cloned images, reconstruct the array at the last mathematically consistent point you can prove.
- This may involve:
- excluding known-bad sectors,
- biasing toward the parity domain that predated the replacement event.
Step 5 — Mount and recover from the reconstructed image
- Work on the reconstructed array, not on the damaged live set.
- Run filesystem repair offline as needed.
- Extract data onto new storage.
Diagnostic Overview
- Device: RAID 60 array (two RAID-6 groups under RAID-0)
- Observed State: Array online after drive replacement, but files, databases, or VMs show corruption or wrong content
- Likely Cause: Rebuild ran against already mismatched parity or stale metadata, cementing bad stripes into the new drive
- Do NOT: Keep running filesystem repairs, rebuild again, reinitialize, or move disks to a new controller for another import
- Recommended Action: Clone all disks, reconstruct parity domains offline, identify last coherent epoch, and recover from a virtual RAID 60 image instead of the live array
RAID Triage Center – RAID 60 Triage – RAID 60 Technical Notes