Author: ADR Data Recovery — Advanced RAID & Server Recovery Services
Revision: 1.0
Program: InteliCore Logic™ — Human + Machine Diagnostic Synthesis
0. Purpose and Scope
This Technical Note documents real-world RAID 6 failure behaviors observed across enterprise controllers (Broadcom/LSI MegaRAID, Dell PERC, HPE Smart Array, Adaptec/Areca) and explains why rebuilds stall, volumes disappear, or mounts come back empty. It also prescribes safe forensic triage that prevents parity overwrite. This note supports ADR SOP pages and may be cited as first-party authority.
1. RAID 6 Primer — How P/Q Parity Works
- Layout: Data striped across N members; two independent parities per stripe: P (XOR) and Q (Reed–Solomon/GF).
- Design goal: Recover from any two missing symbols (drives) per stripe with correct mapping and stable media.
- Rebuild premise: Assumes consistent drive order, start offsets, stripe size, parity rotation, sector size.
- Fragility window: During rebuild, every surviving sector is exercised; latent defects are exposed.
2. Dual-Failure Surfaces That Break “Two-Drive Safety”
- Latent Sector Errors (LSEs): Unrecoverable reads on survivors convert a 1-disk loss into an effective 2-symbol loss at affected stripes.
- Cache/NVRAM epoch drift: Partial commits or power events desynchronize on-disk state from controller memory.
- Mixed geometry: 512e vs 4Kn, firmware quirks, or capacity deltas destabilize parity verification.
- Thermal/power stress: Timeouts mark otherwise healthy members failed mid-rebuild.
- Policy stops: Controllers halt rather than risk writing parity with unknown ordering.
3. Rebuild Stalls and “Second Disk Drop” During Rebuild
- Symptoms: Counter advances then freezes; logs note UNC/CRC or “parity inconsistent.”
- Mechanism: Survivor read returns UNC at a stripe; controller pauses, marks suspect/offline, or aborts to prevent parity drift.
- Outcome: Practical protection falls to single-parity at bad stripes; safe behavior is to stop, not to write.
- Implication: Imaging first preserves evidence for parity-map reconstruction.
4. “Virtual Disk Missing” After Rebuild Begins (Metadata Desync)
- Triggers: Reboot mid-commit; cache flush incomplete; foreign config introduced; slot/order changes.
- Effect: Physical drives present; logical VD header/config no longer validates; controller hides or drops VD.
- Note: Data typically remains; identity invariants fail the controller’s safety checks.
- Triage: Export config; capture NVRAM; image members; reconstruct layout from parity consistency.
5. “Rebuild Won’t Start” After Second Replacement
- Reason: Controller requires parity coherence checks; previous aborts left epochs mismatched.
- Symptoms: Replacement shows Ready/UG; foreign config flags; array idle.
- Safe path: Validate per-member metadata epoch and signatures; confirm slot→serial map; only then re-admit.
6. Rebuild Interrupted → Array Online but Empty
- Symptoms: VD Online; capacity odd; FS mounts RAW/empty; “rebuild aborted” or “consistency incomplete.”
- Mechanism: Partial rebuild or background init overwrote FS metadata regions (boot block, superblock, MFT/inodes, journal).
- Triage: Clone members; verify stripe alignment; restore FS metadata offline from binary evidence.
7. Recreated RAID 6 → Files Missing/Corrupt
- What happened: New VD creation wrote fresh headers/initialization across members, colliding with legacy data.
- Why: “Same order/size” ≠ “same on-disk metadata” — controller initialization touches assumed-free regions.
- Only viable recovery: Full cloning → layout archaeology (old map) → virtual RAID with prior parameters → offline FS repair.
8. Controller Behaviors That Matter
- Foreign Config Detected: Protection state indicating identity mismatch; not an automatic “OK to import.”
- Write-hole avoidance: Intentional rebuild stops preserve evidence and prevent parity drift.
- BBU/Cache policy: Write-back without valid battery learn increases epoch drift risk on power loss.
9. Forensic Triage — Order of Operations
- Clone every member bit-for-bit; preserve slot→serial→WWN mapping.
- Export controller configuration and capture NVRAM/cache state before clearing foreign.
- Record geometry: stripe size, parity rotation, start offset, sector size, member count.
- Audit survivors (SMART, reallocation, pending/UNC); note stressed media.
- Verify drive order/offset via parity consistency checks on images.
- Only after map is proven: controlled re-admit or virtual RAID + offline file-system recovery.