Documentation

Node-operation - Disaster-recovery

SuperNova Disaster Recovery Guide

This document outlines the disaster recovery capabilities and procedures for SuperNova nodes. It is intended for system operators and administrators who need to manage recovery from data corruption or system failures.

Overview

SuperNova incorporates comprehensive disaster recovery mechanisms designed to detect, diagnose, and repair various types of data corruption. The system prioritizes data integrity and availability, with multiple strategies for recovery depending on the nature and severity of the issue.

Types of Corruption

| Corruption Type | Description | Severity | | --------------- | ----------- | -------- | | FileLevelCorruption | Physical damage to database files | Critical | | RecordCorruption | Individual records within the database are corrupted | High | | IndexCorruption | Database indexes are damaged but core data is intact | Medium | | LogicalCorruption | Blockchain state is inconsistent with consensus rules | High | | CheckpointCorruption | Recovery points are damaged or invalid | Medium |

Recovery Strategies

The system employs several recovery strategies based on the detected corruption:

RestoreFromBackup

Used for severe file-level corruption where the database files are extensively damaged.

Process:

  1. The system identifies the most recent valid backup
  2. The current database is archived (if possible)
  3. The backup is restored to the primary data location
  4. The node performs validation of the restored data
  5. The node rejoins the network and syncs any missing blocks

Configuration:

[storage]
backup_directory = "/path/to/backups"
backup_retention_days = 30
backup_frequency_hours = 6

RebuildCorruptedRecords

Used when specific records are identified as corrupted but the overall database structure is intact.

Process:

  1. Corrupted records are identified and logged
  2. The system retrieves correct values from peers if available
  3. Records are reconstructed and validated
  4. Indexes are updated to reflect the repaired records

RebuildIndexes

Applied when database indexes are corrupted but the underlying data is valid.

Process:

  1. Damaged indexes are identified
  2. Read operations are routed to alternative access methods
  3. Indexes are rebuilt from primary data
  4. System validates the rebuilt indexes before resuming normal operation

RevertToCheckpoint

Used when logical corruption is detected, allowing the system to roll back to a known-good state.

Process:

  1. The system identifies the most recent valid checkpoint before the corruption
  2. The current state is archived for later analysis
  3. The system reverts to the selected checkpoint
  4. The node rejoins the network and syncs from the checkpoint forward

Configuration:

[checkpoints]
checkpoint_frequency_blocks = 1000
max_checkpoint_age_days = 14
automatic_reversion = true

RebuildChainState

The most intensive recovery method, used when other strategies fail.

Process:

  1. The blockchain is reconstructed from genesis
  2. Each block is reprocessed and validated
  3. A new UTXO set and other derived data are created
  4. The system performs integrity checks on the rebuilt state

Automated Recovery Process

SuperNova implements a CorruptionHandler that manages the detection and recovery process:

  1. Detection Phase: Regular integrity checks identify potential corruption
  2. Diagnosis Phase: The system determines the type and extent of corruption
  3. Strategy Selection: Based on the diagnosis, an appropriate recovery strategy is selected
  4. Recovery Execution: The selected strategy is executed with detailed logging
  5. Validation Phase: The repaired system undergoes integrity verification
  6. Reporting: Detailed reports are generated for operator review

Manual Recovery Procedures

In cases where automated recovery fails, operators can invoke manual recovery:

# Check database integrity
supernova-cli db check-integrity

# Create a manual checkpoint
supernova-cli db checkpoint create --label "pre-manual-recovery"

# View available checkpoints
supernova-cli db checkpoint list

# Restore from a specific checkpoint
supernova-cli db checkpoint restore --id <checkpoint_id>

# Rebuild indexes
supernova-cli db rebuild-indexes

# Restore from backup
supernova-cli db restore --backup-path /path/to/backup

Best Practices for Operators

  1. Regular Backups: Configure automatic backups and periodically verify their integrity
  2. Multiple Backup Locations: Store backups in geographically diverse locations
  3. Checkpoint Management: Regularly review available checkpoints and their validity
  4. Monitoring: Configure alerts for integrity check failures and corruption detection
  5. Testing: Periodically test recovery procedures in a staging environment
  6. Documentation: Maintain detailed logs of any recovery operations performed
  7. Node Redundancy: Deploy multiple nodes to ensure availability during recovery

Logging and Monitoring

SuperNova writes detailed logs during the recovery process:

/var/log/supernova/recovery.log     # Main recovery log
/var/log/supernova/corruption.log   # Detailed corruption events
/var/log/supernova/checkpoints.log  # Checkpoint creation and restoration

Configure monitoring to alert on:

  • Failed integrity checks
  • Detection of any corruption
  • Failed recovery attempts
  • Successful but time-consuming recovery operations

Recovery Performance Considerations

| Recovery Strategy | Typical Duration | Network Impact | | ----------------- | ---------------- | -------------- | | RestoreFromBackup | Minutes to hours | Node offline during recovery | | RebuildCorruptedRecords | Seconds to minutes | Degraded performance | | RebuildIndexes | Minutes to hours | Read-heavy operations degraded | | RevertToCheckpoint | Minutes | Node offline during recovery | | RebuildChainState | Hours to days | Node offline during recovery |

Future Improvements

The SuperNova team is working on enhancing disaster recovery capabilities:

  • Implementation of a dedicated ResilienceManager with advanced self-healing
  • Distributed recovery across a cluster of nodes
  • Machine learning-based early corruption detection
  • Automated testing of recovery procedures