Variable descriptions for drive tickets

2025-03-03 19:14:29 -05:00
parent 0bf29c44e8
commit 2be4f9072c
1 changed files with 105 additions and 10 deletions
@@ -150,90 +150,185 @@ class SystemHealthMonitor:
        # Add SMART attribute explanations
        SMART_DESCRIPTIONS = {
            'Reported_Uncorrect': """
            Number of errors that could not be recovered using hardware ECC.
            Impact:
            - Indicates permanent data loss in affected sectors
            - High correlation with drive hardware failure
            - Critical reliability indicator
            Recommended Actions:
            1. Backup critical data immediately
            2. Check drive logs for related errors
            3. Plan for drive replacement
            4. Monitor for error count increases
            """,
            'Reallocated_Sector_Ct': """
            Number of sectors that have been reallocated due to errors.
            Impact:
            - High counts indicate degrading media
            - Each reallocation uses one of the drive's limited spare sectors
            - Rapid increases suggest accelerating drive wear
            Recommended Actions:
            1. Monitor rate of increase
            2. Check drive temperature
            3. Plan replacement if count grows rapidly
            """,
            'Current_Pending_Sector': """
            Sectors waiting to be reallocated due to read/write errors.
            Impact:
            - Indicates potentially unstable sectors
            - May result in data loss if unrecoverable
            - Should be monitored for increases
            Recommended Actions:
            1. Backup affected files
            2. Run extended SMART tests
            3. Monitor for conversion to reallocated sectors
            """,
            'Offline_Uncorrectable': """
            Count of uncorrectable errors detected during offline data collection.
            Impact:
            - Direct indicator of media reliability issues
            - May affect data integrity
            - High values suggest drive replacement needed
            """,
-            'Reported_Uncorrect': """
+            Recommended Actions:
-            Number of errors that could not be recovered using hardware ECC.
+            1. Run extended SMART tests
-            - Critical indicator of drive health
+            2. Check drive logs
-            - Directly impacts data reliability
+            3. Plan replacement if count is increasing
            - Any non-zero value requires attention
            """,
            'Spin_Retry_Count': """
            Number of spin start retry attempts.
            Impact:
            - Indicates potential motor or bearing issues
            - May predict imminent mechanical failure
            - Increasing values suggest degrading drive health
            Recommended Actions:
            1. Monitor for rapid increases
            2. Check drive temperature
            3. Plan replacement if count grows rapidly
            """,
            'Power_On_Hours': """
            Total number of hours the device has been powered on.
            Impact:
            - Normal aging metric
            - Used to gauge overall drive lifetime
            - Compare against manufacturer's MTBF rating
            Recommended Actions:
            1. Compare to warranty period
            2. Plan replacement if approaching rated lifetime
            """,
            'Media_Wearout_Indicator': """
            Percentage of drive's rated life remaining (SSDs).
            Impact:
            - 100 indicates new drive
            - 0 indicates exceeded rated writes
            - Critical for SSD lifecycle management
            Recommended Actions:
            1. Plan replacement below 20%
            2. Monitor write workload
            3. Consider workload redistribution
            """,
            'Temperature_Celsius': """
            Current drive temperature.
            Impact:
            - High temperatures accelerate wear
            - Optimal range: 20-45°C
            - Sustained high temps reduce lifespan
            Recommended Actions:
            1. Check system cooling
            2. Verify airflow
            3. Monitor for sustained high temperatures
            """,
            'Available_Spare': """
            Percentage of spare blocks remaining (SSDs).
            Impact:
            - Critical for SSD endurance
            - Low values indicate approaching end-of-life
            - Rapid decreases suggest excessive writes
            Recommended Actions:
            1. Plan replacement if below 20%
            2. Monitor write patterns
            3. Consider workload changes
            """,
            'Program_Fail_Count': """
            Number of flash program operation failures.
            Impact:
            - Indicates NAND cell reliability
            - Important for SSD health assessment
            - Increasing values suggest flash degradation
            Recommended Actions:
            1. Monitor rate of increase
            2. Check firmware updates
            3. Plan replacement if rapidly increasing
            """,
            'Erase_Fail_Count': """
            Number of flash erase operation failures.
            Impact:
            - Related to NAND block health
            - Critical for SSD reliability
            - High counts suggest failing flash blocks
            Recommended Actions:
            1. Monitor count increases
            2. Check firmware version
            3. Plan replacement if count is high
            """,
            'Load_Cycle_Count': """
            Number of power cycles and head load/unload events.
            Impact:
            - Normal operation metric
            - High counts may indicate power management issues
            - Compare against rated cycles (typically 600k-1M)
            Recommended Actions:
            1. Review power management settings
            2. Monitor rate of increase
            3. Plan replacement near rated limit
            """,
            'Wear_Leveling_Count': """
            SSD block erase distribution metric.
            Impact:
            - Indicates wear pattern uniformity
            - Higher values show more balanced wear
            - Critical for SSD longevity
            Recommended Actions:
            1. Monitor trend over time
            2. Compare with similar drives
            3. Check workload distribution
            """
        }
        # Add relevant SMART descriptions
        for attr in SMART_DESCRIPTIONS:
            if attr in issue:
                description += f"\n{attr}:\n{SMART_DESCRIPTIONS[attr]}\n"
        if "SMART" in issue:
            description += """
-            SMART (Self-Monitoring, Analysis, and Reporting Technology) issues indicate potential drive reliability problems.
+            SMART (Self-Monitoring, Analysis, and Reporting Technology) Attribute Details:
-            - Reallocated sectors indicate bad blocks that have been remapped
+            - Possible drive failure!
            - Pending sectors are potentially failing blocks waiting to be remapped
            - Uncorrectable errors indicate data that could not be read
            """
        if "Temperature" in issue: