Variable descriptions for drive tickets
This commit is contained in:
115
hwmonDaemon.py
115
hwmonDaemon.py
@ -150,90 +150,185 @@ class SystemHealthMonitor:
|
|||||||
|
|
||||||
# Add SMART attribute explanations
|
# Add SMART attribute explanations
|
||||||
SMART_DESCRIPTIONS = {
|
SMART_DESCRIPTIONS = {
|
||||||
|
'Reported_Uncorrect': """
|
||||||
|
Number of errors that could not be recovered using hardware ECC.
|
||||||
|
Impact:
|
||||||
|
- Indicates permanent data loss in affected sectors
|
||||||
|
- High correlation with drive hardware failure
|
||||||
|
- Critical reliability indicator
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Backup critical data immediately
|
||||||
|
2. Check drive logs for related errors
|
||||||
|
3. Plan for drive replacement
|
||||||
|
4. Monitor for error count increases
|
||||||
|
""",
|
||||||
|
|
||||||
'Reallocated_Sector_Ct': """
|
'Reallocated_Sector_Ct': """
|
||||||
Number of sectors that have been reallocated due to errors.
|
Number of sectors that have been reallocated due to errors.
|
||||||
|
Impact:
|
||||||
- High counts indicate degrading media
|
- High counts indicate degrading media
|
||||||
- Each reallocation uses one of the drive's limited spare sectors
|
- Each reallocation uses one of the drive's limited spare sectors
|
||||||
- Rapid increases suggest accelerating drive wear
|
- Rapid increases suggest accelerating drive wear
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Monitor rate of increase
|
||||||
|
2. Check drive temperature
|
||||||
|
3. Plan replacement if count grows rapidly
|
||||||
""",
|
""",
|
||||||
|
|
||||||
'Current_Pending_Sector': """
|
'Current_Pending_Sector': """
|
||||||
Sectors waiting to be reallocated due to read/write errors.
|
Sectors waiting to be reallocated due to read/write errors.
|
||||||
|
Impact:
|
||||||
- Indicates potentially unstable sectors
|
- Indicates potentially unstable sectors
|
||||||
- May result in data loss if unrecoverable
|
- May result in data loss if unrecoverable
|
||||||
- Should be monitored for increases
|
- Should be monitored for increases
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Backup affected files
|
||||||
|
2. Run extended SMART tests
|
||||||
|
3. Monitor for conversion to reallocated sectors
|
||||||
""",
|
""",
|
||||||
|
|
||||||
'Offline_Uncorrectable': """
|
'Offline_Uncorrectable': """
|
||||||
Count of uncorrectable errors detected during offline data collection.
|
Count of uncorrectable errors detected during offline data collection.
|
||||||
|
Impact:
|
||||||
- Direct indicator of media reliability issues
|
- Direct indicator of media reliability issues
|
||||||
- May affect data integrity
|
- May affect data integrity
|
||||||
- High values suggest drive replacement needed
|
- High values suggest drive replacement needed
|
||||||
""",
|
|
||||||
|
|
||||||
'Reported_Uncorrect': """
|
Recommended Actions:
|
||||||
Number of errors that could not be recovered using hardware ECC.
|
1. Run extended SMART tests
|
||||||
- Critical indicator of drive health
|
2. Check drive logs
|
||||||
- Directly impacts data reliability
|
3. Plan replacement if count is increasing
|
||||||
- Any non-zero value requires attention
|
|
||||||
""",
|
""",
|
||||||
|
|
||||||
'Spin_Retry_Count': """
|
'Spin_Retry_Count': """
|
||||||
Number of spin start retry attempts.
|
Number of spin start retry attempts.
|
||||||
|
Impact:
|
||||||
- Indicates potential motor or bearing issues
|
- Indicates potential motor or bearing issues
|
||||||
- May predict imminent mechanical failure
|
- May predict imminent mechanical failure
|
||||||
- Increasing values suggest degrading drive health
|
- Increasing values suggest degrading drive health
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Monitor for rapid increases
|
||||||
|
2. Check drive temperature
|
||||||
|
3. Plan replacement if count grows rapidly
|
||||||
""",
|
""",
|
||||||
|
|
||||||
'Power_On_Hours': """
|
'Power_On_Hours': """
|
||||||
Total number of hours the device has been powered on.
|
Total number of hours the device has been powered on.
|
||||||
|
Impact:
|
||||||
- Normal aging metric
|
- Normal aging metric
|
||||||
- Used to gauge overall drive lifetime
|
- Used to gauge overall drive lifetime
|
||||||
- Compare against manufacturer's MTBF rating
|
- Compare against manufacturer's MTBF rating
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Compare to warranty period
|
||||||
|
2. Plan replacement if approaching rated lifetime
|
||||||
""",
|
""",
|
||||||
|
|
||||||
'Media_Wearout_Indicator': """
|
'Media_Wearout_Indicator': """
|
||||||
Percentage of drive's rated life remaining (SSDs).
|
Percentage of drive's rated life remaining (SSDs).
|
||||||
|
Impact:
|
||||||
- 100 indicates new drive
|
- 100 indicates new drive
|
||||||
- 0 indicates exceeded rated writes
|
- 0 indicates exceeded rated writes
|
||||||
- Critical for SSD lifecycle management
|
- Critical for SSD lifecycle management
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Plan replacement below 20%
|
||||||
|
2. Monitor write workload
|
||||||
|
3. Consider workload redistribution
|
||||||
""",
|
""",
|
||||||
|
|
||||||
'Temperature_Celsius': """
|
'Temperature_Celsius': """
|
||||||
Current drive temperature.
|
Current drive temperature.
|
||||||
|
Impact:
|
||||||
- High temperatures accelerate wear
|
- High temperatures accelerate wear
|
||||||
- Optimal range: 20-45°C
|
- Optimal range: 20-45°C
|
||||||
- Sustained high temps reduce lifespan
|
- Sustained high temps reduce lifespan
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Check system cooling
|
||||||
|
2. Verify airflow
|
||||||
|
3. Monitor for sustained high temperatures
|
||||||
""",
|
""",
|
||||||
|
|
||||||
'Available_Spare': """
|
'Available_Spare': """
|
||||||
Percentage of spare blocks remaining (SSDs).
|
Percentage of spare blocks remaining (SSDs).
|
||||||
|
Impact:
|
||||||
- Critical for SSD endurance
|
- Critical for SSD endurance
|
||||||
- Low values indicate approaching end-of-life
|
- Low values indicate approaching end-of-life
|
||||||
- Rapid decreases suggest excessive writes
|
- Rapid decreases suggest excessive writes
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Plan replacement if below 20%
|
||||||
|
2. Monitor write patterns
|
||||||
|
3. Consider workload changes
|
||||||
""",
|
""",
|
||||||
|
|
||||||
'Program_Fail_Count': """
|
'Program_Fail_Count': """
|
||||||
Number of flash program operation failures.
|
Number of flash program operation failures.
|
||||||
|
Impact:
|
||||||
- Indicates NAND cell reliability
|
- Indicates NAND cell reliability
|
||||||
- Important for SSD health assessment
|
- Important for SSD health assessment
|
||||||
- Increasing values suggest flash degradation
|
- Increasing values suggest flash degradation
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Monitor rate of increase
|
||||||
|
2. Check firmware updates
|
||||||
|
3. Plan replacement if rapidly increasing
|
||||||
""",
|
""",
|
||||||
|
|
||||||
'Erase_Fail_Count': """
|
'Erase_Fail_Count': """
|
||||||
Number of flash erase operation failures.
|
Number of flash erase operation failures.
|
||||||
|
Impact:
|
||||||
- Related to NAND block health
|
- Related to NAND block health
|
||||||
- Critical for SSD reliability
|
- Critical for SSD reliability
|
||||||
- High counts suggest failing flash blocks
|
- High counts suggest failing flash blocks
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Monitor count increases
|
||||||
|
2. Check firmware version
|
||||||
|
3. Plan replacement if count is high
|
||||||
|
""",
|
||||||
|
|
||||||
|
'Load_Cycle_Count': """
|
||||||
|
Number of power cycles and head load/unload events.
|
||||||
|
Impact:
|
||||||
|
- Normal operation metric
|
||||||
|
- High counts may indicate power management issues
|
||||||
|
- Compare against rated cycles (typically 600k-1M)
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Review power management settings
|
||||||
|
2. Monitor rate of increase
|
||||||
|
3. Plan replacement near rated limit
|
||||||
|
""",
|
||||||
|
|
||||||
|
'Wear_Leveling_Count': """
|
||||||
|
SSD block erase distribution metric.
|
||||||
|
Impact:
|
||||||
|
- Indicates wear pattern uniformity
|
||||||
|
- Higher values show more balanced wear
|
||||||
|
- Critical for SSD longevity
|
||||||
|
|
||||||
|
Recommended Actions:
|
||||||
|
1. Monitor trend over time
|
||||||
|
2. Compare with similar drives
|
||||||
|
3. Check workload distribution
|
||||||
"""
|
"""
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Add relevant SMART descriptions
|
||||||
|
for attr in SMART_DESCRIPTIONS:
|
||||||
|
if attr in issue:
|
||||||
|
description += f"\n{attr}:\n{SMART_DESCRIPTIONS[attr]}\n"
|
||||||
|
|
||||||
if "SMART" in issue:
|
if "SMART" in issue:
|
||||||
description += """
|
description += """
|
||||||
SMART (Self-Monitoring, Analysis, and Reporting Technology) issues indicate potential drive reliability problems.
|
SMART (Self-Monitoring, Analysis, and Reporting Technology) Attribute Details:
|
||||||
- Reallocated sectors indicate bad blocks that have been remapped
|
- Possible drive failure!
|
||||||
- Pending sectors are potentially failing blocks waiting to be remapped
|
|
||||||
- Uncorrectable errors indicate data that could not be read
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
if "Temperature" in issue:
|
if "Temperature" in issue:
|
||||||
|
|||||||
Reference in New Issue
Block a user