analyzeOSDs

Author	SHA1	Message	Date
Jared Vititoe	03374fa784	Add USB drive SMART support with multiple bridge chipset attempts Issue: osd.2 is a USB-connected 1TB drive that couldn't read SMART Error was: "Read Device Identity failed: scsi error unsupported field" This is typical for USB-attached drives that need bridge-specific flags. Solution: Added USB transport detection and multiple fallback methods: - SAT (SCSI-ATA Translation) - most common USB bridges - usbjmicron - JMicron USB bridge chipsets - usbcypress - Cypress USB bridge chipsets - Generic USB fallback - SCSI passthrough Also added USB/SAT attempt to unknown transport types as fallback. Debug Enhancement: - Now shows detected transport type in debug output - Helps diagnose why SMART fails Note: USB drives in Ceph clusters are unconventional but functional. This OSD appears to be temporary/supplemental storage capacity. If SMART still fails after this update, the USB bridge may be incompatible with smartmontools, which is acceptable for temporary storage. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 15:16:35 -05:00
Jared Vititoe	3d498a4092	CRITICAL FIX: Parse nested Ceph device health metrics for NVMe drives Root Cause Found: All 6 NVMe SMART failures were due to parsing bug! Ceph's `device query-daemon-health-metrics` returns data in nested format: ```json { "DEVICE_ID": { "nvme_smart_health_information_log": { ... } } } ``` Script was checking for `nvme_smart_health_information_log` at top level, so it always failed and fell back to SSH smartctl (which also failed). Fix: - Extract first device entry from nested dict structure - Maintain backward compatibility for direct format - Now correctly parses NVMe SMART from Ceph's built-in metrics Expected Impact: - All 6 NVMe drives will now successfully read SMART data - Should drop from "CRITICAL: No SMART data" to proper health scores - Only truly healthy NVMe drives will show 100/100 health - Failing NVMe drives will be properly detected and ranked Testing: Verified `ceph device query-daemon-health-metrics osd.0` returns full NVMe SMART data including: - available_spare: 100% - percentage_used: 12% - media_errors: 0 - temperature: 38°C This data was always available but wasn't being parsed! 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 15:11:53 -05:00
Jared Vititoe	35a16a1793	Fix reallocated sector scoring - drives with bad sectors now rank correctly Problem: osd.28 with 16 reallocated sectors only ranked #7 with score 40.8 This is a CRITICAL failing drive that should rank just below failed SMART reads. Changes: - Reallocated sectors now use tiered penalties: * 10+ sectors: -95 points (health = 5/100) - DRIVE FAILING * 5-9 sectors: -85 points (health = 15/100) - CRITICAL * 1-4 sectors: -70 points (health = 30/100) - SERIOUS - Added critical_issues detection for sector problems - Critical issues get +20 bonus (large) or +25 (small) in scoring - Updated issue text to "DRIVE FAILING" for clarity Expected Result: - osd.28 will now score ~96/100 and rank #7 (right after 6 failed SMART) - Any drive with reallocated/pending/uncorrectable sectors gets top priority - Matches priority: Failed SMART > Critical sectors > Small failing > Rest 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 15:08:46 -05:00
Jared Vititoe	1848b71c2a	Optimize OSD analyzer: prioritize failing drives and improve SMART collection Major improvements to scoring and data collection: Scoring Changes: - Failed SMART reads now return 0/100 health (was 50/100) - Critical health issues get much higher penalties: * Reallocated sectors: -50 pts, 5x multiplier (was -20, 2x) * Pending sectors: -60 pts, 10x multiplier (was -25, 5x) * Uncorrectable sectors: -70 pts, 15x multiplier (was -30, 5x) * NVMe media errors: -60 pts, 10x multiplier (was -25, 5x) - Revised weights: 80% health, 15% capacity, 5% resilience (was 60/30/10) - Added priority bonuses: * Failed SMART + small drive (<5TB): +30 points * Failed SMART alone: +20 points * Health issues + small drive: +15 points Priority Order Now Enforced: 1. Failed SMART drives (score 90-100) 2. Small drives beginning to fail (70-85) 3. Small healthy drives (40-60) 4. Large failing drives (60-75) Enhanced SMART Collection: - Added metadata.devices field parsing - Enhanced dm-device and /dev/mapper/ resolution - Added ceph-volume lvm list fallback - Retry logic with 3 command variations per device - Try with/without sudo, different device flags Expected Impact: - osd.28 with reallocated sectors jumps from #14 to top 3 - SMART collection failures should drop from 6 to 0-2 - All failing drives rank above healthy drives regardless of size 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-06 15:05:25 -05:00
Jared Vititoe	3b15377821	seperate smartctl depending on device class	2025-12-22 18:23:06 -05:00
Jared Vititoe	c315fa3efc	Updated readme again	2025-12-22 18:15:46 -05:00
Jared Vititoe	c252dbcdc4	resolves /dev/dm-* now	2025-12-22 18:05:32 -05:00
Jared Vititoe	1610aa2606	removed pg and latency counters	2025-12-22 17:14:02 -05:00
Jared Vititoe	db757345fb	Better patterns and error handling	2025-12-22 17:08:13 -05:00
Jared Vititoe	e12b53238e	Pushed malformed code, whoops	2025-12-22 17:00:45 -05:00
Jared Vititoe	559ed9fc94	adds /dev in front of block devices	2025-12-22 16:57:53 -05:00
Jared Vititoe	43d35feb46	Enables ssh to all hosts to gather smart data	2025-12-22 16:50:04 -05:00
Jared Vititoe	7dab2591b1	First test	2025-12-22 16:40:19 -05:00

13 Commits