Improved device type detection:
- Use anchored regex (^Rotation Rate:) to avoid false matches
- Check for actual RPM values (e.g., "7200 rpm") to confirm HDD
- Only match SSD in model name field, not anywhere in output
- Default to HDD when Rotation Rate field is missing
This fixes drives like WDC WD80EFZZ being incorrectly detected
as SSDs when the Rotation Rate field wasn't being matched.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Split SMART data handling into two functions:
- parse_smart_data(): Parses raw smartctl output (no I/O)
- get_drive_smart_info(): Fetches and parses (wrapper)
Changed parallel collection to save raw smartctl output to cache
files, then parse during the display loop. This avoids issues
with function availability in background subshells when running
from process substitution (bash <(curl ...)).
Also fixed:
- Removed orphan code that was outside function scope
- Fixed lsblk caching to use separate calls for SIZE and MOUNTPOINT
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Split lsblk queries into two separate calls:
1. lsblk -dn for disk sizes (whole disk only, simpler parsing)
2. lsblk -rn for mount points (handles partition-to-parent mapping)
This fixes issues where:
- SIZE was empty due to delimiter confusion
- Mount points with spaces caused parsing errors
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removed 'local' keyword from colored_warnings variable assignment
in the main script body. The 'local' keyword can only be used
inside functions in bash.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added path constant for disk-by-path location:
DISK_BY_PATH="/dev/disk/by-path"
Updated build_drive_map() to use the constant instead of
hardcoded path strings.
Note: LOG_DIR not added as the script does not currently use
logging to files. Can be added if logging feature is implemented.
Fixes: #24
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added 'set -o pipefail' to ensure pipe failures are detected.
Not using -e (errexit) as the script is designed for graceful
degradation when optional tools (smartctl, ceph) are missing.
Many commands intentionally redirect stderr to /dev/null.
Not using -u (nounset) as the script uses ${var:-default}
patterns extensively for optional variables.
Fixes: #23
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Improved quoting consistency throughout the script:
- Array subscripts now quoted: DEVICE_TO_BAY["$device"]="$bay"
- Command substitution quoted: all_bays="$(cmd)"
- Function arguments already fixed in earlier commits
Most variable assignments were already properly quoted. The
remaining unquoted uses (like 'for x in $var') are intentional
for word-splitting on whitespace-separated lists.
Fixes: #22
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The magic numbers mentioned have been addressed:
- grep -B 20 for Ceph: Fixed in issue #9 with proper block parsing
in build_ceph_cache() that reads structured output
- awk column 10 for temperature: Fixed in issue #2 with dynamic
last-numeric-field extraction that doesn't rely on column position
- SMART thresholds: Added as named constants in issue #12:
SMART_TEMP_WARN, SMART_TEMP_CRIT, SMART_REALLOCATED_WARN, etc.
Fixes: #21
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Converted all echo -e commands to printf for better portability
across different systems and shells. Printf is POSIX-compliant
and behaves consistently.
Updated functions:
- colorize_health(): Uses printf %b for escape sequences
- colorize_temp(): Uses printf %b for escape sequences
- colorize_header(): Uses printf with newline
- log_error(), log_warn(), log_info(): Uses printf for stderr
Also simplified header output by calling colorize_header directly
since it now handles its own newline.
Fixes: #19
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The build_ceph_cache() function added in commit 9d39332 already
addresses this issue by:
- Querying ceph-volume lvm list once, building CEPH_DEVICE_TO_OSD map
- Querying ceph osd tree once, building CEPH_OSD_STATUS and CEPH_OSD_IN maps
- Eliminating per-device Ceph queries in the main loop
Fixes: #18
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The lspci command is now called only once on first invocation of
get_storage_controllers, with results cached in LSPCI_CACHE.
Subsequent calls from different layout generators (10bay, large1,
micro) reuse the cached output, reducing subprocess overhead.
Also added function documentation.
Fixes: #17
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of calling lsblk twice per device (once for size, once for
mount points), now performs a single lsblk call at start and caches:
- LSBLK_SIZE: Device sizes
- LSBLK_MOUNTS: Mount points (accumulated for partitions)
This reduces the number of subprocess calls significantly,
especially on systems with many drives.
Fixes: #16
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SMART queries are now run in parallel using background jobs:
1. First pass launches background jobs for all devices
2. Each job writes to a temp file in SMART_CACHE_DIR
3. Wait for all jobs to complete
4. Second pass reads cached data for display
This significantly reduces script runtime when multiple drives
are present, as SMART queries can take 1-2 seconds each.
Cache directory is automatically cleaned up after use.
Fixes: #15
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added --verbose flag that enables detailed logging:
- log_error(): Always shown, critical errors
- log_warn(): Shown in verbose mode, potential issues
- log_info(): Shown in verbose mode, informational messages
Now provides helpful feedback for:
- SMART query failures with specific error messages
- Missing drive mappings for the current host
- Empty bays (no device at configured PCI path)
- Ceph command availability and query status
- Drive mapping statistics (mapped vs empty)
Color-coded output when using --color with --verbose.
Fixes: #14
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The --show-pci flag was added in commit 71a4e3b which displays
the PCI path for each drive in the output table.
Fixes: #13
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New warning detection for concerning SMART values:
- Temperature: Warning at 50°C, Critical at 60°C
- Reallocated sectors: Warning at >= 1
- Pending sectors: Warning at >= 1
- UDMA CRC errors: Warning at >= 100
- Power-on hours: Warning at >= 43800 (5 years)
Health indicator now shows ⚠ when SMART passed but has warnings.
Added WARNINGS column to output showing codes like:
TEMP_WARN, TEMP_CRIT, REALLOC:5, PENDING:2, CRC:150, HOURS:50000
Thresholds are configurable via constants at top of script.
Fixes: #12
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When enabled, colors are applied to:
- Headers: Blue/bold for section titles
- Health status: Green for ✓ (passed), Red for ✗ (failed)
- Temperature: Green (<50°C), Yellow (50-59°C), Red (≥60°C)
Added colorize_health, colorize_temp, and colorize_header helper
functions that respect the USE_COLOR flag.
Fixes: #11
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added comprehensive command-line interface with:
- -h, --help: Show usage information
- -v, --version: Show version
- -d, --debug: Enable debug output
- -s, --skip-smart: Skip SMART data collection (faster)
- --no-ceph: Skip Ceph OSD information
- --show-pci: Display PCI paths for debugging
The script now properly respects these flags throughout execution.
Fixes: #10
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace fragile per-device ceph-volume parsing (grep -B 20) with a
single upfront query that builds lookup tables.
New build_ceph_cache function:
- Parses ceph-volume lvm list output using proper block detection
- Extracts OSD IDs by matching "====== osd.X =======" headers
- Maps block devices to their corresponding OSDs
- Queries ceph osd tree once for all status info
- Creates CEPH_DEVICE_TO_OSD, CEPH_OSD_STATUS, CEPH_OSD_IN arrays
This is both more reliable and more efficient.
Fixes: #9
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use lsblk instead of mount command to detect mount points. This
properly detects mounts on partitions (e.g., /dev/sda1) rather
than only whole-device mounts.
- Shows multiple mount points (up to 3) comma-separated
- Correctly identifies BOOT drives with root partition
- Handles NVMe partition naming (nvme0n1p1, etc.)
Fixes: #8
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The device type detection was updated in commit 90055be to:
- Check for NVMe devices by name prefix first
- Handle "Solid State" and "0 rpm" in Rotation Rate field
- Fall back to checking for SSD/Solid State in SMART output
Fixes: #7
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The serial number parsing was updated in commit 90055be to use
'cut -d: -f2 | xargs' which captures the full serial including
spaces, instead of 'awk {print $3}' which only got the first word.
Fixes: #6
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
NVMe drives mapped to m2-1, m2-2 slots now appear in the main drive
table with their bay position, instead of in a separate unmapped
section.
- Extended bay loop to include m2-* slots after numeric bays
- NVMe section now only shows truly unmapped NVMe drives
- Mapped NVMe drives show full SMART data like other drives
Fixes: #5
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use awk BEGIN block for comparing Ceph OSD reweight values instead
of bc. Awk is more universally available and the previous fallback
to "echo 0" could incorrectly evaluate to true.
Fixes: #4
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Script now verifies required (lsblk, lspci, etc.) and optional
(smartctl, ceph, bc, nvme) dependencies at startup.
- Exits with clear error if required dependencies are missing
- Warns about missing optional dependencies with reduced functionality
- Directs users to freshStartScript for easy installation
- Checks for sudo access needed for SMART operations
Fixes: #3
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Temperature parsing now correctly handles:
- SATA Temperature_Celsius attribute (extracts last numeric value)
- Simple "Temperature: XX Celsius" format
- "Current Temperature: XX Celsius" format
- NVMe temperature reporting
Also improved device type detection for NVMe, SSD (including 0 RPM),
and fixed serial number parsing to capture full serial with spaces.
Fixes: #2
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Declare DRIVE_MAP as global at function start and populate directly,
instead of creating a local array and copying to global. Also added
proper variable quoting and function documentation.
Fixes: #1
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added get_storage_controllers() function that detects SAS, SATA, RAID,
and NVMe controllers via lspci. Updated all layout functions (10bay,
large1, micro) to display detected storage controllers with their
PCI address and model info.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Reflects actual physical layout: 3 stacks x 5 rows (15 bays)
- Added note that physical bay mapping is TBD
- Shows Stack A, B, C columns
- Keeps 2x M.2 NVMe slots
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add PCI path mappings for large1 (12 SATA/SAS + 2 NVMe drives)
- Update large1 layout to show actual drive assignments
- Controllers: LSI SAS2008 (7), AMD SATA (3), ASMedia SATA (2), NVMe (2)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add PCI path mappings for storage-01 (4 SATA drives on AMD controller)
- Fix NVMe serial: use smartctl instead of nvme list for accurate serial numbers
- NVMe now shows actual serial number instead of /dev/ng device path
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updated README.md:
- Added feature list with emojis for visual clarity
- Documented all output columns with descriptions
- Added Ceph integration details
- Included troubleshooting for common issues
- Updated example output with current format
- Added status indicators (✅⚠️) for server mapping status
Created CLAUDE.md:
- Documented AI-assisted development process
- Chronicled evolution from basic script to comprehensive tool
- Detailed technical challenges and solutions
- Listed all phases of development
- Provided metrics and future enhancement ideas
- Lessons learned for future AI collaboration
This documents the complete journey from broken PCI paths to a
production-ready storage infrastructure management tool.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed parsing of ceph osd tree output:
- Column 5 is STATUS (up/down) not column 6
- Column 6 is REWEIGHT (1.0 = in, 0 = out)
- Now correctly shows up/in for active OSDs
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
New features:
- Added STATUS column showing Ceph OSD up/down and in/out status
Format: "up/in", "up/out", "down/in", etc.
- Added USAGE column to identify boot drives and mount points
Shows "BOOT" for root filesystem, mount point for others, "-" for OSDs
- Improved table layout with all relevant drive information
Now you can see at a glance:
- Which drives are boot drives
- Which OSDs are up and in the cluster
- Any problematic OSDs that are down or out
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Major enhancements:
- Drive details now sorted by physical bay position (1-10) instead of alphabetically
- Added BAY column to show physical location
- Added CEPH OSD column to show which OSD each drive hosts
- Fixed ASCII art right border alignment (final fix)
- Drives now display in logical order matching physical layout
This makes it much easier to correlate physical drives with their Ceph OSDs
and understand the layout at a glance.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changes:
- Added SERIAL column to SATA/SAS drive details table
- Added SERIAL column to NVMe drive details table
- Updated get_drive_smart_info() to extract and return serial numbers
- Widened output format to accommodate serial numbers
- NVMe serials now display correctly from nvme list output
This makes it much easier to match drives to their physical locations
by comparing visible serial numbers on drive labels with the output.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed issues:
- ASCII art boxes now render correctly with fixed-width layout
- Corrected bay 1 mapping: ata-2 -> bay 1 (sdh SSD confirmed)
- Adjusted mobo SATA port mappings based on physical verification
- Simplified layout to use consistent 10-character wide bay boxes
Bay 1 is confirmed to contain sdh (Crucial SSD boot drive) which maps
to pci-0000:0d:00.0-ata-2, so the mapping has been corrected.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Hardware discovered:
- LSI SAS3008 HBA at 01:00.0 (bays 5-10 via mini-SAS HD cables)
- AMD SATA controller at 0d:00.0 (bays 1-4)
- NVMe at 0e:00.0 (M.2 slot)
Changes:
- Updated SERVER_MAPPINGS with correct PCI paths based on actual hardware
- Fixed diagnose-drives.sh CRLF line endings (was causing script errors)
- Updated README with accurate controller information
- Mapped all 10 bays plus M.2 NVMe slot
- Added detailed cable mapping comments from user documentation
The old mapping referenced non-existent controller 0c:00.0. Now uses
actual SAS PHY paths and ATA port numbers that match physical bays.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Major improvements:
- Separated chassis types from server hostnames for better reusability
- Implemented template-based layout system (10bay, large1)
- Renamed medium2 to compute-storage-01 for clarity
- All three Sliger CX471225 servers now use unified 10bay layout
- Added comprehensive PCI path-based drive mapping system
- Created diagnose-drives.sh helper script for mapping new servers
- Added DEBUG mode for troubleshooting drive mappings
Technical changes:
- Replaced DRIVE_MAPPINGS with separate SERVER_MAPPINGS and CHASSIS_TYPES
- Removed spare-10bay layout (all Sliger chassis use same template)
- Improved drive detection and SMART data collection
- Better error handling for missing drives and unmapped servers
- Cleaner code structure with sectioned comments
Documentation:
- Complete rewrite of README with setup guide and troubleshooting
- Added detailed todo.txt with action plan and technical notes
- Documented Sliger CX471225 4U chassis specifications
- Included step-by-step instructions for mapping new servers
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Major improvements:
- Separated chassis types from server hostnames for better reusability
- Implemented template-based layout system (10bay, large1, spare-10bay)
- Renamed medium2 to compute-storage-01 for clarity
- Added comprehensive PCI path-based drive mapping system
- Created diagnose-drives.sh helper script for mapping new servers
- Added DEBUG mode for troubleshooting drive mappings
- Documented Sliger CX471225 4U chassis model
Technical changes:
- Replaced DRIVE_MAPPINGS with separate SERVER_MAPPINGS and CHASSIS_TYPES
- Improved drive detection and SMART data collection
- Better error handling for missing drives and unmapped servers
- Cleaner code structure with sectioned comments
Documentation:
- Complete rewrite of README with setup guide and troubleshooting
- Added detailed todo.txt with action plan and technical notes
- Included step-by-step instructions for mapping new servers
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>