- Add --diagnose option that shows all PCI paths, storage controllers,
block devices, and validates current mappings. Replaces the separate
diagnose-drives.sh script.
- Remove diagnose-drives.sh (incorporated into --diagnose).
- Remove get-serials.sh (redundant with SMART data in main table).
- Remove test-paths.sh (referenced non-existent 0c:00.0 controller).
- Remove todo.md (massively outdated).
- Fix storage controller text overflowing box borders in large1 and
micro layouts by adding truncation (%-69.69s, %-57.57s).
- Fix chassis name to CX4712 in README.
- Update server mapping statuses from "Requires mapping" to actual
partially-mapped states.
- Add ⚠ health indicator to README output column docs.
- Update Claude.md metrics to match current state.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Border was 130 columns wide but bay lines were 138. Widened border
and all interior format strings to match the bay content width (136
interior = 138 total). Long controller descriptions are now truncated
to prevent overflow.
Ref #25
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verified via ls -la /dev/disk/by-path/ and physical inspection
that HBA SAS3416 phy9 maps to bay 5 (C0 SATA breakout).
Remaining C0 bays 6-8 and C1 bays 9-10 still need drives to verify.
Ref #25
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove `local` from max_parallel_jobs/job_count (not inside a function)
- Document storage-01 physical layout: mobo SATA ports, HBA Mini-SAS HD
ports C0-C3, U.2 NVMe serial numbers
Ref #25
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add bash 4.2+ version check since script uses declare -g -A
- Add cleanup trap (EXIT/INT/TERM) for SMART_CACHE_DIR temp directory
- Sanitize hostname to strip unexpected characters
- Limit parallel SMART collection to 10 concurrent jobs
Fixes #25
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ceph-volume lvm list output varies the number of trailing equals
signs based on OSD number length:
- Single digit: "====== osd.5 =======" (7 equals)
- Double digit: "====== osd.19 ======" (6 equals)
Changed regex to require exactly 6 trailing equals, which matches
both formats.
Fixes: #17
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SMART output for Temperature_Celsius often includes extra sensor data
in parentheses like "26 (0 14 0 0 0)". The previous awk command was
finding "0" from the parenthetical instead of the actual temperature.
Now strips parenthetical content with sed before extracting the last
numeric value.
Fixes: #11
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added support for SAS drive temperature format "Current Drive Temperature:"
and made temperature extraction more robust by:
- Removing ^ anchor that was preventing matches with leading whitespace
- Using awk to find the first numeric value in the line
- Adding explicit SAS drive temperature format handling
Fixes: #11
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The "block device" line in ceph-volume output shows LVM paths like
ceph-xxx/osd-block-xxx, not physical device names. Changed to parse
the "devices" line which contains the actual physical device path
like /dev/sda.
Also reset current_osd after match to avoid duplicate matches.
Fixes: #17
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The smartctl output has leading whitespace before field names:
"Rotation Rate: 7200 rpm"
Removed the ^ anchor from the regex so it matches lines with
leading whitespace. This fixes HDD detection for drives that
have proper Rotation Rate fields in their SMART data.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added log_info messages to show:
- Count of OSDs found
- Each device-to-OSD mapping as discovered
Also fixed array subscript quoting in CEPH_DEVICE_TO_OSD.
Run with --verbose to see Ceph detection diagnostics.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Improved device type detection:
- Use anchored regex (^Rotation Rate:) to avoid false matches
- Check for actual RPM values (e.g., "7200 rpm") to confirm HDD
- Only match SSD in model name field, not anywhere in output
- Default to HDD when Rotation Rate field is missing
This fixes drives like WDC WD80EFZZ being incorrectly detected
as SSDs when the Rotation Rate field wasn't being matched.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Split SMART data handling into two functions:
- parse_smart_data(): Parses raw smartctl output (no I/O)
- get_drive_smart_info(): Fetches and parses (wrapper)
Changed parallel collection to save raw smartctl output to cache
files, then parse during the display loop. This avoids issues
with function availability in background subshells when running
from process substitution (bash <(curl ...)).
Also fixed:
- Removed orphan code that was outside function scope
- Fixed lsblk caching to use separate calls for SIZE and MOUNTPOINT
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Split lsblk queries into two separate calls:
1. lsblk -dn for disk sizes (whole disk only, simpler parsing)
2. lsblk -rn for mount points (handles partition-to-parent mapping)
This fixes issues where:
- SIZE was empty due to delimiter confusion
- Mount points with spaces caused parsing errors
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removed 'local' keyword from colored_warnings variable assignment
in the main script body. The 'local' keyword can only be used
inside functions in bash.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added path constant for disk-by-path location:
DISK_BY_PATH="/dev/disk/by-path"
Updated build_drive_map() to use the constant instead of
hardcoded path strings.
Note: LOG_DIR not added as the script does not currently use
logging to files. Can be added if logging feature is implemented.
Fixes: #24
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added 'set -o pipefail' to ensure pipe failures are detected.
Not using -e (errexit) as the script is designed for graceful
degradation when optional tools (smartctl, ceph) are missing.
Many commands intentionally redirect stderr to /dev/null.
Not using -u (nounset) as the script uses ${var:-default}
patterns extensively for optional variables.
Fixes: #23
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Improved quoting consistency throughout the script:
- Array subscripts now quoted: DEVICE_TO_BAY["$device"]="$bay"
- Command substitution quoted: all_bays="$(cmd)"
- Function arguments already fixed in earlier commits
Most variable assignments were already properly quoted. The
remaining unquoted uses (like 'for x in $var') are intentional
for word-splitting on whitespace-separated lists.
Fixes: #22
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The magic numbers mentioned have been addressed:
- grep -B 20 for Ceph: Fixed in issue #9 with proper block parsing
in build_ceph_cache() that reads structured output
- awk column 10 for temperature: Fixed in issue #2 with dynamic
last-numeric-field extraction that doesn't rely on column position
- SMART thresholds: Added as named constants in issue #12:
SMART_TEMP_WARN, SMART_TEMP_CRIT, SMART_REALLOCATED_WARN, etc.
Fixes: #21
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Converted all echo -e commands to printf for better portability
across different systems and shells. Printf is POSIX-compliant
and behaves consistently.
Updated functions:
- colorize_health(): Uses printf %b for escape sequences
- colorize_temp(): Uses printf %b for escape sequences
- colorize_header(): Uses printf with newline
- log_error(), log_warn(), log_info(): Uses printf for stderr
Also simplified header output by calling colorize_header directly
since it now handles its own newline.
Fixes: #19
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The build_ceph_cache() function added in commit 9d39332 already
addresses this issue by:
- Querying ceph-volume lvm list once, building CEPH_DEVICE_TO_OSD map
- Querying ceph osd tree once, building CEPH_OSD_STATUS and CEPH_OSD_IN maps
- Eliminating per-device Ceph queries in the main loop
Fixes: #18
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The lspci command is now called only once on first invocation of
get_storage_controllers, with results cached in LSPCI_CACHE.
Subsequent calls from different layout generators (10bay, large1,
micro) reuse the cached output, reducing subprocess overhead.
Also added function documentation.
Fixes: #17
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Instead of calling lsblk twice per device (once for size, once for
mount points), now performs a single lsblk call at start and caches:
- LSBLK_SIZE: Device sizes
- LSBLK_MOUNTS: Mount points (accumulated for partitions)
This reduces the number of subprocess calls significantly,
especially on systems with many drives.
Fixes: #16
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SMART queries are now run in parallel using background jobs:
1. First pass launches background jobs for all devices
2. Each job writes to a temp file in SMART_CACHE_DIR
3. Wait for all jobs to complete
4. Second pass reads cached data for display
This significantly reduces script runtime when multiple drives
are present, as SMART queries can take 1-2 seconds each.
Cache directory is automatically cleaned up after use.
Fixes: #15
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added --verbose flag that enables detailed logging:
- log_error(): Always shown, critical errors
- log_warn(): Shown in verbose mode, potential issues
- log_info(): Shown in verbose mode, informational messages
Now provides helpful feedback for:
- SMART query failures with specific error messages
- Missing drive mappings for the current host
- Empty bays (no device at configured PCI path)
- Ceph command availability and query status
- Drive mapping statistics (mapped vs empty)
Color-coded output when using --color with --verbose.
Fixes: #14
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The --show-pci flag was added in commit 71a4e3b which displays
the PCI path for each drive in the output table.
Fixes: #13
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New warning detection for concerning SMART values:
- Temperature: Warning at 50°C, Critical at 60°C
- Reallocated sectors: Warning at >= 1
- Pending sectors: Warning at >= 1
- UDMA CRC errors: Warning at >= 100
- Power-on hours: Warning at >= 43800 (5 years)
Health indicator now shows ⚠ when SMART passed but has warnings.
Added WARNINGS column to output showing codes like:
TEMP_WARN, TEMP_CRIT, REALLOC:5, PENDING:2, CRC:150, HOURS:50000
Thresholds are configurable via constants at top of script.
Fixes: #12
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When enabled, colors are applied to:
- Headers: Blue/bold for section titles
- Health status: Green for ✓ (passed), Red for ✗ (failed)
- Temperature: Green (<50°C), Yellow (50-59°C), Red (≥60°C)
Added colorize_health, colorize_temp, and colorize_header helper
functions that respect the USE_COLOR flag.
Fixes: #11
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added comprehensive command-line interface with:
- -h, --help: Show usage information
- -v, --version: Show version
- -d, --debug: Enable debug output
- -s, --skip-smart: Skip SMART data collection (faster)
- --no-ceph: Skip Ceph OSD information
- --show-pci: Display PCI paths for debugging
The script now properly respects these flags throughout execution.
Fixes: #10
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace fragile per-device ceph-volume parsing (grep -B 20) with a
single upfront query that builds lookup tables.
New build_ceph_cache function:
- Parses ceph-volume lvm list output using proper block detection
- Extracts OSD IDs by matching "====== osd.X =======" headers
- Maps block devices to their corresponding OSDs
- Queries ceph osd tree once for all status info
- Creates CEPH_DEVICE_TO_OSD, CEPH_OSD_STATUS, CEPH_OSD_IN arrays
This is both more reliable and more efficient.
Fixes: #9
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use lsblk instead of mount command to detect mount points. This
properly detects mounts on partitions (e.g., /dev/sda1) rather
than only whole-device mounts.
- Shows multiple mount points (up to 3) comma-separated
- Correctly identifies BOOT drives with root partition
- Handles NVMe partition naming (nvme0n1p1, etc.)
Fixes: #8
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The device type detection was updated in commit 90055be to:
- Check for NVMe devices by name prefix first
- Handle "Solid State" and "0 rpm" in Rotation Rate field
- Fall back to checking for SSD/Solid State in SMART output
Fixes: #7
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The serial number parsing was updated in commit 90055be to use
'cut -d: -f2 | xargs' which captures the full serial including
spaces, instead of 'awk {print $3}' which only got the first word.
Fixes: #6
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
NVMe drives mapped to m2-1, m2-2 slots now appear in the main drive
table with their bay position, instead of in a separate unmapped
section.
- Extended bay loop to include m2-* slots after numeric bays
- NVMe section now only shows truly unmapped NVMe drives
- Mapped NVMe drives show full SMART data like other drives
Fixes: #5
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use awk BEGIN block for comparing Ceph OSD reweight values instead
of bc. Awk is more universally available and the previous fallback
to "echo 0" could incorrectly evaluate to true.
Fixes: #4
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Script now verifies required (lsblk, lspci, etc.) and optional
(smartctl, ceph, bc, nvme) dependencies at startup.
- Exits with clear error if required dependencies are missing
- Warns about missing optional dependencies with reduced functionality
- Directs users to freshStartScript for easy installation
- Checks for sudo access needed for SMART operations
Fixes: #3
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Temperature parsing now correctly handles:
- SATA Temperature_Celsius attribute (extracts last numeric value)
- Simple "Temperature: XX Celsius" format
- "Current Temperature: XX Celsius" format
- NVMe temperature reporting
Also improved device type detection for NVMe, SSD (including 0 RPM),
and fixed serial number parsing to capture full serial with spaces.
Fixes: #2
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Declare DRIVE_MAP as global at function start and populate directly,
instead of creating a local array and copying to global. Also added
proper variable quoting and function documentation.
Fixes: #1
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 11:23:29 -05:00
6 changed files with 964 additions and 216 deletions
ls -la /dev/disk/by-path/ | grep -v "part"| grep "pci-0000:0c:00.0"| head -20
echo""
echo"=== Checking if paths exist from mapping ==="
echo"pci-0000:0c:00.0-ata-3:"
ls -la /dev/disk/by-path/pci-0000:0c:00.0-ata-3 2>&1
echo"pci-0000:0c:00.0-ata-1:"
ls -la /dev/disk/by-path/pci-0000:0c:00.0-ata-1 2>&1
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.