Compare commits

...

15 Commits

Author SHA1 Message Date
c6ea28c5d6 Add --diagnose flag, remove obsolete helper scripts, fix docs
- Add --diagnose option that shows all PCI paths, storage controllers,
  block devices, and validates current mappings. Replaces the separate
  diagnose-drives.sh script.
- Remove diagnose-drives.sh (incorporated into --diagnose).
- Remove get-serials.sh (redundant with SMART data in main table).
- Remove test-paths.sh (referenced non-existent 0c:00.0 controller).
- Remove todo.md (massively outdated).
- Fix storage controller text overflowing box borders in large1 and
  micro layouts by adding truncation (%-69.69s, %-57.57s).
- Fix chassis name to CX4712 in README.
- Update server mapping statuses from "Requires mapping" to actual
  partially-mapped states.
- Add ⚠ health indicator to README output column docs.
- Update Claude.md metrics to match current state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 18:50:37 -05:00
555ecd54b2 Fix 10-bay ASCII box alignment
Border was 130 columns wide but bay lines were 138. Widened border
and all interior format strings to match the bay content width (136
interior = 138 total). Long controller descriptions are now truncated
to prevent overflow.

Ref #25

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 18:27:33 -05:00
4a98a6f6f8 Add storage-01 HBA bay 5 mapping (phy9)
Verified via ls -la /dev/disk/by-path/ and physical inspection
that HBA SAS3416 phy9 maps to bay 5 (C0 SATA breakout).
Remaining C0 bays 6-8 and C1 bays 9-10 still need drives to verify.

Ref #25

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 18:20:44 -05:00
2177ae9092 Fix local keyword used outside function, document storage-01 HBA layout
- Remove `local` from max_parallel_jobs/job_count (not inside a function)
- Document storage-01 physical layout: mobo SATA ports, HBA Mini-SAS HD
  ports C0-C3, U.2 NVMe serial numbers

Ref #25

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 18:15:45 -05:00
71f83e82c5 Add robustness improvements: bash version check, cleanup trap, hostname sanitization, parallel job limit
- Add bash 4.2+ version check since script uses declare -g -A
- Add cleanup trap (EXIT/INT/TERM) for SMART_CACHE_DIR temp directory
- Sanitize hostname to strip unexpected characters
- Limit parallel SMART collection to 10 concurrent jobs

Fixes #25

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 18:09:40 -05:00
b79c69be99 Fix OSD header regex to match double-digit OSD numbers
The ceph-volume lvm list output varies the number of trailing equals
signs based on OSD number length:
- Single digit: "====== osd.5 =======" (7 equals)
- Double digit: "====== osd.19 ======" (6 equals)

Changed regex to require exactly 6 trailing equals, which matches
both formats.

Fixes: #17

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:23:10 -05:00
eb73e03495 Fix temperature parsing with parenthetical data
SMART output for Temperature_Celsius often includes extra sensor data
in parentheses like "26 (0 14 0 0 0)". The previous awk command was
finding "0" from the parenthetical instead of the actual temperature.

Now strips parenthetical content with sed before extracting the last
numeric value.

Fixes: #11

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:19:24 -05:00
f3785c13bc Fix temperature parsing for SAS drives
Added support for SAS drive temperature format "Current Drive Temperature:"
and made temperature extraction more robust by:
- Removing ^ anchor that was preventing matches with leading whitespace
- Using awk to find the first numeric value in the line
- Adding explicit SAS drive temperature format handling

Fixes: #11

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:14:23 -05:00
7579f371d7 Fix Ceph device parsing to use devices line
The "block device" line in ceph-volume output shows LVM paths like
ceph-xxx/osd-block-xxx, not physical device names. Changed to parse
the "devices" line which contains the actual physical device path
like /dev/sda.

Also reset current_osd after match to avoid duplicate matches.

Fixes: #17

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:10:46 -05:00
e6cc9a3853 Fix Rotation Rate regex to handle leading whitespace
The smartctl output has leading whitespace before field names:
  "Rotation Rate:    7200 rpm"

Removed the ^ anchor from the regex so it matches lines with
leading whitespace. This fixes HDD detection for drives that
have proper Rotation Rate fields in their SMART data.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:07:35 -05:00
51cc739da5 Add debug logging for Ceph OSD detection
Added log_info messages to show:
- Count of OSDs found
- Each device-to-OSD mapping as discovered

Also fixed array subscript quoting in CEPH_DEVICE_TO_OSD.

Run with --verbose to see Ceph detection diagnostics.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:04:49 -05:00
2b9871d887 Fix HDD/SSD detection to be more accurate
Improved device type detection:
- Use anchored regex (^Rotation Rate:) to avoid false matches
- Check for actual RPM values (e.g., "7200 rpm") to confirm HDD
- Only match SSD in model name field, not anywhere in output
- Default to HDD when Rotation Rate field is missing

This fixes drives like WDC WD80EFZZ being incorrectly detected
as SSDs when the Rotation Rate field wasn't being matched.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:02:25 -05:00
4a86cdd167 Refactor SMART parsing for parallel collection compatibility
Split SMART data handling into two functions:
- parse_smart_data(): Parses raw smartctl output (no I/O)
- get_drive_smart_info(): Fetches and parses (wrapper)

Changed parallel collection to save raw smartctl output to cache
files, then parse during the display loop. This avoids issues
with function availability in background subshells when running
from process substitution (bash <(curl ...)).

Also fixed:
- Removed orphan code that was outside function scope
- Fixed lsblk caching to use separate calls for SIZE and MOUNTPOINT

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:57:37 -05:00
58897b1f3a Fix lsblk caching to properly parse SIZE and MOUNTPOINT
Split lsblk queries into two separate calls:
1. lsblk -dn for disk sizes (whole disk only, simpler parsing)
2. lsblk -rn for mount points (handles partition-to-parent mapping)

This fixes issues where:
- SIZE was empty due to delimiter confusion
- Mount points with spaces caused parsing errors

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:54:41 -05:00
fbd9965fb1 Fix 'local' used outside function context
Removed 'local' keyword from colored_warnings variable assignment
in the main script body. The 'local' keyword can only be used
inside functions in bash.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 19:53:22 -05:00
6 changed files with 257 additions and 187 deletions

View File

@@ -159,9 +159,9 @@ The project began with:
## Metrics ## Metrics
- **Lines of Code:** ~330 (main script) - **Lines of Code:** ~1178 (main script)
- **Supported Chassis Types:** 4 (10-bay, large1, micro, spare) - **Supported Chassis Types:** 3 (10bay, large1, micro)
- **Mapped Servers:** 1 fully (compute-storage-01), 3 pending - **Mapped Servers:** 1 fully (compute-storage-01), 3 partially (storage-01, large1, compute-storage-gpu-01), 2 stubs (micro1, monitor-02)
- **Features Added:** 10+ - **Features Added:** 10+
- **Bugs Fixed:** 6 major, multiple minor - **Bugs Fixed:** 6 major, multiple minor
- **Documentation:** Comprehensive README + this file - **Documentation:** Comprehensive README + this file
@@ -206,4 +206,4 @@ The result is a robust infrastructure management tool that provides instant visi
- **Human Developer:** LotusGuild - **Human Developer:** LotusGuild
- **AI Assistant:** Claude Sonnet 4.5 (Anthropic) - **AI Assistant:** Claude Sonnet 4.5 (Anthropic)
- **Development Date:** January 6, 2026 - **Development Date:** January 6, 2026
- **Project:** Drive Atlas v1.0 - **Project:** Drive Atlas v1.1.0

View File

@@ -41,14 +41,14 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d
| Chassis Type | Description | Servers Using It | | Chassis Type | Description | Servers Using It |
|-------------|-------------|------------------| |-------------|-------------|------------------|
| **10-Bay Hot-swap** | Sliger CX471225 4U 10x 3.5" NAS (with unused 2x 5.25" bays) | compute-storage-01, compute-storage-gpu-01, storage-01 | | **10-Bay Hot-swap** | Sliger CX4712 4U 10x 3.5" NAS (with unused 2x 5.25" bays) | compute-storage-01, compute-storage-gpu-01, storage-01 |
| **Large1 Grid** | Unique 3x5 grid layout (1/1 configuration) | large1 | | **Large1 Grid** | Unique 3x5 grid layout (1/1 configuration) | large1 |
| **Micro** | Compact 2-drive layout | micro1, monitor-02 | | **Micro** | Compact 2-drive layout | micro1, monitor-02 |
### Server Details ### Server Details
#### compute-storage-01 (formerly medium2) #### compute-storage-01 (formerly medium2)
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap) - **Chassis:** Sliger CX4712 4U (10-Bay Hot-swap)
- **Motherboard:** B650D4U3-2Q/BCM - **Motherboard:** B650D4U3-2Q/BCM
- **Controllers:** - **Controllers:**
- 01:00.0 - LSI SAS3008 HBA (bays 5-10 via 2x mini-SAS HD) - 01:00.0 - LSI SAS3008 HBA (bays 5-10 via 2x mini-SAS HD)
@@ -57,20 +57,20 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d
- **Status:** ✅ Fully mapped and verified - **Status:** ✅ Fully mapped and verified
#### storage-01 #### storage-01
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap) - **Chassis:** Sliger CX4712 4U (10-Bay Hot-swap)
- **Motherboard:** Different from compute-storage-01 - **Motherboard:** ASRock A320M-HDV R4.0
- **Controllers:** Motherboard SATA only (no HBA currently) - **Controllers:** AMD SATA (bays 1-4), LSI SAS3416 HBA (bays 5+, U.2 NVMe)
- **Status:** ⚠️ Requires PCI path mapping - **Status:** ⚠️ Partially mapped (5 of 10 bays)
#### large1 #### large1
- **Chassis:** Unique 3x5 grid (15 bays total) - **Chassis:** Unique 3x5 grid (15 bays total)
- **Note:** 1/1 configuration, will not be replicated - **Note:** 1/1 configuration, will not be replicated
- **Status:** ⚠️ Requires PCI path mapping - **Status:** ⚠️ Partially mapped (14 bays + 2 M.2)
#### compute-storage-gpu-01 #### compute-storage-gpu-01
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap) - **Chassis:** Sliger CX4712 4U (10-Bay Hot-swap)
- **Motherboard:** Same as compute-storage-01 - **Motherboard:** ASUS PRIME B550-PLUS
- **Status:** ⚠️ Requires PCI path mapping - **Status:** ⚠️ Partially mapped (5 SATA + 1 M.2)
## Output Example ## Output Example
@@ -130,15 +130,15 @@ declare -A SERVER_MAPPINGS=(
## Setting Up a New Server ## Setting Up a New Server
### Step 1: Run Diagnostic Script ### Step 1: Run Diagnostic Mode
First, gather PCI path information: First, gather PCI path information:
```bash ```bash
bash diagnose-drives.sh > server-diagnostic.txt bash driveAtlas.sh --diagnose
``` ```
This will show all available PCI paths and their associated drives. This will show all available PCI paths, storage controllers, and their associated drives.
### Step 2: Physical Bay Identification ### Step 2: Physical Bay Identification
@@ -192,7 +192,7 @@ DEBUG=1 bash driveAtlas.sh
| **SIZE** | Drive capacity | | **SIZE** | Drive capacity |
| **TYPE** | SSD or HDD (detected via SMART) | | **TYPE** | SSD or HDD (detected via SMART) |
| **TEMP** | Current temperature from SMART | | **TEMP** | Current temperature from SMART |
| **HEALTH** | SMART health status (✓ = passed, ✗ = failed) | | **HEALTH** | SMART health status (✓ = passed, ⚠ = passed with warnings, ✗ = failed) |
| **MODEL** | Drive model number | | **MODEL** | Drive model number |
| **SERIAL** | Drive serial number (for physical verification) | | **SERIAL** | Drive serial number (for physical verification) |
| **CEPH OSD** | Ceph OSD ID if drive hosts an OSD | | **CEPH OSD** | Ceph OSD ID if drive hosts an OSD |
@@ -235,17 +235,15 @@ DEBUG=1 bash driveAtlas.sh
## Files ## Files
- [driveAtlas.sh](driveAtlas.sh) - Main script - [driveAtlas.sh](driveAtlas.sh) - Main script (includes `--diagnose` mode for PCI path discovery)
- [diagnose-drives.sh](diagnose-drives.sh) - PCI path diagnostic tool
- [README.md](README.md) - This file - [README.md](README.md) - This file
- [CLAUDE.md](CLAUDE.md) - AI-assisted development notes - [CLAUDE.md](CLAUDE.md) - AI-assisted development notes
- [todo.txt](todo.txt) - Development notes and task tracking
## Contributing ## Contributing
When adding support for a new server: When adding support for a new server:
1. Run `diagnose-drives.sh` and save output 1. Run `driveAtlas.sh --diagnose` and save output
2. Physically label or identify drives by serial number 2. Physically label or identify drives by serial number
3. Create mapping in `SERVER_MAPPINGS` 3. Create mapping in `SERVER_MAPPINGS`
4. Test thoroughly 4. Test thoroughly

View File

@@ -1,59 +0,0 @@
#!/bin/bash
# Drive Atlas Diagnostic Script
# Run this on each server to gather PCI path information
echo "=== Server Information ==="
echo "Hostname: $(hostname)"
echo "Date: $(date)"
echo ""
echo "=== All /dev/disk/by-path/ entries ==="
ls -la /dev/disk/by-path/ | grep -v "part" | sort
echo ""
echo "=== Organized by PCI Address ==="
for path in /dev/disk/by-path/*; do
if [ -L "$path" ]; then
# Skip partitions
if [[ "$path" =~ -part[0-9]+$ ]]; then
continue
fi
basename_path=$(basename "$path")
target=$(readlink -f "$path")
device=$(basename "$target")
echo "Path: $basename_path"
echo " -> Device: $device"
# Try to get size
if [ -b "$target" ]; then
size=$(lsblk -d -n -o SIZE "$target" 2>/dev/null)
echo " -> Size: $size"
fi
# Try to get SMART info for model
if command -v smartctl >/dev/null 2>&1; then
model=$(sudo smartctl -i "$target" 2>/dev/null | grep "Device Model\|Model Number" | cut -d: -f2 | xargs)
if [ -n "$model" ]; then
echo " -> Model: $model"
fi
fi
echo ""
fi
done
echo "=== PCI Devices with Storage Controllers ==="
lspci | grep -i "storage\|raid\|sata\|sas\|nvme"
echo ""
echo "=== Current Block Devices ==="
lsblk -d -o NAME,SIZE,TYPE,TRAN | grep -v "rbd\|loop"
echo ""
echo "=== Recommendations ==="
echo "1. Note the PCI addresses (e.g., 0c:00.0) of your storage controllers"
echo "2. For each bay, physically identify which drive is in it"
echo "3. Match the PCI path pattern to the bay number"
echo "4. Example: pci-0000:0c:00.0-ata-1 might be bay 1 on controller 0c:00.0"

View File

@@ -11,8 +11,25 @@
# Note: Not using -u (nounset) as script uses ${var:-default} patterns # Note: Not using -u (nounset) as script uses ${var:-default} patterns
set -o pipefail set -o pipefail
# Require bash 4.2+ for declare -g -A (global associative arrays)
if ((BASH_VERSINFO[0] < 4 || (BASH_VERSINFO[0] == 4 && BASH_VERSINFO[1] < 2))); then
echo "ERROR: This script requires Bash 4.2 or higher (current: $BASH_VERSION)" >&2
exit 1
fi
VERSION="1.1.0" VERSION="1.1.0"
#------------------------------------------------------------------------------
# Cleanup Trap
# Ensures temporary directories are removed on exit or interruption
#------------------------------------------------------------------------------
cleanup() {
if [[ -n "${SMART_CACHE_DIR:-}" && -d "$SMART_CACHE_DIR" ]]; then
rm -rf "$SMART_CACHE_DIR"
fi
}
trap cleanup EXIT INT TERM
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
# Path Constants # Path Constants
# Centralized path definitions to avoid hardcoding throughout the script # Centralized path definitions to avoid hardcoding throughout the script
@@ -43,6 +60,7 @@ OPTIONS:
--verbose Show detailed error messages and warnings --verbose Show detailed error messages and warnings
--no-ceph Skip Ceph OSD information --no-ceph Skip Ceph OSD information
--show-pci Show PCI paths in output --show-pci Show PCI paths in output
--diagnose Show all PCI paths and block devices (for mapping new servers)
EXAMPLES: EXAMPLES:
$(basename "$0") # Normal run with all features $(basename "$0") # Normal run with all features
@@ -50,6 +68,7 @@ EXAMPLES:
$(basename "$0") --color # Run with colored output $(basename "$0") --color # Run with colored output
$(basename "$0") --verbose # Show all errors and warnings $(basename "$0") --verbose # Show all errors and warnings
$(basename "$0") --debug # Show mapping debug info $(basename "$0") --debug # Show mapping debug info
$(basename "$0") --diagnose # Gather PCI paths for new server setup
ENVIRONMENT VARIABLES: ENVIRONMENT VARIABLES:
DEBUG=1 Same as --debug flag DEBUG=1 Same as --debug flag
@@ -66,6 +85,7 @@ SKIP_CEPH=false
SHOW_PCI=false SHOW_PCI=false
USE_COLOR=false USE_COLOR=false
VERBOSE=false VERBOSE=false
RUN_DIAGNOSE=false
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case "$1" in case "$1" in
@@ -101,6 +121,10 @@ while [[ $# -gt 0 ]]; do
VERBOSE=true VERBOSE=true
shift shift
;; ;;
--diagnose)
RUN_DIAGNOSE=true
shift
;;
*) *)
echo "Unknown option: $1" >&2 echo "Unknown option: $1" >&2
echo "Use --help for usage information." >&2 echo "Use --help for usage information." >&2
@@ -304,6 +328,68 @@ check_dependencies() {
# Run dependency check at script start # Run dependency check at script start
check_dependencies check_dependencies
#------------------------------------------------------------------------------
# run_diagnose
#
# Displays all PCI disk paths, storage controllers, and block devices.
# Used to gather information needed when mapping a new server.
#------------------------------------------------------------------------------
run_diagnose() {
local hostname
hostname="$(hostname)"
echo "=== Server Information ==="
echo "Hostname: $hostname"
echo "Date: $(date)"
echo ""
echo "=== Storage Controllers ==="
lspci 2>/dev/null | grep -iE "SAS|SATA|RAID|Mass storage|NVMe"
echo ""
echo "=== All /dev/disk/by-path/ entries (whole disks only) ==="
for path in "${DISK_BY_PATH}"/*; do
[[ -L "$path" ]] || continue
# Skip partitions
[[ "$path" =~ -part[0-9]+$ ]] && continue
local basename_path target device size serial model
basename_path="$(basename "$path")"
target="$(readlink -f "$path")"
device="$(basename "$target")"
size="$(lsblk -d -n -o SIZE "$target" 2>/dev/null | xargs)"
printf " %-55s -> %-10s %s\n" "$basename_path" "$device" "${size:+($size)}"
done
echo ""
echo "=== Block Devices ==="
lsblk -d -o NAME,SIZE,TYPE,TRAN 2>/dev/null | grep -v "rbd\|loop"
echo ""
# Check if this server has a mapping
local sanitized
sanitized="$(echo "$hostname" | tr -cd '[:alnum:]-_.')"
if [[ -n "${SERVER_MAPPINGS[$sanitized]:-}" ]]; then
echo "=== Current Mapping for $sanitized ==="
echo "${SERVER_MAPPINGS[$sanitized]}" | while read -r pci_path bay; do
[[ -z "$pci_path" || -z "$bay" ]] && continue
if [[ -L "${DISK_BY_PATH}/$pci_path" ]]; then
local dev
dev="$(readlink -f "${DISK_BY_PATH}/$pci_path" | sed 's/.*\///')"
printf " Bay %-5s %-55s -> %s\n" "$bay" "$pci_path" "$dev"
else
printf " Bay %-5s %-55s -> (not connected)\n" "$bay" "$pci_path"
fi
done
else
echo "NOTE: No mapping exists yet for '$sanitized'."
echo "Use the PCI paths above to create a SERVER_MAPPINGS entry."
fi
exit 0
}
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
# Chassis Layout Generator Functions # Chassis Layout Generator Functions
# These define the physical layout and display formatting for each chassis type # These define the physical layout and display formatting for each chassis type
@@ -327,26 +413,29 @@ generate_10bay_layout() {
# Fixed width for consistent box drawing (fits device names like "nvme0n1") # Fixed width for consistent box drawing (fits device names like "nvme0n1")
local drive_width=10 local drive_width=10
# Box interior width = 136 (determined by 10 bay boxes: 4 + 10*13 + 2)
# Total box width = 138 (136 interior + 2 for │ borders)
# Main chassis section # Main chassis section
printf "┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐\n" printf "┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐\n"
printf "│ %-126s │\n" "$hostname - Sliger CX4712 (10x 3.5\" Hot-swap)" printf "│ %-132s │\n" "$hostname - Sliger CX4712 (10x 3.5\" Hot-swap)"
printf "│ │\n" printf "│%-136s│\n" ""
# Show storage controllers # Show storage controllers
printf "│ Storage Controllers: │\n" printf "│ %-134s│\n" "Storage Controllers:"
while IFS= read -r ctrl; do while IFS= read -r ctrl; do
[[ -n "$ctrl" ]] && printf "│ %-126s│\n" "$ctrl" [[ -n "$ctrl" ]] && printf "│ %-134.134s│\n" "$ctrl"
done < <(get_storage_controllers) done < <(get_storage_controllers)
printf "│ │\n" printf "│%-136s│\n" ""
# M.2 NVMe slot if present # M.2 NVMe slot if present
if [[ -n "${DRIVE_MAP[m2-1]}" ]]; then if [[ -n "${DRIVE_MAP[m2-1]}" ]]; then
printf "│ M.2 NVMe: %-10s │\n" "${DRIVE_MAP[m2-1]}" printf "│ %-134s│\n" " M.2 NVMe: ${DRIVE_MAP[m2-1]}"
printf "│ │\n" printf "│%-136s│\n" ""
fi fi
printf "│ Front Hot-swap Bays: │\n" printf "│ %-134s│\n" " Front Hot-swap Bays:"
printf "│ │\n" printf "│%-136s│\n" ""
# Bay top borders # Bay top borders
printf "│ " printf "│ "
@@ -369,7 +458,7 @@ generate_10bay_layout() {
done done
printf " │\n" printf " │\n"
printf "└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘\n" printf "└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘\n"
} }
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
@@ -398,7 +487,7 @@ generate_micro_layout() {
printf "│ │\n" printf "│ │\n"
printf "│ Storage Controllers: │\n" printf "│ Storage Controllers: │\n"
while IFS= read -r ctrl; do while IFS= read -r ctrl; do
[[ -n "$ctrl" ]] && printf "│ %-57s│\n" "$ctrl" [[ -n "$ctrl" ]] && printf "│ %-57.57s│\n" "$ctrl"
done < <(get_storage_controllers) done < <(get_storage_controllers)
printf "│ │\n" printf "│ │\n"
@@ -440,7 +529,7 @@ generate_large1_layout() {
printf "│ │\n" printf "│ │\n"
printf "│ Storage Controllers: │\n" printf "│ Storage Controllers: │\n"
while IFS= read -r ctrl; do while IFS= read -r ctrl; do
[[ -n "$ctrl" ]] && printf "│ %-69s│\n" "$ctrl" [[ -n "$ctrl" ]] && printf "│ %-69.69s│\n" "$ctrl"
done < <(get_storage_controllers) done < <(get_storage_controllers)
printf "│ │\n" printf "│ │\n"
printf "│ M.2 NVMe: M1: %-10s M2: %-10s │\n" "${DRIVE_MAP[m2-1]:-EMPTY}" "${DRIVE_MAP[m2-2]:-EMPTY}" printf "│ M.2 NVMe: M1: %-10s M2: %-10s │\n" "${DRIVE_MAP[m2-1]:-EMPTY}" "${DRIVE_MAP[m2-2]:-EMPTY}"
@@ -503,13 +592,26 @@ declare -A SERVER_MAPPINGS=(
" "
# storage-01 # storage-01
# Motherboard: ASRock A320M-HDV R4.0 with AMD SATA controller at 02:00.1 # Motherboard: ASRock A320M-HDV R4.0
# 4 SATA ports used (ata-1, ata-2, ata-5, ata-6) - ata-3/4 empty # AMD SATA controller at 02:00.1 (bays 1-4)
# Mobo SATA physical layout:
# top-left=bay 1, bottom-left=bay 2, top-right=bay 3, bottom-right=bay 4
# HBA: LSI SAS3416 at 01:00.0 (4x Mini-SAS HD ports, top=C0 to bottom=C3)
# C0 (top): 4x SATA breakout → bays 5-8
# C1: 4x SATA breakout → bays 9-10 (2 of 4 ports used)
# C2: U.2 NVMe (serial ends in 0d66) → u2-1
# C3: U.2 NVMe (serial ends in 0d4f) → u2-2
# C0 verified: phy9=bay5 (remaining phy8/10/11 → bays 6-8 TBD)
# C1: PHY-to-bay mapping TBD (bays 9-10)
# C2: U.2 NVMe (serial ends in 0d66) → u2-1 (needs FW update)
# C3: U.2 NVMe (serial ends in 0d4f) → u2-2 (needs FW update)
# Also present: 09:00.0 AMD FCH SATA Controller [AHCI mode]
["storage-01"]=" ["storage-01"]="
pci-0000:02:00.1-ata-1 1 pci-0000:02:00.1-ata-1 1
pci-0000:02:00.1-ata-2 2 pci-0000:02:00.1-ata-2 2
pci-0000:02:00.1-ata-5 3 pci-0000:02:00.1-ata-5 3
pci-0000:02:00.1-ata-6 4 pci-0000:02:00.1-ata-6 4
pci-0000:01:00.0-sas-phy9-lun-0 5
" "
# large1 # large1
@@ -607,7 +709,7 @@ get_storage_controllers() {
# Values: PCI path strings (for --show-pci option) # Values: PCI path strings (for --show-pci option)
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
build_drive_map() { build_drive_map() {
local host="$(hostname)" local host="$(hostname | tr -cd '[:alnum:]-_.')"
local mapping="${SERVER_MAPPINGS[$host]}" local mapping="${SERVER_MAPPINGS[$host]}"
# Declare global arrays directly # Declare global arrays directly
@@ -615,7 +717,7 @@ build_drive_map() {
declare -g -A BAY_TO_PCI_PATH=() declare -g -A BAY_TO_PCI_PATH=()
if [[ -z "$mapping" ]]; then if [[ -z "$mapping" ]]; then
log_warn "No drive mapping found for host '$host'. Run diagnose-drives.sh to create one." log_warn "No drive mapping found for host '$host'. Run with --diagnose to gather PCI path info."
return return
fi fi
@@ -665,16 +767,23 @@ build_ceph_cache() {
# Parse ceph-volume lvm list output # Parse ceph-volume lvm list output
# Format: blocks starting with "====== osd.X =======" followed by device info # Format: blocks starting with "====== osd.X =======" followed by device info
local current_osd="" local current_osd=""
local osd_count=0
while IFS= read -r line; do while IFS= read -r line; do
# Match OSD header: "====== osd.5 =======" # Match OSD header: "====== osd.5 =======" or "====== osd.19 ======"
if [[ "$line" =~ ======[[:space:]]+osd\.([0-9]+)[[:space:]]+======= ]]; then # Number of trailing equals varies based on OSD number length
if [[ "$line" =~ ======[[:space:]]+osd\.([0-9]+)[[:space:]]+====== ]]; then
current_osd="osd.${BASH_REMATCH[1]}" current_osd="osd.${BASH_REMATCH[1]}"
# Match block device line: " block device /dev/sda" # Match "devices" line which has the actual physical device: " devices /dev/sda"
elif [[ -n "$current_osd" && "$line" =~ block[[:space:]]device[[:space:]]+/dev/([^[:space:]]+) ]]; then # This is more reliable than "block device" which may show LVM paths
elif [[ -n "$current_osd" && "$line" =~ devices[[:space:]]+/dev/(sd[a-z]+|nvme[0-9]+n[0-9]+) ]]; then
local dev_name="${BASH_REMATCH[1]}" local dev_name="${BASH_REMATCH[1]}"
CEPH_DEVICE_TO_OSD[$dev_name]="$current_osd" CEPH_DEVICE_TO_OSD["$dev_name"]="$current_osd"
((osd_count++))
log_info "Found $current_osd on $dev_name"
current_osd="" # Reset to avoid duplicate matches
fi fi
done < <(ceph-volume lvm list 2>/dev/null) done < <(ceph-volume lvm list 2>/dev/null)
log_info "Cached $osd_count Ceph OSDs"
# Skip if ceph command is not available # Skip if ceph command is not available
if ! command -v ceph &>/dev/null; then if ! command -v ceph &>/dev/null; then
@@ -714,24 +823,19 @@ readonly SMART_CRC_ERROR_WARN=100 # UDMA CRC error warning threshold
readonly SMART_POWER_ON_HOURS_WARN=43800 # ~5 years of continuous use readonly SMART_POWER_ON_HOURS_WARN=43800 # ~5 years of continuous use
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
# get_drive_smart_info # parse_smart_data
# #
# Retrieves SMART data for a given device. # Parses raw SMART data and returns formatted info string.
# #
# Args: # Args:
# $1 - Device name (e.g., sda, nvme0n1) # $1 - Device name (e.g., sda, nvme0n1)
# $2 - Raw smartctl output string
# #
# Returns: Pipe-delimited string: TYPE|TEMP|HEALTH|MODEL|SERIAL|WARNINGS # Returns: Pipe-delimited string: TYPE|TEMP|HEALTH|MODEL|SERIAL|WARNINGS
# TYPE: SSD, HDD, or NVMe
# TEMP: Temperature in Celsius (or "-" if unavailable)
# HEALTH: ✓ for passed, ✗ for failed, ⚠ for passed with warnings
# MODEL: Drive model string
# SERIAL: Drive serial number
# WARNINGS: Comma-separated warning codes (or empty)
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
get_drive_smart_info() { parse_smart_data() {
local device="$1" local device="$1"
local smart_info local smart_info="$2"
local temp="-" local temp="-"
local type="HDD" local type="HDD"
local health="✗" local health="✗"
@@ -739,48 +843,51 @@ get_drive_smart_info() {
local serial="-" local serial="-"
local warnings="" local warnings=""
# Capture both stdout and stderr for better error reporting
local smart_stderr
smart_stderr="$(mktemp)"
smart_info="$(sudo smartctl -A -i -H "/dev/$device" 2>"$smart_stderr")"
local smart_exit=$?
if [[ $smart_exit -ne 0 && -s "$smart_stderr" ]]; then
log_warn "SMART query failed for $device: $(head -1 "$smart_stderr")"
fi
rm -f "$smart_stderr"
if [[ -z "$smart_info" ]]; then if [[ -z "$smart_info" ]]; then
log_info "No SMART data available for $device"
echo "HDD|-|✗|-|-|" echo "HDD|-|✗|-|-|"
return return
fi fi
# Temperature parsing - handles multiple formats: # Temperature parsing - handles multiple formats:
# - SATA: "194 Temperature_Celsius ... 35" (value at end of line) # - SATA: "194 Temperature_Celsius ... 26 (0 14 0 0 0)" (value before parenthetical)
# - SATA: "Temperature: 42 Celsius" # - SATA: "Temperature: 42 Celsius"
# - SATA: "Current Temperature: 35 Celsius" # - SATA: "Current Temperature: 35 Celsius"
# - SAS: "Current Drive Temperature: 35 C"
# - NVMe: "Temperature: 42 Celsius" # - NVMe: "Temperature: 42 Celsius"
if echo "$smart_info" | grep -q "Temperature_Celsius"; then if echo "$smart_info" | grep -q "Temperature_Celsius"; then
# SMART attribute format - temperature is typically the 10th field (raw value) # Strip parenthetical data like "(0 14 0 0 0)" before finding last number
# But we use the last numeric field before any parentheses for reliability temp="$(echo "$smart_info" | grep "Temperature_Celsius" | head -1 | sed 's/([^)]*)//g' | awk '{for(i=NF;i>0;i--) if($i ~ /^[0-9]+$/) {print $i; exit}}')"
temp="$(echo "$smart_info" | grep "Temperature_Celsius" | head -1 | awk '{for(i=NF;i>0;i--) if($i ~ /^[0-9]+$/) {print $i; exit}}')" elif echo "$smart_info" | grep -qE "Current Drive Temperature:"; then
elif echo "$smart_info" | grep -qE "^(Current )?Temperature:"; then # SAS drives: "Current Drive Temperature: 35 C"
# Simple "Temperature: XX Celsius" format temp="$(echo "$smart_info" | grep -E "Current Drive Temperature:" | head -1 | awk '{for(i=1;i<=NF;i++) if($i ~ /^[0-9]+$/) {print $i; exit}}')"
temp="$(echo "$smart_info" | grep -E "^(Current )?Temperature:" | head -1 | awk '{print $2}')" elif echo "$smart_info" | grep -qE "(Current )?Temperature:"; then
# SATA/NVMe: "Temperature: 42 Celsius" (may have leading whitespace)
temp="$(echo "$smart_info" | grep -E "(Current )?Temperature:" | head -1 | awk '{for(i=1;i<=NF;i++) if($i ~ /^[0-9]+$/) {print $i; exit}}')"
fi fi
# Device type detection - handles SSD, HDD, and NVMe # Device type detection - handles SSD, HDD, and NVMe
# Priority: 1) NVMe by name, 2) Rotation Rate field, 3) Model name hints, 4) Default HDD
if [[ "$device" == nvme* ]]; then if [[ "$device" == nvme* ]]; then
type="NVMe" type="NVMe"
elif echo "$smart_info" | grep -q "Rotation Rate"; then elif echo "$smart_info" | grep -qE "Rotation Rate:"; then
if echo "$smart_info" | grep "Rotation Rate" | grep -qiE "solid state|0 rpm"; then # Check the Rotation Rate field value (may have leading whitespace)
local rotation_rate
rotation_rate="$(echo "$smart_info" | grep -E "Rotation Rate:" | head -1)"
if echo "$rotation_rate" | grep -qiE "solid state"; then
type="SSD" type="SSD"
elif echo "$rotation_rate" | grep -qE "[0-9]+ rpm"; then
# Has actual RPM value (e.g., "7200 rpm") - it's an HDD
type="HDD"
else else
# Unknown rotation rate, default to HDD
type="HDD" type="HDD"
fi fi
elif echo "$smart_info" | grep -qiE "SSD|Solid State"; then elif echo "$smart_info" | grep -qE "Device Model:.*SSD|Model Number:.*SSD"; then
# Match SSD in the model name field
type="SSD" type="SSD"
else
# Default to HDD for spinning rust
type="HDD"
fi fi
# Health status (basic SMART check) # Health status (basic SMART check)
@@ -859,11 +966,34 @@ get_drive_smart_info() {
echo "${type}|${temp_display}|${health}|${model}|${serial}|${warnings}" echo "${type}|${temp_display}|${health}|${model}|${serial}|${warnings}"
} }
#------------------------------------------------------------------------------
# get_drive_smart_info
#
# Retrieves SMART data for a given device (fetches and parses).
#
# Args:
# $1 - Device name (e.g., sda, nvme0n1)
#
# Returns: Pipe-delimited string: TYPE|TEMP|HEALTH|MODEL|SERIAL|WARNINGS
#------------------------------------------------------------------------------
get_drive_smart_info() {
local device="$1"
local smart_info
smart_info="$(sudo smartctl -A -i -H "/dev/$device" 2>/dev/null)"
parse_smart_data "$device" "$smart_info"
}
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
# Main Display Logic # Main Display Logic
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
HOSTNAME=$(hostname) # Run diagnose mode if requested (exits after printing)
if [[ "$RUN_DIAGNOSE" == true ]]; then
run_diagnose
fi
HOSTNAME=$(hostname | tr -cd '[:alnum:]-_.')
CHASSIS_TYPE=${CHASSIS_TYPES[$HOSTNAME]:-"unknown"} CHASSIS_TYPE=${CHASSIS_TYPES[$HOSTNAME]:-"unknown"}
# Display chassis layout # Display chassis layout
@@ -881,7 +1011,7 @@ case "$CHASSIS_TYPE" in
echo "┌─────────────────────────────────────────────────────────┐" echo "┌─────────────────────────────────────────────────────────┐"
echo "│ Unknown server: $HOSTNAME" echo "│ Unknown server: $HOSTNAME"
echo "│ No chassis mapping defined yet" echo "│ No chassis mapping defined yet"
echo "│ Run diagnose-drives.sh to gather PCI path information" echo "│ Run with --diagnose to gather PCI path information"
echo "└─────────────────────────────────────────────────────────┘" echo "└─────────────────────────────────────────────────────────┘"
;; ;;
esac esac
@@ -919,39 +1049,57 @@ done
all_bays="$(printf '%s\n' "${!DRIVE_MAP[@]}" | grep -E '^[0-9]+$' | sort -n; printf '%s\n' "${!DRIVE_MAP[@]}" | grep -E '^m2-' | sort)" all_bays="$(printf '%s\n' "${!DRIVE_MAP[@]}" | grep -E '^[0-9]+$' | sort -n; printf '%s\n' "${!DRIVE_MAP[@]}" | grep -E '^m2-' | sort)"
# Cache lsblk data to reduce redundant calls # Cache lsblk data to reduce redundant calls
# Single call gets all info we need: size and mount points # Get device sizes (whole disk only)
declare -A LSBLK_SIZE=() declare -A LSBLK_SIZE=()
declare -A LSBLK_MOUNTS=() declare -A LSBLK_MOUNTS=()
log_info "Caching block device information..." log_info "Caching block device information..."
while IFS='|' read -r name size mounts; do
# Get sizes for whole disks only
while read -r name size; do
[[ -z "$name" ]] && continue [[ -z "$name" ]] && continue
LSBLK_SIZE[$name]="$size" LSBLK_SIZE["$name"]="$size"
# Accumulate mount points for parent device done < <(lsblk -dn -o NAME,SIZE 2>/dev/null)
parent="${name%%[0-9]}" # Strip partition number
if [[ -n "$mounts" ]]; then # Get mount points (including partitions) and map back to parent device
if [[ -n "${LSBLK_MOUNTS[$parent]}" ]]; then while read -r name mounts; do
LSBLK_MOUNTS[$parent]+=",${mounts}" [[ -z "$name" || -z "$mounts" ]] && continue
else # Strip partition suffix (sda1 -> sda, nvme0n1p1 -> nvme0n1)
LSBLK_MOUNTS[$parent]="$mounts" if [[ "$name" =~ ^(nvme[0-9]+n[0-9]+)p[0-9]+$ ]]; then
fi parent="${BASH_REMATCH[1]}"
elif [[ "$name" =~ ^([a-z]+)[0-9]+$ ]]; then
parent="${BASH_REMATCH[1]}"
else
parent="$name"
fi fi
done < <(lsblk -rn -o NAME,SIZE,MOUNTPOINT 2>/dev/null) if [[ -n "${LSBLK_MOUNTS[$parent]:-}" ]]; then
LSBLK_MOUNTS["$parent"]+=",${mounts}"
else
LSBLK_MOUNTS["$parent"]="$mounts"
fi
done < <(lsblk -rn -o NAME,MOUNTPOINT 2>/dev/null | grep -v '^ ')
# Parallel SMART data collection for faster execution # Parallel SMART data collection for faster execution
# Collect SMART data in background jobs, store in temp files # Collect raw smartctl output in background jobs, parse later
if [[ "$SKIP_SMART" != true ]]; then if [[ "$SKIP_SMART" != true ]]; then
SMART_CACHE_DIR="$(mktemp -d)" SMART_CACHE_DIR="$(mktemp -d)"
log_info "Collecting SMART data in parallel..." log_info "Collecting SMART data in parallel..."
max_parallel_jobs=10
job_count=0
for bay in $all_bays; do for bay in $all_bays; do
device="${DRIVE_MAP[$bay]}" device="${DRIVE_MAP[$bay]}"
if [[ -n "$device" && "$device" != "EMPTY" && -b "/dev/$device" ]]; then if [[ -n "$device" && "$device" != "EMPTY" && -b "/dev/$device" ]]; then
# Launch background job for each device # Launch background job to collect raw smartctl data
(get_drive_smart_info "$device" > "$SMART_CACHE_DIR/$device") & (sudo smartctl -A -i -H "/dev/$device" > "$SMART_CACHE_DIR/${device}.raw" 2>/dev/null) &
((job_count++))
if ((job_count >= max_parallel_jobs)); then
wait -n 2>/dev/null || wait # wait -n requires bash 4.3+, fall back to wait
((job_count--))
fi
fi fi
done done
# Wait for all background SMART queries to complete # Wait for all remaining background SMART queries to complete
wait wait
log_info "SMART data collection complete" log_info "SMART data collection complete"
fi fi
@@ -971,13 +1119,23 @@ for bay in $all_bays; do
serial="-" serial="-"
warnings="" warnings=""
else else
# Read from cached SMART data # Read from cached raw SMART data and parse it
if [[ -f "$SMART_CACHE_DIR/$device" ]]; then raw_smart=""
smart_info="$(cat "$SMART_CACHE_DIR/$device")" if [[ -f "$SMART_CACHE_DIR/${device}.raw" ]]; then
else raw_smart="$(cat "$SMART_CACHE_DIR/${device}.raw")"
smart_info="" fi
# Parse the raw data using get_drive_smart_info logic inline
if [[ -n "$raw_smart" ]]; then
smart_info="$(parse_smart_data "$device" "$raw_smart")"
IFS='|' read -r type temp health model serial warnings <<< "$smart_info"
else
type="-"
temp="-"
health="-"
model="-"
serial="-"
warnings=""
fi fi
IFS='|' read -r type temp health model serial warnings <<< "$smart_info"
fi fi
# Check for Ceph OSD using cached data # Check for Ceph OSD using cached data
@@ -1020,7 +1178,7 @@ for bay in $all_bays; do
colored_health="$(colorize_health "$health")" colored_health="$(colorize_health "$health")"
# Colorize warnings if present # Colorize warnings if present
local colored_warnings="${warnings:--}" colored_warnings="${warnings:--}"
if [[ "$USE_COLOR" == true && -n "$warnings" ]]; then if [[ "$USE_COLOR" == true && -n "$warnings" ]]; then
colored_warnings="${COLOR_YELLOW}${warnings}${COLOR_RESET}" colored_warnings="${COLOR_YELLOW}${warnings}${COLOR_RESET}"
fi fi
@@ -1034,11 +1192,6 @@ for bay in $all_bays; do
fi fi
done done
# Clean up SMART cache directory
if [[ -n "${SMART_CACHE_DIR:-}" && -d "$SMART_CACHE_DIR" ]]; then
rm -rf "$SMART_CACHE_DIR"
fi
# NVMe drives (only show unmapped ones - mapped NVMe drives appear in main table) # NVMe drives (only show unmapped ones - mapped NVMe drives appear in main table)
nvme_devices=$(lsblk -d -n -o NAME,SIZE | grep "^nvme" 2>/dev/null) nvme_devices=$(lsblk -d -n -o NAME,SIZE | grep "^nvme" 2>/dev/null)
if [[ -n "$nvme_devices" ]]; then if [[ -n "$nvme_devices" ]]; then

View File

@@ -1,11 +0,0 @@
#!/bin/bash
echo "=== Drive Serial Numbers ==="
for dev in sd{a..j}; do
if [ -b "/dev/$dev" ]; then
serial=$(sudo smartctl -i /dev/$dev 2>/dev/null | grep "Serial Number" | awk '{print $3}')
model=$(sudo smartctl -i /dev/$dev 2>/dev/null | grep "Device Model\|Model Number" | cut -d: -f2 | xargs)
size=$(lsblk -d -n -o SIZE /dev/$dev 2>/dev/null)
echo "/dev/$dev: $serial ($size - $model)"
fi
done

View File

@@ -1,11 +0,0 @@
#!/bin/bash
echo "=== Checking /dev/disk/by-path/ ==="
ls -la /dev/disk/by-path/ | grep -v "part" | grep "pci-0000:0c:00.0" | head -20
echo ""
echo "=== Checking if paths exist from mapping ==="
echo "pci-0000:0c:00.0-ata-3:"
ls -la /dev/disk/by-path/pci-0000:0c:00.0-ata-3 2>&1
echo "pci-0000:0c:00.0-ata-1:"
ls -la /dev/disk/by-path/pci-0000:0c:00.0-ata-1 2>&1