Compare commits

...

9 Commits

Author SHA1 Message Date
c6ea28c5d6 Add --diagnose flag, remove obsolete helper scripts, fix docs
- Add --diagnose option that shows all PCI paths, storage controllers,
  block devices, and validates current mappings. Replaces the separate
  diagnose-drives.sh script.
- Remove diagnose-drives.sh (incorporated into --diagnose).
- Remove get-serials.sh (redundant with SMART data in main table).
- Remove test-paths.sh (referenced non-existent 0c:00.0 controller).
- Remove todo.md (massively outdated).
- Fix storage controller text overflowing box borders in large1 and
  micro layouts by adding truncation (%-69.69s, %-57.57s).
- Fix chassis name to CX4712 in README.
- Update server mapping statuses from "Requires mapping" to actual
  partially-mapped states.
- Add ⚠ health indicator to README output column docs.
- Update Claude.md metrics to match current state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 18:50:37 -05:00
555ecd54b2 Fix 10-bay ASCII box alignment
Border was 130 columns wide but bay lines were 138. Widened border
and all interior format strings to match the bay content width (136
interior = 138 total). Long controller descriptions are now truncated
to prevent overflow.

Ref #25

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 18:27:33 -05:00
4a98a6f6f8 Add storage-01 HBA bay 5 mapping (phy9)
Verified via ls -la /dev/disk/by-path/ and physical inspection
that HBA SAS3416 phy9 maps to bay 5 (C0 SATA breakout).
Remaining C0 bays 6-8 and C1 bays 9-10 still need drives to verify.

Ref #25

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 18:20:44 -05:00
2177ae9092 Fix local keyword used outside function, document storage-01 HBA layout
- Remove `local` from max_parallel_jobs/job_count (not inside a function)
- Document storage-01 physical layout: mobo SATA ports, HBA Mini-SAS HD
  ports C0-C3, U.2 NVMe serial numbers

Ref #25

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 18:15:45 -05:00
71f83e82c5 Add robustness improvements: bash version check, cleanup trap, hostname sanitization, parallel job limit
- Add bash 4.2+ version check since script uses declare -g -A
- Add cleanup trap (EXIT/INT/TERM) for SMART_CACHE_DIR temp directory
- Sanitize hostname to strip unexpected characters
- Limit parallel SMART collection to 10 concurrent jobs

Fixes #25

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 18:09:40 -05:00
b79c69be99 Fix OSD header regex to match double-digit OSD numbers
The ceph-volume lvm list output varies the number of trailing equals
signs based on OSD number length:
- Single digit: "====== osd.5 =======" (7 equals)
- Double digit: "====== osd.19 ======" (6 equals)

Changed regex to require exactly 6 trailing equals, which matches
both formats.

Fixes: #17

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:23:10 -05:00
eb73e03495 Fix temperature parsing with parenthetical data
SMART output for Temperature_Celsius often includes extra sensor data
in parentheses like "26 (0 14 0 0 0)". The previous awk command was
finding "0" from the parenthetical instead of the actual temperature.

Now strips parenthetical content with sed before extracting the last
numeric value.

Fixes: #11

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:19:24 -05:00
f3785c13bc Fix temperature parsing for SAS drives
Added support for SAS drive temperature format "Current Drive Temperature:"
and made temperature extraction more robust by:
- Removing ^ anchor that was preventing matches with leading whitespace
- Using awk to find the first numeric value in the line
- Adding explicit SAS drive temperature format handling

Fixes: #11

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:14:23 -05:00
7579f371d7 Fix Ceph device parsing to use devices line
The "block device" line in ceph-volume output shows LVM paths like
ceph-xxx/osd-block-xxx, not physical device names. Changed to parse
the "devices" line which contains the actual physical device path
like /dev/sda.

Also reset current_osd after match to avoid duplicate matches.

Fixes: #17

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 20:10:46 -05:00
6 changed files with 171 additions and 136 deletions

View File

@@ -159,9 +159,9 @@ The project began with:
## Metrics ## Metrics
- **Lines of Code:** ~330 (main script) - **Lines of Code:** ~1178 (main script)
- **Supported Chassis Types:** 4 (10-bay, large1, micro, spare) - **Supported Chassis Types:** 3 (10bay, large1, micro)
- **Mapped Servers:** 1 fully (compute-storage-01), 3 pending - **Mapped Servers:** 1 fully (compute-storage-01), 3 partially (storage-01, large1, compute-storage-gpu-01), 2 stubs (micro1, monitor-02)
- **Features Added:** 10+ - **Features Added:** 10+
- **Bugs Fixed:** 6 major, multiple minor - **Bugs Fixed:** 6 major, multiple minor
- **Documentation:** Comprehensive README + this file - **Documentation:** Comprehensive README + this file
@@ -206,4 +206,4 @@ The result is a robust infrastructure management tool that provides instant visi
- **Human Developer:** LotusGuild - **Human Developer:** LotusGuild
- **AI Assistant:** Claude Sonnet 4.5 (Anthropic) - **AI Assistant:** Claude Sonnet 4.5 (Anthropic)
- **Development Date:** January 6, 2026 - **Development Date:** January 6, 2026
- **Project:** Drive Atlas v1.0 - **Project:** Drive Atlas v1.1.0

View File

@@ -41,14 +41,14 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d
| Chassis Type | Description | Servers Using It | | Chassis Type | Description | Servers Using It |
|-------------|-------------|------------------| |-------------|-------------|------------------|
| **10-Bay Hot-swap** | Sliger CX471225 4U 10x 3.5" NAS (with unused 2x 5.25" bays) | compute-storage-01, compute-storage-gpu-01, storage-01 | | **10-Bay Hot-swap** | Sliger CX4712 4U 10x 3.5" NAS (with unused 2x 5.25" bays) | compute-storage-01, compute-storage-gpu-01, storage-01 |
| **Large1 Grid** | Unique 3x5 grid layout (1/1 configuration) | large1 | | **Large1 Grid** | Unique 3x5 grid layout (1/1 configuration) | large1 |
| **Micro** | Compact 2-drive layout | micro1, monitor-02 | | **Micro** | Compact 2-drive layout | micro1, monitor-02 |
### Server Details ### Server Details
#### compute-storage-01 (formerly medium2) #### compute-storage-01 (formerly medium2)
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap) - **Chassis:** Sliger CX4712 4U (10-Bay Hot-swap)
- **Motherboard:** B650D4U3-2Q/BCM - **Motherboard:** B650D4U3-2Q/BCM
- **Controllers:** - **Controllers:**
- 01:00.0 - LSI SAS3008 HBA (bays 5-10 via 2x mini-SAS HD) - 01:00.0 - LSI SAS3008 HBA (bays 5-10 via 2x mini-SAS HD)
@@ -57,20 +57,20 @@ bash <(wget -qO- http://10.10.10.63:3000/LotusGuild/driveAtlas/raw/branch/main/d
- **Status:** ✅ Fully mapped and verified - **Status:** ✅ Fully mapped and verified
#### storage-01 #### storage-01
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap) - **Chassis:** Sliger CX4712 4U (10-Bay Hot-swap)
- **Motherboard:** Different from compute-storage-01 - **Motherboard:** ASRock A320M-HDV R4.0
- **Controllers:** Motherboard SATA only (no HBA currently) - **Controllers:** AMD SATA (bays 1-4), LSI SAS3416 HBA (bays 5+, U.2 NVMe)
- **Status:** ⚠️ Requires PCI path mapping - **Status:** ⚠️ Partially mapped (5 of 10 bays)
#### large1 #### large1
- **Chassis:** Unique 3x5 grid (15 bays total) - **Chassis:** Unique 3x5 grid (15 bays total)
- **Note:** 1/1 configuration, will not be replicated - **Note:** 1/1 configuration, will not be replicated
- **Status:** ⚠️ Requires PCI path mapping - **Status:** ⚠️ Partially mapped (14 bays + 2 M.2)
#### compute-storage-gpu-01 #### compute-storage-gpu-01
- **Chassis:** Sliger CX471225 4U (10-Bay Hot-swap) - **Chassis:** Sliger CX4712 4U (10-Bay Hot-swap)
- **Motherboard:** Same as compute-storage-01 - **Motherboard:** ASUS PRIME B550-PLUS
- **Status:** ⚠️ Requires PCI path mapping - **Status:** ⚠️ Partially mapped (5 SATA + 1 M.2)
## Output Example ## Output Example
@@ -130,15 +130,15 @@ declare -A SERVER_MAPPINGS=(
## Setting Up a New Server ## Setting Up a New Server
### Step 1: Run Diagnostic Script ### Step 1: Run Diagnostic Mode
First, gather PCI path information: First, gather PCI path information:
```bash ```bash
bash diagnose-drives.sh > server-diagnostic.txt bash driveAtlas.sh --diagnose
``` ```
This will show all available PCI paths and their associated drives. This will show all available PCI paths, storage controllers, and their associated drives.
### Step 2: Physical Bay Identification ### Step 2: Physical Bay Identification
@@ -192,7 +192,7 @@ DEBUG=1 bash driveAtlas.sh
| **SIZE** | Drive capacity | | **SIZE** | Drive capacity |
| **TYPE** | SSD or HDD (detected via SMART) | | **TYPE** | SSD or HDD (detected via SMART) |
| **TEMP** | Current temperature from SMART | | **TEMP** | Current temperature from SMART |
| **HEALTH** | SMART health status (✓ = passed, ✗ = failed) | | **HEALTH** | SMART health status (✓ = passed, ⚠ = passed with warnings, ✗ = failed) |
| **MODEL** | Drive model number | | **MODEL** | Drive model number |
| **SERIAL** | Drive serial number (for physical verification) | | **SERIAL** | Drive serial number (for physical verification) |
| **CEPH OSD** | Ceph OSD ID if drive hosts an OSD | | **CEPH OSD** | Ceph OSD ID if drive hosts an OSD |
@@ -235,17 +235,15 @@ DEBUG=1 bash driveAtlas.sh
## Files ## Files
- [driveAtlas.sh](driveAtlas.sh) - Main script - [driveAtlas.sh](driveAtlas.sh) - Main script (includes `--diagnose` mode for PCI path discovery)
- [diagnose-drives.sh](diagnose-drives.sh) - PCI path diagnostic tool
- [README.md](README.md) - This file - [README.md](README.md) - This file
- [CLAUDE.md](CLAUDE.md) - AI-assisted development notes - [CLAUDE.md](CLAUDE.md) - AI-assisted development notes
- [todo.txt](todo.txt) - Development notes and task tracking
## Contributing ## Contributing
When adding support for a new server: When adding support for a new server:
1. Run `diagnose-drives.sh` and save output 1. Run `driveAtlas.sh --diagnose` and save output
2. Physically label or identify drives by serial number 2. Physically label or identify drives by serial number
3. Create mapping in `SERVER_MAPPINGS` 3. Create mapping in `SERVER_MAPPINGS`
4. Test thoroughly 4. Test thoroughly

View File

@@ -1,59 +0,0 @@
#!/bin/bash
# Drive Atlas Diagnostic Script
# Run this on each server to gather PCI path information
echo "=== Server Information ==="
echo "Hostname: $(hostname)"
echo "Date: $(date)"
echo ""
echo "=== All /dev/disk/by-path/ entries ==="
ls -la /dev/disk/by-path/ | grep -v "part" | sort
echo ""
echo "=== Organized by PCI Address ==="
for path in /dev/disk/by-path/*; do
if [ -L "$path" ]; then
# Skip partitions
if [[ "$path" =~ -part[0-9]+$ ]]; then
continue
fi
basename_path=$(basename "$path")
target=$(readlink -f "$path")
device=$(basename "$target")
echo "Path: $basename_path"
echo " -> Device: $device"
# Try to get size
if [ -b "$target" ]; then
size=$(lsblk -d -n -o SIZE "$target" 2>/dev/null)
echo " -> Size: $size"
fi
# Try to get SMART info for model
if command -v smartctl >/dev/null 2>&1; then
model=$(sudo smartctl -i "$target" 2>/dev/null | grep "Device Model\|Model Number" | cut -d: -f2 | xargs)
if [ -n "$model" ]; then
echo " -> Model: $model"
fi
fi
echo ""
fi
done
echo "=== PCI Devices with Storage Controllers ==="
lspci | grep -i "storage\|raid\|sata\|sas\|nvme"
echo ""
echo "=== Current Block Devices ==="
lsblk -d -o NAME,SIZE,TYPE,TRAN | grep -v "rbd\|loop"
echo ""
echo "=== Recommendations ==="
echo "1. Note the PCI addresses (e.g., 0c:00.0) of your storage controllers"
echo "2. For each bay, physically identify which drive is in it"
echo "3. Match the PCI path pattern to the bay number"
echo "4. Example: pci-0000:0c:00.0-ata-1 might be bay 1 on controller 0c:00.0"

View File

@@ -11,8 +11,25 @@
# Note: Not using -u (nounset) as script uses ${var:-default} patterns # Note: Not using -u (nounset) as script uses ${var:-default} patterns
set -o pipefail set -o pipefail
# Require bash 4.2+ for declare -g -A (global associative arrays)
if ((BASH_VERSINFO[0] < 4 || (BASH_VERSINFO[0] == 4 && BASH_VERSINFO[1] < 2))); then
echo "ERROR: This script requires Bash 4.2 or higher (current: $BASH_VERSION)" >&2
exit 1
fi
VERSION="1.1.0" VERSION="1.1.0"
#------------------------------------------------------------------------------
# Cleanup Trap
# Ensures temporary directories are removed on exit or interruption
#------------------------------------------------------------------------------
cleanup() {
if [[ -n "${SMART_CACHE_DIR:-}" && -d "$SMART_CACHE_DIR" ]]; then
rm -rf "$SMART_CACHE_DIR"
fi
}
trap cleanup EXIT INT TERM
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
# Path Constants # Path Constants
# Centralized path definitions to avoid hardcoding throughout the script # Centralized path definitions to avoid hardcoding throughout the script
@@ -43,6 +60,7 @@ OPTIONS:
--verbose Show detailed error messages and warnings --verbose Show detailed error messages and warnings
--no-ceph Skip Ceph OSD information --no-ceph Skip Ceph OSD information
--show-pci Show PCI paths in output --show-pci Show PCI paths in output
--diagnose Show all PCI paths and block devices (for mapping new servers)
EXAMPLES: EXAMPLES:
$(basename "$0") # Normal run with all features $(basename "$0") # Normal run with all features
@@ -50,6 +68,7 @@ EXAMPLES:
$(basename "$0") --color # Run with colored output $(basename "$0") --color # Run with colored output
$(basename "$0") --verbose # Show all errors and warnings $(basename "$0") --verbose # Show all errors and warnings
$(basename "$0") --debug # Show mapping debug info $(basename "$0") --debug # Show mapping debug info
$(basename "$0") --diagnose # Gather PCI paths for new server setup
ENVIRONMENT VARIABLES: ENVIRONMENT VARIABLES:
DEBUG=1 Same as --debug flag DEBUG=1 Same as --debug flag
@@ -66,6 +85,7 @@ SKIP_CEPH=false
SHOW_PCI=false SHOW_PCI=false
USE_COLOR=false USE_COLOR=false
VERBOSE=false VERBOSE=false
RUN_DIAGNOSE=false
while [[ $# -gt 0 ]]; do while [[ $# -gt 0 ]]; do
case "$1" in case "$1" in
@@ -101,6 +121,10 @@ while [[ $# -gt 0 ]]; do
VERBOSE=true VERBOSE=true
shift shift
;; ;;
--diagnose)
RUN_DIAGNOSE=true
shift
;;
*) *)
echo "Unknown option: $1" >&2 echo "Unknown option: $1" >&2
echo "Use --help for usage information." >&2 echo "Use --help for usage information." >&2
@@ -304,6 +328,68 @@ check_dependencies() {
# Run dependency check at script start # Run dependency check at script start
check_dependencies check_dependencies
#------------------------------------------------------------------------------
# run_diagnose
#
# Displays all PCI disk paths, storage controllers, and block devices.
# Used to gather information needed when mapping a new server.
#------------------------------------------------------------------------------
run_diagnose() {
local hostname
hostname="$(hostname)"
echo "=== Server Information ==="
echo "Hostname: $hostname"
echo "Date: $(date)"
echo ""
echo "=== Storage Controllers ==="
lspci 2>/dev/null | grep -iE "SAS|SATA|RAID|Mass storage|NVMe"
echo ""
echo "=== All /dev/disk/by-path/ entries (whole disks only) ==="
for path in "${DISK_BY_PATH}"/*; do
[[ -L "$path" ]] || continue
# Skip partitions
[[ "$path" =~ -part[0-9]+$ ]] && continue
local basename_path target device size serial model
basename_path="$(basename "$path")"
target="$(readlink -f "$path")"
device="$(basename "$target")"
size="$(lsblk -d -n -o SIZE "$target" 2>/dev/null | xargs)"
printf " %-55s -> %-10s %s\n" "$basename_path" "$device" "${size:+($size)}"
done
echo ""
echo "=== Block Devices ==="
lsblk -d -o NAME,SIZE,TYPE,TRAN 2>/dev/null | grep -v "rbd\|loop"
echo ""
# Check if this server has a mapping
local sanitized
sanitized="$(echo "$hostname" | tr -cd '[:alnum:]-_.')"
if [[ -n "${SERVER_MAPPINGS[$sanitized]:-}" ]]; then
echo "=== Current Mapping for $sanitized ==="
echo "${SERVER_MAPPINGS[$sanitized]}" | while read -r pci_path bay; do
[[ -z "$pci_path" || -z "$bay" ]] && continue
if [[ -L "${DISK_BY_PATH}/$pci_path" ]]; then
local dev
dev="$(readlink -f "${DISK_BY_PATH}/$pci_path" | sed 's/.*\///')"
printf " Bay %-5s %-55s -> %s\n" "$bay" "$pci_path" "$dev"
else
printf " Bay %-5s %-55s -> (not connected)\n" "$bay" "$pci_path"
fi
done
else
echo "NOTE: No mapping exists yet for '$sanitized'."
echo "Use the PCI paths above to create a SERVER_MAPPINGS entry."
fi
exit 0
}
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
# Chassis Layout Generator Functions # Chassis Layout Generator Functions
# These define the physical layout and display formatting for each chassis type # These define the physical layout and display formatting for each chassis type
@@ -327,26 +413,29 @@ generate_10bay_layout() {
# Fixed width for consistent box drawing (fits device names like "nvme0n1") # Fixed width for consistent box drawing (fits device names like "nvme0n1")
local drive_width=10 local drive_width=10
# Box interior width = 136 (determined by 10 bay boxes: 4 + 10*13 + 2)
# Total box width = 138 (136 interior + 2 for │ borders)
# Main chassis section # Main chassis section
printf "┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐\n" printf "┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐\n"
printf "│ %-126s │\n" "$hostname - Sliger CX4712 (10x 3.5\" Hot-swap)" printf "│ %-132s │\n" "$hostname - Sliger CX4712 (10x 3.5\" Hot-swap)"
printf "│ │\n" printf "│%-136s│\n" ""
# Show storage controllers # Show storage controllers
printf "│ Storage Controllers: │\n" printf "│ %-134s│\n" "Storage Controllers:"
while IFS= read -r ctrl; do while IFS= read -r ctrl; do
[[ -n "$ctrl" ]] && printf "│ %-126s│\n" "$ctrl" [[ -n "$ctrl" ]] && printf "│ %-134.134s│\n" "$ctrl"
done < <(get_storage_controllers) done < <(get_storage_controllers)
printf "│ │\n" printf "│%-136s│\n" ""
# M.2 NVMe slot if present # M.2 NVMe slot if present
if [[ -n "${DRIVE_MAP[m2-1]}" ]]; then if [[ -n "${DRIVE_MAP[m2-1]}" ]]; then
printf "│ M.2 NVMe: %-10s │\n" "${DRIVE_MAP[m2-1]}" printf "│ %-134s│\n" " M.2 NVMe: ${DRIVE_MAP[m2-1]}"
printf "│ │\n" printf "│%-136s│\n" ""
fi fi
printf "│ Front Hot-swap Bays: │\n" printf "│ %-134s│\n" " Front Hot-swap Bays:"
printf "│ │\n" printf "│%-136s│\n" ""
# Bay top borders # Bay top borders
printf "│ " printf "│ "
@@ -369,7 +458,7 @@ generate_10bay_layout() {
done done
printf " │\n" printf " │\n"
printf "└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘\n" printf "└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘\n"
} }
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
@@ -398,7 +487,7 @@ generate_micro_layout() {
printf "│ │\n" printf "│ │\n"
printf "│ Storage Controllers: │\n" printf "│ Storage Controllers: │\n"
while IFS= read -r ctrl; do while IFS= read -r ctrl; do
[[ -n "$ctrl" ]] && printf "│ %-57s│\n" "$ctrl" [[ -n "$ctrl" ]] && printf "│ %-57.57s│\n" "$ctrl"
done < <(get_storage_controllers) done < <(get_storage_controllers)
printf "│ │\n" printf "│ │\n"
@@ -440,7 +529,7 @@ generate_large1_layout() {
printf "│ │\n" printf "│ │\n"
printf "│ Storage Controllers: │\n" printf "│ Storage Controllers: │\n"
while IFS= read -r ctrl; do while IFS= read -r ctrl; do
[[ -n "$ctrl" ]] && printf "│ %-69s│\n" "$ctrl" [[ -n "$ctrl" ]] && printf "│ %-69.69s│\n" "$ctrl"
done < <(get_storage_controllers) done < <(get_storage_controllers)
printf "│ │\n" printf "│ │\n"
printf "│ M.2 NVMe: M1: %-10s M2: %-10s │\n" "${DRIVE_MAP[m2-1]:-EMPTY}" "${DRIVE_MAP[m2-2]:-EMPTY}" printf "│ M.2 NVMe: M1: %-10s M2: %-10s │\n" "${DRIVE_MAP[m2-1]:-EMPTY}" "${DRIVE_MAP[m2-2]:-EMPTY}"
@@ -503,13 +592,26 @@ declare -A SERVER_MAPPINGS=(
" "
# storage-01 # storage-01
# Motherboard: ASRock A320M-HDV R4.0 with AMD SATA controller at 02:00.1 # Motherboard: ASRock A320M-HDV R4.0
# 4 SATA ports used (ata-1, ata-2, ata-5, ata-6) - ata-3/4 empty # AMD SATA controller at 02:00.1 (bays 1-4)
# Mobo SATA physical layout:
# top-left=bay 1, bottom-left=bay 2, top-right=bay 3, bottom-right=bay 4
# HBA: LSI SAS3416 at 01:00.0 (4x Mini-SAS HD ports, top=C0 to bottom=C3)
# C0 (top): 4x SATA breakout → bays 5-8
# C1: 4x SATA breakout → bays 9-10 (2 of 4 ports used)
# C2: U.2 NVMe (serial ends in 0d66) → u2-1
# C3: U.2 NVMe (serial ends in 0d4f) → u2-2
# C0 verified: phy9=bay5 (remaining phy8/10/11 → bays 6-8 TBD)
# C1: PHY-to-bay mapping TBD (bays 9-10)
# C2: U.2 NVMe (serial ends in 0d66) → u2-1 (needs FW update)
# C3: U.2 NVMe (serial ends in 0d4f) → u2-2 (needs FW update)
# Also present: 09:00.0 AMD FCH SATA Controller [AHCI mode]
["storage-01"]=" ["storage-01"]="
pci-0000:02:00.1-ata-1 1 pci-0000:02:00.1-ata-1 1
pci-0000:02:00.1-ata-2 2 pci-0000:02:00.1-ata-2 2
pci-0000:02:00.1-ata-5 3 pci-0000:02:00.1-ata-5 3
pci-0000:02:00.1-ata-6 4 pci-0000:02:00.1-ata-6 4
pci-0000:01:00.0-sas-phy9-lun-0 5
" "
# large1 # large1
@@ -607,7 +709,7 @@ get_storage_controllers() {
# Values: PCI path strings (for --show-pci option) # Values: PCI path strings (for --show-pci option)
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
build_drive_map() { build_drive_map() {
local host="$(hostname)" local host="$(hostname | tr -cd '[:alnum:]-_.')"
local mapping="${SERVER_MAPPINGS[$host]}" local mapping="${SERVER_MAPPINGS[$host]}"
# Declare global arrays directly # Declare global arrays directly
@@ -615,7 +717,7 @@ build_drive_map() {
declare -g -A BAY_TO_PCI_PATH=() declare -g -A BAY_TO_PCI_PATH=()
if [[ -z "$mapping" ]]; then if [[ -z "$mapping" ]]; then
log_warn "No drive mapping found for host '$host'. Run diagnose-drives.sh to create one." log_warn "No drive mapping found for host '$host'. Run with --diagnose to gather PCI path info."
return return
fi fi
@@ -667,15 +769,18 @@ build_ceph_cache() {
local current_osd="" local current_osd=""
local osd_count=0 local osd_count=0
while IFS= read -r line; do while IFS= read -r line; do
# Match OSD header: "====== osd.5 =======" # Match OSD header: "====== osd.5 =======" or "====== osd.19 ======"
if [[ "$line" =~ ======[[:space:]]+osd\.([0-9]+)[[:space:]]+======= ]]; then # Number of trailing equals varies based on OSD number length
if [[ "$line" =~ ======[[:space:]]+osd\.([0-9]+)[[:space:]]+====== ]]; then
current_osd="osd.${BASH_REMATCH[1]}" current_osd="osd.${BASH_REMATCH[1]}"
# Match block device line: " block device /dev/sda" # Match "devices" line which has the actual physical device: " devices /dev/sda"
elif [[ -n "$current_osd" && "$line" =~ block[[:space:]]device[[:space:]]+/dev/([^[:space:]]+) ]]; then # This is more reliable than "block device" which may show LVM paths
elif [[ -n "$current_osd" && "$line" =~ devices[[:space:]]+/dev/(sd[a-z]+|nvme[0-9]+n[0-9]+) ]]; then
local dev_name="${BASH_REMATCH[1]}" local dev_name="${BASH_REMATCH[1]}"
CEPH_DEVICE_TO_OSD["$dev_name"]="$current_osd" CEPH_DEVICE_TO_OSD["$dev_name"]="$current_osd"
((osd_count++)) ((osd_count++))
log_info "Found $current_osd on $dev_name" log_info "Found $current_osd on $dev_name"
current_osd="" # Reset to avoid duplicate matches
fi fi
done < <(ceph-volume lvm list 2>/dev/null) done < <(ceph-volume lvm list 2>/dev/null)
log_info "Cached $osd_count Ceph OSDs" log_info "Cached $osd_count Ceph OSDs"
@@ -744,14 +849,20 @@ parse_smart_data() {
fi fi
# Temperature parsing - handles multiple formats: # Temperature parsing - handles multiple formats:
# - SATA: "194 Temperature_Celsius ... 35" (value at end of line) # - SATA: "194 Temperature_Celsius ... 26 (0 14 0 0 0)" (value before parenthetical)
# - SATA: "Temperature: 42 Celsius" # - SATA: "Temperature: 42 Celsius"
# - SATA: "Current Temperature: 35 Celsius" # - SATA: "Current Temperature: 35 Celsius"
# - SAS: "Current Drive Temperature: 35 C"
# - NVMe: "Temperature: 42 Celsius" # - NVMe: "Temperature: 42 Celsius"
if echo "$smart_info" | grep -q "Temperature_Celsius"; then if echo "$smart_info" | grep -q "Temperature_Celsius"; then
temp="$(echo "$smart_info" | grep "Temperature_Celsius" | head -1 | awk '{for(i=NF;i>0;i--) if($i ~ /^[0-9]+$/) {print $i; exit}}')" # Strip parenthetical data like "(0 14 0 0 0)" before finding last number
elif echo "$smart_info" | grep -qE "^(Current )?Temperature:"; then temp="$(echo "$smart_info" | grep "Temperature_Celsius" | head -1 | sed 's/([^)]*)//g' | awk '{for(i=NF;i>0;i--) if($i ~ /^[0-9]+$/) {print $i; exit}}')"
temp="$(echo "$smart_info" | grep -E "^(Current )?Temperature:" | head -1 | awk '{print $2}')" elif echo "$smart_info" | grep -qE "Current Drive Temperature:"; then
# SAS drives: "Current Drive Temperature: 35 C"
temp="$(echo "$smart_info" | grep -E "Current Drive Temperature:" | head -1 | awk '{for(i=1;i<=NF;i++) if($i ~ /^[0-9]+$/) {print $i; exit}}')"
elif echo "$smart_info" | grep -qE "(Current )?Temperature:"; then
# SATA/NVMe: "Temperature: 42 Celsius" (may have leading whitespace)
temp="$(echo "$smart_info" | grep -E "(Current )?Temperature:" | head -1 | awk '{for(i=1;i<=NF;i++) if($i ~ /^[0-9]+$/) {print $i; exit}}')"
fi fi
# Device type detection - handles SSD, HDD, and NVMe # Device type detection - handles SSD, HDD, and NVMe
@@ -877,7 +988,12 @@ get_drive_smart_info() {
# Main Display Logic # Main Display Logic
#------------------------------------------------------------------------------ #------------------------------------------------------------------------------
HOSTNAME=$(hostname) # Run diagnose mode if requested (exits after printing)
if [[ "$RUN_DIAGNOSE" == true ]]; then
run_diagnose
fi
HOSTNAME=$(hostname | tr -cd '[:alnum:]-_.')
CHASSIS_TYPE=${CHASSIS_TYPES[$HOSTNAME]:-"unknown"} CHASSIS_TYPE=${CHASSIS_TYPES[$HOSTNAME]:-"unknown"}
# Display chassis layout # Display chassis layout
@@ -895,7 +1011,7 @@ case "$CHASSIS_TYPE" in
echo "┌─────────────────────────────────────────────────────────┐" echo "┌─────────────────────────────────────────────────────────┐"
echo "│ Unknown server: $HOSTNAME" echo "│ Unknown server: $HOSTNAME"
echo "│ No chassis mapping defined yet" echo "│ No chassis mapping defined yet"
echo "│ Run diagnose-drives.sh to gather PCI path information" echo "│ Run with --diagnose to gather PCI path information"
echo "└─────────────────────────────────────────────────────────┘" echo "└─────────────────────────────────────────────────────────┘"
;; ;;
esac esac
@@ -968,15 +1084,22 @@ if [[ "$SKIP_SMART" != true ]]; then
SMART_CACHE_DIR="$(mktemp -d)" SMART_CACHE_DIR="$(mktemp -d)"
log_info "Collecting SMART data in parallel..." log_info "Collecting SMART data in parallel..."
max_parallel_jobs=10
job_count=0
for bay in $all_bays; do for bay in $all_bays; do
device="${DRIVE_MAP[$bay]}" device="${DRIVE_MAP[$bay]}"
if [[ -n "$device" && "$device" != "EMPTY" && -b "/dev/$device" ]]; then if [[ -n "$device" && "$device" != "EMPTY" && -b "/dev/$device" ]]; then
# Launch background job to collect raw smartctl data # Launch background job to collect raw smartctl data
(sudo smartctl -A -i -H "/dev/$device" > "$SMART_CACHE_DIR/${device}.raw" 2>/dev/null) & (sudo smartctl -A -i -H "/dev/$device" > "$SMART_CACHE_DIR/${device}.raw" 2>/dev/null) &
((job_count++))
if ((job_count >= max_parallel_jobs)); then
wait -n 2>/dev/null || wait # wait -n requires bash 4.3+, fall back to wait
((job_count--))
fi
fi fi
done done
# Wait for all background SMART queries to complete # Wait for all remaining background SMART queries to complete
wait wait
log_info "SMART data collection complete" log_info "SMART data collection complete"
fi fi
@@ -1069,11 +1192,6 @@ for bay in $all_bays; do
fi fi
done done
# Clean up SMART cache directory
if [[ -n "${SMART_CACHE_DIR:-}" && -d "$SMART_CACHE_DIR" ]]; then
rm -rf "$SMART_CACHE_DIR"
fi
# NVMe drives (only show unmapped ones - mapped NVMe drives appear in main table) # NVMe drives (only show unmapped ones - mapped NVMe drives appear in main table)
nvme_devices=$(lsblk -d -n -o NAME,SIZE | grep "^nvme" 2>/dev/null) nvme_devices=$(lsblk -d -n -o NAME,SIZE | grep "^nvme" 2>/dev/null)
if [[ -n "$nvme_devices" ]]; then if [[ -n "$nvme_devices" ]]; then

View File

@@ -1,11 +0,0 @@
#!/bin/bash
echo "=== Drive Serial Numbers ==="
for dev in sd{a..j}; do
if [ -b "/dev/$dev" ]; then
serial=$(sudo smartctl -i /dev/$dev 2>/dev/null | grep "Serial Number" | awk '{print $3}')
model=$(sudo smartctl -i /dev/$dev 2>/dev/null | grep "Device Model\|Model Number" | cut -d: -f2 | xargs)
size=$(lsblk -d -n -o SIZE /dev/$dev 2>/dev/null)
echo "/dev/$dev: $serial ($size - $model)"
fi
done

View File

@@ -1,11 +0,0 @@
#!/bin/bash
echo "=== Checking /dev/disk/by-path/ ==="
ls -la /dev/disk/by-path/ | grep -v "part" | grep "pci-0000:0c:00.0" | head -20
echo ""
echo "=== Checking if paths exist from mapping ==="
echo "pci-0000:0c:00.0-ata-3:"
ls -la /dev/disk/by-path/pci-0000:0c:00.0-ata-3 2>&1
echo "pci-0000:0c:00.0-ata-1:"
ls -la /dev/disk/by-path/pci-0000:0c:00.0-ata-1 2>&1