- _collect_snapshot() and _process_interfaces() now skip any Prometheus
instance not explicitly listed in config.json hosts[]. LXC app servers
(postgresql, matrix, etc.) report node_exporter metrics but are not
infrastructure hosts Gandalf should display or alert on.
- Add PBS (10.10.10.3) to config hosts[] with prometheus_instance;
remove from ping_hosts (node_exporter already running on PBS, now
added to Prometheus scrape config as job pbs-node).
- The _instance_map membership check is now consistent across snapshot,
alerting, and ethtool SSH collection.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
LinkStatsCollector.collect() was SSHing into every host reporting
node_network_* metrics to Prometheus, including unrelated app servers
like postgresql and matrix. Add instance_map membership check so ethtool
collection via Pulse only runs on hosts defined in config.json.
Prometheus metrics (traffic rates, errors) are still collected for all
instances — only the SSH/ethtool step is gated.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
monitor.py / diagnose.py PulseClient.run_command:
- Add automatic single retry on submit failure, explicit Pulse failure
(status=failed/timed_out), and poll timeout — handles transient SSH
or Pulse hiccups without dropping the whole collection cycle
- Log execution_id and full Pulse URL on every failure so failed runs
can be found in the Pulse UI immediately
- Handle 'timed_out' and 'cancelled' Pulse statuses explicitly (previously
only 'failed' was caught; others would spin until local deadline)
- Poll every 2s instead of 1s to reduce Pulse API chatter
SSH command options (_ssh_batch + diagnose.py):
- Add BatchMode=yes: aborts immediately instead of hanging on a
password prompt if key auth fails
- Add ServerAliveInterval=10 ServerAliveCountMax=2: SSH detects a
hung remote command within ~20s instead of sitting silent until the
45s Pulse timeout expires
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Topology:
- Correct series layout: UDM-Pro → USW-Agg → Pro 24 PoE (not a fork)
- Remove CSS fork divs, replace with straight vertical connectors
- Labels: WAN · 10G SFP+ (UDM→Agg), 10G trunk (Agg→PoE)
- Remove ISL from legend (no parallel switch pair)
Inspector:
- Fix USW-Agg port blocks appearing narrower than other switches
- SFP ports in rows now use same width (34px) as copper ports;
all-SFP switches like USL8A no longer look undersized
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove SVG fork with preserveAspectRatio="none" (caused line width
distortion and stretched 10G DAC label like a tube TV)
- Replace with pure CSS .topo-fork: stem + horizontal bar + left/right
drops, all absolutely positioned at consistent 2px width
- Use .topo-sw-row with two 50% halves so switch centres land at
exactly 25% and 75% — matching fork drop positions mathematically
- ISL rendered via ::before/::after on .topo-sw-row (switch boxes
with solid bg cover the line at their edges, leaving only the gap)
- Add .topo-sw-drops: two vertical stubs from switch centres to bus rails
- All lines are now exactly 2px, no distortion, no misalignment
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace flat topology with tiered bus-bar layout: Internet → UDM-Pro → SVG fork → USW-Agg + Pro 24 PoE → dual-homed servers
- Show 10G VLAN90 (Ceph) bus from USW-Agg and 1G DHCP management bus from Pro 24 PoE per host
- Add per-host drop wires (solid 10G + dashed 1G) with correct rack positions
- Mark large1 as off-rack (dashed border), ZimaBoards as off-rack mon-01/mon-02
- Add topology legend, inter-switch 10G ISL indicator
- Add recently resolved events section (last 24h) to dashboard
- Add last_seen column and relative timestamps to events table
- Add stale data banner when monitoring data >15 min old
- Improve inspector chassis with port speed labels, LLDP neighbor info, mounting ears, chassis legend
- Add duplex/speed mismatch warnings and carrier changes to path debug panel
- Bump updateTopology() to handle both topo-v2-status-* and topo-status-* classes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All rack servers (and large1 on table) have both a 10G link to USW-Agg
and a 1G management link to Pro 24 PoE. Update topology:
- Move all 6 hosts into single row (including large1)
- Update sublabels to "10G+1G" for all nodes
- large1 dashed-border (off-rack) with "table · 10G+1G"
- Add dashed amber "1G mgmt (PoE)" horizontal band above hosts
to represent the PoE switch management connections
- 10G primary fan-out lines still drop from Agg switch above
- large1 primary line rendered as dashed green (off-rack run)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace linear Internet→UDM→Agg→PoE→all-hosts chain with accurate topology:
- USW-Aggregation and Pro 24 PoE switch shown side-by-side with horizontal
10G SFP+ link between them (not in series)
- 5 compute/storage/monitor nodes fanned out under Agg Switch with 10G labels
and rack unit positions (RU4–12, RU14–17) as sublabels
- large1 shown separately under PoE switch, dashed border = off-rack (table)
- Add device specs as subtitles on all nodes (Dream Machine Pro · RU24, etc.)
- Shorter display names: csg-01 / cs-01 instead of full hostnames
- Live status badges still updated by JS via data-host attributes
- New CSS: .topo-node-sub, .topo-switch-tier, .topo-h-link, .topo-host-tier,
.topo-host-table (dashed), .topo-badge-unknown
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- dashboard: pass recent_resolved (last 24h, limit 10) to index template;
render "Recently Resolved" section showing type, target, resolved time,
and calculated duration (first_seen → resolved_at)
- dashboard: event-age spans now also update via setInterval; duration
shown for resolved events (e.g. "2h 15m")
- links page: link health summary panel shows server iface count,
error/flap counts, switch port up/down, PoE total draw/capacity bar;
only shows problematic stats if non-zero; shows "All OK ✔" when clean
- style.css: new classes for summary panel, resolved row/badge
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- events table: add Last Seen column; show relative times ("3h ago") with
absolute timestamp on hover; update updateEventsTable() in app.js to match
- links.html: add error/drop/flap alert badges to interface and port card headers
- links.html: PoE power bar (draw/max ratio with colour-coded fill) and poe_mode
- links.html: stale data warning banner when link_stats are >2 minutes old
- links.html: improved error handler shows HTTP status instead of generic message
- links.html: fix collapse state persisted to localStorage (was sessionStorage,
lost on browser restart); fix collapseAll/expandAll to also persist state
- inspector.html: duplex mismatch and speed mismatch warnings in path debug panel
- inspector.html: carrier changes added to server column of path debug
- style.css: new classes — .link-alert-badge, .poe-bar-*, .path-mismatch-alert,
.error-state; fix .stale-banner to use CSS variables
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
static/app.js:
- Browser tab title updates to show alert count: '(3 CRIT) GANDALF' or '(2 WARN) GANDALF'
- Stale monitoring banner: injected into .main if last_check > 15 min old,
warns operator that the monitor daemon may be down
static/style.css:
- .stale-banner: amber top-border warning strip
app.py:
- /health now checks DB connectivity and monitor freshness (last_check age)
Returns 503 + degraded status if DB unreachable or monitor stale >20min
db.py:
- cleanup_expired_suppressions(): marks time-limited suppressions inactive when
expires_at <= NOW() (was only filtered in SELECTs, never marked inactive)
- purge_old_resolved_events(days=90): deletes old resolved events to prevent
unbounded table growth
monitor.py:
- Calls cleanup_expired_suppressions() and purge_old_resolved_events() each cycle
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
app.py:
- Context processor injects config.ticket_api.web_url into all templates
(falls back to 'http://t.lotusguild.org/ticket/' if not set in config)
templates/base.html:
- Inject GANDALF_CONFIG JS global with ticket_web_url before app.js loads
static/app.js:
- Use GANDALF_CONFIG.ticket_web_url instead of hardcoded domain
templates/index.html:
- Use {{ config.ticket_api.web_url }} Jinja var instead of hardcoded domain
monitor.py:
- CLUSTER_NAME constant kept as default; NetworkMonitor now reads cluster_name
from config monitor.cluster_name, falling back to the constant
- All CLUSTER_NAME references inside class methods replaced with self.cluster_name
templates/inspector.html:
- pollDiagnostic() .catch() now clears interval and shows error message instead
of silently ignoring network failures during active polling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
network_events: idx_event_lookup (event_type, target_name, target_detail, resolved_at)
- Covers the upsert_event SELECT which runs every cycle per monitored entity
- Replaces three separate single-column index scans with one covering lookup
suppression_rules: idx_sup_lookup (active, target_type, target_name, target_detail)
- Covers is_suppressed() queries (now redundant for runtime due to in-memory
check_suppressed, but ensures fast get_active_suppressions() loading per cycle)
Both indexes created on live DB (MariaDB LXC 149).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
db.py:
- Add check_suppressed(suppressions, ...) for in-memory suppression lookups
against pre-loaded list (eliminates N*M DB queries per monitoring cycle)
- get_baseline(): log error instead of silently swallowing JSON parse failure
monitor.py:
- Load active suppressions once per cycle at the top of the alert loop
- Pass suppressions list to _process_interfaces, _process_unifi, _process_ping_hosts
- Replace all db.is_suppressed() calls with db.check_suppressed(suppressions, ...)
- Reduces DB queries from 100-600+ per cycle down to 1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- suppressions.html: setDur() now takes explicit element param instead of relying
on implicit global event.target (which fails outside direct click handlers)
- suppressions.html: removeSuppression() now shows error toast on failed DELETE
- templates/index.html: escape description in title attribute with |e filter
to prevent attribute breakout on quotes in description text
- diagnose.py: derive Pulse execution URL from pulse_client.url instead of
hardcoding http://pulse.lotusguild.org
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- port_idx now coerced to int() with 400 on invalid type (prevents string/int mismatch)
- api_network and api_links bare except blocks now log errors instead of silently passing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- C5: Validate host_ip (IPv4 check) and iface (allowlist regex) before SSH command builder
- H6: Upgrade Pulse failure logging from debug to error so operators see outages
- M6: Replace per-request O(n) purge with background daemon thread (runs every 2 min)
- M7: Background thread marks jobs stuck in 'running' > 5 min as errored
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds comprehensive per-port link troubleshooting triggered from the
Inspector panel when a port has an LLDP-identified server counterpart.
- diagnose.py: DiagnosticsRunner with 15-section SSH command (carrier,
operstate, sysfs counters, ethtool, ethtool -i/-a/-g/-S/-m, ip link,
ip addr, ip route, dmesg, lldpctl); parsers for all sections; health
analyzer with 14 check codes (NO_CARRIER, HALF_DUPLEX, SPEED_MISMATCH,
SFP_RX_CRITICAL, CARRIER_FLAPPING, CRC_ERRORS_HIGH, LLDP_MISMATCH, etc.)
- monitor.py: PulseClient now tracks last_execution_id so callers can
link back to the raw Pulse execution URL
- app.py: POST /api/diagnose + GET /api/diagnose/<job_id> with daemon
thread background execution and 10-minute in-memory job store
- inspector.html: "Run Link Diagnostics" button (shown only when LLDP
host is resolvable); full results panel: health banner, physical layer,
SFP/DOM with power bars, NIC error counters, collapsible ethtool -S,
flow control/ring buffers, driver info, LLDP 2-col validation,
collapsible dmesg, switch port summary, "View in Pulse" link
- style.css: all .diag-* CSS classes with terminal aesthetic
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- storage-01 now monitored via Prometheus node_exporter (10.10.10.11:9100),
removed from ping_hosts
- Updated data sources table (6 hosts via Prometheus, pbs only via ping)
- Added storage-01 to monitored hosts table
- Fixed Authelia reload command (restart, not reload)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Two-service architecture: Flask web app (gandalf.service) + background
polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>