static/app.js:
- Browser tab title updates to show alert count: '(3 CRIT) GANDALF' or '(2 WARN) GANDALF'
- Stale monitoring banner: injected into .main if last_check > 15 min old,
warns operator that the monitor daemon may be down
static/style.css:
- .stale-banner: amber top-border warning strip
app.py:
- /health now checks DB connectivity and monitor freshness (last_check age)
Returns 503 + degraded status if DB unreachable or monitor stale >20min
db.py:
- cleanup_expired_suppressions(): marks time-limited suppressions inactive when
expires_at <= NOW() (was only filtered in SELECTs, never marked inactive)
- purge_old_resolved_events(days=90): deletes old resolved events to prevent
unbounded table growth
monitor.py:
- Calls cleanup_expired_suppressions() and purge_old_resolved_events() each cycle
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
app.py:
- Context processor injects config.ticket_api.web_url into all templates
(falls back to 'http://t.lotusguild.org/ticket/' if not set in config)
templates/base.html:
- Inject GANDALF_CONFIG JS global with ticket_web_url before app.js loads
static/app.js:
- Use GANDALF_CONFIG.ticket_web_url instead of hardcoded domain
templates/index.html:
- Use {{ config.ticket_api.web_url }} Jinja var instead of hardcoded domain
monitor.py:
- CLUSTER_NAME constant kept as default; NetworkMonitor now reads cluster_name
from config monitor.cluster_name, falling back to the constant
- All CLUSTER_NAME references inside class methods replaced with self.cluster_name
templates/inspector.html:
- pollDiagnostic() .catch() now clears interval and shows error message instead
of silently ignoring network failures during active polling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- port_idx now coerced to int() with 400 on invalid type (prevents string/int mismatch)
- api_network and api_links bare except blocks now log errors instead of silently passing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- C5: Validate host_ip (IPv4 check) and iface (allowlist regex) before SSH command builder
- H6: Upgrade Pulse failure logging from debug to error so operators see outages
- M6: Replace per-request O(n) purge with background daemon thread (runs every 2 min)
- M7: Background thread marks jobs stuck in 'running' > 5 min as errored
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds comprehensive per-port link troubleshooting triggered from the
Inspector panel when a port has an LLDP-identified server counterpart.
- diagnose.py: DiagnosticsRunner with 15-section SSH command (carrier,
operstate, sysfs counters, ethtool, ethtool -i/-a/-g/-S/-m, ip link,
ip addr, ip route, dmesg, lldpctl); parsers for all sections; health
analyzer with 14 check codes (NO_CARRIER, HALF_DUPLEX, SPEED_MISMATCH,
SFP_RX_CRITICAL, CARRIER_FLAPPING, CRC_ERRORS_HIGH, LLDP_MISMATCH, etc.)
- monitor.py: PulseClient now tracks last_execution_id so callers can
link back to the raw Pulse execution URL
- app.py: POST /api/diagnose + GET /api/diagnose/<job_id> with daemon
thread background execution and 10-minute in-memory job store
- inspector.html: "Run Link Diagnostics" button (shown only when LLDP
host is resolvable); full results panel: health banner, physical layer,
SFP/DOM with power bars, NIC error counters, collapsible ethtool -S,
flow control/ring buffers, driver info, LLDP 2-col validation,
collapsible dmesg, switch port summary, "View in Pulse" link
- style.css: all .diag-* CSS classes with terminal aesthetic
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Two-service architecture: Flask web app (gandalf.service) + background
polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>