gandalf

Author	SHA1	Message	Date
Jared Vititoe	271c3c4373	Exclude LXC IPs from link stats collection Add links_exclude_ips to monitor config; collect() skips any Prometheus instance whose IP is in that list, preventing LXC containers from appearing on the links/inspector pages as phantom hosts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-17 20:39:47 -04:00
Jared Vititoe	b80fda7cb2	Fix host filtering: only show/monitor configured hosts; add PBS - _collect_snapshot() and _process_interfaces() now skip any Prometheus instance not explicitly listed in config.json hosts[]. LXC app servers (postgresql, matrix, etc.) report node_exporter metrics but are not infrastructure hosts Gandalf should display or alert on. - Add PBS (10.10.10.3) to config hosts[] with prometheus_instance; remove from ping_hosts (node_exporter already running on PBS, now added to Prometheus scrape config as job pbs-node). - The _instance_map membership check is now consistent across snapshot, alerting, and ethtool SSH collection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-17 17:17:40 -04:00
Jared Vititoe	eb8c0ded5e	Fix: only SSH into explicitly configured hosts for ethtool collection LinkStatsCollector.collect() was SSHing into every host reporting node_network_* metrics to Prometheus, including unrelated app servers like postgresql and matrix. Add instance_map membership check so ethtool collection via Pulse only runs on hosts defined in config.json. Prometheus metrics (traffic rates, errors) are still collected for all instances — only the SSH/ethtool step is gated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 18:35:21 -04:00
Jared Vititoe	b29b70d88b	Improve Pulse execution reliability: retry logic, better logging, SSH hardening monitor.py / diagnose.py PulseClient.run_command: - Add automatic single retry on submit failure, explicit Pulse failure (status=failed/timed_out), and poll timeout — handles transient SSH or Pulse hiccups without dropping the whole collection cycle - Log execution_id and full Pulse URL on every failure so failed runs can be found in the Pulse UI immediately - Handle 'timed_out' and 'cancelled' Pulse statuses explicitly (previously only 'failed' was caught; others would spin until local deadline) - Poll every 2s instead of 1s to reduce Pulse API chatter SSH command options (_ssh_batch + diagnose.py): - Add BatchMode=yes: aborts immediately instead of hanging on a password prompt if key auth fails - Add ServerAliveInterval=10 ServerAliveCountMax=2: SSH detects a hung remote command within ~20s instead of sitting silent until the 45s Pulse timeout expires Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 09:19:07 -04:00
Jared Vititoe	17d3b7d227	New features: stale banner, tab title alerts, health checks, DB housekeeping static/app.js: - Browser tab title updates to show alert count: '(3 CRIT) GANDALF' or '(2 WARN) GANDALF' - Stale monitoring banner: injected into .main if last_check > 15 min old, warns operator that the monitor daemon may be down static/style.css: - .stale-banner: amber top-border warning strip app.py: - /health now checks DB connectivity and monitor freshness (last_check age) Returns 503 + degraded status if DB unreachable or monitor stale >20min db.py: - cleanup_expired_suppressions(): marks time-limited suppressions inactive when expires_at <= NOW() (was only filtered in SELECTs, never marked inactive) - purge_old_resolved_events(days=90): deletes old resolved events to prevent unbounded table growth monitor.py: - Calls cleanup_expired_suppressions() and purge_old_resolved_events() each cycle Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 21:35:32 -04:00
Jared Vititoe	14eaa6a8c9	De-hardcode ticket URL and cluster name; improve diagnostic polling UX app.py: - Context processor injects config.ticket_api.web_url into all templates (falls back to 'http://t.lotusguild.org/ticket/' if not set in config) templates/base.html: - Inject GANDALF_CONFIG JS global with ticket_web_url before app.js loads static/app.js: - Use GANDALF_CONFIG.ticket_web_url instead of hardcoded domain templates/index.html: - Use {{ config.ticket_api.web_url }} Jinja var instead of hardcoded domain monitor.py: - CLUSTER_NAME constant kept as default; NetworkMonitor now reads cluster_name from config monitor.cluster_name, falling back to the constant - All CLUSTER_NAME references inside class methods replaced with self.cluster_name templates/inspector.html: - pollDiagnostic() .catch() now clears interval and shows error message instead of silently ignoring network failures during active polling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 14:31:57 -04:00
Jared Vititoe	85a018ff6c	Optimize suppression checks: load once per cycle, add error logging db.py: - Add check_suppressed(suppressions, ...) for in-memory suppression lookups against pre-loaded list (eliminates N*M DB queries per monitoring cycle) - get_baseline(): log error instead of silently swallowing JSON parse failure monitor.py: - Load active suppressions once per cycle at the top of the alert loop - Pass suppressions list to _process_interfaces, _process_unifi, _process_ping_hosts - Replace all db.is_suppressed() calls with db.check_suppressed(suppressions, ...) - Reduces DB queries from 100-600+ per cycle down to 1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 14:13:54 -04:00
Jared Vititoe	0335845101	Security and reliability fixes: input validation, logging, job cleanup - C5: Validate host_ip (IPv4 check) and iface (allowlist regex) before SSH command builder - H6: Upgrade Pulse failure logging from debug to error so operators see outages - M6: Replace per-request O(n) purge with background daemon thread (runs every 2 min) - M7: Background thread marks jobs stuck in 'running' > 5 min as errored Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 17:30:50 -04:00
Jared Vititoe	b1dd5f9cad	feat: deep link diagnostics via Pulse SSH Adds comprehensive per-port link troubleshooting triggered from the Inspector panel when a port has an LLDP-identified server counterpart. - diagnose.py: DiagnosticsRunner with 15-section SSH command (carrier, operstate, sysfs counters, ethtool, ethtool -i/-a/-g/-S/-m, ip link, ip addr, ip route, dmesg, lldpctl); parsers for all sections; health analyzer with 14 check codes (NO_CARRIER, HALF_DUPLEX, SPEED_MISMATCH, SFP_RX_CRITICAL, CARRIER_FLAPPING, CRC_ERRORS_HIGH, LLDP_MISMATCH, etc.) - monitor.py: PulseClient now tracks last_execution_id so callers can link back to the raw Pulse execution URL - app.py: POST /api/diagnose + GET /api/diagnose/<job_id> with daemon thread background execution and 10-minute in-memory job store - inspector.html: "Run Link Diagnostics" button (shown only when LLDP host is resolvable); full results panel: health banner, physical layer, SFP/DOM with power bars, NIC error counters, collapsible ethtool -S, flow control/ring buffers, driver info, LLDP 2-col validation, collapsible dmesg, switch port summary, "View in Pulse" link - style.css: all .diag-* CSS classes with terminal aesthetic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 16:03:54 -05:00
Jared Vititoe	0278dad502	feat: inspector page, link debug enhancements, security hardening - Add /inspector page: visual model-accurate switch chassis diagrams (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks with color coding (green=up, amber=PoE, cyan=uplink, grey=down), detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side - Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max, collapsible host/switch panels with sessionStorage persistence - monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch port; PulseClient uses requests.Session() for HTTP keep-alive; add shlex.quote() around interface names (defense-in-depth) - Security: suppress buttons use data-* attrs + delegated click handler instead of inline onclick with Jinja2 variable interpolation; remove \| safe filter from user-controlled fields in suppressions.html; setDuration() takes explicit el param instead of implicit event global - db.py: thread-local connection reuse with ping(reconnect=True) to avoid a new TCP handshake per query - .gitignore: add config.json (contains credentials), __pycache__ - README: full rewrite covering architecture, all 4 pages, alert logic, config reference, deployment, troubleshooting, security notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 15:39:48 -05:00
Jared Vititoe	fa7512a2c2	feat: terminal aesthetic rewrite + link debug page - Full dark terminal aesthetic (Pulse/TinkerTickets style): - #0a0a0a background, #00ff41 green, #ffb000 amber, #00ffff cyan - CRT scanline overlay, phosphor glow, ASCII corner pseudoelements - Bracket-notation badges [CRITICAL], monospace font throughout - style.css, base.html, index.html, suppressions.html all rewritten - New Link Debug page (/links, /api/links): - Per-host, per-interface cards with speed/duplex/port type/auto-neg - Traffic bars (TX cyan, RX green) with rate labels - Error/drop counters, carrier change history - SFP/DOM optical panel: vendor, temp, voltage, bias, TX/RX power dBm bars - RX-TX delta shown; color-coded warn/crit thresholds - Auto-refresh every 60s, anchor-jump to #hostname - LinkStatsCollector in monitor.py: - SSHes to each host (one connection, all ifaces batched) - Parses ethtool + ethtool -m (SFP DOM) output - Merges with Prometheus traffic/error/carrier metrics - Stores as link_stats in monitor_state table - config.json: added ssh section for ethtool collection - app.js: terminal chip style consistency (uppercase, ● bullet) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 12:43:11 -05:00
Jared Vititoe	0c0150f698	Complete rewrite: full-featured network monitoring dashboard - Two-service architecture: Flask web app (gandalf.service) + background polling daemon (gandalf-monitor.service) - Monitor polls Prometheus node_network_up for physical NIC states on all 6 hypervisors (added storage-01 at 10.10.10.11:9100) - UniFi API monitoring for switches, APs, and gateway device status - Ping reachability for hosts without node_exporter (pbs only now) - Smart baseline: interfaces first seen as down are never alerted on; only UP→DOWN regressions trigger tickets - Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous interface regressions (guards against false positives on startup) - Tinker Tickets integration with 24-hour hash-based deduplication - Alert suppression: manual toggle or timed windows (30m/1h/4h/8h) - Authelia SSO via forward-auth headers, admin group required - Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) → PoE Switch (10G DAC) → Hosts - MariaDB schema, suppression management UI, host/interface cards Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-01 23:03:18 -05:00

12 Commits