gandalf

Author	SHA1	Message	Date
jared	b80fda7cb2	Fix host filtering: only show/monitor configured hosts; add PBS - _collect_snapshot() and _process_interfaces() now skip any Prometheus instance not explicitly listed in config.json hosts[]. LXC app servers (postgresql, matrix, etc.) report node_exporter metrics but are not infrastructure hosts Gandalf should display or alert on. - Add PBS (10.10.10.3) to config hosts[] with prometheus_instance; remove from ping_hosts (node_exporter already running on PBS, now added to Prometheus scrape config as job pbs-node). - The _instance_map membership check is now consistent across snapshot, alerting, and ethtool SSH collection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-17 17:17:40 -04:00
jared	eb8c0ded5e	Fix: only SSH into explicitly configured hosts for ethtool collection LinkStatsCollector.collect() was SSHing into every host reporting node_network_* metrics to Prometheus, including unrelated app servers like postgresql and matrix. Add instance_map membership check so ethtool collection via Pulse only runs on hosts defined in config.json. Prometheus metrics (traffic rates, errors) are still collected for all instances — only the SSH/ethtool step is gated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 18:35:21 -04:00
jared	b29b70d88b	Improve Pulse execution reliability: retry logic, better logging, SSH hardening monitor.py / diagnose.py PulseClient.run_command: - Add automatic single retry on submit failure, explicit Pulse failure (status=failed/timed_out), and poll timeout — handles transient SSH or Pulse hiccups without dropping the whole collection cycle - Log execution_id and full Pulse URL on every failure so failed runs can be found in the Pulse UI immediately - Handle 'timed_out' and 'cancelled' Pulse statuses explicitly (previously only 'failed' was caught; others would spin until local deadline) - Poll every 2s instead of 1s to reduce Pulse API chatter SSH command options (_ssh_batch + diagnose.py): - Add BatchMode=yes: aborts immediately instead of hanging on a password prompt if key auth fails - Add ServerAliveInterval=10 ServerAliveCountMax=2: SSH detects a hung remote command within ~20s instead of sitting silent until the 45s Pulse timeout expires Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 09:19:07 -04:00
jared	2c67944b4b	Fix topology chain order and inspector SFP port width Topology: - Correct series layout: UDM-Pro → USW-Agg → Pro 24 PoE (not a fork) - Remove CSS fork divs, replace with straight vertical connectors - Labels: WAN · 10G SFP+ (UDM→Agg), 10G trunk (Agg→PoE) - Remove ISL from legend (no parallel switch pair) Inspector: - Fix USW-Agg port blocks appearing narrower than other switches - SFP ports in rows now use same width (34px) as copper ports; all-SFP switches like USL8A no longer look undersized Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 22:42:38 -04:00
jared	e8314b5ba3	Fix topology diagram: replace SVG fork with CSS, fix line alignment - Remove SVG fork with preserveAspectRatio="none" (caused line width distortion and stretched 10G DAC label like a tube TV) - Replace with pure CSS .topo-fork: stem + horizontal bar + left/right drops, all absolutely positioned at consistent 2px width - Use .topo-sw-row with two 50% halves so switch centres land at exactly 25% and 75% — matching fork drop positions mathematically - ISL rendered via ::before/::after on .topo-sw-row (switch boxes with solid bg cover the line at their edges, leaving only the gap) - Add .topo-sw-drops: two vertical stubs from switch centres to bus rails - All lines are now exactly 2px, no distortion, no misalignment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 22:35:02 -04:00
jared	3dce602938	Redesign topology diagram with dual-homed bus layout and improve inspector chassis - Replace flat topology with tiered bus-bar layout: Internet → UDM-Pro → SVG fork → USW-Agg + Pro 24 PoE → dual-homed servers - Show 10G VLAN90 (Ceph) bus from USW-Agg and 1G DHCP management bus from Pro 24 PoE per host - Add per-host drop wires (solid 10G + dashed 1G) with correct rack positions - Mark large1 as off-rack (dashed border), ZimaBoards as off-rack mon-01/mon-02 - Add topology legend, inter-switch 10G ISL indicator - Add recently resolved events section (last 24h) to dashboard - Add last_seen column and relative timestamps to events table - Add stale data banner when monitoring data >15 min old - Improve inspector chassis with port speed labels, LLDP neighbor info, mounting ears, chassis legend - Add duplex/speed mismatch warnings and carrier changes to path debug panel - Bump updateTopology() to handle both topo-v2-status-* and topo-status-* classes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 22:22:19 -04:00
jared	6eb21055ef	fix: topology — reflect VLAN90 Ceph network and DHCP management separation 10G SFP+ ports on USW-Agg are VLAN90 (10.10.90.x/24, static IPs, Ceph storage). 1G ports on Pro 24 PoE are DHCP management. Update topology to show this: - USW-Agg sublabel shows VLAN90 · 10.10.90.x (cyan) - Pro 24 PoE sublabel shows DHCP mgmt (cyan) - Host sublabels changed from "10G+1G" to "VLAN90" for the 10G Agg connection - 1G management band label updated to "← 1G DHCP mgmt (Pro 24 PoE) →" - Add .topo-vlan-tag CSS for cyan VLAN annotation on switch nodes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 22:10:17 -04:00
jared	f2541eb45c	fix: topology — all servers dual-homed 10G+1G, show mgmt band All rack servers (and large1 on table) have both a 10G link to USW-Agg and a 1G management link to Pro 24 PoE. Update topology: - Move all 6 hosts into single row (including large1) - Update sublabels to "10G+1G" for all nodes - large1 dashed-border (off-rack) with "table · 10G+1G" - Add dashed amber "1G mgmt (PoE)" horizontal band above hosts to represent the PoE switch management connections - 10G primary fan-out lines still drop from Agg switch above - large1 primary line rendered as dashed green (off-rack run) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 22:08:48 -04:00
jared	e779b21db4	feat: redesign network topology diagram with accurate rack layout Replace linear Internet→UDM→Agg→PoE→all-hosts chain with accurate topology: - USW-Aggregation and Pro 24 PoE switch shown side-by-side with horizontal 10G SFP+ link between them (not in series) - 5 compute/storage/monitor nodes fanned out under Agg Switch with 10G labels and rack unit positions (RU4–12, RU14–17) as sublabels - large1 shown separately under PoE switch, dashed border = off-rack (table) - Add device specs as subtitles on all nodes (Dream Machine Pro · RU24, etc.) - Shorter display names: csg-01 / cs-01 instead of full hostnames - Live status badges still updated by JS via data-host attributes - New CSS: .topo-node-sub, .topo-switch-tier, .topo-h-link, .topo-host-tier, .topo-host-table (dashed), .topo-badge-unknown Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 22:06:03 -04:00
jared	c1fd53f9bd	Remove aesthetic_diff.md reference from README — convergence complete Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 21:50:02 -04:00
jared	0ca6b1f744	feat: link health summary, recently resolved panel, event duration - dashboard: pass recent_resolved (last 24h, limit 10) to index template; render "Recently Resolved" section showing type, target, resolved time, and calculated duration (first_seen → resolved_at) - dashboard: event-age spans now also update via setInterval; duration shown for resolved events (e.g. "2h 15m") - links page: link health summary panel shows server iface count, error/flap counts, switch port up/down, PoE total draw/capacity bar; only shows problematic stats if non-zero; shows "All OK ✔" when clean - style.css: new classes for summary panel, resolved row/badge Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 21:48:40 -04:00
jared	6b6eaa6227	feat: UI improvements — event ages, error badges, PoE bars, mismatch detection - events table: add Last Seen column; show relative times ("3h ago") with absolute timestamp on hover; update updateEventsTable() in app.js to match - links.html: add error/drop/flap alert badges to interface and port card headers - links.html: PoE power bar (draw/max ratio with colour-coded fill) and poe_mode - links.html: stale data warning banner when link_stats are >2 minutes old - links.html: improved error handler shows HTTP status instead of generic message - links.html: fix collapse state persisted to localStorage (was sessionStorage, lost on browser restart); fix collapseAll/expandAll to also persist state - inspector.html: duplex mismatch and speed mismatch warnings in path debug panel - inspector.html: carrier changes added to server column of path debug - style.css: new classes — .link-alert-badge, .poe-bar-*, .path-mismatch-alert, .error-state; fix .stale-banner to use CSS variables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 21:46:11 -04:00
jared	9c9acbb023	Apply LotusGuild design system convergence (aesthetic_diff.md) CSS (style.css): - §1: Add unified naming aliases (--terminal-green, --bg-primary, etc.) - §2: Upgrade borders: modal 1px→3px double, btn/btn-sm/inputs 1px→2px - §3: Add [ ] bracket decorations to .btn classes; primary keeps > prefix; hover lift -1px→-2px; padding 6px 14px→5px 12px - §4: Fix glow definitions from 2-layer rgba to 3-layer solid stack - §5: Section headers now symmetric ╠═══ TITLE ═══╣ (was one-sided) - §6+§7: Modal border 3px double, corners ┌┐→╔╗, add glow shadow - §11: Nav active state now amber tint (was green); hover remains green - §15: Scanline opacity 0.13→0.15; flicker delay 45s→30s JS (app.js): - §18: Replace custom showToast() with lt.toast.* delegate wrapper Templates (base.html): - Load base.css and base.js (symlinked from web_template) - Add lt-boot overlay for boot sequence animation (§13) README: Remove completed pending convergence items Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 21:40:20 -04:00
jared	17d3b7d227	New features: stale banner, tab title alerts, health checks, DB housekeeping static/app.js: - Browser tab title updates to show alert count: '(3 CRIT) GANDALF' or '(2 WARN) GANDALF' - Stale monitoring banner: injected into .main if last_check > 15 min old, warns operator that the monitor daemon may be down static/style.css: - .stale-banner: amber top-border warning strip app.py: - /health now checks DB connectivity and monitor freshness (last_check age) Returns 503 + degraded status if DB unreachable or monitor stale >20min db.py: - cleanup_expired_suppressions(): marks time-limited suppressions inactive when expires_at <= NOW() (was only filtered in SELECTs, never marked inactive) - purge_old_resolved_events(days=90): deletes old resolved events to prevent unbounded table growth monitor.py: - Calls cleanup_expired_suppressions() and purge_old_resolved_events() each cycle Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 21:35:32 -04:00
jared	14eaa6a8c9	De-hardcode ticket URL and cluster name; improve diagnostic polling UX app.py: - Context processor injects config.ticket_api.web_url into all templates (falls back to 'http://t.lotusguild.org/ticket/' if not set in config) templates/base.html: - Inject GANDALF_CONFIG JS global with ticket_web_url before app.js loads static/app.js: - Use GANDALF_CONFIG.ticket_web_url instead of hardcoded domain templates/index.html: - Use {{ config.ticket_api.web_url }} Jinja var instead of hardcoded domain monitor.py: - CLUSTER_NAME constant kept as default; NetworkMonitor now reads cluster_name from config monitor.cluster_name, falling back to the constant - All CLUSTER_NAME references inside class methods replaced with self.cluster_name templates/inspector.html: - pollDiagnostic() .catch() now clears interval and shows error message instead of silently ignoring network failures during active polling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 14:31:57 -04:00
jared	8f852ed830	Add compound DB indexes for hot query paths network_events: idx_event_lookup (event_type, target_name, target_detail, resolved_at) - Covers the upsert_event SELECT which runs every cycle per monitored entity - Replaces three separate single-column index scans with one covering lookup suppression_rules: idx_sup_lookup (active, target_type, target_name, target_detail) - Covers is_suppressed() queries (now redundant for runtime due to in-memory check_suppressed, but ensures fast get_active_suppressions() loading per cycle) Both indexes created on live DB (MariaDB LXC 149). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 14:24:40 -04:00
jared	85a018ff6c	Optimize suppression checks: load once per cycle, add error logging db.py: - Add check_suppressed(suppressions, ...) for in-memory suppression lookups against pre-loaded list (eliminates N*M DB queries per monitoring cycle) - get_baseline(): log error instead of silently swallowing JSON parse failure monitor.py: - Load active suppressions once per cycle at the top of the alert loop - Pass suppressions list to _process_interfaces, _process_unifi, _process_ping_hosts - Replace all db.is_suppressed() calls with db.check_suppressed(suppressions, ...) - Reduces DB queries from 100-600+ per cycle down to 1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 14:13:54 -04:00
jared	af26407363	Fix setDur implicit event, title XSS, hardcoded pulse URL, suppress error toast - suppressions.html: setDur() now takes explicit element param instead of relying on implicit global event.target (which fails outside direct click handlers) - suppressions.html: removeSuppression() now shows error toast on failed DELETE - templates/index.html: escape description in title attribute with \|e filter to prevent attribute breakout on quotes in description text - diagnose.py: derive Pulse execution URL from pulse_client.url instead of hardcoding http://pulse.lotusguild.org Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-13 14:36:55 -04:00
jared	f8395dcd24	Fix port_idx type coercion and add logging to silent except blocks - port_idx now coerced to int() with 400 on invalid type (prevents string/int mismatch) - api_network and api_links bare except blocks now log errors instead of silently passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 17:35:41 -04:00
jared	0335845101	Security and reliability fixes: input validation, logging, job cleanup - C5: Validate host_ip (IPv4 check) and iface (allowlist regex) before SSH command builder - H6: Upgrade Pulse failure logging from debug to error so operators see outages - M6: Replace per-request O(n) purge with background daemon thread (runs every 2 min) - M7: Background thread marks jobs stuck in 'running' > 5 min as errored Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 17:30:50 -04:00
jared	b1dd5f9cad	feat: deep link diagnostics via Pulse SSH Adds comprehensive per-port link troubleshooting triggered from the Inspector panel when a port has an LLDP-identified server counterpart. - diagnose.py: DiagnosticsRunner with 15-section SSH command (carrier, operstate, sysfs counters, ethtool, ethtool -i/-a/-g/-S/-m, ip link, ip addr, ip route, dmesg, lldpctl); parsers for all sections; health analyzer with 14 check codes (NO_CARRIER, HALF_DUPLEX, SPEED_MISMATCH, SFP_RX_CRITICAL, CARRIER_FLAPPING, CRC_ERRORS_HIGH, LLDP_MISMATCH, etc.) - monitor.py: PulseClient now tracks last_execution_id so callers can link back to the raw Pulse execution URL - app.py: POST /api/diagnose + GET /api/diagnose/<job_id> with daemon thread background execution and 10-minute in-memory job store - inspector.html: "Run Link Diagnostics" button (shown only when LLDP host is resolvable); full results panel: health banner, physical layer, SFP/DOM with power bars, NIC error counters, collapsible ethtool -S, flow control/ring buffers, driver info, LLDP 2-col validation, collapsible dmesg, switch port summary, "View in Pulse" link - style.css: all .diag-* CSS classes with terminal aesthetic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 16:03:54 -05:00
jared	0278dad502	feat: inspector page, link debug enhancements, security hardening - Add /inspector page: visual model-accurate switch chassis diagrams (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks with color coding (green=up, amber=PoE, cyan=uplink, grey=down), detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side - Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max, collapsible host/switch panels with sessionStorage persistence - monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch port; PulseClient uses requests.Session() for HTTP keep-alive; add shlex.quote() around interface names (defense-in-depth) - Security: suppress buttons use data-* attrs + delegated click handler instead of inline onclick with Jinja2 variable interpolation; remove \| safe filter from user-controlled fields in suppressions.html; setDuration() takes explicit el param instead of implicit event global - db.py: thread-local connection reuse with ping(reconnect=True) to avoid a new TCP handshake per query - .gitignore: add config.json (contains credentials), __pycache__ - README: full rewrite covering architecture, all 4 pages, alert logic, config reference, deployment, troubleshooting, security notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 15:39:48 -05:00
jared	fa7512a2c2	feat: terminal aesthetic rewrite + link debug page - Full dark terminal aesthetic (Pulse/TinkerTickets style): - #0a0a0a background, #00ff41 green, #ffb000 amber, #00ffff cyan - CRT scanline overlay, phosphor glow, ASCII corner pseudoelements - Bracket-notation badges [CRITICAL], monospace font throughout - style.css, base.html, index.html, suppressions.html all rewritten - New Link Debug page (/links, /api/links): - Per-host, per-interface cards with speed/duplex/port type/auto-neg - Traffic bars (TX cyan, RX green) with rate labels - Error/drop counters, carrier change history - SFP/DOM optical panel: vendor, temp, voltage, bias, TX/RX power dBm bars - RX-TX delta shown; color-coded warn/crit thresholds - Auto-refresh every 60s, anchor-jump to #hostname - LinkStatsCollector in monitor.py: - SSHes to each host (one connection, all ifaces batched) - Parses ethtool + ethtool -m (SFP DOM) output - Merges with Prometheus traffic/error/carrier metrics - Stores as link_stats in monitor_state table - config.json: added ssh section for ethtool collection - app.js: terminal chip style consistency (uppercase, ● bullet) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 12:43:11 -05:00
jared	4356af1d84	chore: remove deploy test line from README	2026-03-02 12:08:16 -05:00
jared	56f86f6169	chore: test auto-deploy pipeline	2026-03-02 12:05:59 -05:00
jared	4600229207	chore: clean up deploy test line from README	2026-03-02 12:00:46 -05:00
jared	ff1edb5e0f	chore: trigger deploy test	2026-03-02 11:58:42 -05:00
jared	67072099ca	docs: update README for storage-01 Prometheus migration - storage-01 now monitored via Prometheus node_exporter (10.10.10.11:9100), removed from ping_hosts - Updated data sources table (6 hosts via Prometheus, pbs only via ping) - Added storage-01 to monitored hosts table - Fixed Authelia reload command (restart, not reload) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-01 23:05:27 -05:00
jared	0c0150f698	Complete rewrite: full-featured network monitoring dashboard - Two-service architecture: Flask web app (gandalf.service) + background polling daemon (gandalf-monitor.service) - Monitor polls Prometheus node_network_up for physical NIC states on all 6 hypervisors (added storage-01 at 10.10.10.11:9100) - UniFi API monitoring for switches, APs, and gateway device status - Ping reachability for hosts without node_exporter (pbs only now) - Smart baseline: interfaces first seen as down are never alerted on; only UP→DOWN regressions trigger tickets - Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous interface regressions (guards against false positives on startup) - Tinker Tickets integration with 24-hour hash-based deduplication - Alert suppression: manual toggle or timed windows (30m/1h/4h/8h) - Authelia SSO via forward-auth headers, admin group required - Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) → PoE Switch (10G DAC) → Hosts - MariaDB schema, suppression management UI, host/interface cards Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-01 23:03:18 -05:00
jared	4ed5ecacbb	added git ignore	2025-03-01 13:34:25 -05:00
jared	004c97f492	interface update	2025-02-08 00:32:25 -05:00
jared	9f92ac5c1a	fixed indent	2025-02-08 00:16:45 -05:00
jared	ea5e86ef33	lots logs	2025-02-08 00:16:06 -05:00
jared	19224d14df	added raw back	2025-02-08 00:13:06 -05:00
jared	68beb7b1c4	more dynamic	2025-02-08 00:11:28 -05:00
jared	dc117b276e	fix syntax error	2025-02-08 00:04:42 -05:00
jared	b67a5d10c2	dynamic devices	2025-02-08 00:03:01 -05:00
jared	4c90fbb168	interfaces update	2025-02-07 23:57:34 -05:00
jared	da59d50560	update index	2025-02-07 23:54:28 -05:00
jared	610f55710d	updated index html	2025-02-07 23:51:13 -05:00
jared	067ce4d316	update html	2025-02-07 23:38:49 -05:00
jared	02d03f4f3f	json update	2025-02-07 23:26:41 -05:00
jared	1549f39c2c	v2 api	2025-02-07 23:24:36 -05:00
jared	a2c8368439	wrong indentation	2025-02-07 23:20:14 -05:00
jared	75cdef709f	Bearer token	2025-02-07 23:19:50 -05:00
jared	0417106e88	Auth order plz	2025-02-07 23:17:12 -05:00
jared	3c4a9651b5	CSRF token	2025-02-07 23:14:36 -05:00
jared	de24b9ef98	bearer token	2025-02-07 23:12:29 -05:00
jared	37022b132f	updated stats	2025-02-07 23:09:11 -05:00
jared	5d5aea3cf4	acquire site id	2025-02-07 23:06:49 -05:00

1 2

90 Commits