gandalf

Author	SHA1	Message	Date
jared	ed5ba5c59e	Remove unused is_new parameter from ticket helper methods After fixing the is_new guard bug, is_new is no longer used inside _ticket_interface, _ticket_unifi, or _ticket_unreachable. Drop it from their signatures and call sites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-13 11:10:32 -04:00
jared	2be44d8b24	Fix ticket_id never stored when fail_thresh>1; guard sessionStorage JSON.parse Lint / Python (flake8) (push) Successful in 45s Details Lint / JS (eslint) (push) Successful in 8s Details Security / Python Security (bandit) (push) Successful in 43s Details Test / Python Tests (pytest) (push) Successful in 51s Details Lint / Notify on failure (push) Has been skipped Details Lint / Deploy (push) Successful in 3s Details monitor.py: _ticket_interface/_ticket_unifi/_ticket_unreachable all used `if tid and is_new` to guard db.set_ticket_id(). Since is_new is True only on the first upsert (consec=1) but tickets are created at consec>=fail_thresh (default 2), is_new is always False when the ticket is created, so the ticket link never appeared in the UI. Changed to `if tid:`. links.html: JSON.parse(sessionStorage.getItem(...)) in togglePanel and restoreCollapseState had no try-catch. Corrupt/stale session storage would throw an uncaught SyntaxError. Also wrapped all sessionStorage.setItem calls in try-catch to defend against storage-full / private-browsing errors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 23:45:20 -04:00
jared	1a53718cc5	fix: SSH shell quoting bug breaks ethtool collection; ticket_id KeyError Lint / Python (flake8) (push) Successful in 41s Details Lint / JS (eslint) (push) Successful in 7s Details Security / Python Security (bandit) (push) Successful in 55s Details Test / Python Tests (pytest) (push) Successful in 51s Details Lint / Notify on failure (push) Has been skipped Details Lint / Deploy (push) Successful in 3s Details monitor.py _ssh_batch(): the remote command was wrapped in double-quotes (f'root@{ip} "{shell_cmd}"') but shell_cmd itself contains double-quoted echo sentinels ("___IFACE:eth0___"). When Pulse's shell parses the full ssh invocation, the nested double-quotes cause mis-parsing — the remote command is split incorrectly, silently breaking all ethtool/SFP DOM collection. Fix: use shlex.quote(shell_cmd) so the entire remote command is single-quoted, leaving inner double-quotes untouched. TicketClient.create(): data['ticket_id'] raises KeyError if the Tinker Tickets API returns success=true without a ticket_id field (malformed response). Use data.get('ticket_id') with an explicit warning log. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 13:41:09 -04:00
jared	afaeb64636	fix: UTC timezone suffix missing from all isoformat() timestamp outputs db.py returned all datetime columns (first_seen, last_seen, resolved_at, created_at, expires_at) as bare ISO strings like "2026-03-14T14:14:21" with no timezone marker. Per the ECMAScript spec, new Date() on a datetime string without timezone treats it as LOCAL time, not UTC. This made lt.time.ago() and stale-detection wrong for any user whose browser is not in UTC — event ages and stale warnings would be off by the client's UTC offset. monitor.py had the same issue on the network_snapshot 'updated' field. Fix: append 'Z' to all isoformat() calls (UTC datetimes confirmed by MySQL server timezone and _now_utc() pattern used throughout codebase). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 13:28:49 -04:00
jared	61408645a5	fix: LLDP input validation, mgmt_ip early validation, poll timer cleanup, monitor backoff Lint / Python (flake8) (push) Failing after 41s Details Lint / JS (eslint) (push) Successful in 8s Details Security / Python Security (bandit) (push) Successful in 42s Details Test / Python Tests (pytest) (push) Failing after 1m35s Details Lint / Notify on failure (push) Successful in 5s Details Lint / Deploy (push) Has been skipped Details - app.py: validate server_name from LLDP with fullmatch before use in logs/lookups (prevents log injection) - app.py: validate each mgmt_ip candidate before assigning host_ip (avoids assigning non-IP string that then fails later check) - app.py: log actual exception in link_stats JSON parse error - inspector.html: clear _diagPollTimer in closePanel() so timer doesn't orphan when panel is closed mid-poll - monitor.py: sleep 30s after a monitor loop exception before resuming normal poll interval Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 08:45:28 -04:00
jared	25baec67ac	fix: diagnostic rate limiting, lock-held ownership check, iface name length cap Lint / Python (flake8) (push) Failing after 47s Details Lint / JS (eslint) (push) Successful in 8s Details Security / Python Security (bandit) (push) Successful in 43s Details Test / Python Tests (pytest) (push) Failing after 1m22s Details Lint / Notify on failure (push) Successful in 3s Details Lint / Deploy (push) Has been skipped Details - app.py: add per-user diagnostic rate limit (5/min) enforced atomically under _diag_lock - app.py: move diagnostic job ownership check inside _diag_lock to close TOCTOU window; snapshot result before releasing lock - monitor.py: cap interface name regex to 15 chars (Linux IFNAMSIZ limit) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 08:42:50 -04:00
jared	c71d0da97d	security: harden exception exposure, SSL config, and Pulse response parsing Lint / Python (flake8) (push) Failing after 42s Details Lint / JS (eslint) (push) Successful in 7s Details Security / Python Security (bandit) (push) Successful in 1m22s Details Test / Python Tests (pytest) (push) Failing after 1m23s Details Lint / Notify on failure (push) Successful in 3s Details Lint / Deploy (push) Has been skipped Details - app.py: replace raw str(e) in diagnostic _run() with generic client message; log internally only - app.py: /health endpoint no longer leaks exception strings to unauthenticated callers; errors logged server-side - monitor.py: UniFi SSL verification now defaults True, configurable via config.json unifi.verify_ssl; urllib3 warning suppression scoped to verify=False only (removed global disable) - monitor.py: Pulse execution_id extracted with .get() + explicit None check to avoid KeyError on malformed response - monitor.py: interface name regex drops '@' (not a valid kernel interface char) to match app.py and fix inconsistency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 08:40:25 -04:00
jared	38297e616f	arch+security: route all server contact through Pulse, harden SSH Lint / Python (flake8) (push) Failing after 43s Details Lint / JS (eslint) (push) Successful in 8s Details Security / Python Security (bandit) (push) Successful in 1m4s Details Test / Python Tests (pytest) (push) Failing after 1m5s Details Lint / Notify on failure (push) Successful in 2s Details Lint / Deploy (push) Has been skipped Details Architecture: - Remove direct subprocess ping from Gandalf; add PulseClient.ping() which runs the ping via the Pulse worker instead - Remove standalone ping() function and subprocess import from monitor.py - Add self.pulse alias to NetworkMonitor for convenience - Both _process_ping_hosts() and snapshot builder now use self.pulse.ping() Security: - Change StrictHostKeyChecking=no → accept-new in both SSH command builders (monitor.py _ssh_batch, diagnose.py build_ssh_command). The Pulse worker's known_hosts is now authoritative; host keys are recorded on first connection and verified on all subsequent ones. MITM attacks after initial key exchange are now detectable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-10 23:58:16 -04:00
jared	9d6583a08a	Add LDAP avatar photos, UX polish, and TDS component upgrades Lint / Python (flake8) (push) Successful in 1m13s Details Lint / JS (eslint) (push) Successful in 9s Details Security / Python Security (bandit) (push) Failing after 45s Details Test / Python Tests (pytest) (push) Successful in 57s Details Lint / Notify on failure (push) Has been skipped Details Lint / Deploy (push) Successful in 5s Details - Add /api/avatar endpoint querying lldap for user jpegPhoto; disk cache with sentinel pattern avoids repeat LDAP hits for users without photos - Add ldap3 dependency and ldap config block to config.json - Wire lt-avatar img overlay in base.html with capture-phase error fallback (lt-avatar-img-err) to reveal initials when image is absent - Fix lt-avatar CSS shim: position:relative + absolute inset on img (local base.css was missing these; added to style.css) - Replace all empty-state paragraphs with proper lt-empty-state markup (icon + title + body) across index, suppressions, inspector, app.js - Add lt-spinner--cyan next to refresh button; shows during refreshAll() - Replace inspector panel-section-title with lt-divider throughout - Add data-tooltip attributes to SFP DOM metrics, TX/RX/Carrier/Duplex/ Auto-neg/Error labels in links.html and inspector panel - Add tooltips to events table column headers (Sev, First Seen, Failures) - Fix links.html host panel timestamp (was reading sample.updated which is always undefined; now uses data.updated) - Fix UniFi status text casing (Online→ONLINE to match server render) - Remove dead topo-status-* class manipulation from updateTopology() - Always render alert-count-badge; toggle display:none when count is 0 - Fix double UniFi get_devices() call in monitor.py run loop - Fix chip-critical animation (was using green pulse-glow; now red) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 21:09:56 -04:00
jared	c45dd007d1	Fix field name mismatches, add events filter, in-place suppression refresh Lint / Python (flake8) (push) Failing after 50s Details Lint / JS (eslint) (push) Successful in 7s Details Test / Python Tests (pytest) (push) Successful in 51s Details Lint / Notify on failure (push) Successful in 2s Details Lint / Deploy (push) Has been skipped Details Security / Python Security (bandit) (push) Failing after 59s Details - links.html: fix all field name bugs (auto_negotiation→autoneg, full_duplex, tx/rx_errors/drops_per_sec→_rate, tx/rx_bytes_per_sec→_rate, poe_total_w/poe_max_w computed from ports, renderUnifiSwitches uses top-level updated timestamp) - suppressions.html: in-place DOM refresh after create/remove (no page reload), datalist autocomplete for target names, form reset after submit - inspector.html: ESC key closes detail panel via lt.keys.on - index.html: events filter bar with search input + severity pills (All/Critical/Warning), MutationObserver re-applies filter after dynamic updates - style.css: g-section-actions, events-filter-bar, sev-pills layout - app.js/db.py/monitor.py: carry forward prior session fixes (Promise.allSettled, daemon_ok, stale connection handling, double Prometheus call, self.cfg fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 23:35:02 -04:00
jared	271c3c4373	Exclude LXC IPs from link stats collection Add links_exclude_ips to monitor config; collect() skips any Prometheus instance whose IP is in that list, preventing LXC containers from appearing on the links/inspector pages as phantom hosts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-17 20:39:47 -04:00
jared	b80fda7cb2	Fix host filtering: only show/monitor configured hosts; add PBS - _collect_snapshot() and _process_interfaces() now skip any Prometheus instance not explicitly listed in config.json hosts[]. LXC app servers (postgresql, matrix, etc.) report node_exporter metrics but are not infrastructure hosts Gandalf should display or alert on. - Add PBS (10.10.10.3) to config hosts[] with prometheus_instance; remove from ping_hosts (node_exporter already running on PBS, now added to Prometheus scrape config as job pbs-node). - The _instance_map membership check is now consistent across snapshot, alerting, and ethtool SSH collection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-17 17:17:40 -04:00
jared	eb8c0ded5e	Fix: only SSH into explicitly configured hosts for ethtool collection LinkStatsCollector.collect() was SSHing into every host reporting node_network_* metrics to Prometheus, including unrelated app servers like postgresql and matrix. Add instance_map membership check so ethtool collection via Pulse only runs on hosts defined in config.json. Prometheus metrics (traffic rates, errors) are still collected for all instances — only the SSH/ethtool step is gated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 18:35:21 -04:00
jared	b29b70d88b	Improve Pulse execution reliability: retry logic, better logging, SSH hardening monitor.py / diagnose.py PulseClient.run_command: - Add automatic single retry on submit failure, explicit Pulse failure (status=failed/timed_out), and poll timeout — handles transient SSH or Pulse hiccups without dropping the whole collection cycle - Log execution_id and full Pulse URL on every failure so failed runs can be found in the Pulse UI immediately - Handle 'timed_out' and 'cancelled' Pulse statuses explicitly (previously only 'failed' was caught; others would spin until local deadline) - Poll every 2s instead of 1s to reduce Pulse API chatter SSH command options (_ssh_batch + diagnose.py): - Add BatchMode=yes: aborts immediately instead of hanging on a password prompt if key auth fails - Add ServerAliveInterval=10 ServerAliveCountMax=2: SSH detects a hung remote command within ~20s instead of sitting silent until the 45s Pulse timeout expires Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-15 09:19:07 -04:00
jared	17d3b7d227	New features: stale banner, tab title alerts, health checks, DB housekeeping static/app.js: - Browser tab title updates to show alert count: '(3 CRIT) GANDALF' or '(2 WARN) GANDALF' - Stale monitoring banner: injected into .main if last_check > 15 min old, warns operator that the monitor daemon may be down static/style.css: - .stale-banner: amber top-border warning strip app.py: - /health now checks DB connectivity and monitor freshness (last_check age) Returns 503 + degraded status if DB unreachable or monitor stale >20min db.py: - cleanup_expired_suppressions(): marks time-limited suppressions inactive when expires_at <= NOW() (was only filtered in SELECTs, never marked inactive) - purge_old_resolved_events(days=90): deletes old resolved events to prevent unbounded table growth monitor.py: - Calls cleanup_expired_suppressions() and purge_old_resolved_events() each cycle Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 21:35:32 -04:00
jared	14eaa6a8c9	De-hardcode ticket URL and cluster name; improve diagnostic polling UX app.py: - Context processor injects config.ticket_api.web_url into all templates (falls back to 'http://t.lotusguild.org/ticket/' if not set in config) templates/base.html: - Inject GANDALF_CONFIG JS global with ticket_web_url before app.js loads static/app.js: - Use GANDALF_CONFIG.ticket_web_url instead of hardcoded domain templates/index.html: - Use {{ config.ticket_api.web_url }} Jinja var instead of hardcoded domain monitor.py: - CLUSTER_NAME constant kept as default; NetworkMonitor now reads cluster_name from config monitor.cluster_name, falling back to the constant - All CLUSTER_NAME references inside class methods replaced with self.cluster_name templates/inspector.html: - pollDiagnostic() .catch() now clears interval and shows error message instead of silently ignoring network failures during active polling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 14:31:57 -04:00
jared	85a018ff6c	Optimize suppression checks: load once per cycle, add error logging db.py: - Add check_suppressed(suppressions, ...) for in-memory suppression lookups against pre-loaded list (eliminates N*M DB queries per monitoring cycle) - get_baseline(): log error instead of silently swallowing JSON parse failure monitor.py: - Load active suppressions once per cycle at the top of the alert loop - Pass suppressions list to _process_interfaces, _process_unifi, _process_ping_hosts - Replace all db.is_suppressed() calls with db.check_suppressed(suppressions, ...) - Reduces DB queries from 100-600+ per cycle down to 1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 14:13:54 -04:00
jared	0335845101	Security and reliability fixes: input validation, logging, job cleanup - C5: Validate host_ip (IPv4 check) and iface (allowlist regex) before SSH command builder - H6: Upgrade Pulse failure logging from debug to error so operators see outages - M6: Replace per-request O(n) purge with background daemon thread (runs every 2 min) - M7: Background thread marks jobs stuck in 'running' > 5 min as errored Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-12 17:30:50 -04:00
jared	b1dd5f9cad	feat: deep link diagnostics via Pulse SSH Adds comprehensive per-port link troubleshooting triggered from the Inspector panel when a port has an LLDP-identified server counterpart. - diagnose.py: DiagnosticsRunner with 15-section SSH command (carrier, operstate, sysfs counters, ethtool, ethtool -i/-a/-g/-S/-m, ip link, ip addr, ip route, dmesg, lldpctl); parsers for all sections; health analyzer with 14 check codes (NO_CARRIER, HALF_DUPLEX, SPEED_MISMATCH, SFP_RX_CRITICAL, CARRIER_FLAPPING, CRC_ERRORS_HIGH, LLDP_MISMATCH, etc.) - monitor.py: PulseClient now tracks last_execution_id so callers can link back to the raw Pulse execution URL - app.py: POST /api/diagnose + GET /api/diagnose/<job_id> with daemon thread background execution and 10-minute in-memory job store - inspector.html: "Run Link Diagnostics" button (shown only when LLDP host is resolvable); full results panel: health banner, physical layer, SFP/DOM with power bars, NIC error counters, collapsible ethtool -S, flow control/ring buffers, driver info, LLDP 2-col validation, collapsible dmesg, switch port summary, "View in Pulse" link - style.css: all .diag-* CSS classes with terminal aesthetic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 16:03:54 -05:00
jared	0278dad502	feat: inspector page, link debug enhancements, security hardening - Add /inspector page: visual model-accurate switch chassis diagrams (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks with color coding (green=up, amber=PoE, cyan=uplink, grey=down), detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side - Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max, collapsible host/switch panels with sessionStorage persistence - monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch port; PulseClient uses requests.Session() for HTTP keep-alive; add shlex.quote() around interface names (defense-in-depth) - Security: suppress buttons use data-* attrs + delegated click handler instead of inline onclick with Jinja2 variable interpolation; remove \| safe filter from user-controlled fields in suppressions.html; setDuration() takes explicit el param instead of implicit event global - db.py: thread-local connection reuse with ping(reconnect=True) to avoid a new TCP handshake per query - .gitignore: add config.json (contains credentials), __pycache__ - README: full rewrite covering architecture, all 4 pages, alert logic, config reference, deployment, troubleshooting, security notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 15:39:48 -05:00
jared	fa7512a2c2	feat: terminal aesthetic rewrite + link debug page - Full dark terminal aesthetic (Pulse/TinkerTickets style): - #0a0a0a background, #00ff41 green, #ffb000 amber, #00ffff cyan - CRT scanline overlay, phosphor glow, ASCII corner pseudoelements - Bracket-notation badges [CRITICAL], monospace font throughout - style.css, base.html, index.html, suppressions.html all rewritten - New Link Debug page (/links, /api/links): - Per-host, per-interface cards with speed/duplex/port type/auto-neg - Traffic bars (TX cyan, RX green) with rate labels - Error/drop counters, carrier change history - SFP/DOM optical panel: vendor, temp, voltage, bias, TX/RX power dBm bars - RX-TX delta shown; color-coded warn/crit thresholds - Auto-refresh every 60s, anchor-jump to #hostname - LinkStatsCollector in monitor.py: - SSHes to each host (one connection, all ifaces batched) - Parses ethtool + ethtool -m (SFP DOM) output - Merges with Prometheus traffic/error/carrier metrics - Stores as link_stats in monitor_state table - config.json: added ssh section for ethtool collection - app.js: terminal chip style consistency (uppercase, ● bullet) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-02 12:43:11 -05:00
jared	0c0150f698	Complete rewrite: full-featured network monitoring dashboard - Two-service architecture: Flask web app (gandalf.service) + background polling daemon (gandalf-monitor.service) - Monitor polls Prometheus node_network_up for physical NIC states on all 6 hypervisors (added storage-01 at 10.10.10.11:9100) - UniFi API monitoring for switches, APs, and gateway device status - Ping reachability for hosts without node_exporter (pbs only now) - Smart baseline: interfaces first seen as down are never alerted on; only UP→DOWN regressions trigger tickets - Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous interface regressions (guards against false positives on startup) - Tinker Tickets integration with 24-hour hash-based deduplication - Alert suppression: manual toggle or timed windows (30m/1h/4h/8h) - Authelia SSO via forward-auth headers, admin group required - Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) → PoE Switch (10G DAC) → Hosts - MariaDB schema, suppression management UI, host/interface cards Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-01 23:03:18 -05:00

22 Commits