gandalf

Author	SHA1	Message	Date
jared	9c5a88fbce	Guard ticket creation against duplicates using event's existing ticket_id Lint / Python (flake8) (push) Successful in 41s Details Lint / JS (eslint) (push) Successful in 7s Details Security / Python Security (bandit) (push) Successful in 40s Details Test / Python Tests (pytest) (push) Successful in 1m18s Details Lint / Notify on failure (push) Has been skipped Details Lint / Deploy (push) Successful in 4s Details upsert_event now returns ticket_id (4th element) so callers can skip ticket creation when one already exists. This prevents calling the ticket API every poll cycle for ongoing issues while still retrying if the previous creation attempt failed (ticket_id stays NULL until success). Cluster events use (is_new or not ticket_id) so they too get retried on failure rather than relying solely on is_new. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-14 11:09:50 -04:00
jared	31747c4bd3	Replace deprecated datetime.utcnow() with datetime.now(timezone.utc) Lint / Python (flake8) (push) Successful in 1m9s Details Lint / JS (eslint) (push) Successful in 11s Details Security / Python Security (bandit) (push) Successful in 44s Details Test / Python Tests (pytest) (push) Successful in 58s Details Lint / Notify on failure (push) Has been skipped Details Lint / Deploy (push) Successful in 3s Details datetime.utcnow() is deprecated in Python 3.12 and removed in 3.13. Replace all four call sites with timezone-aware equivalents so the codebase is ready for Python 3.12+. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-13 15:34:41 -04:00
jared	afaeb64636	fix: UTC timezone suffix missing from all isoformat() timestamp outputs db.py returned all datetime columns (first_seen, last_seen, resolved_at, created_at, expires_at) as bare ISO strings like "2026-03-14T14:14:21" with no timezone marker. Per the ECMAScript spec, new Date() on a datetime string without timezone treats it as LOCAL time, not UTC. This made lt.time.ago() and stale-detection wrong for any user whose browser is not in UTC — event ages and stale warnings would be off by the client's UTC offset. monitor.py had the same issue on the network_snapshot 'updated' field. Fix: append 'Z' to all isoformat() calls (UTC datetimes confirmed by MySQL server timezone and _now_utc() pattern used throughout codebase). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 13:28:49 -04:00
jared	cd0b725f3e	fix: LLDP port label bug, suppression SQL dead code, avatar path hardening Lint / Python (flake8) (push) Successful in 1m13s Details Lint / JS (eslint) (push) Successful in 7s Details Security / Python Security (bandit) (push) Successful in 42s Details Test / Python Tests (pytest) (push) Successful in 50s Details Lint / Notify on failure (push) Has been skipped Details Lint / Deploy (push) Successful in 3s Details - inspector.html: fix LLDP neighbor label in port blocks — port.lldp_table never exists; data is at port.lldp (dict with system_name/chassis_id); both port block renderers corrected - db.py: remove dead 'target_detail IS NULL' branch in suppression check — target_detail is always stored as '' not NULL; query simplified to target_detail='' - app.py: resolve cache_dir/cache_file/sentinel to absolute paths; guard against path escape before use - app.py: wrap sentinel os.path.getmtime() in try/except OSError to handle TOCTOU deletion race Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 09:31:25 -04:00
jared	40a0c2af78	Dynamic resolved count, host search filter, lt-divider for UniFi section Lint / Python (flake8) (push) Successful in 38s Details Lint / JS (eslint) (push) Successful in 8s Details Security / Python Security (bandit) (push) Successful in 38s Details Test / Python Tests (pytest) (push) Successful in 50s Details Lint / Notify on failure (push) Has been skipped Details Lint / Deploy (push) Successful in 3s Details - db.py: add resolved_24h to get_status_summary() so each /api/status poll carries the fresh 24h resolved count - app.js: wire stat-resolved-val to update from summary.resolved_24h so the Resolved 24h card stays accurate after auto-refresh - index.html: add lt-toolbar/lt-search above host grid for quick client-side host filtering by name - links.html: replace custom unifi-section-header div with lt-divider - style.css: remove unused .unifi-section-header rules Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-07 18:36:57 -04:00
jared	c45dd007d1	Fix field name mismatches, add events filter, in-place suppression refresh Lint / Python (flake8) (push) Failing after 50s Details Lint / JS (eslint) (push) Successful in 7s Details Test / Python Tests (pytest) (push) Successful in 51s Details Lint / Notify on failure (push) Successful in 2s Details Lint / Deploy (push) Has been skipped Details Security / Python Security (bandit) (push) Failing after 59s Details - links.html: fix all field name bugs (auto_negotiation→autoneg, full_duplex, tx/rx_errors/drops_per_sec→_rate, tx/rx_bytes_per_sec→_rate, poe_total_w/poe_max_w computed from ports, renderUnifiSwitches uses top-level updated timestamp) - suppressions.html: in-place DOM refresh after create/remove (no page reload), datalist autocomplete for target names, form reset after submit - inspector.html: ESC key closes detail panel via lt.keys.on - index.html: events filter bar with search input + severity pills (All/Critical/Warning), MutationObserver re-applies filter after dynamic updates - style.css: g-section-actions, events-filter-bar, sev-pills layout - app.js/db.py/monitor.py: carry forward prior session fixes (Promise.allSettled, daemon_ok, stale connection handling, double Prometheus call, self.cfg fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-19 23:35:02 -04:00
jared	e2b65db2fc	Add pagination to event queries, input validation, daily event purge - get_active_events() now takes limit/offset (default 200) to cap unbounded queries - count_active_events() added to return total for pagination display - /api/events supports ?limit=, ?offset=, ?status= query params (max 1000) - /api/status includes total_active count alongside paginated events list - index() route passes total_active to template for server-side truncation notice - Show "Showing X of Y" notice in dashboard when events are truncated - Suppression POST validates: reason ≤500 chars, target_name/detail ≤255 chars - _purge_old_jobs_loop runs purge_old_resolved_events(90d) once per day Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-17 20:32:32 -04:00
jared	17d3b7d227	New features: stale banner, tab title alerts, health checks, DB housekeeping static/app.js: - Browser tab title updates to show alert count: '(3 CRIT) GANDALF' or '(2 WARN) GANDALF' - Stale monitoring banner: injected into .main if last_check > 15 min old, warns operator that the monitor daemon may be down static/style.css: - .stale-banner: amber top-border warning strip app.py: - /health now checks DB connectivity and monitor freshness (last_check age) Returns 503 + degraded status if DB unreachable or monitor stale >20min db.py: - cleanup_expired_suppressions(): marks time-limited suppressions inactive when expires_at <= NOW() (was only filtered in SELECTs, never marked inactive) - purge_old_resolved_events(days=90): deletes old resolved events to prevent unbounded table growth monitor.py: - Calls cleanup_expired_suppressions() and purge_old_resolved_events() each cycle Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 21:35:32 -04:00
jared	85a018ff6c	Optimize suppression checks: load once per cycle, add error logging db.py: - Add check_suppressed(suppressions, ...) for in-memory suppression lookups against pre-loaded list (eliminates N*M DB queries per monitoring cycle) - get_baseline(): log error instead of silently swallowing JSON parse failure monitor.py: - Load active suppressions once per cycle at the top of the alert loop - Pass suppressions list to _process_interfaces, _process_unifi, _process_ping_hosts - Replace all db.is_suppressed() calls with db.check_suppressed(suppressions, ...) - Reduces DB queries from 100-600+ per cycle down to 1 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-14 14:13:54 -04:00
jared	0278dad502	feat: inspector page, link debug enhancements, security hardening - Add /inspector page: visual model-accurate switch chassis diagrams (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks with color coding (green=up, amber=PoE, cyan=uplink, grey=down), detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side - Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max, collapsible host/switch panels with sessionStorage persistence - monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch port; PulseClient uses requests.Session() for HTTP keep-alive; add shlex.quote() around interface names (defense-in-depth) - Security: suppress buttons use data-* attrs + delegated click handler instead of inline onclick with Jinja2 variable interpolation; remove \| safe filter from user-controlled fields in suppressions.html; setDuration() takes explicit el param instead of implicit event global - db.py: thread-local connection reuse with ping(reconnect=True) to avoid a new TCP handshake per query - .gitignore: add config.json (contains credentials), __pycache__ - README: full rewrite covering architecture, all 4 pages, alert logic, config reference, deployment, troubleshooting, security notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 15:39:48 -05:00
jared	0c0150f698	Complete rewrite: full-featured network monitoring dashboard - Two-service architecture: Flask web app (gandalf.service) + background polling daemon (gandalf-monitor.service) - Monitor polls Prometheus node_network_up for physical NIC states on all 6 hypervisors (added storage-01 at 10.10.10.11:9100) - UniFi API monitoring for switches, APs, and gateway device status - Ping reachability for hosts without node_exporter (pbs only now) - Smart baseline: interfaces first seen as down are never alerted on; only UP→DOWN regressions trigger tickets - Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous interface regressions (guards against false positives on startup) - Tinker Tickets integration with 24-hour hash-based deduplication - Alert suppression: manual toggle or timed windows (30m/1h/4h/8h) - Authelia SSO via forward-auth headers, admin group required - Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) → PoE Switch (10G DAC) → Hosts - MariaDB schema, suppression management UI, host/interface cards Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-01 23:03:18 -05:00

11 Commits