Commit Graph

11 Commits

Author SHA1 Message Date
jared 9c5a88fbce Guard ticket creation against duplicates using event's existing ticket_id
Lint / Python (flake8) (push) Successful in 41s
Lint / JS (eslint) (push) Successful in 7s
Security / Python Security (bandit) (push) Successful in 40s
Test / Python Tests (pytest) (push) Successful in 1m18s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 4s
upsert_event now returns ticket_id (4th element) so callers can skip
ticket creation when one already exists. This prevents calling the ticket
API every poll cycle for ongoing issues while still retrying if the
previous creation attempt failed (ticket_id stays NULL until success).

Cluster events use (is_new or not ticket_id) so they too get retried
on failure rather than relying solely on is_new.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 11:09:50 -04:00
jared 31747c4bd3 Replace deprecated datetime.utcnow() with datetime.now(timezone.utc)
Lint / Python (flake8) (push) Successful in 1m9s
Lint / JS (eslint) (push) Successful in 11s
Security / Python Security (bandit) (push) Successful in 44s
Test / Python Tests (pytest) (push) Successful in 58s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 3s
datetime.utcnow() is deprecated in Python 3.12 and removed in 3.13.
Replace all four call sites with timezone-aware equivalents so the
codebase is ready for Python 3.12+.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 15:34:41 -04:00
jared afaeb64636 fix: UTC timezone suffix missing from all isoformat() timestamp outputs
db.py returned all datetime columns (first_seen, last_seen, resolved_at,
created_at, expires_at) as bare ISO strings like "2026-03-14T14:14:21"
with no timezone marker. Per the ECMAScript spec, new Date() on a
datetime string without timezone treats it as LOCAL time, not UTC.
This made lt.time.ago() and stale-detection wrong for any user whose
browser is not in UTC — event ages and stale warnings would be off by
the client's UTC offset.

monitor.py had the same issue on the network_snapshot 'updated' field.

Fix: append 'Z' to all isoformat() calls (UTC datetimes confirmed by
MySQL server timezone and _now_utc() pattern used throughout codebase).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 13:28:49 -04:00
jared cd0b725f3e fix: LLDP port label bug, suppression SQL dead code, avatar path hardening
Lint / Python (flake8) (push) Successful in 1m13s
Lint / JS (eslint) (push) Successful in 7s
Security / Python Security (bandit) (push) Successful in 42s
Test / Python Tests (pytest) (push) Successful in 50s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 3s
- inspector.html: fix LLDP neighbor label in port blocks — port.lldp_table never exists; data is at port.lldp (dict with system_name/chassis_id); both port block renderers corrected
- db.py: remove dead 'target_detail IS NULL' branch in suppression check — target_detail is always stored as '' not NULL; query simplified to target_detail=''
- app.py: resolve cache_dir/cache_file/sentinel to absolute paths; guard against path escape before use
- app.py: wrap sentinel os.path.getmtime() in try/except OSError to handle TOCTOU deletion race

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 09:31:25 -04:00
jared 40a0c2af78 Dynamic resolved count, host search filter, lt-divider for UniFi section
Lint / Python (flake8) (push) Successful in 38s
Lint / JS (eslint) (push) Successful in 8s
Security / Python Security (bandit) (push) Successful in 38s
Test / Python Tests (pytest) (push) Successful in 50s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 3s
- db.py: add resolved_24h to get_status_summary() so each /api/status
  poll carries the fresh 24h resolved count
- app.js: wire stat-resolved-val to update from summary.resolved_24h
  so the Resolved 24h card stays accurate after auto-refresh
- index.html: add lt-toolbar/lt-search above host grid for quick
  client-side host filtering by name
- links.html: replace custom unifi-section-header div with lt-divider
- style.css: remove unused .unifi-section-header rules

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 18:36:57 -04:00
jared c45dd007d1 Fix field name mismatches, add events filter, in-place suppression refresh
Lint / Python (flake8) (push) Failing after 50s
Lint / JS (eslint) (push) Successful in 7s
Test / Python Tests (pytest) (push) Successful in 51s
Lint / Notify on failure (push) Successful in 2s
Lint / Deploy (push) Has been skipped
Security / Python Security (bandit) (push) Failing after 59s
- links.html: fix all field name bugs (auto_negotiation→autoneg, full_duplex,
  tx/rx_errors/drops_per_sec→_rate, tx/rx_bytes_per_sec→_rate, poe_total_w/poe_max_w
  computed from ports, renderUnifiSwitches uses top-level updated timestamp)
- suppressions.html: in-place DOM refresh after create/remove (no page reload),
  datalist autocomplete for target names, form reset after submit
- inspector.html: ESC key closes detail panel via lt.keys.on
- index.html: events filter bar with search input + severity pills (All/Critical/Warning),
  MutationObserver re-applies filter after dynamic updates
- style.css: g-section-actions, events-filter-bar, sev-pills layout
- app.js/db.py/monitor.py: carry forward prior session fixes (Promise.allSettled,
  daemon_ok, stale connection handling, double Prometheus call, self.cfg fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 23:35:02 -04:00
jared e2b65db2fc Add pagination to event queries, input validation, daily event purge
- get_active_events() now takes limit/offset (default 200) to cap unbounded queries
- count_active_events() added to return total for pagination display
- /api/events supports ?limit=, ?offset=, ?status= query params (max 1000)
- /api/status includes total_active count alongside paginated events list
- index() route passes total_active to template for server-side truncation notice
- Show "Showing X of Y" notice in dashboard when events are truncated
- Suppression POST validates: reason ≤500 chars, target_name/detail ≤255 chars
- _purge_old_jobs_loop runs purge_old_resolved_events(90d) once per day

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-17 20:32:32 -04:00
jared 17d3b7d227 New features: stale banner, tab title alerts, health checks, DB housekeeping
static/app.js:
- Browser tab title updates to show alert count: '(3 CRIT) GANDALF' or '(2 WARN) GANDALF'
- Stale monitoring banner: injected into .main if last_check > 15 min old,
  warns operator that the monitor daemon may be down

static/style.css:
- .stale-banner: amber top-border warning strip

app.py:
- /health now checks DB connectivity and monitor freshness (last_check age)
  Returns 503 + degraded status if DB unreachable or monitor stale >20min

db.py:
- cleanup_expired_suppressions(): marks time-limited suppressions inactive when
  expires_at <= NOW() (was only filtered in SELECTs, never marked inactive)
- purge_old_resolved_events(days=90): deletes old resolved events to prevent
  unbounded table growth

monitor.py:
- Calls cleanup_expired_suppressions() and purge_old_resolved_events() each cycle

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-14 21:35:32 -04:00
jared 85a018ff6c Optimize suppression checks: load once per cycle, add error logging
db.py:
- Add check_suppressed(suppressions, ...) for in-memory suppression lookups
  against pre-loaded list (eliminates N*M DB queries per monitoring cycle)
- get_baseline(): log error instead of silently swallowing JSON parse failure

monitor.py:
- Load active suppressions once per cycle at the top of the alert loop
- Pass suppressions list to _process_interfaces, _process_unifi, _process_ping_hosts
- Replace all db.is_suppressed() calls with db.check_suppressed(suppressions, ...)
- Reduces DB queries from 100-600+ per cycle down to 1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-14 14:13:54 -04:00
jared 0278dad502 feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
jared 0c0150f698 Complete rewrite: full-featured network monitoring dashboard
- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 23:03:18 -05:00