db.py:
- Add check_suppressed(suppressions, ...) for in-memory suppression lookups
against pre-loaded list (eliminates N*M DB queries per monitoring cycle)
- get_baseline(): log error instead of silently swallowing JSON parse failure
monitor.py:
- Load active suppressions once per cycle at the top of the alert loop
- Pass suppressions list to _process_interfaces, _process_unifi, _process_ping_hosts
- Replace all db.is_suppressed() calls with db.check_suppressed(suppressions, ...)
- Reduces DB queries from 100-600+ per cycle down to 1
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Two-service architecture: Flask web app (gandalf.service) + background
polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>