Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-01 23:03:18 -05:00
parent 4ed5ecacbb
commit 0c0150f698
13 changed files with 2787 additions and 512 deletions

22
gandalf-monitor.service Normal file
View File

@@ -0,0 +1,22 @@
[Unit]
Description=Gandalf Network Monitor Daemon
Documentation=https://gitea.lotusguild.org/LotusGuild/gandalf
After=network.target
Wants=network-online.target
[Service]
Type=simple
User=www-data
WorkingDirectory=/var/www/html/prod
ExecStart=/usr/bin/python3 /var/www/html/prod/monitor.py
Restart=on-failure
RestartSec=30
TimeoutStopSec=10
# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=gandalf-monitor
[Install]
WantedBy=multi-user.target