0c0150f698eb193ce83f7bf7b723f27889dd07f7
- Two-service architecture: Flask web app (gandalf.service) + background polling daemon (gandalf-monitor.service) - Monitor polls Prometheus node_network_up for physical NIC states on all 6 hypervisors (added storage-01 at 10.10.10.11:9100) - UniFi API monitoring for switches, APs, and gateway device status - Ping reachability for hosts without node_exporter (pbs only now) - Smart baseline: interfaces first seen as down are never alerted on; only UP→DOWN regressions trigger tickets - Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous interface regressions (guards against false positives on startup) - Tinker Tickets integration with 24-hour hash-based deduplication - Alert suppression: manual toggle or timed windows (30m/1h/4h/8h) - Authelia SSO via forward-auth headers, admin group required - Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) → PoE Switch (10G DAC) → Hosts - MariaDB schema, suppression management UI, host/interface cards Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GANDALF (Global Advanced Network Detection And Link Facilitator)
Because it shall not let problems pass!
Network monitoring dashboard for the LotusGuild Proxmox cluster.
Deployed on LXC 157 (monitor-02 / 10.10.10.9), reachable at gandalf.lotusguild.org.
Architecture
Gandalf is two processes that share a MariaDB database:
| Process | Service | Role |
|---|---|---|
app.py |
gandalf.service |
Flask web dashboard (gunicorn, port 8000) |
monitor.py |
gandalf-monitor.service |
Background polling daemon |
[Prometheus :9090] ──▶
monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[UniFi Controller] ──▶
Data Sources
| Source | What it monitors |
|---|---|
Prometheus (10.10.10.48:9090) |
Physical NIC link state (node_network_up) for 5 Proxmox hypervisors |
UniFi API (https://10.10.10.1) |
Switch, AP, and gateway device status |
| Ping | pbs (10.10.10.3) and storage-01 (10.10.10.11) — no node_exporter |
Monitored Hosts (Prometheus / node_exporter)
| Host | Instance |
|---|---|
| large1 | 10.10.10.2:9100 |
| compute-storage-01 | 10.10.10.4:9100 |
| micro1 | 10.10.10.8:9100 |
| monitor-02 | 10.10.10.9:9100 |
| compute-storage-gpu-01 | 10.10.10.10:9100 |
Features
- Interface monitoring – tracks link state for all physical NICs via Prometheus
- UniFi device monitoring – detects offline switches, APs, and gateways
- Ping reachability – covers hosts without node_exporter
- Cluster-wide detection – creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure)
- Smart baseline tracking – interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets
- Ticket creation – integrates with Tinker Tickets (
t.lotusguild.org) with 24-hour deduplication - Alert suppression – manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual)
- Authelia SSO – restricted to
admingroup via forward-auth headers
Alert Logic
Ticket Triggers
| Condition | Priority |
|---|---|
| UniFi device offline (2+ consecutive checks) | P2 High |
| Proxmox host NIC link-down regression (2+ consecutive checks) | P2 High |
| Host unreachable via ping (2+ consecutive checks) | P2 High |
| 3+ hosts simultaneously reporting interface failures | P1 Critical |
Suppression Targets
| Type | Suppresses |
|---|---|
host |
All interface alerts for a named host |
interface |
A specific NIC on a specific host |
unifi_device |
A specific UniFi device |
all |
Everything (global maintenance mode) |
Suppressions can be manual (persist until removed) or timed (auto-expire).
Configuration
config.json – shared by both processes:
| Key | Description |
|---|---|
unifi.api_key |
UniFi API key from controller |
prometheus.url |
Prometheus base URL |
database.* |
MariaDB credentials |
ticket_api.api_key |
Tinker Tickets Bearer token |
monitor.poll_interval |
Seconds between checks (default: 120) |
monitor.failure_threshold |
Consecutive failures before ticketing (default: 2) |
monitor.cluster_threshold |
Hosts with failures to trigger cluster alert (default: 3) |
monitor.ping_hosts |
Hosts checked via ping (no node_exporter) |
hosts |
Maps Prometheus instance labels to hostnames |
Deployment (LXC 157)
1. Database (MariaDB LXC 149 at 10.10.10.50)
CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;
Then import the schema:
mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
2. LXC 157 – Install dependencies
pip3 install -r requirements.txt
3. Deploy files
cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/
4. Configure secrets in config.json
database.password– set the gandalf DB passwordticket_api.api_key– copy from tinker tickets admin panel
5. Install the monitor service
cp gandalf-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable gandalf-monitor
systemctl start gandalf-monitor
Update existing gandalf.service to use a single worker:
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
6. Authelia rule
Add to /etc/authelia/configuration.yml access_control rules:
- domain: gandalf.lotusguild.org
policy: one_factor
subject:
- group:admin
Reload Authelia: systemctl reload authelia
7. NPM proxy host
- Domain:
gandalf.lotusguild.org - Forward to:
http://10.10.10.61:80(nginx on LXC 157) - Enable Authelia forward auth
- WebSockets: not required
Service Management
# Monitor daemon
systemctl status gandalf-monitor
journalctl -u gandalf-monitor -f
# Web server
systemctl status gandalf
journalctl -u gandalf -f
# Restart both after config/code changes
systemctl restart gandalf-monitor gandalf
Troubleshooting
Monitor not creating tickets
- Check
config.json→ticket_api.api_keyis set - Check
journalctl -u gandalf-monitorfor errors
Baseline re-initializing on every restart
interface_baselineis stored in themonitor_stateDB table; it persists across restarts
Interface always showing as "initial_down"
- That interface was down on the first poll after the monitor started
- It will begin tracking once it comes up; or manually update the baseline in DB if needed
Prometheus data missing for a host
- Verify node_exporter is running:
systemctl status prometheus-node-exporter - Check Prometheus targets:
http://10.10.10.48:9090/targets
Description
Languages
HTML
35.9%
Python
34.3%
CSS
24.6%
JavaScript
5.2%