- storage-01 now monitored via Prometheus node_exporter (10.10.10.11:9100), removed from ping_hosts - Updated data sources table (6 hosts via Prometheus, pbs only via ping) - Added storage-01 to monitored hosts table - Fixed Authelia reload command (restart, not reload) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.8 KiB
5.8 KiB
GANDALF (Global Advanced Network Detection And Link Facilitator)
Because it shall not let problems pass!
Network monitoring dashboard for the LotusGuild Proxmox cluster.
Deployed on LXC 157 (monitor-02 / 10.10.10.9), reachable at gandalf.lotusguild.org.
Architecture
Gandalf is two processes that share a MariaDB database:
| Process | Service | Role |
|---|---|---|
app.py |
gandalf.service |
Flask web dashboard (gunicorn, port 8000) |
monitor.py |
gandalf-monitor.service |
Background polling daemon |
[Prometheus :9090] ──▶
monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[UniFi Controller] ──▶
Data Sources
| Source | What it monitors |
|---|---|
Prometheus (10.10.10.48:9090) |
Physical NIC link state (node_network_up) for 6 Proxmox hosts |
UniFi API (https://10.10.10.1) |
Switch, AP, and gateway device status |
| Ping | pbs (10.10.10.3) — no node_exporter |
Monitored Hosts (Prometheus / node_exporter)
| Host | Instance |
|---|---|
| large1 | 10.10.10.2:9100 |
| compute-storage-01 | 10.10.10.4:9100 |
| micro1 | 10.10.10.8:9100 |
| monitor-02 | 10.10.10.9:9100 |
| compute-storage-gpu-01 | 10.10.10.10:9100 |
| storage-01 | 10.10.10.11:9100 |
Features
- Interface monitoring – tracks link state for all physical NICs via Prometheus
- UniFi device monitoring – detects offline switches, APs, and gateways
- Ping reachability – covers hosts without node_exporter
- Cluster-wide detection – creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure)
- Smart baseline tracking – interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets
- Ticket creation – integrates with Tinker Tickets (
t.lotusguild.org) with 24-hour deduplication - Alert suppression – manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual)
- Authelia SSO – restricted to
admingroup via forward-auth headers
Alert Logic
Ticket Triggers
| Condition | Priority |
|---|---|
| UniFi device offline (2+ consecutive checks) | P2 High |
| Proxmox host NIC link-down regression (2+ consecutive checks) | P2 High |
| Host unreachable via ping (2+ consecutive checks) | P2 High |
| 3+ hosts simultaneously reporting interface failures | P1 Critical |
Suppression Targets
| Type | Suppresses |
|---|---|
host |
All interface alerts for a named host |
interface |
A specific NIC on a specific host |
unifi_device |
A specific UniFi device |
all |
Everything (global maintenance mode) |
Suppressions can be manual (persist until removed) or timed (auto-expire).
Configuration
config.json – shared by both processes:
| Key | Description |
|---|---|
unifi.api_key |
UniFi API key from controller |
prometheus.url |
Prometheus base URL |
database.* |
MariaDB credentials |
ticket_api.api_key |
Tinker Tickets Bearer token |
monitor.poll_interval |
Seconds between checks (default: 120) |
monitor.failure_threshold |
Consecutive failures before ticketing (default: 2) |
monitor.cluster_threshold |
Hosts with failures to trigger cluster alert (default: 3) |
monitor.ping_hosts |
Hosts checked via ping (no node_exporter) |
hosts |
Maps Prometheus instance labels to hostnames |
Deployment (LXC 157)
1. Database (MariaDB LXC 149 at 10.10.10.50)
CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;
Then import the schema:
mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
2. LXC 157 – Install dependencies
pip3 install -r requirements.txt
3. Deploy files
cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/
4. Configure secrets in config.json
database.password– set the gandalf DB passwordticket_api.api_key– copy from tinker tickets admin panel
5. Install the monitor service
cp gandalf-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable gandalf-monitor
systemctl start gandalf-monitor
Update existing gandalf.service to use a single worker:
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
6. Authelia rule
Add to /etc/authelia/configuration.yml access_control rules:
- domain: gandalf.lotusguild.org
policy: one_factor
subject:
- group:admin
Reload Authelia: systemctl restart authelia
7. NPM proxy host
- Domain:
gandalf.lotusguild.org - Forward to:
http://10.10.10.61:80(nginx on LXC 157) - Enable Authelia forward auth
- WebSockets: not required
Service Management
# Monitor daemon
systemctl status gandalf-monitor
journalctl -u gandalf-monitor -f
# Web server
systemctl status gandalf
journalctl -u gandalf -f
# Restart both after config/code changes
systemctl restart gandalf-monitor gandalf
Troubleshooting
Monitor not creating tickets
- Check
config.json→ticket_api.api_keyis set - Check
journalctl -u gandalf-monitorfor errors
Baseline re-initializing on every restart
interface_baselineis stored in themonitor_stateDB table; it persists across restarts
Interface always showing as "initial_down"
- That interface was down on the first poll after the monitor started
- It will begin tracking once it comes up; or manually update the baseline in DB if needed
Prometheus data missing for a host
- Verify node_exporter is running:
systemctl status prometheus-node-exporter - Check Prometheus targets:
http://10.10.10.48:9090/targets