Go to file

Jared Vititoe 0c0150f698 Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-01 23:03:18 -05:00

static

Complete rewrite: full-featured network monitoring dashboard

2026-03-01 23:03:18 -05:00

templates

Complete rewrite: full-featured network monitoring dashboard

2026-03-01 23:03:18 -05:00

.gitignore

added git ignore

2025-03-01 13:34:25 -05:00

app.py

Complete rewrite: full-featured network monitoring dashboard

2026-03-01 23:03:18 -05:00

config.json

Complete rewrite: full-featured network monitoring dashboard

2026-03-01 23:03:18 -05:00

db.py

Complete rewrite: full-featured network monitoring dashboard

2026-03-01 23:03:18 -05:00

gandalf-monitor.service

Complete rewrite: full-featured network monitoring dashboard

2026-03-01 23:03:18 -05:00

monitor.py

Complete rewrite: full-featured network monitoring dashboard

2026-03-01 23:03:18 -05:00

README.md

Complete rewrite: full-featured network monitoring dashboard

2026-03-01 23:03:18 -05:00

requirements.txt

Complete rewrite: full-featured network monitoring dashboard

2026-03-01 23:03:18 -05:00

schema.sql

Complete rewrite: full-featured network monitoring dashboard

2026-03-01 23:03:18 -05:00

README.md

GANDALF (Global Advanced Network Detection And Link Facilitator)

Because it shall not let problems pass!

Network monitoring dashboard for the LotusGuild Proxmox cluster. Deployed on LXC 157 (monitor-02 / 10.10.10.9), reachable at gandalf.lotusguild.org.

Architecture

Gandalf is two processes that share a MariaDB database:

Process	Service	Role
`app.py`	`gandalf.service`	Flask web dashboard (gunicorn, port 8000)
`monitor.py`	`gandalf-monitor.service`	Background polling daemon

[Prometheus :9090] ──▶
                        monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
[UniFi Controller] ──▶

Data Sources

Source	What it monitors
Prometheus (`10.10.10.48:9090`)	Physical NIC link state (`node_network_up`) for 5 Proxmox hypervisors
UniFi API (`https://10.10.10.1`)	Switch, AP, and gateway device status
Ping	pbs (10.10.10.3) and storage-01 (10.10.10.11) — no node_exporter

Monitored Hosts (Prometheus / node_exporter)

Host	Instance
large1	10.10.10.2:9100
compute-storage-01	10.10.10.4:9100
micro1	10.10.10.8:9100
monitor-02	10.10.10.9:9100
compute-storage-gpu-01	10.10.10.10:9100

Features

Interface monitoring – tracks link state for all physical NICs via Prometheus
UniFi device monitoring – detects offline switches, APs, and gateways
Ping reachability – covers hosts without node_exporter
Cluster-wide detection – creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure)
Smart baseline tracking – interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets
Ticket creation – integrates with Tinker Tickets (t.lotusguild.org) with 24-hour deduplication
Alert suppression – manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual)
Authelia SSO – restricted to admin group via forward-auth headers

Alert Logic

Ticket Triggers

Condition	Priority
UniFi device offline (2+ consecutive checks)	P2 High
Proxmox host NIC link-down regression (2+ consecutive checks)	P2 High
Host unreachable via ping (2+ consecutive checks)	P2 High
3+ hosts simultaneously reporting interface failures	P1 Critical

Suppression Targets

Type	Suppresses
`host`	All interface alerts for a named host
`interface`	A specific NIC on a specific host
`unifi_device`	A specific UniFi device
`all`	Everything (global maintenance mode)

Suppressions can be manual (persist until removed) or timed (auto-expire).

Configuration

config.json – shared by both processes:

Key	Description
`unifi.api_key`	UniFi API key from controller
`prometheus.url`	Prometheus base URL
`database.*`	MariaDB credentials
`ticket_api.api_key`	Tinker Tickets Bearer token
`monitor.poll_interval`	Seconds between checks (default: 120)
`monitor.failure_threshold`	Consecutive failures before ticketing (default: 2)
`monitor.cluster_threshold`	Hosts with failures to trigger cluster alert (default: 3)
`monitor.ping_hosts`	Hosts checked via ping (no node_exporter)
`hosts`	Maps Prometheus instance labels to hostnames

Deployment (LXC 157)

1. Database (MariaDB LXC 149 at 10.10.10.50)

CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
FLUSH PRIVILEGES;

Then import the schema:

mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql

2. LXC 157 – Install dependencies

pip3 install -r requirements.txt

3. Deploy files

cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/

4. Configure secrets in `config.json`

database.password – set the gandalf DB password
ticket_api.api_key – copy from tinker tickets admin panel

5. Install the monitor service

cp gandalf-monitor.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable gandalf-monitor
systemctl start gandalf-monitor

Update existing gandalf.service to use a single worker:

ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app

6. Authelia rule

Add to /etc/authelia/configuration.yml access_control rules:

- domain: gandalf.lotusguild.org
  policy: one_factor
  subject:
    - group:admin

Reload Authelia: systemctl reload authelia

7. NPM proxy host

Domain: gandalf.lotusguild.org
Forward to: http://10.10.10.61:80 (nginx on LXC 157)
Enable Authelia forward auth
WebSockets: not required

Service Management

# Monitor daemon
systemctl status gandalf-monitor
journalctl -u gandalf-monitor -f

# Web server
systemctl status gandalf
journalctl -u gandalf -f

# Restart both after config/code changes
systemctl restart gandalf-monitor gandalf

Troubleshooting

Monitor not creating tickets

Check config.json → ticket_api.api_key is set
Check journalctl -u gandalf-monitor for errors

Baseline re-initializing on every restart

interface_baseline is stored in the monitor_state DB table; it persists across restarts

Interface always showing as "initial_down"

That interface was down on the first poll after the monitor started
It will begin tracking once it comes up; or manually update the baseline in DB if needed

Prometheus data missing for a host

Verify node_exporter is running: systemctl status prometheus-node-exporter
Check Prometheus targets: http://10.10.10.48:9090/targets

README.md Unescape Escape

GANDALF (Global Advanced Network Detection And Link Facilitator)

Architecture

Data Sources

Monitored Hosts (Prometheus / node_exporter)

Features

Alert Logic

Ticket Triggers

Suppression Targets

Configuration

Deployment (LXC 157)

1. Database (MariaDB LXC 149 at 10.10.10.50)

2. LXC 157 – Install dependencies

3. Deploy files

4. Configure secrets in config.json

5. Install the monitor service

6. Authelia rule

7. NPM proxy host

Service Management

Troubleshooting

README.md

4. Configure secrets in `config.json`