# GANDALF (Global Advanced Network Detection And Link Facilitator) > Because it shall not let problems pass! Network monitoring dashboard for the LotusGuild Proxmox cluster. Deployed on **LXC 157** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org`. --- ## Architecture Gandalf is two processes that share a MariaDB database: | Process | Service | Role | |---|---|---| | `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) | | `monitor.py` | `gandalf-monitor.service` | Background polling daemon | ``` [Prometheus :9090] ──▶ monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser [UniFi Controller] ──▶ ``` ### Data Sources | Source | What it monitors | |---|---| | **Prometheus** (`10.10.10.48:9090`) | Physical NIC link state (`node_network_up`) for 6 Proxmox hosts | | **UniFi API** (`https://10.10.10.1`) | Switch, AP, and gateway device status | | **Ping** | pbs (10.10.10.3) — no node_exporter | ### Monitored Hosts (Prometheus / node_exporter) | Host | Instance | |---|---| | large1 | 10.10.10.2:9100 | | compute-storage-01 | 10.10.10.4:9100 | | micro1 | 10.10.10.8:9100 | | monitor-02 | 10.10.10.9:9100 | | compute-storage-gpu-01 | 10.10.10.10:9100 | | storage-01 | 10.10.10.11:9100 | --- ## Features - **Interface monitoring** – tracks link state for all physical NICs via Prometheus - **UniFi device monitoring** – detects offline switches, APs, and gateways - **Ping reachability** – covers hosts without node_exporter - **Cluster-wide detection** – creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure) - **Smart baseline tracking** – interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets - **Ticket creation** – integrates with Tinker Tickets (`t.lotusguild.org`) with 24-hour deduplication - **Alert suppression** – manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual) - **Authelia SSO** – restricted to `admin` group via forward-auth headers --- ## Alert Logic ### Ticket Triggers | Condition | Priority | |---|---| | UniFi device offline (2+ consecutive checks) | P2 High | | Proxmox host NIC link-down regression (2+ consecutive checks) | P2 High | | Host unreachable via ping (2+ consecutive checks) | P2 High | | 3+ hosts simultaneously reporting interface failures | P1 Critical | ### Suppression Targets | Type | Suppresses | |---|---| | `host` | All interface alerts for a named host | | `interface` | A specific NIC on a specific host | | `unifi_device` | A specific UniFi device | | `all` | Everything (global maintenance mode) | Suppressions can be manual (persist until removed) or timed (auto-expire). --- ## Configuration **`config.json`** – shared by both processes: | Key | Description | |---|---| | `unifi.api_key` | UniFi API key from controller | | `prometheus.url` | Prometheus base URL | | `database.*` | MariaDB credentials | | `ticket_api.api_key` | Tinker Tickets Bearer token | | `monitor.poll_interval` | Seconds between checks (default: 120) | | `monitor.failure_threshold` | Consecutive failures before ticketing (default: 2) | | `monitor.cluster_threshold` | Hosts with failures to trigger cluster alert (default: 3) | | `monitor.ping_hosts` | Hosts checked via ping (no node_exporter) | | `hosts` | Maps Prometheus instance labels to hostnames | --- ## Deployment (LXC 157) ### 1. Database (MariaDB LXC 149 at 10.10.10.50) ```sql CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password'; GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61'; FLUSH PRIVILEGES; ``` Then import the schema: ```bash mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql ``` ### 2. LXC 157 – Install dependencies ```bash pip3 install -r requirements.txt ``` ### 3. Deploy files ```bash cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/ ``` ### 4. Configure secrets in `config.json` - `database.password` – set the gandalf DB password - `ticket_api.api_key` – copy from tinker tickets admin panel ### 5. Install the monitor service ```bash cp gandalf-monitor.service /etc/systemd/system/ systemctl daemon-reload systemctl enable gandalf-monitor systemctl start gandalf-monitor ``` Update existing `gandalf.service` to use a single worker: ``` ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app ``` ### 6. Authelia rule Add to `/etc/authelia/configuration.yml` access_control rules: ```yaml - domain: gandalf.lotusguild.org policy: one_factor subject: - group:admin ``` Reload Authelia: `systemctl restart authelia` ### 7. NPM proxy host - Domain: `gandalf.lotusguild.org` - Forward to: `http://10.10.10.61:80` (nginx on LXC 157) - Enable Authelia forward auth - WebSockets: **not required** --- ## Service Management ```bash # Monitor daemon systemctl status gandalf-monitor journalctl -u gandalf-monitor -f # Web server systemctl status gandalf journalctl -u gandalf -f # Restart both after config/code changes systemctl restart gandalf-monitor gandalf ``` --- ## Troubleshooting **Monitor not creating tickets** - Check `config.json` → `ticket_api.api_key` is set - Check `journalctl -u gandalf-monitor` for errors **Baseline re-initializing on every restart** - `interface_baseline` is stored in the `monitor_state` DB table; it persists across restarts **Interface always showing as "initial_down"** - That interface was down on the first poll after the monitor started - It will begin tracking once it comes up; or manually update the baseline in DB if needed **Prometheus data missing for a host** - Verify node_exporter is running: `systemctl status prometheus-node-exporter` - Check Prometheus targets: `http://10.10.10.48:9090/targets` # auto-deploy: webhook + deploy script