Complete rewrite: full-featured network monitoring dashboard

- Two-service architecture: Flask web app (gandalf.service) + background polling daemon (gandalf-monitor.service) - Monitor polls Prometheus node_network_up for physical NIC states on all 6 hypervisors (added storage-01 at 10.10.10.11:9100) - UniFi API monitoring for switches, APs, and gateway device status - Ping reachability for hosts without node_exporter (pbs only now) - Smart baseline: interfaces first seen as down are never alerted on; only UP→DOWN regressions trigger tickets - Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous interface regressions (guards against false positives on startup) - Tinker Tickets integration with 24-hour hash-based deduplication - Alert suppression: manual toggle or timed windows (30m/1h/4h/8h) - Authelia SSO via forward-auth headers, admin group required - Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) → PoE Switch (10G DAC) → Hosts - MariaDB schema, suppression management UI, host/interface cards Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 23:03:18 -05:00
parent 4ed5ecacbb
commit 0c0150f698
13 changed files with 2787 additions and 512 deletions
@@ -2,61 +2,199 @@

 > Because it shall not let problems pass!

-## Multiple Distributed Servers Approach
+Network monitoring dashboard for the LotusGuild Proxmox cluster.
+Deployed on **LXC 157** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org`.

-This architecture represents the most robust implementation approach for the system.
+---

-### Core Components
+## Architecture

-1. Multiple monitoring nodes across different network segments
-2. Distributed database for sharing state
-3. Consensus mechanism for alert verification
+Gandalf is two processes that share a MariaDB database:

-### System Architecture
+| Process | Service | Role |
+|---|---|---|
+| `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) |
+| `monitor.py` | `gandalf-monitor.service` | Background polling daemon |

-#### A. Monitoring Layer
+```
+[Prometheus :9090] ──▶
+                        monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
+[UniFi Controller] ──▶
+```

- Multiple monitoring nodes in different locations/segments
- Each node runs independent health checks
- Mix of internal and external perspectives
+### Data Sources

-#### B. Data Collection
+| Source | What it monitors |
+|---|---|
+| **Prometheus** (`10.10.10.48:9090`) | Physical NIC link state (`node_network_up`) for 5 Proxmox hypervisors |
+| **UniFi API** (`https://10.10.10.1`) | Switch, AP, and gateway device status |
+| **Ping** | pbs (10.10.10.3) and storage-01 (10.10.10.11) — no node_exporter |

-Each node collects:
- Link status
- Latency measurements
- Error rates
- Bandwidth utilization
- Device health metrics
+### Monitored Hosts (Prometheus / node_exporter)

-#### C. Consensus Mechanism
+| Host | Instance |
+|---|---|
+| large1 | 10.10.10.2:9100 |
+| compute-storage-01 | 10.10.10.4:9100 |
+| micro1 | 10.10.10.8:9100 |
+| monitor-02 | 10.10.10.9:9100 |
+| compute-storage-gpu-01 | 10.10.10.10:9100 |

- Multiple nodes must agree before declaring an outage
- Voting system implementation:
-  - 2/3 node agreement required for issue confirmation
-  - Weighted checks based on type
-  - Time-based consensus requirements (X seconds persistence)
+---

-#### D. Alert Verification
+## Features

- Cross-reference multiple data points
- Check from different network paths
- Verify both ends of connections
- Consider network topology
+- **Interface monitoring** – tracks link state for all physical NICs via Prometheus
+- **UniFi device monitoring** – detects offline switches, APs, and gateways
+- **Ping reachability** – covers hosts without node_exporter
+- **Cluster-wide detection** – creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure)
+- **Smart baseline tracking** – interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets
+- **Ticket creation** – integrates with Tinker Tickets (`t.lotusguild.org`) with 24-hour deduplication
+- **Alert suppression** – manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual)
+- **Authelia SSO** – restricted to `admin` group via forward-auth headers

-#### E. Redundancy
+---

- Eliminates single points of failure
- Nodes distributed across availability zones
- Independent power and network paths
+## Alert Logic

-#### F. Central Coordination
+### Ticket Triggers

- Distributed database for state sharing
- Leader election for coordinating responses
- Backup coordinators ready to take over
+| Condition | Priority |
+|---|---|
+| UniFi device offline (2+ consecutive checks) | P2 High |
+| Proxmox host NIC link-down regression (2+ consecutive checks) | P2 High |
+| Host unreachable via ping (2+ consecutive checks) | P2 High |
+| 3+ hosts simultaneously reporting interface failures | P1 Critical |

-### Additional Features
+### Suppression Targets

- Alarm suppression capabilities
- Ticket creation system integration
+| Type | Suppresses |
+|---|---|
+| `host` | All interface alerts for a named host |
+| `interface` | A specific NIC on a specific host |
+| `unifi_device` | A specific UniFi device |
+| `all` | Everything (global maintenance mode) |
+
+Suppressions can be manual (persist until removed) or timed (auto-expire).
+
+---
+
+## Configuration
+
+**`config.json`** – shared by both processes:
+
+| Key | Description |
+|---|---|
+| `unifi.api_key` | UniFi API key from controller |
+| `prometheus.url` | Prometheus base URL |
+| `database.*` | MariaDB credentials |
+| `ticket_api.api_key` | Tinker Tickets Bearer token |
+| `monitor.poll_interval` | Seconds between checks (default: 120) |
+| `monitor.failure_threshold` | Consecutive failures before ticketing (default: 2) |
+| `monitor.cluster_threshold` | Hosts with failures to trigger cluster alert (default: 3) |
+| `monitor.ping_hosts` | Hosts checked via ping (no node_exporter) |
+| `hosts` | Maps Prometheus instance labels to hostnames |
+
+---
+
+## Deployment (LXC 157)
+
+### 1. Database (MariaDB LXC 149 at 10.10.10.50)
+
+```sql
+CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
+CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
+GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
+FLUSH PRIVILEGES;
+```
+
+Then import the schema:
+```bash
+mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
+```
+
+### 2. LXC 157 – Install dependencies
+
+```bash
+pip3 install -r requirements.txt
+```
+
+### 3. Deploy files
+
+```bash
+cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/
+```
+
+### 4. Configure secrets in `config.json`
+
+- `database.password` – set the gandalf DB password
+- `ticket_api.api_key` – copy from tinker tickets admin panel
+
+### 5. Install the monitor service
+
+```bash
+cp gandalf-monitor.service /etc/systemd/system/
+systemctl daemon-reload
+systemctl enable gandalf-monitor
+systemctl start gandalf-monitor
+```
+
+Update existing `gandalf.service` to use a single worker:
+```
+ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
+```
+
+### 6. Authelia rule
+
+Add to `/etc/authelia/configuration.yml` access_control rules:
+```yaml
+- domain: gandalf.lotusguild.org
+  policy: one_factor
+  subject:
+    - group:admin
+```
+
+Reload Authelia: `systemctl reload authelia`
+
+### 7. NPM proxy host
+
+- Domain: `gandalf.lotusguild.org`
+- Forward to: `http://10.10.10.61:80` (nginx on LXC 157)
+- Enable Authelia forward auth
+- WebSockets: **not required**
+
+---
+
+## Service Management
+
+```bash
+# Monitor daemon
+systemctl status gandalf-monitor
+journalctl -u gandalf-monitor -f
+
+# Web server
+systemctl status gandalf
+journalctl -u gandalf -f
+
+# Restart both after config/code changes
+systemctl restart gandalf-monitor gandalf
+```
+
+---
+
+## Troubleshooting
+
+**Monitor not creating tickets**
+- Check `config.json` → `ticket_api.api_key` is set
+- Check `journalctl -u gandalf-monitor` for errors
+
+**Baseline re-initializing on every restart**
+- `interface_baseline` is stored in the `monitor_state` DB table; it persists across restarts
+
+**Interface always showing as "initial_down"**
+- That interface was down on the first poll after the monitor started
+- It will begin tracking once it comes up; or manually update the baseline in DB if needed
+
+**Prometheus data missing for a host**
+- Verify node_exporter is running: `systemctl status prometheus-node-exporter`
+- Check Prometheus targets: `http://10.10.10.48:9090/targets`