Complete rewrite: full-featured network monitoring dashboard
- Two-service architecture: Flask web app (gandalf.service) + background polling daemon (gandalf-monitor.service) - Monitor polls Prometheus node_network_up for physical NIC states on all 6 hypervisors (added storage-01 at 10.10.10.11:9100) - UniFi API monitoring for switches, APs, and gateway device status - Ping reachability for hosts without node_exporter (pbs only now) - Smart baseline: interfaces first seen as down are never alerted on; only UP→DOWN regressions trigger tickets - Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous interface regressions (guards against false positives on startup) - Tinker Tickets integration with 24-hour hash-based deduplication - Alert suppression: manual toggle or timed windows (30m/1h/4h/8h) - Authelia SSO via forward-auth headers, admin group required - Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) → PoE Switch (10G DAC) → Hosts - MariaDB schema, suppression management UI, host/interface cards Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
218
README.md
218
README.md
@@ -2,61 +2,199 @@
|
||||
|
||||
> Because it shall not let problems pass!
|
||||
|
||||
## Multiple Distributed Servers Approach
|
||||
Network monitoring dashboard for the LotusGuild Proxmox cluster.
|
||||
Deployed on **LXC 157** (monitor-02 / 10.10.10.9), reachable at `gandalf.lotusguild.org`.
|
||||
|
||||
This architecture represents the most robust implementation approach for the system.
|
||||
---
|
||||
|
||||
### Core Components
|
||||
## Architecture
|
||||
|
||||
1. Multiple monitoring nodes across different network segments
|
||||
2. Distributed database for sharing state
|
||||
3. Consensus mechanism for alert verification
|
||||
Gandalf is two processes that share a MariaDB database:
|
||||
|
||||
### System Architecture
|
||||
| Process | Service | Role |
|
||||
|---|---|---|
|
||||
| `app.py` | `gandalf.service` | Flask web dashboard (gunicorn, port 8000) |
|
||||
| `monitor.py` | `gandalf-monitor.service` | Background polling daemon |
|
||||
|
||||
#### A. Monitoring Layer
|
||||
```
|
||||
[Prometheus :9090] ──▶
|
||||
monitor.py ──▶ MariaDB ◀── app.py ──▶ nginx ──▶ Authelia ──▶ Browser
|
||||
[UniFi Controller] ──▶
|
||||
```
|
||||
|
||||
- Multiple monitoring nodes in different locations/segments
|
||||
- Each node runs independent health checks
|
||||
- Mix of internal and external perspectives
|
||||
### Data Sources
|
||||
|
||||
#### B. Data Collection
|
||||
| Source | What it monitors |
|
||||
|---|---|
|
||||
| **Prometheus** (`10.10.10.48:9090`) | Physical NIC link state (`node_network_up`) for 5 Proxmox hypervisors |
|
||||
| **UniFi API** (`https://10.10.10.1`) | Switch, AP, and gateway device status |
|
||||
| **Ping** | pbs (10.10.10.3) and storage-01 (10.10.10.11) — no node_exporter |
|
||||
|
||||
Each node collects:
|
||||
- Link status
|
||||
- Latency measurements
|
||||
- Error rates
|
||||
- Bandwidth utilization
|
||||
- Device health metrics
|
||||
### Monitored Hosts (Prometheus / node_exporter)
|
||||
|
||||
#### C. Consensus Mechanism
|
||||
| Host | Instance |
|
||||
|---|---|
|
||||
| large1 | 10.10.10.2:9100 |
|
||||
| compute-storage-01 | 10.10.10.4:9100 |
|
||||
| micro1 | 10.10.10.8:9100 |
|
||||
| monitor-02 | 10.10.10.9:9100 |
|
||||
| compute-storage-gpu-01 | 10.10.10.10:9100 |
|
||||
|
||||
- Multiple nodes must agree before declaring an outage
|
||||
- Voting system implementation:
|
||||
- 2/3 node agreement required for issue confirmation
|
||||
- Weighted checks based on type
|
||||
- Time-based consensus requirements (X seconds persistence)
|
||||
---
|
||||
|
||||
#### D. Alert Verification
|
||||
## Features
|
||||
|
||||
- Cross-reference multiple data points
|
||||
- Check from different network paths
|
||||
- Verify both ends of connections
|
||||
- Consider network topology
|
||||
- **Interface monitoring** – tracks link state for all physical NICs via Prometheus
|
||||
- **UniFi device monitoring** – detects offline switches, APs, and gateways
|
||||
- **Ping reachability** – covers hosts without node_exporter
|
||||
- **Cluster-wide detection** – creates a separate P1 ticket when 3+ hosts have simultaneous interface failures (likely a switch failure)
|
||||
- **Smart baseline tracking** – interfaces that are down on first observation (unused ports) are never alerted on; only regressions from UP→DOWN trigger tickets
|
||||
- **Ticket creation** – integrates with Tinker Tickets (`t.lotusguild.org`) with 24-hour deduplication
|
||||
- **Alert suppression** – manual toggle or timed windows (30min / 1hr / 4hr / 8hr / manual)
|
||||
- **Authelia SSO** – restricted to `admin` group via forward-auth headers
|
||||
|
||||
#### E. Redundancy
|
||||
---
|
||||
|
||||
- Eliminates single points of failure
|
||||
- Nodes distributed across availability zones
|
||||
- Independent power and network paths
|
||||
## Alert Logic
|
||||
|
||||
#### F. Central Coordination
|
||||
### Ticket Triggers
|
||||
|
||||
- Distributed database for state sharing
|
||||
- Leader election for coordinating responses
|
||||
- Backup coordinators ready to take over
|
||||
| Condition | Priority |
|
||||
|---|---|
|
||||
| UniFi device offline (2+ consecutive checks) | P2 High |
|
||||
| Proxmox host NIC link-down regression (2+ consecutive checks) | P2 High |
|
||||
| Host unreachable via ping (2+ consecutive checks) | P2 High |
|
||||
| 3+ hosts simultaneously reporting interface failures | P1 Critical |
|
||||
|
||||
### Additional Features
|
||||
### Suppression Targets
|
||||
|
||||
- Alarm suppression capabilities
|
||||
- Ticket creation system integration
|
||||
| Type | Suppresses |
|
||||
|---|---|
|
||||
| `host` | All interface alerts for a named host |
|
||||
| `interface` | A specific NIC on a specific host |
|
||||
| `unifi_device` | A specific UniFi device |
|
||||
| `all` | Everything (global maintenance mode) |
|
||||
|
||||
Suppressions can be manual (persist until removed) or timed (auto-expire).
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
**`config.json`** – shared by both processes:
|
||||
|
||||
| Key | Description |
|
||||
|---|---|
|
||||
| `unifi.api_key` | UniFi API key from controller |
|
||||
| `prometheus.url` | Prometheus base URL |
|
||||
| `database.*` | MariaDB credentials |
|
||||
| `ticket_api.api_key` | Tinker Tickets Bearer token |
|
||||
| `monitor.poll_interval` | Seconds between checks (default: 120) |
|
||||
| `monitor.failure_threshold` | Consecutive failures before ticketing (default: 2) |
|
||||
| `monitor.cluster_threshold` | Hosts with failures to trigger cluster alert (default: 3) |
|
||||
| `monitor.ping_hosts` | Hosts checked via ping (no node_exporter) |
|
||||
| `hosts` | Maps Prometheus instance labels to hostnames |
|
||||
|
||||
---
|
||||
|
||||
## Deployment (LXC 157)
|
||||
|
||||
### 1. Database (MariaDB LXC 149 at 10.10.10.50)
|
||||
|
||||
```sql
|
||||
CREATE DATABASE gandalf CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
|
||||
CREATE USER 'gandalf'@'10.10.10.61' IDENTIFIED BY 'your_password';
|
||||
GRANT ALL PRIVILEGES ON gandalf.* TO 'gandalf'@'10.10.10.61';
|
||||
FLUSH PRIVILEGES;
|
||||
```
|
||||
|
||||
Then import the schema:
|
||||
```bash
|
||||
mysql -h 10.10.10.50 -u gandalf -p gandalf < schema.sql
|
||||
```
|
||||
|
||||
### 2. LXC 157 – Install dependencies
|
||||
|
||||
```bash
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
### 3. Deploy files
|
||||
|
||||
```bash
|
||||
cp app.py db.py monitor.py config.json templates/ static/ /var/www/html/prod/
|
||||
```
|
||||
|
||||
### 4. Configure secrets in `config.json`
|
||||
|
||||
- `database.password` – set the gandalf DB password
|
||||
- `ticket_api.api_key` – copy from tinker tickets admin panel
|
||||
|
||||
### 5. Install the monitor service
|
||||
|
||||
```bash
|
||||
cp gandalf-monitor.service /etc/systemd/system/
|
||||
systemctl daemon-reload
|
||||
systemctl enable gandalf-monitor
|
||||
systemctl start gandalf-monitor
|
||||
```
|
||||
|
||||
Update existing `gandalf.service` to use a single worker:
|
||||
```
|
||||
ExecStart=/usr/bin/python3 -m gunicorn --workers 1 --bind 127.0.0.1:8000 app:app
|
||||
```
|
||||
|
||||
### 6. Authelia rule
|
||||
|
||||
Add to `/etc/authelia/configuration.yml` access_control rules:
|
||||
```yaml
|
||||
- domain: gandalf.lotusguild.org
|
||||
policy: one_factor
|
||||
subject:
|
||||
- group:admin
|
||||
```
|
||||
|
||||
Reload Authelia: `systemctl reload authelia`
|
||||
|
||||
### 7. NPM proxy host
|
||||
|
||||
- Domain: `gandalf.lotusguild.org`
|
||||
- Forward to: `http://10.10.10.61:80` (nginx on LXC 157)
|
||||
- Enable Authelia forward auth
|
||||
- WebSockets: **not required**
|
||||
|
||||
---
|
||||
|
||||
## Service Management
|
||||
|
||||
```bash
|
||||
# Monitor daemon
|
||||
systemctl status gandalf-monitor
|
||||
journalctl -u gandalf-monitor -f
|
||||
|
||||
# Web server
|
||||
systemctl status gandalf
|
||||
journalctl -u gandalf -f
|
||||
|
||||
# Restart both after config/code changes
|
||||
systemctl restart gandalf-monitor gandalf
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Monitor not creating tickets**
|
||||
- Check `config.json` → `ticket_api.api_key` is set
|
||||
- Check `journalctl -u gandalf-monitor` for errors
|
||||
|
||||
**Baseline re-initializing on every restart**
|
||||
- `interface_baseline` is stored in the `monitor_state` DB table; it persists across restarts
|
||||
|
||||
**Interface always showing as "initial_down"**
|
||||
- That interface was down on the first poll after the monitor started
|
||||
- It will begin tracking once it comes up; or manually update the baseline in DB if needed
|
||||
|
||||
**Prometheus data missing for a host**
|
||||
- Verify node_exporter is running: `systemctl status prometheus-node-exporter`
|
||||
- Check Prometheus targets: `http://10.10.10.48:9090/targets`
|
||||
|
||||
Reference in New Issue
Block a user