Commit Graph

80 Commits

Author SHA1 Message Date
jared cd0b725f3e fix: LLDP port label bug, suppression SQL dead code, avatar path hardening
Lint / Python (flake8) (push) Successful in 1m13s
Lint / JS (eslint) (push) Successful in 7s
Security / Python Security (bandit) (push) Successful in 42s
Test / Python Tests (pytest) (push) Successful in 50s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 3s
- inspector.html: fix LLDP neighbor label in port blocks — port.lldp_table never exists; data is at port.lldp (dict with system_name/chassis_id); both port block renderers corrected
- db.py: remove dead 'target_detail IS NULL' branch in suppression check — target_detail is always stored as '' not NULL; query simplified to target_detail=''
- app.py: resolve cache_dir/cache_file/sentinel to absolute paths; guard against path escape before use
- app.py: wrap sentinel os.path.getmtime() in try/except OSError to handle TOCTOU deletion race

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 09:31:25 -04:00
jared 77c74098a3 fix: flake8 E701 in avatar handler; update SSH test to match accept-new
Lint / Python (flake8) (push) Successful in 55s
Lint / JS (eslint) (push) Successful in 11s
Security / Python Security (bandit) (push) Successful in 1m15s
Test / Python Tests (pytest) (push) Successful in 59s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 3s
- app.py: split 'with open(sentinel): pass' onto two lines (flake8 E701)
- tests/test_diagnose.py: rename test and assert StrictHostKeyChecking=accept-new (not =no which was fixed earlier)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 09:23:06 -04:00
jared aa52047016 fix: cache_ttl config validation; ticket_web_url via tojson in base.html
Lint / Python (flake8) (push) Failing after 44s
Lint / JS (eslint) (push) Successful in 8s
Security / Python Security (bandit) (push) Successful in 42s
Test / Python Tests (pytest) (push) Failing after 1m13s
Lint / Notify on failure (push) Successful in 4s
Lint / Deploy (push) Has been skipped
- app.py: wrap int(cache_ttl) in try/except so a misconfigured non-integer value falls back to 3600 instead of raising ValueError
- base.html: use Jinja2 tojson filter for ticket_web_url to ensure proper JS string escaping regardless of URL contents

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 09:05:53 -04:00
jared e166e3fcb4 fix: LDAP conn leak, health timing info, security headers, link_stats size guard
Lint / Python (flake8) (push) Failing after 51s
Lint / JS (eslint) (push) Successful in 7s
Security / Python Security (bandit) (push) Successful in 1m2s
Test / Python Tests (pytest) (push) Failing after 1m21s
Lint / Notify on failure (push) Successful in 2s
Lint / Deploy (push) Has been skipped
- app.py: move conn.unbind() into finally block in api_avatar() so connection is always closed even if conn.search() throws
- app.py: remove elapsed-time strings from /health response (unauthenticated endpoint no longer leaks monitor timing)
- app.py: add after_request hook setting X-Content-Type-Options, X-Frame-Options, Referrer-Policy on all responses
- app.py: add 10 MB size guard on link_stats before JSON parse; log actual exception on parse failure
- app.py: wrap suppressions_page network_snapshot parse in try/except (same protection as index page)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 08:50:51 -04:00
jared d4d4208145 fix: LDAP empty-password guard, expires_minutes bounds, snapshot JSON safety, rate dict cleanup
Lint / Python (flake8) (push) Failing after 39s
Lint / JS (eslint) (push) Failing after 12s
Security / Python Security (bandit) (push) Successful in 41s
Test / Python Tests (pytest) (push) Failing after 1m28s
Lint / Notify on failure (push) Successful in 2s
Lint / Deploy (push) Has been skipped
- app.py: fail loudly if LDAP bind_pw is not configured rather than attempting anonymous bind
- app.py: validate expires_minutes is 1–43200 (max 30 days) before storing suppression
- app.py: wrap network_snapshot JSON parse in try/except so a corrupt DB value returns degraded page instead of 500
- app.py: prune _diag_rate entries inactive for >1h to prevent unbounded growth

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 08:47:43 -04:00
jared 61408645a5 fix: LLDP input validation, mgmt_ip early validation, poll timer cleanup, monitor backoff
Lint / Python (flake8) (push) Failing after 41s
Lint / JS (eslint) (push) Successful in 8s
Security / Python Security (bandit) (push) Successful in 42s
Test / Python Tests (pytest) (push) Failing after 1m35s
Lint / Notify on failure (push) Successful in 5s
Lint / Deploy (push) Has been skipped
- app.py: validate server_name from LLDP with fullmatch before use in logs/lookups (prevents log injection)
- app.py: validate each mgmt_ip candidate before assigning host_ip (avoids assigning non-IP string that then fails later check)
- app.py: log actual exception in link_stats JSON parse error
- inspector.html: clear _diagPollTimer in closePanel() so timer doesn't orphan when panel is closed mid-poll
- monitor.py: sleep 30s after a monitor loop exception before resuming normal poll interval

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 08:45:28 -04:00
jared 25baec67ac fix: diagnostic rate limiting, lock-held ownership check, iface name length cap
Lint / Python (flake8) (push) Failing after 47s
Lint / JS (eslint) (push) Successful in 8s
Security / Python Security (bandit) (push) Successful in 43s
Test / Python Tests (pytest) (push) Failing after 1m22s
Lint / Notify on failure (push) Successful in 3s
Lint / Deploy (push) Has been skipped
- app.py: add per-user diagnostic rate limit (5/min) enforced atomically under _diag_lock
- app.py: move diagnostic job ownership check inside _diag_lock to close TOCTOU window; snapshot result before releasing lock
- monitor.py: cap interface name regex to 15 chars (Linux IFNAMSIZ limit)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 08:42:50 -04:00
jared c71d0da97d security: harden exception exposure, SSL config, and Pulse response parsing
Lint / Python (flake8) (push) Failing after 42s
Lint / JS (eslint) (push) Successful in 7s
Security / Python Security (bandit) (push) Successful in 1m22s
Test / Python Tests (pytest) (push) Failing after 1m23s
Lint / Notify on failure (push) Successful in 3s
Lint / Deploy (push) Has been skipped
- app.py: replace raw str(e) in diagnostic _run() with generic client message; log internally only
- app.py: /health endpoint no longer leaks exception strings to unauthenticated callers; errors logged server-side
- monitor.py: UniFi SSL verification now defaults True, configurable via config.json unifi.verify_ssl; urllib3 warning suppression scoped to verify=False only (removed global disable)
- monitor.py: Pulse execution_id extracted with .get() + explicit None check to avoid KeyError on malformed response
- monitor.py: interface name regex drops '@' (not a valid kernel interface char) to match app.py and fix inconsistency

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 08:40:25 -04:00
jared ca41486c45 security+a11y: job ownership check, aria-live chips, aria-hidden topo
Lint / Python (flake8) (push) Failing after 45s
Lint / JS (eslint) (push) Successful in 8s
Security / Python Security (bandit) (push) Successful in 1m5s
Test / Python Tests (pytest) (push) Successful in 49s
Lint / Notify on failure (push) Successful in 3s
Lint / Deploy (push) Has been skipped
security:
- Fix bare open(sentinel, 'w').close() file descriptor leak; use
  context manager instead
- Store requesting username in _diag_jobs at creation time; return 403
  from api_diagnose_poll if the polling user does not match the job owner

accessibility:
- Add aria-live="polite" aria-atomic="true" to .status-chips container
  so screen readers announce critical/warning count changes on refresh
- Add aria-controls="events-table-wrap" to critical and warning stat
  cards so assistive tech knows these buttons control the events table
- Add aria-hidden sync to topology setCollapsed() — hidden topology
  content is now removed from the accessibility tree when collapsed,
  preventing keyboard focus from entering invisible elements

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 23:53:17 -04:00
jared 41695a3faa security: escape user input in 403 error response to prevent XSS
Lint / Python (flake8) (push) Successful in 40s
Lint / JS (eslint) (push) Successful in 9s
Security / Python Security (bandit) (push) Successful in 1m0s
Test / Python Tests (pytest) (push) Successful in 51s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 2s
The require_auth decorator was interpolating user['username'] and the
allowed_groups list directly into HTML strings. An attacker with a
crafted username or control over group names could inject arbitrary HTML.

Use html.escape() on both values before insertion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 23:41:31 -04:00
jared c0e59cfa9e refactor: extract _annotate_suppressions helper, remove orphaned CSS
Lint / Python (flake8) (push) Successful in 45s
Lint / JS (eslint) (push) Successful in 8s
Security / Python Security (bandit) (push) Successful in 1m0s
Test / Python Tests (pytest) (push) Successful in 57s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 2s
- Extract identical suppression-annotation loop from index() and
  api_status() into _annotate_suppressions() helper to eliminate DRY
  violation
- Improve stuck-job error message: 'thread crash' → 'no activity for
  5 minutes' (less alarming, more accurate)
- Remove orphaned .events-filter-bar CSS class (never referenced in
  any template or JS file)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 23:39:52 -04:00
jared c027b5422a Feature: show suppression status on active alert rows
Lint / Python (flake8) (push) Successful in 45s
Lint / JS (eslint) (push) Successful in 7s
Security / Python Security (bandit) (push) Successful in 47s
Test / Python Tests (pytest) (push) Successful in 1m11s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 3s
Active events now carry an is_suppressed boolean (added in api_status()
and the index() route via check_suppressed() against the pre-loaded
suppression list). The events table renders a muted '🔕 sup' badge next
to the severity and dims the entire row (.row-suppressed) so operators
can immediately see which firing alerts are silenced.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 23:11:15 -04:00
jared 08543ac25a Fix B108: replace hardcoded /tmp with tempfile.gettempdir()
Lint / Python (flake8) (push) Successful in 40s
Lint / JS (eslint) (push) Successful in 8s
Security / Python Security (bandit) (push) Successful in 42s
Test / Python Tests (pytest) (push) Successful in 1m18s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 8s
Bandit flags hardcoded /tmp strings as CWE-377 (insecure temp file).
Use tempfile.gettempdir() for the avatar cache dir default so the
path resolves correctly on all platforms and passes the security scan.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 13:34:37 -04:00
jared 9d6583a08a Add LDAP avatar photos, UX polish, and TDS component upgrades
Lint / Python (flake8) (push) Successful in 1m13s
Lint / JS (eslint) (push) Successful in 9s
Security / Python Security (bandit) (push) Failing after 45s
Test / Python Tests (pytest) (push) Successful in 57s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 5s
- Add /api/avatar endpoint querying lldap for user jpegPhoto; disk cache
  with sentinel pattern avoids repeat LDAP hits for users without photos
- Add ldap3 dependency and ldap config block to config.json
- Wire lt-avatar img overlay in base.html with capture-phase error
  fallback (lt-avatar-img-err) to reveal initials when image is absent
- Fix lt-avatar CSS shim: position:relative + absolute inset on img
  (local base.css was missing these; added to style.css)
- Replace all empty-state paragraphs with proper lt-empty-state markup
  (icon + title + body) across index, suppressions, inspector, app.js
- Add lt-spinner--cyan next to refresh button; shows during refreshAll()
- Replace inspector panel-section-title with lt-divider throughout
- Add data-tooltip attributes to SFP DOM metrics, TX/RX/Carrier/Duplex/
  Auto-neg/Error labels in links.html and inspector panel
- Add tooltips to events table column headers (Sev, First Seen, Failures)
- Fix links.html host panel timestamp (was reading sample.updated which
  is always undefined; now uses data.updated)
- Fix UniFi status text casing (Online→ONLINE to match server render)
- Remove dead topo-status-* class manipulation from updateTopology()
- Always render alert-count-badge; toggle display:none when count is 0
- Fix double UniFi get_devices() call in monitor.py run loop
- Fix chip-critical animation (was using green pulse-glow; now red)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 21:09:56 -04:00
jared cabdbc24ad fix: resolve bandit B324/B104 and flake8 E302/E303/E501 in app.py
Lint / Python (flake8) (push) Successful in 42s
Lint / JS (eslint) (push) Successful in 6s
Test / Python Tests (pytest) (push) Successful in 51s
Lint / Notify on failure (push) Has been skipped
Lint / Deploy (push) Successful in 2s
Security / Python Security (bandit) (push) Successful in 1m10s
- Add nosec B324 to md5 avatar-colour call (non-security deterministic hash)
- Extend nosec on host='0.0.0.0' to cover B104 alongside existing B201
- Fix E302 (missing blank line before template_filter decorator)
- Fix E303 (4 blank lines → 2 before _purge_old_jobs_loop)
- Add extend-exclude = node_modules to .flake8 so CI --exclude flag
  doesn't override config and third-party JS Python helpers stay ignored

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 20:51:41 -04:00
jared c45dd007d1 Fix field name mismatches, add events filter, in-place suppression refresh
Lint / Python (flake8) (push) Failing after 50s
Lint / JS (eslint) (push) Successful in 7s
Test / Python Tests (pytest) (push) Successful in 51s
Lint / Notify on failure (push) Successful in 2s
Lint / Deploy (push) Has been skipped
Security / Python Security (bandit) (push) Failing after 59s
- links.html: fix all field name bugs (auto_negotiation→autoneg, full_duplex,
  tx/rx_errors/drops_per_sec→_rate, tx/rx_bytes_per_sec→_rate, poe_total_w/poe_max_w
  computed from ports, renderUnifiSwitches uses top-level updated timestamp)
- suppressions.html: in-place DOM refresh after create/remove (no page reload),
  datalist autocomplete for target names, form reset after submit
- inspector.html: ESC key closes detail panel via lt.keys.on
- index.html: events filter bar with search input + severity pills (All/Critical/Warning),
  MutationObserver re-applies filter after dynamic updates
- style.css: g-section-actions, events-filter-bar, sev-pills layout
- app.js/db.py/monitor.py: carry forward prior session fixes (Promise.allSettled,
  daemon_ok, stale connection handling, double Prometheus call, self.cfg fix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 23:35:02 -04:00
jared c1c3905179 Add avatar color, initials structure, and admin nav dropdown
Lint / Python (flake8) (push) Failing after 1m7s
Lint / JS (eslint) (push) Successful in 10s
Security / Python Security (bandit) (push) Failing after 1m31s
Test / Python Tests (pytest) (push) Successful in 1m8s
Lint / Notify on failure (push) Successful in 2s
Lint / Deploy (push) Has been skipped
- app.py: avatar_color Jinja filter using deterministic hash → lt-avatar--orange/green/purple
- base.html: proper lt-avatar--sm with lt-avatar-initials span and color class; multi-word initials support
- base.html: admin users get lt-nav-dropdown for Suppressions; non-admins see flat link; mobile drawer hides Suppressions for non-admins

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-18 23:54:01 -04:00
jared b10eded514 Suppress bandit B201 false positive in dev runner
Lint / Python (flake8) (push) Failing after 20s
Lint / JS (eslint) (push) Successful in 7s
Security / Python Security (bandit) (push) Failing after 22s
Test / Python Tests (pytest) (push) Successful in 29s
Lint / Deploy (push) Has been skipped
app.run(debug=True) is only reached via __main__ (dev mode).
Production runs gunicorn, never this block.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-14 12:40:10 -04:00
jared 3dfcd5903a ci: add flake8 lint workflow; fix unused imports
Lint / Python (flake8) (push) Failing after 4s
Adds .gitea/workflows/lint.yml running flake8 with .flake8 config.
Removes unused imports (flask.redirect, flask.url_for, time, typing.Tuple).
Config ignores E221/E305 (intentional column alignment and function spacing).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-13 22:23:26 -04:00
jared e2b65db2fc Add pagination to event queries, input validation, daily event purge
- get_active_events() now takes limit/offset (default 200) to cap unbounded queries
- count_active_events() added to return total for pagination display
- /api/events supports ?limit=, ?offset=, ?status= query params (max 1000)
- /api/status includes total_active count alongside paginated events list
- index() route passes total_active to template for server-side truncation notice
- Show "Showing X of Y" notice in dashboard when events are truncated
- Suppression POST validates: reason ≤500 chars, target_name/detail ≤255 chars
- _purge_old_jobs_loop runs purge_old_resolved_events(90d) once per day

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-17 20:32:32 -04:00
jared 0ca6b1f744 feat: link health summary, recently resolved panel, event duration
- dashboard: pass recent_resolved (last 24h, limit 10) to index template;
  render "Recently Resolved" section showing type, target, resolved time,
  and calculated duration (first_seen → resolved_at)
- dashboard: event-age spans now also update via setInterval; duration
  shown for resolved events (e.g. "2h 15m")
- links page: link health summary panel shows server iface count,
  error/flap counts, switch port up/down, PoE total draw/capacity bar;
  only shows problematic stats if non-zero; shows "All OK ✔" when clean
- style.css: new classes for summary panel, resolved row/badge

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-14 21:48:40 -04:00
jared 17d3b7d227 New features: stale banner, tab title alerts, health checks, DB housekeeping
static/app.js:
- Browser tab title updates to show alert count: '(3 CRIT) GANDALF' or '(2 WARN) GANDALF'
- Stale monitoring banner: injected into .main if last_check > 15 min old,
  warns operator that the monitor daemon may be down

static/style.css:
- .stale-banner: amber top-border warning strip

app.py:
- /health now checks DB connectivity and monitor freshness (last_check age)
  Returns 503 + degraded status if DB unreachable or monitor stale >20min

db.py:
- cleanup_expired_suppressions(): marks time-limited suppressions inactive when
  expires_at <= NOW() (was only filtered in SELECTs, never marked inactive)
- purge_old_resolved_events(days=90): deletes old resolved events to prevent
  unbounded table growth

monitor.py:
- Calls cleanup_expired_suppressions() and purge_old_resolved_events() each cycle

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-14 21:35:32 -04:00
jared 14eaa6a8c9 De-hardcode ticket URL and cluster name; improve diagnostic polling UX
app.py:
- Context processor injects config.ticket_api.web_url into all templates
  (falls back to 'http://t.lotusguild.org/ticket/' if not set in config)

templates/base.html:
- Inject GANDALF_CONFIG JS global with ticket_web_url before app.js loads

static/app.js:
- Use GANDALF_CONFIG.ticket_web_url instead of hardcoded domain

templates/index.html:
- Use {{ config.ticket_api.web_url }} Jinja var instead of hardcoded domain

monitor.py:
- CLUSTER_NAME constant kept as default; NetworkMonitor now reads cluster_name
  from config monitor.cluster_name, falling back to the constant
- All CLUSTER_NAME references inside class methods replaced with self.cluster_name

templates/inspector.html:
- pollDiagnostic() .catch() now clears interval and shows error message instead
  of silently ignoring network failures during active polling

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-14 14:31:57 -04:00
jared f8395dcd24 Fix port_idx type coercion and add logging to silent except blocks
- port_idx now coerced to int() with 400 on invalid type (prevents string/int mismatch)
- api_network and api_links bare except blocks now log errors instead of silently passing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 17:35:41 -04:00
jared 0335845101 Security and reliability fixes: input validation, logging, job cleanup
- C5: Validate host_ip (IPv4 check) and iface (allowlist regex) before SSH command builder
- H6: Upgrade Pulse failure logging from debug to error so operators see outages
- M6: Replace per-request O(n) purge with background daemon thread (runs every 2 min)
- M7: Background thread marks jobs stuck in 'running' > 5 min as errored

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 17:30:50 -04:00
jared b1dd5f9cad feat: deep link diagnostics via Pulse SSH
Adds comprehensive per-port link troubleshooting triggered from the
Inspector panel when a port has an LLDP-identified server counterpart.

- diagnose.py: DiagnosticsRunner with 15-section SSH command (carrier,
  operstate, sysfs counters, ethtool, ethtool -i/-a/-g/-S/-m, ip link,
  ip addr, ip route, dmesg, lldpctl); parsers for all sections; health
  analyzer with 14 check codes (NO_CARRIER, HALF_DUPLEX, SPEED_MISMATCH,
  SFP_RX_CRITICAL, CARRIER_FLAPPING, CRC_ERRORS_HIGH, LLDP_MISMATCH, etc.)
- monitor.py: PulseClient now tracks last_execution_id so callers can
  link back to the raw Pulse execution URL
- app.py: POST /api/diagnose + GET /api/diagnose/<job_id> with daemon
  thread background execution and 10-minute in-memory job store
- inspector.html: "Run Link Diagnostics" button (shown only when LLDP
  host is resolvable); full results panel: health banner, physical layer,
  SFP/DOM with power bars, NIC error counters, collapsible ethtool -S,
  flow control/ring buffers, driver info, LLDP 2-col validation,
  collapsible dmesg, switch port summary, "View in Pulse" link
- style.css: all .diag-* CSS classes with terminal aesthetic

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 16:03:54 -05:00
jared 0278dad502 feat: inspector page, link debug enhancements, security hardening
- Add /inspector page: visual model-accurate switch chassis diagrams
  (USF5P, USL8A, US24PRO, USPPDUP, USMINI), clickable port blocks
  with color coding (green=up, amber=PoE, cyan=uplink, grey=down),
  detail panel with stats/PoE/LLDP, LLDP-based path debug side-by-side

- Link Debug: port number badges (#N), LLDP neighbor line, PoE class/max,
  collapsible host/switch panels with sessionStorage persistence

- monitor.py: collect LLDP neighbor map + PoE class/max/mode per switch
  port; PulseClient uses requests.Session() for HTTP keep-alive; add
  shlex.quote() around interface names (defense-in-depth)

- Security: suppress buttons use data-* attrs + delegated click handler
  instead of inline onclick with Jinja2 variable interpolation; remove
  | safe filter from user-controlled fields in suppressions.html;
  setDuration() takes explicit el param instead of implicit event global

- db.py: thread-local connection reuse with ping(reconnect=True) to
  avoid a new TCP handshake per query

- .gitignore: add config.json (contains credentials), __pycache__

- README: full rewrite covering architecture, all 4 pages, alert logic,
  config reference, deployment, troubleshooting, security notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 15:39:48 -05:00
jared fa7512a2c2 feat: terminal aesthetic rewrite + link debug page
- Full dark terminal aesthetic (Pulse/TinkerTickets style):
  - #0a0a0a background, #00ff41 green, #ffb000 amber, #00ffff cyan
  - CRT scanline overlay, phosphor glow, ASCII corner pseudoelements
  - Bracket-notation badges [CRITICAL], monospace font throughout
  - style.css, base.html, index.html, suppressions.html all rewritten

- New Link Debug page (/links, /api/links):
  - Per-host, per-interface cards with speed/duplex/port type/auto-neg
  - Traffic bars (TX cyan, RX green) with rate labels
  - Error/drop counters, carrier change history
  - SFP/DOM optical panel: vendor, temp, voltage, bias, TX/RX power dBm bars
  - RX-TX delta shown; color-coded warn/crit thresholds
  - Auto-refresh every 60s, anchor-jump to #hostname

- LinkStatsCollector in monitor.py:
  - SSHes to each host (one connection, all ifaces batched)
  - Parses ethtool + ethtool -m (SFP DOM) output
  - Merges with Prometheus traffic/error/carrier metrics
  - Stores as link_stats in monitor_state table

- config.json: added ssh section for ethtool collection
- app.js: terminal chip style consistency (uppercase, ● bullet)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-02 12:43:11 -05:00
jared 0c0150f698 Complete rewrite: full-featured network monitoring dashboard
- Two-service architecture: Flask web app (gandalf.service) + background
  polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
  6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
  only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
  interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
  PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 23:03:18 -05:00
jared 004c97f492 interface update 2025-02-08 00:32:25 -05:00
jared 9f92ac5c1a fixed indent 2025-02-08 00:16:45 -05:00
jared ea5e86ef33 lots logs 2025-02-08 00:16:06 -05:00
jared 19224d14df added raw back 2025-02-08 00:13:06 -05:00
jared 68beb7b1c4 more dynamic 2025-02-08 00:11:28 -05:00
jared dc117b276e fix syntax error 2025-02-08 00:04:42 -05:00
jared b67a5d10c2 dynamic devices 2025-02-08 00:03:01 -05:00
jared 4c90fbb168 interfaces update 2025-02-07 23:57:34 -05:00
jared 02d03f4f3f json update 2025-02-07 23:26:41 -05:00
jared 1549f39c2c v2 api 2025-02-07 23:24:36 -05:00
jared a2c8368439 wrong indentation 2025-02-07 23:20:14 -05:00
jared 75cdef709f Bearer token 2025-02-07 23:19:50 -05:00
jared 0417106e88 Auth order plz 2025-02-07 23:17:12 -05:00
jared 3c4a9651b5 CSRF token 2025-02-07 23:14:36 -05:00
jared de24b9ef98 bearer token 2025-02-07 23:12:29 -05:00
jared 37022b132f updated stats 2025-02-07 23:09:11 -05:00
jared 5d5aea3cf4 acquire site id 2025-02-07 23:06:49 -05:00
jared de7b731269 site id to default 2025-02-07 23:01:58 -05:00
jared ac3eaf4f92 DEBUG BROSKI 2025-02-07 22:54:43 -05:00
jared c939a37344 fixed endpoint 2025-02-07 22:52:19 -05:00
jared baf7d23cd0 new header 2025-02-07 22:50:52 -05:00