- inspector.html: fix LLDP neighbor label in port blocks — port.lldp_table never exists; data is at port.lldp (dict with system_name/chassis_id); both port block renderers corrected
- db.py: remove dead 'target_detail IS NULL' branch in suppression check — target_detail is always stored as '' not NULL; query simplified to target_detail=''
- app.py: resolve cache_dir/cache_file/sentinel to absolute paths; guard against path escape before use
- app.py: wrap sentinel os.path.getmtime() in try/except OSError to handle TOCTOU deletion race
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- app.py: split 'with open(sentinel): pass' onto two lines (flake8 E701)
- tests/test_diagnose.py: rename test and assert StrictHostKeyChecking=accept-new (not =no which was fixed earlier)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- app.py: wrap int(cache_ttl) in try/except so a misconfigured non-integer value falls back to 3600 instead of raising ValueError
- base.html: use Jinja2 tojson filter for ticket_web_url to ensure proper JS string escaping regardless of URL contents
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- app.py: move conn.unbind() into finally block in api_avatar() so connection is always closed even if conn.search() throws
- app.py: remove elapsed-time strings from /health response (unauthenticated endpoint no longer leaks monitor timing)
- app.py: add after_request hook setting X-Content-Type-Options, X-Frame-Options, Referrer-Policy on all responses
- app.py: add 10 MB size guard on link_stats before JSON parse; log actual exception on parse failure
- app.py: wrap suppressions_page network_snapshot parse in try/except (same protection as index page)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- app.py: fail loudly if LDAP bind_pw is not configured rather than attempting anonymous bind
- app.py: validate expires_minutes is 1–43200 (max 30 days) before storing suppression
- app.py: wrap network_snapshot JSON parse in try/except so a corrupt DB value returns degraded page instead of 500
- app.py: prune _diag_rate entries inactive for >1h to prevent unbounded growth
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- app.py: validate server_name from LLDP with fullmatch before use in logs/lookups (prevents log injection)
- app.py: validate each mgmt_ip candidate before assigning host_ip (avoids assigning non-IP string that then fails later check)
- app.py: log actual exception in link_stats JSON parse error
- inspector.html: clear _diagPollTimer in closePanel() so timer doesn't orphan when panel is closed mid-poll
- monitor.py: sleep 30s after a monitor loop exception before resuming normal poll interval
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- app.py: replace raw str(e) in diagnostic _run() with generic client message; log internally only
- app.py: /health endpoint no longer leaks exception strings to unauthenticated callers; errors logged server-side
- monitor.py: UniFi SSL verification now defaults True, configurable via config.json unifi.verify_ssl; urllib3 warning suppression scoped to verify=False only (removed global disable)
- monitor.py: Pulse execution_id extracted with .get() + explicit None check to avoid KeyError on malformed response
- monitor.py: interface name regex drops '@' (not a valid kernel interface char) to match app.py and fix inconsistency
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
security:
- Fix bare open(sentinel, 'w').close() file descriptor leak; use
context manager instead
- Store requesting username in _diag_jobs at creation time; return 403
from api_diagnose_poll if the polling user does not match the job owner
accessibility:
- Add aria-live="polite" aria-atomic="true" to .status-chips container
so screen readers announce critical/warning count changes on refresh
- Add aria-controls="events-table-wrap" to critical and warning stat
cards so assistive tech knows these buttons control the events table
- Add aria-hidden sync to topology setCollapsed() — hidden topology
content is now removed from the accessibility tree when collapsed,
preventing keyboard focus from entering invisible elements
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The require_auth decorator was interpolating user['username'] and the
allowed_groups list directly into HTML strings. An attacker with a
crafted username or control over group names could inject arbitrary HTML.
Use html.escape() on both values before insertion.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract identical suppression-annotation loop from index() and
api_status() into _annotate_suppressions() helper to eliminate DRY
violation
- Improve stuck-job error message: 'thread crash' → 'no activity for
5 minutes' (less alarming, more accurate)
- Remove orphaned .events-filter-bar CSS class (never referenced in
any template or JS file)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Active events now carry an is_suppressed boolean (added in api_status()
and the index() route via check_suppressed() against the pre-loaded
suppression list). The events table renders a muted '🔕 sup' badge next
to the severity and dims the entire row (.row-suppressed) so operators
can immediately see which firing alerts are silenced.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bandit flags hardcoded /tmp strings as CWE-377 (insecure temp file).
Use tempfile.gettempdir() for the avatar cache dir default so the
path resolves correctly on all platforms and passes the security scan.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add /api/avatar endpoint querying lldap for user jpegPhoto; disk cache
with sentinel pattern avoids repeat LDAP hits for users without photos
- Add ldap3 dependency and ldap config block to config.json
- Wire lt-avatar img overlay in base.html with capture-phase error
fallback (lt-avatar-img-err) to reveal initials when image is absent
- Fix lt-avatar CSS shim: position:relative + absolute inset on img
(local base.css was missing these; added to style.css)
- Replace all empty-state paragraphs with proper lt-empty-state markup
(icon + title + body) across index, suppressions, inspector, app.js
- Add lt-spinner--cyan next to refresh button; shows during refreshAll()
- Replace inspector panel-section-title with lt-divider throughout
- Add data-tooltip attributes to SFP DOM metrics, TX/RX/Carrier/Duplex/
Auto-neg/Error labels in links.html and inspector panel
- Add tooltips to events table column headers (Sev, First Seen, Failures)
- Fix links.html host panel timestamp (was reading sample.updated which
is always undefined; now uses data.updated)
- Fix UniFi status text casing (Online→ONLINE to match server render)
- Remove dead topo-status-* class manipulation from updateTopology()
- Always render alert-count-badge; toggle display:none when count is 0
- Fix double UniFi get_devices() call in monitor.py run loop
- Fix chip-critical animation (was using green pulse-glow; now red)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- app.py: avatar_color Jinja filter using deterministic hash → lt-avatar--orange/green/purple
- base.html: proper lt-avatar--sm with lt-avatar-initials span and color class; multi-word initials support
- base.html: admin users get lt-nav-dropdown for Suppressions; non-admins see flat link; mobile drawer hides Suppressions for non-admins
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
app.run(debug=True) is only reached via __main__ (dev mode).
Production runs gunicorn, never this block.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- get_active_events() now takes limit/offset (default 200) to cap unbounded queries
- count_active_events() added to return total for pagination display
- /api/events supports ?limit=, ?offset=, ?status= query params (max 1000)
- /api/status includes total_active count alongside paginated events list
- index() route passes total_active to template for server-side truncation notice
- Show "Showing X of Y" notice in dashboard when events are truncated
- Suppression POST validates: reason ≤500 chars, target_name/detail ≤255 chars
- _purge_old_jobs_loop runs purge_old_resolved_events(90d) once per day
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- dashboard: pass recent_resolved (last 24h, limit 10) to index template;
render "Recently Resolved" section showing type, target, resolved time,
and calculated duration (first_seen → resolved_at)
- dashboard: event-age spans now also update via setInterval; duration
shown for resolved events (e.g. "2h 15m")
- links page: link health summary panel shows server iface count,
error/flap counts, switch port up/down, PoE total draw/capacity bar;
only shows problematic stats if non-zero; shows "All OK ✔" when clean
- style.css: new classes for summary panel, resolved row/badge
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
static/app.js:
- Browser tab title updates to show alert count: '(3 CRIT) GANDALF' or '(2 WARN) GANDALF'
- Stale monitoring banner: injected into .main if last_check > 15 min old,
warns operator that the monitor daemon may be down
static/style.css:
- .stale-banner: amber top-border warning strip
app.py:
- /health now checks DB connectivity and monitor freshness (last_check age)
Returns 503 + degraded status if DB unreachable or monitor stale >20min
db.py:
- cleanup_expired_suppressions(): marks time-limited suppressions inactive when
expires_at <= NOW() (was only filtered in SELECTs, never marked inactive)
- purge_old_resolved_events(days=90): deletes old resolved events to prevent
unbounded table growth
monitor.py:
- Calls cleanup_expired_suppressions() and purge_old_resolved_events() each cycle
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
app.py:
- Context processor injects config.ticket_api.web_url into all templates
(falls back to 'http://t.lotusguild.org/ticket/' if not set in config)
templates/base.html:
- Inject GANDALF_CONFIG JS global with ticket_web_url before app.js loads
static/app.js:
- Use GANDALF_CONFIG.ticket_web_url instead of hardcoded domain
templates/index.html:
- Use {{ config.ticket_api.web_url }} Jinja var instead of hardcoded domain
monitor.py:
- CLUSTER_NAME constant kept as default; NetworkMonitor now reads cluster_name
from config monitor.cluster_name, falling back to the constant
- All CLUSTER_NAME references inside class methods replaced with self.cluster_name
templates/inspector.html:
- pollDiagnostic() .catch() now clears interval and shows error message instead
of silently ignoring network failures during active polling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- port_idx now coerced to int() with 400 on invalid type (prevents string/int mismatch)
- api_network and api_links bare except blocks now log errors instead of silently passing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- C5: Validate host_ip (IPv4 check) and iface (allowlist regex) before SSH command builder
- H6: Upgrade Pulse failure logging from debug to error so operators see outages
- M6: Replace per-request O(n) purge with background daemon thread (runs every 2 min)
- M7: Background thread marks jobs stuck in 'running' > 5 min as errored
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds comprehensive per-port link troubleshooting triggered from the
Inspector panel when a port has an LLDP-identified server counterpart.
- diagnose.py: DiagnosticsRunner with 15-section SSH command (carrier,
operstate, sysfs counters, ethtool, ethtool -i/-a/-g/-S/-m, ip link,
ip addr, ip route, dmesg, lldpctl); parsers for all sections; health
analyzer with 14 check codes (NO_CARRIER, HALF_DUPLEX, SPEED_MISMATCH,
SFP_RX_CRITICAL, CARRIER_FLAPPING, CRC_ERRORS_HIGH, LLDP_MISMATCH, etc.)
- monitor.py: PulseClient now tracks last_execution_id so callers can
link back to the raw Pulse execution URL
- app.py: POST /api/diagnose + GET /api/diagnose/<job_id> with daemon
thread background execution and 10-minute in-memory job store
- inspector.html: "Run Link Diagnostics" button (shown only when LLDP
host is resolvable); full results panel: health banner, physical layer,
SFP/DOM with power bars, NIC error counters, collapsible ethtool -S,
flow control/ring buffers, driver info, LLDP 2-col validation,
collapsible dmesg, switch port summary, "View in Pulse" link
- style.css: all .diag-* CSS classes with terminal aesthetic
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Two-service architecture: Flask web app (gandalf.service) + background
polling daemon (gandalf-monitor.service)
- Monitor polls Prometheus node_network_up for physical NIC states on all
6 hypervisors (added storage-01 at 10.10.10.11:9100)
- UniFi API monitoring for switches, APs, and gateway device status
- Ping reachability for hosts without node_exporter (pbs only now)
- Smart baseline: interfaces first seen as down are never alerted on;
only UP→DOWN regressions trigger tickets
- Cluster-wide P1 ticket when 3+ hosts have genuine simultaneous
interface regressions (guards against false positives on startup)
- Tinker Tickets integration with 24-hour hash-based deduplication
- Alert suppression: manual toggle or timed windows (30m/1h/4h/8h)
- Authelia SSO via forward-auth headers, admin group required
- Network topology: Internet → UDM-Pro → Agg Switch (10G DAC) →
PoE Switch (10G DAC) → Hosts
- MariaDB schema, suppression management UI, host/interface cards
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>