DocsGuidesObservability

Observability

Monitor the health and performance of your TopGun cluster using Prometheus metrics and structured logging.

Prometheus Metrics

The TopGun server exposes a /metrics endpoint that is compatible with Prometheus.

By default, this is available on the main server port (or a dedicated metrics port if configured).

Key Metrics

Metric NameTypeDescription
topgun_connected_clientsGaugeNumber of currently connected WebSocket clients.
topgun_map_size_itemsGaugeNumber of items in a specific map (labeled by map name).
topgun_ops_totalCounterTotal number of operations processed. Labels: type (PUT, GET, DELETE, SUBSCRIBE), map.
topgun_memory_usage_bytesGaugeCurrent heap memory usage of the server process.
topgun_cluster_membersGaugeNumber of active nodes in the cluster.

Standard Node.js metrics (CPU, Event Loop, GC) are also exported with the topgun_ prefix.

Event Routing Metrics

Metric NameTypeDescription
topgun_events_routed_totalCounterTotal events processed for routing to subscribers.
topgun_events_filtered_by_subscriptionCounterEvents not sent because no clients were subscribed.
topgun_subscribers_per_eventSummaryDistribution of subscriber count per event (quantiles).

Event Queue Metrics

Metric NameTypeDescription
topgun_event_queue_sizeGaugeCurrent number of events in the bounded queue.
topgun_event_queue_enqueued_totalCounterTotal events successfully enqueued.
topgun_event_queue_dequeued_totalCounterTotal events dequeued for processing.
topgun_event_queue_rejected_totalCounterEvents rejected due to queue being at capacity. Alert if increasing.

Backpressure Metrics

Metric NameTypeDescription
topgun_backpressure_sync_forced_totalCounterNumber of times synchronous processing was forced.
topgun_backpressure_pending_opsGaugeCurrent number of pending async operations.
topgun_backpressure_waits_totalCounterTimes processing had to wait for capacity.
topgun_backpressure_timeouts_totalCounterBackpressure timeouts. Alert if increasing.

Connection Rate Limiting Metrics

Metric NameTypeDescription
topgun_connections_accepted_totalCounterTotal connections accepted by the rate limiter.
topgun_connections_rejected_totalCounterConnections rejected due to rate limiting.
topgun_connections_pendingGaugeCurrent pending connection handshakes.
topgun_connection_rate_per_secondGaugeCurrent connection rate (connections/second).

Alert Recommendations

Key Metrics to Alert On

  • topgun_event_queue_rejected_total — Events are being dropped due to queue capacity
  • topgun_backpressure_timeouts_total — Operations timing out waiting for capacity
  • topgun_connections_rejected_total — Clients being rejected (rate limit or DDoS)
  • topgun_event_queue_size — Monitor queue depth approaching eventQueueCapacity

Structured Logging

TopGun uses Pino for high-performance, structured JSON logging.

This makes it easy to ingest logs into systems like ELK, Datadog, or Loki.

Log Example

server.log
{
  "level": 30,
  "time": 1678901234567,
  "pid": 12345,
  "hostname": "topgun-server-0",
  "msg": "Server started on port 8080",
  "nodeId": "node-xyz-123"
}

Logs include context such as nodeId, clientId, and requestId where applicable, allowing you to trace requests across the distributed system.