Jump to content
Jump to navigation

Jump to heading Observability

Scotty includes a comprehensive observability stack for monitoring application health, performance, and behavior. The stack provides metrics, distributed tracing, and visualization through industry-standard tools.

Jump to heading Architecture

Scotty Application
       ↓ (OTLP over gRPC)
OpenTelemetry Collector (port 4317)
       ├─→ Jaeger (distributed traces)
       └─→ VictoriaMetrics (metrics storage)
              ↓
           Grafana (visualization & dashboards)

Jump to heading Components

  • OpenTelemetry Collector: Receives telemetry data from Scotty via OTLP protocol and routes it to appropriate backends
  • VictoriaMetrics: High-performance time-series database for metrics storage (30-day retention)
  • Jaeger: Distributed tracing backend for request traces and spans
  • Grafana: Visualization platform with pre-configured dashboards

Jump to heading Resource Usage

The observability stack requires approximately:

  • Memory: 180-250 MB total
  • CPU: Minimal (< 5% on modern systems)
  • Disk: ~1-2 GB for 30 days of metrics retention

Jump to heading Prometheus Compatibility & Flexibility

All metrics are fully Prometheus-compatible. The stack uses open standards (OTLP, PromQL, W3C Trace Context) and components are interchangeable.

Jump to heading Metric Format

  • OpenTelemetry format (scotty.metric.name) → Prometheus format (scotty_metric_name_total)
  • Standard types: Counter, Gauge, Histogram, UpDownCounter
  • Attributes become labels (method, status, path)

Jump to heading Replace Components as Needed

Use Prometheus instead of VictoriaMetrics: Update otel-collector-config.yaml exporter from prometheusremotewrite to prometheus endpoint, then swap VictoriaMetrics for Prometheus in docker-compose.

Alternative backends: Thanos, Cortex, M3DB, InfluxDB, Datadog, New Relic, Honeycomb, Grafana Cloud

Alternative visualization: Prometheus UI, VictoriaMetrics vmui, Chronograf, commercial dashboards

Alternative tracing: Zipkin, Tempo, Elasticsearch + Jaeger, Lightstep, Honeycomb

Multi-backend export example:

# otel-collector-config.yaml - export to multiple destinations
service:
  pipelines:
    metrics:
      exporters: [prometheusremotewrite/victoria, prometheusremotewrite/thanos, otlp/datadog]

Jump to heading Integration Patterns

Remote write to existing Prometheus:

exporters:
  prometheusremotewrite:
    endpoint: "https://your-prometheus.company.com/api/v1/write"

Federation from VictoriaMetrics:

# prometheus.yml
scrape_configs:
  - job_name: 'scotty'
    metrics_path: '/api/v1/export/prometheus'
    params:
      match[]: ['{__name__=~"scotty_.*"}']
    static_configs:
      - targets: ['victoriametrics:8428']

Service discovery: Standard Kubernetes/Consul Prometheus SD works with VictoriaMetrics API.

Jump to heading Why VictoriaMetrics Default

Chosen for development convenience: lower memory usage, single binary, Prometheus-compatible, free. Swap for Prometheus in production if preferred.

Jump to heading Quick Start

Jump to heading Prerequisites

The observability stack requires Traefik for .ddev.site domain routing. Start Traefik first:

cd apps/traefik
docker-compose up -d

Jump to heading Starting the Observability Stack

cd observability
docker-compose up -d

This will start all four services:

  • OpenTelemetry Collector
  • VictoriaMetrics
  • Jaeger
  • Grafana

Jump to heading Enabling Metrics in Scotty

Configure Scotty to export telemetry data using the SCOTTY__TELEMETRY environment variable:

Enable both metrics and traces:

SCOTTY__TELEMETRY=metrics,traces cargo run --bin scotty

Enable only metrics:

SCOTTY__TELEMETRY=metrics cargo run --bin scotty

Production deployment (in docker-compose.yml or .env):

environment:
  - SCOTTY__TELEMETRY=metrics,traces

Jump to heading Accessing Services

Once running, access the services at:

Service URL Credentials
Grafana http://grafana.ddev.site admin/admin
Jaeger UI http://jaeger.ddev.site (none)
VictoriaMetrics http://vm.ddev.site (none)

Jump to heading Available Metrics

Scotty exports comprehensive metrics covering all major subsystems. All metrics use the scotty. prefix.

Jump to heading Log Streaming Metrics

Metric Name Type Description
scotty_log_streams_active Gauge Number of active log streams
scotty_log_streams_total Counter Total log streams created
scotty_log_stream_duration_seconds Histogram Duration of log streaming sessions
scotty_log_stream_lines_total Counter Total log lines streamed to clients
scotty_log_stream_errors_total Counter Log streaming errors

Use Cases:

  • Monitor concurrent log stream load
  • Detect log streaming errors
  • Analyze log stream duration patterns

Jump to heading Shell Session Metrics

Metric Name Type Description
scotty_shell_sessions_active Gauge Number of active shell sessions
scotty_shell_sessions_total Counter Total shell sessions created
scotty_shell_session_duration_seconds Histogram Shell session duration
scotty_shell_session_errors_total Counter Shell session errors
scotty_shell_session_timeouts_total Counter Sessions ended due to timeout

Use Cases:

  • Monitor active shell connections
  • Track session timeout rates
  • Identify shell session errors

Jump to heading WebSocket Metrics

Metric Name Type Description
scotty_websocket_connections_active Gauge Active WebSocket connections
scotty_websocket_messages_sent_total Counter Messages sent to clients
scotty_websocket_messages_received_total Counter Messages received from clients
scotty_websocket_auth_failures_total Counter WebSocket authentication failures

Use Cases:

  • Monitor real-time connection count
  • Track message throughput
  • Detect authentication issues

Jump to heading Task Output Streaming Metrics

Metric Name Type Description
scotty_tasks_active Gauge Active task output streams
scotty_tasks_total Counter Total tasks executed
scotty_task_duration_seconds Histogram Task execution duration
scotty_task_failures_total Counter Failed tasks
scotty_task_output_lines_total Counter Task output lines streamed

Use Cases:

  • Monitor task execution load
  • Track task failure rates
  • Analyze output streaming performance

Jump to heading HTTP Server Metrics

Metric Name Type Description
scotty_http_requests_active UpDownCounter Currently processing requests
scotty_http_requests_total Counter Total HTTP requests
scotty_http_request_duration_seconds Histogram Request processing time

Attributes:

  • method: HTTP method (GET, POST, etc.)
  • path: Request path
  • status: HTTP status code

Use Cases:

  • Monitor API endpoint performance
  • Track request rates by endpoint
  • Identify slow requests

Jump to heading Memory Metrics

Metric Name Type Description
scotty_memory_rss_bytes Gauge Resident Set Size (RSS) in bytes
scotty_memory_virtual_bytes Gauge Virtual memory size in bytes

Use Cases:

  • Monitor memory consumption
  • Detect memory leaks
  • Capacity planning

Jump to heading Application Metrics

Metric Name Type Description
scotty_apps_total Gauge Total managed applications
scotty_apps_by_status Gauge Apps grouped by status
scotty_app_services_count Histogram Services per application distribution
scotty_app_last_check_age_seconds Histogram Time since last health check

Attributes:

  • status: Application status (running, stopped, etc.)

Use Cases:

  • Monitor application fleet size
  • Track application health check timeliness
  • Analyze service distribution

Jump to heading Tokio Runtime Metrics

Metric Name Type Description
scotty_tokio_workers_count Gauge Number of Tokio worker threads
scotty_tokio_tasks_active Gauge Active instrumented tasks
scotty_tokio_tasks_dropped_total Counter Completed/dropped tasks
scotty_tokio_poll_count_total Counter Total task polls
scotty_tokio_poll_duration_seconds Histogram Task poll duration
scotty_tokio_slow_poll_count_total Counter Slow task polls (>1ms)
scotty_tokio_idle_duration_seconds Histogram Task idle time between polls
scotty_tokio_scheduled_count_total Counter Task scheduling events
scotty_tokio_first_poll_delay_seconds Histogram Delay from creation to first poll

Use Cases:

  • Monitor async runtime health
  • Detect slow tasks blocking the runtime
  • Optimize task scheduling

Jump to heading Grafana Dashboard

Scotty includes a pre-configured Grafana dashboard (scotty-metrics.json) that visualizes all available metrics.

Jump to heading Dashboard Sections

  1. Log Streaming: Active streams, throughput, duration percentiles, errors
  2. Shell Sessions: Active sessions, creation rate, duration, errors & timeouts
  3. WebSocket & Tasks: Connection metrics, message rates, task execution
  4. Memory Usage: RSS and virtual memory trends
  5. HTTP Server: Request rates, active requests, latencies
  6. Tokio Runtime: Worker threads, task lifecycle, poll metrics
  7. Application Metrics: App count, status distribution, health checks

Jump to heading Accessing the Dashboard

  1. Open Grafana: http://grafana.ddev.site
  2. Login with admin / admin (change on first login)
  3. Navigate to DashboardsScotty Metrics

The dashboard auto-refreshes every 5 seconds and shows data from the last hour by default.

Jump to heading PromQL Query Examples

Jump to heading Request Rate by HTTP Status

sum by (status) (rate(scotty_http_requests_total[5m]))

Jump to heading P95 Request Latency

histogram_quantile(0.95, rate(scotty_http_request_duration_seconds_bucket[5m]))

Jump to heading WebSocket Connection Churn

rate(scotty_websocket_connections_total[5m])

Jump to heading Memory Growth Rate

deriv(scotty_memory_rss_bytes[10m])

Jump to heading Active Resources Summary

# All active resources
scotty_log_streams_active +
scotty_shell_sessions_active +
scotty_websocket_connections_active +
scotty_tasks_active

Jump to heading Distributed Tracing

When traces are enabled (SCOTTY__TELEMETRY=traces or metrics,traces), Scotty exports distributed traces to Jaeger.

Jump to heading Viewing Traces

  1. Open Jaeger UI: http://jaeger.ddev.site
  2. Select scotty service
  3. Search for traces by operation or timeframe

Jump to heading Key Operations

  • HTTP POST /apps/create: Application creation
  • HTTP GET /apps/info/{name}: Application info retrieval
  • log_stream_handler: Log streaming operations
  • shell_session_handler: Shell session management

Traces include timing information, error status, and contextual metadata for debugging request flows.

Jump to heading Troubleshooting

Jump to heading No Metrics Appearing in Grafana

  1. Check Scotty is exporting metrics:

    # Verify SCOTTY__TELEMETRY is set
    echo $SCOTTY__TELEMETRY
    
    # Should be 'metrics' or 'metrics,traces'
  2. Verify OpenTelemetry Collector is receiving data:

    docker logs otel-collector
    # Look for: "Trace received"
  3. Check VictoriaMetrics has data:

    curl http://vm.ddev.site/api/v1/label/__name__/values | jq
    # Should list scotty_* metrics
  4. Restart the stack:

    cd observability
    docker-compose restart

Jump to heading High Memory Usage

If VictoriaMetrics uses too much memory, adjust retention:

# observability/docker-compose.yml
services:
  victoriametrics:
    command:
      - '-retentionPeriod=14d'  # Reduce from 30d

Jump to heading Connection Refused Errors

Ensure Traefik is running:

docker ps | grep traefik
cd apps/traefik
docker-compose up -d

Jump to heading Grafana Dashboard Not Loading

  1. Check dashboard file exists: observability/grafana/dashboards/scotty-metrics.json
  2. Restart Grafana: docker-compose restart grafana
  3. Check Grafana logs: docker logs grafana

Jump to heading Configuration

Jump to heading OpenTelemetry Collector

Configuration file: observability/otel-collector-config.yaml

Key settings:

  • OTLP Receiver: Port 4317 (gRPC)
  • Exporters: Jaeger (traces), Prometheus Remote Write (metrics to VictoriaMetrics)
  • Batch Processor: Batches telemetry for efficiency

Jump to heading VictoriaMetrics

Configuration via docker-compose environment:

  • Retention: 30 days (-retentionPeriod=30d)
  • Storage path: /victoria-metrics-data
  • HTTP port: 8428

Jump to heading Grafana

Configuration in observability/grafana/provisioning/:

  • Datasources: VictoriaMetrics (Prometheus type)
  • Dashboards: Auto-provisioned from dashboards/ directory

Jump to heading Production Recommendations

Jump to heading Resource Allocation

For production deployments, allocate resources based on scale:

Small deployment (< 10 apps):

  • VictoriaMetrics: 256 MB memory
  • OpenTelemetry Collector: 128 MB memory
  • Grafana: 256 MB memory

Medium deployment (10-50 apps):

  • VictoriaMetrics: 512 MB memory
  • OpenTelemetry Collector: 256 MB memory
  • Grafana: 512 MB memory

Large deployment (50+ apps):

  • VictoriaMetrics: 1 GB+ memory
  • OpenTelemetry Collector: 512 MB memory
  • Grafana: 512 MB memory

Jump to heading Alerting

Configure Grafana alerts for critical metrics:

  • High error rate: rate(scotty_http_requests_total{status="500"}[5m]) > 0.1
  • Memory leak: deriv(scotty_memory_rss_bytes[30m]) > 1000000
  • High WebSocket failures: rate(scotty_websocket_auth_failures_total[5m]) > 1
  • Task failures: rate(scotty_task_failures_total[5m]) > 0.5

Jump to heading Data Retention

Adjust retention based on compliance and capacity:

# observability/docker-compose.yml
services:
  victoriametrics:
    command:
      - '-retentionPeriod=90d'  # 3 months for compliance

Jump to heading Security

Production checklist:

  • [ ] Change Grafana default password
  • [ ] Enable Grafana authentication (OAuth, LDAP, etc.)
  • [ ] Use TLS for Grafana access
  • [ ] Restrict Jaeger UI access
  • [ ] Firewall VictoriaMetrics port (8428)
  • [ ] Use secure networks for OTLP traffic

Jump to heading Further Reading