Jump to heading Observability

Scotty includes a comprehensive observability stack for monitoring application health, performance, and behavior. The stack provides metrics, distributed tracing, and visualization through industry-standard tools.

Jump to heading Architecture

Scotty Application
       ↓ (OTLP over gRPC)
OpenTelemetry Collector (port 4317)
       ├─→ Jaeger (distributed traces)
       └─→ VictoriaMetrics (metrics storage)
              ↓
           Grafana (visualization & dashboards)

Jump to heading Components

OpenTelemetry Collector: Receives telemetry data from Scotty via OTLP protocol and routes it to appropriate backends
VictoriaMetrics: High-performance time-series database for metrics storage (30-day retention)
Jaeger: Distributed tracing backend for request traces and spans
Grafana: Visualization platform with pre-configured dashboards

Jump to heading Resource Usage

The observability stack requires approximately:

Memory: 180-250 MB total
CPU: Minimal (< 5% on modern systems)
Disk: ~1-2 GB for 30 days of metrics retention

Jump to heading Prometheus Compatibility & Flexibility

All metrics are fully Prometheus-compatible. The stack uses open standards (OTLP, PromQL, W3C Trace Context) and components are interchangeable.

Jump to heading Metric Format

OpenTelemetry format (scotty.metric.name) → Prometheus format (scotty_metric_name_total)
Standard types: Counter, Gauge, Histogram, UpDownCounter
Attributes become labels (method, status, path)

Jump to heading Replace Components as Needed

Use Prometheus instead of VictoriaMetrics: Update otel-collector-config.yaml exporter from prometheusremotewrite to prometheus endpoint, then swap VictoriaMetrics for Prometheus in docker-compose.

Alternative backends: Thanos, Cortex, M3DB, InfluxDB, Datadog, New Relic, Honeycomb, Grafana Cloud

Alternative visualization: Prometheus UI, VictoriaMetrics vmui, Chronograf, commercial dashboards

Alternative tracing: Zipkin, Tempo, Elasticsearch + Jaeger, Lightstep, Honeycomb

Multi-backend export example:

# otel-collector-config.yaml - export to multiple destinations
service:
  pipelines:
    metrics:
      exporters: [prometheusremotewrite/victoria, prometheusremotewrite/thanos, otlp/datadog]

Jump to heading Integration Patterns

Remote write to existing Prometheus:

exporters:
  prometheusremotewrite:
    endpoint: "https://your-prometheus.company.com/api/v1/write"

Federation from VictoriaMetrics:

# prometheus.yml
scrape_configs:
  - job_name: 'scotty'
    metrics_path: '/api/v1/export/prometheus'
    params:
      match[]: ['{__name__=~"scotty_.*"}']
    static_configs:
      - targets: ['victoriametrics:8428']

Service discovery: Standard Kubernetes/Consul Prometheus SD works with VictoriaMetrics API.

Jump to heading Why VictoriaMetrics Default

Chosen for development convenience: lower memory usage, single binary, Prometheus-compatible, free. Swap for Prometheus in production if preferred.

Jump to heading Quick Start

Jump to heading Prerequisites

The observability stack requires Traefik for .ddev.site domain routing. Start Traefik first:

cd apps/traefik
docker-compose up -d

Jump to heading Starting the Observability Stack

cd observability
docker-compose up -d

This will start all four services:

OpenTelemetry Collector
VictoriaMetrics
Jaeger
Grafana

Jump to heading Enabling Metrics in Scotty

Configure Scotty to export telemetry data using the SCOTTY__TELEMETRY environment variable:

Enable both metrics and traces:

SCOTTY__TELEMETRY=metrics,traces cargo run --bin scotty

Enable only metrics:

SCOTTY__TELEMETRY=metrics cargo run --bin scotty

Production deployment (in docker-compose.yml or .env):

environment:
  - SCOTTY__TELEMETRY=metrics,traces

Jump to heading Accessing Services

Once running, access the services at:

Service	URL	Credentials
Grafana	http://grafana.ddev.site	admin/admin
Jaeger UI	http://jaeger.ddev.site	(none)
VictoriaMetrics	http://vm.ddev.site	(none)

Jump to heading Available Metrics

Scotty exports comprehensive metrics covering all major subsystems. All metrics use the scotty. prefix.

Jump to heading Log Streaming Metrics

Metric Name	Type	Description
`scotty_log_streams_active`	Gauge	Number of active log streams
`scotty_log_streams_total`	Counter	Total log streams created
`scotty_log_stream_duration_seconds`	Histogram	Duration of log streaming sessions
`scotty_log_stream_lines_total`	Counter	Total log lines streamed to clients
`scotty_log_stream_errors_total`	Counter	Log streaming errors

Use Cases:

Monitor concurrent log stream load
Detect log streaming errors
Analyze log stream duration patterns

Jump to heading Shell Session Metrics

Metric Name	Type	Description
`scotty_shell_sessions_active`	Gauge	Number of active shell sessions
`scotty_shell_sessions_total`	Counter	Total shell sessions created
`scotty_shell_session_duration_seconds`	Histogram	Shell session duration
`scotty_shell_session_errors_total`	Counter	Shell session errors
`scotty_shell_session_timeouts_total`	Counter	Sessions ended due to timeout

Use Cases:

Monitor active shell connections
Track session timeout rates
Identify shell session errors

Jump to heading WebSocket Metrics

Metric Name	Type	Description
`scotty_websocket_connections_active`	Gauge	Active WebSocket connections
`scotty_websocket_messages_sent_total`	Counter	Messages sent to clients
`scotty_websocket_messages_received_total`	Counter	Messages received from clients
`scotty_websocket_auth_failures_total`	Counter	WebSocket authentication failures

Use Cases:

Monitor real-time connection count
Track message throughput
Detect authentication issues

Jump to heading Task Output Streaming Metrics

Metric Name	Type	Description
`scotty_tasks_active`	Gauge	Active task output streams
`scotty_tasks_total`	Counter	Total tasks executed
`scotty_task_duration_seconds`	Histogram	Task execution duration
`scotty_task_failures_total`	Counter	Failed tasks
`scotty_task_output_lines_total`	Counter	Task output lines streamed

Use Cases:

Monitor task execution load
Track task failure rates
Analyze output streaming performance

Jump to heading HTTP Server Metrics

Metric Name	Type	Description
`scotty_http_requests_active`	UpDownCounter	Currently processing requests
`scotty_http_requests_total`	Counter	Total HTTP requests
`scotty_http_request_duration_seconds`	Histogram	Request processing time

Attributes:

method: HTTP method (GET, POST, etc.)
path: Request path
status: HTTP status code

Use Cases:

Monitor API endpoint performance
Track request rates by endpoint
Identify slow requests

Jump to heading Memory Metrics

Metric Name	Type	Description
`scotty_memory_rss_bytes`	Gauge	Resident Set Size (RSS) in bytes
`scotty_memory_virtual_bytes`	Gauge	Virtual memory size in bytes

Use Cases:

Monitor memory consumption
Detect memory leaks
Capacity planning

Jump to heading Application Metrics

Metric Name	Type	Description
`scotty_apps_total`	Gauge	Total managed applications
`scotty_apps_by_status`	Gauge	Apps grouped by status
`scotty_app_services_count`	Histogram	Services per application distribution
`scotty_app_last_check_age_seconds`	Histogram	Time since last health check

Attributes:

status: Application status (running, stopped, etc.)

Use Cases:

Monitor application fleet size
Track application health check timeliness
Analyze service distribution

Jump to heading Tokio Runtime Metrics

Metric Name	Type	Description
`scotty_tokio_workers_count`	Gauge	Number of Tokio worker threads
`scotty_tokio_tasks_active`	Gauge	Active instrumented tasks
`scotty_tokio_tasks_dropped_total`	Counter	Completed/dropped tasks
`scotty_tokio_poll_count_total`	Counter	Total task polls
`scotty_tokio_poll_duration_seconds`	Histogram	Task poll duration
`scotty_tokio_slow_poll_count_total`	Counter	Slow task polls (>1ms)
`scotty_tokio_idle_duration_seconds`	Histogram	Task idle time between polls
`scotty_tokio_scheduled_count_total`	Counter	Task scheduling events
`scotty_tokio_first_poll_delay_seconds`	Histogram	Delay from creation to first poll

Use Cases:

Monitor async runtime health
Detect slow tasks blocking the runtime
Optimize task scheduling

Jump to heading Grafana Dashboard

Scotty includes a pre-configured Grafana dashboard (scotty-metrics.json) that visualizes all available metrics.

Jump to heading Dashboard Sections

Log Streaming: Active streams, throughput, duration percentiles, errors
Shell Sessions: Active sessions, creation rate, duration, errors & timeouts
WebSocket & Tasks: Connection metrics, message rates, task execution
Memory Usage: RSS and virtual memory trends
HTTP Server: Request rates, active requests, latencies
Tokio Runtime: Worker threads, task lifecycle, poll metrics
Application Metrics: App count, status distribution, health checks

Jump to heading Accessing the Dashboard

Open Grafana: http://grafana.ddev.site
Login with admin / admin (change on first login)
Navigate to Dashboards → Scotty Metrics

The dashboard auto-refreshes every 5 seconds and shows data from the last hour by default.

Jump to heading PromQL Query Examples

Jump to heading Request Rate by HTTP Status

sum by (status) (rate(scotty_http_requests_total[5m]))

Jump to heading P95 Request Latency

histogram_quantile(0.95, rate(scotty_http_request_duration_seconds_bucket[5m]))

Jump to heading WebSocket Connection Churn

rate(scotty_websocket_connections_total[5m])

Jump to heading Memory Growth Rate

deriv(scotty_memory_rss_bytes[10m])

Jump to heading Active Resources Summary

# All active resources
scotty_log_streams_active +
scotty_shell_sessions_active +
scotty_websocket_connections_active +
scotty_tasks_active

Jump to heading Distributed Tracing

When traces are enabled (SCOTTY__TELEMETRY=traces or metrics,traces), Scotty exports distributed traces to Jaeger.

Jump to heading Viewing Traces

Open Jaeger UI: http://jaeger.ddev.site
Select scotty service
Search for traces by operation or timeframe

Jump to heading Key Operations

HTTP POST /apps/create: Application creation
HTTP GET /apps/info/{name}: Application info retrieval
log_stream_handler: Log streaming operations
shell_session_handler: Shell session management

Traces include timing information, error status, and contextual metadata for debugging request flows.

Jump to heading Troubleshooting

Jump to heading No Metrics Appearing in Grafana

Check Scotty is exporting metrics:

# Verify SCOTTY__TELEMETRY is set
echo $SCOTTY__TELEMETRY

# Should be 'metrics' or 'metrics,traces'

Verify OpenTelemetry Collector is receiving data:

docker logs otel-collector
# Look for: "Trace received"

Check VictoriaMetrics has data:

curl http://vm.ddev.site/api/v1/label/__name__/values | jq
# Should list scotty_* metrics

Restart the stack:
```
cd observability
docker-compose restart
```

Jump to heading High Memory Usage

If VictoriaMetrics uses too much memory, adjust retention:

# observability/docker-compose.yml
services:
  victoriametrics:
    command:
      - '-retentionPeriod=14d'  # Reduce from 30d

Jump to heading Connection Refused Errors

Ensure Traefik is running:

docker ps | grep traefik
cd apps/traefik
docker-compose up -d

Jump to heading Grafana Dashboard Not Loading

Check dashboard file exists: observability/grafana/dashboards/scotty-metrics.json
Restart Grafana: docker-compose restart grafana
Check Grafana logs: docker logs grafana

Jump to heading Configuration

Jump to heading OpenTelemetry Collector

Configuration file: observability/otel-collector-config.yaml

Key settings:

OTLP Receiver: Port 4317 (gRPC)
Exporters: Jaeger (traces), Prometheus Remote Write (metrics to VictoriaMetrics)
Batch Processor: Batches telemetry for efficiency

Jump to heading VictoriaMetrics

Configuration via docker-compose environment:

Retention: 30 days (-retentionPeriod=30d)
Storage path: /victoria-metrics-data
HTTP port: 8428

Jump to heading Grafana

Configuration in observability/grafana/provisioning/:

Datasources: VictoriaMetrics (Prometheus type)
Dashboards: Auto-provisioned from dashboards/ directory

Jump to heading Production Recommendations

Jump to heading Resource Allocation

For production deployments, allocate resources based on scale:

Small deployment (< 10 apps):

VictoriaMetrics: 256 MB memory
OpenTelemetry Collector: 128 MB memory
Grafana: 256 MB memory

Medium deployment (10-50 apps):

VictoriaMetrics: 512 MB memory
OpenTelemetry Collector: 256 MB memory
Grafana: 512 MB memory

Large deployment (50+ apps):

VictoriaMetrics: 1 GB+ memory
OpenTelemetry Collector: 512 MB memory
Grafana: 512 MB memory

Jump to heading Alerting

Configure Grafana alerts for critical metrics:

High error rate: rate(scotty_http_requests_total{status="500"}[5m]) > 0.1
Memory leak: deriv(scotty_memory_rss_bytes[30m]) > 1000000
High WebSocket failures: rate(scotty_websocket_auth_failures_total[5m]) > 1
Task failures: rate(scotty_task_failures_total[5m]) > 0.5

Jump to heading Data Retention

Adjust retention based on compliance and capacity:

# observability/docker-compose.yml
services:
  victoriametrics:
    command:
      - '-retentionPeriod=90d'  # 3 months for compliance

Jump to heading Security

Production checklist:

[ ] Change Grafana default password
[ ] Enable Grafana authentication (OAuth, LDAP, etc.)
[ ] Use TLS for Grafana access
[ ] Restrict Jaeger UI access
[ ] Firewall VictoriaMetrics port (8428)
[ ] Use secure networks for OTLP traffic

Jump to heading # Observability

Jump to heading # Architecture

Jump to heading # Components

Jump to heading # Resource Usage

Jump to heading # Prometheus Compatibility & Flexibility

Jump to heading # Metric Format

Jump to heading # Replace Components as Needed

Jump to heading # Integration Patterns

Jump to heading # Why VictoriaMetrics Default

Jump to heading # Quick Start

Jump to heading # Prerequisites

Jump to heading # Starting the Observability Stack

Jump to heading # Enabling Metrics in Scotty

Jump to heading # Accessing Services

Jump to heading # Available Metrics

Jump to heading # Log Streaming Metrics

Jump to heading # Shell Session Metrics

Jump to heading # WebSocket Metrics

Jump to heading # Task Output Streaming Metrics

Jump to heading # HTTP Server Metrics

Jump to heading # Memory Metrics

Jump to heading # Application Metrics

Jump to heading # Tokio Runtime Metrics

Jump to heading # Grafana Dashboard

Jump to heading # Dashboard Sections

Jump to heading # Accessing the Dashboard

Jump to heading # PromQL Query Examples

Jump to heading # Request Rate by HTTP Status

Jump to heading # P95 Request Latency

Jump to heading # WebSocket Connection Churn

Jump to heading # Memory Growth Rate

Jump to heading # Active Resources Summary

Jump to heading # Distributed Tracing

Jump to heading # Viewing Traces

Jump to heading # Key Operations

Jump to heading # Troubleshooting

Jump to heading # No Metrics Appearing in Grafana

Jump to heading # High Memory Usage

Jump to heading # Connection Refused Errors

Jump to heading # Grafana Dashboard Not Loading

Jump to heading # Configuration

Jump to heading # OpenTelemetry Collector

Jump to heading # VictoriaMetrics

Jump to heading # Grafana

Jump to heading # Production Recommendations

Jump to heading # Resource Allocation

Jump to heading # Alerting

Jump to heading # Data Retention

Jump to heading # Security

Jump to heading # Further Reading

Jump to heading Observability

Jump to heading Architecture

Jump to heading Components

Jump to heading Resource Usage

Jump to heading Prometheus Compatibility & Flexibility

Jump to heading Metric Format

Jump to heading Replace Components as Needed

Jump to heading Integration Patterns

Jump to heading Why VictoriaMetrics Default

Jump to heading Quick Start

Jump to heading Prerequisites

Jump to heading Starting the Observability Stack

Jump to heading Enabling Metrics in Scotty

Jump to heading Accessing Services

Jump to heading Available Metrics

Jump to heading Log Streaming Metrics

Jump to heading Shell Session Metrics

Jump to heading WebSocket Metrics

Jump to heading Task Output Streaming Metrics

Jump to heading HTTP Server Metrics

Jump to heading Memory Metrics

Jump to heading Application Metrics

Jump to heading Tokio Runtime Metrics

Jump to heading Grafana Dashboard

Jump to heading Dashboard Sections

Jump to heading Accessing the Dashboard

Jump to heading PromQL Query Examples

Jump to heading Request Rate by HTTP Status

Jump to heading P95 Request Latency

Jump to heading WebSocket Connection Churn

Jump to heading Memory Growth Rate

Jump to heading Active Resources Summary

Jump to heading Distributed Tracing

Jump to heading Viewing Traces

Jump to heading Key Operations

Jump to heading Troubleshooting

Jump to heading No Metrics Appearing in Grafana

Jump to heading High Memory Usage

Jump to heading Connection Refused Errors

Jump to heading Grafana Dashboard Not Loading

Jump to heading Configuration

Jump to heading OpenTelemetry Collector

Jump to heading VictoriaMetrics

Jump to heading Grafana

Jump to heading Production Recommendations

Jump to heading Resource Allocation

Jump to heading Alerting

Jump to heading Data Retention

Jump to heading Security

Jump to heading Further Reading