Jump to heading Observability
Scotty includes a comprehensive observability stack for monitoring application health, performance, and behavior. The stack provides metrics, distributed tracing, and visualization through industry-standard tools.
Jump to heading Architecture
Scotty Application
↓ (OTLP over gRPC)
OpenTelemetry Collector (port 4317)
├─→ Jaeger (distributed traces)
└─→ VictoriaMetrics (metrics storage)
↓
Grafana (visualization & dashboards)
Jump to heading Components
- OpenTelemetry Collector: Receives telemetry data from Scotty via OTLP protocol and routes it to appropriate backends
- VictoriaMetrics: High-performance time-series database for metrics storage (30-day retention)
- Jaeger: Distributed tracing backend for request traces and spans
- Grafana: Visualization platform with pre-configured dashboards
Jump to heading Resource Usage
The observability stack requires approximately:
- Memory: 180-250 MB total
- CPU: Minimal (< 5% on modern systems)
- Disk: ~1-2 GB for 30 days of metrics retention
Jump to heading Prometheus Compatibility & Flexibility
All metrics are fully Prometheus-compatible. The stack uses open standards (OTLP, PromQL, W3C Trace Context) and components are interchangeable.
Jump to heading Metric Format
- OpenTelemetry format (
scotty.metric.name) → Prometheus format (scotty_metric_name_total) - Standard types: Counter, Gauge, Histogram, UpDownCounter
- Attributes become labels (
method,status,path)
Jump to heading Replace Components as Needed
Use Prometheus instead of VictoriaMetrics:
Update otel-collector-config.yaml exporter from prometheusremotewrite to prometheus endpoint, then swap VictoriaMetrics for Prometheus in docker-compose.
Alternative backends: Thanos, Cortex, M3DB, InfluxDB, Datadog, New Relic, Honeycomb, Grafana Cloud
Alternative visualization: Prometheus UI, VictoriaMetrics vmui, Chronograf, commercial dashboards
Alternative tracing: Zipkin, Tempo, Elasticsearch + Jaeger, Lightstep, Honeycomb
Multi-backend export example:
# otel-collector-config.yaml - export to multiple destinations
service:
pipelines:
metrics:
exporters: [prometheusremotewrite/victoria, prometheusremotewrite/thanos, otlp/datadog]
Jump to heading Integration Patterns
Remote write to existing Prometheus:
exporters:
prometheusremotewrite:
endpoint: "https://your-prometheus.company.com/api/v1/write"
Federation from VictoriaMetrics:
# prometheus.yml
scrape_configs:
- job_name: 'scotty'
metrics_path: '/api/v1/export/prometheus'
params:
match[]: ['{__name__=~"scotty_.*"}']
static_configs:
- targets: ['victoriametrics:8428']
Service discovery: Standard Kubernetes/Consul Prometheus SD works with VictoriaMetrics API.
Jump to heading Why VictoriaMetrics Default
Chosen for development convenience: lower memory usage, single binary, Prometheus-compatible, free. Swap for Prometheus in production if preferred.
Jump to heading Quick Start
Jump to heading Prerequisites
The observability stack requires Traefik for .ddev.site domain routing. Start Traefik first:
cd apps/traefik
docker-compose up -d
Jump to heading Starting the Observability Stack
cd observability
docker-compose up -d
This will start all four services:
- OpenTelemetry Collector
- VictoriaMetrics
- Jaeger
- Grafana
Jump to heading Enabling Metrics in Scotty
Configure Scotty to export telemetry data using the SCOTTY__TELEMETRY environment variable:
Enable both metrics and traces:
SCOTTY__TELEMETRY=metrics,traces cargo run --bin scotty
Enable only metrics:
SCOTTY__TELEMETRY=metrics cargo run --bin scotty
Production deployment (in docker-compose.yml or .env):
environment:
- SCOTTY__TELEMETRY=metrics,traces
Jump to heading Accessing Services
Once running, access the services at:
| Service | URL | Credentials |
|---|---|---|
| Grafana | http://grafana.ddev.site | admin/admin |
| Jaeger UI | http://jaeger.ddev.site | (none) |
| VictoriaMetrics | http://vm.ddev.site | (none) |
Jump to heading Available Metrics
Scotty exports comprehensive metrics covering all major subsystems. All metrics use the scotty. prefix.
Jump to heading Log Streaming Metrics
| Metric Name | Type | Description |
|---|---|---|
scotty_log_streams_active |
Gauge | Number of active log streams |
scotty_log_streams_total |
Counter | Total log streams created |
scotty_log_stream_duration_seconds |
Histogram | Duration of log streaming sessions |
scotty_log_stream_lines_total |
Counter | Total log lines streamed to clients |
scotty_log_stream_errors_total |
Counter | Log streaming errors |
Use Cases:
- Monitor concurrent log stream load
- Detect log streaming errors
- Analyze log stream duration patterns
Jump to heading Shell Session Metrics
| Metric Name | Type | Description |
|---|---|---|
scotty_shell_sessions_active |
Gauge | Number of active shell sessions |
scotty_shell_sessions_total |
Counter | Total shell sessions created |
scotty_shell_session_duration_seconds |
Histogram | Shell session duration |
scotty_shell_session_errors_total |
Counter | Shell session errors |
scotty_shell_session_timeouts_total |
Counter | Sessions ended due to timeout |
Use Cases:
- Monitor active shell connections
- Track session timeout rates
- Identify shell session errors
Jump to heading WebSocket Metrics
| Metric Name | Type | Description |
|---|---|---|
scotty_websocket_connections_active |
Gauge | Active WebSocket connections |
scotty_websocket_messages_sent_total |
Counter | Messages sent to clients |
scotty_websocket_messages_received_total |
Counter | Messages received from clients |
scotty_websocket_auth_failures_total |
Counter | WebSocket authentication failures |
Use Cases:
- Monitor real-time connection count
- Track message throughput
- Detect authentication issues
Jump to heading Task Output Streaming Metrics
| Metric Name | Type | Description |
|---|---|---|
scotty_tasks_active |
Gauge | Active task output streams |
scotty_tasks_total |
Counter | Total tasks executed |
scotty_task_duration_seconds |
Histogram | Task execution duration |
scotty_task_failures_total |
Counter | Failed tasks |
scotty_task_output_lines_total |
Counter | Task output lines streamed |
Use Cases:
- Monitor task execution load
- Track task failure rates
- Analyze output streaming performance
Jump to heading HTTP Server Metrics
| Metric Name | Type | Description |
|---|---|---|
scotty_http_requests_active |
UpDownCounter | Currently processing requests |
scotty_http_requests_total |
Counter | Total HTTP requests |
scotty_http_request_duration_seconds |
Histogram | Request processing time |
Attributes:
method: HTTP method (GET, POST, etc.)path: Request pathstatus: HTTP status code
Use Cases:
- Monitor API endpoint performance
- Track request rates by endpoint
- Identify slow requests
Jump to heading Memory Metrics
| Metric Name | Type | Description |
|---|---|---|
scotty_memory_rss_bytes |
Gauge | Resident Set Size (RSS) in bytes |
scotty_memory_virtual_bytes |
Gauge | Virtual memory size in bytes |
Use Cases:
- Monitor memory consumption
- Detect memory leaks
- Capacity planning
Jump to heading Application Metrics
| Metric Name | Type | Description |
|---|---|---|
scotty_apps_total |
Gauge | Total managed applications |
scotty_apps_by_status |
Gauge | Apps grouped by status |
scotty_app_services_count |
Histogram | Services per application distribution |
scotty_app_last_check_age_seconds |
Histogram | Time since last health check |
Attributes:
status: Application status (running, stopped, etc.)
Use Cases:
- Monitor application fleet size
- Track application health check timeliness
- Analyze service distribution
Jump to heading Tokio Runtime Metrics
| Metric Name | Type | Description |
|---|---|---|
scotty_tokio_workers_count |
Gauge | Number of Tokio worker threads |
scotty_tokio_tasks_active |
Gauge | Active instrumented tasks |
scotty_tokio_tasks_dropped_total |
Counter | Completed/dropped tasks |
scotty_tokio_poll_count_total |
Counter | Total task polls |
scotty_tokio_poll_duration_seconds |
Histogram | Task poll duration |
scotty_tokio_slow_poll_count_total |
Counter | Slow task polls (>1ms) |
scotty_tokio_idle_duration_seconds |
Histogram | Task idle time between polls |
scotty_tokio_scheduled_count_total |
Counter | Task scheduling events |
scotty_tokio_first_poll_delay_seconds |
Histogram | Delay from creation to first poll |
Use Cases:
- Monitor async runtime health
- Detect slow tasks blocking the runtime
- Optimize task scheduling
Jump to heading Grafana Dashboard
Scotty includes a pre-configured Grafana dashboard (scotty-metrics.json) that visualizes all available metrics.
Jump to heading Dashboard Sections
- Log Streaming: Active streams, throughput, duration percentiles, errors
- Shell Sessions: Active sessions, creation rate, duration, errors & timeouts
- WebSocket & Tasks: Connection metrics, message rates, task execution
- Memory Usage: RSS and virtual memory trends
- HTTP Server: Request rates, active requests, latencies
- Tokio Runtime: Worker threads, task lifecycle, poll metrics
- Application Metrics: App count, status distribution, health checks
Jump to heading Accessing the Dashboard
- Open Grafana: http://grafana.ddev.site
- Login with
admin/admin(change on first login) - Navigate to Dashboards → Scotty Metrics
The dashboard auto-refreshes every 5 seconds and shows data from the last hour by default.
Jump to heading PromQL Query Examples
Jump to heading Request Rate by HTTP Status
sum by (status) (rate(scotty_http_requests_total[5m]))
Jump to heading P95 Request Latency
histogram_quantile(0.95, rate(scotty_http_request_duration_seconds_bucket[5m]))
Jump to heading WebSocket Connection Churn
rate(scotty_websocket_connections_total[5m])
Jump to heading Memory Growth Rate
deriv(scotty_memory_rss_bytes[10m])
Jump to heading Active Resources Summary
# All active resources
scotty_log_streams_active +
scotty_shell_sessions_active +
scotty_websocket_connections_active +
scotty_tasks_active
Jump to heading Distributed Tracing
When traces are enabled (SCOTTY__TELEMETRY=traces or metrics,traces), Scotty exports distributed traces to Jaeger.
Jump to heading Viewing Traces
- Open Jaeger UI: http://jaeger.ddev.site
- Select scotty service
- Search for traces by operation or timeframe
Jump to heading Key Operations
HTTP POST /apps/create: Application creationHTTP GET /apps/info/{name}: Application info retrievallog_stream_handler: Log streaming operationsshell_session_handler: Shell session management
Traces include timing information, error status, and contextual metadata for debugging request flows.
Jump to heading Troubleshooting
Jump to heading No Metrics Appearing in Grafana
-
Check Scotty is exporting metrics:
# Verify SCOTTY__TELEMETRY is set echo $SCOTTY__TELEMETRY # Should be 'metrics' or 'metrics,traces' -
Verify OpenTelemetry Collector is receiving data:
docker logs otel-collector # Look for: "Trace received" -
Check VictoriaMetrics has data:
curl http://vm.ddev.site/api/v1/label/__name__/values | jq # Should list scotty_* metrics -
Restart the stack:
cd observability docker-compose restart
Jump to heading High Memory Usage
If VictoriaMetrics uses too much memory, adjust retention:
# observability/docker-compose.yml
services:
victoriametrics:
command:
- '-retentionPeriod=14d' # Reduce from 30d
Jump to heading Connection Refused Errors
Ensure Traefik is running:
docker ps | grep traefik
cd apps/traefik
docker-compose up -d
Jump to heading Grafana Dashboard Not Loading
- Check dashboard file exists:
observability/grafana/dashboards/scotty-metrics.json - Restart Grafana:
docker-compose restart grafana - Check Grafana logs:
docker logs grafana
Jump to heading Configuration
Jump to heading OpenTelemetry Collector
Configuration file: observability/otel-collector-config.yaml
Key settings:
- OTLP Receiver: Port 4317 (gRPC)
- Exporters: Jaeger (traces), Prometheus Remote Write (metrics to VictoriaMetrics)
- Batch Processor: Batches telemetry for efficiency
Jump to heading VictoriaMetrics
Configuration via docker-compose environment:
- Retention: 30 days (
-retentionPeriod=30d) - Storage path:
/victoria-metrics-data - HTTP port: 8428
Jump to heading Grafana
Configuration in observability/grafana/provisioning/:
- Datasources: VictoriaMetrics (Prometheus type)
- Dashboards: Auto-provisioned from
dashboards/directory
Jump to heading Production Recommendations
Jump to heading Resource Allocation
For production deployments, allocate resources based on scale:
Small deployment (< 10 apps):
- VictoriaMetrics: 256 MB memory
- OpenTelemetry Collector: 128 MB memory
- Grafana: 256 MB memory
Medium deployment (10-50 apps):
- VictoriaMetrics: 512 MB memory
- OpenTelemetry Collector: 256 MB memory
- Grafana: 512 MB memory
Large deployment (50+ apps):
- VictoriaMetrics: 1 GB+ memory
- OpenTelemetry Collector: 512 MB memory
- Grafana: 512 MB memory
Jump to heading Alerting
Configure Grafana alerts for critical metrics:
- High error rate:
rate(scotty_http_requests_total{status="500"}[5m]) > 0.1 - Memory leak:
deriv(scotty_memory_rss_bytes[30m]) > 1000000 - High WebSocket failures:
rate(scotty_websocket_auth_failures_total[5m]) > 1 - Task failures:
rate(scotty_task_failures_total[5m]) > 0.5
Jump to heading Data Retention
Adjust retention based on compliance and capacity:
# observability/docker-compose.yml
services:
victoriametrics:
command:
- '-retentionPeriod=90d' # 3 months for compliance
Jump to heading Security
Production checklist:
- [ ] Change Grafana default password
- [ ] Enable Grafana authentication (OAuth, LDAP, etc.)
- [ ] Use TLS for Grafana access
- [ ] Restrict Jaeger UI access
- [ ] Firewall VictoriaMetrics port (8428)
- [ ] Use secure networks for OTLP traffic