Observability beyond metrics: logs, traces, and alerts that actually help
CPU graphs tell you something broke. Structured logs and distributed traces tell you why. Here is how we wired observability into every StackBlaze service by default.
Sarah Kim
Co-founder & CTO
Most platforms give you a metrics dashboard and call it observability. That is enough when your app is healthy and boring. When something breaks at 2am, you need logs you can search, traces that show which service dropped the request, and alerts that fire on symptoms users feel, not on "CPU is 73%."
We shipped unified observability for every StackBlaze service last quarter. This post covers what is included, how it is wired under the hood, and the patterns we recommend for getting signal without drowning in noise.
The three pillars, one place
Every service on StackBlaze automatically gets metrics (Prometheus-compatible), structured logs (JSON, shipped to our log store), and distributed traces (OpenTelemetry, sampled at 10% by default). You do not install agents or sidecars, the platform injects an OpenTelemetry SDK configuration at deploy time and scrapes metrics from a /metrics endpoint if your framework exposes one.
- Metrics: request rate, error rate, latency histograms, plus container CPU/memory
- Logs: stdout/stderr from every container, parsed as JSON when possible
- Traces: HTTP and outbound client spans, correlated with logs via trace_id
Structured logging that survives production
Unstructured logs are fine in development. In production they are nearly useless at scale, grep does not work across fifty replicas and log lines that say "error" without context waste everyone's time.
import pino from 'pino';
export const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
formatters: {
level: (label) => ({ level: label }),
},
base: {
service: process.env.STACKBLAZE_SERVICE_NAME,
env: process.env.STACKBLAZE_ENV,
},
});
// Every log line automatically includes trace_id when inside a request
export function logWithTrace(traceId: string) {
return logger.child({ trace_id: traceId });
}StackBlaze indexes common fields, level, service, env, trace_id, user_id (if you log it), so you can filter in the dashboard without writing LogQL. We also redact known secret patterns (API keys, bearer tokens) at ingest time so a misconfigured log line does not become a credential leak.
Log one line per request
Emit a single structured "request completed" log at the end of each HTTP handler with status, duration_ms, route, and trace_id. That one line answers 80% of incident questions.
Distributed tracing without the yak shave
If your API calls Postgres and Redis over the private network, a slow request could be slow anywhere. Traces show the breakdown: 12ms in middleware, 340ms in a database query, 4ms in serialization.
// Auto-injected by StackBlaze, shown for transparency
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [
new HttpInstrumentation(),
new PgInstrumentation(),
],
sampler: { type: 'parentbased_traceidratio', ratio: 0.1 },
});
sdk.start();Sampling at 10% keeps cost predictable on high-traffic services while still giving you enough traces to debug intermittent failures. You can raise the sample rate per service in the dashboard for short windows during an incident.
Alerts that match user pain
We ship default alert rules for every web service: error rate above 1% for five minutes, p99 latency above 2s for ten minutes, and zero healthy replicas. You can add custom rules in blueprint.yaml or the dashboard.
services:
api:
type: web
alerts:
- name: high-error-rate
condition: error_rate > 0.01
for: 5m
notify: [pagerduty, slack-oncall]
- name: slow-checkout
condition: p99_latency{route="/checkout"} > 3s
for: 10m
notify: [slack-team-commerce]| Alert type | Good for | Avoid |
|---|---|---|
| Error rate SLO | User-visible failures | Alerting on single 500s |
| Latency SLO | Degraded experience | Raw CPU thresholds |
| Saturation | Capacity planning | Primary on-call signal |
| Synthetic checks | Critical paths | Replacing real traffic metrics |
Debugging a real incident
- Start from the alert, note the service, env, and time window.
- Open the metrics view: confirm error rate or latency spike aligns with user reports.
- Filter logs by trace_id from a failing request sample.
- Open the trace: identify the slowest span (often a missing index or upstream timeout).
- Ship a fix; watch the same dashboards to confirm recovery.
Observability is not a product checkbox, it is the difference between a thirty-minute incident and a three-hour one. We built it into the platform so every team gets that baseline without hiring a platform engineer first.
Sarah Kim
Co-founder & CTO at StackBlaze
Member of the founding team at StackBlaze. Writes about infrastructure, engineering culture, and the systems that keep production running.
More from the blog
How Calico network policies isolate tenants on shared hosting
Shared Kubernetes does not have to mean shared trust boundaries. Calico enforces network isolation, Linkerd provides automatic mTLS between services, and Falco detects runtime threats, three layers that keep tenants separated on shared infrastructure.
Shared platform vs dedicated clusters: control plane isolation and policy-as-code
Policy-as-code on a shared platform gives you guardrails without operational overhead. Dedicated clusters add an isolated control plane, single-tenant nodes, and customer-owned policy boundaries, here is how to choose and what changes under the hood.
Regulatory compliance and data governance on StackBlaze
SOC 2, GDPR, HIPAA-readiness, data residency, encryption, audit logs, and DPAs, a detailed map of how StackBlaze controls align with common regulatory frameworks and what you own vs what the platform certifies.