EngineeringObservabilityEngineering

Observability beyond metrics: logs, traces, and alerts that actually help

CPU graphs tell you something broke. Structured logs and distributed traces tell you why. Here is how we wired observability into every StackBlaze service by default.

Sarah Kim

Co-founder & CTO

May 15, 202610 min read

Most platforms give you a metrics dashboard and call it observability. That is enough when your app is healthy and boring. When something breaks at 2am, you need logs you can search, traces that show which service dropped the request, and alerts that fire on symptoms users feel, not on "CPU is 73%."

We shipped unified observability for every StackBlaze service last quarter. This post covers what is included, how it is wired under the hood, and the patterns we recommend for getting signal without drowning in noise.

The three pillars, one place

Every service on StackBlaze automatically gets metrics (Prometheus-compatible), structured logs (JSON, shipped to our log store), and distributed traces (OpenTelemetry, sampled at 10% by default). You do not install agents or sidecars, the platform injects an OpenTelemetry SDK configuration at deploy time and scrapes metrics from a /metrics endpoint if your framework exposes one.

Metrics: request rate, error rate, latency histograms, plus container CPU/memory
Logs: stdout/stderr from every container, parsed as JSON when possible
Traces: HTTP and outbound client spans, correlated with logs via trace_id

Structured logging that survives production

Unstructured logs are fine in development. In production they are nearly useless at scale, grep does not work across fifty replicas and log lines that say "error" without context waste everyone's time.

lib/logger.ts

import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  base: {
    service: process.env.STACKBLAZE_SERVICE_NAME,
    env: process.env.STACKBLAZE_ENV,
  },
});

// Every log line automatically includes trace_id when inside a request
export function logWithTrace(traceId: string) {
  return logger.child({ trace_id: traceId });
}

StackBlaze indexes common fields, level, service, env, trace_id, user_id (if you log it), so you can filter in the dashboard without writing LogQL. We also redact known secret patterns (API keys, bearer tokens) at ingest time so a misconfigured log line does not become a credential leak.

Log one line per request

Emit a single structured "request completed" log at the end of each HTTP handler with status, duration_ms, route, and trace_id. That one line answers 80% of incident questions.

Distributed tracing without the yak shave

If your API calls Postgres and Redis over the private network, a slow request could be slow anywhere. Traces show the breakdown: 12ms in middleware, 340ms in a database query, 4ms in serialization.

instrumentation.ts

// Auto-injected by StackBlaze, shown for transparency
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [
    new HttpInstrumentation(),
    new PgInstrumentation(),
  ],
  sampler: { type: 'parentbased_traceidratio', ratio: 0.1 },
});

sdk.start();

Sampling at 10% keeps cost predictable on high-traffic services while still giving you enough traces to debug intermittent failures. You can raise the sample rate per service in the dashboard for short windows during an incident.

Alerts that match user pain

We ship default alert rules for every web service: error rate above 1% for five minutes, p99 latency above 2s for ten minutes, and zero healthy replicas. You can add custom rules in blueprint.yaml or the dashboard.

blueprint.yaml

services:
  api:
    type: web
    alerts:
      - name: high-error-rate
        condition: error_rate > 0.01
        for: 5m
        notify: [pagerduty, slack-oncall]
      - name: slow-checkout
        condition: p99_latency{route="/checkout"} > 3s
        for: 10m
        notify: [slack-team-commerce]

Alert type	Good for	Avoid
Error rate SLO	User-visible failures	Alerting on single 500s
Latency SLO	Degraded experience	Raw CPU thresholds
Saturation	Capacity planning	Primary on-call signal
Synthetic checks	Critical paths	Replacing real traffic metrics

Debugging a real incident

Start from the alert, note the service, env, and time window.
Open the metrics view: confirm error rate or latency spike aligns with user reports.
Filter logs by trace_id from a failing request sample.
Open the trace: identify the slowest span (often a missing index or upstream timeout).
Ship a fix; watch the same dashboards to confirm recovery.

Observability is not a product checkbox, it is the difference between a thirty-minute incident and a three-hour one. We built it into the platform so every team gets that baseline without hiring a platform engineer first.

Sarah Kim

Co-founder & CTO at StackBlaze

Member of the founding team at StackBlaze. Writes about infrastructure, engineering culture, and the systems that keep production running.

How Calico network policies isolate tenants on shared hosting

Shared Kubernetes does not have to mean shared trust boundaries. Calico enforces network isolation, Linkerd provides automatic mTLS between services, and Falco detects runtime threats, three layers that keep tenants separated on shared infrastructure.

Sarah Kim

Security16 min read

Shared platform vs dedicated clusters: control plane isolation and policy-as-code

Policy-as-code on a shared platform gives you guardrails without operational overhead. Dedicated clusters add an isolated control plane, single-tenant nodes, and customer-owned policy boundaries, here is how to choose and what changes under the hood.

Priya Patel

Security18 min read

Regulatory compliance and data governance on StackBlaze

SOC 2, GDPR, HIPAA-readiness, data residency, encryption, audit logs, and DPAs, a detailed map of how StackBlaze controls align with common regulatory frameworks and what you own vs what the platform certifies.

Nina Okoye