Scaling

Horizontal scaling and autoscaling

7 min readUpdated April 2026

StackBlaze manages a Kubernetes HorizontalPodAutoscaler (HPA) for every service with autoscaling enabled. The HPA continuously monitors CPU and memory metrics from the cluster's Metrics Server and adjusts the Deployment's replica count to match actual demand.

Scale-up is fast, new pods typically start within 30 seconds. Scale-down is intentionally slow, the HPA waits 5 minutes of below-threshold utilization before removing pods, preventing thrashing during bursty traffic.

Autoscaling architecture

HPACPU 78% › scale up
Load Balancer

round-robin

distributes load

replica-1

CPU: 78%

replica-2

CPU: 65%

replica-3

CPU: 71%

HTTP

Autoscaling configuration

Minimum replicas

Always running, even at zero traffic

2

Maximum replicas

Hard ceiling on scale-out

10

Target CPU threshold

Scale up when average CPU exceeds this

60%
20%50%80%100%

Monthly spend cap

Pause scaling when limit reached

$150 / mo

Scale-up event log

HPA event log, my-web-service

14:02:31 HPA metrics collected: CPU avg 42% across 2 replicas

14:07:14 HPA metrics collected: CPU avg 58% across 2 replicas

14:08:01 HPA metrics collected: CPU avg 78% across 2 replicas

14:08:01 HPA: scaling from 24 replicas (CPU 78% > threshold 60%)

14:08:18 replica-3: scheduled on node worker-07

14:08:24 replica-4: scheduled on node worker-02

14:08:41 replica-3: ready, readiness probe passed

14:08:48 replica-4: ready, readiness probe passed

14:08:48 HPA: all 4 replicas healthy, load balancer updated

14:09:02 HPA metrics collected: CPU avg 41% across 4 replicas

Under the hood

  • HorizontalPodAutoscaler: Kubernetes native HPA resource targets your Deployment. It polls the Metrics Server every 15 seconds and uses a proportional algorithm to calculate the ideal replica count: ceil(current * currentMetric / desiredMetric).
  • Metrics Server: a lightweight aggregator that collects CPU and memory usage from each node's kubelet. StackBlaze keeps it running on every cluster. Custom metrics (requests/second, queue depth) are available on enterprise plans via the Prometheus adapter.
  • Scale-down stabilisation: the HPA uses a 5-minute stabilisation window before removing pods. This prevents flapping when traffic is bursty. Scale-up has no delay, it reacts immediately to protect user experience.
  • Pod Disruption Budget: StackBlaze automatically creates a PDB ensuring at least 50% of replicas remain available during node drains and cluster upgrades. Your service stays up even during maintenance windows.

Step by step

01

Set minimum and maximum replicas

In the StackBlaze dashboard, open your service and go to the "Scaling" tab. Set a minimum replica count (we recommend at least 2 for production to survive a node failure) and a maximum that fits your budget and expected load.

02

Set a CPU or memory threshold

Choose the metric that best predicts load for your service type. CPU works well for compute-bound services (API servers, workers). Memory is better for data-heavy workloads. StackBlaze defaults to 60% CPU utilization, scale up triggers when average usage across all pods exceeds the threshold.

03

Optionally set a spend cap

Enable the monthly spend cap to prevent runaway scaling costs. When your service hits the cap, scaling pauses and you receive an alert. You can bump the cap or investigate traffic spikes from the dashboard without facing surprise bills.

04

Test with a load generator

Use hey, k6, or Locust to send sustained load to your service. Watch the "Scaling" tab in real-time as the HPA scales up replicas. Check the event log to see exactly when scale-up and scale-down events fired and what metric triggered them.