EngineeringNetworkingInfrastructure

Inside StackBlaze's private networking layer

How we built a 100 Gbps isolated fabric that connects services with sub-millisecond latency without a VPN.

SK

Sarah Kim

Co-founder & CTO

March 6, 202614 min read

When we were designing StackBlaze's multi-service architecture, private networking was the feature I cared most about getting right. I've worked at companies where inter-service communication went over the public internet, load balancers calling other load balancers, encrypted but expensive, slow, and full of unnecessary hops. We built something different, and this post explains how it works.

Why private networking matters

When two of your services need to talk to each other, say, your API calling your Redis cache, or your backend querying Postgres, the path that request takes has significant consequences for latency, cost, and security.

Most platforms route inter-service traffic over the public internet by default. Your API makes a TLS connection to a public hostname, traffic exits the data center, traverses the internet, and arrives back at another service in the same physical building. You pay egress fees on both sides. You add 5–30ms of unnecessary latency. And you're relying on TLS to protect what should be internal traffic.

The naive approach

The simplest way to build private networking is a shared VPC per customer. Every customer gets their own virtual network, their services get private IPs, and traffic stays on-fabric. This works fine at small scale.

But it doesn't scale operationally. You end up managing thousands of VPCs, VPC peering becomes a nightmare as customers add cross-region services, and you can't easily do traffic shaping or enforce network policies between tenants. Cloud providers charge for VPC peering and NAT gateways in ways that add up quickly.

Our architecture

StackBlaze runs on bare metal in our own data centers across four regions. This matters because it gives us control over the physical network layer that you simply don't have in a cloud VPC environment.

Every compute node is connected to a dedicated private fabric, a separate physical network from the public-facing uplinks. Services on the same platform communicate over this fabric, never touching the internet. Traffic is routed at Layer 3 using BGP, with each region announcing its own prefixes.

The fabric design

Each node has dual 100 Gbps uplinks to the private fabric (active/active bonding). The top-of-rack switches are connected via 400 Gbps links to the spine layer. In practice, inter-service bandwidth is limited by the service's CPU and memory long before it hits network limits.

Tenant isolation is enforced at the Kubernetes network policy layer and at the physical VLAN layer. Each tenant's traffic is tagged at ingress and the tags are enforced all the way through the fabric. A service in tenant A cannot reach a service in tenant B even if it somehow obtained the correct IP address, the fabric will drop the packet.

DNS resolution

Rather than exposing raw IP addresses, every service on StackBlaze gets an .internal hostname: your-service-name.internal. These hostnames resolve via a per-tenant DNS zone served by CoreDNS instances running in each region.

services/api.js
const { Pool } = require('pg');

// Use the .internal hostname, resolves to a private fabric IP.
// No public internet involved.
const pool = new Pool({
  host: process.env.DB_HOST, // set to: my-postgres.internal
  port: 5432,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD,
});
services/worker.py
import redis
import os

# Redis running on the private fabric, no TLS overhead, no egress fees.
r = redis.Redis(
    host=os.environ['REDIS_HOST'],  # e.g. my-redis.internal
    port=6379,
    decode_responses=True,
)

Latency numbers

MetricPublic internet routingStackBlaze private network
p50 round-trip latency8–25 ms0.15–0.4 ms
p99 round-trip latency40–120 ms0.8–2 ms
Bandwidth (service to service)Limited by NAT/LBUp to 10 Gbps per service
Egress feesYes (cloud rates)None
TLS required for securityYesNo (optional)

The latency difference is meaningful for high-QPS services. A microservice that makes 5 database calls per request at 20ms each is spending 100ms per request just on network time. At 0.3ms each, that drops to 1.5ms, a 98% reduction in network latency for that request.

Security model

We use a zero-trust model between tenants. The default posture is deny-all: your services cannot reach another tenant's services, and no inbound connections from outside your environment are allowed on private fabric addresses.

Do I still need TLS for internal traffic?

For most applications, no. Traffic on the private fabric is isolated at the physical and VLAN layer - it cannot be intercepted by other tenants. That said, if you have compliance requirements (PCI-DSS, HIPAA) that mandate encryption in transit regardless of the network path, you can still use TLS for internal connections.

How to use it

There's nothing to configure. When you add a second service to your environment, StackBlaze automatically registers its .internal hostname in your tenant's DNS zone. You'll see the internal hostname on the service's Overview page. Set the hostname as an environment variable in the service that needs to connect to it, and use it in your code exactly as you would a hostname in development.

Private networking is available on all plans, including free tier. We didn't make it a paid feature because it's better for everyone when services communicate efficiently, it reduces load on our infrastructure too.

SK

Sarah Kim

Co-founder & CTO at StackBlaze

Member of the founding team at StackBlaze. Writes about infrastructure, engineering culture, and the systems that keep production running.

More from the blog