Company

Why we ship fast even when it feels uncomfortable

Our approach to iteration, how we handle mistakes in public, and why we think perfectionism is a trap.

Alex Chen

Co-founder & CEO

March 20, 20266 min read

Last month we shipped a billing change that broke invoice emails for about 200 accounts. We found it four hours after deploy, fixed it in twenty minutes, and sent a personal email to every affected account that same afternoon. A few of those customers wrote back to say thank you.

I've been thinking about why that happened, not the bug, but the thank-you emails. I think it's because moving fast, when it's done honestly, builds more trust than moving slowly. Here's how we think about speed at StackBlaze.

Why we move fast

The cloud infrastructure market is not a market where you win by being correct on the first try. It's a market where you win by learning faster than everyone else. The companies that move slowly are not building better software, they're just accumulating more opinions about what users might want, rather than learning what they actually need.

We ship every Tuesday and Thursday, regardless of what's ready. Not because we're reckless, but because a fixed ship cadence forces a kind of discipline that variable shipping does not. If you know you're shipping Tuesday, you scope your work to fit Tuesday. You cut scope, not quality. You ship something real rather than waiting for the perfect version.

The 70% rule

We ship when something is 70% of where we want it to be, as long as it's genuinely useful to someone at that point. The last 30% takes twice as long and teaches you half as much as shipping and watching how people actually use it.

What fast actually means

Fast does not mean undisciplined. Every StackBlaze deploy goes through a full test suite, a staging environment, and a canary rollout that watches error rates before going to 100% of traffic. The speed comes from our architecture and processes, not from skipping steps.

We write tests while we write the feature, not after, because retrofitting tests is slower than writing them alongside
We use feature flags for anything that touches billing, auth, or data storage, so we can ship code dark and turn it on for a subset of users first
We run post-incident reviews for anything that caused user-facing impact, but we keep them blameless and forward-looking
We don't bikeshed. If two approaches are both reasonable, we pick one and ship it. We can always change it later.

How we handle public mistakes

When something goes wrong, and it will, the only two things that matter are: how fast did you fix it, and how honestly did you communicate? We write public post-mortems for any incident that caused more than 15 minutes of user-facing impact. We name the cause specifically, not vaguely. 'Our database query exceeded its timeout due to a missing index on the events table' is better than 'an infrastructure issue caused elevated error rates.'

Vague incident communications are a form of disrespect. Your users are technical people. They deserve a specific explanation, a clear timeline, and a concrete plan to prevent recurrence. That's what we try to deliver.

The perfectionism trap

I've seen teams spend three weeks building the perfect version of a feature that their users would have been perfectly happy with after three days. The extra two and a half weeks did not go into quality, they went into convincing themselves that the feature was ready. That's not engineering, that's anxiety management.

Perfectionism is especially dangerous in infrastructure products because the failure modes are so visible. A bug in a billing system or a deployment tool is immediately apparent. That visibility creates pressure to be perfect before shipping, which creates long cycles, which creates bigger changes, which creates bigger blast radii when something does go wrong. Speed, paradoxically, makes shipping safer, because your changes are smaller.

What we ship next

We're shipping PR preview environments this week. We announced the feature in our changelog before the backend was fully stable, because a handful of teams were already asking for it and we wanted to set accurate expectations. A few things on that feature are still rough, the seeding time for large databases is longer than we want, and the URL format is not final.

We shipped it anyway. We'll fix the seeding time next week. We'll finalize the URL format the week after. By the time we've iterated three times, we'll know things about how teams use preview environments that we could never have learned by waiting.

That's the deal: we move fast, we communicate honestly when we get something wrong, and we iterate until we get it right. If that sounds like a company you want to build on top of, we'd love to have you.

Alex Chen

Co-founder & CEO at StackBlaze

Member of the founding team at StackBlaze. Writes about infrastructure, engineering culture, and the systems that keep production running.

How Calico network policies isolate tenants on shared hosting

Shared Kubernetes does not have to mean shared trust boundaries. Calico enforces network isolation, Linkerd provides automatic mTLS between services, and Falco detects runtime threats, three layers that keep tenants separated on shared infrastructure.

Sarah Kim

Security16 min read

Shared platform vs dedicated clusters: control plane isolation and policy-as-code

Policy-as-code on a shared platform gives you guardrails without operational overhead. Dedicated clusters add an isolated control plane, single-tenant nodes, and customer-owned policy boundaries, here is how to choose and what changes under the hood.

Priya Patel

Security18 min read

Regulatory compliance and data governance on StackBlaze

SOC 2, GDPR, HIPAA-readiness, data residency, encryption, audit logs, and DPAs, a detailed map of how StackBlaze controls align with common regulatory frameworks and what you own vs what the platform certifies.

Nina Okoye