TutorialsAIPostgreSQLTutorial

Building a RAG chatbot with pgvector on StackBlaze

Step-by-step guide to deploying a retrieval-augmented generation app with a pgvector-powered embedding store.

Priya Patel

Head of Engineering

Feb 18, 202611 min read

Retrieval-augmented generation (RAG) is the pattern behind most production LLM applications: rather than relying solely on the model's training data, you retrieve relevant context from your own documents and include it in the prompt. The quality of that retrieval step determines the quality of the answers. This tutorial walks through building a complete RAG pipeline using pgvector on StackBlaze, from document ingestion to a working API endpoint.

What we're building

A FastAPI service that answers questions about a corpus of documents. Users ask a question; the service finds the most semantically similar document chunks, assembles them into a prompt, and calls OpenAI's GPT-4o to generate an answer grounded in your docs.

Architecture overview

Postgres 16 with pgvector extension, stores document chunks and their embeddings
Python ingestion script, reads documents, generates embeddings via OpenAI, writes to Postgres
FastAPI service, handles user queries, runs similarity search, calls the LLM
OpenAI API, generates embeddings (text-embedding-3-small) and completions (gpt-4o)

Everything except the OpenAI API runs on StackBlaze. The Postgres database and the FastAPI service communicate over the private fabric, so embedding storage and retrieval are fast and free of egress costs.

Step 1: Deploy the stack

Create a new environment in StackBlaze. Add a Postgres 16 database (the Standard plan gives you enough storage for a few million document chunks). Add a Web Service for the FastAPI app, pointing at your GitHub repo. Set the runtime to Python 3.11.

Add your OPENAI_API_KEY as a secret environment variable on the web service. StackBlaze automatically injects DATABASE_URL when the Postgres resource is attached.

Step 2: Enable pgvector

CREATE EXTENSION IF NOT EXISTS vector;

Choosing the right embedding dimensions

OpenAI's text-embedding-3-small model outputs 1536-dimensional vectors by default, but supports dimensionality reduction down to 256. Smaller dimensions use less storage and make similarity search faster, with a small accuracy tradeoff. For most RAG applications, 512 dimensions is a good balance. Whatever you choose, the number must match your vector column definition - you can't change it without re-embedding your entire corpus.

Step 3: Create the embeddings table

migrations/001_create_documents.sql

CREATE TABLE documents (
  id          BIGSERIAL PRIMARY KEY,
  source      TEXT NOT NULL,
  chunk_index INTEGER NOT NULL,
  content     TEXT NOT NULL,
  embedding   VECTOR(1536) NOT NULL,
  created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- IVFFlat index for approximate nearest-neighbor search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Step 4: Ingest documents

scripts/ingest.py

import os
import psycopg2
from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
conn = psycopg2.connect(os.environ['DATABASE_URL'])

CHUNK_SIZE = 800
CHUNK_OVERLAP = 100

def chunk_text(text: str) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + CHUNK_SIZE
        chunks.append(text[start:end])
        start += CHUNK_SIZE - CHUNK_OVERLAP
    return chunks

def embed(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(model='text-embedding-3-small', input=texts)
    return [item.embedding for item in response.data]

def ingest_file(path: Path) -> None:
    text = path.read_text(encoding='utf-8')
    chunks = chunk_text(text)
    embeddings = embed(chunks)
    with conn.cursor() as cur:
        for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
            cur.execute(
                'INSERT INTO documents (source, chunk_index, content, embedding) VALUES (%s, %s, %s, %s)',
                (str(path), i, chunk, embedding),
            )
    conn.commit()
    print(f'Ingested {len(chunks)} chunks from {path}')

for doc in Path('docs').rglob('*.md'):
    ingest_file(doc)

conn.close()

Step 5: Query with similarity search

app/retrieval.py

def retrieve(question: str, top_k: int = 5) -> list[dict]:
    embedding = client.embeddings.create(
        model='text-embedding-3-small', input=question
    ).data[0].embedding
    with conn.cursor() as cur:
        cur.execute(
            'SELECT source, chunk_index, content, 1 - (embedding <=> %s::vector) AS similarity FROM documents ORDER BY embedding <=> %s::vector LIMIT %s',
            (embedding, embedding, top_k),
        )
        rows = cur.fetchall()
    return [{'source': r[0], 'chunk_index': r[1], 'content': r[2], 'similarity': r[3]} for r in rows]

Step 6: Wire up the LLM

app/main.py

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI
import os
from .retrieval import retrieve

app = FastAPI()
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

class QuestionRequest(BaseModel):
    question: str
    top_k: int = 5

@app.post('/ask')
def ask(req: QuestionRequest):
    chunks = retrieve(req.question, top_k=req.top_k)
    context = '\n\n---\n\n'.join(c['content'] for c in chunks)
    sources = list(dict.fromkeys(c['source'] for c in chunks))
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[
            {'role': 'system', 'content': 'You are a helpful assistant that answers questions based on the provided documentation. If the answer is not in the documentation, say so.'},
            {'role': 'user', 'content': f'Documentation:\n\n{context}\n\nQuestion: {req.question}'},
        ],
        temperature=0.2,
    )
    return {'answer': response.choices[0].message.content, 'sources': sources}

Production considerations

Connection pooling is important at scale

The code above opens a module-level psycopg2 connection, which is fine for development. In production, use a connection pool (psycopg2.pool.ThreadedConnectionPool or psycopg[pool]) and set pool size to match your FastAPI worker count. StackBlaze's Postgres instances support up to 200 concurrent connections on Standard plans.

Chunk size matters: 800-character chunks work well for technical docs. Conversational content chunks better at 400–600 characters.
Re-embed when the model changes: OpenAI's embedding models are versioned. If you switch models, you must re-embed your entire corpus.
Index maintenance: The IVFFlat index needs to be rebuilt periodically as your corpus grows. Use REINDEX CONCURRENTLY to do this without downtime.
Cache popular queries: A simple Redis cache keyed on the question string can dramatically reduce OpenAI API costs.
Monitor retrieval quality: Log the similarity scores of retrieved chunks. If top results are consistently below 0.6, your chunks may be too large or your documents don't cover the question domain.

Priya Patel

Head of Engineering at StackBlaze

Member of the founding team at StackBlaze. Writes about infrastructure, engineering culture, and the systems that keep production running.

How Calico network policies isolate tenants on shared hosting

Shared Kubernetes does not have to mean shared trust boundaries. Calico enforces network isolation, Linkerd provides automatic mTLS between services, and Falco detects runtime threats, three layers that keep tenants separated on shared infrastructure.

Sarah Kim

Security16 min read

Shared platform vs dedicated clusters: control plane isolation and policy-as-code

Policy-as-code on a shared platform gives you guardrails without operational overhead. Dedicated clusters add an isolated control plane, single-tenant nodes, and customer-owned policy boundaries, here is how to choose and what changes under the hood.

Priya Patel

Security18 min read

Regulatory compliance and data governance on StackBlaze

SOC 2, GDPR, HIPAA-readiness, data residency, encryption, audit logs, and DPAs, a detailed map of how StackBlaze controls align with common regulatory frameworks and what you own vs what the platform certifies.

Nina Okoye