Building a RAG chatbot with pgvector on StackBlaze
Step-by-step guide to deploying a retrieval-augmented generation app with a pgvector-powered embedding store.
Priya Patel
Head of Engineering
Retrieval-augmented generation (RAG) is the pattern behind most production LLM applications: rather than relying solely on the model's training data, you retrieve relevant context from your own documents and include it in the prompt. The quality of that retrieval step determines the quality of the answers. This tutorial walks through building a complete RAG pipeline using pgvector on StackBlaze, from document ingestion to a working API endpoint.
What we're building
A FastAPI service that answers questions about a corpus of documents. Users ask a question; the service finds the most semantically similar document chunks, assembles them into a prompt, and calls OpenAI's GPT-4o to generate an answer grounded in your docs.
Architecture overview
- Postgres 16 with pgvector extension, stores document chunks and their embeddings
- Python ingestion script, reads documents, generates embeddings via OpenAI, writes to Postgres
- FastAPI service, handles user queries, runs similarity search, calls the LLM
- OpenAI API, generates embeddings (text-embedding-3-small) and completions (gpt-4o)
Everything except the OpenAI API runs on StackBlaze. The Postgres database and the FastAPI service communicate over the private fabric, so embedding storage and retrieval are fast and free of egress costs.
Step 1: Deploy the stack
Create a new environment in StackBlaze. Add a Postgres 16 database (the Standard plan gives you enough storage for a few million document chunks). Add a Web Service for the FastAPI app, pointing at your GitHub repo. Set the runtime to Python 3.11.
Add your OPENAI_API_KEY as a secret environment variable on the web service. StackBlaze automatically injects DATABASE_URL when the Postgres resource is attached.
Step 2: Enable pgvector
CREATE EXTENSION IF NOT EXISTS vector;Choosing the right embedding dimensions
OpenAI's text-embedding-3-small model outputs 1536-dimensional vectors by default, but supports dimensionality reduction down to 256. Smaller dimensions use less storage and make similarity search faster, with a small accuracy tradeoff. For most RAG applications, 512 dimensions is a good balance. Whatever you choose, the number must match your vector column definition - you can't change it without re-embedding your entire corpus.
Step 3: Create the embeddings table
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
embedding VECTOR(1536) NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- IVFFlat index for approximate nearest-neighbor search
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);Step 4: Ingest documents
import os
import psycopg2
from openai import OpenAI
from pathlib import Path
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
conn = psycopg2.connect(os.environ['DATABASE_URL'])
CHUNK_SIZE = 800
CHUNK_OVERLAP = 100
def chunk_text(text: str) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = start + CHUNK_SIZE
chunks.append(text[start:end])
start += CHUNK_SIZE - CHUNK_OVERLAP
return chunks
def embed(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(model='text-embedding-3-small', input=texts)
return [item.embedding for item in response.data]
def ingest_file(path: Path) -> None:
text = path.read_text(encoding='utf-8')
chunks = chunk_text(text)
embeddings = embed(chunks)
with conn.cursor() as cur:
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
cur.execute(
'INSERT INTO documents (source, chunk_index, content, embedding) VALUES (%s, %s, %s, %s)',
(str(path), i, chunk, embedding),
)
conn.commit()
print(f'Ingested {len(chunks)} chunks from {path}')
for doc in Path('docs').rglob('*.md'):
ingest_file(doc)
conn.close()Step 5: Query with similarity search
def retrieve(question: str, top_k: int = 5) -> list[dict]:
embedding = client.embeddings.create(
model='text-embedding-3-small', input=question
).data[0].embedding
with conn.cursor() as cur:
cur.execute(
'SELECT source, chunk_index, content, 1 - (embedding <=> %s::vector) AS similarity FROM documents ORDER BY embedding <=> %s::vector LIMIT %s',
(embedding, embedding, top_k),
)
rows = cur.fetchall()
return [{'source': r[0], 'chunk_index': r[1], 'content': r[2], 'similarity': r[3]} for r in rows]Step 6: Wire up the LLM
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI
import os
from .retrieval import retrieve
app = FastAPI()
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
class QuestionRequest(BaseModel):
question: str
top_k: int = 5
@app.post('/ask')
def ask(req: QuestionRequest):
chunks = retrieve(req.question, top_k=req.top_k)
context = '\n\n---\n\n'.join(c['content'] for c in chunks)
sources = list(dict.fromkeys(c['source'] for c in chunks))
response = client.chat.completions.create(
model='gpt-4o',
messages=[
{'role': 'system', 'content': 'You are a helpful assistant that answers questions based on the provided documentation. If the answer is not in the documentation, say so.'},
{'role': 'user', 'content': f'Documentation:\n\n{context}\n\nQuestion: {req.question}'},
],
temperature=0.2,
)
return {'answer': response.choices[0].message.content, 'sources': sources}Production considerations
Connection pooling is important at scale
The code above opens a module-level psycopg2 connection, which is fine for development. In production, use a connection pool (psycopg2.pool.ThreadedConnectionPool or psycopg[pool]) and set pool size to match your FastAPI worker count. StackBlaze's Postgres instances support up to 200 concurrent connections on Standard plans.
- Chunk size matters: 800-character chunks work well for technical docs. Conversational content chunks better at 400–600 characters.
- Re-embed when the model changes: OpenAI's embedding models are versioned. If you switch models, you must re-embed your entire corpus.
- Index maintenance: The IVFFlat index needs to be rebuilt periodically as your corpus grows. Use REINDEX CONCURRENTLY to do this without downtime.
- Cache popular queries: A simple Redis cache keyed on the question string can dramatically reduce OpenAI API costs.
- Monitor retrieval quality: Log the similarity scores of retrieved chunks. If top results are consistently below 0.6, your chunks may be too large or your documents don't cover the question domain.
Priya Patel
Head of Engineering at StackBlaze
Member of the founding team at StackBlaze. Writes about infrastructure, engineering culture, and the systems that keep production running.
More from the blog
How Calico network policies isolate tenants on shared hosting
Shared Kubernetes does not have to mean shared trust boundaries. Calico enforces network isolation, Linkerd provides automatic mTLS between services, and Falco detects runtime threats, three layers that keep tenants separated on shared infrastructure.
Shared platform vs dedicated clusters: control plane isolation and policy-as-code
Policy-as-code on a shared platform gives you guardrails without operational overhead. Dedicated clusters add an isolated control plane, single-tenant nodes, and customer-owned policy boundaries, here is how to choose and what changes under the hood.
Regulatory compliance and data governance on StackBlaze
SOC 2, GDPR, HIPAA-readiness, data residency, encryption, audit logs, and DPAs, a detailed map of how StackBlaze controls align with common regulatory frameworks and what you own vs what the platform certifies.