If you're self-hosting Supabase, you've already made a choice: data ownership matters. But when it comes to AI features, most tutorials point you straight to OpenAI or Anthropic APIs—sending your documents to external servers, paying per-token costs, and introducing a dependency that defeats the purpose of self-hosting.
There's another path. Running embedding models locally with Ollama gives you fully private AI capabilities. Your documents never leave your infrastructure. Your costs are fixed to hardware, not usage. And with pgvector already bundled in self-hosted Supabase, you have everything needed for production RAG applications.
Why Local Embeddings Matter for Self-Hosters
The standard approach to building AI features involves calling external embedding APIs. This works, but it creates friction for teams who chose self-hosting specifically for data control:
Data leaves your infrastructure: Every document you embed gets sent to a third party. For customer data, internal knowledge bases, or proprietary content, this may violate compliance requirements or internal policies.
Variable costs at scale: Embedding APIs charge per token. A 10,000-document knowledge base might cost dollars; a million documents costs hundreds. Usage spikes become budget surprises.
Network dependency: Your AI features stop working when the API provider has issues. In 2026, this happens more often than vendors admit.
Latency adds up: Each embedding request involves a network round-trip. When processing thousands of documents, this compounds into significant ingestion delays.
Local embeddings solve all four problems. Ollama runs on your server, processing text without network calls. Once you've invested in hardware, marginal cost per embedding drops to near-zero.
Setting Up Ollama with Self-Hosted Supabase
Ollama provides a streamlined way to run open-source language models locally. For embeddings, models like nomic-embed-text and mxbai-embed-large match or exceed OpenAI's text-embedding-3-small quality on most benchmarks—without API costs.
Step 1: Install Ollama
On the same server running your Docker Compose Supabase stack, install Ollama:
# Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Verify installation ollama --version # Pull an embedding model ollama pull nomic-embed-text
For production, add Ollama as a systemd service:
sudo systemctl enable ollama sudo systemctl start ollama
Step 2: Verify the API
Ollama exposes a REST API on port 11434. Test embedding generation:
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "Test embedding generation"
}'
You should receive a JSON response with a 768-dimensional vector (nomic-embed-text's default dimension).
Step 3: Add Ollama to Docker Network (Optional)
If you want Edge Functions or other Supabase services to access Ollama, add it to your Docker network:
# Add to your docker-compose.yml
services:
ollama:
image: ollama/ollama
container_name: supabase-ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
networks:
- supabase_network
restart: unless-stopped
volumes:
ollama_data:
This keeps Ollama accessible at http://ollama:11434 from within the Supabase container network.
Configuring pgvector for Local Embedding Dimensions
If you've followed our pgvector setup guide, you likely have tables configured for 1536-dimensional OpenAI embeddings. Local models use different dimensions:
| Model | Dimensions | Quality | Speed |
|---|---|---|---|
| nomic-embed-text | 768 | Excellent | Fast |
| mxbai-embed-large | 1024 | Excellent | Medium |
| all-minilm | 384 | Good | Very Fast |
Create a table matching your chosen model:
-- For nomic-embed-text (768 dimensions)
CREATE TABLE documents_local (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
embedding vector(768),
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- HNSW index for similarity search
CREATE INDEX ON documents_local
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
If you're migrating from OpenAI embeddings, you'll need to re-embed your documents—dimension mismatches prevent querying across different models.
Building a Local RAG Pipeline
With Ollama and pgvector configured, here's a complete TypeScript implementation for local RAG:
import { createClient } from '@supabase/supabase-js';
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_KEY!
);
const OLLAMA_URL = process.env.OLLAMA_URL || 'http://localhost:11434';
// Generate embedding using local Ollama
async function getLocalEmbedding(text: string): Promise<number[]> {
const response = await fetch(`${OLLAMA_URL}/api/embeddings`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'nomic-embed-text',
prompt: text,
}),
});
if (!response.ok) {
throw new Error(`Ollama embedding failed: ${response.statusText}`);
}
const data = await response.json();
return data.embedding;
}
// Store document with local embedding
async function ingestDocument(content: string, metadata: Record<string, any>) {
const chunks = splitIntoChunks(content, 500, 50);
for (const chunk of chunks) {
const embedding = await getLocalEmbedding(chunk);
await supabase.from('documents_local').insert({
content: chunk,
metadata: { ...metadata, chunk_index: chunks.indexOf(chunk) },
embedding,
});
}
}
// Similarity search with local embeddings
async function semanticSearch(query: string, limit = 5) {
const queryEmbedding = await getLocalEmbedding(query);
const { data, error } = await supabase.rpc('match_documents_local', {
query_embedding: queryEmbedding,
match_threshold: 0.5,
match_count: limit,
});
if (error) throw error;
return data;
}
Create the matching function in PostgreSQL:
CREATE OR REPLACE FUNCTION match_documents_local(
query_embedding vector(768),
match_threshold float,
match_count int
)
RETURNS TABLE (
id uuid,
content text,
metadata jsonb,
similarity float
)
LANGUAGE sql STABLE
AS $$
SELECT
documents_local.id,
documents_local.content,
documents_local.metadata,
1 - (documents_local.embedding <=> query_embedding) AS similarity
FROM documents_local
WHERE 1 - (documents_local.embedding <=> query_embedding) > match_threshold
ORDER BY documents_local.embedding <=> query_embedding
LIMIT match_count;
$$;
Adding Local LLM Generation
For a fully private RAG system, pair local embeddings with local generation. Ollama supports conversational models like Llama 3, Mistral, and Qwen:
async function generateLocalResponse(
question: string,
context: string
): Promise<string> {
const response = await fetch(`${OLLAMA_URL}/api/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'llama3.2',
prompt: `Answer the question based on the context below. If the context doesn't contain relevant information, say so.
Context:
${context}
Question: ${question}
Answer:`,
stream: false,
}),
});
const data = await response.json();
return data.response;
}
// Complete RAG query - fully local
async function localRagQuery(question: string): Promise<string> {
const relevantDocs = await semanticSearch(question, 5);
const context = relevantDocs
.map((doc: any) => doc.content)
.join('\n\n---\n\n');
return generateLocalResponse(question, context);
}
Pull a conversational model first:
ollama pull llama3.2
Performance Considerations
Local inference trades API costs for hardware requirements. Here's what to expect:
Hardware Requirements
| Use Case | Minimum RAM | Recommended | GPU |
|---|---|---|---|
| Embeddings only | 4GB | 8GB | Optional |
| Small LLMs (7B) | 8GB | 16GB | Helpful |
| Medium LLMs (13B) | 16GB | 32GB | Recommended |
| Large LLMs (70B) | 64GB+ | 128GB+ | Required |
For embedding-focused workloads, a modest server handles thousands of embeddings per minute. Add a GPU (even a consumer RTX card) and throughput jumps dramatically.
Embedding Throughput
On a 4-core CPU server with 8GB RAM, expect roughly:
- nomic-embed-text: ~15-20 embeddings/second
- mxbai-embed-large: ~8-12 embeddings/second
With an NVIDIA GPU (RTX 3060 or better):
- nomic-embed-text: ~100+ embeddings/second
- mxbai-embed-large: ~60+ embeddings/second
These numbers matter for initial ingestion. Once documents are embedded, query latency stays under 100ms regardless of hardware.
Batch Processing for Ingestion
When processing large document sets, batch your embedding requests:
async function batchEmbed(texts: string[]): Promise<number[][]> {
const embeddings: number[][] = [];
// Process in parallel with concurrency limit
const batchSize = 10;
for (let i = 0; i < texts.length; i += batchSize) {
const batch = texts.slice(i, i + batchSize);
const results = await Promise.all(batch.map(getLocalEmbedding));
embeddings.push(...results);
}
return embeddings;
}
Hybrid Approach: Local Embeddings with Cloud Generation
Not ready to commit fully to local inference? A hybrid approach offers flexibility:
- Embeddings: Run locally via Ollama (no data leaves your server)
- Generation: Use cloud APIs when needed (only the query + context leaves)
This protects your document corpus while allowing powerful generation models for responses:
import OpenAI from 'openai';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function hybridRagQuery(question: string): Promise<string> {
// Local embeddings - documents stay private
const relevantDocs = await semanticSearch(question, 5);
const context = relevantDocs.map((d: any) => d.content).join('\n\n');
// Cloud generation - only query + retrieved context sent
const completion = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'Answer based on the provided context.' },
{ role: 'user', content: `Context:\n${context}\n\nQuestion: ${question}` },
],
});
return completion.choices[0].message.content || '';
}
Managing Ollama Models
Keep your embedding models updated and your disk usage manageable:
# List installed models ollama list # Update a model to latest version ollama pull nomic-embed-text # Remove unused models ollama rm mistral # Check model details ollama show nomic-embed-text
Model files live in ~/.ollama/models by default. Each 7B parameter model consumes roughly 4GB of disk space.
Monitoring Local AI Workloads
Add observability to your self-hosted stack to track AI performance:
# Prometheus scrape config for Ollama
- job_name: 'ollama'
static_configs:
- targets: ['localhost:11434']
metrics_path: '/api/metrics'
Key metrics to watch:
- Embedding latency distribution
- Memory usage during inference
- Queue depth for concurrent requests
- GPU utilization (if applicable)
The Self-Hosting Advantage for AI
Running AI workloads on your own infrastructure represents the natural evolution of the self-hosting philosophy. You've already chosen to own your database, authentication, and storage. Extending that ownership to AI capabilities closes the loop—your entire application stack runs on hardware you control.
For startups building AI products, this matters. GDPR compliance becomes straightforward when documents never leave your data center. For indie hackers, eliminating API costs means your AI features don't scale your bills linearly with users.
Getting Started with Supascale
Managing self-hosted Supabase with local AI adds operational complexity. Supascale helps by automating backups (including your embedding tables), managing SSL certificates, and providing a unified dashboard for multiple projects—all without requiring you to become a DevOps specialist.
Check our pricing to see how Supascale fits your self-hosting strategy. One-time purchase, unlimited projects, no usage fees.
Further Reading
- Setting Up pgvector for Self-Hosted Supabase - Foundation for vector storage
- PostgreSQL Performance Tuning - Optimize for AI workloads
- Self-Hosted Supabase for Indie Hackers - Cost-effective deployment strategies
- Ollama Documentation - Full model library and API reference
