Local AI Embeddings for Self-Hosted Supabase: Building Private RAG with Ollama

If you're self-hosting Supabase, you've already made a choice: data ownership matters. But when it comes to AI features, most tutorials point you straight to OpenAI or Anthropic APIs—sending your documents to external servers, paying per-token costs, and introducing a dependency that defeats the purpose of self-hosting.

There's another path. Running embedding models locally with Ollama gives you fully private AI capabilities. Your documents never leave your infrastructure. Your costs are fixed to hardware, not usage. And with pgvector already bundled in self-hosted Supabase, you have everything needed for production RAG applications.

Why Local Embeddings Matter for Self-Hosters

The standard approach to building AI features involves calling external embedding APIs. This works, but it creates friction for teams who chose self-hosting specifically for data control:

Data leaves your infrastructure: Every document you embed gets sent to a third party. For customer data, internal knowledge bases, or proprietary content, this may violate compliance requirements or internal policies.

Variable costs at scale: Embedding APIs charge per token. A 10,000-document knowledge base might cost dollars; a million documents costs hundreds. Usage spikes become budget surprises.

Network dependency: Your AI features stop working when the API provider has issues. In 2026, this happens more often than vendors admit.

Latency adds up: Each embedding request involves a network round-trip. When processing thousands of documents, this compounds into significant ingestion delays.

Local embeddings solve all four problems. Ollama runs on your server, processing text without network calls. Once you've invested in hardware, marginal cost per embedding drops to near-zero.

Setting Up Ollama with Self-Hosted Supabase

Ollama provides a streamlined way to run open-source language models locally. For embeddings, models like nomic-embed-text and mxbai-embed-large match or exceed OpenAI's text-embedding-3-small quality on most benchmarks—without API costs.

Step 1: Install Ollama

On the same server running your Docker Compose Supabase stack, install Ollama:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version

# Pull an embedding model
ollama pull nomic-embed-text

For production, add Ollama as a systemd service:

sudo systemctl enable ollama
sudo systemctl start ollama

Step 2: Verify the API

Ollama exposes a REST API on port 11434. Test embedding generation:

curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Test embedding generation"
}'

You should receive a JSON response with a 768-dimensional vector (nomic-embed-text's default dimension).

Step 3: Add Ollama to Docker Network (Optional)

If you want Edge Functions or other Supabase services to access Ollama, add it to your Docker network:

# Add to your docker-compose.yml
services:
  ollama:
    image: ollama/ollama
    container_name: supabase-ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    networks:
      - supabase_network
    restart: unless-stopped

volumes:
  ollama_data:

This keeps Ollama accessible at http://ollama:11434 from within the Supabase container network.

Configuring pgvector for Local Embedding Dimensions

If you've followed our pgvector setup guide, you likely have tables configured for 1536-dimensional OpenAI embeddings. Local models use different dimensions:

Model	Dimensions	Quality	Speed
nomic-embed-text	768	Excellent	Fast
mxbai-embed-large	1024	Excellent	Medium
all-minilm	384	Good	Very Fast

Create a table matching your chosen model:

-- For nomic-embed-text (768 dimensions)
CREATE TABLE documents_local (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  metadata JSONB DEFAULT '{}',
  embedding vector(768),
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index for similarity search
CREATE INDEX ON documents_local 
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

If you're migrating from OpenAI embeddings, you'll need to re-embed your documents—dimension mismatches prevent querying across different models.

Building a Local RAG Pipeline

With Ollama and pgvector configured, here's a complete TypeScript implementation for local RAG:

import { createClient } from '@supabase/supabase-js';

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_KEY!
);

const OLLAMA_URL = process.env.OLLAMA_URL || 'http://localhost:11434';

// Generate embedding using local Ollama
async function getLocalEmbedding(text: string): Promise<number[]> {
  const response = await fetch(`${OLLAMA_URL}/api/embeddings`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'nomic-embed-text',
      prompt: text,
    }),
  });

  if (!response.ok) {
    throw new Error(`Ollama embedding failed: ${response.statusText}`);
  }

  const data = await response.json();
  return data.embedding;
}

// Store document with local embedding
async function ingestDocument(content: string, metadata: Record<string, any>) {
  const chunks = splitIntoChunks(content, 500, 50);

  for (const chunk of chunks) {
    const embedding = await getLocalEmbedding(chunk);

    await supabase.from('documents_local').insert({
      content: chunk,
      metadata: { ...metadata, chunk_index: chunks.indexOf(chunk) },
      embedding,
    });
  }
}

// Similarity search with local embeddings
async function semanticSearch(query: string, limit = 5) {
  const queryEmbedding = await getLocalEmbedding(query);

  const { data, error } = await supabase.rpc('match_documents_local', {
    query_embedding: queryEmbedding,
    match_threshold: 0.5,
    match_count: limit,
  });

  if (error) throw error;
  return data;
}

Create the matching function in PostgreSQL:

CREATE OR REPLACE FUNCTION match_documents_local(
  query_embedding vector(768),
  match_threshold float,
  match_count int
)
RETURNS TABLE (
  id uuid,
  content text,
  metadata jsonb,
  similarity float
)
LANGUAGE sql STABLE
AS $$
  SELECT
    documents_local.id,
    documents_local.content,
    documents_local.metadata,
    1 - (documents_local.embedding <=> query_embedding) AS similarity
  FROM documents_local
  WHERE 1 - (documents_local.embedding <=> query_embedding) > match_threshold
  ORDER BY documents_local.embedding <=> query_embedding
  LIMIT match_count;
$$;

Adding Local LLM Generation

For a fully private RAG system, pair local embeddings with local generation. Ollama supports conversational models like Llama 3, Mistral, and Qwen:

async function generateLocalResponse(
  question: string,
  context: string
): Promise<string> {
  const response = await fetch(`${OLLAMA_URL}/api/generate`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'llama3.2',
      prompt: `Answer the question based on the context below. If the context doesn't contain relevant information, say so.

Context:
${context}

Question: ${question}

Answer:`,
      stream: false,
    }),
  });

  const data = await response.json();
  return data.response;
}

// Complete RAG query - fully local
async function localRagQuery(question: string): Promise<string> {
  const relevantDocs = await semanticSearch(question, 5);

  const context = relevantDocs
    .map((doc: any) => doc.content)
    .join('\n\n---\n\n');

  return generateLocalResponse(question, context);
}

Pull a conversational model first:

ollama pull llama3.2

Performance Considerations

Local inference trades API costs for hardware requirements. Here's what to expect:

Hardware Requirements

Use Case	Minimum RAM	Recommended	GPU
Embeddings only	4GB	8GB	Optional
Small LLMs (7B)	8GB	16GB	Helpful
Medium LLMs (13B)	16GB	32GB	Recommended
Large LLMs (70B)	64GB+	128GB+	Required

For embedding-focused workloads, a modest server handles thousands of embeddings per minute. Add a GPU (even a consumer RTX card) and throughput jumps dramatically.

Embedding Throughput

On a 4-core CPU server with 8GB RAM, expect roughly:

nomic-embed-text: ~15-20 embeddings/second
mxbai-embed-large: ~8-12 embeddings/second

With an NVIDIA GPU (RTX 3060 or better):

nomic-embed-text: ~100+ embeddings/second
mxbai-embed-large: ~60+ embeddings/second

These numbers matter for initial ingestion. Once documents are embedded, query latency stays under 100ms regardless of hardware.

Batch Processing for Ingestion

When processing large document sets, batch your embedding requests:

async function batchEmbed(texts: string[]): Promise<number[][]> {
  const embeddings: number[][] = [];

  // Process in parallel with concurrency limit
  const batchSize = 10;
  for (let i = 0; i < texts.length; i += batchSize) {
    const batch = texts.slice(i, i + batchSize);
    const results = await Promise.all(batch.map(getLocalEmbedding));
    embeddings.push(...results);
  }

  return embeddings;
}

Hybrid Approach: Local Embeddings with Cloud Generation

Not ready to commit fully to local inference? A hybrid approach offers flexibility:

Embeddings: Run locally via Ollama (no data leaves your server)
Generation: Use cloud APIs when needed (only the query + context leaves)

This protects your document corpus while allowing powerful generation models for responses:

import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function hybridRagQuery(question: string): Promise<string> {
  // Local embeddings - documents stay private
  const relevantDocs = await semanticSearch(question, 5);
  const context = relevantDocs.map((d: any) => d.content).join('\n\n');

  // Cloud generation - only query + retrieved context sent
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      { role: 'system', content: 'Answer based on the provided context.' },
      { role: 'user', content: `Context:\n${context}\n\nQuestion: ${question}` },
    ],
  });

  return completion.choices[0].message.content || '';
}

Managing Ollama Models

Keep your embedding models updated and your disk usage manageable:

# List installed models
ollama list

# Update a model to latest version
ollama pull nomic-embed-text

# Remove unused models
ollama rm mistral

# Check model details
ollama show nomic-embed-text

Model files live in ~/.ollama/models by default. Each 7B parameter model consumes roughly 4GB of disk space.

Monitoring Local AI Workloads

Add observability to your self-hosted stack to track AI performance:

# Prometheus scrape config for Ollama
- job_name: 'ollama'
  static_configs:
    - targets: ['localhost:11434']
  metrics_path: '/api/metrics'

Key metrics to watch:

Embedding latency distribution
Memory usage during inference
Queue depth for concurrent requests
GPU utilization (if applicable)

The Self-Hosting Advantage for AI

Running AI workloads on your own infrastructure represents the natural evolution of the self-hosting philosophy. You've already chosen to own your database, authentication, and storage. Extending that ownership to AI capabilities closes the loop—your entire application stack runs on hardware you control.

For startups building AI products, this matters. GDPR compliance becomes straightforward when documents never leave your data center. For indie hackers, eliminating API costs means your AI features don't scale your bills linearly with users.

Getting Started with Supascale

Managing self-hosted Supabase with local AI adds operational complexity. Supascale helps by automating backups (including your embedding tables), managing SSL certificates, and providing a unified dashboard for multiple projects—all without requiring you to become a DevOps specialist.

Check our pricing to see how Supascale fits your self-hosting strategy. One-time purchase, unlimited projects, no usage fees.