NODE.JS 4 MIN READ 16 MAR 2026

Building a RAG Pipeline in Node.js

by Theodor

QUEST LOG ENTRY

WARNING · DRAGON AHEAD

LLMs hallucinate on data they’ve never seen. Your private documents? Your company knowledge? They’re flying blind. RAG fixes this by retrieving relevant context first, then asking the LLM to answer grounded in facts.

The RAG Problem and Why It Matters

Large language models have a knowledge cutoff. Ask Claude about your internal API? It guesses. Ask it to summarize your private contracts? Hallucination. RAG (Retrieval-Augmented Generation) solves this:

Chunk your documents into searchable pieces
Embed them into vectors using an LLM
Store vectors in a database
Retrieve the top-k most relevant chunks when the user asks a question
Prompt the LLM with those chunks + the user’s question

Result: Accurate answers grounded in your data.

Architecture at a Glance

┌─────────────────────────────────────────┐
│  Your Documents (PDFs, code, wikis)    │
└──────────────┬──────────────────────────┘
               │
               ▼
        ┌─────────────┐
        │   Chunking  │  (split into ~512-1024 tokens)
        └──────┬──────┘
               │
               ▼
    ┌──────────────────────┐
    │  OpenAI Embeddings   │  (text-embedding-3-small)
    └──────────┬───────────┘
               │
               ▼
      ┌────────────────────┐
      │  PostgreSQL+pgvector   │  (cosine similarity search)
      └──────────┬─────────────┘
               │
    ┌──────────┴──────────┐
    │   User Query        │
    │   Embedding         │
    │   Vector Search     │
    └──────────┬──────────┘
               │
               ▼
       ┌──────────────────┐
       │ Top-K Chunks     │
       │ + Prompt Context │
       └────────┬─────────┘
                │
                ▼
          ┌──────────────┐
          │ OpenAI GPT   │  (answer with grounded facts)
          └──────────────┘

Chunking Strategy: Fixed vs. Semantic

Fixed-size chunking is simple, but dumb:

function chunkBySize(text: string, size: number, overlap: number) {
  const chunks: string[] = [];
  for (let i = 0; i < text.length; i += size - overlap) {
    chunks.push(text.slice(i, i + size));
  }
  return chunks;
}

const chunks = chunkBySize(documentText, 1024, 200); // 1024 chars, 200 overlap

Semantic chunking breaks at logical boundaries (sentences, paragraphs). Better quality:

function chunkBySentences(text: string) {
  // Split by sentence, recombine until reaching ~1024 tokens
  const sentences = text.split(/[.!?]+/);
  const chunks: string[] = [];
  let current = '';

  for (const sent of sentences) {
    if ((current + sent).length > 1024) {
      chunks.push(current);
      current = sent;
    } else {
      current += sent;
    }
  }
  if (current) chunks.push(current);
  return chunks;
}

Generating Embeddings with OpenAI

Convert chunks into 1536-dimensional vectors:

import { OpenAI } from 'openai';

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function embedChunks(chunks: string[]) {
  const response = await client.embeddings.create({
    model: 'text-embedding-3-small',
    input: chunks,
  });

  return response.data.map(item => item.embedding);
}

const embeddings = await embedChunks(chunks);
console.log(embeddings[0].length); // 1536

text-embedding-3-small is fast (~100 chunks/second) and cheap. For massive datasets, batch them.

Storing Vectors in PostgreSQL with pgvector

Enable the extension:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  content TEXT NOT NULL,
  embedding vector(1536) NOT NULL,
  metadata JSONB,
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);

Insert vectors from Node.js:

import pg from 'pg';
const pool = new pg.Pool();

async function storeChunks(chunks: string[], embeddings: number[][]) {
  const query = `
    INSERT INTO documents (content, embedding, metadata)
    VALUES ($1, $2::vector, $3)
  `;

  for (let i = 0; i < chunks.length; i++) {
    await pool.query(query, [
      chunks[i],
      JSON.stringify(embeddings[i]), // pgvector expects JSON format
      { source: 'user_upload', chunk_index: i },
    ]);
  }
}

Retrieval: Vector Similarity Search

Find the top-k most relevant chunks:

async function retrieveContext(userQuery: string, k: number = 5) {
  // 1. Embed the user's question
  const response = await client.embeddings.create({
    model: 'text-embedding-3-small',
    input: userQuery,
  });
  const queryVector = response.data[0].embedding;

  // 2. Search PostgreSQL for closest vectors
  const results = await pool.query(
    `
    SELECT content, embedding <-> $1::vector AS distance
    FROM documents
    ORDER BY distance
    LIMIT $2
    `,
    [JSON.stringify(queryVector), k]
  );

  return results.rows.map(row => row.content);
}

Cosine distance <-> finds vectors closest to the query. Smaller distance = higher relevance.

Assembling the Prompt Context

Combine retrieved chunks into a system prompt:

async function answerWithRAG(userQuery: string) {
  const context = await retrieveContext(userQuery);

  const systemPrompt = `You are a helpful assistant. Answer the user's question using ONLY the following context. If the context doesn't contain the answer, say so.

Context:
${context.map((chunk, i) => `[${i + 1}] ${chunk}`).join('\n\n')}`;

  const response = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    system: systemPrompt,
    messages: [{ role: 'user', content: userQuery }],
  });

  return response.content[0].type === 'text' ? response.content[0].text : '';
}

Full End-to-End Pipeline

Tying it together:

async function setupRAG(documentText: string) {
  // 1. Chunk the document
  const chunks = chunkBySentences(documentText);
  console.log(`Chunked into ${chunks.length} pieces`);

  // 2. Generate embeddings
  const embeddings = await embedChunks(chunks);
  console.log(`Generated ${embeddings.length} embeddings`);

  // 3. Store in PostgreSQL
  await storeChunks(chunks, embeddings);
  console.log('Stored in database');

  // 4. Answer a question
  const answer = await answerWithRAG('What is your main service?');
  console.log(answer);
}

// Load a document and run
const fs = require('fs');
const docText = fs.readFileSync('company_handbook.txt', 'utf-8');
await setupRAG(docText);

Summary

RAG transforms LLMs from guessing machines into fact machines. Chunk strategically. Embed with OpenAI. Store in pgvector. Search by cosine distance. Retrieve the top 5. Prompt with context. Your LLM now answers with precision grounded in your data.

This pipeline works at scale for thousands of documents. Ready for production.

ALL POSTS →