Building a RAG Pipeline in Node.js
LLMs hallucinate on data they’ve never seen. Your private documents? Your company knowledge? They’re flying blind. RAG fixes this by retrieving relevant context first, then asking the LLM to answer grounded in facts.
The RAG Problem and Why It Matters
Large language models have a knowledge cutoff. Ask Claude about your internal API? It guesses. Ask it to summarize your private contracts? Hallucination. RAG (Retrieval-Augmented Generation) solves this:
- Chunk your documents into searchable pieces
- Embed them into vectors using an LLM
- Store vectors in a database
- Retrieve the top-k most relevant chunks when the user asks a question
- Prompt the LLM with those chunks + the user’s question
Result: Accurate answers grounded in your data.
Architecture at a Glance
┌─────────────────────────────────────────┐│ Your Documents (PDFs, code, wikis) │└──────────────┬──────────────────────────┘ │ ▼ ┌─────────────┐ │ Chunking │ (split into ~512-1024 tokens) └──────┬──────┘ │ ▼ ┌──────────────────────┐ │ OpenAI Embeddings │ (text-embedding-3-small) └──────────┬───────────┘ │ ▼ ┌────────────────────┐ │ PostgreSQL+pgvector │ (cosine similarity search) └──────────┬─────────────┘ │ ┌──────────┴──────────┐ │ User Query │ │ Embedding │ │ Vector Search │ └──────────┬──────────┘ │ ▼ ┌──────────────────┐ │ Top-K Chunks │ │ + Prompt Context │ └────────┬─────────┘ │ ▼ ┌──────────────┐ │ OpenAI GPT │ (answer with grounded facts) └──────────────┘Chunking Strategy: Fixed vs. Semantic
Fixed-size chunking is simple, but dumb:
function chunkBySize(text: string, size: number, overlap: number) { const chunks: string[] = []; for (let i = 0; i < text.length; i += size - overlap) { chunks.push(text.slice(i, i + size)); } return chunks;}
const chunks = chunkBySize(documentText, 1024, 200); // 1024 chars, 200 overlapSemantic chunking breaks at logical boundaries (sentences, paragraphs). Better quality:
function chunkBySentences(text: string) { // Split by sentence, recombine until reaching ~1024 tokens const sentences = text.split(/[.!?]+/); const chunks: string[] = []; let current = '';
for (const sent of sentences) { if ((current + sent).length > 1024) { chunks.push(current); current = sent; } else { current += sent; } } if (current) chunks.push(current); return chunks;}Generating Embeddings with OpenAI
Convert chunks into 1536-dimensional vectors:
import { OpenAI } from 'openai';
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function embedChunks(chunks: string[]) { const response = await client.embeddings.create({ model: 'text-embedding-3-small', input: chunks, });
return response.data.map(item => item.embedding);}
const embeddings = await embedChunks(chunks);console.log(embeddings[0].length); // 1536text-embedding-3-small is fast (~100 chunks/second) and cheap. For massive datasets, batch them.
Storing Vectors in PostgreSQL with pgvector
Enable the extension:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents ( id SERIAL PRIMARY KEY, content TEXT NOT NULL, embedding vector(1536) NOT NULL, metadata JSONB, created_at TIMESTAMP DEFAULT NOW());
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);Insert vectors from Node.js:
import pg from 'pg';const pool = new pg.Pool();
async function storeChunks(chunks: string[], embeddings: number[][]) { const query = ` INSERT INTO documents (content, embedding, metadata) VALUES ($1, $2::vector, $3) `;
for (let i = 0; i < chunks.length; i++) { await pool.query(query, [ chunks[i], JSON.stringify(embeddings[i]), // pgvector expects JSON format { source: 'user_upload', chunk_index: i }, ]); }}Retrieval: Vector Similarity Search
Find the top-k most relevant chunks:
async function retrieveContext(userQuery: string, k: number = 5) { // 1. Embed the user's question const response = await client.embeddings.create({ model: 'text-embedding-3-small', input: userQuery, }); const queryVector = response.data[0].embedding;
// 2. Search PostgreSQL for closest vectors const results = await pool.query( ` SELECT content, embedding <-> $1::vector AS distance FROM documents ORDER BY distance LIMIT $2 `, [JSON.stringify(queryVector), k] );
return results.rows.map(row => row.content);}Cosine distance <-> finds vectors closest to the query. Smaller distance = higher relevance.
Assembling the Prompt Context
Combine retrieved chunks into a system prompt:
async function answerWithRAG(userQuery: string) { const context = await retrieveContext(userQuery);
const systemPrompt = `You are a helpful assistant. Answer the user's question using ONLY the following context. If the context doesn't contain the answer, say so.
Context:${context.map((chunk, i) => `[${i + 1}] ${chunk}`).join('\n\n')}`;
const response = await client.messages.create({ model: 'claude-3-5-sonnet-20241022', max_tokens: 1024, system: systemPrompt, messages: [{ role: 'user', content: userQuery }], });
return response.content[0].type === 'text' ? response.content[0].text : '';}Full End-to-End Pipeline
Tying it together:
async function setupRAG(documentText: string) { // 1. Chunk the document const chunks = chunkBySentences(documentText); console.log(`Chunked into ${chunks.length} pieces`);
// 2. Generate embeddings const embeddings = await embedChunks(chunks); console.log(`Generated ${embeddings.length} embeddings`);
// 3. Store in PostgreSQL await storeChunks(chunks, embeddings); console.log('Stored in database');
// 4. Answer a question const answer = await answerWithRAG('What is your main service?'); console.log(answer);}
// Load a document and runconst fs = require('fs');const docText = fs.readFileSync('company_handbook.txt', 'utf-8');await setupRAG(docText);Summary
RAG transforms LLMs from guessing machines into fact machines. Chunk strategically. Embed with OpenAI. Store in pgvector. Search by cosine distance. Retrieve the top 5. Prompt with context. Your LLM now answers with precision grounded in your data.
This pipeline works at scale for thousands of documents. Ready for production.