Skip to main content
btheo.com btheo.com > press start to play
NEW POST: NODE.JS SECURITY 2025 OPEN FOR FREELANCE 10+ YEARS EXP REACT × NODE × AWS NEW POST: NODE.JS SECURITY 2025 OPEN FOR FREELANCE 10+ YEARS EXP REACT × NODE × AWS
NODE.JS 4 MIN READ

Building a RAG Pipeline in Node.js

WARNING · DRAGON AHEAD

LLMs hallucinate on data they’ve never seen. Your private documents? Your company knowledge? They’re flying blind. RAG fixes this by retrieving relevant context first, then asking the LLM to answer grounded in facts.

The RAG Problem and Why It Matters

Large language models have a knowledge cutoff. Ask Claude about your internal API? It guesses. Ask it to summarize your private contracts? Hallucination. RAG (Retrieval-Augmented Generation) solves this:

  1. Chunk your documents into searchable pieces
  2. Embed them into vectors using an LLM
  3. Store vectors in a database
  4. Retrieve the top-k most relevant chunks when the user asks a question
  5. Prompt the LLM with those chunks + the user’s question

Result: Accurate answers grounded in your data.

Architecture at a Glance

┌─────────────────────────────────────────┐
│ Your Documents (PDFs, code, wikis) │
└──────────────┬──────────────────────────┘
┌─────────────┐
│ Chunking │ (split into ~512-1024 tokens)
└──────┬──────┘
┌──────────────────────┐
│ OpenAI Embeddings │ (text-embedding-3-small)
└──────────┬───────────┘
┌────────────────────┐
│ PostgreSQL+pgvector │ (cosine similarity search)
└──────────┬─────────────┘
┌──────────┴──────────┐
│ User Query │
│ Embedding │
│ Vector Search │
└──────────┬──────────┘
┌──────────────────┐
│ Top-K Chunks │
│ + Prompt Context │
└────────┬─────────┘
┌──────────────┐
│ OpenAI GPT │ (answer with grounded facts)
└──────────────┘

Chunking Strategy: Fixed vs. Semantic

Fixed-size chunking is simple, but dumb:

function chunkBySize(text: string, size: number, overlap: number) {
const chunks: string[] = [];
for (let i = 0; i < text.length; i += size - overlap) {
chunks.push(text.slice(i, i + size));
}
return chunks;
}
const chunks = chunkBySize(documentText, 1024, 200); // 1024 chars, 200 overlap

Semantic chunking breaks at logical boundaries (sentences, paragraphs). Better quality:

function chunkBySentences(text: string) {
// Split by sentence, recombine until reaching ~1024 tokens
const sentences = text.split(/[.!?]+/);
const chunks: string[] = [];
let current = '';
for (const sent of sentences) {
if ((current + sent).length > 1024) {
chunks.push(current);
current = sent;
} else {
current += sent;
}
}
if (current) chunks.push(current);
return chunks;
}

Generating Embeddings with OpenAI

Convert chunks into 1536-dimensional vectors:

import { OpenAI } from 'openai';
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
async function embedChunks(chunks: string[]) {
const response = await client.embeddings.create({
model: 'text-embedding-3-small',
input: chunks,
});
return response.data.map(item => item.embedding);
}
const embeddings = await embedChunks(chunks);
console.log(embeddings[0].length); // 1536

text-embedding-3-small is fast (~100 chunks/second) and cheap. For massive datasets, batch them.

Storing Vectors in PostgreSQL with pgvector

Enable the extension:

CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536) NOT NULL,
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);

Insert vectors from Node.js:

import pg from 'pg';
const pool = new pg.Pool();
async function storeChunks(chunks: string[], embeddings: number[][]) {
const query = `
INSERT INTO documents (content, embedding, metadata)
VALUES ($1, $2::vector, $3)
`;
for (let i = 0; i < chunks.length; i++) {
await pool.query(query, [
chunks[i],
JSON.stringify(embeddings[i]), // pgvector expects JSON format
{ source: 'user_upload', chunk_index: i },
]);
}
}

Find the top-k most relevant chunks:

async function retrieveContext(userQuery: string, k: number = 5) {
// 1. Embed the user's question
const response = await client.embeddings.create({
model: 'text-embedding-3-small',
input: userQuery,
});
const queryVector = response.data[0].embedding;
// 2. Search PostgreSQL for closest vectors
const results = await pool.query(
`
SELECT content, embedding <-> $1::vector AS distance
FROM documents
ORDER BY distance
LIMIT $2
`,
[JSON.stringify(queryVector), k]
);
return results.rows.map(row => row.content);
}

Cosine distance <-> finds vectors closest to the query. Smaller distance = higher relevance.

Assembling the Prompt Context

Combine retrieved chunks into a system prompt:

async function answerWithRAG(userQuery: string) {
const context = await retrieveContext(userQuery);
const systemPrompt = `You are a helpful assistant. Answer the user's question using ONLY the following context. If the context doesn't contain the answer, say so.
Context:
${context.map((chunk, i) => `[${i + 1}] ${chunk}`).join('\n\n')}`;
const response = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
system: systemPrompt,
messages: [{ role: 'user', content: userQuery }],
});
return response.content[0].type === 'text' ? response.content[0].text : '';
}

Full End-to-End Pipeline

Tying it together:

async function setupRAG(documentText: string) {
// 1. Chunk the document
const chunks = chunkBySentences(documentText);
console.log(`Chunked into ${chunks.length} pieces`);
// 2. Generate embeddings
const embeddings = await embedChunks(chunks);
console.log(`Generated ${embeddings.length} embeddings`);
// 3. Store in PostgreSQL
await storeChunks(chunks, embeddings);
console.log('Stored in database');
// 4. Answer a question
const answer = await answerWithRAG('What is your main service?');
console.log(answer);
}
// Load a document and run
const fs = require('fs');
const docText = fs.readFileSync('company_handbook.txt', 'utf-8');
await setupRAG(docText);

Summary

RAG transforms LLMs from guessing machines into fact machines. Chunk strategically. Embed with OpenAI. Store in pgvector. Search by cosine distance. Retrieve the top 5. Prompt with context. Your LLM now answers with precision grounded in your data.

This pipeline works at scale for thousands of documents. Ready for production.

ALL POSTS →