Skip to main content
btheo.com btheo.com > press start to play
NEW POST: NODE.JS SECURITY 2025 OPEN FOR FREELANCE 10+ YEARS EXP REACT × NODE × AWS NEW POST: NODE.JS SECURITY 2025 OPEN FOR FREELANCE 10+ YEARS EXP REACT × NODE × AWS
NODE.JS 4 MIN READ

Local LLMs in Node.js with Ollama

WARNING · DRAGON AHEAD

Hosted LLM APIs cost money per token and send your data to third parties. Local LLMs run on your machine, offline, free, and private. You own the data.

Why Local LLMs Matter

Hosted APIs (OpenAI, Anthropic):

  • ⚠️ $0.01-0.15 per 1K tokens (costs pile up)
  • ⚠️ Your data goes to a third party’s servers
  • ⚠️ Client data compliance becomes regulatory risk
  • ⚠️ Rate limits, API downtime

Local LLMs (Ollama):

  • ✔ Zero marginal cost (runs on your hardware)
  • ✔ No data leaves your machine
  • ✔ HIPAA/GDPR compliant by default
  • ✔ Works offline (no internet required)
  • ✔ Perfect for internal tools, document processing, code review

Setting Up Ollama

Install Ollama (macOS/Linux/Windows):

Terminal window
# macOS
brew install ollama
# Or download from https://ollama.ai

Pull a model:

Terminal window
ollama pull llama3.2 # ~5GB, best quality
ollama pull mistral # ~4GB, faster
ollama pull phi-3 # ~2GB, tiny but capable

Run the Ollama server (stays in background):

Terminal window
ollama serve
# Listens on http://localhost:11434

Test it manually:

Terminal window
ollama run llama3.2
# Type a prompt, get responses locally

Querying Ollama from Node.js

Use fetch to hit the REST API:

async function queryOllama(prompt: string, model: string = 'llama3.2') {
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model,
prompt,
stream: false, // Get full response at once
}),
});
const data = await response.json();
return data.response; // The LLM's answer
}
const answer = await queryOllama('What is Node.js?');
console.log(answer);

Streaming Responses from Ollama

Chunk-based streaming for long responses:

async function queryOllamaStream(prompt: string, model: string = 'llama3.2') {
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model,
prompt,
stream: true, // Stream tokens as they generate
}),
});
const reader = response.body?.getReader();
if (!reader) return '';
const decoder = new TextDecoder();
let fullResponse = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line) {
const data = JSON.parse(line);
if (data.response) {
process.stdout.write(data.response); // Print token immediately
fullResponse += data.response;
}
}
}
}
return fullResponse;
}
await queryOllamaStream('Write a haiku about Node.js');

Model Comparison: Speed vs. Capability

ModelSizeSpeedQualityBest For
Phi-32.7GB40 tokens/secGoodLightweight tasks, CPU-only
Mistral 7B4.1GB25 tokens/secGreatBalanced: speed + quality
Llama 3.24.7GB15 tokens/secExcellentComplex reasoning, accuracy
Llama 3.1 70B40GB3 tokens/secBestResearch, production tasks

Pick based on your hardware:

  • CPU-only: Phi-3
  • 4GB GPU: Mistral 7B
  • 8GB+ GPU: Llama 3.2
  • 24GB+ VRAM: Llama 70B (split across GPUs)

Production-Ready Wrapper Class

TypeScript wrapper for clean error handling:

class OllamaClient {
private baseUrl = 'http://localhost:11434';
async query(prompt: string, model: string = 'llama3.2'): Promise<string> {
try {
const response = await fetch(`${this.baseUrl}/api/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model, prompt, stream: false }),
});
if (!response.ok) {
throw new Error(`Ollama API error: ${response.status}`);
}
const data = await response.json();
return data.response;
} catch (error) {
console.error('Ollama query failed:', error);
throw error;
}
}
async* streamQuery(prompt: string, model: string = 'llama3.2') {
const response = await fetch(`${this.baseUrl}/api/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model, prompt, stream: true }),
});
const reader = response.body?.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader!.read();
if (done) break;
const chunk = decoder.decode(value);
for (const line of chunk.split('\n')) {
if (line) {
const data = JSON.parse(line);
if (data.response) {
yield data.response;
}
}
}
}
}
async health(): Promise<boolean> {
try {
const response = await fetch(`${this.baseUrl}/api/tags`);
return response.ok;
} catch {
return false;
}
}
}
// Usage
const ollama = new OllamaClient();
const isHealthy = await ollama.health();
if (isHealthy) {
const answer = await ollama.query('Explain closures in JavaScript');
console.log(answer);
}

Real-World Use Cases

Document Summarization:

const docText = fs.readFileSync('report.pdf', 'utf-8');
const summary = await ollama.query(
`Summarize this report in 3 sentences:\n\n${docText}`
);
console.log(summary);

Code Review Assistant:

const code = fs.readFileSync('app.ts', 'utf-8');
const review = await ollama.query(
`Review this code for bugs, performance issues, and best practices:\n\n${code}`
);

Internal Chatbot:

const ollama = new OllamaClient();
for await (const token of ollama.streamQuery('What is TypeScript?', 'mistral')) {
process.stdout.write(token);
}

Limitations vs. Hosted APIs

Local LLMs are good for:

  • Private data (no external transmission)
  • Offline scenarios
  • Cost-sensitive applications
  • Internal tooling

But they lose to hosted APIs on:

  • 🔹 Quality (GPT-4 > Llama 3.2)
  • 🔹 Latest knowledge (local models aren’t updated)
  • 🔹 Complex reasoning (Claude 3.5 Sonnet is better)
  • 🔹 Multimodal tasks (vision, audio — limited in local models)

Hybrid approach: Use Ollama for cheap, private tasks. Reserve Claude/GPT-4 for production reasoning.

Summary

Local LLMs are ready for production. Ollama makes setup trivial. Llama 3.2 and Mistral offer excellent quality. You keep all data private. Zero API costs. Perfect for internal tools, document processing, and compliance-heavy domains.

Don’t pay per token for work you can run offline.

ALL POSTS →