Local LLMs in Node.js with Ollama
Hosted LLM APIs cost money per token and send your data to third parties. Local LLMs run on your machine, offline, free, and private. You own the data.
Why Local LLMs Matter
Hosted APIs (OpenAI, Anthropic):
- ⚠️ $0.01-0.15 per 1K tokens (costs pile up)
- ⚠️ Your data goes to a third party’s servers
- ⚠️ Client data compliance becomes regulatory risk
- ⚠️ Rate limits, API downtime
Local LLMs (Ollama):
- ✔ Zero marginal cost (runs on your hardware)
- ✔ No data leaves your machine
- ✔ HIPAA/GDPR compliant by default
- ✔ Works offline (no internet required)
- ✔ Perfect for internal tools, document processing, code review
Setting Up Ollama
Install Ollama (macOS/Linux/Windows):
# macOSbrew install ollama
# Or download from https://ollama.aiPull a model:
ollama pull llama3.2 # ~5GB, best qualityollama pull mistral # ~4GB, fasterollama pull phi-3 # ~2GB, tiny but capableRun the Ollama server (stays in background):
ollama serve# Listens on http://localhost:11434Test it manually:
ollama run llama3.2# Type a prompt, get responses locallyQuerying Ollama from Node.js
Use fetch to hit the REST API:
async function queryOllama(prompt: string, model: string = 'llama3.2') { const response = await fetch('http://localhost:11434/api/generate', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model, prompt, stream: false, // Get full response at once }), });
const data = await response.json(); return data.response; // The LLM's answer}
const answer = await queryOllama('What is Node.js?');console.log(answer);Streaming Responses from Ollama
Chunk-based streaming for long responses:
async function queryOllamaStream(prompt: string, model: string = 'llama3.2') { const response = await fetch('http://localhost:11434/api/generate', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model, prompt, stream: true, // Stream tokens as they generate }), });
const reader = response.body?.getReader(); if (!reader) return '';
const decoder = new TextDecoder(); let fullResponse = '';
while (true) { const { done, value } = await reader.read(); if (done) break;
const chunk = decoder.decode(value); const lines = chunk.split('\n');
for (const line of lines) { if (line) { const data = JSON.parse(line); if (data.response) { process.stdout.write(data.response); // Print token immediately fullResponse += data.response; } } } }
return fullResponse;}
await queryOllamaStream('Write a haiku about Node.js');Model Comparison: Speed vs. Capability
| Model | Size | Speed | Quality | Best For |
|---|---|---|---|---|
| Phi-3 | 2.7GB | 40 tokens/sec | Good | Lightweight tasks, CPU-only |
| Mistral 7B | 4.1GB | 25 tokens/sec | Great | Balanced: speed + quality |
| Llama 3.2 | 4.7GB | 15 tokens/sec | Excellent | Complex reasoning, accuracy |
| Llama 3.1 70B | 40GB | 3 tokens/sec | Best | Research, production tasks |
Pick based on your hardware:
- CPU-only: Phi-3
- 4GB GPU: Mistral 7B
- 8GB+ GPU: Llama 3.2
- 24GB+ VRAM: Llama 70B (split across GPUs)
Production-Ready Wrapper Class
TypeScript wrapper for clean error handling:
class OllamaClient { private baseUrl = 'http://localhost:11434';
async query(prompt: string, model: string = 'llama3.2'): Promise<string> { try { const response = await fetch(`${this.baseUrl}/api/generate`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model, prompt, stream: false }), });
if (!response.ok) { throw new Error(`Ollama API error: ${response.status}`); }
const data = await response.json(); return data.response; } catch (error) { console.error('Ollama query failed:', error); throw error; } }
async* streamQuery(prompt: string, model: string = 'llama3.2') { const response = await fetch(`${this.baseUrl}/api/generate`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model, prompt, stream: true }), });
const reader = response.body?.getReader(); const decoder = new TextDecoder();
while (true) { const { done, value } = await reader!.read(); if (done) break;
const chunk = decoder.decode(value); for (const line of chunk.split('\n')) { if (line) { const data = JSON.parse(line); if (data.response) { yield data.response; } } } } }
async health(): Promise<boolean> { try { const response = await fetch(`${this.baseUrl}/api/tags`); return response.ok; } catch { return false; } }}
// Usageconst ollama = new OllamaClient();const isHealthy = await ollama.health();
if (isHealthy) { const answer = await ollama.query('Explain closures in JavaScript'); console.log(answer);}Real-World Use Cases
Document Summarization:
const docText = fs.readFileSync('report.pdf', 'utf-8');const summary = await ollama.query( `Summarize this report in 3 sentences:\n\n${docText}`);console.log(summary);Code Review Assistant:
const code = fs.readFileSync('app.ts', 'utf-8');const review = await ollama.query( `Review this code for bugs, performance issues, and best practices:\n\n${code}`);Internal Chatbot:
const ollama = new OllamaClient();
for await (const token of ollama.streamQuery('What is TypeScript?', 'mistral')) { process.stdout.write(token);}Limitations vs. Hosted APIs
Local LLMs are good for:
- Private data (no external transmission)
- Offline scenarios
- Cost-sensitive applications
- Internal tooling
But they lose to hosted APIs on:
- 🔹 Quality (GPT-4 > Llama 3.2)
- 🔹 Latest knowledge (local models aren’t updated)
- 🔹 Complex reasoning (Claude 3.5 Sonnet is better)
- 🔹 Multimodal tasks (vision, audio — limited in local models)
Hybrid approach: Use Ollama for cheap, private tasks. Reserve Claude/GPT-4 for production reasoning.
Summary
Local LLMs are ready for production. Ollama makes setup trivial. Llama 3.2 and Mistral offer excellent quality. You keep all data private. Zero API costs. Perfect for internal tools, document processing, and compliance-heavy domains.
Don’t pay per token for work you can run offline.