NODE.JS 4 MIN READ 23 MAR 2026

Local LLMs in Node.js with Ollama

by Theodor

QUEST LOG ENTRY

WARNING · DRAGON AHEAD

Hosted LLM APIs cost money per token and send your data to third parties. Local LLMs run on your machine, offline, free, and private. You own the data.

Why Local LLMs Matter

Hosted APIs (OpenAI, Anthropic):

⚠️ $0.01-0.15 per 1K tokens (costs pile up)
⚠️ Your data goes to a third party’s servers
⚠️ Client data compliance becomes regulatory risk
⚠️ Rate limits, API downtime

Local LLMs (Ollama):

✔ Zero marginal cost (runs on your hardware)
✔ No data leaves your machine
✔ HIPAA/GDPR compliant by default
✔ Works offline (no internet required)
✔ Perfect for internal tools, document processing, code review

Setting Up Ollama

Install Ollama (macOS/Linux/Windows):

# macOS
brew install ollama

# Or download from https://ollama.ai

Pull a model:

ollama pull llama3.2      # ~5GB, best quality
ollama pull mistral       # ~4GB, faster
ollama pull phi-3         # ~2GB, tiny but capable

Run the Ollama server (stays in background):

ollama serve
# Listens on http://localhost:11434

Test it manually:

ollama run llama3.2
# Type a prompt, get responses locally

Querying Ollama from Node.js

Use fetch to hit the REST API:

async function queryOllama(prompt: string, model: string = 'llama3.2') {
  const response = await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model,
      prompt,
      stream: false, // Get full response at once
    }),
  });

  const data = await response.json();
  return data.response; // The LLM's answer
}

const answer = await queryOllama('What is Node.js?');
console.log(answer);

Streaming Responses from Ollama

Chunk-based streaming for long responses:

async function queryOllamaStream(prompt: string, model: string = 'llama3.2') {
  const response = await fetch('http://localhost:11434/api/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model,
      prompt,
      stream: true, // Stream tokens as they generate
    }),
  });

  const reader = response.body?.getReader();
  if (!reader) return '';

  const decoder = new TextDecoder();
  let fullResponse = '';

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    const lines = chunk.split('\n');

    for (const line of lines) {
      if (line) {
        const data = JSON.parse(line);
        if (data.response) {
          process.stdout.write(data.response); // Print token immediately
          fullResponse += data.response;
        }
      }
    }
  }

  return fullResponse;
}

await queryOllamaStream('Write a haiku about Node.js');

Model Comparison: Speed vs. Capability

Model	Size	Speed	Quality	Best For
Phi-3	2.7GB	40 tokens/sec	Good	Lightweight tasks, CPU-only
Mistral 7B	4.1GB	25 tokens/sec	Great	Balanced: speed + quality
Llama 3.2	4.7GB	15 tokens/sec	Excellent	Complex reasoning, accuracy
Llama 3.1 70B	40GB	3 tokens/sec	Best	Research, production tasks

Pick based on your hardware:

CPU-only: Phi-3
4GB GPU: Mistral 7B
8GB+ GPU: Llama 3.2
24GB+ VRAM: Llama 70B (split across GPUs)

Production-Ready Wrapper Class

TypeScript wrapper for clean error handling:

class OllamaClient {
  private baseUrl = 'http://localhost:11434';

  async query(prompt: string, model: string = 'llama3.2'): Promise<string> {
    try {
      const response = await fetch(`${this.baseUrl}/api/generate`, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ model, prompt, stream: false }),
      });

      if (!response.ok) {
        throw new Error(`Ollama API error: ${response.status}`);
      }

      const data = await response.json();
      return data.response;
    } catch (error) {
      console.error('Ollama query failed:', error);
      throw error;
    }
  }

  async* streamQuery(prompt: string, model: string = 'llama3.2') {
    const response = await fetch(`${this.baseUrl}/api/generate`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ model, prompt, stream: true }),
    });

    const reader = response.body?.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader!.read();
      if (done) break;

      const chunk = decoder.decode(value);
      for (const line of chunk.split('\n')) {
        if (line) {
          const data = JSON.parse(line);
          if (data.response) {
            yield data.response;
          }
        }
      }
    }
  }

  async health(): Promise<boolean> {
    try {
      const response = await fetch(`${this.baseUrl}/api/tags`);
      return response.ok;
    } catch {
      return false;
    }
  }
}

// Usage
const ollama = new OllamaClient();
const isHealthy = await ollama.health();

if (isHealthy) {
  const answer = await ollama.query('Explain closures in JavaScript');
  console.log(answer);
}

Real-World Use Cases

Document Summarization:

const docText = fs.readFileSync('report.pdf', 'utf-8');
const summary = await ollama.query(
  `Summarize this report in 3 sentences:\n\n${docText}`
);
console.log(summary);

Code Review Assistant:

const code = fs.readFileSync('app.ts', 'utf-8');
const review = await ollama.query(
  `Review this code for bugs, performance issues, and best practices:\n\n${code}`
);

Internal Chatbot:

const ollama = new OllamaClient();

for await (const token of ollama.streamQuery('What is TypeScript?', 'mistral')) {
  process.stdout.write(token);
}

Limitations vs. Hosted APIs

Local LLMs are good for:

Private data (no external transmission)
Offline scenarios
Cost-sensitive applications
Internal tooling

But they lose to hosted APIs on:

🔹 Quality (GPT-4 > Llama 3.2)
🔹 Latest knowledge (local models aren’t updated)
🔹 Complex reasoning (Claude 3.5 Sonnet is better)
🔹 Multimodal tasks (vision, audio — limited in local models)

Hybrid approach: Use Ollama for cheap, private tasks. Reserve Claude/GPT-4 for production reasoning.

Local LLMs are ready for production. Ollama makes setup trivial. Llama 3.2 and Mistral offer excellent quality. You keep all data private. Zero API costs. Perfect for internal tools, document processing, and compliance-heavy domains.

Don’t pay per token for work you can run offline.

ALL POSTS →