Ollama + Local Models

Run powerful models locally on your machine — free, private, offline — and route smartly between local and Claude.

Why run local models alongside Claude?

Ollama is an open-source tool that lets you run large language models locally on your own machine — Llama 3, Mistral, Qwen, Gemma, DeepSeek, and dozens more. Running local models has real advantages:

Zero cost per token — local inference is free after the one-time hardware cost
Complete privacy — sensitive data never leaves your machine
Offline capability — works without internet (on planes, secure environments)
High volume tasks — batch processing thousands of records without API cost
Experimentation — test open-source models before committing to the Claude API

The smart approach: use local models for high-volume, low-stakes tasks, and Claude API for complex reasoning and production-quality outputs.

Install Ollama

macOS

brew install ollama
# Or download from https://ollama.ai

Linux

curl -fsSL https://ollama.ai/install.sh | sh

Windows

# Download the Windows installer from https://ollama.ai/download
# Requires Windows 10/11, 64-bit

Pull and run models

Terminal

# Pull a model (one-time download)
ollama pull llama3.2          # Meta Llama 3.2 (3B, fast, general)
ollama pull qwen2.5-coder     # Alibaba Qwen 2.5 Coder (coding specialist)
ollama pull deepseek-coder-v2 # DeepSeek Coder V2 (strong coding)
ollama pull mistral-nemo      # Mistral NeMo (12B, multilingual)
ollama pull phi4              # Microsoft Phi-4 (14B, reasoning)

# Run interactively
ollama run llama3.2

# List downloaded models
ollama list

# Check if server is running
ollama serve   # runs at http://localhost:11434

Hardware guide — which model fits your machine?

Your RAM / VRAM	Recommended models	Quality level
8 GB RAM	llama3.2 (3B), phi3.5, gemma2:2b	Good for simple tasks
16 GB RAM	llama3.2 (7B), mistral, qwen2.5-coder	Solid general-purpose
32 GB RAM	llama3.1 (13B), deepseek-coder-v2, phi4	Strong, near GPT-3.5 level
64 GB RAM / 24 GB VRAM	llama3.3 (70B), qwen2.5 (72B)	Near Claude Haiku quality
Mac M1/M2/M3 (unified memory)	Use your full RAM — all models above	Excellent efficiency on Apple Silicon

Access Ollama from Python (same API as Claude)

Python — Ollama has an OpenAI-compatible API

from ollama import Client

client = Client(host='http://localhost:11434')

response = client.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Explain a star schema in 3 sentences.'}
    ]
)
print(response['message']['content'])

Connect Ollama to Claude Code via MCP

The most powerful setup: Claude Code as your reasoning engine, with Ollama providing cheap local inference for repetitive subtasks. Use the MCP Ollama server:

Install the Ollama MCP server

npm install -g mcp-server-ollama

claude_desktop_config.json — add Ollama MCP

{
  "mcpServers": {
    "ollama": {
      "command": "mcp-server-ollama",
      "args": [],
      "env": {
        "OLLAMA_HOST": "http://localhost:11434"
      }
    }
  }
}

Now in Claude Desktop you can say: "Use the local llama3.2 model to classify all these 10,000 customer support tickets into categories" — Claude orchestrates the task, Ollama does the bulk inference cheaply.

Use Ollama with Claude Code directly

Terminal — add Ollama to Claude Code

claude mcp add ollama mcp-server-ollama
claude mcp list  # verify it appears

Inside a Claude Code session, you can now ask: "Use the local qwen2.5-coder model to generate boilerplate for all 20 of these API endpoints, then I'll review them".

Switching between Claude API and Ollama in your code

Python — smart routing: Claude for complex, Ollama for bulk

import anthropic
from ollama import Client as OllamaClient

claude  = anthropic.Anthropic()
ollama  = OllamaClient(host='http://localhost:11434')

def classify(text: str, use_local: bool = False) -> str:
    '''Route to local or cloud model based on task importance.'''
    if use_local:
        # Cheap, fast, private — good for bulk classification
        r = ollama.chat(
            model='llama3.2',
            messages=[{'role':'user', 'content': f'Classify: {text}'}]
        )
        return r['message']['content']
    else:
        # High quality — for production output or complex reasoning
        r = claude.messages.create(
            model='claude-sonnet-4-5',
            max_tokens=256,
            messages=[{'role':'user', 'content': f'Classify: {text}'}]
        )
        return r.content[0].text

# Bulk: use local
results = [classify(ticket, use_local=True) for ticket in tickets]

# High-stakes: use Claude
final_summary = classify(combined_output, use_local=False)

Recommended local models by use case

Use case	Best local model	Why
Code generation / review	qwen2.5-coder, deepseek-coder-v2	Trained specifically on code
Text classification	llama3.2, phi3.5	Fast, small, accurate
Summarization	mistral-nemo	Strong at long-form compression
SQL generation	qwen2.5-coder	Excellent SQL benchmark scores
Multilingual tasks	qwen2.5, mistral-nemo	Strong non-English performance
Math / reasoning	phi4, deepseek-r1	Chain-of-thought reasoning

THE HYBRID WORKFLOW Think of local models and Claude as a team. Local Ollama handles: bulk classification, data cleaning, boilerplate generation, offline work, private data. Claude API handles: architectural decisions, complex reasoning, final output quality, anything customer-facing. The sweet spot is routing automatically based on task complexity — and you can build that routing with a few lines of Python.

← Computer Use Crack the Claude Exam →