Ollama + Local Models
Run powerful models locally on your machine — free, private, offline — and route smartly between local and Claude.
Why run local models alongside Claude?
Ollama is an open-source tool that lets you run large language models locally on your own machine — Llama 3, Mistral, Qwen, Gemma, DeepSeek, and dozens more. Running local models has real advantages:
- Zero cost per token — local inference is free after the one-time hardware cost
- Complete privacy — sensitive data never leaves your machine
- Offline capability — works without internet (on planes, secure environments)
- High volume tasks — batch processing thousands of records without API cost
- Experimentation — test open-source models before committing to the Claude API
The smart approach: use local models for high-volume, low-stakes tasks, and Claude API for complex reasoning and production-quality outputs.
Install Ollama
brew install ollama # Or download from https://ollama.ai
curl -fsSL https://ollama.ai/install.sh | sh
# Download the Windows installer from https://ollama.ai/download # Requires Windows 10/11, 64-bit
Pull and run models
# Pull a model (one-time download) ollama pull llama3.2 # Meta Llama 3.2 (3B, fast, general) ollama pull qwen2.5-coder # Alibaba Qwen 2.5 Coder (coding specialist) ollama pull deepseek-coder-v2 # DeepSeek Coder V2 (strong coding) ollama pull mistral-nemo # Mistral NeMo (12B, multilingual) ollama pull phi4 # Microsoft Phi-4 (14B, reasoning) # Run interactively ollama run llama3.2 # List downloaded models ollama list # Check if server is running ollama serve # runs at http://localhost:11434
Hardware guide — which model fits your machine?
| Your RAM / VRAM | Recommended models | Quality level |
|---|---|---|
| 8 GB RAM | llama3.2 (3B), phi3.5, gemma2:2b | Good for simple tasks |
| 16 GB RAM | llama3.2 (7B), mistral, qwen2.5-coder | Solid general-purpose |
| 32 GB RAM | llama3.1 (13B), deepseek-coder-v2, phi4 | Strong, near GPT-3.5 level |
| 64 GB RAM / 24 GB VRAM | llama3.3 (70B), qwen2.5 (72B) | Near Claude Haiku quality |
| Mac M1/M2/M3 (unified memory) | Use your full RAM — all models above | Excellent efficiency on Apple Silicon |
Access Ollama from Python (same API as Claude)
from ollama import Client
client = Client(host='http://localhost:11434')
response = client.chat(
model='llama3.2',
messages=[
{'role': 'user', 'content': 'Explain a star schema in 3 sentences.'}
]
)
print(response['message']['content'])
Connect Ollama to Claude Code via MCP
The most powerful setup: Claude Code as your reasoning engine, with Ollama providing cheap local inference for repetitive subtasks. Use the MCP Ollama server:
npm install -g mcp-server-ollama
{
"mcpServers": {
"ollama": {
"command": "mcp-server-ollama",
"args": [],
"env": {
"OLLAMA_HOST": "http://localhost:11434"
}
}
}
}
Now in Claude Desktop you can say: "Use the local llama3.2 model to classify all these 10,000 customer support tickets into categories" — Claude orchestrates the task, Ollama does the bulk inference cheaply.
Use Ollama with Claude Code directly
claude mcp add ollama mcp-server-ollama claude mcp list # verify it appears
Inside a Claude Code session, you can now ask: "Use the local qwen2.5-coder model to generate boilerplate for all 20 of these API endpoints, then I'll review them".
Switching between Claude API and Ollama in your code
import anthropic
from ollama import Client as OllamaClient
claude = anthropic.Anthropic()
ollama = OllamaClient(host='http://localhost:11434')
def classify(text: str, use_local: bool = False) -> str:
'''Route to local or cloud model based on task importance.'''
if use_local:
# Cheap, fast, private — good for bulk classification
r = ollama.chat(
model='llama3.2',
messages=[{'role':'user', 'content': f'Classify: {text}'}]
)
return r['message']['content']
else:
# High quality — for production output or complex reasoning
r = claude.messages.create(
model='claude-sonnet-4-5',
max_tokens=256,
messages=[{'role':'user', 'content': f'Classify: {text}'}]
)
return r.content[0].text
# Bulk: use local
results = [classify(ticket, use_local=True) for ticket in tickets]
# High-stakes: use Claude
final_summary = classify(combined_output, use_local=False)
Recommended local models by use case
| Use case | Best local model | Why |
|---|---|---|
| Code generation / review | qwen2.5-coder, deepseek-coder-v2 | Trained specifically on code |
| Text classification | llama3.2, phi3.5 | Fast, small, accurate |
| Summarization | mistral-nemo | Strong at long-form compression |
| SQL generation | qwen2.5-coder | Excellent SQL benchmark scores |
| Multilingual tasks | qwen2.5, mistral-nemo | Strong non-English performance |
| Math / reasoning | phi4, deepseek-r1 | Chain-of-thought reasoning |