🎯 SQL Practice

Ollama + Local Models

Run powerful models locally on your machine — free, private, offline — and route smartly between local and Claude.

Why run local models alongside Claude?

Ollama is an open-source tool that lets you run large language models locally on your own machine — Llama 3, Mistral, Qwen, Gemma, DeepSeek, and dozens more. Running local models has real advantages:

The smart approach: use local models for high-volume, low-stakes tasks, and Claude API for complex reasoning and production-quality outputs.

Install Ollama

macOS
brew install ollama
# Or download from https://ollama.ai
Linux
curl -fsSL https://ollama.ai/install.sh | sh
Windows
# Download the Windows installer from https://ollama.ai/download
# Requires Windows 10/11, 64-bit

Pull and run models

Terminal
# Pull a model (one-time download)
ollama pull llama3.2          # Meta Llama 3.2 (3B, fast, general)
ollama pull qwen2.5-coder     # Alibaba Qwen 2.5 Coder (coding specialist)
ollama pull deepseek-coder-v2 # DeepSeek Coder V2 (strong coding)
ollama pull mistral-nemo      # Mistral NeMo (12B, multilingual)
ollama pull phi4              # Microsoft Phi-4 (14B, reasoning)

# Run interactively
ollama run llama3.2

# List downloaded models
ollama list

# Check if server is running
ollama serve   # runs at http://localhost:11434

Hardware guide — which model fits your machine?

Your RAM / VRAMRecommended modelsQuality level
8 GB RAMllama3.2 (3B), phi3.5, gemma2:2bGood for simple tasks
16 GB RAMllama3.2 (7B), mistral, qwen2.5-coderSolid general-purpose
32 GB RAMllama3.1 (13B), deepseek-coder-v2, phi4Strong, near GPT-3.5 level
64 GB RAM / 24 GB VRAMllama3.3 (70B), qwen2.5 (72B)Near Claude Haiku quality
Mac M1/M2/M3 (unified memory)Use your full RAM — all models aboveExcellent efficiency on Apple Silicon

Access Ollama from Python (same API as Claude)

Python — Ollama has an OpenAI-compatible API
from ollama import Client

client = Client(host='http://localhost:11434')

response = client.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Explain a star schema in 3 sentences.'}
    ]
)
print(response['message']['content'])

Connect Ollama to Claude Code via MCP

The most powerful setup: Claude Code as your reasoning engine, with Ollama providing cheap local inference for repetitive subtasks. Use the MCP Ollama server:

Install the Ollama MCP server
npm install -g mcp-server-ollama
claude_desktop_config.json — add Ollama MCP
{
  "mcpServers": {
    "ollama": {
      "command": "mcp-server-ollama",
      "args": [],
      "env": {
        "OLLAMA_HOST": "http://localhost:11434"
      }
    }
  }
}

Now in Claude Desktop you can say: "Use the local llama3.2 model to classify all these 10,000 customer support tickets into categories" — Claude orchestrates the task, Ollama does the bulk inference cheaply.

Use Ollama with Claude Code directly

Terminal — add Ollama to Claude Code
claude mcp add ollama mcp-server-ollama
claude mcp list  # verify it appears

Inside a Claude Code session, you can now ask: "Use the local qwen2.5-coder model to generate boilerplate for all 20 of these API endpoints, then I'll review them".

Switching between Claude API and Ollama in your code

Python — smart routing: Claude for complex, Ollama for bulk
import anthropic
from ollama import Client as OllamaClient

claude  = anthropic.Anthropic()
ollama  = OllamaClient(host='http://localhost:11434')

def classify(text: str, use_local: bool = False) -> str:
    '''Route to local or cloud model based on task importance.'''
    if use_local:
        # Cheap, fast, private — good for bulk classification
        r = ollama.chat(
            model='llama3.2',
            messages=[{'role':'user', 'content': f'Classify: {text}'}]
        )
        return r['message']['content']
    else:
        # High quality — for production output or complex reasoning
        r = claude.messages.create(
            model='claude-sonnet-4-5',
            max_tokens=256,
            messages=[{'role':'user', 'content': f'Classify: {text}'}]
        )
        return r.content[0].text

# Bulk: use local
results = [classify(ticket, use_local=True) for ticket in tickets]

# High-stakes: use Claude
final_summary = classify(combined_output, use_local=False)

Recommended local models by use case

Use caseBest local modelWhy
Code generation / reviewqwen2.5-coder, deepseek-coder-v2Trained specifically on code
Text classificationllama3.2, phi3.5Fast, small, accurate
Summarizationmistral-nemoStrong at long-form compression
SQL generationqwen2.5-coderExcellent SQL benchmark scores
Multilingual tasksqwen2.5, mistral-nemoStrong non-English performance
Math / reasoningphi4, deepseek-r1Chain-of-thought reasoning
THE HYBRID WORKFLOW Think of local models and Claude as a team. Local Ollama handles: bulk classification, data cleaning, boilerplate generation, offline work, private data. Claude API handles: architectural decisions, complex reasoning, final output quality, anything customer-facing. The sweet spot is routing automatically based on task complexity — and you can build that routing with a few lines of Python.