What hardware do I need for Ollama?

A modern Mac with Apple Silicon (M1/M2/M3) runs Ollama very well via Metal GPU acceleration. On Linux/Windows, a GPU with at least 8GB VRAM is recommended for 7B models. CPU-only works but is slow.

Yes. Ollama itself is free and open source. The models it runs (Llama, Mistral, etc.) are also free and open-source. No API keys or subscriptions required.

How does Ollama compare to OpenAI API?

Ollama runs locally: no cost per token, full privacy, no internet required after download. Trade-off: smaller models, slower on consumer hardware, and you manage the infrastructure.

Can I use Ollama in production?

Yes, but plan for hardware costs. Ollama's REST API is compatible with OpenAI's API format, making it easy to swap between local and cloud LLMs.

Ollama Complete Guide | Run LLMs Locally, API, Open WebUI

2026년 4월 16일 · 16분 읽기 · 수정 2026년 4월 16일 beginner tutorial

이 글의 핵심

Ollama lets you run powerful open-source LLMs (Llama 3, Mistral, Gemma, Phi) on your own hardware — no API keys, no usage costs, full privacy. This guide covers everything from first install to production API integration.

What This Guide Covers

Ollama makes running open-source LLMs as simple as ollama run llama3. This guide covers model management, the REST API, integration with Python and Node.js, and running a local ChatGPT-like UI.

Real-world insight: Running Llama 3.1 8B locally with Ollama on an M2 Mac costs $0/month and delivers GPT-3.5-level quality for most coding and writing tasks.

Installation

# macOS (Homebrew)
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from ollama.com

Start the server:

ollama serve
# Server runs on http://localhost:11434

On macOS, Ollama runs as a menu bar app automatically.

1. Running Models

# Download and run (interactive chat)
ollama run llama3.2

# Specific model size
ollama run llama3.2:3b    # 3 billion params (~2GB)
ollama run llama3.1:8b    # 8 billion params (~5GB)
ollama run llama3.1:70b   # 70 billion params (~40GB, needs high-end GPU)

# Other popular models
ollama run mistral         # Mistral 7B — fast, good for code
ollama run gemma3          # Google Gemma 3
ollama run phi4            # Microsoft Phi-4 — small but capable
ollama run qwen2.5-coder   # Best for code generation
ollama run deepseek-r1     # Reasoning model

Once the model downloads, you get an interactive chat. Type /bye to exit.

2. Model Management

# List downloaded models
ollama list

# Pull a model without running
ollama pull mistral

# Show model info
ollama show llama3.2

# Delete a model
ollama rm llama3.2:3b

# Copy/rename a model
ollama cp llama3.2 my-llama

# Check running models
ollama ps

3. REST API

Ollama exposes an OpenAI-compatible REST API at http://localhost:11434:

Generate (completion)

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2",
    "prompt": "Explain WebSockets in one paragraph.",
    "stream": false
  }'

Chat (multi-turn conversation)

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2",
    "messages": [
      {"role": "system", "content": "You are a senior software engineer."},
      {"role": "user", "content": "What is the difference between REST and GraphQL?"}
    ],
    "stream": false
  }'

OpenAI-compatible endpoint

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

This endpoint is drop-in compatible with the OpenAI SDK.

4. Python Integration

With the ollama package

pip install ollama

import ollama

# Simple generation
response = ollama.generate(
    model='llama3.2',
    prompt='Write a Python function to parse JSON safely.'
)
print(response['response'])

# Chat with history
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'system', 'content': 'You are a helpful coding assistant.'},
        {'role': 'user', 'content': 'How do I reverse a list in Python?'},
    ]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.generate(model='llama3.2', prompt='Tell me a story', stream=True):
    print(chunk['response'], end='', flush=True)

With OpenAI SDK (drop-in replacement)

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',  # required but ignored
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response.choices[0].message.content)

This lets you switch between Ollama and OpenAI by changing only base_url and api_key.

5. Node.js Integration

npm install ollama

import ollama from 'ollama';

// Generate
const response = await ollama.generate({
  model: 'llama3.2',
  prompt: 'Explain async/await in JavaScript.',
  stream: false,
});
console.log(response.response);

// Chat
const chat = await ollama.chat({
  model: 'llama3.2',
  messages: [
    { role: 'user', content: 'Write a TypeScript interface for a User object.' }
  ],
});
console.log(chat.message.content);

// Streaming
const stream = await ollama.generate({
  model: 'llama3.2',
  prompt: 'Write a blog post about TypeScript.',
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.response);
}

6. Custom Modelfiles

Create custom models with system prompts and parameters:

# Modelfile
FROM llama3.2

SYSTEM """
You are a senior TypeScript developer. Always provide type-safe code examples.
Respond concisely and include practical examples.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 8192

# Build the custom model
ollama create typescript-expert -f Modelfile

# Run it
ollama run typescript-expert

Common parameters:

temperature — creativity (0.0–1.0, lower = more deterministic)
num_ctx — context window size (tokens)
top_p — nucleus sampling (0.0–1.0)

7. Open WebUI — Local ChatGPT UI

# Run with Docker (easiest)
docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 — you get a full ChatGPT-like interface that connects to your local Ollama models.

Features: model switching, conversation history, file uploads, image understanding (with vision models).

8. LangChain Integration

from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatOllama(model="llama3.2", temperature=0.3)

messages = [
    SystemMessage(content="You are a helpful Python expert."),
    HumanMessage(content="Show me how to use dataclasses in Python."),
]

response = llm.invoke(messages)
print(response.content)

# Streaming
for chunk in llm.stream(messages):
    print(chunk.content, end="", flush=True)

9. GPU Acceleration

NVIDIA GPU (Linux/Windows)

Ollama auto-detects CUDA if drivers are installed:

# Verify GPU is being used
ollama run llama3.2
# Look for "using GPU" in logs: ollama serve (verbose)

Apple Silicon (macOS)

Metal GPU acceleration is automatic on M1/M2/M3/M4 Macs — no configuration needed.

Check GPU usage

# macOS
sudo powermetrics --samplers gpu_power -i 1000

# Linux
nvidia-smi dmon -s u

Model Recommendations

Use case	Model
General chat	llama3.2:3b (fast) or llama3.1:8b (smarter)
Code generation	qwen2.5-coder:7b or deepseek-coder:6.7b
Reasoning tasks	deepseek-r1:8b
Vision (images)	llava:7b or llama3.2-vision
Embeddings (RAG)	nomic-embed-text or mxbai-embed-large
Small / fast	phi4:3.8b or gemma3:2b

Key Takeaways

Zero cost after initial hardware — no per-token billing
Full privacy — data never leaves your machine
OpenAI-compatible API — swap between local and cloud easily
Custom Modelfiles — bake in system prompts and tune parameters
Open WebUI — instant ChatGPT-like UI in Docker

Ollama is the fastest way to get a local LLM running. Start with ollama run llama3.2, explore the REST API, then integrate with LangChain or the OpenAI SDK for production use cases.