Ollama Complete Guide | Run LLMs Locally, API, Open WebUI
이 글의 핵심
Ollama lets you run powerful open-source LLMs (Llama 3, Mistral, Gemma, Phi) on your own hardware — no API keys, no usage costs, full privacy. This guide covers everything from first install to production API integration.
What This Guide Covers
Ollama makes running open-source LLMs as simple as ollama run llama3. This guide covers model management, the REST API, integration with Python and Node.js, and running a local ChatGPT-like UI.
Real-world insight: Running Llama 3.1 8B locally with Ollama on an M2 Mac costs $0/month and delivers GPT-3.5-level quality for most coding and writing tasks.
Installation
# macOS (Homebrew)
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer from ollama.com
Start the server:
ollama serve
# Server runs on http://localhost:11434
On macOS, Ollama runs as a menu bar app automatically.
1. Running Models
# Download and run (interactive chat)
ollama run llama3.2
# Specific model size
ollama run llama3.2:3b # 3 billion params (~2GB)
ollama run llama3.1:8b # 8 billion params (~5GB)
ollama run llama3.1:70b # 70 billion params (~40GB, needs high-end GPU)
# Other popular models
ollama run mistral # Mistral 7B — fast, good for code
ollama run gemma3 # Google Gemma 3
ollama run phi4 # Microsoft Phi-4 — small but capable
ollama run qwen2.5-coder # Best for code generation
ollama run deepseek-r1 # Reasoning model
Once the model downloads, you get an interactive chat. Type /bye to exit.
2. Model Management
# List downloaded models
ollama list
# Pull a model without running
ollama pull mistral
# Show model info
ollama show llama3.2
# Delete a model
ollama rm llama3.2:3b
# Copy/rename a model
ollama cp llama3.2 my-llama
# Check running models
ollama ps
3. REST API
Ollama exposes an OpenAI-compatible REST API at http://localhost:11434:
Generate (completion)
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2",
"prompt": "Explain WebSockets in one paragraph.",
"stream": false
}'
Chat (multi-turn conversation)
curl http://localhost:11434/api/chat \
-d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a senior software engineer."},
{"role": "user", "content": "What is the difference between REST and GraphQL?"}
],
"stream": false
}'
OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
This endpoint is drop-in compatible with the OpenAI SDK.
4. Python Integration
With the ollama package
pip install ollama
import ollama
# Simple generation
response = ollama.generate(
model='llama3.2',
prompt='Write a Python function to parse JSON safely.'
)
print(response['response'])
# Chat with history
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': 'You are a helpful coding assistant.'},
{'role': 'user', 'content': 'How do I reverse a list in Python?'},
]
)
print(response['message']['content'])
# Streaming
for chunk in ollama.generate(model='llama3.2', prompt='Tell me a story', stream=True):
print(chunk['response'], end='', flush=True)
With OpenAI SDK (drop-in replacement)
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # required but ignored
)
response = client.chat.completions.create(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Hello!'}]
)
print(response.choices[0].message.content)
This lets you switch between Ollama and OpenAI by changing only base_url and api_key.
5. Node.js Integration
npm install ollama
import ollama from 'ollama';
// Generate
const response = await ollama.generate({
model: 'llama3.2',
prompt: 'Explain async/await in JavaScript.',
stream: false,
});
console.log(response.response);
// Chat
const chat = await ollama.chat({
model: 'llama3.2',
messages: [
{ role: 'user', content: 'Write a TypeScript interface for a User object.' }
],
});
console.log(chat.message.content);
// Streaming
const stream = await ollama.generate({
model: 'llama3.2',
prompt: 'Write a blog post about TypeScript.',
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.response);
}
6. Custom Modelfiles
Create custom models with system prompts and parameters:
# Modelfile
FROM llama3.2
SYSTEM """
You are a senior TypeScript developer. Always provide type-safe code examples.
Respond concisely and include practical examples.
"""
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
# Build the custom model
ollama create typescript-expert -f Modelfile
# Run it
ollama run typescript-expert
Common parameters:
temperature— creativity (0.0–1.0, lower = more deterministic)num_ctx— context window size (tokens)top_p— nucleus sampling (0.0–1.0)
7. Open WebUI — Local ChatGPT UI
# Run with Docker (easiest)
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 — you get a full ChatGPT-like interface that connects to your local Ollama models.
Features: model switching, conversation history, file uploads, image understanding (with vision models).
8. LangChain Integration
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage
llm = ChatOllama(model="llama3.2", temperature=0.3)
messages = [
SystemMessage(content="You are a helpful Python expert."),
HumanMessage(content="Show me how to use dataclasses in Python."),
]
response = llm.invoke(messages)
print(response.content)
# Streaming
for chunk in llm.stream(messages):
print(chunk.content, end="", flush=True)
9. GPU Acceleration
NVIDIA GPU (Linux/Windows)
Ollama auto-detects CUDA if drivers are installed:
# Verify GPU is being used
ollama run llama3.2
# Look for "using GPU" in logs: ollama serve (verbose)
Apple Silicon (macOS)
Metal GPU acceleration is automatic on M1/M2/M3/M4 Macs — no configuration needed.
Check GPU usage
# macOS
sudo powermetrics --samplers gpu_power -i 1000
# Linux
nvidia-smi dmon -s u
Model Recommendations
| Use case | Model |
|---|---|
| General chat | llama3.2:3b (fast) or llama3.1:8b (smarter) |
| Code generation | qwen2.5-coder:7b or deepseek-coder:6.7b |
| Reasoning tasks | deepseek-r1:8b |
| Vision (images) | llava:7b or llama3.2-vision |
| Embeddings (RAG) | nomic-embed-text or mxbai-embed-large |
| Small / fast | phi4:3.8b or gemma3:2b |
Key Takeaways
- Zero cost after initial hardware — no per-token billing
- Full privacy — data never leaves your machine
- OpenAI-compatible API — swap between local and cloud easily
- Custom Modelfiles — bake in system prompts and tune parameters
- Open WebUI — instant ChatGPT-like UI in Docker
Ollama is the fastest way to get a local LLM running. Start with ollama run llama3.2, explore the REST API, then integrate with LangChain or the OpenAI SDK for production use cases.