Back to

Ollama: The 'Docker for LLMs' You Need to Know About

Run Llama 3, Gemma, and other LLMs locally with a single command. A technical guide to Ollama's features, Modelfiles, and API.

Running a large language model locally used to be a headache. You had to deal with Python dependencies, PyTorch versions, quantization formats, and obscure C++ compilation flags. If you wanted to switch from Llama 2 to Mistral, you often had to start over.

Then Ollama showed up and changed the game.

If you've used Docker, you already understand Ollama. Just as Docker standardized how we package and run applications, Ollama standardizes how we package and run LLMs. It abstracts away the messy details of hardware acceleration and model weights, leaving you with a clean, simple command line interface.

Here is why it has become the default tool for local AI development.

The "Docker for LLMs" Analogy

The comparison isn't just marketing fluff; it’s accurate to the architecture.

In the Docker world, you have a Dockerfile that defines your environment. You build an image from it, and then you run that image as a container. Ollama works exactly the same way:

  • Modelfile: This is your Dockerfile. It defines the base model (e.g., FROM llama3), sets parameters like temperature, and includes your system prompt.
  • Images: When you run ollama pull llama3, you are downloading an image that contains the model weights and configuration.
  • Instances: When you run ollama run llama3, you are spinning up an instance of that model, ready to accept queries.

This standardization means you can share a Modelfile with a colleague, and they can run the exact same custom model on their machine, regardless of whether they are on macOS, Linux, or Windows.

Key Features That Matter

Hardware Acceleration "Just Works"

This is the big one. Getting CUDA (NVIDIA) or Metal (Apple Silicon) working with raw PyTorch scripts can be a nightmare of version mismatches. Ollama detects your hardware on startup and automatically selects the best acceleration library.

If you are on a Mac with an M1/M2/M3 chip, it uses Apple's Metal API natively. If you are on a Linux box with an NVIDIA GPU, it spins up the CUDA backend. It even supports AMD GPUs via ROCm. You don't configure anything; it just runs fast.

Modelfiles

The Modelfile is powerful because it lets you bake prompt engineering into the model itself. Instead of pasting a massive "You are a coding assistant..." prompt every time you start a chat, you save it into a custom model.

FROM llama3

# Set the temperature to be low for coding
PARAMETER temperature 0.1

# Set the system message
SYSTEM """
You are a senior Python engineer. You answer concisely and favor modern 3.12+ syntax.
"""

You build this with ollama create my-coder -f Modelfile, and suddenly you have a dedicated coding assistant available via ollama run my-coder.

The New ollama launch

Recent updates have introduced ollama launch. While ollama run is for interactive chat, ollama launch is designed to spin up models specifically for integrations and agents.

For example, you can use ollama launch claude or ollama launch openclaw (if you are using those tools) to automatically download the necessary dependencies and start the model with the correct context window settings for that specific agentic workflow. It handles the "infrastructure" side of connecting a model to an external application.

For Developers: Why You Should Care

The API

Ollama isn't just a CLI tool; it's a server. By default, it runs a REST API on port 11434. This is huge for developers. You can spin up Ollama in the background and write a simple Python or Node.js script to query it.

The community has even built wrappers that make Ollama drop-in compatible with the OpenAI API. This means you can often take an app designed for GPT-4, change the base_url to http://localhost:11434/v1, and run it with a local Llama 3 model without changing a single line of code.

Privacy and Security

We all know the risk of sending sensitive code or data to a cloud API. With Ollama, the data never leaves your machine. This makes it viable for analyzing PII (Personally Identifiable Information), proprietary codebases, or internal documentation that you are strictly forbidden from uploading to ChatGPT.

Cost

Local inference is free (minus electricity). If you are building a feature that requires processing thousands of documents or running automated tests 24/7, the API costs from cloud providers stack up fast. A local 8GB model running on consumer hardware can chew through text at impressive speeds for zero marginal cost.

Getting Hands-On

If you haven't installed it yet, just grab it from the official site. Once installed, here are the commands you'll use 90% of the time:

1. Run a model
This downloads the model if you don't have it (defaulting to the 8B parameter version usually) and drops you into a chat.

ollama run llama3

2. List your models
See what images you have locally and how much disk space they are taking.

ollama list

3. Run a Google Gemma model
Ollama supports almost all major open-weights models, including Google's Gemma series.

ollama run gemma:7b

4. Check the server status
Since Ollama runs as a background service, you can check if it's listening with curl.

curl http://localhost:11434/api/tags

Conclusion

Ollama has done for local AI what VS Code did for code editing—it took something complex and made it accessible without sacrificing power. Whether you are a privacy-conscious developer, a hobbyist experimenting with the latest Llama 3 release, or an engineer building a local RAG pipeline, it is the most robust starting point available today.

The ecosystem is moving fast. With features like ollama launch and expanding support for multimodal models, it's clear that the future of AI isn't just in the cloud—it's running right there on your laptop.