ollama/ollama
ollama
Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.

Private runtimes and inference servers for local model serving.
Local runtimes
This page groups runtimes and inference servers that can expose models on your hardware or private cloud with predictable deployment patterns.
Ollama and LocalAI
Good defaults when you want simple local model serving or an OpenAI-compatible API on private infrastructure.
llama.cpp and vLLM
Use these when GGUF inference, performance tuning, or higher-throughput serving matters more than a friendly wrapper.
Why it works
Ollama: easiest private runtime
A simple default when you want local model management and a friendly developer experience.
LocalAI and llama.cpp: compatibility and control
Better when you need OpenAI-style APIs, GGUF support, or tighter control over serving behavior.
vLLM: throughput-oriented serving
Use it when model serving performance matters more than wrapper convenience.
Curated repositories
ollama
Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
mudler
LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
ggml-org
LLM inference in C/C++
vllm-project
A high-throughput and memory-efficient inference and serving engine for LLMs
Related pages
Self-hosted ChatGPT alternatives
Private assistant apps and team chat portals for people who want a familiar front end around local or private models.
Self-hosted RAG tools
Document search, connectors, and knowledge assistants for private corpora and retrieval-heavy AI products.
Vector databases and retrieval storage
Storage and search layers for embeddings, filtering, persistence, and semantic retrieval at scale.
Agents, workflows, and app builders
Workflow engines, agent systems, and app builders for repeatable internal automation instead of one-off chat.
AI developer tools
Self-hostable coding assistants and repo-aware tools for local or private developer workflows.
Self-hosted AI tools
Browse open source AI tools you can run on your own infrastructure, from local LLM apps to RAG, agents, inference, and production tooling.
FAQ
It is the layer that loads models, serves inference, and exposes an API or UI for private use on your own hardware or cloud.
Ollama is the easiest default for most users. LocalAI is good when OpenAI-compatible endpoints matter. llama.cpp is better when GGUF control matters. vLLM is better when throughput matters.