ollama/ollama
ollama
Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
Local runtimes
Use this page when you need to load models, expose inference endpoints, or run local LLMs without depending on a hosted inference provider.
Easy local serving
Ollama and LocalAI are practical starting points when you want simple model management or OpenAI-compatible endpoints.
Performance and control
llama.cpp and vLLM fit cases where model format, serving behavior, or throughput matters more than a friendly wrapper.
Why it works
Ollama: easiest private runtime
A simple default for local model management, quick setup, and a friendly developer experience.
LocalAI and llama.cpp: compatibility and control
Good fits when you need OpenAI-style APIs, GGUF support, or closer control over serving behavior.
vLLM, SGLang, and TensorRT-LLM: throughput-oriented serving
Best suited to deployments where throughput, concurrency, GPU utilization, or production serving controls matter more than wrapper convenience.
Curated repositories
ollama
Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.
mudler
LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
ggml-org
LLM inference in C/C++
vllm-project
A high-throughput and memory-efficient inference and serving engine for LLMs
sgl-project
SGLang is a high-performance serving framework for large language models and multimodal models.
xorbitsai
Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.
mlc-ai
Universal LLM Deployment Engine with ML Compilation
NVIDIA
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
Selection guide
Pick based on whether you need simple local development, OpenAI-compatible APIs, GGUF control, or high-throughput production serving.
Ollama for simple local use
The easiest starting point for running local models from a CLI, desktop environment, or local API.
LocalAI for OpenAI-compatible private APIs
Useful when apps expect OpenAI-style endpoints and you want a self-hosted backend with multiple model backends.
vLLM for throughput
A better fit for concurrent serving and production workloads than for a minimal local-first setup. SGLang and TensorRT-LLM are also strong candidates for this higher-throughput lane.
Model formats
Model format and hardware support matter as much as the API. CPU-only, Apple Silicon, NVIDIA, AMD, and server GPU deployments point to different tools.
llama.cpp for GGUF and low-level control
A strong fit for CPU, edge, and offline inference where model format and runtime control matter.
Ollama for model management
A simpler abstraction when you want to pull, run, and switch models without managing every serving detail. MLC LLM and llamafile are worth comparing when portability or platform coverage matters more.
Deployment
Local model runners are not all production servers. Compare auth, concurrency, GPU utilization, monitoring, API compatibility, and operational burden before choosing.
Solo developer or homelab
Ollama and llama.cpp are usually easier to start with.
Shared service or app backend
LocalAI, vLLM, SGLang, Xinference, and TensorRT-LLM usually fit better when other apps depend on a stable private inference endpoint.
Suggested additions
llamafile
mozilla-ai/llamafile
Runs and distributes LLMs as single-file executables. Useful for portable local inference and GGUF workflows where packaging matters.
View repositoryGPT4All
nomic-ai/gpt4all
A mature local LLM project for running models on personal devices. It is more app/runtime hybrid than pure inference server, but still fits local-model intent.
View repositoryRelated pages
Self-hosted ChatGPT alternatives
Chat interfaces and assistant apps you can run with local models, private endpoints, or your own hosted providers.
Self-hosted RAG tools
Knowledge-base apps, retrieval frameworks, and document pipelines for private data and production AI systems.
Vector databases and retrieval storage
Databases and search layers for embeddings, metadata filtering, persistence, and semantic retrieval.
Agents, workflows, and app builders
Agent frameworks, workflow engines, and app builders for repeatable AI-powered processes.
AI developer tools
Coding assistants and repo-aware tools that can run locally or inside private development environments.
FAQ
It is the layer that loads models, serves inference, and exposes an API or local interface on your own machine, server, or cloud environment.
Ollama is the easiest default for most users. LocalAI is useful when OpenAI-compatible endpoints matter. llama.cpp is better when GGUF control matters. vLLM is better when throughput matters.
Use vLLM when serving throughput, concurrency, and production inference performance matter more than the simple local model-management experience Ollama provides.
Yes, depending on the model size and performance expectations. llama.cpp and LocalAI can be useful for CPU-oriented or no-GPU setups, while GPU-backed runtimes are usually better for larger or faster workloads.