Local model runtimes and inference servers

Private runtimes and inference servers for local model serving.

Projects — 4

Updated regularly

Local runtimes

Model serving for teams that want local control instead of a hosted inference layer.

This page groups runtimes and inference servers that can expose models on your hardware or private cloud with predictable deployment patterns.

Ollama and LocalAI

Good defaults when you want simple local model serving or an OpenAI-compatible API on private infrastructure.

llama.cpp and vLLM

Use these when GGUF inference, performance tuning, or higher-throughput serving matters more than a friendly wrapper.

Why it works

Ollama: easiest private runtime
A simple default when you want local model management and a friendly developer experience.
LocalAI and llama.cpp: compatibility and control
Better when you need OpenAI-style APIs, GGUF support, or tighter control over serving behavior.
vLLM: throughput-oriented serving
Use it when model serving performance matters more than wrapper convenience.

Curated repositories

Local model runtimes and servers

4 projects

ollama/ollama

ollama

170k

Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.

15.8k|Go

MIT

llamallmllms

mudler/LocalAI

mudler

45.8k

LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

4k|Go

MIT

llamaaillm

ggml-org/llama.cpp

ggml-org

106.6k

LLM inference in C/C++

17.4k|C++

MIT

ggml

vllm-project/vllm

vllm-project

78.2k

A high-throughput and memory-efficient inference and serving engine for LLMs

16.1k|Python

Apache-2.0

gptllmpytorch

Keep browsing

Self-hosted ChatGPT alternatives

Private assistant apps and team chat portals for people who want a familiar front end around local or private models.

Self-hosted RAG tools

Document search, connectors, and knowledge assistants for private corpora and retrieval-heavy AI products.

Vector databases and retrieval storage

Storage and search layers for embeddings, filtering, persistence, and semantic retrieval at scale.

Agents, workflows, and app builders

Workflow engines, agent systems, and app builders for repeatable internal automation instead of one-off chat.

AI developer tools

Self-hostable coding assistants and repo-aware tools for local or private developer workflows.

Self-hosted AI tools

Browse open source AI tools you can run on your own infrastructure, from local LLM apps to RAG, agents, inference, and production tooling.

FAQ

Questions answered

What is a local model runtime?

It is the layer that loads models, serves inference, and exposes an API or UI for private use on your own hardware or cloud.

Which runtime should I start with?

Ollama is the easiest default for most users. LocalAI is good when OpenAI-compatible endpoints matter. llama.cpp is better when GGUF control matters. vLLM is better when throughput matters.