Local model runtimes and inference servers

Local model runtimes and inference servers

Private runtimes and inference servers for local model serving.

Projects — 4
Updated regularly

Local runtimes

Model serving for teams that want local control instead of a hosted inference layer.

This page groups runtimes and inference servers that can expose models on your hardware or private cloud with predictable deployment patterns.

Ollama and LocalAI

Good defaults when you want simple local model serving or an OpenAI-compatible API on private infrastructure.

llama.cpp and vLLM

Use these when GGUF inference, performance tuning, or higher-throughput serving matters more than a friendly wrapper.

Why it works

  • Ollama: easiest private runtime

    A simple default when you want local model management and a friendly developer experience.

  • LocalAI and llama.cpp: compatibility and control

    Better when you need OpenAI-style APIs, GGUF support, or tighter control over serving behavior.

  • vLLM: throughput-oriented serving

    Use it when model serving performance matters more than wrapper convenience.

Curated repositories

Local model runtimes and servers

4 projects
ollama

ollama/ollama

ollama

170k

Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.

15.8k|Go
MIT
llamallmllms
mudler

mudler/LocalAI

mudler

45.8k

LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

4k|Go
MIT
llamaaillm
ggml-org

ggml-org/llama.cpp

ggml-org

106.6k

LLM inference in C/C++

17.4k|C++
MIT
ggml
vllm-project

vllm-project/vllm

vllm-project

78.2k

A high-throughput and memory-efficient inference and serving engine for LLMs

16.1k|Python
Apache-2.0
gptllmpytorch

Related pages

Keep browsing

FAQ

Questions answered

What is a local model runtime?

It is the layer that loads models, serves inference, and exposes an API or UI for private use on your own hardware or cloud.

Which runtime should I start with?

Ollama is the easiest default for most users. LocalAI is good when OpenAI-compatible endpoints matter. llama.cpp is better when GGUF control matters. vLLM is better when throughput matters.