Local runtimes

Model serving tools for local machines, private servers, and controlled cloud environments.

Use this page when you need to load models, expose inference endpoints, or run local LLMs without depending on a hosted inference provider.

Easy local serving

Ollama and LocalAI are practical starting points when you want simple model management or OpenAI-compatible endpoints.

Performance and control

llama.cpp and vLLM fit cases where model format, serving behavior, or throughput matters more than a friendly wrapper.

Why it works

  • Ollama: easiest private runtime

    A simple default for local model management, quick setup, and a friendly developer experience.

  • LocalAI and llama.cpp: compatibility and control

    Good fits when you need OpenAI-style APIs, GGUF support, or closer control over serving behavior.

  • vLLM, SGLang, and TensorRT-LLM: throughput-oriented serving

    Best suited to deployments where throughput, concurrency, GPU utilization, or production serving controls matter more than wrapper convenience.

Curated repositories

Local model runtimes and servers

8 projects
ollama

ollama/ollama

ollama

170.1k

Get up and running with Kimi-K2.5, GLM-5, MiniMax, DeepSeek, gpt-oss, Qwen, Gemma and other models.

15.8k|Go
MIT
llamallmllms
mudler

mudler/LocalAI

mudler

45.9k

LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

4k|Go
MIT
llamaaillm
ggml-org

ggml-org/llama.cpp

ggml-org

106.8k

LLM inference in C/C++

17.4k|C++
MIT
ggml
vllm-project

vllm-project/vllm

vllm-project

78.3k

A high-throughput and memory-efficient inference and serving engine for LLMs

16.1k|Python
Apache-2.0
gptllmpytorch
sgl-project

sgl-project/sglang

sgl-project

26.5k

SGLang is a high-performance serving framework for large language models and multimodal models.

5.6k|Python
Apache-2.0
cudainferencellama
xorbitsai

xorbitsai/inference

xorbitsai

9.3k

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.

824|Python
Apache-2.0
ggmlpytorchchatglm
mlc-ai

mlc-ai/mlc-llm

mlc-ai

22.5k

Universal LLM Deployment Engine with ML Compilation

2k|Python
Apache-2.0
llmmachine-learning-compilationlanguage-model
NVIDIA

NVIDIA/TensorRT-LLM

NVIDIA

13.5k

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

2.3k|Python
NOASSERTION
blackwellcudamoe

Selection guide

How to choose a local model runtime

Pick based on whether you need simple local development, OpenAI-compatible APIs, GGUF control, or high-throughput production serving.

  • Ollama for simple local use

    The easiest starting point for running local models from a CLI, desktop environment, or local API.

  • LocalAI for OpenAI-compatible private APIs

    Useful when apps expect OpenAI-style endpoints and you want a self-hosted backend with multiple model backends.

  • vLLM for throughput

    A better fit for concurrent serving and production workloads than for a minimal local-first setup. SGLang and TensorRT-LLM are also strong candidates for this higher-throughput lane.

Model formats

GGUF, Hugging Face models, and hardware fit

Model format and hardware support matter as much as the API. CPU-only, Apple Silicon, NVIDIA, AMD, and server GPU deployments point to different tools.

  • llama.cpp for GGUF and low-level control

    A strong fit for CPU, edge, and offline inference where model format and runtime control matter.

  • Ollama for model management

    A simpler abstraction when you want to pull, run, and switch models without managing every serving detail. MLC LLM and llamafile are worth comparing when portability or platform coverage matters more.

Deployment

Local runtime vs production inference server

Local model runners are not all production servers. Compare auth, concurrency, GPU utilization, monitoring, API compatibility, and operational burden before choosing.

  • Solo developer or homelab

    Ollama and llama.cpp are usually easier to start with.

  • Shared service or app backend

    LocalAI, vLLM, SGLang, Xinference, and TensorRT-LLM usually fit better when other apps depend on a stable private inference endpoint.

Suggested additions

Strong candidates not yet in the registry

llamafile

mozilla-ai/llamafile

8.4/10

Runs and distributes LLMs as single-file executables. Useful for portable local inference and GGUF workflows where packaging matters.

View repository

GPT4All

nomic-ai/gpt4all

8/10

A mature local LLM project for running models on personal devices. It is more app/runtime hybrid than pure inference server, but still fits local-model intent.

View repository

Related pages

Keep browsing

FAQ

Questions answered

What is a local model runtime?

It is the layer that loads models, serves inference, and exposes an API or local interface on your own machine, server, or cloud environment.

Which runtime should I start with?

Ollama is the easiest default for most users. LocalAI is useful when OpenAI-compatible endpoints matter. llama.cpp is better when GGUF control matters. vLLM is better when throughput matters.

When should I use vLLM instead of Ollama?

Use vLLM when serving throughput, concurrency, and production inference performance matter more than the simple local model-management experience Ollama provides.

Can I run local LLMs without a GPU?

Yes, depending on the model size and performance expectations. llama.cpp and LocalAI can be useful for CPU-oriented or no-GPU setups, while GPU-backed runtimes are usually better for larger or faster workloads.