Block pool
free
cached (evictable)
in use
shared (refcount>1)
hashed (border)
Scheduler
tokens / step
0
prefill / decode
0 / 0
step (ms)
0
prefix cache
—
free blocks
0
preemptions
0
A minimal continuous-batching LLM engine — paged KV-cache, automatic prefix caching, chunked prefill, SSE streaming. Built to be read end-to-end. Below is a live visualization of the scheduler and memory pool.
git clone https://github.com/surajsharan/tiny_vLLM
cd tiny_vLLM && pip install -r requirements.txt
python -m tiny_vllm.server --model Qwen/Qwen2.5-0.5B-Instruct
# then open http://localhost:8000