Designing a Multi-Tenant LLM Inference Platform
Why serving LLMs breaks classic API intuitions, and how to design around the physics: KV cache, continuous batching, placement under uncertainty, and fairness.
A collection of thoughts, experiences, and life updates.
Why serving LLMs breaks classic API intuitions, and how to design around the physics: KV cache, continuous batching, placement under uncertainty, and fairness.