LLM inferencing

Inferencing is different from training in that models are often quantized to reduced precisions to save on the memory and computational requirements to process requests.

From the DeepSpeed-FastGen paper:

prefill or prompt processing
- input is user-provided text (the prompt)
- output is a key-value cache for attention
- compute-bound and scales with the input length
decode or token generation:
- adds a token to the KV cache, then generates a new token
- memory bandwidth-bound and shows approximately O(1) scaling

Efficient Memory Management for Large Language Model Serving with PagedAttention by Kwon et al describes how GPU memory is consumed during inferencing.

Disaggregated inferencing

See disaggregated inferencing.

In practice

Open-source

See inferencing frameworks.

Azure

AI Foundry has an inferencing-as-a-service feature. Not sure how this works as of 2025.

ChatGPT

ChatGPT stores conversations, prompts, and metadata in Azure Cosmos DB¹ and a little bit about how exactly this is done has been described.²

ChatGPT is built on Azure Kubernetes Service.¹

A little more detail is quoted in a blog post:³

Consider what happens when you chat with ChatGPT: Your prompt and conversation state are stored in an open-source database (Azure Database for PostgreSQL) so the AI can remember context. The model runs in containers across thousands of AKS nodes. Azure Cosmos DB then replicates data in milliseconds to the datacenter closest to the user, ensuring low latency. All of this is powered by open-source technologies under the hood and delivered as cloud services on Azure.

Scott Guthrie’s keynote at Microsoft Build 2025 - Unpacking the tech ↩ ↩²
https://officegarageitpro.medium.com/what-is-the-database-behind-chatgpt-62b325ba91dc ↩
https://azure.microsoft.com/en-us/blog/microsofts-open-source-journey-from-20000-lines-of-linux-code-to-ai-at-global-scale/ ↩

Glenn's Digital Garden

Explorer

LLM inferencing

Disaggregated inferencing

In practice

Open-source

Azure

ChatGPT

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

LLM inferencing

Disaggregated inferencing

In practice

Open-source

Azure

ChatGPT

Footnotes

Graph View

Table of Contents

Backlinks