OpenAI

This page serves as a locus for everything related to OpenAI.

Software stack

OpenAI has disclosed the following about their training software stack:

They have used Ray for training GPT 3.5 and GPT 4.0.¹ It is unclear if they have used it for training since then, or if they use it for inferencing at all.
They have used Kubernetes on their large training clusters.² See HPC vs AI > Kubernetes.
They have used Apache Spark for data preprocessing. This was mentioned in the GPT-3 paper.

For inferencing, their stack appears to include:

Cosmos DB for conversation state (see LLM inferencing > ChatGPT)
Codex’s web UI uses Temporal to store workflow state³

In addition, they have disclosed:

They use a private monorepo for their code. This was stated in some video they posted about testing with data that they know wasn’t in the training dataset.
Their observability platform is built on
- ClickHouse,⁴ Fluent Bit, and Azure Blob⁵
- They process over 9 petabytes of logs daily using DataDog (Vector?) agents and OpenTelemetry Collector.⁶
- Envoy to route to storage⁶
They use redis for generic session/user data caching. This isn’t LLM-specific though.⁷

Training techniques

They used multicluster training for GPT-4.5.

In 2026, Esha Chouke et al published a paper in collaboration with OpenAI that contained power traces for a typical training loop.⁸ I annotated it with my best guess at what was going on:

Infrastructure

See Microsoft supercomputers and Stargate.

Business

See OpenAI x Microsoft.

Glenn's Digital Garden

Explorer

OpenAI

Software stack

Training techniques

Infrastructure

Business

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

OpenAI

Software stack

Training techniques

Infrastructure

Business

Footnotes

Graph View

Table of Contents

Backlinks