Glenn's Digital Garden
Explorer
entities
Satya Nadella
nodes
Azure HBv4
Azure ND A100 v4
Azure ND H100 v5
Azure ND MI200 v4
Azure ND MI300X v5
Cray EX154n
Cray EX235a
Cray EX235n
Cray EX254n
Cray EX255a
Cray EX425
papers
A democratic vision for artificial intelligence must prevail over an authoritarian one.
Azure Accelerated Networking: SmartNICs in the Public Cloud
Big Tech Is Rushing to Find Clean Power to Fuel AI’s Insatiable Appetite
Carbon-Removal Firms Have One Very Big Backer. That’s a Problem
Datacenters to emit 3x more carbon dioxide because of genAI
FASST RFI
Machines of Loving Grace
Nuclear finance will rely on consumers’ stomach for risk
Recommendations on Powering Artificial Intelligence and Data Center Infrastructure
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
The Intelligence Age
The National Security Case for Public AI - Vanderbilt Policy Accelerator
processors
AMD MI250X
AMD MI300A
AMD MI300X
AMD MI325X
AMD MI355X
Custom A100 GPUs
Custom H100 GPUs
Intel Ponte Vecchio
Microsoft Maia 100
NVIDIA B200
NVIDIA GH200
NVIDIA Grace
NVIDIA H100
NVIDIA H200
Trainium2
systems
Alps (CSCS)
Aurora (ALCF)
Colossus (xAI)
Eagle (Microsoft)
El Capitan (LLNL)
Frontier (OLCF)
Fugaku (R-CCS)
FugakuNEXT (R-CCS)
Horizon (TACC)
Isambard-AI (Bristol)
JUPITER (JSC)
Leonardo (CINECA)
Meta's H100 clusters
Perlmutter (NERSC)
Project Ceiba (AWS)
Project Rainier (AWS)
Red Storm (Sandia)
tags
usa
U.S. Department of Energy
Papers
Reliability
Azure SmartNICs
Dragonfly topology
Government's role in AI
InfiniBand
LLM training
LLM training at scale
LLM training datasets
Microsoft supercomputers
MTBF, FIT, and AFR
Nuclear energy
Scaling laws
Sustainability in HPC
Ultra Ethernet
AMD CDNA
Availability
Azure infrastructure
Benchmarking blobfuse
Broadcom Tomahawk 5
Cables and connectors
Canadian HPC
Capex
checkpointing
China
Combined cycle
Component reliability
Cray EX
Cray EX145n
DAOS
Darshan
Digital gardens
Distillation
DRAM architecture
FASST
Franklin Farms
Google TPUv4
GPU terminology decoder ring
GPUaaS
GSP and SMC
High-Performance Linpack
Job Mean Time To Interrupt
Job reliability
LLM inferencing
LPDDR5 Reliability
Lustre
manufacturing level
Memory bandwidth
Meta Llama-3.1
Meta Movie Gen
Minipack2
Model FLOPs Utilization
Multi-plane topologies
NAIRR
Network flow
Networking for LLM training
Neuromorphic computing
New Frontiers
NSF LCCF
Obsidian
OPT-175B
Palisades nuclear plant
PCIe Gen6
Podman
Read-it-later apps
Reasoning models
Signal modulation
Slingshot
Small Language Models
Social media platforms
Storage for LLM training
Structured sparsity
System architect
Tensor cores and Matrix cores
Three Mile Island
VAST
Wisdom
Search
Search
Search
Dark mode
Light mode
Home
❯
tags
❯
Tag: anecdotes
Tag: anecdotes
2 items with this tag.
Nov 04, 2024
LLM training at scale
anecdotes
evergreen
Oct 22, 2024
OPT-175B
anecdotes
model