This digital garden is an experiment in publishing some of the material I maintain in a private Obsidian vault as a supplement to my personal website.

I work as a system architect who helps design some of the largest supercomputers in the world. I spent seven years doing this in the the U.S. Department of Energy complex, where I contributed to the design and evaluation of supercomputers including Perlmutter, Frontier, and El Capitan. I then joined Microsoft Azure where I now work on the design of supercomputers designed for LLM training such as Eagle.

Designing supercomputers is a complex process that requires having a broad understanding of where key technologies (CPUs, GPUs, networking, and storage) are headed, how they can be combined to create systems, and how those systems will behave when different workloads run on them.

Most recently, I have been working on understanding and predicting the reliability of supercomputers for LLM training at scale. This means analyzing the failures that occur on production AI supercomputers that result in job interrupts, predicting the failure of future systems using component-level reliability and statistical modeling, and devising ways in which software and hardware improvements can increase the stability of long-running, full-system jobs.

From 2015 to 2023, I specialized in storage for HPC and actively participated in that community. I wrote a bunch of papers around file system performance analysis, reliability, utilization, and architectural philosophy. I also managed a team of storage engineers for a year and dabbled in the broader area of data management for scientific computing.