This digital garden is an experiment in publishing some of the material I maintain in a private Obsidian vault as a supplement to my personal website.

Who am I?

I work as a system architect who helps design some of the largest supercomputers in the world. I spent seven years doing this in the the U.S. Department of Energy complex, where I contributed to the design and evaluation of supercomputers including Perlmutter, Frontier, and El Capitan. I then joined Microsoft where I now work on the design of supercomputers designed for LLM training such as Eagle.

The long version of my life story can be found on my personal website’s About Me page.

What do I do?

Most recently, I have been:

As of October 2024, here are a few pages that I’ve been developing that reflect how I’ve been spending my time:

  1. Scaling laws, the simple equations that are making everyone build massive AI training clusters.
  2. Revisiting Reliability in Large-Scale Machine Learning Research Clusters, a paper published by Meta on the long-term reliability of their A100 clusters.

More generally, I also track trends in the state of the art in supercomputing and AI worldwide so that I know how my work fits into where the rest of the AI industry is going. A few pages I’ve been working on that reflect this are:

  1. Everything I know about xAI’s Colossus cluster.
  2. My personal thoughts in response to the FASST RFI. This is part of a personal passion project around the government’s role in AI.
  3. Energy and nuclear power, especially as it pertains to sustainability in HPC and the energy demands of AI.

What have I done?

From 2015 to 2023, I specialized in storage for HPC and actively participated in that community. I wrote a bunch of papers around file system performance analysis, reliability, utilization, and architectural philosophy. I also managed a team of storage engineers for a year and dabbled in the broader area of data management for scientific computing.