I’ve spent most of my career in HPC functioning as a system architect. It’s hard to put a finger on exactly what that means, but I tried to describe some of that in a blog post I wrote about my job at Microsoft.
I once spoke to a real architect (who designs buildings) who described his job as gathering requirements from the people who will use a building, understanding external constraints governing the building, and then designing something that will work based on that. That’s surprisingly similar to what I’ve done as a system architect. The big difference is that I design computer infrastructure instead of building infrastructure.
This typically means the job involves three major activities:
- Gathering requirements from HPC users: What do their workloads benefit from the most and least? What’s working well and not working well today? How will their needs change in the future?
- Gathering requirements from workload analysis: Sometimes users don’t understand what they are really doing, or they overstate or understate what they think they will need. Other times, users conflict with each other in ways that cause problems that can be addressed through better system design.
- Understanding technology trends: You ultimately can’t build a supercomputer out of parts that will never exist or are not profitable to produce. And supercomputers need to be installed in buildings with enough power and cooling available, today and in the future.
These pillars translate to different types of day-to-day responsibilities:
- Gathering requirements from HPC users: This means talking to a lot of people and asking them questions about what they’re doing and how they do it. This requires being able to carry a pretty in-depth conversation about broad ranges of topics in both computing and science domains.
- Gathering requirements from workload analysis: This means gathering data from HPC systems, analyzing it, and making actionable recommendations based on those findings. This can be very heavy into data science.
- Understanding technology trends: This means talking to a lot of technology providers and asking them questions to help figure out how their technology might fit into a larger supercomputer design. A lot of this is also keeping vendors honest by asking hard questions, critically analyzing their claims, and benchmarking and evaluating their products.