Salishan 2026 notes

These are raw notes that I took when attending the Salishan HPC Conference in 2026. It’s an invitation-only conference, but I don’t think it’s necessarily a closed-door one since the slides are public.

That said, if you see something attributed to you on this page and you don’t like it, please contact me and I will remove it.

I had neither the time nor permission to turn this into my usual post-conference recap blog post, but I am making this set of notes public for two reasons:

It was a single-track conference of expert speakers that had a wide range of insightful things to share.
It might be interesting for someone to see the raw material that turns into one of my post-conference blog posts. This is what I come home with.

Thomas Sterling - Opening Keynote

history: 1000x increase in the sum of Top500 every 11 years
…shown in ~~two~~ three different graphs

Thomas Sterling’s plenary talk felt like a 101 course meandering through quantum and neuromorphic computing. Both topics that have been discussed since at least when I entered the field around ten years ago.

Presented his “Active Memory Architecture”

latency determines execution order
own communication protocol because…why? They are RPCs + data payloads

Dan Reed: Selling ideas to technical people is easy - how do you sell these ideas to a person on the street? (or a congressperson)

Session 1

Estela Suarez - JSC/FZ Jülich

Top500 slide on energy efficiency being grossly outpaced by FLOPS

Nice data on how underutilized the JUWELS CPUs and GPUs really are. Most apps do not utilize the full node - so try to downclock?

Nice way to frame to improve energy efficiency:

x axis: observe —> act
y axis: system —> user

“define a monitoring standard system” - do we need another standard? Based around sensors pushing telemetry which can be queried.

Many parts of envisioned system overlap with capabilities of VAST. Lots of streaming telemetry connected to actions and “MLOps” - through agents?

Scheduler: Flux + on-node resource manager (so like k8s? or like PMIx?)

“SEANERGYS” is the name. Started in 2025, through 2029 when it will be deployed in production on a midrange system. Data plane is Kafka + client/server access?

Thesis is that automatically coscheduling and automatically scaling power up and down. Optimize aggregate throughput of machine. Didn’t address the fact that this will throw any SLA around performance out the window.

Kyle Niemeyer - Oregon State

Reactive fluid simulations. Combustion. One ODE per gridpoint turns into massive system of ODEs.

direct numerical simulation resolved turbulence to the smallest scales (because it is performed in Fourier space). Very expensive to compute.
large eddy simulation is much cheaper to compute, but “sub-grid phenomenon” are approximated.

Nice survey of combustion apps worldwide, including Chinese DeepFlame. china

SIAMPP 7K node Frontier run of PeleC:

60 billion cells
simplified species (the things that react)
0.2 microseconds took an hour of compute

Scaling estimates 200 microseconds per wall-month for grand challenge combustion process are required.

Claim: DNS is the only way to build effective surrogate models.

“Stiff chemistry” is 80-90% of cost workload. “Transport” is the remainder. AI may help make the chemistry cheaper.

“PINN” (physics-informed neural networks” currently suck. Need “manual preconditioning” to be effective.

Building AI surrogate models requires data (“BLASTNet”) and code stewardship.

Brian Ryujin - ARES code

Abstract

Brian gave a great application-centric history of computing at Livermore Computing as they moved from MPPs to GPU-accelerated supercomputers.

“memory per core” of L is constraint of how complex the multiphysics can be. L was never used for ARES; those users did their work on CTS machines.

Q was not time-to-solution optimal, but it offered throughput for ensemble workflows. This gave rise to 2 GB/core requirement given to Livermore Computing.

Sierra drove move to RAJA. Required NVIDIA support to do this because the only Fortran support for GPUs was OpenMP, which was immature on V100.

3D was intense enough to use GPUs, but 2D problems went back to CTS. LC’s CTS-2 machine remains almost as good as it gets for 2D multiphysics problems.

El Capitan comes close to “better” except for single-threaded and 2D problems.

Also, performance per node allows users to request O(10-20) nodes instead of O(100-200). But El Cap has 10K nodes. Sounds like ARES users aren't really using [[El Capitan]] for capability jobs then?
Also, [[Bronis]]' claims that LLNL is an ensemble shop is not aligned with Brian's narrative. Ensembles were a consolation prize to get _some_ utility out of the system, not the workload that drove the system design.

It sounded like the tail wags the dog at LC; scientists think of what they could do with the next ATS when LC tells the application scientists what it is. The mission (e.g., 3D vs. ensemble) was defined by the capabilities of the next system. This is surprising and similar to how LLM training systems are designed (see scaling laws > Scaling law experiments).

David Flynn - Hammerspace

Abstract

Talk started out as a “here’s how smart I am and have been.” It was an unusual marketing/sales pitch, even though Hammerspace wasn’t a sponsor.

Session 1 Panel

Scott Atchley: What happens when the next HPC has less memory than the current one? The answers exposed a general reticence to being smarter (vs. just scaling existing approaches).

Kyle and Dan Ernst both suggested that people buying systems are increasingly disconnected from the needs of the end users. Kyle accused “IT people” of following the bubble of AI.

Panel was an unusual split of two apps people and two infrastructure people.

Session 2

Tony James - RedHat

Argued the case for open software but seemed to really talk about

standard APIs rather than open-source software
auditability - but this deviates from outcome-driven approach to problem solving

Linux Foundation, CNCF, etc prevent any one organization from owning. But what about DAOS? That’s the opposite case, where the foundation was created because nobody wanted to own it when Intel divested its business.

Unconvincing argument around “infrastructure must be open and foundation-led.” Implies that API and implementation are indistinguishable.

I asked a question to clarify, and he agreed that open APIs are more important that open-sourcing everything. Interoperability was what he was really driving at.

Christian Trott - Sandia

Why does software fragmentation happen? Not-invented-here syndrome (sure), but funding too - cannot ask EuroHPC for funding to contribute a feature to a library made at Sandia. And this is worse for codes developed by industry.

Implementation-first	Standards-first
PyTorch	C/C++
Kokkos	SYCL
CUDA/ROCm	BLAS

His position is coming together around an implementation is usually better than an open standard. Compelling.

Nick Malaya

Nick Malaya: HPC is not really as bound by memory bandwidth as people claim. Every GPU gets 3x more memory bandwidth generation-over-generation, so you add more FLOPS. “Many HPC kernels remain compute-bound for FP64.”

Had a good slide showing estimated tokens for common scientific datasets.

MATEY surrogate model ancedote: Nick threw Claude Code at profiling and optimizing an MI355X over two hours after spending twenty minutes developing a prompt.

Mike Heroux

PESO/CASS - appears to be yet another consortium for scientific software. Unclear what its purpose is beyond that. Not funded like HPSF is, but more a brand around ideas and people with common interests?

Seemed like a primitive view of the work to be done to enable agents in scientific discovery. No acknowledgment of security, provenance, auditability, interpretability, or any of the other features required by a production agentic system.

Vibe-coded code quality - does it matter if the code is junk? When was the last time you looked at compiler-generated ASM?

Lots of hand-waving with feel-good statements, but not much depth.

Session 2 Panel

Nobody seemed to argue to open-source everything. Christian, Heroux, and Tony were surprisingly pragmatic. Wish these people made storage purchasing decisions.

Trilinos has closed components (due to national security)
Christian doesn’t care if nvcc is closed as long as someone picks up the phone when he has a problem. cf. Google anecdote, where they patched an OSS release and nobody at Google responded to the pull request.
Tony: APIs are good enough, because they are what ensures interoperability.

Nick Malaya - we have fallen into thinking that software engineering is an ends, and we have lost sight of it being a means to something more useful. As a physicist student, he realized 90% of his work was not physics.

Salman Habib - us scientists have high standards and insist on the right answer. “We are not morons writing AI slop.”

The OSS panel eventually became an AI software panel after the initial few questions.

Session 3

Nathan DeBardalaben - URSA

“ArtIMis scientific foundation models”

URSA - thinking agents + task agents

Thinking: Hypothesize + critique agent. Planning agent to break problem into tasks.
Tasking: He got a little vague here. Model cards inspired by agent cards. Are they accessed via MCP? Also mentioned “data cards.”

Talk turned into a static demo. Throw a “find a dataset in our data repository, create a surrogate model, and write a report” prompt in, and show thumbnail. Unclear if this is actually valuable to a scientist trying to accomplish work. A scientist would show up with a problem in mind a priori, not just ask AI to cook up problems to be solved.

URSA is a “major component” of Genesis agent platform including LLNL’s “MADA.”

Katie’s question: How do you deal with security? “We have an agent …” - but they haven’t really thought about security yet.

All the talk of agentic seems to be not really related to HPC and more about all the drudgery that is peripheral to HPC or science. It's all about improving productivity, not solving problems. This is no different than the discourse around agentic in any other knowledge work.

Bill Magro - Google

Gemini “AI Co-scientist” - don’t ask it a question; give it an objective.

Gemini does a quick background research and proposes a plan for human review, feedback, and iteration.
Gemini spawns many agents with different seeds, temperatures, etc.
After a time limit is reached, agent hypotheses are paired up and go through an elimination tournament.

Top500 is 1000x every 11 years. AI training FLOPS needed is 300,000x in 5(?) years.

Demo generated hundreds of pages of slop. What is the gain here? Humans now have to read a ton of cheap ideas?

Amanda Bullock - OpenAI

Abstract

Amanda (former DOD researcher/PhD) gave a talk about the basic shape of the deployment of OpenAI’s models on-prem and behind-the-wire at LANL on its Venado supercomputer for use in classified work.

January 2025 - August 2025 = o3 deployment at LANL. GPT-5.4 is being installed on Venado right now.

Guardrails are removed on LANL’s version of these OpenAI models. Public versions of each model would refuse too many requests.
Sounds like they had to fine-tune models specifically for national security. OpenAI’s partnership with LANL required OpenAI Research, not just Applied.

Model has access to 2,600 GH200 GPUs and 900 Grace CPUs.

"Partnering across national labs" slide is essentially discovering how to work with the DOE like Cray has always done. Shows how weird working with [[doe|DOE]] really is.

"High knowledge identity can lead to subconscious desire for AI to fail" - interesting observation that people with PhDs are more resistant to finding ways to be productive with advanced models.

Claim: fine-tuning isn’t as good as RAG, and RAG isn’t as good as agentic. Glad to hear that I'm not the only one who sees this progression in capability.

Rob Rieben - LLNL

“Cognitive debt” - gap between code generation and understanding what it does.

“Solow productivity paradox” - in late 90s, couldn’t observe productivity benefits of computers in the workplace.

MADA = Multi Agent Design Assistant

Vision: automated inverse design; start with outcome and determine optimal initial conditions to reach that. currently requires ensembles; gradient-free approach (expensive)

“differentiable multiphysics models” would allow you to “train” an optimal solution in the same way you use gradients to find the optimal model weights. avoids ensembles and uses gradient descent learned from AI world.

Session 3 Panel

As Claude Code and Codex keep improving to solve for general white collar work, does URSA etc erode in value?

The panel kind of went back into “is AI going to take my job” along many dimensions.

Random access

Torsten Hoefler: MAIA is 30% cheaper $/token than NVIDIA.

Assertion that MAIA is dataflow with matrix, vector, and DMA.

Kwasi Ankomah (SambaNova)

SambaNova -> ClickHouse for agent tracing in a queryable format.

“lessons” retrieved from ClickHouse and injected into each agent prompt

Wonsuk Lee (SK Hynix)

Incoherence is 1.5 across all trained LLMs

Interesting softmax talk: it works for language, but nobody knows why. do we keep optimizing for it?

Session 4

Scott Pakin

Nice explanation of how qubits work:

rotate one qubit in a space with 0-ness and 1-ness
can rotate one qubit based on the rotation of another qubit
converting to classical requires measurement which eliminates probabilistic nature, snapping the qubit into a single state (modulo entanglement effects)

he had a 1+1 example in C, CUDA, and QASM - much harder to do in quantum because you have to do the addition in Fourier space and transform in and out

quantum compilers are very complex because they have to do things like circuit synthesis, place and route, etc.

synergy between quantum and HPC:

offload physics to quantum
use HPC for compilation and error correction

Laura Schulz (ANL)

Shotgunned the quantum ecosystem - emphasis on developing everything concurrently, not waiting for qubits alone to catch up.

Robin Blume-Kohout (Sandia)

Abstract

Was a very useful and instructive talk about quantum computing. Very engaging and very detailed in why things in quantum aren’t as they seem.

Quantum just lets you take shortcuts that are leveraging the fact that quantum behavior is different from classical.

Quantum errors are 1/1000. So you can run 1K ops before the first error. But quantum error is like aircraft error: when one happens, the flight is over. The error propagates for the rest of the ops, limiting quantum to around 1000 ops today.

Quantum programs are circuits. Circuits map to qubits and have a defined number of quops. Error rate is a probability per group, so big problems have a smaller chance of error-free completion. Can run multiple times eventually get an error-free example.

QEC analogized to DRAM refresh cycle topping each cell’s capacitor up to ~50K electrons; use many physical qubits (around 49) to form one error-corrected qubits (Shor developed this in 1996). But like locking Rapunzel into a tower, decoherence can’t get to it, but neither can the programmer.

RIKEN found theoretical basis for never tightly coupling HPC with quantum computing. Don’t need quantum computers on a high-speed network; “pigeons would be good enough.” Like airplanes versus cars. One doesn’t replace the other; still need to drive to the airport.

Benchmarks: Don’t benchmark a baby by asking it to lift weights. Benchmark capability - benchmark forms a pareto frontier with circuit widths and depths. Robin pitched a new benchmark that measures the largest program a quantum system could run. Sadly, capability in storage doesn't work like this; I/O capability is subservient to the compute capability.

He does boil his pareto to a single scalar using a “cone of utility” defined by two points of equality in the frontier.

Elica Kyoseva (NVIDIA)

Ising model - 35B parameter(!) to “calibrate a QPU”

vision encoder
decoder (unclear on the architecture)

This is a big model for a science workload unless this is just a transformer backbone.

Decoder is tiny - 0.9M parameters or 1.8M parameters. Is this the “pre-decoder” that she talked about?

Pitched “NVQLink” as low-latency connection between GPU and QPU, specifically for quantum error correction.

Session 4 Panel

How does this jive with the RIKEN finding that tightly coupling is useless?

need high bandwidth/low latency for quantum error correction
low bandwidth needed for the actual quantum computation

100 ns/qubit $\times$ 100M physical qubits = TB/s bandwidth

Robin: 1M physical qubits
1 microsecond per qubit
1M bits $\times 1 0^{- 6}$ per second = 1 Tbit/sec

Robin: but people want to hear “tight coupling between HPC + QC” so people find ways to do this. The reality is:

QC generates 1,000 bits per hour/day/week
Next 1000 bits needs another week
Right now coupling is “someone gets a result from QC, publishes a paper, then someone else implements a classical version of that outcome” - coupling through Nature papers, not software!

“FeMo-cofactor” is a problem that was claimed to be only tractable for QC, but Garret Chan proved it’s solvable with classical. Goalpost of need for quantum keeps getting pushed back as clever classical chemists/computationalists work on it.

Session 5 Megapanel

Thuc Hoang (NNSA)

Top risks:

workforce development/talent pipeline
short horizon - can’t pull things in too much, but also can’t push things out
unstable budget - decrease year over year is bad, but so is increase. Getting +$500M in one year causes nonlinear bureaucratic overhead!
decreased buying power (thanks hyperscale)
conservative user community

Katie Antypas (NSF)

NSF was priced out of leadership HPC before DOE, so that forced NSF to innovate in other dimensions. Call: engage with public and do as much listening as talking.

Michael Krajecki (GENCI)

Risks:

Unsustainable energy trajectory
Talent pipeline
Code fragmentation due to different accelerators
Supply chain - US company dependencies in EU

Andrew Jones (Microsoft)

Risks: Fear, other human elements

David Keyes (KAUST)

Risks:

mix of human and technical
analytical/theoretical capabilities (of people)
nature physics letters asked David and Jack to write about the physics-vs-CS paradigm after realizing Gordon Bell prizes are overwhelmingly being won by physicists.

Discussion

Horst Simon agreed with Ian Karlin that HPL is not relevant. Also supportive of Ozaki scheme. Jack Dongarra is adamant that Ozaki “changes the algorithm,” which is why it is not allowed on Top500.

David Keyes argued that peak FLOPS in Gordon Bell winners also correlates with Top1 HPL score, with a few dips in capability year-over-year.

Robin: Physicsts drive HPC. Is it a bad thing that HPC’s biggest entanglement is with science? Shouldn’t science serve society, not computing?

Katie Antypas: Not exciting to just differentiate by being cheaper than industry. We should take bigger risks to differentiate; this might mean shedding some workloads to private industry. Model of open science HPC facilities should change to be more R&D-focused.

Andrew Jones: Leadership computing facilities can learn a lot from midrange computing facilities now who have always had to innovate in a constrained environment.

Christian Trott: Sandia had to deal with this and focused on testbeds and Kokkos despite not having a leadership machine.

Jay Boisseau: What about “rules-based methods” and “active inference” instead of just “data-based methods” like LLMs?

“Rules-based methods” - as fall within the scope of symbolic AI
“Active inference” - see Active inference and epistemic value - PubMed
“Data-based methods” - traditional LLMs

Salman Habib wants to go back to the days of a Lab datacenter having multiple exotic machines on the floor. If DE Shaw can build Anton, why can’t DOE?

Katie loves Vanguard program for this.
Thuc does too, but there’s ten years of lead time before product can be realized at scale.
NERSC consolidated its testbeds into one big system - but now hyperscale has the biggest systems.
Andrew Jones says this community must stop defining itself by the size of its machine.

DOE can’t receive anything for free. But DOE cannot give anything for free, either. DOE M&O model complicates this.

Audience agreed that CRADAs suck:

take forever to get legal signoff
requires outcomes to be public, not IP-locked
Hal says there is a new “lightweight CRADA” that is only two pages; shouldn’t take 12 months to ink a 3-month partnership

Question: What is everyone wrong about?

Andrew Jones: assuming geopolitics, people, finance, etc will be the same in ten years.
David: Whole hemisphere is underestimated - and possibly more ambitious. Going to work in the USA was their biggest ambition, but they wouldn’t have a shot without giving them exposure at, e.g., SIAM.
Katie: Youth have critical thinking skills despite AI; they will build on these new tools.
Michael: What will be a university in a few decades?

Glenn's Digital Garden

Explorer

Salishan 2026 notes

Thomas Sterling - Opening Keynote

Session 1

Estela Suarez - JSC/FZ Jülich

Kyle Niemeyer - Oregon State

Brian Ryujin - ARES code

David Flynn - Hammerspace

Session 1 Panel

Session 2

Tony James - RedHat

Christian Trott - Sandia

Nick Malaya

Mike Heroux

Session 2 Panel

Session 3

Nathan DeBardalaben - URSA

Bill Magro - Google

Amanda Bullock - OpenAI

Rob Rieben - LLNL

Session 3 Panel

Random access

Kwasi Ankomah (SambaNova)

Wonsuk Lee (SK Hynix)

Session 4

Scott Pakin

Laura Schulz (ANL)

Robin Blume-Kohout (Sandia)

Elica Kyoseva (NVIDIA)

Session 4 Panel

Session 5 Megapanel

Thuc Hoang (NNSA)

Katie Antypas (NSF)

Michael Krajecki (GENCI)

Andrew Jones (Microsoft)

David Keyes (KAUST)

Discussion

Graph View

Table of Contents