These are some personal thoughts I’ve had in response to Notice of Request for Information (RFI) on Frontiers in AI for Science, Security, and Technology (FASST) Initiative. In the spirit of working with the garage door up, I’m working on these thoughts out in the open, and my intent is to publish an open response on my blog (and maybe even formally submit a response as a concerned citizen) before the November 11 deadline.

Warning

This is a working document and is being updated up until the deadline of November 11. Thoughts and prose will appear and disappear during this time.

This is a part of my government’s role in AI topic. The premise of the RFI includes the following:

Quote

The Department of Energy’s Office of Critical and Emerging Technologies (CET) seeks public comment to inform how DOE and its 17 national laboratories can leverage existing assets to provide a national AI capability for the public interest.

Quote

This RFI seeks public input to inform how DOE can partner with outside institutions and leverage its assets to implement and develop the roadmap for FASST, based on the four pillars of FASST: AI-ready data; Frontier-Scale AI Computing Infrastructure and Platforms; Safe, Secure, and Trustworthy AI Models and Systems; and AI Applications; as well as considerations for workforce and FASST governance.

FASST’s position is that DOE is uniquely equipped to achieve this mission due to its existing “key enabling infrastructure” which includes:

  • Data
  • Computing infrastructure
  • Workforce
  • Partnerships

2. Compute

Quote

(a) How can DOE ensure FASST investments support a competitive hardware ecosystem and maintain American leadership in AI compute, including through DOE’s existing AI and high-performance-computing testbeds?

DOE must first define “American leadership in AI compute” very precisely. At present, American leadership in AI has happened in parallel to the US Exascale efforts; the race to achieve artificial general intelligence (and the AI innovation resulting from it) is being funded exclusively by private industry. For example, NVIDIA Tensor Cores first appeared in the Summit supercomputer in 2018, but the absence of this capability in Summit’s launch press release1 and subsequent scientific accomplishments2 paint a clear picture that, despite being the flagship supercomputer to feature Volta GPUs, Summit had no bearing on the hardware innovation that resulted in the now-indispensable Tensor/Matrix Core found in GPUs.

Directly supporting a competitive hardware ecosystem for AI compute will be a challenge for FASST. Consider that NVIDIA, which holds an overwhelming majority of the market share of AI accelerators, recently disclosed in a 10-Q filing disclosed that almost half of its quarterly revenue came from four customers who purchased in volumes that exceed the purchasing power of ASCR and NNSA programs.3 It follows that the hardware ecosystem is largely shaped by the needs of a few key corporations, and the DOE no longer serves as a market maker with the purchasing power to sustain competition by itself.

Thus, the DOE should acknowledge this reality and align its approach to AI technology with the needs of the AI industry to the fullest extent possible. Areas for alignment include:

  • Computational approaches such as using the same model architectures, approaches to scaling jobs, and using available arithmetic logic units.
  • Orchestration and management of resources which includes using existing approaches to security, authentication, and federation.
  • Infrastructural philosophies such as optimizing more holistically across the entire AI technology value chain by codesigning hardware with power, cooling, data center, real estate, energy providers, and global supply chain.
  • Policy approaches that avoid the substantial oversight and lengthy reviews that precede one-time capital acquisitions and inhibit agility to adapt to rapidly changing technology needs that accompany the breakneck pace of AI innovation.

Quote

(b) How can DOE improve awareness of existing allocation processes for DOE’s AI-capable supercomputers and AI testbeds for smaller companies and newer research teams? How should DOE evaluate compute resource allocation strategies for large-scale foundation-model training and/or other AI use cases?

The DOE’s ERCAP model for allocations is already aligned with the way in which private sector matches AI compute consumers with AI compute providers. When an AI startup gets its first funding round, it is often accompanied with connections to one or more GPU service providers as part of the investment since such startups’ success is contingent upon having access to reliable, high-performance computing capabilities.45 Continuing this model through FASST is the most direct way to raise awareness amongst those small businesses and researchers who stand to benefit most from FASST resources.

Evaluating allocation strategies should follow a different model, though. Recognizing that the centroid of AI expertise in the country lies outside of the government research space, FASST allocations should leverage AI experts outside of the government research space as well. This approach will have several benefits:

  • It reduces the odds of allocated resources being squandered on research projects that, while novel to the scientific research community, may have known flaws to the AI community.
  • It also keeps DOE-sponsored AI research grounded to the mainstream momentum of AI research, which occurs beyond the ken of federal sponsorship.

DOE should also make the process fast because AI moves quickly. This may require DOE accepting a higher risk of failure that arises from less oversight but higher research velocity.

Quote

(d) How can DOE continue to support the development of AI hardware, algorithms, and platforms tailored for science and engineering applications in cases where the needs of those applications differ from the needs of commodity AI applications?

To the extent that scientific uses for AI diverge from industry’s uses for AI, the DOE should consider partnering with other like-minded consumers of AI technology with similarly high risk tolerances to create a meaningful market for competition.

Collaborations like the now-defunct APEX and CORAL programs seemed like a step in this direction, and cross-agency efforts such as NAIRR also hold the potential for the government to send a unified signal to industry that there is a market for alternate technologies. If formally aligning FASST with parallel government efforts proves untenable, FASST should do all in its power to avoid contradicting those other efforts and causing destructive interference in the voice of the government to industry.

The DOE should also be very deliberate to differentiate:

  1. places where science and engineering applications truly diverge from industry AI applications, and
  2. places where science and engineering applications prefer conveniences that are not offered by hardware, algorithms, and platforms tailored for industry AI applications

This is critical because the success of FASST is incompatible with the pace of traditional scientific computing. Maintaining support for the multi-decadal legacy of traditional HPC is not a constraint carried by the AI industry, so the outdated, insecure, and inefficient use modalities around HPC resources must not make their way into the requirements of FASST investments.

As a specific example, the need for FP64 by science and engineering applications is often stated as a requirement, but investments in algorithmic innovation have shown that lower-precision data types can provide scientifically meaningful results at very high performance.6 Instead of a starting position of “FP64 is required” in this case, FASST investments should start from “what will it take to achieve the desired outcomes using BFLOAT16?”

This aligns with the AI industry’s approach to problems where the latest model architectures and algorithms are never perfectly matched with the latest AI hardware and platforms due to the different pace at which each progresses. AI model developers accept that their ideas must be made to work on the existing or near-future compute platforms, and hard work through innovation is always required to close the gaps between ambition and available tools.

Quote

How can DOE partner with other compute capability providers, including both on-premises and cloud solution providers, to support various hardware technologies and provide a portfolio of compute capabilities for its mission areas?

The DOE may choose to continue its current approach to partnership where, to a first-order approximation, it is a customer who buys goods and services from compute capability providers. The role of those providers is to reliably deliver those goods and services, and as a part of that, periodically perform non-recurring engineering or codesign with its customers to align their products with the needs of their customers.

However, partnership with AI technology providers will bring forth two new challenges: misalignment of mission and mismatch of pace.

Alignment of mission

The DOE Office of Science’s mission is “to deliver scientific discoveries and major scientific tools to transform our understanding of nature and advance the energy, economic, and national security of the United States.”7 Put more broadly, its mission is to benefit society.

This mission naturally maps to the mission statements of the technology companies that have traditionally partnered with DOE:

  • HPE does not have a clear mission statement, but their goals involve “helping you connect, protect, analyze, and act on all your data and applications wherever they live, from edge to cloud, so you can turn insights into outcomes at the speed required to thrive in today’s complex world.”8
  • IBM also does not have a clear mission statement, but they state they “bring together all the necessary technology and services to help our clients solve their business problems.”9
  • AMD’s mission is to “build great products that accelerate next-generation computing experiences.”10

These technology companies’ missions are to help other companies realize their visions for the world. Partnership comes naturally, as these companies can help advance the mission of the DOE.

However, consider the mission statements of a few prominent AI companies:

  • OpenAI’s mission is “to ensure that artificial general intelligence benefits all of humanity.”
  • Microsoft’s mission is “to empower every person and every organization on the planet to achieve more.”
  • Anthropic’s mission is “to ensure transformative AI helps people and society flourish.”

AI companies’ missions are to benefit society directly, not businesses or customers. The AI industry does not need to partner with the DOE to realize its vision, because its mission is to “do,” not “help those who are doing.”

As such, the DOE and the AI industry are on equal footing in their ambition to directly impact everyday lives. It is not self-evident why the AI industry would want to partner with DOE, so if it is the ambition of the DOE to partner with the AI industry, it is incumbent upon DOE to define its role as the “helper” and not the “doer.” The DOE must answer the question: how will the DOE help the AI industry achieve its mission?

The tempting, cynical answer may be “revenue,” but this would only be true companies whose mission is to “help” (and sell), not “do.” The following was stated on Microsoft’s Q1 FY2025 earnings call by Microsoft CEO Satya Nadella:11

Quote

One of the things that may not be as evident is that we are not actually selling raw GPUs for other people to train. In fact, that’s a business we turn away, because we have so much demand on inference…

The motives of the AI industry are to solve problems through inferencing using world-class models. Selling AI infrastructure is not a motive.

Mismatch of pace

Once the DOE has made the case for why the AI industry should achieve its mission,

3. Models

Quote

(b) How can DOE support investment and innovation in energy efficient AI model architectures and deployment, including potentially through prize-based competitions?

Energy efficiency is a red herring when the true concern is carbon emissions up to scope 3. Do not focus solely on energy efficiency, because that is a geographically constrained subset of the true problem.

4. Applications

Quote

(b) How can DOE ensure foundation AI models are effectively developed to realize breakthrough applications, in partnership with industry, academia, and other agencies?

“Foundation models” are being used disingenuously to justify capex. Realize that frontier models for science are different than foundation models for language. Question whether a science-specific foundation model is truly worthwhile given practitioner comment at SMC and the BloombergGPT example.

5. Workforce

Quote

(a) DOE has an inventory of AI workforce training programs underway through our national labs. What other partnerships or convenings could DOE host or develop to support an AI ready scientific workforce in the United States?

The largest threats to American leadership in AI compute lie in factors that are secondary to the hardware ecosystem and include:

  • Heavy-handed or misguided regulation that hampers the pace of innovation
  • The domestic talent pool being constrained by uneven access to high-quality, affordable education that emphasizes STEM and critical thinking skills across the nation
  • The global talent pool being inaccessible to American companies due to geopolitical tensions and immigration policy

It is not clear that FASST has the ability to address any of these threats directly, but FASST has the ability to elevate the level of discourse around AI within DOE and, by extension, enabling the federal government to meaningfully respond to these threats.

6. Governance

Quote

(a) How can DOE effectively engage and partner with industry and civil society? What are convenings, organizational structures, and engagement mechanisms that DOE should consider for FASST?

DOE needs to keep pace with the AI industry which does not limit itself with months-long FOAs preceding work. It also does not limit itself to judging progress based on traditional peer review in conferences and journals; consider how many foundational AI papers were only published on arxiv.

Quote

(b) What role should public-private partnerships play in FASST? What problems or topics should be the focus of these partnerships?

FASST should establish vehicles for public-private partnerships that go beyond the conventional large-scale system procurements and large-scale NRE projects. Both of these vehicles reinforce a customer-supplier relationship where money is exchanged for goods and services, but the new reality is that the AI industry is least constrained by money. FASST must provide incentives that outweigh the opportunity cost that partners face when dedicating resources to DOE and its mission instead of the commercial AI industry, and the promise of low-margin large-scale capital acquisition contracts is simply not sufficient.

To put this in concrete terms, I am frequently faced with the dilemma: is my time better spent writing a response to a government RFI such as this one or developing insight that will improve the overall reliability of our next flagship training cluster?

Realistically, investing my time in the government has a negligible role in “moving the needle” in a meaningful way. It may slightly increase the chances that DOE would award a major system procurement to me as an AI company, but even then, that business would have low gross margin due to the extensive requirements and oversight that accompanies those procurements. A similarly sized opportunity could surface with a twenty-person AI startup with higher margins backed by private equity and a more agile approach to partnership.

The latter may reduce the overall training time for the next frontier model by 5%, save millions of dollars in spend, and increase our competitive lead by weeks. The answer is always to choose the latter, resulting in the former being relegated to a passion project which consumes my nights and weekends. FASST should find ways to make this dilemma much less black and white.

Response guidelines

Quote

Commenters are welcome to comment on any question. RFI responses shall include:

  1. RFI title;
  2. Name(s), phone number(s), and email address(es) for the principal point(s) of contact;
  3. Institution or organization affiliation and postal address; and
  4. Clear indication of the specific question(s) to which you are responding.

Responses to this RFI must be submitted electronically to FASST@ hq.doe.gov with the subject line ‘‘FASST RFI’’ no later than 5:00 p.m. (ET) on November 11, 2024. Responses must be provided as attachments to an email. It is recommended that attachments with file sizes exceeding 25 MB be compressed (i.e., zipped) to ensure message delivery. Responses must be provided as a Microsoft Word (.docx) or Adobe Acrobat (.pdf) attachment to the email and should be no more than 15 pages in length, 12- point font, 1-inch margins. Only electronic responses will be accepted. Only one response per individual or organization will be accepted.

A response to this RFI will not be viewed as a binding commitment to develop or pursue the project or ideas discussed. DOE may engage in postresponse conversations with interested parties.

Footnotes

  1. ORNL Launches Summit Supercomputer

  2. 2019 OLCF OAR

  3. FORM 10-Q; see also Nearly half of Nvidia’s revenue comes from just four mystery whales each buying $3 billion–plus

  4. The Desperate Hunt for the A.I. Boom’s Most Indispensable Prize

  5. Startups to access high-performance Azure infrastructure, accelerating AI breakthroughs

  6. DGEMM on integer matrix multiplication unit | International Journal of High Performance Computing Applications

  7. Office of Science | Department of Energy

  8. About Hewlett Packard Enterprise: Information and Strategic Vision | HPE

  9. About | IBM

  10. About AMD

  11. Microsoft FY25 First Quarter Earnings Conference Call