VAST

Although I work for VAST Data, these notes are my own personal notes and are not authoritative. They may be wrong.

VAST Data is a company that develops the “VAST AI Operating System.” Although vague and marketing-y, this “product” is the result of where cast came from and where its ambition goes.

Year	Product	Product position	Features	Target workloads
2019	Universal Storage	storage	file and object storage	Enterprise storage, HPC
2023	AI Data Platform	storage + database	+ tables	+ analytics
2025	AI Operating System	storage + database + microservice infrastructure	+ events and services runtime	+ AI agents

Well-trained VAST employees would say that VAST is no longer a storage company, but it is a company that does storage.

A collection of software infrastructure that includes:

VAST DataStore, formerly known as VAST Universal Storage, which is an all-flash storage system that exposes NFS, S3, and SMB endpoints to store and retrieve data.
VAST DataBase is a tabular data store. This is what defined the VAST AI Data Platform.
VAST DataEngine is a serverless execution environment that can act on events triggered by other parts of the Data Platform. This is what defined the VAST AI Operating System.

In addition, there are some features that overlay or underlay these data-centric dimensions to what VAST is selling:

VAST DataSpace is the machinery that enables federation of VAST clusters across multiple sites and provides a single namespace across all of them.
VAST Element Store is the low-level data store underlying all of this. It is a key-value store built on a B-tree-like data structure.

VAST has also started developing specific applications on top of DataEngine to address specific enterprise needs:

VAST InsightEngine automatically indexes all data written into a VAST system and inserts them into VAST using VAST’s vector database capability.
VAST AgentEngine is an inferencing framework that is implemented atop DataEngine. As of July 2025, it is under development.

Core components

DataStore

I wrote a description about some of the unique aspects of the VAST DataStore architecture here: VAST Data’s storage system architecture (glennklockwood.com). In brief,

NFS clients: It allows standard NFS clients to connect to it for low performance, but also supports enhanced NFS clients including NFSoRDMA and NFS multipathing to allow a single client to issue reads and writes to multiple VAST frontend servers in parallel.¹
Novel write path: It lands all writes to storage-class memory (formerly Optane, but now Kioxia FL6) to preserve low latency at the cost of write bandwidth. Data accumulates in this write layer over time, building massive stripes that contain similar data (using a locality-sensitive hash on incoming data). When a write stripe is full, it is compressed and flushed to low-durability, cost-effective QLC flash.²
Novel data protection: VAST uses extremely wide stripes of up to 150+4 to protect data with very high efficiency. They use locally decodable codes instead of standard Reed-Solomon codes to avoid having to rebuild using 150 surviving drives; instead, they only need $\frac{150}{4}$ surviving drives to rebuild.³

Items 2 and 3 above allow VAST to get away with using all low-endurance QLC flash at an effective price-per-gigabyte rate that becomes competitive with hybrid disk+flash storage systems in overall value. VAST is neither the fastest nor the cheapest, but it does both reasonably well.

DataBase

VAST DataBase is a column store that supports row-wise operations, deriving its performance from the all-flash nature of the underlying Element Store. Data can be accessed via

pyarrow⁴
SQL via Trino, via an embedded Trino server that runs in CNodes⁵ or the VAST Trino connector running on your own Trino cluster.⁶
Spark, via the VAST Connector for Spark⁶

The DataBase connectors (Trino, Spark) offer predicate pushdowns to make some operations (filtering, column projection, etc) run in the storage system, preventing your analytics job from having to read an entire Parquet file to perform these operations.

Queries and interactions with data are managed through pyarrow and SQL (as implemented by Spark and Trino), both of which implement predicate pushdowns via the VAST DB Python SDK.⁷

The Python SDK also provides APIs for managing “buckets,” which are a quirky part of the way VAST DataBase manages collections of schemata and tables. These buckets are expressed via the VAST S3 Bucket view to enable snapshotting, and individual tables appear as parquet objects via the S3 interface. DataBase “buckets” are analogous to DAOS pools or Azure Storage Accounts.

DataBase can also expose the VAST Event Broker via a Kafka interface.

DataEngine

VAST DataEngine is a serverless execution environment that is accessible via a Python SDK. Behind the scenes, it is a container execution framework that orchestrates execution across federated VAST clusters and is able to be smart about not copying data all over the place as functions execute.

As of July 2025, it supports running Apache Spark master and worker nodes on CNodes as a managed application.⁸

The VAST Event Broker, which is a Kafka-compatible event broker, is rolled into the DataEngine branding.

For now, scroll about halfway down the official DataEngine landing page.

Element Store

VAST’s Element Store is the key-value store that underlies the higher-level interfaces (DataBase, DataStore, etc). A few salient features:

It is based on a customized B-tree which VAST calls the “V-tree.” I don’t actually know how this differs from B-trees.
It supports locally decodable erasure codes, allowing it to go up to 146+4 parity and supports fail-in-place. The erasure coding algorithm is patented.⁹ VAST always uses +4 parity blocks per stripe.
Each element has a flavor, similar toDAOS object classes.¹⁰ However, each element maps directly to a file, table, etc.

Additional capabilities

DataSpace

VAST DataSpace is a set of functionality that allows multiple VAST clusters operate as a single namespace. It handles all of the replication and synchronization of data across clusters.

InsightEngine

VAST InsightEngine is a framework, developed in partnership with NVIDIA, that implements automatic vectorization as an integral part of a VAST system. It does the following:

Data is written into VAST DataStore through any of its interfaces (file, object, …)
An event is generated in VAST DataEngine which triggers a function in [VAST|VAST DataEngine]
The function generates embeddings for the data that was just ingested using NIM
The embeddings are inserted into VAST’s integrated vector database

Once indexed (which is near real-time), an inferencing application can perform RAG against the data.

Event Broker

VAST Event Broker is an alternate, Kafka-compatible API that can be exposed. Under the hood, it is built on DataBase, and it can both produce events (based on DataStore activity generated through the S3 interface;¹¹ file interfaces do not generate events) and consume events.

Physical infrastructure

VAST is sold as software, but it comes with specific hardware requirements that require qualification. VAST can be deployed on hardware in one of two broad modes:

CNodes and DNodes: CNodes are general-purpose x86 or ARM-based servers, and DNodes are separate flash-rich servers which function as JBOFs.
ENodes: The CNode function is integrated into the same physical server as the DNode function. This approach is how VAST is delivered through OEMs like Supermicro.

When delivered as CNodes and DNodes, VAST uses an YYxZZ nomenclature to describe the cluster size, where YY is the number of CBoxes and ZZ is the number of DBoxes. For example, a “60x42” means 60 CBoxes and 42 DBoxes. Unfortunately, the number of CNodes and DNodes in a CBox or DBox is dependent on the specific version of C/DBox being deployed, so this nomenclature is not very helpful without additional context.

Alon Horev posted this photo of a VAST cluster on LinkedIn:¹²

It looks like six racks, each with:

18 CBoxes (HPE; middle of rack)
9 DBoxes (Ceres; bottom of rack)

So this is a 108x54 cluster(?)

Customers

VAST is seeing broad success in midrange public HPC, private sectors including finance, and GPU-as-a-Service cloud providers.

VAST’s most notable customers include:

TACC, which houses the NSF’s largest supercomputers¹³
CoreWeave, a growing player in providing GPUs-as-a-Service¹⁴
Lambda Labs, a major GPU-as-a-Service cloud provider¹⁵
Core42, a major GPU-as-a-Service cloud provider in the middle east¹⁶
xAI¹⁷
Hypertec Cloud, a Canadian GPUaaS provider¹⁸
Crusoe¹² https://www.datacenterdynamics.com/en/news/crusoe-partners-with-vast-data-for-high-performance-storage-on-crusoe-cloud/

Company

VAST Data Ltd. is the Israeli company that holds the patents underpinning VAST’s products.

VAST Data Inc. is the Delaware company that is a subsidiary of VAST Data Ltd. I am an employee of VAST Data Inc, but my equity is in VAST Data Ltd.

VAST Data Federal Inc. is a subsidiary as well. zook

Glenn's Digital Garden

Explorer

VAST

Core components

DataStore

DataBase

DataEngine

Element Store

Additional capabilities

DataSpace

InsightEngine

Event Broker

Physical infrastructure

Customers

Company

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

VAST

Core components

DataStore

DataBase

DataEngine

Element Store

Additional capabilities

DataSpace

InsightEngine

Event Broker

Physical infrastructure

Customers

Company

Footnotes

Graph View

Table of Contents

Backlinks