WEKA is a company that makes a super-high-performance scale-out parallel file system. It delivers extremely high performance and extremely low latency thanks to the optimizations it puts into its clients, but this comes at the cost of the client being rather heavy-weight and consuming CPU, memory, and local SSD resources.
NeuralMesh Axon
I think WEKA NeuralMesh Axon is a rebranding of WEKA’s hyperconverged mode, where the WEKA file system is instantiated across the node-local SSDs of compute nodes.
Challenges
I document weird limitations in WEKA here. I don’t mean for this section to be a hit piece, but I need to save these notes somewhere.
Performance
WEKA relies on transparently tiering data from its flash layer to an underlying object store to achieve economical $/TB. However, when data is tiered down, its read performance becomes very poor, resulting in unexpected/unpredictable performance when operating at scale. Huggingface showed that WEKA's reliance on tiering resulted in 40% loss of training throughput when training using a 24 TB dataset.1
Security
WEKA cannot use RDMA or GPUDirect Storage on encrypted file systems. This suggests that their multitenancy is incompatible with RDMA and GDS as well.2
Client scalability
Every WEKA file system mount requires its own copy of the WEKA client container running, and each mount requires its own memory (5 GB), network port, and CPU core. In addition, only seven client containers can run on a single compute node at a time, so you cannot mount more than seven WEKA file systems at once.3
Space efficiency
WEKA has a 0.39% capacity loss due to the fact that it must store a 4 KiB “metadata unit” for every 1 MiB of data stored.4 This isn’t much though; 2 TiB of big files would require 80 GiB of metadata. If most files are between 0.5 and 1.0 MiB though, this overhead approaches 0.78% due to the second half-meg of every file requiring a second metadata unit.