3FS is a parallel file system developed by DeepSeek. Until I write up how it works, see:

An Intro to DeepSeek’s Distributed File System | Some blog

I also took a quick scan through the paper that describes DeepSeek’s 10,000 GPU A100 cluster.1 Here are the notes:

From Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning:

Quote

3FS is our in-house developed high performance distributed file system, akin to WekaFS [78], DAOS [79], [80], and BeeGFS [81].

Quote

The total 2880 NVMe SSDs provide over 20PiB storage space with an mirror data redundancy.

They do full replication; raw capacity is 44 PB. They use 15.36 TB drives.

Quote

File system meta data are stored in tables of a distributed key-value storage system. Each file or directory has a unique inode ID. The File inode/directory ID and meta data, such as file size and location information of the file content data, are stored as key-value pairs in the inode table. A separate directory entry table stores key-value pairs of (parent dir inode id, entry name) : (entry inode id, …) to support iterating entries in a directory and resolving file/directory paths.

So they don’t incur the rename penalty associated with emulating hierarchy with a pure key-value metadata system.

Quote

The storage service has an implementation of Chain Replication with Apportioned Queries (CRAQ) [82] to provide strong consistency. CRAQ’s write-all-read-any approach helps to unleash the throughput and IOPS of all SSDs.

This is interesting and requires more analysis. Fundamentally different approach to consistency from Lustre, but it appears to require trading capacity for IOPS.

Quote

To distribute read/write traffic evenly to all SSDs, each SSD serves multiple storage targets from different chains. The storage service runs on every storage node and manages a few storage targets.

Not sure I fully understand, but this seems WEKA-like.

Quote

Parameters and optimization states are asynchronously transferred from GPU to CPU host memory, with check- point saving performed periodically (typically every 5 minutes).

Hierarchical asynchronous checkpointing.

Quote

When a storage service is granted the permission to transfer, it sends the data with a RDMA WRITE followed by a RDMA SEND to notify the client. The request-to-send control increases end-to-end IO latency but it’s required to achieve sustainable high throughput.

So they implement QOS within their protocol.

Quote

Thanks to the high write throughput of 3FS, periodic saving operations can be completed asynchronously in a matter of seconds, without impacting the training process. In the event of hardware failures that interrupt training, only the last 5 minutes of progress are lost.

At 10,000 GPUs, a 5-minute restart time per failure is acceptable. This will not work at larger scales though, highlighting the fundamental problem with using shared storage at massive scale.

Quote

3FS-KV is a shared-storage distributed data processing system built on top of 3FS, currently supporting three models: key-value, message queue, and object storage. It supports read-write separation and on-demand startup, al- lowing it to fully leverage the extremely high I/O throughput provided by 3FS. 3FS-KV supports DeepSeek’s KV Context Caching on Disk technology [84],

File, KV, message queue, and object. Assuming “KV” here means Transformer KV, not a key-value API (which is just an object API with minor differences).

Unified support for protocols just like VAST. WEKA is missing messages. Azure Blob is missing KV and object, and it’s unclear how good Blob’s message queue is.

Quote

Parameters and optimization states are divided into chunks and written to 3FS using the 3FS batch write API, which is significantly faster than normal writes, achiev- ing over 10 GiB/s per node.

They implement QOS in the protocol (and sacrifice IOPS for it), but this lets them checkpoint to shared storage (3FS) using a non-POSIX API.

Quote

our NVLink-related issues, primarily Xid-74 Errors, as mentioned in Section VII-C1, account for about 42.57% of GPU failures.

Footnotes

  1. [2408.14158] Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning