HyperAI

Kioxia has unveiled a prototype of a 5TB high-bandwidth flash memory module capable of delivering 64 GB/s of sustained bandwidth, marking a significant step toward integrating NAND flash directly into the memory hierarchy for AI accelerators. This new technology, dubbed High Bandwidth Flash or HBF, redefines the role of flash storage by combining vast capacity with performance previously reserved for DRAM-based HBM. Unlike traditional flash storage, which prioritizes capacity over speed, Kioxia’s prototype flips the script. It achieves 64 GB/s throughput over PCIe 6.0, more than four times faster than current PCIe 5.0 SSDs like Samsung’s 9100 Pro, and approaches the bandwidth of HBM2E memory stacks. The module uses PAM4 signaling—four-level pulse amplitude modulation—which doubles the data rate per symbol compared to conventional NRZ, but demands advanced signal integrity techniques. To maintain reliability, Kioxia implements equalization, robust error correction, and strong pre-emphasis, all of which are essential for PCIe 6.0’s high-speed demands. The 64 GB/s target is well within the theoretical limits of PCIe 6.0 x16 lanes, which can handle up to 128 GB/s bidirectional bandwidth, leaving room for overhead and error correction. This allows for scalable performance without overwhelming the interface. Latency remains a key challenge. While HBM operates in the hundreds of nanoseconds range—close to GPU registers—NAND flash still operates in tens of microseconds. To mitigate this, Kioxia employs aggressive prefetching and intelligent caching at the controller level, reducing the impact on sequential workloads such as AI model training, large dataset streaming, and graph analytics. Power efficiency is another major advantage. The module consumes under 40W, significantly more efficient than typical Gen5 SSDs that draw up to 15W for around 14 GB/s. On a GB/s per watt basis, this represents a dramatic improvement—critical in hyperscale AI data centers where power consumption from GPU clusters like the H100 already drives massive energy demands. The design also enables scalable system architecture. With daisy-chained controllers, additional modules can be added without consuming more bandwidth, enabling linear performance scaling. A full configuration of 16 modules could deliver over 80 TB of storage and more than 1 TB/s of total throughput—capabilities once limited to high-end parallel file systems or expensive DRAM-based scratchpads. This isn’t Kioxia’s first attempt at high-bandwidth flash. The company has previously explored long-reach PCIe SSDs and GPU peer-to-peer flash links, including collaborations with Nvidia on XL-Flash drives optimized for 10 million IOPS. Its recent expansion of manufacturing facilities in Japan further signals long-term commitment, driven by projected flash demand nearly tripling by 2028. While the module is still in prototype form, and real-world performance under mixed random workloads, ECC overhead, and AI training conditions remains to be validated, the broader vision is clear: flash is evolving from a backend storage layer to a near-memory component. If realized, this could redefine data center architecture, positioning high-capacity flash modules as key performance competitors alongside GPUs themselves.

Kioxia's 5TB, 64 GB/s flash module redefines storage for AI GPUs with high-bandwidth flash, offering massive capacity and efficiency while challenging HBM dominance in datacenter performance.

Related Links