Much of what we call hardware is really software – firmware – that uses the underlying hardware for speed and – this is important – a stable operating environment. The engineering dilemma is a juggling act: can we build a faster product by a) running easily updated software on a merchant processor; b) faster booting software as firmware on a merchant processor; or c) do we build a hardware engine to run the protocol on bare metal for maximum speed?
Each choice involves different levels of investment, which usually precludes the costly option of a dedicated hardware engine. But it’s hard to know the true potential because engineers and academics are often operating by heuristics or simulations that attempt to model real world conditions. Yet simulations are only as good as their design assumptions and the test workloads.
NVMe (Non-Volatile Memory express), the latest and greatest storage interconnect, is based primarily on PCIe, a standard undergoing rapid evolution. The PCIe spec is currently at 5.0, with 64 gigatransfers per second. PCIe 6.0, due next year, is expected to raise that to 128 GT/s. But today even PCIe 5.0-compliant products are scarce.
With each boost in PCIe performance, the trade-offs change. What is the optimal queue depth? How do different memory technologies, such as PRAM or MRAM, affect interleaving strategies? If the clock frequency of a controller’s FPGA is increased for performance, how does that affect timing across a multi-core processor? The problem of optimizing – or taking advantage of higher performance – requires research.
NVMe controllers present a complex case of these trade-offs. How can engineers untangle this bleeding edge mess of architectures, protocols, interconnects, clock frequencies, workloads, and strategies?
At last week’s USENIX Annual Technical Conference, a paper from the Korea Advanced Institute of Science and Technology (KAIST) proposes an answer: “OpenExpress, a fully hardware automated framework that has no software intervention to process concurrent NVMe requests while supporting scalable data submission, rich outstanding I/O command queues, and submission/completion queue management.”
Unlike costly vendor research tools, the implementation uses a low-cost FPGA design that allows engineers and researchers to play quickly and cheaply with multiple parameters, such as request sizes and queue depths, to see how controller architecture and firmware changes affect real-world performance.
This means that for any given engineering budget, designers will be able to run many more experiments on different design trade-offs to bring you the most performant (screaming!) storage controllers.
This is how the future is invented. Imagine a storage controller and medium that is almost as fast as the fastest DRAM, such as some of the NVRAM startups are promising, with a PCIe v 7 or 8 capable of handling over 500 GT/s. That blurs the line between storage and memory.
We could, then, dump the entire virtual memory overhead, with its context switches and virtual-to-physical address translation, in favor of a totally flat address space. There’d be no logical difference between “memory” and “storage”. All accesses would be direct to the data. And data, my friend, is why we do all this.
IBM pioneered the flat address space concept decades ago in the System 38. If I was an Apple engineer looking at future system architectures, I’d definitely be investigating this. Especially since Intel would be slow to support it.
Comments welcome. Did you ever program a System 38? Thoughts?