Last Updated: November 27, 2021
VAST Data’s Universal Storage redefines the economics of flash storage, making flash affordable for all applications, from the highest performance databases to the largest data archives, for the first time. The Universal Storage concept blends game-changing storage innovations to lower the acquisition cost of flash with an exabyte-scale file and object storage architecture breaking decades of storage tradeoffs.
With the advantage of new, enabling technologies that weren’t available before 2018, this new Universal Storage concept can achieve a previously-impossible architecture design point. The system combines low-cost QLC Flash Drives and Storage Class Memory with stateless, containerized storage services all connected over new low-latency NVMe over Fabrics networks to create VAST’s Disaggregated Shared Everything (DASE) scale-out architecture. Next-generation global algorithms are applied to this DASE architecture to deliver new levels of storage efficiency, resilience, and scale.
While the architecture concepts are sophisticated, the intent and vision of Universal Storage are simple: to bring an end to the data center HDD era and end the complexity of storage tiering that is a byproduct of the decades of compromises caused by mechanical media. This White Paper will introduce you to VAST Data’s Universal Storage, the DASE architecture, and explain how this new architecture defies all conventional definitions of storage. In breaking the classic price/performance tradeoff, this system features all-flash performance at archive economics to simplify the data center and accelerate all modern applications.
Why Universal Storage?
The Tyranny of Tiers
Over 30 years ago, Gartner introduced the storage tiering model as a means to optimize data center costs by advising customers to deprecate older and less-valuable data to lower-cost (and slower) tiers of storage. Fast forward 30 years and the sprawl of storage technologies within organizations has grown to unmanageable proportions – where many of the world’s largest companies can be found managing dozens of different types of storage. This problem is exhibited when defining both storage class (for example: all-flash, hybrid, all-HDD, tape) as well as by classes of protocols (block, file, object, big data, etc.)…. all of it creates a complex pyramid of storage technologies.
While the savings are clear when applying this model with legacy storage architectures, the idea that data should exist on a specific storage tier according to its current value creates multiple challenges:
The Demands of Artificial Intelligence Render Storage Tiering Obsolete
Arguably the greater problem with storage tiering is that this concept assumes that the applications accessing data enjoy a narrow and predefined view of their data access requirements. While that’s true for some applications, such as traditional database engines, new game-changing AI and analytics tools, such as machine learning and deep learning, see value in all data and want the fastest access to the largest amounts of data. For example, when a deep learning system trains it’s neural network model for facial recognition, the model becomes more accurate only once it’s run against all the photos in the dataset, not just the 15-30% that may fit in some expensive flash tier. The value these applications bring is proportionate to the corpus of data they get exposed to, where they thrive with large data sets.
Defining Universal Storage
Universal Storage is a next-generation, scale-out file and object storage concept that breaks decades of storage tradeoffs, and in so doing defies classical storage definitions. Universal Storage is:
New Technologies Lay A New Storage Foundation
There are points in time where the introduction of new technologies make it possible to rethink fundamental approaches to system architecture. In order to realize the Universal Storage architecture vision, VAST made a bet on a trio of underlying technologies that were not available to previous storage architecture efforts, and in fact, only all became commercially viable in 2018. These are:
Scale-out Beyond Shared Nothing
For the past 10 years, the storage industry has convinced itself that a shared-nothing storage architecture is the best approach to achieving storage scale and cost savings. Following the release of the Google File System architecture whitepaper in 2003, it became table stakes for storage architectures of almost any variety to be built from a shared-nothing model, ranging from hyper-converged storage to scale-out file storage to object storage to data warehouse systems and beyond. 10 years later, the basic principles that shared-nothing systems were founded on are much less valid for the following reasons:
Quad-Level Cell Flash (QLC) is the fourth and latest generation in flash memory density and therefore costs the least to manufacture. QLC stores 33% more data in the same space than Triple-Level Cell (TLC). Each cell in a QLC flash chip stores four bits, requiring 16 different voltage levels.
While QLC brings the cost per GB of flash down to unprecedentedly low levels, squeezing more bits in each cell comes with a cost. As each successive generation of flash chips reduced cost by fitting more bits in a cell, each generation also had lower endurance, wearing out after fewer write/erase cycles. The differences in endurance across flash generations are huge – while the first generation of NAND (SLC) could be overwritten 100,000 times, QLC endurance is 100x lower.
Erasing flash memory requires high voltage that causes damage to the flash cell’s insulating layer at a physical level. After multiple cycles, enough damage has accumulated to allow some electron leakage through the silicon’s insulating layer. This insulating layer wear is the cause of QLC’s lower endurance. To hold a four-bit value a QLC cell must hold one of 16 discrete charge/voltage levels all between 0 and 3 volts or so. Holding that many bits as slightly different voltage levels makes QLC more sensitive to electron leakage through the insulating layers. This makes QLC more sensitive to electron leakage since the stored value in a cell could change from 1101 to 1100 when just a few electrons leak out. Since previous flash types use fewer voltage levels it takes more electron leakage to change from one value to another so they can survive more of the damage each write/erase cycle causes.
VAST’s Universal Storage systems were designed to minimize flash wear by both using innovative new data structures that align with the internal geometry of low-cost QLC SSDs in ways never before attempted and a large Storage Class Memory write buffer to absorb writes, providing the time, and space, to minimize flash wear. The combination allows VAST Data to warranty QLC or PLC flash systems for 10 years, which has its own impact on system ownership economics.
Several flash vendors have now started talking about, an even denser, 5-bit per cell flash generation. While designs are preliminary this new PLC (Penta Level Cell) flash will store 25% more data per cell than QLC and is projected to have endurance of only a few hundred write/erase cycles per cell, which the VAST architecture is also designed to accommodate.
The logic of VAST’s Universal Storage cluster runs in stateless containers. Thanks to NVMe-oF and NVMe Flash and Storage Class Memory, each container enjoys direct-attached levels of storage performance without having any direct-attached stateful storage. Containers make it simple to deploy and scale VAST as a software-defined microservice while also laying the foundation for a much more resilient architecture where container failures are non-disruptive to system operation.
Storage Class Memory
For the first time in 30 years, a new type of media has been introduced into the classic media hierarchy. Storage Class Memory is a new persistent memory technology that is both lower-latency and more endurant than the NAND flash memory used in SSDs while retaining flash’s ability to retain data without external power persistently.
Universal Storage systems use Storage Class Memory both as a high-performance write buffer to enable the deployment of low-cost QLC flash for the system’s data store, as well as a global metadata store. Storage Class Memory was selected for its low write latency and long endurance. A Universal Storage cluster includes tens to hundreds of terabytes of Storage Class Memory capacity, which provides the VAST DASE architecture with several architectural benefits:
NVMe over Fabrics
NVMe (Non-Volatile Memory express) is the software interface that replaced the SCSI command set for accessing PCIe SSDs. Greater parallelism and lower command queue overhead make NVMe SSDs significantly faster than their SAS or SATA equivalents.
NVMe over Fabrics (NVMe-oF) extends the NVMe API over commodity Ethernet and Infiniband networks to provide PCI levels of performance for remote storage access at data center scale. VAST’s DASE architecture disaggregates CPUs and connects them to a globally accessible pool of Storage Class Memory and QLC Flash SSDs to enable a system architecture that independently scales controllers from storage and provides the foundation to execute a new class of global storage algorithms with the intent of driving the effective cost of the system below the sum of the cost of goods. With NVMe-oF, VAST Containers enjoy the advantage of statelessness and shared-everything access to a global pool of Storage Class Memory and Flash, with direct-attached levels of storage access performance.
The DASE Architecture
VAST Universal Storage is based on a new scale-out architecture concept consisting of two building blocks that are scaled across a common NVMe Fabric. First, the state (and storage capacity) of the system is built from resilient, high-density NVMe-oF storage enclosures. Second, the logic of the system is implemented by stateless docker containers that each has the ability to connect to and manage all of the media in the enclosures. Since the compute elements are disaggregated from the media across a data center scale Fabric, each can scale independently – thereby decoupling capacity and performance.
In this Disaggregated Shared Everything (DASE) architecture, every VAST Server in the cluster has direct access to all the cluster’s storage media with PCI levels of low latency.
VAST Servers provide the intelligence to transform enclosures full of Storage Class Memory and QLC SSDs into an enterprise storage cluster. VAST Servers serve file and object protocol requests from NFS, S3, and SMB clients and manage the global namespace, called the VAST Element Store.
The VAST Server Operating System (VASTOS) provides multi-protocol access to the VAST Element store by treating file and object protocols as interchangeable peers. Clients can write a file to an NFS mount, or an SMB share and read the same data as an object from an S3 bucket (and vice-versa). Today, VASTOS supports NFS v3 including NFSoRDMA (NFS over RDMA) and SMB (Server Message Block, the Microsoft protocol previously known as CIFS) file protocols along with the de facto cloud standard S3 object storage protocol. Each server manages a collection of virtual IP addresses (VIPs) that clients mount via round-robin DNS services to balance load across the cluster.
All the VAST Servers in a cluster mount all the storage devices in the cluster via NVMe-oF, providing global and direct access to all the data and metadata in the system. With this global view, VASTOS distributes data management services (erasure encoding, data reduction, etc) across the cluster’s CPUs so that cluster performance scales linearly as more CPUs are added.
VASTOS is deployed in stateless containers that simplify software management across VAST server appliances. Legacy systems must reboot nodes to instantiate a new software version, which can take a minute or more as the node’s BIOS performs a power on self-test (POST) on the node’s 768MB of DRAM. The upgrade process for VASTOS instantiates a new VASTOS container without restarting the underlying OS reducing the time a VAST server is offline to a few seconds.
Under VAST’s Gemini model customers purchase capacity-based subscription licenses from VAST that allow them to run VASTOS containers on qualified hardware they own. VAST has arranged for customers to buy fully integrated VAST Server appliances and VAST Enclosures from our manufacturing partners at VAST’s negotiated cost. Hardware maintenance, including SSD replacement is included in the Gemini subscription and VAST will support server appliances and/or enclosures for 10 years from the appliance’s initial install date.
The Advantage of a Stateless Design
When a VAST Server receives a read request, the Server accesses persistent metadata that is housed in Storage Class Memory across the Fabric in order to locate a file or object’s data, and then reads data from QLC flash (or XPoint if the data has not yet been migrated from the buffer) before forwarding the data to the client. For write requests, the VAST Server writes both data and metadata directly to multiple XPoint SSDs before acknowledging writes. This direct access to shared devices over an ultra-low latency fabric eliminates the need for VAST servers to talk with each other in order to service an IO request – no machine talks to any other machine in the synchronous write or read path. Shared-Everything makes it easy to linearly scale performance just by adding CPUs and thereby overcome the law of diminishing returns that is often found when shared-nothing architectures are scaled up. Clusters can be built from 1,000s of VAST servers to provide extreme levels of aggregate performance… the only scalability limiter is the size of the Fabric that customers configure.
Storing all the system’s metadata in shared, persistent XPoint SSDs eliminates the need to maintain any coherency between Servers and eliminates the need for power failure protection hardware that would be otherwise required by volatile and expensive DRAM write-back caches. VAST’s DASE architecture pairs 100% nonvolatile media with transactional storage semantics to ensure that updates to the Element Store are always consistent and persistent.
VAST Servers do not, themselves, maintain any local state – thereby making it easy to scale services and fail around any Server outage. When a VAST Server joins a cluster, it executes a consistent hashing function to locate the root of various metadata trees. As Server resources are added, the cluster leader rebalances responsibility for shared functions. Should a Server go offline, other Servers easily adopt its VIPs and the clients will connect to the new servers within standard timeout ranges upon retry.
This Shared Everything cluster concept breaks the rigid association that storage systems have historically built around specific storage devices within a cluster. A VAST Cluster will continue to operate and provide all data services even with just one VAST Server running, as all state is stored in a set of globally-accessible and resilient storage enclosures. If, for example, a cluster consisted of 100 Servers, said cluster could lose as many as 99 machines and still be 100% online.
VAST Enclosures are resilient, NVMe-oF storage enclosures that connect XPoint and QLC flash SSDs to a high-throughput Ethernet or InfiniBand network. The VAST Enclosure features no single point of failure – Fabric modules, NICs, fans, power supplies are all fully-redundant such that VAST Clusters can be built from as little as one Enclosure and scale to 1,000 Enclosures.
Per the figure above, each VAST Enclosure houses two Fabric Modules that are responsible for routing NVMe-oF requests from Ethernet or InfiniBand ports to the Enclosure’s SSDs through a complex of PCIe switch chips. With no single point of failure from network port to SSD, VAST Enclosures combine enterprise-grade resiliency with high-throughput connectivity. While at face-value the architecture of a VAST Enclosure appears similar to a dual-controller storage array, there are in reality several fundamental differences:
Fabric Failover With Single Ported SSDs
VAST Enclosures use a dual port 100Gbps network interface card (NIC) in each fabric module to provide redundant paths from the cluster’s NVMe fabric to the enclosure’s SSDs. This ensures that the cluster can continue to operate through switch and fabric module failures.
While the over 20GB/s of read bandwidth a VAST enclosure provides is more than enough for most applications, the latest GPU computing applications from AI to rendering needed more. The VAST LightSpeed enclosures provide twice the read bandwidth of VAST’s standard enclosure by replacing the dual-port 100Gbps network interface cards in standard enclosure with two single port cards. This eliminates the single x16 PCIe slot as a bottleneck providing each NVMe fabric port with 16 dedicated PCIe lanes to the PCIe switch complex, and through the switch to the enclosure’s SSDs.