This post was originally published on this siteAt Pure//Accelerate 2018 Cody Hosterman and I delivered a session on “Moving Data Between Cloud and On-Premises Virtualized Environments“....
The objective of this article is to explore the advantages of running Hadoop MapReduce on FlashBlade™, a shared file/object storage tier using NFS, instead of traditional HDFS. FlashBlade is a high-performance, scale-out architecture that challenges the traditional view of shared storage and NFS as too slow or unreliable for big-data analytics.
A follow-up post explains in detail how to configure MapReduce for FlashBlade and compares the performance of FlashBlade versus traditional HDFS.
In short, a high-performance, scale-out NFS service both improves MapReduce performance and greatly simplifies operating and scaling the storage tier of a Hadoop cluster:
Disclaimer: I helped build FlashBlade.
This article will dive deeper into the requirements and candidates for Hadoop’s storage tier and look at why it makes sense to replace HDFS with a high-performance, scale-out storage device like FlashBlade.
There have been others who have suggested the benefits of non-HDFS alternatives, most notably AWS’s S3 (more here), but also GlusterFS and proprietary solutions like MapR-FS, but none of these have the small-file performance, general flexibility, and wide adoption of NFS and the high-performance, scale-out architecture of FlashBlade.
NFS over FlashBlade can either be used in full on-premises environments or hybrid clouds environments where EC2 compute instances access the storage device using AWS direct connect. This article focuses on MapReduce, but the limitations and inflexibility of HDFS hinder other applications, like Apache Spark, in the same way.
The figure below contrasts the architectures of HDFS and shared storage (NAS). The key difference is the unit of scaling; in an HDFS cluster, each node is both compute and storage and hence scaling the cluster means always scaling both proportionally. In contrast, the shared storage architecture allows adding additional compute nodes separately from the storage tier.
Storage Layer Requirements
The Hadoop project combines three main components: a storage tier (HDFS), a parallel computation engine (MapReduce), and a resource scheduler (YARN). MapReduce simplifies large, batch computations by allowing the programmer to write single-threaded code and it automatically parallelizes the computation to run on 1000s of machines. Though the default in Hadoop, HDFS can be replaced by a different storage tier such as a networked-filesystem (e.g., NFS) or object store (e.g., S3). Each job’s storage requirement depends on data size and the computational complexity of the task, but IO-dominant workloads require reading and writing data at 1) high aggregate bandwidth, 2) predictable performance and 3) high concurrency.
Protection against failures is also required from the storage tier. The MapReduce model helps prevent transient failures from aborting an entire, long-running job by retrying individual failed tasks. Most MapReduce input data is very expensive or impossible to reproduce if lost so the storage tier must also provide durable storage.
There are several critical requirements from general-purpose filesystem that are missing: efficient storage of small files, very large numbers of files, and overwrite support. Though not required for MapReduce, these features greatly increase the flexibility to use multiple and new applications (e.g., Apache Spark) with the same datasets; MapReduce is just one tool in a quickly growing space.
HDFS: Motivations and Relevance Today
Hadoop’s HDFS design is inspired by the 2003 “Google File System” paper, which described a system for clustering large numbers of commodity servers with unreliable hard disks together into a reliable and high performance storage tier. Namespace metadata is managed by a single namenode server and data blocks are highly replicated across datanodes in the cluster to provide both aggregate bandwidth and fault tolerance.
When HDFS was originally created, its initial assumptions included:
Since the original design over 10 years ago, several underlying premises in the datacenter have changed or shifted:
The largest fundamental issue with the implementation of HDFS is the serialization of metadata accesses on the single namenode. Each client read must access the namenode in order to direct a block operation (read/write) to the nearest datanode with the desired block. This means all data accesses, whether read or write, must go through the namenode. This makes the namenode a performance bottleneck in larger systems or systems with many small files. There has been effort to eliminate the namenode as a single-point-of-failure (more here), but it is still a performance chokepoint.
Additional benefits of moving to a networked storage tier like NFS:
When expanding a cluster, adding HDFS nodes adds both compute and storage proportionally.
By decoupling, you can add compute by the addition of new nodes with only minimal local storage. The storage tier scales either organically in a scale-out storage system, e.g. adding additional blades, or by mounting multiple storage systems.
There are other features of modern storage appliances that HDFS can or will be able to match. Many NAS systems use erasure-coding to protect user data without the inefficiencies of mirroring. In the upcoming Hadoop 3.0 release, HDFS is planned to support erasure-coding which would protect data from node failure with more efficient use of space than 3x replication. Additionally, modern storage appliances support inline compression which is also possible to enable to HDFS.
HDFS was originally designed 15 years ago to solve an application-specific set of storage problems.However, today’s data centers must deliver faster networking and lower storage latency to accommodate the demands of mixed workloads and web-scale applications. With the introduction of scale-out, all-flash file and object storage platforms like FlashBlade, NFS-based storage is able to offer 1) similar or better performance, 2) higher densities, 3) disaggregation of compute and storage, and 4) simpler administration versus traditional HDFS.
Read the second blog post, which explores the details of how to run MapReduce on FlashBlade and presents performance comparisons against HDFS backed by local SSDs.