The Pure Good Foundation was established in late 2015 with the mission to accelerate positive change by harnessing the power of people, technology, and community to uplift communities ...
The technologies that comprise data analytics pipelines have changed over the past few decades, which has had a massive effect on the infrastructure required to support them. Newer technologies include support for File and Object Store protocols, which enables the use of centralized storage instead of DDAS (Distributed Direct Attached Storage), avoiding the difficulties of scaling, tuning, and maintaining infrastructure.
Relational databases have been around for many decades as the norm across many companies. They have evolved to be easy to create, maintain, and manage with reliable performance directly proportional to the number of queries and amount of data they store. However, relational databases have two main limitations
The infrastructure to support a relational database model frequently looks something like this:
You have a large cluster of compute to run queries on top of your Relational Database, and this compute cluster is attached to a centralized storage platform over the network. As long as your storage can keep up with bandwidth and latency requirements, you can add more compute nodes in order to make your queries run faster (up to a point). Conversely, if you just need more capacity, you can add more storage to your SAN array.
Storage is relatively inexpensive, so enterprises are saving larger and larger amounts of semi-structured “data exhaust” from logs, devices, and dozens of other IoT sources. But it’s not just about storing the data for archival/backup any more – companies want to extract insights from their data to gain competitive advantages. New software apps have been created that can analyze all these data: HDFS, Elastic, Kafka, S3, Spark. With these open source building blocks in place, the challenge for enterprises becomes how best to architect their modern data warehouses to serve their needs, while still meeting strict performance and cost requirements.
The most common deployment of infrastructure to support modern data analytic pipelines is the DDAS model – this is mostly for historical reasons. At the time that modern data analytics technologies were being developed, there wasn’t a storage platform big enough for such large amounts of data nor fast enough to meet the high bandwidth requirements from big data software.
When Hadoop, the most commonly used big data analytics platform, was created, its distributed filesystem (HDFS) was modeled after Google File System – which was based on a DDAS model. HDFS and the use of DDAS allowed data scientists to use commodity off-the-shelf systems/components for their analytics pipeline, like X86 processors and standard hard disk drives. While the entry point for big data analytic pipelines was lowered through DDAS and HDFS, it also created a list of significant management problems:
At Pure Engineering, we have developed a big data analytics pipeline that processes over 18TB of data per day to triage failures from our automated testing. Since we are part of the team that built the FlashBlade™, we decided to use it as the centralized storage solution for all the steps in our pipeline (rsyslog servers, Kafka, Spark, and ElasticSearch). Having a fast, dense, and simple-to-manage centralized storage eliminates all of the above infrastructure headaches that traditionally come with this ecosystem. Reach out to us if you want to learn more about what the FlashBlade can do for your data analytics pipelines!