The technologies that comprise data analytics pipelines have changed over the past few decades, which has had a massive effect on the infrastructure required to support them. Newer ...
Faster software development = faster feature delivery = happier Pure customers.
At the same time, quality of our software has always been our number one priority. From day one, all software development at Pure adopted a continuous integration model and made a heavy investment in automated testing to ensure software quality. Test failures provide insights, but they can be hard to debug due to the complexity of software systems as well as failure injection tests. As we grow our engineering team, the infrastructure to support them, and the numbers tests we run per day, manual log review for debugging test failures does not scale due to the immense amount of work. We have built a software QA infrastructure using modern big data tools like Spark and Kafka, enabling us to process 50 billion events per day and provide feedback in real-time under five seconds.
This blog post describes how we’ve leveraged automated software testing from the beginning, the challenges our processes faced when scaling, and how we’ve augmented our automated testing infrastructure with a real-time big data analytics pipeline to solve these problems.
Early on in the development of the FlashBlade, a scale-out file and object storage platform, we were a small team of engineers with a handful of prototypes for the product and a small amount of servers to simulate our customers’ load. At this stage, we made a conscious decision to do as much automated testing during development as possible. Our goal was to have a fast iteration cycle for software development that included a continuous integration model.
During these early days of development, we had dozens of tests that amounted to hundreds of test runs per day. This model served us well, as it enabled an agile development cycle during prototyping and until we released FlashBlade. We were able to quickly identify and fix bugs, which in turn increased the quality of our product.
Customer demand for FlashBlade ramped up very quickly, so we started to increase our engineering force accordingly. As the number of engineers grew, the number of features that we were developing in parallel also grew. And with more features come more tests which required more infrastructure to support them (FlashBlades, Servers, Jenkinses, etc). We even had to invest in a feature simulation system of the FlashBlade to allow us to scale these testing efforts. Each factor grew so rapidly that we quickly found ourselves running over 20 thousand tests per day.
For any given test, there are multiple points of failure: FlashBlade, network, client, code line, test case, etc. These multiple components make identifying the cause of a test failure – also referred to as triage – a complicated task. For the sake of discussion, let’s assume it takes 10 minutes for an engineer to root-cause a test failure. Now, if we assume a 5 percent failure rate, we end up with 1,000 failures per day. This means we would need a full time team of (1000 failures per day) * (10 minutes / failure) / (hour / 60 minutes) * (day / 8 hours) ~ 20 engineers to triage these failures.
With all of the opportunity that FlashBlade has for innovating and bringing more solutions to market, we plan to double our engineering team making this problem larger and more complex.
Our triage team today only has a handful of members. Without any automated triage to supplement those people, a huge percentage of test failures would go untriaged. Valuing software quality as we do, it was unacceptable to not be able to reap the benefits of having our automated testing framework for continuous integrations. This is the reason we set out to build an automated solution that would help with this process.
Our goal was to come up with a system to:
Because of the requirements laid out above, we decided to go with a big data analytics pipeline that streams and processes all of the logs from our infrastructure. Here are the steps from our pipeline:
All of our infrastructure uses logger and we have a pre-task in jenkins that configures each component with an rsyslog configuration to send the logs into a centralized cluster of rsyslog servers. This pre-task polls a service to find the least utilized rsyslog server for load balancing.
This is the first and last step of the pipeline where the logs are processed in the order they are created. Because of this, each line needs to be tagged with enough metadata so that it can be uniquely identified after being processed. Rsyslog converts each log line into json extracting some fields that describe its origin (host, component, time, etc.) and adding other fields about the test it was running at the time (jenkins job, jenkins build id, VM, branch, etc.).
These rsyslog servers are also in charge of preserving all of our logs for posterity – in json format. Because of all the metadata we are adding, the amount of data being sent out from rsyslog triples in size. We preserve all of our logs for posterity in a pre-classified format using the filesystem’s directory tree structure <FlashBlade>/<year>/<month>/<day>/<host>/<log>. This format helps us lower the granularity of our log files for re-processing. On the other hand, we create half a million files per day to be stored on a FlashBlade. Processing all of that metadata for re-processing our logs takes advantage of the high IOPS provided by the FlashBlade.
A property we found valuable for a data analytics pipeline was the decoupling of consumer and producer of the data. Having kafka as a broker in between any 2 components of the pipeline (e.g. rsyslog and Spark) allows us to stop/start the consumer and continue ingesting data from the point at which it previously stopped. Also, based on our retention policy (3 days), we are able to attach other consumers (e.g. our test pipeline, look-backs for new signatures) and replay certain periods repeatedly. Kafka is configured to use the FlashBlade via NFS as its backing storage; this means that we can dynamically change our retention policy without having to worry about the infrastructure supporting the Kafka cluster.
The compute core of our pipeline that classifies and filters the streamed logs and outputs to the filesystem (NFS/S3). We use the filesystem directory structure to hold the metadata of the filtered lines in the following format: <filtered_logs>/<FlashBlade>/<year>/<month>/<day>/<hour>/<host>/<signature_id>/<text>. This classification system allows us to have an hourly granularity per signature (and all of its relevant patterns) which speeds up our post-processing of the filtered log lines. We found a large part of our use cases were going to be met with the processing of unique lines and we optimized for the real-time aspect of it: Spark does not perform any aggregations, only filtering and classification.
We use python consumers from a Kafka queue to do post-processing and aggregation of unique patterns for multi-line and scripted signatures. Scalability of our post-processing layers comes intrinsically from the ability to have multiple concurrent consumers pulling from a given Kafka topic.
We use JIRA internally to track our engineering work items, bugs, and test failures. Thus, JIRA is the ideal location for our users (software engineers) to encode their signatures and for our analytics pipeline to report new reproductions of a given signature. We use Jenkins for resource management of our hardware as well as test scheduling; our analytics pipeline also marks test failures as ‘known failure’ by augmenting the description of a given Jenkins build.
The analytics pipeline that we’ve built has a significant impact to our engineering organization across multiple fronts:
While modern analytics tools were critical in building our QA pipeline, the speed and agility of the experimentation and growth of our pipeline would not have been possible without a disaggregated compute and storage layer. This architecture allows us to spin up large amounts of compute nodes on demand and point them at the FlashBlade for a large batch analytics job without disrupting our existing ones. FlashBlade is the infrastructure on which the pipeline is built. It consolidates all the data silos into a single, shared system and enables real-time performance for any data, any access patterns, and any applications. To learn more, please read our blog on why FlashBlade for modern analytics here.