We’ve all witnessed the hype around Artificial Intelligence, and the Financial Services (FS) industry is no exception. Many are relatively new to the topic and there is much to learn. At a detailed level the maths, models and data science involved are very complex and can be daunting. That said, it’s important not to get lost in these details. Whilst they play a central part, successfully delivering AI @ scale also has several other major dependencies and challenges. It’s these on which this blog focuses.
With AI @ scale we are talking supercomputer levels of performance and therefore unsurprisingly standard generic legacy IT (and more specifically standard generic legacy storage) is simply not up to the task. As a result, many FS organizations are, or are in the process of, evaluating new technology for this purpose. Having been through many AI engagements, it’s clear that success requires a good high-level understanding of AI and an understanding of the technical requirements (or critical capabilities) that the planned AI project will place on the infrastructure. This is by no means trivial as this knowledge and experience is scarce.
The purpose of this blog is to share the top 10 lessons from our experiences to date:
- Don’t underestimate the data challenge
Training AI algorithms requires huge volumes of data and invariably one of the key challenges we see within FS organizations is simply getting access to the data needed to train your algorithm (navigating Information Security policies and teams). As a result, we consistently see the time taken to internally source data taking much longer than expected. And it’s worth pointing out this is for on-premise deployments in the customers’ own datacenter. For customers considering using the public cloud, you can expect this process to take significantly longer. So, you should identify and request the data you need as early in your project as possible.
- Don’t be rigid about tools
Tools and frameworks are evolving quickly. It’s very easy to get sucked into proprietary tools and systems, so it’s worth considering a few principles here, as a customer I spoke to a few months back told me “you don’t want to be right for six months and then wrong forever”. Adopting open standards and avoiding vendor lock-in should be a key architectural principle. Allowing for tools to be swapped in and out as better alternatives become available. From an infrastructure perspective, you should ensure that your infrastructure will facilitate this “flexibility” by ensuring data can be shared via open standards (e.g. NFS or S3).
- Is your initiative strategic or tactical?
This is a valuable question to ask right at the outset. If it’s a tactical solution, that’s fine but you need to recognize that shifting to a strategic solution down the road could mean discarding the tactical set-up (and associated costs) and starting all over again. If it’s a strategic plan from the outset, then you should invest and build on solid, proven foundations that enable linear scale and performance.
- AI is a challenge to proof of concept (POC)
At small scale (single GPU), AI experiments can be serviced by pretty much any hardware. But this changes quickly with scale! Some of the challenges we’ve seen working on AI deployments include:
- Compute – Lack of sufficient compute resources in lab environments to effectively simulate and test large AI workloads.
- Data – As discussed above, accessing meaningful data sets is problematic and time consuming. So, we often see customers pivot to smaller data sets. This introduces significant risk! It’s important to recognize that “toy” datasets generate “toy” results. If your real-world requirement is a 1PB data set, testing on a 1TB data set exposes you to a lot of risk. Ideally you want to minimize the “gap” between your real-world requirements and the POC data set. For this reason, we would always recommend using real world data and real-world use cases wherever possible, even if it takes longer to do so.
- Synthetic testing – Customer often do not have certainty about all future AI workload requirements. Therefore, it’s important to test a wide range of different data types and sizes. Historically storage solutions have either been good at small file or big file, never both. Therefore, it’s a good idea to look for solutions that allow you to cope with the “chaos factor” so that you can be confident that whatever the business throws at you can be confident your solution will be able to deliver.
- Document your infrastructures key objectives
Typically, we see the key priorities of infrastructure being:
- To keep the GPUs busy; and
- To keep the data scientists busy (working on data science, not on systems integration, optimization and tuning).
- Identify your critical capabilities up front
Based on your pre-identified infrastructure objectives, it’s also worthwhile documenting the critical capabilities you require up front, so you don’t lose sight of them.
- Scale Capacity – Where do you want to start? And what do you want the ability to scale to?
- Scale GPUs – Where do you want to start? And what do you want the ability to scale to?
- AI as a Service – Do you have a single very well-defined training data set? Or are you trying to build an infrastructure that can support multiple as yet undefined AI workloads – “AI as a Service” if you will. We commonly see customers wanting to build capabilities for the latter. If your goal is similar your infrastructure should be able to cope with the chaos factor, such as random, ad hoc, concurrent workloads, different data types (large and small file) and changing business priorities.
Other considerations might include such things as the ability to scale non-disruptively in cost effective increments, deliver linear scale and performance etc.
- Recognize the cost of data science specialists
The average cost of a data scientist in New York City, for example, is $150K a year, so you don’t want these resources standing around idle. Don’t make them wait for data. If you buy into the idea that “data is the world’s most valuable resource” and AI is the “fourth industrial revolution,” then organizations should invest appropriately to get maximum value out of these resources.
- AI is a pipeline
About 80 percent of AI is data preparation with the other 20 percent made up by training, and yet the vast majority of focus, when it comes to infrastructure, is on training. As a result, we see the data preparation steps of building the pipeline are often neglected from an infrastructure perspective, with data being deployed, copied and duplicated on whatever infrastructure may be available. This is inefficient and potentially problematic from a number of perspectives. In our opinion, customers should seize the opportunity to build a robust, efficient, scalable consolidated infrastructure to support the entire data pipeline as per the diagram below – what we call a “data hub”.
- Recognize that few people understand the end-to-end solution
It’s important to recognize that delivering an integrated system capable of supercomputer levels of performance, is not standard IT. And the nature of AI/ML/DL today is that few if any people are able to understand the full end to end picture. Data scientists understand the software tools and frameworks but seldom understand the impact of these layers on the infrastructure. For example, TensorFlow (one of the most widely adopted open source libraries) has hundreds of tuneables that may have some downstream impact on the infrastructure. And when looking to optimize and tune the performance of the end to end solution there’s little other than trial and error. Likewise, “infrastructure” itself is typically broken down into compute, storage and network specializations where each individual specialist may have limited knowledge of the other. Therefore, the difficulty of trying to build an end to end solution capable of supercomputer performance should not be underestimated.
- Acknowledge the size of the challenge for large organizations
The monolithic silos in most large organizations do not typically lend themselves to the close collaboration required to make a successful, strategic, integrated, linearly scalable AI infrastructure. Building a successful AI infrastructure requires close collaboration between multiple specialists in the compute, storage, network, data science, Kubernetes and Docker fields. Without that, it’s not going to happen.
Pure finds itself with Nvidia at the centre of many of the world’s largest most sophisticated AI deployments. And it’s this collective learning, experience and IP, allied with best in class GPU technology from Nvidia, Storage from Pure and Networking from Cisco/Arista that we’ve leveraged to develop AIRITM – the world’s first AI ready infrastructure for AI @ scale. The purpose of AIRI is really to minimize many of the risks, complexities and potential mis-steps discussed in this blog and give enterprises a state of the art pre-optimized and pre-integrated platform for AI that can be stood up in less than 3 weeks and allow you to hit the floor running. Whilst ensuring that your solution has Enterprise support (something that a DIY solution does not) and shielding you from the ongoing burden of open source software integration and management.
Of course, it’s worth pointing out for those customers who still prefer the DIY approach, all the performance, simplicity and efficiency benefits of FlashBladeTM can still be leveraged to provide a rock-solid foundation for your AI venture.
Find out more about AIRI:
- Pure Storage And NVIDIA Announce AIRI Converged Infrastructure Reference Architecture
- Announcing AIRI: Industry’s First Integrated AI-Ready Infrastructure for Deploying Deep Learning at Scale
- Reference Architecture by Pure Storage and NVIDIA with Arista 7060X Switch
- How to cross the AI chasm: from vision to an AI-first business