We’ve all witnessed the hype around Artificial Intelligence, and the Financial Services (FS) industry is no exception. Many are relatively new to the topic and there is much to learn. At a detailed level the maths, models and data science involved are very complex and can be daunting. That said, it’s important not to get lost in these details. Whilst they play a central part, successfully delivering AI @ scale also has several other major dependencies and challenges. It’s these on which this blog focuses.
With AI @ scale we are talking supercomputer levels of performance and therefore unsurprisingly standard generic legacy IT (and more specifically standard generic legacy storage) is simply not up to the task. As a result, many FS organizations are, or are in the process of, evaluating new technology for this purpose. Having been through many AI engagements, it’s clear that success requires a good high-level understanding of AI and an understanding of the technical requirements (or critical capabilities) that the planned AI project will place on the infrastructure. This is by no means trivial as this knowledge and experience is scarce.
The purpose of this blog is to share the top 10 lessons from our experiences to date:
Training AI algorithms requires huge volumes of data and invariably one of the key challenges we see within FS organizations is simply getting access to the data needed to train your algorithm (navigating Information Security policies and teams). As a result, we consistently see the time taken to internally source data taking much longer than expected. And it’s worth pointing out this is for on-premise deployments in the customers’ own datacenter. For customers considering using the public cloud, you can expect this process to take significantly longer. So, you should identify and request the data you need as early in your project as possible.
Tools and frameworks are evolving quickly. It’s very easy to get sucked into proprietary tools and systems, so it’s worth considering a few principles here, as a customer I spoke to a few months back told me “you don’t want to be right for six months and then wrong forever”. Adopting open standards and avoiding vendor lock-in should be a key architectural principle. Allowing for tools to be swapped in and out as better alternatives become available. From an infrastructure perspective, you should ensure that your infrastructure will facilitate this “flexibility” by ensuring data can be shared via open standards (e.g. NFS or S3).
This is a valuable question to ask right at the outset. If it’s a tactical solution, that’s fine but you need to recognize that shifting to a strategic solution down the road could mean discarding the tactical set-up (and associated costs) and starting all over again. If it’s a strategic plan from the outset, then you should invest and build on solid, proven foundations that enable linear scale and performance.
At small scale (single GPU), AI experiments can be serviced by pretty much any hardware. But this changes quickly with scale! Some of the challenges we’ve seen working on AI deployments include:
Typically, we see the key priorities of infrastructure being:
Based on your pre-identified infrastructure objectives, it’s also worthwhile documenting the critical capabilities you require up front, so you don’t lose sight of them.
Other considerations might include such things as the ability to scale non-disruptively in cost effective increments, deliver linear scale and performance etc.
The average cost of a data scientist in New York City, for example, is $150K a year, so you don’t want these resources standing around idle. Don’t make them wait for data. If you buy into the idea that “data is the world’s most valuable resource” and AI is the “fourth industrial revolution,” then organizations should invest appropriately to get maximum value out of these resources.
About 80 percent of AI is data preparation with the other 20 percent made up by training, and yet the vast majority of focus, when it comes to infrastructure, is on training. As a result, we see the data preparation steps of building the pipeline are often neglected from an infrastructure perspective, with data being deployed, copied and duplicated on whatever infrastructure may be available. This is inefficient and potentially problematic from a number of perspectives. In our opinion, customers should seize the opportunity to build a robust, efficient, scalable consolidated infrastructure to support the entire data pipeline as per the diagram below – what we call a “data hub”.
It’s important to recognize that delivering an integrated system capable of supercomputer levels of performance, is not standard IT. And the nature of AI/ML/DL today is that few if any people are able to understand the full end to end picture. Data scientists understand the software tools and frameworks but seldom understand the impact of these layers on the infrastructure. For example, TensorFlow (one of the most widely adopted open source libraries) has hundreds of tuneables that may have some downstream impact on the infrastructure. And when looking to optimize and tune the performance of the end to end solution there’s little other than trial and error. Likewise, “infrastructure” itself is typically broken down into compute, storage and network specializations where each individual specialist may have limited knowledge of the other. Therefore, the difficulty of trying to build an end to end solution capable of supercomputer performance should not be underestimated.
The monolithic silos in most large organizations do not typically lend themselves to the close collaboration required to make a successful, strategic, integrated, linearly scalable AI infrastructure. Building a successful AI infrastructure requires close collaboration between multiple specialists in the compute, storage, network, data science, Kubernetes and Docker fields. Without that, it’s not going to happen.
Pure finds itself with Nvidia at the centre of many of the world’s largest most sophisticated AI deployments. And it’s this collective learning, experience and IP, allied with best in class GPU technology from Nvidia, Storage from Pure and Networking from Cisco/Arista that we’ve leveraged to develop AIRITM – the world’s first AI ready infrastructure for AI @ scale. The purpose of AIRI is really to minimize many of the risks, complexities and potential mis-steps discussed in this blog and give enterprises a state of the art pre-optimized and pre-integrated platform for AI that can be stood up in less than 3 weeks and allow you to hit the floor running. Whilst ensuring that your solution has Enterprise support (something that a DIY solution does not) and shielding you from the ongoing burden of open source software integration and management.
Of course, it’s worth pointing out for those customers who still prefer the DIY approach, all the performance, simplicity and efficiency benefits of FlashBladeTM can still be leveraged to provide a rock-solid foundation for your AI venture.
Find out more about AIRI: