A Win for Intel and AWS
|
NEWS
|
In re:Invent 2020, AWS introduced compute instance—or virtual server—powered by two new ML training chipsets: Gaudi from Habana Labs and AWS’s own Trainium. These chipsets will be available by mid-to-late 2021. This inclusion not only is a success story for Intel in their constant competition with NVIDIA in cloud ML chipset but also is another showcase of chipset development strength in AWS and its desire for tighter vertical integration. This is also the first time a developer can distribute multiple training frameworks across multiple hardware using a single AI engine.
At the moment, the majority of ML workloads on AWS are powered by AI chipsets coming from NVIDIA and also increasingly by Inferentia, which is AWS’s own inference chipset. COVID-19 has exposed the weaknesses of single vendor strategy. The disruption in the global supply chain has left many chipset vendors struggling to fulfill their orders and ML chipset prices to skyrocket. The inclusion of new AI chipsets brings more choices to ML developers, diversifies supply chain risk, and allows the company to continue the trend of lowering cost for ML developers on AWS.
Focus on Distributed Training and MLOps
|
IMPACT
|
As more enterprises migrate their ML workloads—especially complex ML workloads that are based on large datasets—to the cloud, AWS needs to stay competitive by making these workloads cost-efficient and easy to manage and deploy. This means more focus will be on ML Operations (MLOps); the upgrade AWS brings to Amazon SageMaker plays a big role in enabling that. New features—such as SageMaker Data Wrangler for data preparation; SageMaker Feature Store for feature storage, synchronization, and sharing; SageMaker Pipelines for continuous integration and continuous delivery; and SageMaker Clarify for ML model bias monitoring—aim to make MLOps easier. Tighter integration of Amazon SageMaker with AWS databases and data warehouse solutions also makes AI datasets more accessible to developers.
Nonetheless, the key highlight is AWS’s attempts to further accelerate distributed training through data parallelism. The current industry-leading distributed training method was introduced by Uber in 2017. Horovod is an open source framework that enables developers to scale AI training across multiple Graphic Processing Units (GPUs) using a single script for TensorFlow, Keras, PyTorch, and MXNet. Horovod allows the GPUs to process the training data, to share the information among themselves, and to update the model in a decentralized manner. However, this creates a huge bottleneck in network bandwidth as AI datasets increase in size. AWS’s solution is to add CPUs to the mix and requires the GPUs to communicate only to the CPUs. Once the GPUs have completed processing the training data, the update will be sent to the CPUs for AI model updates and then shared with the GPUs. This minimizes the communications among GPUs and allows the GPUs to focus mainly on data processing.
Shortening ML model development time is the competitive moat that AWS aims to bring to the table. Granted, other cloud competitors are offering or will offer a similar set of capabilities. AI chipset vendors themselves are also working very closely with the developer community and making their hardware solutions more accessible to ML developers. A prime example is Intel’s oneAPI strategy, which allows developers to deploy their workloads across Intel’s portfolio of heterogenous chipsets in the data center environment. Developers only need to maintain a single code base using Data Parallel C++ as the programming language to perform cross-architecture programming.
Not Just the Cloud or the Edge—It Is Both
|
RECOMMENDATIONS
|
A stressful year for many industries, 2020 has been a great year for public cloud companies. Benefiting greatly from remote working and a distributed workforce, AWS has been growing at 30% on a yearly basis. The growth figure is even higher for many of its smaller competitors. Instead of just talking about the cloud, AWS also took the opportunity to launch several edge AI–focused services. AWS SageMaker Edge Manager is able to manage multiple edge AI models for a single edge device. The company is also doubling down in industrial and manufacturing applications through Amazon Monitron for predictive monitoring, Lookup for Equipment, and Lookup for Vision for machine vision applications at the edge. AWS has also introduced the AWS Panorama Appliance, an edge AI gateway designed for legacy Internet protocol cameras. As a public cloud vendor, AWS has yet to bring innovative solutions to edge AI deployment platforms and custom tools when compared with the likes of Edge Impulse, SensiML, Laneyes, and Blaize, so there is still a strong validation for companies that are working on edge AI development and deployment. ABI Research expects more public cloud vendors to follow this strategy and start to launch more edge AI–focused services.