Habana Labs' Gaudi Now Available on AWS
|
NEWS
|
In October 2021, AWS announced the availability of Amazon EC2 DL1 instances or virtual server services powered by Gaudi accelerators from Habana Labs. Founded in 2016 and acquired by Intel in 2019 for US$2 billion, Israel-based Habana Labs designs programmable Machine-Learning (ML) accelerators for model inference and training in the data center and cloud. Intended for ML training, Amazon’s EC2 DL1 instances feature eight Gaudi accelerators with 32 Gigabytes (GB) of High Bandwidth Memory (HBM) per accelerator, 768 gibibytes of system memory, 400 Gigabits per second (Gbps) of networking throughput, and four terabytes of local nonvolatile memory express storage.
In terms of software, developers can develop their ML models on Habana’s Synapse AI Software Development Kit (SDK). The SDK features a graph compiler and run time, kernel library, firmware, drivers, and tools. Most importantly, the Synapse AI SDK is integrated with TensorFlow and PyTorch, the leading ML frameworks, ensuring migration from existing Graphics Processing Unit (GPU)–services with minimal code changes.
Cloud ML Training Is Becoming More Costly
|
IMPACT
|
In the early days of ML, most cloud ML applications revolved around machine vision. ML has become more accessible in recent years thanks to open-source frameworks, high-performing training, inference hardware, and third-party optimization software and services. As such, ML accelerates digital transformation across multiple verticals and applications. Major ML applications hosted in the cloud include natural language understanding in virtual assistants, credit risk assessment and regulatory compliance in banking and finance, threat intelligence and network protection in cybersecurity, generative design in smart manufacturing, service assurance and network optimization in telecommunications, and route scheduling and optimization in transport and logistics. ABI Research estimates that the cloud ML training processor market is worth US$4.2 billion in 2021 and is forecasted to grow to US$6.9 billion in 2026, with a compound annual growth rate of 13.1%.
More importantly, these models are becoming larger in size and complexity. They take up considerable computing resources. Because it takes a longer time to train on these systems, the cost of training and maintenance also has increased significantly. As a result, ML developers often need to dedicate more resources to maintain their ML models and also face a constant battle against a limited research and development budget.
Intel and AWS hope that the Amazon EC2 DL1 instance can be the perfect solution for cloud ML developers looking for greater AI training cost-efficiency. With the combination of Gaudi’s Artificial Intelligence (AI)–customized programmable Tensor Processor Cores, AI-optimized general matrix multiplication engine, and 32 GB HBM2 on-chip memory, Gaudi delivers high-efficiency AI compute. In addition, the new instance derives both performance and cost benefits from the native integration of ten 100 Gb ports of remote direct memory access over converged Ethernet on every Gaudi, eliminating networking bottlenecks within the DL1 server. As a result of these and other customizations, AWS and Habana Labs claim that DL1 instances provide up to a 40% better price-performance for training deep-learning models as compared with current generation GPU-based AWS EC2 instances. In addition, the instance offers 400 Gbps of networking throughput and connectivity to Amazon’s Elastic Fabric Adapter and Elastic Network Adapter for applications that require access to high-speed networking.
An Increasingly Competitive Landscape
|
RECOMMENDATIONS
|
Currently, NVIDIA’s GPU remains the go-to computing chipset option for ML training in the cloud. Intel’s partnership with AWS—the most popular public cloud service provider—enables them to compete head-to-head with NVIDIA and other prominent ML training chipset vendors. This is critical for Intel—a company that is trying to expand its presence in the ML training market—since Amazon’s EC2 DL1 is AWS’s first non-GPU ML training service.
Nonetheless, Gaudi is not going to be the only non-GPU option on AWS. Last year, AWS introduced Trainium, its own custom ML training chipset. Trainium supports popular ML frameworks, including TensorFlow and PyTorch via AWS’s Neuron SDK, and MXNet developers can easily migrate to AWS Trainium from GPU-based instances with minimal code changes. Amazon SageMaker support will be native to Trainium as well.
Facing such competitive pressure, Intel and Habana Labs need to move beyond cloud service providers. ABI Research defines cloud ML to be all ML training and inference workloads hosted in a cloud environment. This can include public clouds by all cloud service providers; private clouds exclusively owned, maintained, and operated by private enterprises for their own needs; or telco clouds deployed by telecommunication service providers for their core network, information technology, and edge computing workloads.
Not surprisingly, Habana Labs has been actively partnering with hyperscalers and high-performance computing developers. For example, in 2019, the company collaborated with Facebook to foster awareness and cultivate the developer community via Facebook’s open-source organizations, such as the Open Compute Project (OCP) and Telecom Infra Project. The company even launched the Habana HL-205, a mezzanine card compliant with OCP specifications. In addition, in April 2021, the San Diego Supercomputer Center at the University of California San Diego selected Gaudi as its AI compute chipset provider for its Voyager supercomputer.