AWS launches Trainium2 chips for advanced AI model training
- AWS unveiled its Trainium2 chips capable of training large language models at the re:Invent conference.
- The chips provide impressive compute performance, significantly outperforming previous versions and competing offerings.
- AWS aims to continue innovating in AI computing with upcoming Trainium3 chips expected in late 2025.
In late 2024, Amazon Web Services (AWS) announced the general availability of its Trainium2 chips at its annual re:Invent conference. These chips, first introduced a year prior, are designed specifically for the training and deployment of large language models (LLMs). The Trainium2 chips boast a performance advantage, being four times faster than their predecessors and capable of providing up to 20.8 petaflops of compute performance from a single instance with 16 chips. This significant enhancement positions AWS's offering as highly competitive in the cloud computing space, particularly for LLM applications. The new chips also enable higher token-generation throughput for complex models, including Meta’s 405B Llama model. AWS emphasizes that their solution can deliver triple the throughput compared to existing offerings from other leading cloud providers. To accommodate these chips, AWS is deploying them in EC2 Trn2 UltraServers, which comprise 64 interconnected chips capable of achieving 83.2 peak petaflops of compute, making it an attractive option for customers dealing with large datasets and extensive model training. In collaboration with Anthropic, an LLM provider, AWS plans to build a massive cluster of UltraServers, envisioned to consist of hundreds of thousands of Trainium2 chips. This cluster aims to facilitate the training of Anthropic's models and is expected to deliver an impressive five times the computational power compared to their previous model training cluster. AWS claims this new setup is predicted to be the world’s largest AI compute cluster reported to this date. Despite the new advancements, AWS acknowledges the ongoing popularity of Nvidia's GPU offerings, which continue to be in high demand, although they are outmatched by the potential of the upcoming Nvidia Blackwell chips scheduled for release early next year. Looking ahead, AWS has also announced the Trainium3, which is anticipated to achieve another fourfold increase in performance, with availability expected in late 2025. This rapid development cycle underscores AWS's commitment to meeting the growing needs of customers who require powerful solutions for training expansive AI models, signaling an ongoing competition in the AI and cloud computing markets.