Рет қаралды 256
Pengfei Huo, Sr. Network Architect - ByteDance
S. Kamran Naqvi, Chief Network Architect - Broadcom
Large-scale AI training clusters, hosting tens of thousands of GPUs, are designed to deliver unparalleled computational power for a variety of AI workloads. To fully unleash the power, a highly efficient network fabric that connects these GPUs is essential.
The fabric should support extensive GPU scale-out while maintaining excellence, handle diverse parallel workloads with efficient multi-tenancy and job segregation, be resilient against link failures or topology changes to reduce intervention for check-points, and be grounded in an open ecosystem for innovation and adaptability.
In this presentation, we will explain how the Scheduled fabric addresses the essential requirements. We will also talk about how ByteDance has benchmarked the fabric at their AI clusters, examining its actual performance, deployment plan and thoughts on broader collaboration in the community.