Large scale distributed training has become an essential element to scaling the productivity for ML engineers. Today, ML models are getting larger and more complex in terms of compute and memory requirements. The amount of data we train on at Facebook is huge. In this talk, we will learn about the Distributed Training Platform to support large scale data and model parallelism. We will touch base on Distributed Training support for PyTorch and how we are offering a flexible training platform for ML engineers to increase their productivity at the Facebook scale.
Overview and Author Bio
Distributed Training Platform at Facebook
Senior Engineering Manager | Facebook
Software Engineer | Facebook