DevConf.US 2021 is the 5th annual, free, Red Hat sponsored technology conference for community project and professional contributors to Free and Open Source technologies coming to Boston!
Is a single GPU node not enough for your machine learning (ML) workflow anymore? As the datasets and models get bigger, the demand for more powerful and efficient GPUs is rapidly increasing. Oftentimes when a single GPU is not adequate for an ML use case, the immediate response is to throw a bigger GPU at it, an expensive proposition. An alternative to upgrading the GPU hardware is to distribute the ML workload either across several GPUs on one node, or across multiple nodes each containing one or several GPUs. The ability for the latter is especially preferred when a single machine can fit only so many GPUs. In this talk, we will explore how one can distribute a machine learning workflow across several nodes with GPU hardware in a cloud environment. We will use PyTorch and Kubeflow to distribute the ML workload and carry out the training in Open Data Hub. At the end of this talk, attendees will understand how to overcome the GPU hardware limits of a single node training by taking advantage of GPUs on other machines, and therefore, maximizing the utilization of GPUs in an open cloud environment.