Choosing the right instance on a public cloud for model training is not an easy task. There are hundreds of different virtual machines available with a wide variety of core counts, RAM, disk
type, network speed, and of course GPU.
The latter is often became a differentiating factor for choosing one VM over another. The goal of this work is to scrutinize how neural network architecture will influence training performance and cost on different cloud VMs.
We compared BERT, Mask R-CNN, and DLRM architectures using AWS EC2 instances and showed that architecture and model implementation can cause significant variation in training time and cost with a different optimal configuration for different architectures.
We showed that a simple rule of thumb (e.g. always chose the latest generation or the most performant GPU) can increase training cost and time by an order of magnitude in the worst-case scenario.
We also showed that components surrounding GPU (e.g. RAM and CPU) can cause significant performance bottlenecks and should be considered carefully in conjunction with trained model architecture and implementation.
The overall results show that even in the case of a single GPU training set up costs can vary significantly and encourage our group to proceed further with this research adding driver stack, multi GPU instances, and clusters to the future works.