Deep learning has become more and more important to academic research, and cloud computing makes it cheaper, more efficient, and more accessible to all institutions. But most of the research into using cloud computing for deep learning has focused on on-demand servers, which allow researchers to reserve processing units for their exclusive use for as long as they need them.
A team of computer scientists at Worcester Polytechnic Institute (WPI) set out to investigate whether it was feasible to train deep learning models on transient servers, which are cheaper but can be pre-empted, or revoked at any time. Using Google’s pre-emptible virtual machine instances VMs, Tian Guo and Robert Walls, both Assistant Professors of Computer Science at WPI, joined up with Ph.D. student Shijian Li and their collaborator Lijie Xu to conduct one of the first large-scale empirical studies on how to utilize transient servers to get the benefits of distributed training while avoiding the challenges of revocation. "Our high level goal is to provide more efficient training for deep learning researchers. We know that training can take a long time and cost a lot of money if you don’t do it carefully," says Guo.