MirroredStrategy can help us scale up to about 8 GPUs per compute instance; however, we are likely to need 16 instances with 8 GPUs each to train ImageNet in a reasonable time (see Jeremy Howard’s post on Training Imagenet in 18 Minutes). So where do we go from here?
MultiWorkerMirroredStrategy: This strategy can use not only multiple GPUs, but also multiple GPUs across multiple computers. To configure them, all we have to do is define a
TF_CONFIG environment variable with the right addresses and run the exact same code in each compute instance.
Please note that
partition must change for each compute instance to uniquely identify it, and that the IP addresses also need to be adjusted. In addition,
data should point to a different partition of ImageNet, which we can retrieve with
pins; although, for convenience,
alexnet contains similar code under
alexnet::imagenet_partition(). Other than that, the code that you need to run in each compute instance is exactly the same.
However, if we were to use 16 machines with 8 GPUs each to train ImageNet, it would be quite time-consuming and error-prone to manually run code in each R session. So instead, we should think of making use of cluster-computing frameworks, like Apache Spark with barrier execution. If you are new to Spark, there are many resources available at sparklyr.ai. To learn just about running Spark and TensorFlow together, watch our Deep Learning with Spark, TensorFlow and R video.
Putting it all together, training ImageNet in R with TensorFlow and Spark looks as follows:
We hope this post gave you a reasonable overview of what training large-datasets in R looks like – thanks for reading along!