Distributed training for CPU backend is not supported. Distributed
training support is provided only with the Intel® Nervana™ Neural Network
Processor for Training (Intel® Nervana™ NNP-T). Run
nnptool and see the
system software for details.
Distributed training with Docker/Kubernetes is available with Intel® Nervana™ Neural Network Processor for Training (Intel® Nervana™ NNP-T). See the User Guide for details.
Distributed training for Deep Learning involves defining a per-node application workload (such as with a distributed TensorFlow script) and leveraging infrastructure (container, Docker, or Kubernetes-specific) to spawn processes on many nodes which communicate and work together.
There are a number of assumptions or constraints imposed by the
DL workload which have to be facilitated by the
infrastructure. For example, Synchronous SGD workloads typically require
N processes to run – and if one job fails, they all fail. Jobs using
asynchronous techniques, however, (parameter server, for example) can tolerate
varying numbers of worker processes dynamically. Jobs using only data center
fabric for communication can be scheduled with more leniency on physical
placement, while NNP-T jobs using Inter-Chip Links (ICL) in a ring
or mesh topology have to be scheduled using adjacent accelerators.
In nGraph, we enable the High-Performance Compute, or HPC techniques that use MPI to launch distributed training, providing excellent scaling efficiency with very light overhead. See the Distribute training across multiple nGraph backends documentation for detail on how to deploy data-parallel training. Currently nGraph launches a series of duplicated graphs on each device; communication happens on the device without copying data to the host. nGraph currently supports data parallelism on two frameworks: TensorFlow* and PaddlePaddle.
The Intel® Nervana™ Neural Network Processor for Training (NNP-T) includes kube-nnp software that enables cluster-level observability and management (orchestration).
kube-nnp extends the functionality of a default installation of Kubernetes, a container orchestration system, to manage the life cycle of machine learning jobs in a cluster of machines that contains NNP T-1000 accelerators, in addition to other compute devices such as CPUs, GPUs, or FPGAs. The orchestration system provides fair sharing, fault tolerance, bin-packing, and hardware abstraction that makes the use of large compute clusters easy while providing a standardized experience for users. For more detail on kube-nnp, see the documentation provided with the software.