Optimizer¶
This module includes a set of optimizers for updating model parameters. It replaces the old optimizers from optimizer.py

class
singa.opt.
Optimizer
(config)¶ Bases:
object
Base optimizer.
 Parameters
config (Dict) – specify the default values of configurable variables.

update
(param, grad)¶ Update the param values with given gradients.

step
()¶ To increment the step counter

register
(param_group, config)¶

load
()¶

save
()¶

class
singa.opt.
SGD
(lr=0.1, momentum=0, dampening=0, weight_decay=0, nesterov=False)¶ Bases:
singa.opt.Optimizer
Implements stochastic gradient descent (optionally with momentum).
Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.
 Args:
lr(float): learning rate momentum(float, optional): momentum factor(default: 0) weight_decay(float, optional): weight decay(L2 penalty)(default: 0) dampening(float, optional): dampening for momentum(default: 0) nesterov(bool, optional): enables Nesterov momentum(default: False)
 Typical usage example:
>> > from singa import opt >> > optimizer = opt.SGD(lr=0.1, momentum=0.9) >> > optimizer.update()
Note
The implementation of SGD with Momentum / Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.
Considering the specific case of Momentum, the update can be written as
\[v =\] ho * v + g
p = p  lr * v
where p, g, v and: math: `
 ho` denote the parameters, gradient,
velocity, and momentum respectively.
This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form
\[v =\] ho * v + lr * g
p = p  v
The Nesterov version is analogously modified.

update
(param, grad)¶ Performs a single optimization step.

backward_and_update
(loss)¶ Performs backward propagation from the loss and parameter update.
From the loss, it performs backward propagation to get the gradients and do the parameter update.
 Parameters
loss (Tensor) – loss is the objective function of the deep learning model
e.g. for classification problem it can be the output of the (optimization,) –
function. (softmax_cross_entropy) –

class
singa.opt.
DistOpt
(opt=<singa.opt.SGD object>, nccl_id=None, gpu_num=None, gpu_per_node=None, buffSize=4194304)¶ Bases:
object
The class is designed to wrap an optimizer to do distributed training.
This class is used to wrap an optimizer object to perform distributed training based on multiprocessing. Each process has an individual rank, which gives information of which GPU the individual process is using. The training data is partitioned, so that each process can evaluate the subgradient based on the partitioned training data. Once the subgraident is calculated on each processes, the overall stochastic gradient is obtained by allreducing the subgradients evaluated by all processes. The allreduce operation is supported by the NVidia Collective Communication Library (NCCL).
 Parameters
opt (Optimizer) – The optimizer to be wrapped.
nccl_id (NcclIdHolder) – an nccl id holder object for a unique communication id
gpu_num (int) – the GPU id in a single node
gpu_per_node (int) – the number of GPUs in a single node
buffSize (int) – the buffSize in terms of number of elements used in nccl communicator

world_size
¶ total number of processes
 Type
int

rank_in_local
¶ local rank of a process on the current node
 Type
int

rank_in_global
¶ global rank of a process
 Type
int
 Typical usage example:
>> > from singa import opt >> > optimizer = opt.SGD(lr=0.1, momentum=0.9) >> > optimizer = opt.DistOpt(sgd)

update
(param, grad)¶ Performs a single optimization step.

all_reduce
(tensor)¶ Performs all reduce of a tensor for distributed training.
 Parameters
tensor (Tensor) – a tensor to be allreduced

fused_all_reduce
(tensor, send=True)¶ Performs all reduce of the tensors after fusing them in a buffer.
 Parameters
tensor (List of Tensors) – a list of tensors to be allreduced
send (bool) – When send is False, the tensor won’t be send to the
device immediately, it will be copied to the buffer first (target) –

all_reduce_half
(tensor)¶ Performs all reduce of a tensor after converting to FP16.
 Parameters
tensor (Tensor) – a tensor to be allreduced

fused_all_reduce_half
(tensor, send=True)¶ Performs all reduce of the tensors after fusing and converting them to FP16.
 Parameters
tensor (List of Tensors) – a list of tensors to be allreduced
send (bool) – When send is False, the tensor won’t be send to the
device immediately, it will be copied to the buffer first (target) –

sparsification
(tensor, accumulation, spars, topK)¶ Performs all reduce of a tensor after sparsification.
 Parameters
tensor (Tensor) – a tensor to be allreduced
accumulation (Tensor) – local gradient accumulation
spars (float) – a parameter to control sparsity as defined below
topK (bool) – When topK is False, it sparsifies the gradient with absolute
>= sparsWhen topK is True, it sparsifies a fraction of total gradient (value) –
equals to spars, E.g. when spars = 0.01, it sparsifies 1 % of the (number) –
gradient elements (total) –

fused_sparsification
(tensor, accumulation, spars, topK)¶ Performs all reduce of the tensors after fusing and sparsification.
 Parameters
tensor (List of Tensors) – a list of tensors to be allreduced
accumulation (Tensor) – local gradient accumulation
spars (float) – a parameter to control sparsity as defined below
topK (bool) – When topK is False, it sparsifies the gradient with absolute
>= sparsWhen topK is True, it sparsifies a fraction of total gradient (value) –
equals to spars, E.g. when spars = 0.01, it sparsifies 1 % of the (number) –
gradient elements (total) –

wait
()¶ Wait for the cuda streams used by the communicator to finish their operations.

backward_and_update
(loss, threshold=2097152)¶ Performs backward propagation from the loss and parameter update.
From the loss, it performs backward propagation to get the gradients and do the parameter update. For gradient communication, it fuses all the tensor smaller than the threshold value to reduce network latency.
 Parameters
loss (Tensor) – loss is the objective function of the deep learning model
e.g. for classification problem it can be the output of the (optimization,) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold, they are to (the) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value, they are to be reduced directly (of) –
fusion. (without) –

backward_and_update_half
(loss, threshold=2097152, clipping=False, clip_Value=100)¶ Performs backward propagation and parameter update, with FP16 precision communication.
THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. For gradient communication, it fuses all the tensor smaller than the threshold value to reduce network latency, as well as converting them to FP16 half precision format before sending them out. To assist training, this functions provide an option to perform gradient clipping.
 Parameters
loss (Tensor) – loss is the objective function of the deep learning model
e.g. for classification problem it can be the output of the (optimization,) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold, they are to (the) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value, they are to be reduced directly (of) –
fusion. (without) –
clipping (bool) – a boolean flag to choose whether to clip the gradient value
clip_value (float) – the clip value to be used when clipping is True

backward_and_partial_update
(loss, threshold=2097152)¶ Performs backward propagation from the loss and parameter update using asychronous training.
THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. It fuses the tensors smaller than the threshold value to reduce network latency, as well as performing asychronous training where one parameter partition is allreduced per iteration. The size of the parameter partition depends on the threshold value.
 Parameters
loss (Tensor) – loss is the objective function of the deep learning model
e.g. for classification problem it can be the output of the (optimization,) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold, they are to (the) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value, they are to be reduced directly (of) –
fusion. (without) –

self.
partial
¶ A counter to determine which partition to perform allreduce.
 Type
int

This counter resets to zero automatlly after an update cycle of the full parameter

set.

backward_and_spars_update
(loss, threshold=2097152, spars=0.05, topK=False, corr=True)¶ Performs backward propagation from the loss and parameter update with sparsification.
THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. It fuses the tensors with size smaller than the threshold value to reduce network latency, as well as using sparsification schemes to transfer only the gradient elements which are significant.
 Parameters
loss (Tensor) – loss is the objective function of the deep learning model
e.g. for classification problem it can be the output of the (optimization,) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold, they are to (the) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value, they are to be reduced directly (of) –
fusion. (without) –
spars (float) – a parameter to control sparsity as defined below
topK (bool) – When topK is False, it sparsifies the gradient with absolute
>= sparsWhen topK is True, it sparsifies a fraction of total gradient (value) –
equals to spars, E.g. when spars = 0.01, it sparsifies 1 % of the (number) –
gradient elements (total) –
corr (bool) – whether to use the local accumulate gradient for correction

self.
sparsInit
¶ A counter to determine which partition to perform allreduce.

self.
gradAccumulation
¶ Local gradient accumulation