Optimizer¶

This module includes a set of optimizers for updating model parameters. It replaces the old optimizers from optimizer.py

class singa.opt.Optimizer(config)¶

Bases: object

Base optimizer.

Parameters: config (Dict) – specify the default values of configurable variables.

update(param, grad)¶

Update the param values with given gradients.

Parameters

param (Tensor) – param values to be updated in-place
grad (Tensor) – param gradients; the values may be updated in this function; do not use it anymore

step()¶: To increment the step counter

register(param_group, config)¶

load()¶

save()¶

class singa.opt.SGD(lr=0.1, momentum=0, dampening=0, weight_decay=0, nesterov=False)¶

Bases: singa.opt.Optimizer

Implements stochastic gradient descent (optionally with momentum).

Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.

Args:
lr(float): learning rate momentum(float, optional): momentum factor(default: 0) weight_decay(float, optional): weight decay(L2 penalty)(default: 0) dampening(float, optional): dampening for momentum(default: 0) nesterov(bool, optional): enables Nesterov momentum(default: False)

Typical usage example:
>> > from singa import opt >> > optimizer = opt.SGD(lr=0.1, momentum=0.9) >> > optimizer.update()

Note

The implementation of SGD with Momentum / Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

\[v =\]

ho * v + g

p = p - lr * v

where p, g, v and: math: `

ho` denote the parameters, gradient,

velocity, and momentum respectively.

This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form

\[v =\]

ho * v + lr * g

p = p - v

The Nesterov version is analogously modified.

update(param, grad)¶

Performs a single optimization step.

Parameters

param (Tensor) – param values to be update in-place
grad (Tensor) – param gradients; the values may be updated in this function; cannot use it anymore

backward_and_update(loss)¶

Performs backward propagation from the loss and parameter update.

From the loss, it performs backward propagation to get the gradients and do the parameter update.

Parameters

loss (Tensor) – loss is the objective function of the deep learning model
optimization –
for classification problem it can be the output of the (e.g.) –
function. (softmax_cross_entropy) –

class singa.opt.DistOpt(opt=<singa.opt.SGD object>, nccl_id=None, gpu_num=None, gpu_per_node=None, buffSize=4194304)¶

Bases: object

The class is designed to wrap an optimizer to do distributed training.

This class is used to wrap an optimizer object to perform distributed training based on multiprocessing. Each process has an individual rank, which gives information of which GPU the individual process is using. The training data is partitioned, so that each process can evaluate the sub-gradient based on the partitioned training data. Once the sub-graident is calculated on each processes, the overall stochastic gradient is obtained by all-reducing the sub-gradients evaluated by all processes. The all-reduce operation is supported by the NVidia Collective Communication Library (NCCL).

Parameters

opt (Optimizer) – The optimizer to be wrapped.
nccl_id (NcclIdHolder) – an nccl id holder object for a unique communication id
gpu_num (int) – the GPU id in a single node
gpu_per_node (int) – the number of GPUs in a single node
buffSize (int) – the buffSize in terms of number of elements used in nccl communicator

world_size¶

total number of processes

Type: int

rank_in_local¶

local rank of a process on the current node

Type: int

rank_in_global¶

global rank of a process

Type: int

Typical usage example:: >> > from singa import opt >> > optimizer = opt.SGD(lr=0.1, momentum=0.9) >> > optimizer = opt.DistOpt(sgd)

update(param, grad)¶

Performs a single optimization step.

Parameters

param (Tensor) – param values to be update
grad (Tensor) – param gradients

all_reduce(tensor)¶

Performs all reduce of a tensor for distributed training.

Parameters: tensor (Tensor) – a tensor to be all-reduced

fused_all_reduce(tensor, send=True)¶

Performs all reduce of the tensors after fusing them in a buffer.

Parameters

tensor (List of Tensors) – a list of tensors to be all-reduced
send (bool) – When send is False, the tensor won’t be send to the
device immediately (target) –
will be copied to the buffer first (it) –

all_reduce_half(tensor)¶

Performs all reduce of a tensor after converting to FP16.

Parameters: tensor (Tensor) – a tensor to be all-reduced

fused_all_reduce_half(tensor, send=True)¶

Performs all reduce of the tensors after fusing and converting them to FP16.

Parameters

tensor (List of Tensors) – a list of tensors to be all-reduced
send (bool) – When send is False, the tensor won’t be send to the
device immediately (target) –
will be copied to the buffer first (it) –

sparsification(tensor, accumulation, spars, topK)¶

Performs all reduce of a tensor after sparsification.

Parameters

tensor (Tensor) – a tensor to be all-reduced
accumulation (Tensor) – local gradient accumulation
spars (float) – a parameter to control sparsity as defined below
topK (bool) – When topK is False, it sparsifies the gradient with absolute
>= sparsWhen topK is True (value) –
sparsifies a fraction of total gradient (it) –
equals to spars (number) –
when spars = 0.01 (E.g.) –
sparsifies 1 % of the (it) –
gradient elements (total) –

fused_sparsification(tensor, accumulation, spars, topK)¶

Performs all reduce of the tensors after fusing and sparsification.

Parameters

tensor (List of Tensors) – a list of tensors to be all-reduced
accumulation (Tensor) – local gradient accumulation
spars (float) – a parameter to control sparsity as defined below
topK (bool) – When topK is False, it sparsifies the gradient with absolute
>= sparsWhen topK is True (value) –
sparsifies a fraction of total gradient (it) –
equals to spars (number) –
when spars = 0.01 (E.g.) –
sparsifies 1 % of the (it) –
gradient elements (total) –

wait()¶: Wait for the cuda streams used by the communicator to finish their operations.

backward_and_update(loss, threshold=2097152)¶

Performs backward propagation from the loss and parameter update.

From the loss, it performs backward propagation to get the gradients and do the parameter update. For gradient communication, it fuses all the tensor smaller than the threshold value to reduce network latency.

Parameters

loss (Tensor) – loss is the objective function of the deep learning model
optimization –
for classification problem it can be the output of the (e.g.) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold (the) –
are to (they) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value (of) –
are to be reduced directly (they) –
fusion. (without) –

backward_and_update_half(loss, threshold=2097152, clipping=False, clip_Value=100)¶

Performs backward propagation and parameter update, with FP16 precision communication.

THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. For gradient communication, it fuses all the tensor smaller than the threshold value to reduce network latency, as well as converting them to FP16 half precision format before sending them out. To assist training, this functions provide an option to perform gradient clipping.

Parameters

loss (Tensor) – loss is the objective function of the deep learning model
optimization –
for classification problem it can be the output of the (e.g.) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold (the) –
are to (they) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value (of) –
are to be reduced directly (they) –
fusion. (without) –
clipping (bool) – a boolean flag to choose whether to clip the gradient value
clip_value (float) – the clip value to be used when clipping is True

backward_and_partial_update(loss, threshold=2097152)¶

Performs backward propagation from the loss and parameter update using asychronous training.

THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. It fuses the tensors smaller than the threshold value to reduce network latency, as well as performing asychronous training where one parameter partition is all-reduced per iteration. The size of the parameter partition depends on the threshold value.

Parameters

loss (Tensor) – loss is the objective function of the deep learning model
optimization –
for classification problem it can be the output of the (e.g.) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold (the) –
are to (they) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value (of) –
are to be reduced directly (they) –
fusion. (without) –

self.partial¶

A counter to determine which partition to perform all-reduce.

Type: int

This counter resets to zero automatlly after an update cycle of the full parameter

set.

backward_and_spars_update(loss, threshold=2097152, spars=0.05, topK=False, corr=True)¶

Performs backward propagation from the loss and parameter update with sparsification.

THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. It fuses the tensors with size smaller than the threshold value to reduce network latency, as well as using sparsification schemes to transfer only the gradient elements which are significant.

Parameters

loss (Tensor) – loss is the objective function of the deep learning model
optimization –
for classification problem it can be the output of the (e.g.) –
function. (softmax_cross_entropy) –
threshold (int) – threshold is a parameter to control performance in fusing
tensors. For the tensors of sizes smaller than threshold (the) –
are to (they) –
accumulated and fused before the all reduce operation. For the tensors (be) –
its size larger than the threshold value (of) –
are to be reduced directly (they) –
fusion. (without) –
spars (float) – a parameter to control sparsity as defined below
topK (bool) – When topK is False, it sparsifies the gradient with absolute
>= sparsWhen topK is True (value) –
sparsifies a fraction of total gradient (it) –
equals to spars (number) –
when spars = 0.01 (E.g.) –
sparsifies 1 % of the (it) –
gradient elements (total) –
corr (bool) – whether to use the local accumulate gradient for correction

self.sparsInit¶: A counter to determine which partition to perform all-reduce.

self.gradAccumulation¶: Local gradient accumulation