Optimizer

This module includes a set of optimizers for updating model parameters. It replaces the old optimizers from optimizer.py

class singa.opt.Optimizer(config)

Bases: object

Base optimizer.

Parameters

config (Dict) – specify the default values of configurable variables.

update(param, grad)

Update the param values with given gradients.

Parameters
  • param (Tensor) – param values to be updated in-place

  • grad (Tensor) – param gradients; the values may be updated in this function; do not use it anymore

step()

To increment the step counter

register(param_group, config)
load()
save()
class singa.opt.SGD(lr=0.1, momentum=0, dampening=0, weight_decay=0, nesterov=False)

Bases: singa.opt.Optimizer

Implements stochastic gradient descent (optionally with momentum).

Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.

Args:

lr(float): learning rate momentum(float, optional): momentum factor(default: 0) weight_decay(float, optional): weight decay(L2 penalty)(default: 0) dampening(float, optional): dampening for momentum(default: 0) nesterov(bool, optional): enables Nesterov momentum(default: False)

Typical usage example:

>> > from singa import opt >> > optimizer = opt.SGD(lr=0.1, momentum=0.9) >> > optimizer.update()

Note

The implementation of SGD with Momentum / Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

\[v =\]
ho * v + g

p = p - lr * v

where p, g, v and: math: `

ho` denote the parameters, gradient,

velocity, and momentum respectively.

This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form

\[v =\]
ho * v + lr * g

p = p - v

The Nesterov version is analogously modified.

update(param, grad)

Performs a single optimization step.

Parameters
  • param (Tensor) – param values to be update in-place

  • grad (Tensor) – param gradients; the values may be updated in this function; cannot use it anymore

backward_and_update(loss)

Performs backward propagation from the loss and parameter update.

From the loss, it performs backward propagation to get the gradients and do the parameter update.

Parameters
  • loss (Tensor) – loss is the objective function of the deep learning model

  • e.g. for classification problem it can be the output of the (optimization,) –

  • function. (softmax_cross_entropy) –

class singa.opt.DistOpt(opt=<singa.opt.SGD object>, nccl_id=None, gpu_num=None, gpu_per_node=None, buffSize=4194304)

Bases: object

The class is designed to wrap an optimizer to do distributed training.

This class is used to wrap an optimizer object to perform distributed training based on multiprocessing. Each process has an individual rank, which gives information of which GPU the individual process is using. The training data is partitioned, so that each process can evaluate the sub-gradient based on the partitioned training data. Once the sub-graident is calculated on each processes, the overall stochastic gradient is obtained by all-reducing the sub-gradients evaluated by all processes. The all-reduce operation is supported by the NVidia Collective Communication Library (NCCL).

Parameters
  • opt (Optimizer) – The optimizer to be wrapped.

  • nccl_id (NcclIdHolder) – an nccl id holder object for a unique communication id

  • gpu_num (int) – the GPU id in a single node

  • gpu_per_node (int) – the number of GPUs in a single node

  • buffSize (int) – the buffSize in terms of number of elements used in nccl communicator

world_size

total number of processes

Type

int

rank_in_local

local rank of a process on the current node

Type

int

rank_in_global

global rank of a process

Type

int

Typical usage example:

>> > from singa import opt >> > optimizer = opt.SGD(lr=0.1, momentum=0.9) >> > optimizer = opt.DistOpt(sgd)

update(param, grad)

Performs a single optimization step.

Parameters
  • param (Tensor) – param values to be update

  • grad (Tensor) – param gradients

all_reduce(tensor)

Performs all reduce of a tensor for distributed training.

Parameters

tensor (Tensor) – a tensor to be all-reduced

fused_all_reduce(tensor, send=True)

Performs all reduce of the tensors after fusing them in a buffer.

Parameters
  • tensor (List of Tensors) – a list of tensors to be all-reduced

  • send (bool) – When send is False, the tensor won’t be send to the

  • device immediately, it will be copied to the buffer first (target) –

all_reduce_half(tensor)

Performs all reduce of a tensor after converting to FP16.

Parameters

tensor (Tensor) – a tensor to be all-reduced

fused_all_reduce_half(tensor, send=True)

Performs all reduce of the tensors after fusing and converting them to FP16.

Parameters
  • tensor (List of Tensors) – a list of tensors to be all-reduced

  • send (bool) – When send is False, the tensor won’t be send to the

  • device immediately, it will be copied to the buffer first (target) –

sparsification(tensor, accumulation, spars, topK)

Performs all reduce of a tensor after sparsification.

Parameters
  • tensor (Tensor) – a tensor to be all-reduced

  • accumulation (Tensor) – local gradient accumulation

  • spars (float) – a parameter to control sparsity as defined below

  • topK (bool) – When topK is False, it sparsifies the gradient with absolute

  • >= sparsWhen topK is True, it sparsifies a fraction of total gradient (value) –

  • equals to spars, E.g. when spars = 0.01, it sparsifies 1 % of the (number) –

  • gradient elements (total) –

fused_sparsification(tensor, accumulation, spars, topK)

Performs all reduce of the tensors after fusing and sparsification.

Parameters
  • tensor (List of Tensors) – a list of tensors to be all-reduced

  • accumulation (Tensor) – local gradient accumulation

  • spars (float) – a parameter to control sparsity as defined below

  • topK (bool) – When topK is False, it sparsifies the gradient with absolute

  • >= sparsWhen topK is True, it sparsifies a fraction of total gradient (value) –

  • equals to spars, E.g. when spars = 0.01, it sparsifies 1 % of the (number) –

  • gradient elements (total) –

wait()

Wait for the cuda streams used by the communicator to finish their operations.

backward_and_update(loss, threshold=2097152)

Performs backward propagation from the loss and parameter update.

From the loss, it performs backward propagation to get the gradients and do the parameter update. For gradient communication, it fuses all the tensor smaller than the threshold value to reduce network latency.

Parameters
  • loss (Tensor) – loss is the objective function of the deep learning model

  • e.g. for classification problem it can be the output of the (optimization,) –

  • function. (softmax_cross_entropy) –

  • threshold (int) – threshold is a parameter to control performance in fusing

  • tensors. For the tensors of sizes smaller than threshold, they are to (the) –

  • accumulated and fused before the all reduce operation. For the tensors (be) –

  • its size larger than the threshold value, they are to be reduced directly (of) –

  • fusion. (without) –

backward_and_update_half(loss, threshold=2097152, clipping=False, clip_Value=100)

Performs backward propagation and parameter update, with FP16 precision communication.

THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. For gradient communication, it fuses all the tensor smaller than the threshold value to reduce network latency, as well as converting them to FP16 half precision format before sending them out. To assist training, this functions provide an option to perform gradient clipping.

Parameters
  • loss (Tensor) – loss is the objective function of the deep learning model

  • e.g. for classification problem it can be the output of the (optimization,) –

  • function. (softmax_cross_entropy) –

  • threshold (int) – threshold is a parameter to control performance in fusing

  • tensors. For the tensors of sizes smaller than threshold, they are to (the) –

  • accumulated and fused before the all reduce operation. For the tensors (be) –

  • its size larger than the threshold value, they are to be reduced directly (of) –

  • fusion. (without) –

  • clipping (bool) – a boolean flag to choose whether to clip the gradient value

  • clip_value (float) – the clip value to be used when clipping is True

backward_and_partial_update(loss, threshold=2097152)

Performs backward propagation from the loss and parameter update using asychronous training.

THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. It fuses the tensors smaller than the threshold value to reduce network latency, as well as performing asychronous training where one parameter partition is all-reduced per iteration. The size of the parameter partition depends on the threshold value.

Parameters
  • loss (Tensor) – loss is the objective function of the deep learning model

  • e.g. for classification problem it can be the output of the (optimization,) –

  • function. (softmax_cross_entropy) –

  • threshold (int) – threshold is a parameter to control performance in fusing

  • tensors. For the tensors of sizes smaller than threshold, they are to (the) –

  • accumulated and fused before the all reduce operation. For the tensors (be) –

  • its size larger than the threshold value, they are to be reduced directly (of) –

  • fusion. (without) –

self.partial

A counter to determine which partition to perform all-reduce.

Type

int

This counter resets to zero automatlly after an update cycle of the full parameter
set.
backward_and_spars_update(loss, threshold=2097152, spars=0.05, topK=False, corr=True)

Performs backward propagation from the loss and parameter update with sparsification.

THIS IS A EXPERIMENTAL FUNCTION FOR RESEARCH PURPOSE: From the loss, it performs backward propagation to get the gradients and do the parameter update. It fuses the tensors with size smaller than the threshold value to reduce network latency, as well as using sparsification schemes to transfer only the gradient elements which are significant.

Parameters
  • loss (Tensor) – loss is the objective function of the deep learning model

  • e.g. for classification problem it can be the output of the (optimization,) –

  • function. (softmax_cross_entropy) –

  • threshold (int) – threshold is a parameter to control performance in fusing

  • tensors. For the tensors of sizes smaller than threshold, they are to (the) –

  • accumulated and fused before the all reduce operation. For the tensors (be) –

  • its size larger than the threshold value, they are to be reduced directly (of) –

  • fusion. (without) –

  • spars (float) – a parameter to control sparsity as defined below

  • topK (bool) – When topK is False, it sparsifies the gradient with absolute

  • >= sparsWhen topK is True, it sparsifies a fraction of total gradient (value) –

  • equals to spars, E.g. when spars = 0.01, it sparsifies 1 % of the (number) –

  • gradient elements (total) –

  • corr (bool) – whether to use the local accumulate gradient for correction

self.sparsInit

A counter to determine which partition to perform all-reduce.

self.gradAccumulation

Local gradient accumulation