stable_pretraining.optim package#

Submodules#

stable_pretraining.optim.lars module#

class stable_pretraining.optim.lars.LARS(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, eta=0.001, eps=1e-08, clip_lr=False, exclude_bias_n_norm=False)[source]#

Bases: Optimizer

Extends SGD in PyTorch with LARS scaling from the paper.

Implementation based on Large batch training of Convolutional Networks.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float) – learning rate

  • momentum (float, optional) – momentum factor (default: 0)

  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

  • dampening (float, optional) – dampening for momentum (default: 0)

  • nesterov (bool, optional) – enables Nesterov momentum (default: False)

  • eta (float, optional) – trust coefficient for computing LR (default: 0.001)

  • eps (float, optional) – eps for division denominator (default: 1e-8)

Example

>>> model = torch.nn.Linear(10, 1)
>>> input = torch.Tensor(10)
>>> target = torch.Tensor([1.])
>>> loss_fn = lambda input, target: (input - target) ** 2
>>> #
>>> optimizer = LARS(model.parameters(), lr=0.1, momentum=0.9)
>>> optimizer.zero_grad()
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()

Note

The application of momentum in the SGD part is modified according to the PyTorch standards. LARS scaling fits into the equation in the following fashion. .. math:

\begin{aligned}
    g_{t+1} & = \text{lars_lr} * (\beta * p_{t} + g_{t+1}), \\
    v_{t+1} & = \\mu * v_{t} + g_{t+1}, \\
    p_{t+1} & = p_{t} - \text{lr} * v_{t+1},
\\end{aligned}

where \(p\), \(g\), \(v\), \(\\mu\) and \(\beta\) denote the parameters, gradient, velocity, momentum, and weight decay respectively. The \(lars_lr\) is defined by Eq. 6 in the paper. The Nesterov version is analogously modified.

Warning

Parameters with weight decay set to 0 will automatically be excluded from layer-wise LR scaling. This is to ensure consistency with papers like SimCLR and BYOL.

step(closure=None)[source]#

Performs a single optimization step.

Parameters:

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

stable_pretraining.optim.lr_scheduler module#

class stable_pretraining.optim.lr_scheduler.CosineDecayer(total_steps, n_cycles=3, gamma=0.2)[source]#

Bases: object

Apply cosine decay with multiple cycles for learning rate scheduling.

This class implements a cosine decay function with multiple cycles that can be used as a learning rate scheduler. The decay follows a cosine curve with additional cyclic variations.

Parameters:
  • total_steps (int) – Total number of training steps.

  • n_cycles (int, optional) – Number of cycles in the cosine decay. Defaults to 3.

  • gamma (float, optional) – Gamma parameter for cycle amplitude. Defaults to 0.2.

Example

>>> decayer = CosineDecayer(total_steps=1000, n_cycles=3)
>>> lr_factor = decayer(step=500)
stable_pretraining.optim.lr_scheduler.LinearWarmup(optimizer, total_steps, start_factor=0.01, peak_step=0.1)[source]#

Create a linear warmup learning rate scheduler.

This function creates a linear warmup scheduler that gradually increases the learning rate from a small value to the full learning rate over a specified number of steps.

Parameters:
  • optimizer (torch.optim.Optimizer) – The optimizer to schedule.

  • total_steps (int) – Total number of training steps.

  • start_factor (float, optional) – Initial learning rate factor. Defaults to 0.01.

  • peak_step (float, optional) – Step at which warmup peaks (as fraction of total_steps). Defaults to 0.1.

Returns:

Linear warmup scheduler.

Return type:

torch.optim.lr_scheduler.LinearLR

Example

>>> optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
>>> scheduler = LinearWarmup(optimizer, total_steps=1000, start_factor=0.01)
stable_pretraining.optim.lr_scheduler.LinearWarmupCosineAnnealing(optimizer, total_steps, start_factor=0.01, end_lr=0.0, peak_step=0.01)[source]#

Combine linear warmup with cosine annealing decay.

This function creates a scheduler that first linearly warms up the learning rate, then applies cosine annealing decay. This is commonly used in self-supervised learning to achieve better convergence.

Parameters:
  • optimizer (torch.optim.Optimizer) – The optimizer to schedule.

  • total_steps (int) – Total number of training steps.

  • start_factor (float, optional) – Initial learning rate factor for warmup. Defaults to 0.01.

  • end_lr (float, optional) – Final learning rate after annealing. Defaults to 0.0.

  • peak_step (float, optional) – Step at which warmup ends (as fraction of total_steps). Defaults to 0.01.

Returns:

Combined warmup and annealing scheduler.

Return type:

torch.optim.lr_scheduler.SequentialLR

Example

>>> optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
>>> scheduler = LinearWarmupCosineAnnealing(optimizer, total_steps=1000)
class stable_pretraining.optim.lr_scheduler.LinearWarmupCosineAnnealingLR(optimizer, warmup_steps, max_steps, warmup_start_lr=0.0, eta_min=0.0, last_epoch=-1)[source]#

Bases: _LRScheduler

Learning rate scheduler with linear warmup followed by cosine annealing.

This scheduler implements a custom learning rate schedule that combines linear warmup with cosine annealing. It provides more control over the warmup and annealing phases compared to the factory function approach.

Parameters:
  • optimizer (torch.optim.Optimizer) – The optimizer to schedule.

  • warmup_steps (int) – Number of steps for linear warmup.

  • max_steps (int) – Total number of training steps.

  • warmup_start_lr (float, optional) – Starting learning rate for warmup. Defaults to 0.0.

  • eta_min (float, optional) – Minimum learning rate after annealing. Defaults to 0.0.

  • last_epoch (int, optional) – The index of last epoch. Defaults to -1.

Example

>>> optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
>>> scheduler = LinearWarmupCosineAnnealingLR(
...     optimizer, warmup_steps=100, max_steps=1000
... )
get_lr()[source]#

Compute the learning rate for the current epoch.

Returns:

List of learning rates for each parameter group.

Return type:

list

stable_pretraining.optim.lr_scheduler.LinearWarmupCyclicAnnealing(optimizer, total_steps, start_factor=0.01, peak_step=0.1)[source]#

Combine linear warmup with cyclic cosine annealing.

This function creates a scheduler that combines linear warmup with cyclic cosine annealing. The cyclic annealing provides multiple learning rate cycles which can help escape local minima during training.

Parameters:
  • optimizer (torch.optim.Optimizer) – The optimizer to schedule.

  • total_steps (int) – Total number of training steps.

  • start_factor (float, optional) – Initial learning rate factor for warmup. Defaults to 0.01.

  • peak_step (float, optional) – Step at which warmup ends (as fraction of total_steps). Defaults to 0.1.

Returns:

Combined warmup and cyclic annealing scheduler.

Return type:

torch.optim.lr_scheduler.SequentialLR

Example

>>> optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
>>> scheduler = LinearWarmupCyclicAnnealing(optimizer, total_steps=1000)
stable_pretraining.optim.lr_scheduler.LinearWarmupThreeStepsAnnealing(optimizer, total_steps, start_factor=0.001, gamma=0.3, peak_step=0.05)[source]#

Combine linear warmup with a three-step learning rate annealing.

This function creates a scheduler that combines linear warmup with a three-step annealing schedule. The annealing reduces the learning rate at three predefined milestones, which can help with fine-tuning and convergence.

Parameters:
  • optimizer (torch.optim.Optimizer) – The optimizer to schedule.

  • total_steps (int) – Total number of training steps.

  • start_factor (float, optional) – Initial learning rate factor for warmup. Defaults to 0.001.

  • gamma (float, optional) – Multiplicative factor for learning rate reduction. Defaults to 0.3.

  • peak_step (float, optional) – Step at which warmup ends (as fraction of total_steps). Defaults to 0.05.

Returns:

Combined warmup and three-step annealing scheduler.

Return type:

torch.optim.lr_scheduler.SequentialLR

Example

>>> optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
>>> scheduler = LinearWarmupThreeStepsAnnealing(optimizer, total_steps=1000)
stable_pretraining.optim.lr_scheduler.create_scheduler(optimizer: Optimizer, scheduler_config: str | dict | partial | type, module: Any = None) LRScheduler[source]#

Create a learning rate scheduler with flexible configuration.

This function provides a unified way to create schedulers from various configuration formats, used by both Module and OnlineProbe for consistency.

Parameters:
  • optimizer – The optimizer to attach the scheduler to

  • scheduler_config – Can be: - str: Name of scheduler (e.g., “CosineAnnealingLR”) - dict: {“type”: “CosineAnnealingLR”, “T_max”: 1000, …} - partial: Pre-configured scheduler (e.g., partial(CosineAnnealingLR, T_max=1000)) - class: Direct scheduler class (will use smart defaults)

  • module – Optional module instance for accessing trainer properties (for smart defaults)

Returns:

Configured scheduler instance

Examples

>>> # Simple string (uses smart defaults)
>>> scheduler = create_scheduler(opt, "CosineAnnealingLR")
>>> # With custom parameters
>>> scheduler = create_scheduler(
...     opt, {"type": "StepLR", "step_size": 30, "gamma": 0.1}
... )
>>> # Using partial for full control
>>> from functools import partial
>>> scheduler = create_scheduler(
...     opt, partial(torch.optim.lr_scheduler.ExponentialLR, gamma=0.95)
... )

stable_pretraining.optim.utils module#

Shared utilities for optimizer and scheduler configuration.

stable_pretraining.optim.utils.create_optimizer(params, optimizer_config: str | dict | partial | type) Optimizer[source]#

Create an optimizer from flexible configuration.

This function provides a unified way to create optimizers from various configuration formats, used by both Module and OnlineProbe for consistency.

Parameters:
  • params – Parameters to optimize (e.g., model.parameters())

  • optimizer_config – Can be: - str: optimizer name from torch.optim or stable_pretraining.optim (e.g., “AdamW”, “LARS”) - dict: {“type”: “AdamW”, “lr”: 1e-3, …} - partial: pre-configured optimizer factory - class: optimizer class (e.g., torch.optim.AdamW)

Returns:

Configured optimizer instance

Examples

>>> # String name (uses default parameters)
>>> opt = create_optimizer(model.parameters(), "AdamW")
>>> # Dict with parameters
>>> opt = create_optimizer(
...     model.parameters(), {"type": "SGD", "lr": 0.1, "momentum": 0.9}
... )
>>> # Using partial
>>> from functools import partial
>>> opt = create_optimizer(
...     model.parameters(), partial(torch.optim.Adam, lr=1e-3)
... )
>>> # Direct class
>>> opt = create_optimizer(model.parameters(), torch.optim.RMSprop)

Module contents#