stable_pretraining.optim package#
Submodules#
stable_pretraining.optim.lars module#
- class stable_pretraining.optim.lars.LARS(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, eta=0.001, eps=1e-08, clip_lr=False, exclude_bias_n_norm=False)[source]#
Bases:
OptimizerExtends SGD in PyTorch with LARS scaling from the paper.
Implementation based on Large batch training of Convolutional Networks.
- Parameters:
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float) – learning rate
momentum (float, optional) – momentum factor (default: 0)
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
dampening (float, optional) – dampening for momentum (default: 0)
nesterov (bool, optional) – enables Nesterov momentum (default: False)
eta (float, optional) – trust coefficient for computing LR (default: 0.001)
eps (float, optional) – eps for division denominator (default: 1e-8)
Example
>>> model = torch.nn.Linear(10, 1) >>> input = torch.Tensor(10) >>> target = torch.Tensor([1.]) >>> loss_fn = lambda input, target: (input - target) ** 2 >>> # >>> optimizer = LARS(model.parameters(), lr=0.1, momentum=0.9) >>> optimizer.zero_grad() >>> loss_fn(model(input), target).backward() >>> optimizer.step()
Note
The application of momentum in the SGD part is modified according to the PyTorch standards. LARS scaling fits into the equation in the following fashion. .. math:
\begin{aligned} g_{t+1} & = \text{lars_lr} * (\beta * p_{t} + g_{t+1}), \\ v_{t+1} & = \\mu * v_{t} + g_{t+1}, \\ p_{t+1} & = p_{t} - \text{lr} * v_{t+1}, \\end{aligned}
where \(p\), \(g\), \(v\), \(\\mu\) and \(\beta\) denote the parameters, gradient, velocity, momentum, and weight decay respectively. The \(lars_lr\) is defined by Eq. 6 in the paper. The Nesterov version is analogously modified.
Warning
Parameters with weight decay set to 0 will automatically be excluded from layer-wise LR scaling. This is to ensure consistency with papers like SimCLR and BYOL.
stable_pretraining.optim.lr_scheduler module#
- class stable_pretraining.optim.lr_scheduler.CosineDecayer(total_steps, n_cycles=3, gamma=0.2)[source]#
Bases:
objectApply cosine decay with multiple cycles for learning rate scheduling.
This class implements a cosine decay function with multiple cycles that can be used as a learning rate scheduler. The decay follows a cosine curve with additional cyclic variations.
- Parameters:
Example
>>> decayer = CosineDecayer(total_steps=1000, n_cycles=3) >>> lr_factor = decayer(step=500)
- stable_pretraining.optim.lr_scheduler.LinearWarmup(optimizer, total_steps, start_factor=0.01, peak_step=0.1)[source]#
Create a linear warmup learning rate scheduler.
This function creates a linear warmup scheduler that gradually increases the learning rate from a small value to the full learning rate over a specified number of steps.
- Parameters:
optimizer (torch.optim.Optimizer) – The optimizer to schedule.
total_steps (int) – Total number of training steps.
start_factor (float, optional) – Initial learning rate factor. Defaults to 0.01.
peak_step (float, optional) – Step at which warmup peaks (as fraction of total_steps). Defaults to 0.1.
- Returns:
Linear warmup scheduler.
- Return type:
Example
>>> optimizer = torch.optim.Adam(model.parameters(), lr=0.001) >>> scheduler = LinearWarmup(optimizer, total_steps=1000, start_factor=0.01)
- stable_pretraining.optim.lr_scheduler.LinearWarmupCosineAnnealing(optimizer, total_steps, start_factor=0.01, end_lr=0.0, peak_step=0.01)[source]#
Combine linear warmup with cosine annealing decay.
This function creates a scheduler that first linearly warms up the learning rate, then applies cosine annealing decay. This is commonly used in self-supervised learning to achieve better convergence.
- Parameters:
optimizer (torch.optim.Optimizer) – The optimizer to schedule.
total_steps (int) – Total number of training steps.
start_factor (float, optional) – Initial learning rate factor for warmup. Defaults to 0.01.
end_lr (float, optional) – Final learning rate after annealing. Defaults to 0.0.
peak_step (float, optional) – Step at which warmup ends (as fraction of total_steps). Defaults to 0.01.
- Returns:
Combined warmup and annealing scheduler.
- Return type:
Example
>>> optimizer = torch.optim.Adam(model.parameters(), lr=0.001) >>> scheduler = LinearWarmupCosineAnnealing(optimizer, total_steps=1000)
- class stable_pretraining.optim.lr_scheduler.LinearWarmupCosineAnnealingLR(optimizer, warmup_steps, max_steps, warmup_start_lr=0.0, eta_min=0.0, last_epoch=-1)[source]#
Bases:
_LRSchedulerLearning rate scheduler with linear warmup followed by cosine annealing.
This scheduler implements a custom learning rate schedule that combines linear warmup with cosine annealing. It provides more control over the warmup and annealing phases compared to the factory function approach.
- Parameters:
optimizer (torch.optim.Optimizer) – The optimizer to schedule.
warmup_steps (int) – Number of steps for linear warmup.
max_steps (int) – Total number of training steps.
warmup_start_lr (float, optional) – Starting learning rate for warmup. Defaults to 0.0.
eta_min (float, optional) – Minimum learning rate after annealing. Defaults to 0.0.
last_epoch (int, optional) – The index of last epoch. Defaults to -1.
Example
>>> optimizer = torch.optim.Adam(model.parameters(), lr=0.001) >>> scheduler = LinearWarmupCosineAnnealingLR( ... optimizer, warmup_steps=100, max_steps=1000 ... )
- stable_pretraining.optim.lr_scheduler.LinearWarmupCyclicAnnealing(optimizer, total_steps, start_factor=0.01, peak_step=0.1)[source]#
Combine linear warmup with cyclic cosine annealing.
This function creates a scheduler that combines linear warmup with cyclic cosine annealing. The cyclic annealing provides multiple learning rate cycles which can help escape local minima during training.
- Parameters:
optimizer (torch.optim.Optimizer) – The optimizer to schedule.
total_steps (int) – Total number of training steps.
start_factor (float, optional) – Initial learning rate factor for warmup. Defaults to 0.01.
peak_step (float, optional) – Step at which warmup ends (as fraction of total_steps). Defaults to 0.1.
- Returns:
Combined warmup and cyclic annealing scheduler.
- Return type:
Example
>>> optimizer = torch.optim.Adam(model.parameters(), lr=0.001) >>> scheduler = LinearWarmupCyclicAnnealing(optimizer, total_steps=1000)
- stable_pretraining.optim.lr_scheduler.LinearWarmupThreeStepsAnnealing(optimizer, total_steps, start_factor=0.001, gamma=0.3, peak_step=0.05)[source]#
Combine linear warmup with a three-step learning rate annealing.
This function creates a scheduler that combines linear warmup with a three-step annealing schedule. The annealing reduces the learning rate at three predefined milestones, which can help with fine-tuning and convergence.
- Parameters:
optimizer (torch.optim.Optimizer) – The optimizer to schedule.
total_steps (int) – Total number of training steps.
start_factor (float, optional) – Initial learning rate factor for warmup. Defaults to 0.001.
gamma (float, optional) – Multiplicative factor for learning rate reduction. Defaults to 0.3.
peak_step (float, optional) – Step at which warmup ends (as fraction of total_steps). Defaults to 0.05.
- Returns:
Combined warmup and three-step annealing scheduler.
- Return type:
Example
>>> optimizer = torch.optim.Adam(model.parameters(), lr=0.001) >>> scheduler = LinearWarmupThreeStepsAnnealing(optimizer, total_steps=1000)
- stable_pretraining.optim.lr_scheduler.create_scheduler(optimizer: Optimizer, scheduler_config: str | dict | partial | type, module: Any = None) LRScheduler[source]#
Create a learning rate scheduler with flexible configuration.
This function provides a unified way to create schedulers from various configuration formats, used by both Module and OnlineProbe for consistency.
- Parameters:
optimizer – The optimizer to attach the scheduler to
scheduler_config – Can be: - str: Name of scheduler (e.g., “CosineAnnealingLR”) - dict: {“type”: “CosineAnnealingLR”, “T_max”: 1000, …} - partial: Pre-configured scheduler (e.g., partial(CosineAnnealingLR, T_max=1000)) - class: Direct scheduler class (will use smart defaults)
module – Optional module instance for accessing trainer properties (for smart defaults)
- Returns:
Configured scheduler instance
Examples
>>> # Simple string (uses smart defaults) >>> scheduler = create_scheduler(opt, "CosineAnnealingLR")
>>> # With custom parameters >>> scheduler = create_scheduler( ... opt, {"type": "StepLR", "step_size": 30, "gamma": 0.1} ... )
>>> # Using partial for full control >>> from functools import partial >>> scheduler = create_scheduler( ... opt, partial(torch.optim.lr_scheduler.ExponentialLR, gamma=0.95) ... )
stable_pretraining.optim.utils module#
Shared utilities for optimizer and scheduler configuration.
- stable_pretraining.optim.utils.create_optimizer(params, optimizer_config: str | dict | partial | type) Optimizer[source]#
Create an optimizer from flexible configuration.
This function provides a unified way to create optimizers from various configuration formats, used by both Module and OnlineProbe for consistency.
- Parameters:
params – Parameters to optimize (e.g., model.parameters())
optimizer_config – Can be: - str: optimizer name from torch.optim or stable_pretraining.optim (e.g., “AdamW”, “LARS”) - dict: {“type”: “AdamW”, “lr”: 1e-3, …} - partial: pre-configured optimizer factory - class: optimizer class (e.g., torch.optim.AdamW)
- Returns:
Configured optimizer instance
Examples
>>> # String name (uses default parameters) >>> opt = create_optimizer(model.parameters(), "AdamW")
>>> # Dict with parameters >>> opt = create_optimizer( ... model.parameters(), {"type": "SGD", "lr": 0.1, "momentum": 0.9} ... )
>>> # Using partial >>> from functools import partial >>> opt = create_optimizer( ... model.parameters(), partial(torch.optim.Adam, lr=1e-3) ... )
>>> # Direct class >>> opt = create_optimizer(model.parameters(), torch.optim.RMSprop)