Shortcuts

strategy_adapters

Fine-Tuning Scheduler Strategy Adapters

Strategy adapters extend Fine-Tuning Scheduler support for complex or custom training strategies. The built-in adapters (FSDPStrategyAdapter, ModelParallelStrategyAdapter) handle PyTorch’s advanced distributed training strategies.

Plugin Support

Warning

This is an experimental feature which is still in development.

Fine-Tuning Scheduler supports custom strategy adapters via Python entry points. Third-party packages can register custom strategy adapters that will be automatically discovered at runtime.

To register a custom strategy adapter, add an entry point in your package’s pyproject.toml:

[project.entry-points."finetuning_scheduler.strategy_adapters"]
my_adapter = "my_package.adapters:MyStrategyAdapter"

The entry point name (my_adapter in this example) will be used to reference the adapter, automatically lowercased. Once registered, the adapter can be used by mapping Lightning strategy flags to the adapter via the custom_strategy_adapters parameter. You can use the entry point name, a fully qualified class path with colon separator (module:Class), or dot separator (module.Class):

from finetuning_scheduler import FinetuningScheduler

# Map strategy flags to adapters using entry point name
fts = FinetuningScheduler(
    custom_strategy_adapters={
        "single_device": "my_adapter",  # Entry point name
        "ddp": "my_package.adapters:MyStrategyAdapter",  # Colon-separated
        "fsdp": "my_package.adapters.MyStrategyAdapter",  # Dot-separated
    }
)

See Strategy Adapter Entry Points for complete documentation and examples.

class finetuning_scheduler.strategy_adapters.FSDPStrategyAdapter(awp_overrides=None, *args, **kwargs)[source]

A StrategyAdapter that extends FinetuningScheduler (FTS) to support flexible, multi-phase, scheduled fine-tuning with the Fully Sharded Data Parallel (FSDP) strategy (FSDPStrategy).

As with standard FSDP usage, FSDP wrapping of a LightningModule can be performed either by providing an auto_wrap_policy or (for maximal control) by overriding the configure_model method of LightningModule and manually wrapping the module.

In order to support multi-phase scheduled fine-tuning with FSDP, FTS’s key precondition is that the defined fine-tuning schedule phases have disjoint sets of FSDP-flattened parameters (i.e. FlatParameter s, which are created when wrapping a set of modules in a FSDP instance/unit). This constraint is derived from the fact that the requires_grad attribute currently must be the same for all parameters flattened into the same FlatParameter (if in use_orig_params=False mode).

In order to support multi-phase scheduled fine-tuning with FSDP in use_orig_params=False mode, FTS’s key precondition is that the defined fine-tuning schedule phases have disjoint sets of FSDP-flattened parameters (i.e. FlatParameter s, which are created when wrapping a set of modules in a FSDP instance/unit). This constraint is derived from the fact that (if in use_orig_params=False mode) the requires_grad attribute must be the same for all parameters flattened into the same FlatParameter.

To facilitate module wrapping in alignment with fine-tuning schedule phases, FTS provides the awp_overrides feature which allows users to provide module name-based complements to a given auto_wrap_policy. See the Example: Multi-Phase Scheduled Fine-Tuning with FSDP tutorial for a concrete example and additional guidance.

FTS will attempt to validate that the module is wrapped in a manner that aligns with the defined fine-tuning schedule phases prior to the start of training and provided detailed feedback for the user if a misalignment is discovered.

Note

The no_decay attribute that FTS supports on LightningModule with the base StrategyAdapter is not currently supported in the context of FSDP fine-tuning.

Tip

Because of inter-module dependencies (among other reasons), wrapping every submodule in its own separate FSDP instance is often not a viable approach to ensuring fine-tuning schedule/module wrapping alignment. Starting with a provided auto_wrap_policy (e.g. transformer_auto_wrap_policy) and providing module name-based complements as needed using awp_overrides is often the most expedient approach to auto-wrapping in alignment with a fine-tuning schedule. As always, if needed, one can override configure_model and manually wrap a given LightningModule to align with a desired fine-tuning schedule.

The only user-facing configuration for FSDPStrategyAdapter is awp_overrides, an optional list of module names that should be wrapped in separate FSDP instances, complementing the modules that would be individually wrapped by auto_wrap_policy provided in the FSDPStrategy strategy configuration.

Parameters:

awp_overrides (List | None) – A list of module names to wrap in separate FSDP instances (i.e., auto_wrap_policy overrides). Only applicable when complementing/overriding an auto_wrap_policy provided in the FSDPStrategy strategy configuration. Override lists will be ignored when manually wrapping modules via a configure_model method. If the named modules cannot be found, an exception will be thrown. Defaults to None.

awp_overrides

A list of module names to wrap in separate FSDP instances.

fsdp_param_transform(orig_thaw_pl, inspect_only)[source]

The parameter transformation function currently used by fts_optim_transform() to transform original parameter lists for optimizer operations.

Parameters:
  • orig_thaw_pl (List) – The original parameter name list before FSDP’s transformation of them.

  • inspect_only (bool) – Whether to use the specified transform in read-only (i.e. inspect_only) mode, avoiding any persistent state transformation that may accompany normal usage. Typically useful for state inspection and validation contexts.

Returns:

A transformed parameter name list that matches the current optimizer’s view of them after FSDP’s

transformation of the original parameter names.

Return type:

List

fts_optim_transform(orig_pl, inspect_only=False)[source]

Because FSDP performs parameter transformations that cause the current optimizer’s view of parameter names to diverge from the original parameter names, this parameter transformation is required for optimizer operations.

Parameters:
  • orig_pl (List) – The original parameter name list before FSDP’s transformation of them.

  • inspect_only (bool) – Whether to use the specified transform in read-only (i.e. inspect_only) mode, avoiding any persistent state transformation that may accompany normal usage. Typically useful for state inspection and validation contexts.

Returns:

A transformed parameter name list that matches the current optimizer’s view of them after FSDP’s

transformation of the original parameter names.

Return type:

List

load_optimizer_state_dict(checkpoint_connector)[source]

Override the default load_optimizer_state_dict method so that we can allow FSDP to manage the movement of restored optimizer states to the relevant devices.

Parameters:

checkpoint_connector (_CheckpointConnector) – The _CheckpointConnector associated with the current training session.

Return type:

None

logical_param_translation(param_names)[source]

Effectively the reverse transformation of fts_optim_transform().

Parameters:

param_names (List) – A parameter name list from the current optimizer’s view of them after FSDP’s transformation of the original parameter names.

Returns:

The original parameter name list before a given FSDP’s transformation.

Return type:

List

on_after_init_fts()[source]

To accommodate FSDP, we defer executing the first fine-tuning phase that would otherwise be executed in this hook, which fires in FinetuningScheduler setup immediately after init_fts()

Return type:

None

on_before_fts_fit_start()[source]

In this hook executed immediately before the FinetuningScheduler on_fit_start() hook begins, we ensure the provided fine-tuning schedule and FSDP wrapped LightningModule are appropriately aligned and valid. If the fine-tuning schedule and wrapped module are detected to be incompatible, detailed feedback is provided to the user (which is why multiple checks are aggregated before returning any alignment exceptions).

Raises:

MisconfigurationException – If any FTS FSDP fine-tuning schedule/module wrapping alignment exceptions are thrown. The provided exceptions provide detailed feedback for the user to address the misalignment.

Return type:

None

on_before_init_fts()[source]

In this hook executed immediately before init_fts(), to accommodate FSDP we: :rtype: None

  1. Disable Lightning’s restoration of the optimizer to allow us to implement special handling

  2. Prune no_decay specification since it is not currently supported in the context of FSDP fine-tuning

  3. Validate the awp_overrides configuration

  4. Configure FTS wrapping of the provided LightningModule to either use the provided LightningModule.configure_model method (if present) or a provided auto_wrap_policy.

on_before_restore_optimizers_and_lrs()[source]

Allow the FSDPStrategyAdapter to override the default load_optimizer_state_dict method.

This is necessary so we can allow FSDP to manage the movement of restored optimizer states to the relevant devices.

Return type:

None

optimizer_state(optimizer)[source]

Override the default optimizer_state method so that we can unify use_orig_params code-paths and save a full, consolidated optimizer state dict to be restored via load_optimizer_state_dict.

Parameters:

optimizer (Optimizer) – The optimizer instance for which a full optimizer state dict will be captured.

Returns:

The consolidated full optimizer state dict (if on rank 0, otherwise an empty dict).

Return type:

dict[str, Tensor]

property lightning_restore_optimizer

Disable Lightning’s restoration of the optimizer to allow FTS to implement special handling.

Returns:

Returns False to allow FTS control over optimizer restoration.

Return type:

bool

class finetuning_scheduler.strategy_adapters.ModelParallelStrategyAdapter(fsdp_default_kwargs=None, fsdp_plan=None, *args, **kwargs)[source]

A StrategyAdapter that extends FinetuningScheduler (FTS) to support flexible, multi-phase, scheduled fine-tuning with PyTorch’s composable distributed (e.g. fully_shard) and Tensor Parallelism APIs. FTS augments Lightning’s Model Parallel strategy (ModelParallelStrategy) by allowing users to apply the fully_shard API using module name/pattern-based configuration instead of manually inspecting modules and applying the API in LightningModule.configure_model (see fsdp_plan).

See the FTS Distributed Composable API Training Examples tutorial for a concrete example and additional guidance.

Note

fsdp_plan module name/pattern-based fully_shard directives are applied after any preceding Tensor Parallel or explicit fully_shard directives in LightningModule.configure_model. FTS will only apply fully_shard to a specified module if it was not already applied to that module.

Note

In addition to all valid fully_shard API kwargs, fsdp_plan also supports a act_ckpt and cpu_offload_policy kwargs.

For specified module/patterns (or fsdp_default_kwargs), act_ckpt allows one to pass a string alias specifying the use of the desired activation checkpointing API (e.g. “composable”, “wrapped”, “wrapped_offload”) as well as an optional Dict of activation checkpointing kwargs. The specified checkpointing APIs will be applied to the matching module(s) before fully_shard.

cpu_offload_policy is a convenience alias that will apply CPUOffloadPolicy to the matching module(s) along with any provided Dict of policy kwargs.

The only user-facing configuration for ModelParallelStrategyAdapter are fsdp_plan and fsdp_default_kwargs.

Parameters:
  • fsdp_plan (Dict | None) –

    An optional dictionary of module names or regex pattern keys with associated fully_shard composable distributed API kwargs to apply to matching modules.

    • Allows users to apply the fully_shard API using module name/pattern-based configuration instead of manually inspecting modules and applying the API in LightningModule.configure_model.

    • fsdp_plan directives can also be composed with explicit fully_shard calls in LightningModule.configure_model, as the fsdp_plan directives will only invoke fully_shard on a specified module if it was not already applied to that module.

    • All valid fully_shard API kwargs are supported.

    • fsdp_plan directives are applied in the order provided in the fsdp_plan dictionary.

    Additionally, fsdp_plan supports act_ckpt and cpu_offload_policy kwargs. For specified module/patterns (or fsdp_default_kwargs):

    • act_ckpt (Sequence [ str, Dict | None ] | ActCkptCfg): pass an alias specifying the use of the desired activation checkpointing API (e.g. “composable”, “wrapped”, “wrapped_offload”) as well as an optional Dict of activation checkpointing kwargs. The specified checkpointing APIs will be applied to the matching module(s) before fully_shard.

    • cpu_offload_policy (Dict [ Optional [ str , Any ]]) is a convience alias that will apply CPUOffloadPolicy to the matching module(s) along with any provided Dict of policy kwargs. Defaults to None.

  • fsdp_default_kwargs (Dict | None) – An optional dictionary of default fully_shard API kwargs to apply to each matching module in fsdp_plan. Module-name/pattern specific kwargs will take precedence over these. All kwargs valid for fsdp_plan above are supported. Defaults to None.

fsdp_plan

An optional dictionary of module names or regex pattern keys with associated fully_shard composable distributed API kwargs to apply to matching modules.

  • Allows users to apply the fully_shard API using module name/pattern-based configuration instead of manually inspecting modules and applying the API in LightningModule.configure_model.

  • fsdp_plan directives can also be composed with explicit fully_shard calls in LightningModule.configure_model, as the fsdp_plan directives will only invoke fully_shard on a specified module if it was not already applied to that module.

  • All valid fully_shard API kwargs are supported.

  • fsdp_plan directives are applied in the order provided in the fsdp_plan dictionary.

Additionally, fsdp_plan supports act_ckpt and cpu_offload_policy kwargs. For specified module/patterns (or fsdp_default_kwargs):

  • act_ckpt (Sequence [ str, Dict | None ] | ActCkptCfg): pass an alias specifying the use of the desired activation checkpointing API (e.g. “composable”, “wrapped”, “wrapped_offload”) as well as an optional Dict of activation checkpointing kwargs. The specified checkpointing APIs will be applied to the matching module(s) before fully_shard.

  • cpu_offload_policy (Dict [ Optional [ str , Any ]]) is a convience alias that will apply CPUOffloadPolicy to the matching module(s) along with any provided Dict of policy kwargs.

fsdp_default_kwargs

An optional dictionary of default fully_shard API kwargs to apply to each matching module in fsdp_plan. Module-name/pattern specific kwargs will take precedence over these. All kwargs valid for fsdp_plan above are supported.

on_before_fts_fit_start()[source]

In this hook executed immediately before the FinetuningScheduler on_fit_start() hook begins, we ensure the provided fine-tuning schedule and FSDP2 composed LightningModule are appropriately aligned.

If the fine-tuning schedule and composed modules yield parameter group configurations that may not be supported by some optimizer group operations, detailed feedback on potential remediation is provided to the user.

Return type:

None

on_before_init_fts()[source]

In this hook executed immediately before init_fts(), to accommodate enhanced Model Parallel functionality, we: :rtype: None

  1. Validate the fsdp_plan configuration

  2. Configure FTS wrapping of the provided LightningModule to either use the provided LightningModule.configure_model method (if present) or a provided fsdp_plan.

class finetuning_scheduler.strategy_adapters.StrategyAdapter[source]

Base class for all strategy adapters. Implements the default FinetuningScheduler hooks. Can be subclassed to extend FinetuningScheduler support for a complex or custom Strategy via an associated StrategyAdapter.

Tip

If you want to extend FTS to use a custom, currently unsupported strategy or override current FTS behavior in the context of a given training strategy, subclassing StrategyAdapter is a way to do so. See FSDPStrategyAdapter for an example implementation.

The default fine-tuning phase execution function is set on StrategyAdapter initialization.

This can be overridden by StrategyAdapter subclasses to adapt fine-tuning phase execution to meet strategy-specific requirements.

static base_ft_phase(module, thaw_pl, translation_func=None, init_thaw=False)[source]

Thaw/unfreeze the provided list of parameters in the provided Module

Parameters:
  • module (Module) – The Module that will have parameters selectively unfrozen/thawed.

  • thaw_pl (list) – The list of parameters that should be thawed/unfrozen in the Module

  • init_thaw (bool) – If True, modifies message to user accordingly. Defaults to False.

Returns:

A Tuple of two lists.
  1. The list of newly thawed/unfrozen parameters thawed by this function

  2. A list of all currently thawed/unfrozen parameters in the target Module

Return type:

tuple[List, List]

before_restore_model(checkpoint)[source]

Adapter hook executed before model restore.

Strategy adapters can override this to modify or translate the checkpoint contents (e.g. for state-dict translations) before the model’s load path is executed.

Parameters:

checkpoint (dict[str, Any]) – The full checkpoint dict loaded by the Trainer.

Returns:

The checkpoint dictionary to be used for restore.

Return type:

dict[str, Any]

connect(fts_parent)[source]

Create a handle for the associated FinetuningScheduler instance.

Parameters:

fts_parent (Callback) – The associated FinetuningScheduler instance

Return type:

None

fts_optim_transform(orig_pl, inspect_only=False)[source]

A method that can be overridden by a StrategyAdapter if a Strategy performs parameter transformations that cause the current optimizer’s view of parameter names to diverge from the original parameter names. By default, no transformation of schedule parameter names is required for optimizer operations.

Parameters:
  • orig_pl (List) – The original parameter name list before a given Strategy’s transformation of them.

  • inspect_only (bool) – Whether to use the specified transform in read-only (i.e. inspect_only) mode, avoiding any persistent state transformation that may accompany normal usage. Typically useful for state inspection and validation contexts.

Returns:

A transformed parameter name list that matches the current optimizer’s view of them after a given

Strategy’s transformation of the original parameter names.

Return type:

List

gen_ft_schedule(dump_loc)[source]

Generate the default fine-tuning schedule using a naive, 2-parameters per-level heuristic.

This method can be overridden by StrategyAdapter subclasses to customize schedule generation for specific strategies (e.g., using strategy-specific parameter naming conventions).

Parameters:

dump_loc (str | PathLike) – The directory to which the generated schedule (.yaml) should be written

Returns:

The path to the generated schedule, by default Trainer.log_dir and named after the LightningModule subclass in use with the suffix _ft_schedule.yaml)

Return type:

os.PathLike

get_named_params_for_schedule_validation()[source]

Get named parameters for schedule validation.

This method can be overridden by StrategyAdapter subclasses to customize parameter iteration for schedule validation (e.g., returning TL-style parameter names instead of canonical names).

Note

Strategy adapters can override validation behavior at two levels of abstraction:

  1. Parameter naming only (simpler): Override this method to provide custom parameter names while using the default validation logic from _validate_ft_sched().

  2. Full validation logic (more control): Override validate_ft_sched() to completely customize the validation process.

Choose the approach that best suits your use case. Most adapters only need to override this method to provide custom parameter names.

Returns:

A dictionary mapping parameter names to parameter tensors.

By default, returns the standard named_parameters() dict.

Return type:

dict[str, torch.nn.Parameter]

logical_param_translation(param_names)[source]

Effectively the reverse transformation of fts_optim_transform(). Can be overridden by a StrategyAdapter if a Strategy performs parameter transformations that cause the original user view of parameter names to diverge from the current optimizer’s view. By default, no transformation of optimizer parameter names is required.

Parameters:

param_names (List) – A parameter name list from the current optimizer’s view of them after a Strategy’s transformation of the original parameter names.

Returns:

The original parameter name list before a given

Strategy’s transformation.

Return type:

List

on_after_init_fts()[source]

Hook executed in FinetuningScheduler setup immediately after init_fts().

Return type:

None

on_before_fts_fit_start()[source]

Hook executed immediately before the FinetuningScheduler on_fit_start() hook begins.

Return type:

None

on_before_init_fts()[source]

Hook executed in FinetuningScheduler setup immediately before init_fts()

Return type:

None

on_before_restore_optimizers_and_lrs()[source]

Hook executed immediately before FinetuningScheduler restores optimizers and schedulers.

Return type:

None

phase0_optimizer_override()[source]

Reconfigure the user-configured optimizer (configured via configure_optimizers) to optimize the parameters (and only those parameters) scheduled to be optimized in phase 0 of the current fine-tuning schedule.

Reconfiguration only takes place here if FTS discovers the set of parameters to be initially thawed and present in the optimizer differs from the parameters specified in phase 0. Only the parameters included in the optimizer are affected; the choice of optimizer, lr_scheduler etc. remains unaltered.

Return type:

None

validate_ft_sched()[source]

Validate the fine-tuning schedule configuration.

This method can be overridden by StrategyAdapter subclasses to customize schedule validation for specific strategies (e.g., strategies that require substantially different validation logic beyond just custom parameter naming).

Note

Strategy adapters can override validation behavior at two levels of abstraction:

  1. Parameter naming only (simpler): Override get_named_params_for_schedule_validation() to provide custom parameter names while using the default validation logic from _validate_ft_sched().

  2. Full validation logic (more control): Override this method to completely customize the validation process.

Choose the approach that best suits your use case. Most adapters only need to override get_named_params_for_schedule_validation() to provide custom parameter names.

Returns:

A tuple of ints specifying:
  1. The depth of the final scheduled phase

  2. The maximum epoch watermark explicitly specified in the schedule

Return type:

tuple[int, int]

property pl_module

Convenient access to the LightningModule being fine- tuned.

Returns:

The user’s LightningModule

Return type:

LightningModule

property pls_handle

Convenient access to the current Strategy in use.

Returns:

The Strategy in use.

Return type:

Strategy

property trainer

Convenient access to the Trainer instance.

Returns:

The Trainer instance

Return type:

Trainer

property using_sharded_optimizer

Whether the currently used optimizer is a supported sharded optimizer.

Returns:

Returns True if the current optimizer is a supported sharded optimizer.

Return type:

bool