Accelerate
Accelerate is a library designed to simplify distributed training on any type of setup with PyTorch by uniting the most common frameworks (Fully Sharded Data Parallel (FSDP) and DeepSpeed) for it into a single interface. Trainer is powered by Accelerate under the hood, enabling loading big models and distributed training.
This guide will show you two ways to use Accelerate with Transformers, using FSDP as the backend. The first method demonstrates distributed training with Trainer, and the second method demonstrates adapting a PyTorch training loop. For more detailed information about Accelerate, please refer to the documentation.
pip install accelerateStart by running accelerate config in the command line to answer a series of prompts about your training system. This creates and saves a configuration file to help Accelerate correctly set up training based on your setup.
accelerate configDepending on your setup and the answers you provide, an example configuration file for distributing training with FSDP on one machine with two GPUs may look like the following.
compute_environment: LOCAL_MACHINEdebug: falsedistributed_type: FSDPdowncast_bf16: 'no'fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_forward_prefetch: false fsdp_cpu_ram_efficient_loading: true fsdp_offload_params: false fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_transformer_layer_cls_to_wrap: BertLayer fsdp_use_orig_params: truemachine_rank: 0main_training_function: mainmixed_precision: bf16num_machines: 1num_processes: 2rdzv_backend: staticsame_network: truetpu_env: []tpu_use_cluster: falsetpu_use_sudo: falseuse_cpu: falseTrainer
Section titled “Trainer”Pass the path to the saved configuration file to TrainingArguments, and from there, pass your TrainingArguments to Trainer.
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments( output_dir="your-model", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=2, fsdp_config="path/to/fsdp_config", fsdp="full_shard", weight_decay=0.01, eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, push_to_hub=True,)
trainer = Trainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["test"], processing_class=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics,)
trainer.train()Native PyTorch
Section titled “Native PyTorch”Accelerate can also be added to any PyTorch training loop to enable distributed training. The Accelerator is the main entry point for adapting your PyTorch code to work with Accelerate. It automatically detects your distributed training setup and initializes all the necessary components for training. You don’t need to explicitly place your model on a device because Accelerator knows which device to move your model to.
from accelerate import Accelerator
accelerator = Accelerator()device = accelerator.deviceAll PyTorch objects (model, optimizer, scheduler, dataloaders) should be passed to the prepare method now. This method moves your model to the appropriate device or devices, adapts the optimizer and scheduler to use AcceleratedOptimizer and AcceleratedScheduler, and creates a new shardable dataloader.
train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare( train_dataloader, eval_dataloader, model, optimizer)Replace loss.backward in your training loop with Accelerates backward method to scale the gradients and determine the appropriate backward method to use depending on your framework (for example, DeepSpeed or Megatron).
for epoch in range(num_epochs): for batch in train_dataloader: outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() progress_bar.update(1)Combine everything into a function and make it callable as a script.
from accelerate import Accelerator
def main(): accelerator = Accelerator()
model, optimizer, training_dataloader, scheduler = accelerator.prepare( model, optimizer, training_dataloader, scheduler )
for batch in training_dataloader: optimizer.zero_grad() inputs, targets = batch outputs = model(inputs) loss = loss_function(outputs, targets) accelerator.backward(loss) optimizer.step() scheduler.step()
if __name__ == "__main__": main()From the command line, call accelerate launch to run your training script. Any additional arguments or parameters can be passed here as well.
To launch your training script on two GPUs, add the --num_processes argument.
accelerate launch --num_processes=2 your_script.pyRefer to the Launching Accelerate scripts for more details.